(Elena Volodina, August, 17, 2021)
Contact info: swell@svenska.gu.se
The most recent version of this ReadMe file is available at https://spraakbanken.github.io/swell-release-v1/Readme-SweLL-gold
SweLL-gold corpus is a corpus of essays written by adult learners of Swedish. It was collected during the period of 20017-2021 in the SweLL project https://spraakbanken.gu.se/en/projects/swell/, and contains 502 essays that have been pseudonymized, normalized and correction annotated. More information about this corpus can be found in the Metadata section below, and in the articles in the References section (e.g. Volodina et al., 2019).
Descriptive files:
Data files:
swellOriginal-folder contains
8.1 sourceSweLL.txt - raw text files with original essays (pseudonymized versions)
8.2 sourceSweLL.xml - an xml version of the original essays following Korp format. Attributes and variables are described in the Metadata file below. No linguistic annotation is added to this version
8.3 sourceSweLL_Ling_annotated.xml – an xml version (Korp format) of the original essays with linguistic annotation using Sparv-pipeline.
swellTarget-folder contains
9.1 targetSweLL.txt - raw text files with normalized essays
9.2 targetSweLL.xml - an xml version of the normalized essays following Korp format. Attributes and variables are described in the Metadata file below. No linguistic annotation is added to this version
9.3 targetSweLL_Ling_annotated.xml – an xml version (Korp format) of the original essays with linguistic annotation using Sparv-pipeline.
Note! Each format includes a so-called SVALA-link (or full-text link) for each essay. Using that link you can open an essay in a full-text version in a parallel representation (original, target, tags on the links between original and target tokens) using SVALA tool (Wirén et al. 2019). In SVALA, you will be able to play with various annotation modes and tagsets. However, please be aware that the added annotations are not saved to any database.
Please note the agreement conditions:
Cite one or several of the following when using this corpus:
Beáta Megyesi, Sofia Johansson, Dan Rosén,Carl-Johan Schenström, Gunlög Sundberg, Mats Wirén & Elena Volodina. (2018). Learner Corpus Anonymization in the Age of GDPR: Insights from the Creation of a Learner Corpus of Swedish. Proceedings of the 7th NLP4CALL workshop. https://ep.liu.se/en/conference-article.aspx?series=ecp&issue=152&Article_No=6
Elena Volodina, Lena Granstedt, Arild Matsson, Beáta Megyesi, Ildikó Pilán, Julia Prentice, Dan Rosén, Lisa Rudebeck, Carl-Johan Schenström, Gunlög Sundberg and Mats Wirén (2019). The SweLL Language Learner Corpus: From Design to Annotation. Northern European Journal of Language Technology, Special Issue. https://nejlt.ep.liu.se/article/view/1374
Wirén Mats, Arild Matsson, Dan Rosén, Elena Volodina. 2019. SVALA: Annotation of Second-Language Learner Text Based on Mostly Automatic Alignment of Parallel Corpora. CLARIN-2018 post-conference volume. LiUP Press. https://ep.liu.se/en/conference-article.aspx?series=ecp&issue=159&Article_No=23
…more references can be found at https://spraakbanken.gu.se/en/projects/swell