Metadata description: the SweLL-gold corpus

(Elena Volodina, August, 17, 2021)


1. General information

2. Metadata description for the SweLL-gold subcorpus

3. Manual annotation in the SweLL-gold subcorpus

1. General information

Total nr corr-annotated essays: 502 (SweLL-gold v.1.0) NOTE! In the SweLL corpus we use the term CORRECTION ANNOTATION instead of a more traditional ERROR ANNOTATION

Nr sentences:

Nr tokens:

Proficiency levels represented (based on the level of the courses of Swedish): A, B, C (A=Beginner, B=Intermediate, C=Advanced)

Manual coding/labeling: pseudonymization, normalization, correction annotation.

Inter-annotator agreement (IAA): 88% by Fleiss’ kappa and 76% by Krippendorff’s alpha. IAA was calculated on the basis of 10% of the essays (i.e. 50 essays) that were double-annotated with correction labels, and is counted based on annotated edges only.

NOTE the use of NL-tokens (i.e. New Line breaks) to preserve line breaks in the original learner writings. NL-tokens do not correspond to any punctuation, are added to the beginning of the sentences following it, and are counted towards running tokens.

Other subcorpora: Apart from the SweLL-gold subcorpus, the SweLL collection contains other (sub)corpora collected in other projects, e.g. TISUS (2007), SpIn, SW1203. Other (sub)corpora may be added to the SweLL collection. _____________________________

1.1 Description of the project

The purpose of the SweLL infrastructure project was to set up an infrastructure for collection, digitization, normalization, and annotation of learner written production, as well as to make available a linguistically annotated corpus, where it would be possible to search for various types of linguistic structures, without the researcher having to guess what such a structure might look like, since there is a parallel normalized version available.

The SweLL infrastructure v1 consists of:

More information: https://spraakbanken.gu.se/en/projects/swell

The SweLL corpus is maintained at the University of Gothenburg, Språkbanken-Text https://spraakbanken.gu.se

For approved users, SweLL essays are available through Korp search interface and as a full dataset via a link that is sent to approved users. To get access, apply using this form: https://sunet.artologik.net/gu/swell.

1.2 Description of the SweLL-gold subcorpus

Read the article: https://nejlt.ep.liu.se/article/view/1374

Personal data management: Essays and personal metadata were collected following a consent from the learners. The consent allows the use of essays for research by registered (approved) users. Handwritten essays were transcribed using secure encrypted environment (SweLL kiosk). All essays were manually pseudonymized (using SweLL kiosk) based on the Pseudonymization guidelines:


Mode: All essays were written in as an exam or in a classroom as placement, formative or final tests. Most of the essays were written by hand and were transcribed later according to the Transcription Guidelines:


Pseudonymized essays were normalized and corr-annotated by L2 specialists, as described in section 3.

Time constraints and access to allowed materials varies between tasks (see details in Task Metadata for each particular task).

Full text access via Korp is secured through a link to the SVALA annotation tool where the full text opens. Note that the NL sign (␤) marks a new paragraph in the original student writing.

To get further information on access to the SweLL corpus, see the webpage:

https://spraakbanken.gu.se/en/projects/swell/swell4users and https://spraakbanken.gu.se/en/projects/swell/l2korp

1.3 To cite the SweLL-gold subcorpus

2. Metadata description for the SweLL-gold subcorpus

2.1 Administrative information

Based on the “Core metadata for learner corpora: draft 1.0, December 2017” by Sylviane Granger and Magali Paquot

Administrative info  
Corpus title SweLL-gold, a corpus in a bigger SweLL collection
Distributor Språkbanken Text (https://spraakbanken.gu.se/), SweLL-infrastructure component (swell@svenska.gu.se)
Availability free of charge; access regulated by the GDPR restrictions (application form: https://sunet.artologik.net/gu/swell)
License CLARIN-ID, -PRIV, -NORED, -BY (explanations: https://www.kielipankki.fi/support/clarin-eula/#res)
Edition version 1
Character encoding UTF-8
Markup language / file formats Files are distributed in three formats: xml, json, raw texts
Corpus design info  
L2 target language Swedish; courses taken in Sweden
L1 (mother tongue) multiple; represented as iso-codes and usual names
Period of collection 20017-2020
Corpus size 502 essays; 7 807 sentences; 147 842 tokens, incl punctuation (original version))
Corpus mode written language: essays collected from classroom/exam setting
Annotation transcription, pseudonymization, normalization, correction annotation, automatic linguistic annotation
Written versions One version of each essay (no several submissions of the same text)
Longitudinal some recurrent students appear, see section 2.4 (schools with IDs B, C, E, G, K
Proficiency levels (approximate) development levels based on course level:
  A:Beginner –> Swedish for Immigrants (SFI)
  B:Intermediate –> Grundläggande vuxenutbildning (SVA)
  C:Advanced –> Gymnasiet; Universitetskurser; TISUS
Proficiency level type Course-level based
Official language testing mixed sources, including some official testing, e.g. TISUS
Comparison data (L1 source) N/A
Corpus annotation info  
Manual annotation yes, pseudonymization, normalization, correction annotation
  Transcription guidelines: https://spraakbanken.github.io/swell-project/Transcription_guidelines
  Pseudonymization guidelines: https://spraakbanken.github.io/swell-project/Pseudonymization_guidelines
  Normalization guidelines: https://spraakbanken.github.io/swell-project/Normalization_guidelines
  Correction annotation guidelines: https://gupea.ub.gu.se/handle/2077/69434
Automatic annotation yes, using SPARV pipeline: https://spraakbanken.gu.se/en/tools/sparv/annotations
  part-of-speech tagging, incl morpho-syntactic description: https://spraakbanken.gu.se/korp/markup/msdtags.html
  lemmatization, incl word sense disambiguation and multi-word identification, SALDO-based
  dependency parsing: https://cl.lingfil.uu.se/~nivre/swedish_treebank/dep.html
Correction annotated yes, see section 3.2

2.2 Personal metadata

Explanatory term [attribute name in Korp / attribute name in the xml file]

Student ID [student ID / student_id] 451 unique students, e.g. C16. Letter prefix (for a school) + a running number
Age [age / age] A 5-year age interval indicating the age at the moment of writing an essay, e.g. 31-35. Ages 16-70 are represented
Birth year in 5-year intervals [birthyear_interval / birth year] 1950-1954 -- 2000-2004
Gender [gender / gender] Kvinna, Man, Annat, Vill inte säga
Time in Sweden (sum in months) [residence / time_in_sweden] 0 - 315
Native language(s) [native language / iso_l1] 81 unique languages in 117 unique combinations of 1-4 languages
  For easiness of interpretation, the full name of the language is provided as well in the xml files and in the metadata excel
Education background [education level / edu_level] 1= 0-6 years of schooling (including elementary school) => 34
  2= 7-9 years (including high school) => 56
  3= 10-13 years (including upper-secondary education) => 155
  4= 14+ years (including university education)=> 257
Elementary education outside Sweden (nr months) 12 - 156
Elementary education in Sweden (nr months) 12 - 132
Introductory education in Sweden (nyanlända) (months) 1 - 36
Gymnasial education outside Sweden (nr months) 2 - 168
Gymnasial education in Sweden (nr months) 1 - 72
Professional education outside Sweden (nr months) 12 - 60
Professional education in Sweden (nr months) 3 - 72
University education outside Sweden (nr months) 12 - 228
University education in Sweden (nr months) 6 - 96
Professional degree Free text (e.g. Ekonom)
Educational degree Free text
Education: additional comments Free text
Language information  
Education in L1 (modersmålsundervisning/hemspråk, education in Sweden) Language name
Characteristics below are available in the material, but not in the xml files or Korp  
Length of education in L1, nr months (modersmålsundervisning/hemspråk) 1 - 168
Swedish proficiency courses Self-instruction, formal education
Length of Swedish proficiency courses, nr months 1 - 336
All known languages List of languages
Other known language(s) except mother tongue(s) List of languages
Best written language(s) [writing_language] List of languages; not used for filtering in Korp
Best spoken language(s) Language name(s)
Language(s) used with the family Language name(s)
Language(s) used with friends Language name(s)
Metacomment Free comment added by an assistant

2.3 Task metadata

Administrative Metadata  
Task ID Task ID in the corpus. A letter prefix (for a school) + T(ask) + a running number, e.g. AT14. Total of 44 unique tasks
Semester (time span) VT-2018, HT-2018, VT-2019, HT-2019, VT-2020
Task date [task date / task_date] Year-week, e.g. 2018-W20
Datum [datum] Automatic Korp value, e.g. “2014-01-01”, in this case - a derivative of task_date. Datum is used by Korp search engine to create trend diagrams.
Course type / school type [school_type] A generic description of the type of the school/education where essay has been collected from
Course level / course subject [course_subject] Behörighetsgivande kurs
  Förberedande kurs
  Grundläggande SVA dk3
  Inplaceringsprov SFI
  SFI B / C / D
  SVA 2 / 3
Grading scale [grading_scale] A-F
  SFI inplacering
  Uppgiften har inte betyg
Writing task details  
Task type Behörighetstest
  Formativ skrivuppgift
  Test inför NP (Nationella Prov)
Task - format / mode [task_format] Mode of the essay writing: Handskriven, Digital
Task duration (in minutes) 25--180
Text type / genre Argumenterande
  Informellt mejl
Task instructions Free text alt. reference to an attachment
Allowed aids Bilingual dictionary, Monolingual dictionary, Internet, etc.
Additional material Free text comment
Additional comments Free text comment
Additional comment on coursebooks used Book title(s) / free text
Approximate level Approximate mapping of the course type to the level of proficiency.
  additional marker “Fortsättning” (i.e. Continuation) added where necessary to the other descriptors. Note that the order of the two descriptors always come in the alphabetical order, thus sometimes taking form of “Forsättning, Nybörjare” (i.e. Continuation, Beginner)
Tasks - subject / topic [task_subject] A topic of the essay, e.g.
  1. Din första arbetsplats alt. Kvinnors arbete; 2. Mejl till kusin på besök i Sverige
  1. När jag var liten och gick till skolan första gången; 2. Mejl till kusin på besök i Sverige
  Argumenterande text om arbetsmoral
  Argumenterande text/brev
  Berätta hur du bor!
  Berätta utifrån texten "Alma berättar"
  Beskriv - En god relation
  Beskriva - Min första kärlek
  Brott - orsaker och konsekvenser
  Demokratiska val - hur gammal ska man behöva vara för att rösta och varför?
  Diskuterande text om pengars betydelse
  En kulturupplevelse
  En plats du tycker om
  En viktig plats
  Enkel utredande text om litterära teman
  Ge tips och råd
  Ge tips och råd - en anställningsintervju
  Kommunikation och sociala medier
  Mejl till en vän
  Mina första intryck
  Objektivt utredande uppgift
  Om din bostad och om att bo
  Referat av texten "Giftermål ett större steg än barn"
  Skriv en insändare
  Skriv ett brev
  Skriv ett mail
  Skriv om en känd person
  Två sätt att uppfostra
  Utredande text (pm) övning inför NP
  Världens lyckligaste länder
  Övnings-pm inför NP
Topic domain [lessontext_topic] Not available for SweLL-gold; present in some SweLL-pilot corpora and in COCTAILL. Topics in accordance with COCTAILL taxonomy < https://www.aclweb.org/anthology/W14-3510.pdf >
Task_url In certain cases where handouts were used, attachments are available at the urls.

2.4 Essay metadata

Explanatory term [attribute name in Korp / attribute name in the xml file]

Descriptive metadata  
• Essay ID [essay ID / essay_id] Essay ID consists of a student ID (e.g. A1) + task ID (e.g. AT1) == A1AT1
• Essay per student [uppsatser/student / nr_essay_student] numbers 1-5, indicating “recurrent students” if the value is 2 or above. For example, “2” means that there are 2 essays written by that particular student
  1 essay per student => 177
  2 essays per student => 151
  3 essays per student => 92
  4 essays per student => 63
  5 essays per student => 19
• Grade [grade / grade] Where available, is indicated for each task and student
• Full text [full text / svala_link] A link to the full essay that opens in SVALA annotation tool
Additional attributes A few attributes that are present in other SweLL subcorpora, and have been added to each SweLL subcorpus for the sake of interoperability
• Result on the writing assignment [written proficiency / written_result] TISUS-attribute
• Reading comprehension 1 / LF1 [reading comprehension (result), part 1 / lf1_result] TISUS-attribute
• Reading comprehension 2 / LF2 [reading comprehension (result), part 2 / lf2_result] TISUS-attribute
• Reading comprehension, sum [reading comprehension (sum) / lf_sum] TISUS-attribute
• Oral proficiency [oral proficiency / oral_result] TISUS-attribute
• Final grade [final grade / final_grade] TISUS-attribute
Proficiency level [proficiency level (CEFR) / cefr_level] Present in some of SweLL subcorpora (e.g. TISUS, SW1203, SpIn), however, absent in SweLL-gold v1

2.5 School metadata

School listing   Description
• Source [not available in Korp for filtering, but present in the xml files] a letter indicating a school where the essays has been collected from, see the listing below  
A Vuxenutbildningscentrum Inplacering SFI-utbildning: A-D
B Gymnasieskola SVA
C Komvux/SFI SFI A-D
E Behörighetsgivande kurser / Universitetet motsv gymnasiet
F Behörighetsgivande kurser / Universitetet motsv gymnasiet
G SFI-provet SFI A-D
H TISUS-prov motsv gymnasiet
J Grundläggande vuxenutbildning SVA dk 1-4
K Grundläggande SVA SVA dk 1-4
L SVA-kurser på gymnasienivå vuxen utbildning
M Prov för antagning och inplacering till förberedande respektive behörighetsgivande kurser / Universitetet  

3. Manual annotation in the SweLL-gold subcorpus

3.1 Pseudonymization codes

Pseudonymization guidelines:


Category type Codes Pseudonym / details
Names firstname_male replaced by an equivalent
Geographical data city (+foreign for non-Swedish ones) Swedish names replaced with dummy names or X-stad randomly
  area (+foreign for non-Swe ones) All other names with equivalent ones (i.e. cities with other cities)
  country Sverige is not replaced; other countries replaced with X-land
  geo X-geoplats
  place fake name or X-plats
  region fake name or X-region
  zip_code 00000
Institutions school replaced with X-skola randomly
  other_institution replaced with X-institution randomly
Transportation transport_name replaced with X-linjen randomly
Age age_digits replaced with a random number in the +-2 span from the actual number
Dates date_digits 11/11/1111
  day replaced randomly
  month_digit 11/11
  month_word randomly
  year +-2
  e-mail email\@dot.com
  license_nr (e.g. cars) ABS 000
  phone_nr 0000-000000
  person_nr 123456-000
  url url\@com
  Zip-code 000 00
Sensitive edu (education) Markup only, no replacement
  prof (profession)  
  fam (family members)  
  sensitive (free text  
  with hints on  
  ethnical & sexual  
  info, religious &  
  political views)  

3.2 Correction annotation codes

Normalization guidelines:


Correction annotation guidelines:


To see an overview of the number of tags at various approximate levels, see Appendix in Volodina et al. (2021): https://arxiv.org/pdf/2105.06681.pdf

O misspelling  
O-Cap capitalization  
O-Comp compounding  
LEXIS (derivational morphology included)    
L any other lexical problem (not represented in the data)
L-Der wrong derivation mechanism used  
L-FL mix of foreign language in Swedish  
L-Ref reference error  
L-W wrong word choice  
MORPHOLOGY (inflectional) ≈ PHRASE LEVEL*    
M-Adj/adv adjective instead of adverb  
M-Case case problem  
M-Def definiteness problem  
M-F problem on form / morphology level  
M-Gend problem with gender  
M-Num problem with number  
M-Other any other, not listed, problem on morphology level  
M-Verb any problem on the verb or verb phrase level  
S-Adv word order, sentence adverbial placement  
S-Clause clause vs phrase level  
S-Comp phrase vs compound structure  
S-Ext extensive change  
S-FinV word order, verb placement  
S-M missing word  
S-MSubj missing subject  
S-Other any other problem/syntact. level  
S-R word redundant (i.e. removed in the target)  
S-Type change of construction type/phrase level  
S-WO any other word order problem  
P general miss on punctuation (not represented in the data)
P-M missing  
P-R redundant  
P-Sent sentence segmentation  
P-W wrong  
C consistency change  
Cit-FL use of foreign language for citation  
Com! comment on an essay level  
OBS! comment on a token level  
X unintelligible  
Unid unidentified (not represented in the data)