(Elena Volodina, August, 17, 2021)
2. Metadata description for the SweLL-gold subcorpus
3. Manual annotation in the SweLL-gold subcorpus
Total nr corr-annotated essays: 502 (SweLL-gold v.1.0) NOTE! In the SweLL corpus we use the term CORRECTION ANNOTATION instead of a more traditional ERROR ANNOTATION
Nr sentences:
Nr tokens:
Proficiency levels represented (based on the level of the courses of Swedish): A, B, C (A=Beginner, B=Intermediate, C=Advanced)
Manual coding/labeling: pseudonymization, normalization, correction annotation.
Inter-annotator agreement (IAA): 88% by Fleiss’ kappa and 76% by Krippendorff’s alpha. IAA was calculated on the basis of 10% of the essays (i.e. 50 essays) that were double-annotated with correction labels, and is counted based on annotated edges only.
NOTE the use of NL-tokens (i.e. New Line breaks) to preserve line breaks in the original learner writings. NL-tokens do not correspond to any punctuation, are added to the beginning of the sentences following it, and are counted towards running tokens.
Other subcorpora: Apart from the SweLL-gold subcorpus, the SweLL collection contains other (sub)corpora collected in other projects, e.g. TISUS (2007), SpIn, SW1203. Other (sub)corpora may be added to the SweLL collection. _____________________________
The purpose of the SweLL infrastructure project was to set up an infrastructure for collection, digitization, normalization, and annotation of learner written production, as well as to make available a linguistically annotated corpus, where it would be possible to search for various types of linguistic structures, without the researcher having to guess what such a structure might look like, since there is a parallel normalized version available.
The SweLL infrastructure v1 consists of:
More information: https://spraakbanken.gu.se/en/projects/swell
The SweLL corpus is maintained at the University of Gothenburg, Språkbanken-Text https://spraakbanken.gu.se
For approved users, SweLL essays are available through Korp search interface and as a full dataset via a link that is sent to approved users. To get access, apply using this form: https://sunet.artologik.net/gu/swell.
Read the article: https://nejlt.ep.liu.se/article/view/1374
Personal data management: Essays and personal metadata were collected following a consent from the learners. The consent allows the use of essays for research by registered (approved) users. Handwritten essays were transcribed using secure encrypted environment (SweLL kiosk). All essays were manually pseudonymized (using SweLL kiosk) based on the Pseudonymization guidelines:
https://spraakbanken.github.io/swell-project/Pseudonymization_guidelines
Mode: All essays were written in as an exam or in a classroom as placement, formative or final tests. Most of the essays were written by hand and were transcribed later according to the Transcription Guidelines:
https://spraakbanken.github.io/swell-project/Transcription_guidelines
Pseudonymized essays were normalized and corr-annotated by L2 specialists, as described in section 3.
Time constraints and access to allowed materials varies between tasks (see details in Task Metadata for each particular task).
Full text access via Korp is secured through a link to the SVALA annotation tool where the full text opens. Note that the NL sign () marks a new paragraph in the original student writing.
To get further information on access to the SweLL corpus, see the webpage:
https://spraakbanken.gu.se/en/projects/swell/swell4users and https://spraakbanken.gu.se/en/projects/swell/l2korp
Elena Volodina, Lena Granstedt, Arild Matsson, Beáta Megyesi, Ildikó Pilán, Julia Prentice, Dan Rosén, Lisa Rudebeck, Carl-Johan Schenström, Gunlög Sundberg and Mats Wirén (2019). The SweLL Language Learner Corpus: From Design to Annotation. Northern European Journal of Language Technology, Special Issue. https://nejlt.ep.liu.se/article/view/1374
Other articles can be found on the project website: https://spraakbanken.gu.se/en/projects/swell/
(Not available through Korp)
Based on the “Core metadata for learner corpora: draft 1.0, December 2017” by Sylviane Granger and Magali Paquot
Administrative info | |
---|---|
Corpus title | SweLL-gold, a corpus in a bigger SweLL collection |
Distributor | Språkbanken Text (https://spraakbanken.gu.se/), SweLL-infrastructure component (swell@svenska.gu.se) |
Availability | free of charge; access regulated by the GDPR restrictions (application form: https://sunet.artologik.net/gu/swell) |
License | CLARIN-ID, -PRIV, -NORED, -BY (explanations: https://www.kielipankki.fi/support/clarin-eula/#res) |
Edition | version 1 |
Character encoding | UTF-8 |
Markup language / file formats | Files are distributed in three formats: xml, json, raw texts |
Corpus design info | |
L2 target language | Swedish; courses taken in Sweden |
L1 (mother tongue) | multiple; represented as iso-codes and usual names |
Period of collection | 20017-2020 |
Corpus size | 502 essays; 7 807 sentences; 147 842 tokens, incl punctuation (original version)) |
Corpus mode | written language: essays collected from classroom/exam setting |
Annotation | transcription, pseudonymization, normalization, correction annotation, automatic linguistic annotation |
Written versions | One version of each essay (no several submissions of the same text) |
Longitudinal | some recurrent students appear, see section 2.4 (schools with IDs B, C, E, G, K |
Proficiency levels | (approximate) development levels based on course level: |
A:Beginner –> Swedish for Immigrants (SFI) | |
B:Intermediate –> Grundläggande vuxenutbildning (SVA) | |
C:Advanced –> Gymnasiet; Universitetskurser; TISUS | |
Proficiency level type | Course-level based |
Official language testing | mixed sources, including some official testing, e.g. TISUS |
Comparison data (L1 source) | N/A |
Corpus annotation info | |
Manual annotation | yes, pseudonymization, normalization, correction annotation |
Transcription guidelines: https://spraakbanken.github.io/swell-project/Transcription_guidelines | |
Pseudonymization guidelines: https://spraakbanken.github.io/swell-project/Pseudonymization_guidelines | |
Normalization guidelines: https://spraakbanken.github.io/swell-project/Normalization_guidelines | |
Correction annotation guidelines: https://gupea.ub.gu.se/handle/2077/69434 | |
Automatic annotation | yes, using SPARV pipeline: https://spraakbanken.gu.se/en/tools/sparv/annotations |
part-of-speech tagging, incl morpho-syntactic description: https://spraakbanken.gu.se/korp/markup/msdtags.html | |
lemmatization, incl word sense disambiguation and multi-word identification, SALDO-based | |
dependency parsing: https://cl.lingfil.uu.se/~nivre/swedish_treebank/dep.html | |
Correction annotated | yes, see section 3.2 |
Explanatory term [attribute name in Korp / attribute name in the xml file]
General | |
---|---|
Student ID [student ID / student_id] | 451 unique students, e.g. C16. Letter prefix (for a school) + a running number |
Age [age / age] | A 5-year age interval indicating the age at the moment of writing an essay, e.g. 31-35. Ages 16-70 are represented |
Birth year in 5-year intervals [birthyear_interval / birth year] | 1950-1954 -- 2000-2004 |
Gender [gender / gender] | Kvinna, Man, Annat, Vill inte säga |
Time in Sweden (sum in months) [residence / time_in_sweden] | 0 - 315 |
Native language(s) [native language / iso_l1] | 81 unique languages in 117 unique combinations of 1-4 languages |
For easiness of interpretation, the full name of the language is provided as well in the xml files and in the metadata excel | |
Education | |
Education background [education level / edu_level] | 1= 0-6 years of schooling (including elementary school) => 34 |
2= 7-9 years (including high school) => 56 | |
3= 10-13 years (including upper-secondary education) => 155 | |
4= 14+ years (including university education)=> 257 | |
Elementary education outside Sweden (nr months) | 12 - 156 |
Elementary education in Sweden (nr months) | 12 - 132 |
Introductory education in Sweden (nyanlända) (months) | 1 - 36 |
Gymnasial education outside Sweden (nr months) | 2 - 168 |
Gymnasial education in Sweden (nr months) | 1 - 72 |
Professional education outside Sweden (nr months) | 12 - 60 |
Professional education in Sweden (nr months) | 3 - 72 |
University education outside Sweden (nr months) | 12 - 228 |
University education in Sweden (nr months) | 6 - 96 |
Professional degree | Free text (e.g. Ekonom) |
Educational degree | Free text |
Education: additional comments | Free text |
Language information | |
Education in L1 (modersmålsundervisning/hemspråk, education in Sweden) | Language name |
• Characteristics below are available in the material, but not in the xml files or Korp | |
Length of education in L1, nr months (modersmålsundervisning/hemspråk) | 1 - 168 |
Swedish proficiency courses | Self-instruction, formal education |
Length of Swedish proficiency courses, nr months | 1 - 336 |
All known languages | List of languages |
Other known language(s) except mother tongue(s) | List of languages |
Best written language(s) [writing_language] | List of languages; not used for filtering in Korp |
Best spoken language(s) | Language name(s) |
Language(s) used with the family | Language name(s) |
Language(s) used with friends | Language name(s) |
Metacomment | Free comment added by an assistant |
Administrative Metadata | |
---|---|
Task ID | Task ID in the corpus. A letter prefix (for a school) + T(ask) + a running number, e.g. AT14. Total of 44 unique tasks |
Semester (time span) | VT-2018, HT-2018, VT-2019, HT-2019, VT-2020 |
Task date [task date / task_date] | Year-week, e.g. 2018-W20 |
Datum [datum] | Automatic Korp value, e.g. “2014-01-01”, in this case - a derivative of task_date. Datum is used by Korp search engine to create trend diagrams. |
Course type / school type [school_type] | A generic description of the type of the school/education where essay has been collected from |
Ungdomsgymnasiet | |
Universitetet | |
Vuxenutbildningen | |
Course level / course subject [course_subject] | Behörighetsgivande kurs |
Förberedande kurs | |
Grundläggande SVA dk3 | |
Inplaceringsprov SFI | |
SFI B / C / D | |
SVA 2 / 3 | |
TISUS | |
Grading scale [grading_scale] | A-F |
G/U | |
SFI inplacering | |
Uppgiften har inte betyg | |
Writing task details | |
Task type | Behörighetstest |
Formativ skrivuppgift | |
Inplaceringsprov | |
Mitterminsprov | |
Slutprov | |
Test inför NP (Nationella Prov) | |
Task - format / mode [task_format] | Mode of the essay writing: Handskriven, Digital |
Task duration (in minutes) | 25--180 |
Text type / genre | Argumenterande |
Berättande | |
Beskrivande | |
Förklarande | |
Informellt mejl | |
Instruerande | |
Resonerande | |
Utredande | |
Återgivande | |
Task instructions | Free text alt. reference to an attachment |
Allowed aids | Bilingual dictionary, Monolingual dictionary, Internet, etc. |
Additional material | Free text comment |
Additional comments | Free text comment |
Additional comment on coursebooks used | Book title(s) / free text |
Approximate level | Approximate mapping of the course type to the level of proficiency. |
Nybörjare | |
Fortsättning | |
Avancerad | |
additional marker “Fortsättning” (i.e. Continuation) added where necessary to the other descriptors. Note that the order of the two descriptors always come in the alphabetical order, thus sometimes taking form of “Forsättning, Nybörjare” (i.e. Continuation, Beginner) | |
Tasks - subject / topic [task_subject] | A topic of the essay, e.g. |
1. Din första arbetsplats alt. Kvinnors arbete; 2. Mejl till kusin på besök i Sverige | |
1. När jag var liten och gick till skolan första gången; 2. Mejl till kusin på besök i Sverige | |
Argumenterande text om arbetsmoral | |
Argumenterande text/brev | |
Berätta hur du bor! | |
Berätta utifrån texten "Alma berättar" | |
Beskriv - En god relation | |
Beskriva - Min första kärlek | |
Brott - orsaker och konsekvenser | |
Demokratiska val - hur gammal ska man behöva vara för att rösta och varför? | |
Diskuterande text om pengars betydelse | |
En kulturupplevelse | |
En plats du tycker om | |
En viktig plats | |
Enkel utredande text om litterära teman | |
Familjen | |
Ge tips och råd | |
Ge tips och råd - en anställningsintervju | |
Insändare | |
Kommunikation och sociala medier | |
Mejl till en vän | |
Mina första intryck | |
Objektivt utredande uppgift | |
Om din bostad och om att bo | |
Referat av texten "Giftermål ett större steg än barn" | |
Skriv en insändare | |
Skriv ett brev | |
Skriv ett mail | |
Skriv om en känd person | |
Två sätt att uppfostra | |
Utredande text (pm) övning inför NP | |
Världens lyckligaste länder | |
Övnings-pm inför NP | |
etc | |
Topic domain [lessontext_topic] | Not available for SweLL-gold; present in some SweLL-pilot corpora and in COCTAILL. Topics in accordance with COCTAILL taxonomy < https://www.aclweb.org/anthology/W14-3510.pdf > |
Task_url | In certain cases where handouts were used, attachments are available at the urls. |
Explanatory term [attribute name in Korp / attribute name in the xml file]
Descriptive metadata | |
---|---|
• Essay ID [essay ID / essay_id] | Essay ID consists of a student ID (e.g. A1) + task ID (e.g. AT1) == A1AT1 |
• Essay per student [uppsatser/student / nr_essay_student] | numbers 1-5, indicating “recurrent students” if the value is 2 or above. For example, “2” means that there are 2 essays written by that particular student |
1 essay per student => 177 | |
2 essays per student => 151 | |
3 essays per student => 92 | |
4 essays per student => 63 | |
5 essays per student => 19 | |
• Grade [grade / grade] | Where available, is indicated for each task and student |
• Full text [full text / svala_link] | A link to the full essay that opens in SVALA annotation tool |
Additional attributes | A few attributes that are present in other SweLL subcorpora, and have been added to each SweLL subcorpus for the sake of interoperability |
• Result on the writing assignment [written proficiency / written_result] | TISUS-attribute |
• Reading comprehension 1 / LF1 [reading comprehension (result), part 1 / lf1_result] | TISUS-attribute |
• Reading comprehension 2 / LF2 [reading comprehension (result), part 2 / lf2_result] | TISUS-attribute |
• Reading comprehension, sum [reading comprehension (sum) / lf_sum] | TISUS-attribute |
• Oral proficiency [oral proficiency / oral_result] | TISUS-attribute |
• Final grade [final grade / final_grade] | TISUS-attribute |
Proficiency level [proficiency level (CEFR) / cefr_level] | Present in some of SweLL subcorpora (e.g. TISUS, SW1203, SpIn), however, absent in SweLL-gold v1 |
School listing | Description | |
---|---|---|
• Source [not available in Korp for filtering, but present in the xml files] | a letter indicating a school where the essays has been collected from, see the listing below | |
A | Vuxenutbildningscentrum | Inplacering SFI-utbildning: A-D |
B | Gymnasieskola | SVA |
C | Komvux/SFI | SFI A-D |
E | Behörighetsgivande kurser / Universitetet | motsv gymnasiet |
F | Behörighetsgivande kurser / Universitetet | motsv gymnasiet |
G | SFI-provet | SFI A-D |
H | TISUS-prov | motsv gymnasiet |
J | Grundläggande vuxenutbildning | SVA dk 1-4 |
K | Grundläggande SVA | SVA dk 1-4 |
L | SVA-kurser på gymnasienivå | vuxen utbildning |
M | Prov för antagning och inplacering till förberedande respektive behörighetsgivande kurser / Universitetet |
Pseudonymization guidelines:
https://spraakbanken.github.io/swell-project/Pseudonymization_guidelines
Category type | Codes | Pseudonym / details |
---|---|---|
Names | firstname_male | replaced by an equivalent |
firstname_female | ||
firstname_unknown | ||
initials | ||
middlename | ||
surname | ||
Geographical data | city (+foreign for non-Swedish ones) | Swedish names replaced with dummy names or X-stad randomly |
area (+foreign for non-Swe ones) | All other names with equivalent ones (i.e. cities with other cities) | |
country | Sverige is not replaced; other countries replaced with X-land | |
geo | X-geoplats | |
place | fake name or X-plats | |
region | fake name or X-region | |
street_nr | ||
zip_code | 00000 | |
Institutions | school | replaced with X-skola randomly |
work | ||
other_institution | replaced with X-institution randomly | |
Transportation | transport_name | replaced with X-linjen randomly |
transport_nr | ||
Age | age_digits | replaced with a random number in the +-2 span from the actual number |
age_string | ||
Dates | date_digits | 11/11/1111 |
day | replaced randomly | |
month_digit | 11/11 | |
month_word | randomly | |
year | +-2 | |
Miscellaneous | ||
account_nr | ||
email\@dot.com | ||
extra | ||
license_nr (e.g. cars) | ABS 000 | |
other_nr_seq | ||
phone_nr | 0000-000000 | |
person_nr | 123456-000 | |
url | url\@com | |
Zip-code | 000 00 | |
Sensitive | edu (education) | Markup only, no replacement |
prof (profession) | ||
fam (family members) | ||
sensitive (free text | ||
with hints on | ||
ethnical & sexual | ||
info, religious & | ||
political views) |
Normalization guidelines:
https://spraakbanken.github.io/swell-project/Normalization_guidelines
Correction annotation guidelines:
https://gupea.ub.gu.se/handle/2077/69434
To see an overview of the number of tags at various approximate levels, see Appendix in Volodina et al. (2021): https://arxiv.org/pdf/2105.06681.pdf
Description | ||
---|---|---|
ORTHOGRAPHY | ||
O | misspelling | |
O-Cap | capitalization | |
O-Comp | compounding | |
LEXIS (derivational morphology included) | ||
L | any other lexical problem | (not represented in the data) |
L-Der | wrong derivation mechanism used | |
L-FL | mix of foreign language in Swedish | |
L-Ref | reference error | |
L-W | wrong word choice | |
MORPHOLOGY (inflectional) ≈ PHRASE LEVEL* | ||
M-Adj/adv | adjective instead of adverb | |
M-Case | case problem | |
M-Def | definiteness problem | |
M-F | problem on form / morphology level | |
M-Gend | problem with gender | |
M-Num | problem with number | |
M-Other | any other, not listed, problem on morphology level | |
M-Verb | any problem on the verb or verb phrase level | |
SYNTAX ≈ CLAUSE LEVEL | ||
S-Adv | word order, sentence adverbial placement | |
S-Clause | clause vs phrase level | |
S-Comp | phrase vs compound structure | |
S-Ext | extensive change | |
S-FinV | word order, verb placement | |
S-M | missing word | |
S-MSubj | missing subject | |
S-Other | any other problem/syntact. level | |
S-R | word redundant (i.e. removed in the target) | |
S-Type | change of construction type/phrase level | |
S-WO | any other word order problem | |
PUNCTUATION | ||
P | general miss on punctuation | (not represented in the data) |
P-M | missing | |
P-R | redundant | |
P-Sent | sentence segmentation | |
P-W | wrong | |
OTHER | ||
C | consistency change | |
Cit-FL | use of foreign language for citation | |
Com! | comment on an essay level | |
OBS! | comment on a token level | |
X | unintelligible | |
Unid | unidentified | (not represented in the data) |