A subcorpus in the SweLL-pilot collection
(Elena Volodina, August, 17, 2021)
2. Metadata description for the TISUS subcorpus
Total nr essays: 105 (June, 19, 2021)
Nr sentences: 3 422
Nr tokens: 60 632, incl. punctuation and NL-tokens (see below)
Levels represented (CEFR scale): B2, C1
No manual coding/labeling was performed, except very basic anonymization control
NOTE the use of NL-tokens (i.e. New Line breaks) to preserve line breaks in the original learner writings. NL-tokens do not correspond to any punctuation, are added to the beginning of the sentences following it, and are counted towards running tokens.
TISUS is a subcorpus of SweLL-pilot collection of learner essays written by adult immigrants learning Swedish in Sweden. The essays were collected from an official exam Test In Swedish for University Studies, TISUS, during one occasion in 2006. The setting of the essay collection has given the name to this subcorpus. Learners filled in consent forms and allowed use of their essays and personal metadata for research.
Essays were collected and transcribed on the initiative of individual exam assessors at Stockholm University, Sweden. All later work on linguistic annotation, file format conversions and CEEFR mappings was funded by Språkbanken and Center for Language Technology (CLT), both at the University of Gothenburg, Sweden, during 2012-2016. Part of the work on metadata harmonization was carried out later as part of the L2 profiles project and SweLL infrastructure project.
For approved users, TISUS is available through Korp search interface and as a full dataset via a link that is sent to approved users. To get access, apply using this form: https://sunet.artologik.net/gu/swell.
The TISUS subcorpus is maintained at the University of Gothenburg, Språkbanken Text (https://spraakbanken.gu.se), as part of the SweLL infrastructure.
(Version from 2016) Elena Volodina, Ildikó Pilán, Ingegerd Enström, Lorena Llozhi, Peter Lundkvist, Gunlög Sundberg, Monica Sandell. 2016. SweLL on the rise: Swedish Learner Language corpus for European Reference Level studies. Proceedings of LREC 2016, Slovenia. https://arxiv.org/pdf/1604.06583.pdf
(Version from 2021) The above article + updated statistics in this documentation. The major difference between the versions from 2016 and 2021 is that all attributes and the corresponding values (e.g. level, language names, etc) have been harmonized between several other learner subcorpora in the SweLL collection to make possible searches in several of them at the same time.
Personal data management: Essays were collected following a consent from the learners. The consent allows the use of essays for research by registered (approved) users. Handwritten essays were transcribed and manually anonymized by trained L2 specialists.
Mode/format: All essays were written in exam setting. Most of the essays were written by hand and were transcribed later by L2 specialists.
Time constraints 150 minutes
No access to extra materials was allowed except handouts. However, no handouts are made available to corpus users (see details in Task Metadata).
CEFR levels are indicative only. They were assigned by mapping grades on the written assignment according to the following scheme: failed (i.e. grades 1-2) => B2, grades 3-5 => C1. This mapping was made based on the assumption that since the test was announced as equivalent to CEFR level C1, all “passed” students on the written assignment had competence of that level or higher. A more rigorous re-grading is left for future.
To get more information on access to TISUS and other SweLL corpora, see the webpage: https://spraakbanken.gu.se/en/projects/swell/swell4users and https://spraakbanken.gu.se/projekt/swell/l2korp
Elena Volodina, Ildikó Pilán, Ingegerd Enström, Lorena Llozhi, Peter Lundkvist, Gunlög Sundberg, Monica Sandell. 2016. SweLL on the rise: Swedish Learner Language corpus for European Reference Level studies. Proceedings of LREC 2016, Slovenia. https://arxiv.org/pdf/1604.06583.pdf
(Not available through Korp)
Based on the “Core metadata for learner corpora: draft 1.0, December 2017” by Sylviane Granger and Magali Paquot
Administrative info | |
---|---|
Corpus title | TISUS, a subcorpus in SweLL-pilot |
Distributor | Språkbanken Text (https://spraakbanken.gu.se/), SweLL-infrastructure component (swell@svenska.gu.se) |
Availability | free of charge; access regulated by the GDPR restrictions (application form: https://sunet.artologik.net/gu/swell) |
License | CLARIN-ID, -PRIV, -NORED, -BY (explanations: https://www.kielipankki.fi/support/clarin-eula/#res) |
Edition | version 1 |
Character encoding | UTF-8 |
Markup language / file formats | Files are distributed in three formats: xml, json, raw texts |
Corpus design info | |
L2 target language | Swedish; exam taken in Sweden |
L1 (mother tongue) | multiple; represented as iso-codes and usual names |
Period of collection | 2006 |
Corpus size | 105 essays; 3 422 sentences; 60 632 tokens (incl punctuation) |
Corpus mode | written language: essays collected from an official exam |
Annotation | transcription, anonymization control |
Transcription guidelines | see article: http://arxiv.org/pdf/1604.06583v1.pdf |
Written versions | One version of each essay (no several submissions of the same text) |
Longitudinal | no |
Proficiency levels | (approximate) CEFR grades represented: B2, C1 (see 1.2 above) |
Proficiency level type | Text based |
Official language testing | yes |
Comparison data (L1 source) | N/A |
Corpus annotation info | |
Manual annotation | anonymization control, CEFR mapping |
Automatic annotation | yes, using SPARV pipeline: https://spraakbanken.gu.se/en/tools/sparv/annotations |
part-of-speech tagging, incl morpho-syntactic description: https://spraakbanken.gu.se/korp/markup/msdtags.html | |
lemmatization, incl word sense disambiguation and multi-word identification, SALDO-based | |
dependency parsing: https://cl.lingfil.uu.se/~nivre/swedish_treebank/dep.html | |
Correction annotated | no |
Explanatory term [attribute name in Korp / attribute name in the xml file]
General | |
---|---|
• Student ID [student ID / student_id] | 105 unique students, e.g. T16. Letter prefix (for a school) + a running number |
• Age in 5-year intervals [age / age] | A 5-year age interval indicating the age at the moment of writing an essay, e.g. 16-20. Age groups in TISUS: 16-45 |
• Birth year in 5-year intervals [birth year / birthyear_interval] | e.g. 1960-1964; birth year spans between 1960 and 1989 are represented |
• Gender [gender / gender] | Kvinna, Man |
• Time in Sweden (sum in months) [residence / time_in_sweden] | spans between 0-264 months |
• Native language(s) [native language / iso_l1] | 32 unique languages in combinations of 1-2 languages |
For ease of interpretation, the full name of the language is provided as well in Korp, in the xml files and in the metadata excel file | |
• Education background [education level / edu_level] | 1= 0-6 years of schooling (including elementary school)=> 3 |
2= 7-9 years (including high school) => 28 | |
3= 10-13 years (including upper-secondary education) => 30 | |
4= 14+ years (including university education) => 20 | |
5= ??? => 24 | |
• Exam [exam / exam] | Educational certificate, if available. Free text. N/A for TISUS |
• Writing language [writing language / writing_language] | The best language in writing. N/A for TISUS |
Explanatory term [attribute name in Korp / attribute name in the xml file]
Descriptive metadata | |
---|---|
• Task ID [task id / task_id] | Task ID in the corpus. A letter prefix (for a school) + T(ask) + a running number, e.g. TT1. One unique task for all essays |
• Datum [date / datum] | Automatic Korp value, e.g. “2014-01-01”, in this case - a derivative of task_date. Datum is used by Korp search engine to create trend diagrams. |
• Task date [task date / task_date] | The year and week (year-week) when the essay was written, e.g. 2006-W20 (TISUS data collected from one exam occasion) |
• Course type / school form [school type / school_type] | N/A |
• Course level [course subject / course_subject] | N/A |
Grading principles of the TISUS components | Several scales have been used on the different parts of the test; later mapped to the CEFR levels for the written proficiency (see 1.2 above) |
• Result on the writing assignment [written proficiency / written_result] | Grades 1-5: Grading principles for the written assignment: 5 = 2 / 8,5 = 3 / 11 = 4 / 14 = 5 |
• Reading comprehension 1 / LF1 [reading comprehension (result), part 1 / lf1_result] | Result on the reading comprehension test 1; max points => 27 |
• Reading comprehension 2 / LF2 [reading comprehension (result), part 2 / lf2_result] | Result on the reading comprehension test 2; max points => 15 |
• Reading comprehension, sum [reading comprehension (sum) / lf_sum] | The sum of the two reading comprehension tests; lf values have been multiplied with two different indices: 1,48 for LF1 och 1,33 för LF2 |
Grading principles (grades 1 (lowest)-5 (best)) for reading comprehension based on LF1 and LF2 are: 20 = 2 / 40 = 3 / 48 = 4 / 55 = 5 | |
• Oral proficiency [oral proficiency / oral_result] | Result on the spoken part of the test, the grading scale 1 (lowest) - 5 (best) |
• Final grade [final grade / final_grade] | Grade on the whole TISUS test: Godkänd (=pass) or Underkänd (=fail) |
• Grading scale for the essays [grading scale / grading_scale] | (mapped to) CEFR (B1-C2) |
Writing task details | |
• Task type [task type / task_type] | Behörighetsprov |
• Task - format [task format / task_format] | Handwritten |
• Task duration (in minutes) | 150 minutes |
• Text type / genre [genre / text_types] | Argumenterande |
• Task instructions | N/A for |
• Allowed aids | None except handouts |
• Tasks - subject / topic [task_subject / subject] | E.g. Stress i dagenssamhälle. One task subject for all essays |
• Lessontext_topic [lessontext_topic / lessontext_topic] | Topics in accordance with COCTAILL taxonomy https://www.aclweb.org/anthology/W14-3510.pdf. N/A for TISUS |
• Task_url | N/A for TISUS |
Explanatory term [attribute name in Korp / attribute name in the xml file]
Descriptive metadata | |
---|---|
• Essay ID [essay ID / essay_id] | Essay ID consists of a student ID (e.g. T) + task ID (e.g. TT1) == T1TT1 |
• Essay per student [uppsatser/student / nr_essay_student] | numbers 1-5, indicating “recurrent students” if the value is 2 or above. No recurrent students in TISUS |
• CEFR level [proficiency level / cefr_level] | B2 => 32 |
C1 => 73 | |
• Grade [grade / grade] | In Tisus grade and cefr level are identical |
• Full text [full text / svala_link] | A link to the full essay that opens in SVALA annotation tool |
Explanatory term [attribute name in Korp / attribute name in the xml file]
Description | |
---|---|
School ID [source / source] | Letter T for TISUS |
an high-stakes exam that qualifies non-Swedish speaking students to get accepted to the university | |
School-type for all essays is “universitetet” (at the moment not added as an attribute to the corpus, but will be in the future updates) | |
Approximate level [approximate level / approximate_level] | Level assigned based on the type of course/exam, roughly split into A:Beginners, B:Intermediate and C:Advanced, with additional marker “Fortsättning” (i.e. Continuation) added where necessary. Note the the order of the two descriptors always come in the alphabetical order, thus sometimes taking form of “Forsättning, Nybörjare” (i.e. Continuation, Beginner) |