swell-release-v1

Metadata description: SW1203 (sub)corpus

A subcorpus in the SweLL-pilot collection

(Elena Volodina, August, 17, 2021)


Contents

1. General information

2. Metadata description for the SW1203 subcorpus


1. General information

Total nr essays: 141 (June, 19, 2021)

Nr sentences: 3 145

Nr tokens: 52 518, incl. punctuation

Levels represented (CEFR scale): B1, B2, C1, C2

No manual coding/labeling was performed, except very basic anonymization

NOTE the use of NL-tokens (i.e. New Line breaks) to preserve line breaks in the original learner writings. NL-tokens do not correspond to any punctuation, are added to the beginning of the sentences following it, and are counted towards running tokens.


1.1 Description of the project

SW1203 is a subcorpus of SweLL-pilot collection of learner essays written by adult immigrants learning Swedish in Sweden. The essays were collected from the preparatory (language) courses for prospective university students during the academic year 2012-2013. (Almost) each student wrote three (3) essays: entrance exam, mid-term exam and final exam. The setting of the essay collection has given the name to this subcorpus (SWedish-year-12-nr.essays-03). Learners filled in consent forms and allowed use of their essays and personal metadata for research.

Essays were collected on the initiative of individual university teachers who performed transcription and anonymization. All later work on linguistic annotation, file format conversions and CEEFR grading was funded by Språkbanken and Center for Language Technology (CLT), both at the University of Gothenburg, Sweden, during 2012-2016. Part of the work on metadata harmonization was carried out later as part of the L2 profiles project and SweLL infrastructure project.

For approved users, SW1203 is available through Korp search interface and as a full dataset via a link that is sent to approved users. To get access, apply using this form: https://sunet.artologik.net/gu/swell.

The SW1203 corpus is maintained at the University of Gothenburg, Språkbanken Text (https://spraakbanken.gu.se), as part of the SweLL infrastructure.


1.2 Description of the SW1203 subcorpus

(Version from 2016) Elena Volodina, Ildikó Pilán, Ingegerd Enström, Lorena Llozhi, Peter Lundkvist, Gunlög Sundberg, Monica Sandell. 2016. SweLL on the rise: Swedish Learner Language corpus for European Reference Level studies. Proceedings of LREC 2016, Slovenia. https://arxiv.org/pdf/1604.06583.pdf

(Version from 2021) The above article + updated statistics in this documentation. The major difference between the versions from 2016 and 2021 is that additional 51 essays have been added. Additionally, all attributes and the corresponding values (e.g. level, language names, etc) have been harmonized between several other learner subcorpora in the SweLL collection to make possible searches in several of them at the same time.

Personal data management: Essays were collected following a consent from the learners. The consent allows the use of essays for research by registered (approved) users. Handwritten essays were transcribed and manually anonymized by trained L2 specialists.

Mode/format: All essays were written in exam setting. Most of the essays were written by hand and were transcribed later according to the Transcription guidelines (see Volodina et al., 2016).

Time constraints 3,5 hours; No access to extra materials was allowed except handouts (see details in Task Metadata).

CEFR levels were assigned separately by two CEFR experts each, see a grading document < https://spraakbanken.gu.se/sites/spraakbanken.gu.se/files/Bedomningar_SW1203.pdf >

To get more information on access to SW1203 and other SweLL corpora, see the webpage: https://spraakbanken.gu.se/en/projects/swell/swell4users and https://spraakbanken.gu.se/projekt/swell/l2korp


1.3 To cite the SW1203 subcorpus

Elena Volodina, Ildikó Pilán, Ingegerd Enström, Lorena Llozhi, Peter Lundkvist, Gunlög Sundberg, Monica Sandell. 2016. SweLL on the rise: Swedish Learner Language corpus for European Reference Level studies. Proceedings of LREC 2016, Slovenia. https://arxiv.org/pdf/1604.06583.pdf


2. Metadata description for the SW1203 subcorpus


2.1 Administrative information

(Not available through Korp)

Based on the “Core metadata for learner corpora: draft 1.0, December 2017” by Sylviane Granger and Magali Paquot

Administrative info  
Corpus title SW1203, a subcorpus in SweLL-pilot
Distributor Språkbanken Text (https://spraakbanken.gu.se/), SweLL-infrastructure component (swell@svenska.gu.se)
Availability free of charge; access regulated by the GDPR restrictions (application form: https://sunet.artologik.net/gu/swell)
License CLARIN-ID, -PRIV, -NORED, -BY (explanations: https://www.kielipankki.fi/support/clarin-eula/#res)
Edition version 1
Character encoding UTF-8
Markup language / file formats Files are distributed in three formats: xml, json, raw texts
Corpus design info  
L2 target language Swedish; courses taken in Sweden
L1 (mother tongue) multiple; represented as iso-codes and usual names
Period of collection 2012-2013
Corpus size 141 essays; 3 145 sentence; 52 518 tokens (incl punctuation)
Corpus mode written language: essays collected from three exams during a period of one academic term
Annotation transcription, anonymization
Transcription guidelines see article: http://arxiv.org/pdf/1604.06583v1.pdf
Written versions One version of each essay (no several submissions of the same text)
Longitudinal yes, in a way: (almost) each students wrote three (3) essays during the same term/course
Proficiency levels CEFR grades represented: B1, B2, C1, C2
Proficiency level type Text based
Official language testing no
Comparison data (L1 source) N/A
Corpus annotation info  
Manual annotation anonymization, CEFR grading
Automatic annotation yes, using SPARV pipeline: https://spraakbanken.gu.se/en/tools/sparv/annotations
  part-of-speech tagging, incl morpho-syntactic description: https://spraakbanken.gu.se/korp/markup/msdtags.html
  lemmatization, incl word sense disambiguation and multi-word identification, SALDO-based
  dependency parsing: https://cl.lingfil.uu.se/~nivre/swedish_treebank/dep.html
Correction annotated no

2.2 Personal metadata

Explanatory term [attribute name in Korp / attribute name in the xml file]

General  
• Student ID [student ID / student_id] 54 unique students, e.g. W16. Letter prefix (for a school) + a running number
• Age in 5-year intervals [age / age] A 5-year age interval indicating the age at the moment of writing an essay, e.g. 16-20; age groups 16-50 are represented
• Birth year in 5-year intervals [birth year / birthyear_interval] e.g. 1960-1964; birth year spans between 1960 and 1994 are represented
• Gender [gender / gender] Kvinna, Man
• Time in Sweden (sum in months) [residence / time_in_sweden] N/A
• Native language(s) [native language / iso_l1] 27 unique languages in combinations of 1-3 languages
  For ease of interpretation, the full name of the language is provided as well in the xml files and in the metadata excel
• Education background [education level / edu_level] 1= 0-6 years of schooling (including elementary school) => 0
  2= 7-9 years (including high school) => 0
  3= 10-13 years (including upper-secondary education) => 54
  4= 14+ years (including university education) => 0
• Exam [exam / exam] Educational certificate, if available. Free text. N/A for SW1203
• Writing language [writing language / writing_language] The best language in writing. N/A for SW1203

2.3 Task metadata

Explanatory term [attribute name in Korp / attribute name in the xml file]

Descriptive metadata  
• Task ID [task id / task_id] Task ID in the corpus. A letter prefix (for a school) + T(ask) + a running number, e.g. WT14. Total of 8 unique tasks
• Datum [date / datum] Automatic Korp value, e.g. “2014-01-01”, in this case - a derivative of task_date. Datum is used by Korp search engine to create trend diagrams.
  inträdesprov A -HT12: 2012-08-30; VT13:2012-12-12
  mitterminsprov B –HT12: 2012-10-16; VT13:2013-03-20
  slutprov C –HT12: 2012-12-12; VT13:2013-05-21
  omprov D –HT12: 2013-01-10; VT13:2013-06-05
• Task date [task date / task_date] The year and week (year-week) when the essay was written, e.g. 2012-W44 (SW1203 data collected during 2012-2013)
• Course type / school form [school type / school_type] A generic description of the type of the school/education where essay has been collected from
  Universitetet
• Course level [course subject / course_subject] Behörighetsgivande kurs
• Grading scale [grading scale / grading_scale] CEFR (B1-C2)
Writing task details  
• Task type [task type / task_type] A: Inträdesuppsats (52 st)
  B: Mitterminsuppsats (41 st)
  C: Slutprovsuppsats (45 st)
  D: Omprov (3 st)
• Task - format [task format / task_format] Handwritten
• Task duration (in minutes) 210 minutes
• Text type / genre [genre / text_types] Argumenterande
• Task instructions yes, for some tasks
• Allowed aids None except handouts
• Tasks - subject / topic [task_subject / subject] E.g. Yttrandefrihet
• Lessontext_topic [lessontext_topic / lessontext_topic] Topics in accordance with COCTAILL taxonomy https://www.aclweb.org/anthology/W14-3510.pdf. N/A for SW1203
• Task_url links where available
  Ett brev till politikerna i kommunen: handout
  Det goda livet: handout
  Internet som mötesplats: handout
  Traditioner och traditioners betydelse:
  Familjtyper (handout N/A)
  Turism (handout N/A)
  Yttrandefrihet (handout N/A)
  Engelska som världsspråk (handout N/A)

2.4 Essay metadata

Explanatory term [attribute name in Korp / attribute name in the xml file]

Descriptive metadata  
• Essay ID [essay ID / essay_id] Essay ID consists of a student ID (e.g. W1) + task ID (e.g. WT1) == W1WT1
• Essays per student [uppsatser/student / nr_essay_student] numbers 1-5, indicating “recurrent” (longitudinal) students if the value is 2 or above. For example, “2” means that there are 2 essays written by that particular student
  1 essay/student => 5
  2 essays/student => 28
  3 essays/student => 96
  4 essays/student => 12
• CEFR level [proficiency level / cefr_level] B1 => 40 essays
  B2 => 71 essays
  C1 => 23
  C2 => 7
• Grade [grade / grade] In SW1203 grade and CEFR level are the same
• Full text [full text / svala_link] A link to the full essay that opens in SVALA annotation tool
Additional attributes A few attributes that are present in other SweLL subcorpora, and have been added to each SweLL subcorpus for the sake of interoperability.
• Result on the writing assignment [written proficiency / written_result] TISUS-attribute
• Reading comprehension 1 / LF1 [reading comprehension (result), part 1 / lf1_result] TISUS-attribute
• Reading comprehension 2 / LF2 [reading comprehension (result), part 2 / lf2_result] TISUS-attribute
• Reading comprehension, sum [reading comprehension (sum) / lf_sum] TISUS-attribute
• Oral proficiency [oral proficiency / oral_result] TISUS-attribute
• Final grade [final grade / final_grade] TISUS-attribute

2.5 School metadata

Explanatory term [attribute name in Korp / attribute name in the xml file]

Description  
School ID [source / source] Letter W for SW1203; a unversity course for prospective students, preparatory before TISUS exam
Approximate level [approximate level / approximate_level] Level assigned based on the type of course, roughly split into A:Beginners, B:Intermediate and C:Advanced, with additional marker “Fortsättning” (i.e. Continuation) added where necessary. Note the the order of the two descriptors always come in the alphabetical order, thus sometimes taking form of “Forsättning, Nybörjare” (i.e. Continuation, Beginner)