swell-release-v1

Metadata description: SpIn (sub)corpus

A subcorpus in the SweLL-pilot collection

(Elena Volodina, August, 17, 2021)

The most recent version of this document is available at https://spraakbanken.github.io/swell-release-v1/Metadata-SpIn

1. General information

1.1 Description of the project
1.2 Description of the SpIn subcorpus
1.3 To cite the SpIn subcorpus

2. Metadata description for the SpIn subcorpus

2.1 Administrative information
2.2 Personal metadata
2.3 Task metadata
2.4 Essay metadata
2.5 School metadata
2.6 Transcription and anonymization

1. General information

Total nr essays: 256 (June, 19, 2021)

Nr sentences: 4 302

Nr tokens: 46 911, incl. punctuation

Levels represented: A1, A2, B1, B2

No manual coding/labeling was performed, except very basic anonymization

NOTE the use of NL-tokens (i.e. New Line breaks) to preserve line breaks in the original learner writings. NL-tokens do not correspond to any punctuation, are added to the beginning of the sentences following it, and are counted towards running tokens.

1.1 Description of the project

SpIn is a subcorpus of SweLL-pilot collection of learner essays written by adult immigrants learning Swedish in Sweden. The essays were collected from the courses provided at the Center for Language Introduction (Centrum för SpråkIntroduktion), which has given the name to this subcorpus. Learners filled in consent forms and allowed use of their essays and personal metadata for research.

The work on this corpus was funded by Språkbanken and Center for Language Technology (CLT), both at the University of Gothenburg, Sweden, during 2012-2016. Part of the work on transcription, anonymization and metadata harmonization was carried out later as part of the L2 profiles project and SweLL infrastructure project.

For approved users, SpIn is available through Korp search interface and as a full dataset via a link that is sent to approved users. To get access, apply using this form: https://sunet.artologik.net/gu/swell.

The SpIn corpus is maintained at the University of Gothenburg, Språkbanken Text (https://spraakbanken.gu.se), as part of the SweLL infrastructure.

1.2 Description of the SpIn subcorpus

(Version from 2016) Elena Volodina, Ildikó Pilán, Ingegerd Enström, Lorena Llozhi, Peter Lundkvist, Gunlög Sundberg, Monica Sandell. 2016. SweLL on the rise: Swedish Learner Language corpus for European Reference Level studies. Proceedings of LREC 2016, Slovenia. < https://arxiv.org/pdf/1604.06583.pdf >

(Version from 2021) The above article + updated statistics in this documentation. The major difference between the versions from 2016 and 2021 is that additional 112 essays have been transcribed and anonymized. Additionally, all attributes and the corresponding values (e.g. level, language names, etc) have been harmonized between several other learner subcorpora in the SweLL collection to make possible searches in several of them at the same time. It is the first time that the corpus is officially released for users.

Personal data management: Essays were collected following a consent from the learners. The consent allows the use of essays for research by registered (approved) users. Handwritten essays were transcribed and manually anonymized according to the principles described in Volodina et al. (2016).

Mode/format: All essays were written as mid-term exams, in a classroom setting. Most of the essays were written by hand and were transcribed later according to the Transcription guidelines (see Volodina et al., 2016).

Time constraints varies between tasks; No access to extra materials was allowed (see details in Task Metadata).

To get more information on access to SpIn and other SweLL corpora, see the webpage: https://spraakbanken.gu.se/en/projects/swell/swell4users and https://spraakbanken.gu.se/projekt/swell/l2korp

1.3 To cite the SpIn subcorpus

Elena Volodina, Ildikó Pilán, Ingegerd Enström, Lorena Llozhi, Peter Lundkvist, Gunlög Sundberg, Monica Sandell. 2016. SweLL on the rise: Swedish Learner Language corpus for European Reference Level studies. Proceedings of LREC 2016, Slovenia. https://arxiv.org/pdf/1604.06583.pdf

2. Metadata description for the SpIn subcorpus

2.1 Administrative information

(Not available through Korp)

Based on the “Core metadata for learner corpora: draft 1.0, December 2017” by Sylviane Granger and Magali Paquot

Administrative info
Corpus title	SpIn, a subcorpus in SweLL-pilot
Distributor	Språkbanken Text (https://spraakbanken.gu.se/), SweLL-infrastructure component (swell@svenska.gu.se)
Availability	free of charge; access regulated by the GDPR restrictions (application form: https://sunet.artologik.net/gu/swell)
License	CLARIN-ID, -PRIV, -NORED, -BY (explanations: https://www.kielipankki.fi/support/clarin-eula/#res)
Edition	version 1
Character encoding	UTF-8
Markup language / file formats	Files are distributed in three formats: xml, json, raw texts
Corpus design info
L2 target language	Swedish; courses taken in Sweden
L1 (mother tongue)	multiple; represented as iso-codes and usual names
Period of collection	2012-2016
Corpus size	256 essays; 4 302 sentences; 46 911 tokens (incl punctuation)
Corpus mode	written language: essays collected from (classroom/midterm) exams
Annotation	transcription, anonymization
Transcription guidelines	see article: http://arxiv.org/pdf/1604.06583v1.pdf
Written versions	One version of each essay (no several submissions of the same text)
Longitudinal	no; but some students are recurrent with up to five (5) essays written in different periods of education
Proficiency levels	CEFR grades represented: A1, A2, B1, (B2)
Proficiency level type	Text based
Official language testing	no
Comparison data (L1 source)	N/A
Corpus annotation info
Manual annotation	yes, anonymization
Automatic annotation	yes, using SPARV pipeline: https://spraakbanken.gu.se/en/tools/sparv/annotations
	part-of-speech tagging, incl morpho-syntactic description: https://spraakbanken.gu.se/korp/markup/msdtags.html
	lemmatization, incl word sense disambiguation and multi-word identification, SALDO-based
	dependency parsing: https://cl.lingfil.uu.se/~nivre/swedish_treebank/dep.html
Correction annotated	no

2.2 Personal metadata

Explanatory term [attribute name in Korp / attribute name in the xml file]

*General*
• Student ID [student ID / student_id]	166 unique students, e.g. S16. Letter prefix (for a school) + a running number
• Age in 5-year intervals [age / age]	A 5-year age interval indicating the age at the moment of writing an essay. Age groups in SpIn: 16-20; 21-25; 26-30
• Birth year in 5-year intervals [birth year / birthyear_interval]	1990-1994 => 7 students
	1995-1999 => 151 students
	2000-2004 => 3 students
	unknown => 5 students
• Gender [gender / gender]	Kvinna => 54
	Man => 110
	Unknown => 2
• Time in Sweden (sum in months) [residence / time_in_sweden]	0 - 42
• Native language(s) [native language / iso_l1]	29 unique languages in 38 unique combinations of 1-3 languages
	For ease of interpretation, the full name of the language is provided as well as an iso-code in Korp, in the xml files and in the metadata excel
• Education background [education level / edu_level]	1 = 0-6 years of schooling (including elementary school) => 65
	2 = 7-9 years (including high school) => 44
	3 = 10-13 years (including upper-secondary education) => 52
	4 = 14+ years (including university education) => 0
	N/A => 5
• Exam [exam / exam]	Educational certificate, if available. Free text.
• Writing language [writing language / writing_language]	The best language in writing. N/A for SpIn

2.3 Task metadata

Explanatory term [attribute name in Korp / attribute name in the xml file]

*Administrative Metadata*
• Task ID [task id / task_id]	Task ID in the corpus. A letter prefix (for a school) + T(ask) + a running number, e.g. ST14. Total of 37 unique tasks
• Datum [date / datum]	Automatic Korp value, e.g. “2014-01-01”, in this case - a derivative of task_date. Datum is used by Korp search engine to create trend diagrams.
• Task date [task date / task_date]	The year and week (year-week) when the essay was written, e.g. 2012-W44 (SpIn data collected during 2012-2016)
• Course type / school form [school type / school_type]	A generic description of the type of the school/education where essay has been collected from
	Ungdomsgymnasiet
• Course level [course subject / course_subject]	Språkintroduktion för nyanlända
• Grading scale [grading scale / grading_scale]	CEFR (A1-B2)
*Writing task details*
• Task type [task type / task_type]	E.g. formativ skrivuppgift, mitterminsprov, exam
• Task - format [task format / task_format]	Handwritten, Digital
• Task duration (in minutes)	25–180
• Text type / genre [genre / text_types]	E.g. berättande, argumenterande
• Task instructions	N/A for SpIn
• Allowed aids	None
• Tasks - subject / topic [task_subject / subject]	A topic of the essay, e.g. Min skola, Bästa dag i mitt liv, etc.
• Lessontext_topic [lessontext_topic / lessontext_topic]	Topics in accordance with COCTAILL taxonomy < https://www.aclweb.org/anthology/W14-3510.pdf >
• Task_url	In certain cases where handouts were used, attachments are available at the urls. However, very few handouts for the tasks were made available

2.4 Essay metadata

Explanatory term [attribute name in Korp / attribute name in the xml file]

*Description*
• Essay ID [essay ID / essay_id]	Essay ID consists of a student ID (e.g. S1) + task ID (e.g. ST1) == S1ST1
• Essays per student [uppsatser/student / nr_essay_student]	numbers 1-5, indicating “recurrent” (longitudinal) students if the value is 2 or above. For example, “2” means that there are 2 essays written by that particular student
	1 essay/student => 32
	2 essays/student => 66
	3 essays/student => 18
	4 essays/student => 40
	5 esays/student => 20
• CEFR level [proficiency level / cefr_level]	A1 => 59 essays
	A2 => 143 essays
	B1 => 46
	B2 => 2
	Unknown => 6
• Grade [grade / grade]	reported as the CEFR level
• Full text [full text / svala_link]	A link to the full essay that opens in SVALA annotation tool
*Additional attributes*	A few attributes that are present in other SweLL subcorpora, and have been added to each SweLL subcorpus for the sake of interoperability.
• Result on the writing assignment [written proficiency / written_result]	TISUS-attribute
• Reading comprehension 1 / LF1 [reading comprehension (result), part 1 / lf1_result]	TISUS-attribute
• Reading comprehension 2 / LF2 [reading comprehension (result), part 2 / lf2_result]	TISUS-attribute
• Reading comprehension, sum [reading comprehension (sum) / lf_sum]	TISUS-attribute
• Oral proficiency [oral proficiency / oral_result]	TISUS-attribute
• Final grade [final grade / final_grade]	TISUS-attribute

2.5 School metadata

Explanatory term [attribute name in Korp / attribute name in the xml file]

*Description*
School ID [source / source]	Letter S for Centrum för Språkintroduktion
	a one-year intensive program that aims to prepare newly arrived refugees/learners for the next transitional training stage before they can proceed with the upper-secondary studies at national Swedish schools
Approximate level [approximate level / approximate_level]	Level assigned based on the type of course, roughly split into A:Beginners, B:Intermediate and C:Advanced, with additional marker “Fortsättning” (i.e. Continuation) added where necessary. Note the the order of the two descriptors always come in the alphabetical order, thus sometimes taking form of “Forsättning, Nybörjare” (i.e. Continuation, Beginner)

2.6 Transcription and anonymization

*Description*
Illegible handwriting	rendered by the sign @ for each illegible letter
Names, streets	anonymized using format: NN, NN-gata

swell-release-v1

Metadata description: SpIn (sub)corpus

Contents

1. General information

1.1 Description of the project

1.2 Description of the SpIn subcorpus

1.3 To cite the SpIn subcorpus

2. Metadata description for the SpIn subcorpus

2.1 Administrative information

2.2 Personal metadata

2.3 Task metadata

2.4 Essay metadata

2.5 School metadata

2.6 Transcription and anonymization