swell-release-v1

Metadata description: the SweLL-gold corpus

(Elena Volodina, August, 17, 2021)

The most recent version of this document is available at https://spraakbanken.github.io/swell-release-v1/Metadata-SweLL

1. General information

1.1 Description of the project
1.2 Description of the SweLL-gold subcorpus
1.3 To cite the SweLL-gold subcorpus

2. Metadata description for the SweLL-gold subcorpus

2.1 Administrative information
2.2 Personal metadata
2.3 Task metadata
2.4 Essay metadata
2.5 School metadata

3. Manual annotation in the SweLL-gold subcorpus

3.1 Pseudonymization codes
3.2 Correction annotation codes

1. General information

Total nr corr-annotated essays: 502 (SweLL-gold v.1.0) NOTE! In the SweLL corpus we use the term CORRECTION ANNOTATION instead of a more traditional ERROR ANNOTATION

Nr sentences:

Original version: 7 807
Normalized version: 8 137

Nr tokens:

Original veresion: 147 842, incl. punctuation
Normalized version: 151 851, incl. punctuation

Proficiency levels represented (based on the level of the courses of Swedish): A, B, C (A=Beginner, B=Intermediate, C=Advanced)

Manual coding/labeling: pseudonymization, normalization, correction annotation.

Inter-annotator agreement (IAA): 88% by Fleiss’ kappa and 76% by Krippendorff’s alpha. IAA was calculated on the basis of 10% of the essays (i.e. 50 essays) that were double-annotated with correction labels, and is counted based on annotated edges only.

NOTE the use of NL-tokens (i.e. New Line breaks) to preserve line breaks in the original learner writings. NL-tokens do not correspond to any punctuation, are added to the beginning of the sentences following it, and are counted towards running tokens.

Other subcorpora: Apart from the SweLL-gold subcorpus, the SweLL collection contains other (sub)corpora collected in other projects, e.g. TISUS (2007), SpIn, SW1203. Other (sub)corpora may be added to the SweLL collection. _____________________________

1.1 Description of the project

The purpose of the SweLL infrastructure project was to set up an infrastructure for collection, digitization, normalization, and annotation of learner written production, as well as to make available a linguistically annotated corpus, where it would be possible to search for various types of linguistic structures, without the researcher having to guess what such a structure might look like, since there is a parallel normalized version available.

The SweLL infrastructure v1 consists of:

a data collection portal
annotation tools for L2 analysis
an annotated corpus of L2 written production
specific search solutions for L2-material facilitating filtering for e.g. texts written by male writers or writers at a certain proficiency level.

More information: https://spraakbanken.gu.se/en/projects/swell

The SweLL corpus is maintained at the University of Gothenburg, Språkbanken-Text https://spraakbanken.gu.se

For approved users, SweLL essays are available through Korp search interface and as a full dataset via a link that is sent to approved users. To get access, apply using this form: https://sunet.artologik.net/gu/swell.

1.2 Description of the SweLL-gold subcorpus

Elena Volodina, Lena Granstedt, Arild Matsson, Beáta Megyesi, Ildikó Pilán, Julia Prentice, Dan Rosén, Lisa Rudebeck, Carl-Johan Schenström, Gunlög Sundberg and Mats Wirén (2019). The SweLL Language Learner Corpus: From Design to Annotation. Northern European Journal of Language Technology, Special Issue.

Read the article: https://nejlt.ep.liu.se/article/view/1374

Personal data management: Essays and personal metadata were collected following a consent from the learners. The consent allows the use of essays for research by registered (approved) users. Handwritten essays were transcribed using secure encrypted environment (SweLL kiosk). All essays were manually pseudonymized (using SweLL kiosk) based on the Pseudonymization guidelines:

https://spraakbanken.github.io/swell-project/Pseudonymization_guidelines

Mode: All essays were written in as an exam or in a classroom as placement, formative or final tests. Most of the essays were written by hand and were transcribed later according to the Transcription Guidelines:

https://spraakbanken.github.io/swell-project/Transcription_guidelines

Pseudonymized essays were normalized and corr-annotated by L2 specialists, as described in section 3.

Time constraints and access to allowed materials varies between tasks (see details in Task Metadata for each particular task).

Full text access via Korp is secured through a link to the SVALA annotation tool where the full text opens. Note that the NL sign (␤) marks a new paragraph in the original student writing.

To get further information on access to the SweLL corpus, see the webpage:

https://spraakbanken.gu.se/en/projects/swell/swell4users and https://spraakbanken.gu.se/en/projects/swell/l2korp

1.3 To cite the SweLL-gold subcorpus

Elena Volodina, Lena Granstedt, Arild Matsson, Beáta Megyesi, Ildikó Pilán, Julia Prentice, Dan Rosén, Lisa Rudebeck, Carl-Johan Schenström, Gunlög Sundberg and Mats Wirén (2019). The SweLL Language Learner Corpus: From Design to Annotation. Northern European Journal of Language Technology, Special Issue. https://nejlt.ep.liu.se/article/view/1374
Other articles can be found on the project website: https://spraakbanken.gu.se/en/projects/swell/

2. Metadata description for the SweLL-gold subcorpus

2.1 Administrative information

(Not available through Korp)

Based on the “Core metadata for learner corpora: draft 1.0, December 2017” by Sylviane Granger and Magali Paquot

Administrative info
Corpus title	SweLL-gold, a corpus in a bigger SweLL collection
Distributor	Språkbanken Text (https://spraakbanken.gu.se/), SweLL-infrastructure component (swell@svenska.gu.se)
Availability	free of charge; access regulated by the GDPR restrictions (application form: https://sunet.artologik.net/gu/swell)
License	CLARIN-ID, -PRIV, -NORED, -BY (explanations: https://www.kielipankki.fi/support/clarin-eula/#res)
Edition	version 1
Character encoding	UTF-8
Markup language / file formats	Files are distributed in three formats: xml, json, raw texts
Corpus design info
L2 target language	Swedish; courses taken in Sweden
L1 (mother tongue)	multiple; represented as iso-codes and usual names
Period of collection	20017-2020
Corpus size	502 essays; 7 807 sentences; 147 842 tokens, incl punctuation (original version))
Corpus mode	written language: essays collected from classroom/exam setting
Annotation	transcription, pseudonymization, normalization, correction annotation, automatic linguistic annotation
Written versions	One version of each essay (no several submissions of the same text)
Longitudinal	some recurrent students appear, see section 2.4 (schools with IDs B, C, E, G, K
Proficiency levels	(approximate) development levels based on course level:
	A:Beginner –> Swedish for Immigrants (SFI)
	B:Intermediate –> Grundläggande vuxenutbildning (SVA)
	C:Advanced –> Gymnasiet; Universitetskurser; TISUS
Proficiency level type	Course-level based
Official language testing	mixed sources, including some official testing, e.g. TISUS
Comparison data (L1 source)	N/A
Corpus annotation info
Manual annotation	yes, pseudonymization, normalization, correction annotation
	Transcription guidelines: https://spraakbanken.github.io/swell-project/Transcription_guidelines
	Pseudonymization guidelines: https://spraakbanken.github.io/swell-project/Pseudonymization_guidelines
	Normalization guidelines: https://spraakbanken.github.io/swell-project/Normalization_guidelines
	Correction annotation guidelines: https://gupea.ub.gu.se/handle/2077/69434
Automatic annotation	yes, using SPARV pipeline: https://spraakbanken.gu.se/en/tools/sparv/annotations
	part-of-speech tagging, incl morpho-syntactic description: https://spraakbanken.gu.se/korp/markup/msdtags.html
	lemmatization, incl word sense disambiguation and multi-word identification, SALDO-based
	dependency parsing: https://cl.lingfil.uu.se/~nivre/swedish_treebank/dep.html
Correction annotated	yes, see section 3.2

2.2 Personal metadata

Explanatory term [attribute name in Korp / attribute name in the xml file]

*General*
Student ID [student ID / student_id]	451 unique students, e.g. C16. Letter prefix (for a school) + a running number
Age [age / age]	A 5-year age interval indicating the age at the moment of writing an essay, e.g. 31-35. Ages 16-70 are represented
Birth year in 5-year intervals [birthyear_interval / birth year]	1950-1954 -- 2000-2004
Gender [gender / gender]	Kvinna, Man, Annat, Vill inte säga
Time in Sweden (sum in months) [residence / time_in_sweden]	0 - 315
Native language(s) [native language / iso_l1]	81 unique languages in 117 unique combinations of 1-4 languages
	For easiness of interpretation, the full name of the language is provided as well in the xml files and in the metadata excel
*Education*
Education background [education level / edu_level]	1= 0-6 years of schooling (including elementary school) => 34
	2= 7-9 years (including high school) => 56
	3= 10-13 years (including upper-secondary education) => 155
	4= 14+ years (including university education)=> 257
Elementary education outside Sweden (nr months)	12 - 156
Elementary education in Sweden (nr months)	12 - 132
Introductory education in Sweden (nyanlända) (months)	1 - 36
Gymnasial education outside Sweden (nr months)	2 - 168
Gymnasial education in Sweden (nr months)	1 - 72
Professional education outside Sweden (nr months)	12 - 60
Professional education in Sweden (nr months)	3 - 72
University education outside Sweden (nr months)	12 - 228
University education in Sweden (nr months)	6 - 96
Professional degree	Free text (e.g. Ekonom)
Educational degree	Free text
Education: additional comments	Free text
*Language information*
Education in L1 (modersmålsundervisning/hemspråk, education in Sweden)	Language name
• Characteristics below are available in the material, but not in the xml files or Korp
Length of education in L1, nr months (modersmålsundervisning/hemspråk)	1 - 168
Swedish proficiency courses	Self-instruction, formal education
Length of Swedish proficiency courses, nr months	1 - 336
All known languages	List of languages
Other known language(s) except mother tongue(s)	List of languages
Best written language(s) [writing_language]	List of languages; not used for filtering in Korp
Best spoken language(s)	Language name(s)
Language(s) used with the family	Language name(s)
Language(s) used with friends	Language name(s)
Metacomment	Free comment added by an assistant

2.3 Task metadata

*Administrative Metadata*
Task ID	Task ID in the corpus. A letter prefix (for a school) + T(ask) + a running number, e.g. AT14. Total of 44 unique tasks
Semester (time span)	VT-2018, HT-2018, VT-2019, HT-2019, VT-2020
Task date [task date / task_date]	Year-week, e.g. 2018-W20
Datum [datum]	Automatic Korp value, e.g. “2014-01-01”, in this case - a derivative of task_date. Datum is used by Korp search engine to create trend diagrams.
Course type / school type [school_type]	A generic description of the type of the school/education where essay has been collected from
	Ungdomsgymnasiet
	Universitetet
	Vuxenutbildningen
Course level / course subject [course_subject]	Behörighetsgivande kurs
	Förberedande kurs
	Grundläggande SVA dk3
	Inplaceringsprov SFI
	SFI B / C / D
	SVA 2 / 3
	TISUS
Grading scale [grading_scale]	A-F
	G/U
	SFI inplacering
	Uppgiften har inte betyg
*Writing task details*
Task type	Behörighetstest
	Formativ skrivuppgift
	Inplaceringsprov
	Mitterminsprov
	Slutprov
	Test inför NP (Nationella Prov)
Task - format / mode [task_format]	Mode of the essay writing: Handskriven, Digital
Task duration (in minutes)	25--180
Text type / genre	Argumenterande
	Berättande
	Beskrivande
	Förklarande
	Informellt mejl
	Instruerande
	Resonerande
	Utredande
	Återgivande
Task instructions	Free text alt. reference to an attachment
Allowed aids	Bilingual dictionary, Monolingual dictionary, Internet, etc.
Additional material	Free text comment
Additional comments	Free text comment
Additional comment on coursebooks used	Book title(s) / free text
Approximate level	Approximate mapping of the course type to the level of proficiency.
	Nybörjare
	Fortsättning
	Avancerad
	additional marker “Fortsättning” (i.e. Continuation) added where necessary to the other descriptors. Note that the order of the two descriptors always come in the alphabetical order, thus sometimes taking form of “Forsättning, Nybörjare” (i.e. Continuation, Beginner)
Tasks - subject / topic [task_subject]	A topic of the essay, e.g.
	1. Din första arbetsplats alt. Kvinnors arbete; 2. Mejl till kusin på besök i Sverige
	1. När jag var liten och gick till skolan första gången; 2. Mejl till kusin på besök i Sverige
	Argumenterande text om arbetsmoral
	Argumenterande text/brev
	Berätta hur du bor!
	Berätta utifrån texten "Alma berättar"
	Beskriv - En god relation
	Beskriva - Min första kärlek
	Brott - orsaker och konsekvenser
	Demokratiska val - hur gammal ska man behöva vara för att rösta och varför?
	Diskuterande text om pengars betydelse
	En kulturupplevelse
	En plats du tycker om
	En viktig plats
	Enkel utredande text om litterära teman
	Familjen
	Ge tips och råd
	Ge tips och råd - en anställningsintervju
	Insändare
	Kommunikation och sociala medier
	Mejl till en vän
	Mina första intryck
	Objektivt utredande uppgift
	Om din bostad och om att bo
	Referat av texten "Giftermål ett större steg än barn"
	Skriv en insändare
	Skriv ett brev
	Skriv ett mail
	Skriv om en känd person
	Två sätt att uppfostra
	Utredande text (pm) övning inför NP
	Världens lyckligaste länder
	Övnings-pm inför NP
	etc
Topic domain [lessontext_topic]	Not available for SweLL-gold; present in some SweLL-pilot corpora and in COCTAILL. Topics in accordance with COCTAILL taxonomy < https://www.aclweb.org/anthology/W14-3510.pdf >
Task_url	In certain cases where handouts were used, attachments are available at the urls.

2.4 Essay metadata

Explanatory term [attribute name in Korp / attribute name in the xml file]

*Descriptive metadata*
• Essay ID [essay ID / essay_id]	Essay ID consists of a student ID (e.g. A1) + task ID (e.g. AT1) == A1AT1
• Essay per student [uppsatser/student / nr_essay_student]	numbers 1-5, indicating “recurrent students” if the value is 2 or above. For example, “2” means that there are 2 essays written by that particular student
	1 essay per student => 177
	2 essays per student => 151
	3 essays per student => 92
	4 essays per student => 63
	5 essays per student => 19
• Grade [grade / grade]	Where available, is indicated for each task and student
• Full text [full text / svala_link]	A link to the full essay that opens in SVALA annotation tool
*Additional attributes*	A few attributes that are present in other SweLL subcorpora, and have been added to each SweLL subcorpus for the sake of interoperability
• Result on the writing assignment [written proficiency / written_result]	TISUS-attribute
• Reading comprehension 1 / LF1 [reading comprehension (result), part 1 / lf1_result]	TISUS-attribute
• Reading comprehension 2 / LF2 [reading comprehension (result), part 2 / lf2_result]	TISUS-attribute
• Reading comprehension, sum [reading comprehension (sum) / lf_sum]	TISUS-attribute
• Oral proficiency [oral proficiency / oral_result]	TISUS-attribute
• Final grade [final grade / final_grade]	TISUS-attribute
Proficiency level [proficiency level (CEFR) / cefr_level]	Present in some of SweLL subcorpora (e.g. TISUS, SW1203, SpIn), however, absent in SweLL-gold v1

2.5 School metadata

*School listing*		*Description*
• Source [not available in Korp for filtering, but present in the xml files]	a letter indicating a school where the essays has been collected from, see the listing below
A	Vuxenutbildningscentrum	Inplacering SFI-utbildning: A-D
B	Gymnasieskola	SVA
C	Komvux/SFI	SFI A-D
E	Behörighetsgivande kurser / Universitetet	motsv gymnasiet
F	Behörighetsgivande kurser / Universitetet	motsv gymnasiet
G	SFI-provet	SFI A-D
H	TISUS-prov	motsv gymnasiet
J	Grundläggande vuxenutbildning	SVA dk 1-4
K	Grundläggande SVA	SVA dk 1-4
L	SVA-kurser på gymnasienivå	vuxen utbildning
M	Prov för antagning och inplacering till förberedande respektive behörighetsgivande kurser / Universitetet

3. Manual annotation in the SweLL-gold subcorpus

3.1 Pseudonymization codes

Pseudonymization guidelines:

https://spraakbanken.github.io/swell-project/Pseudonymization_guidelines

*Category type*	*Codes*	*Pseudonym / details*
Names	firstname_male	replaced by an equivalent
	firstname_female
	firstname_unknown
	initials
	middlename
	surname
Geographical data	city (+foreign for non-Swedish ones)	Swedish names replaced with dummy names or X-stad randomly
	area (+foreign for non-Swe ones)	All other names with equivalent ones (i.e. cities with other cities)
	country	Sverige is not replaced; other countries replaced with X-land
	geo	X-geoplats
	place	fake name or X-plats
	region	fake name or X-region
	street_nr
	zip_code	00000
Institutions	school	replaced with X-skola randomly
	work
	other_institution	replaced with X-institution randomly
Transportation	transport_name	replaced with X-linjen randomly
	transport_nr
Age	age_digits	replaced with a random number in the +-2 span from the actual number
	age_string
Dates	date_digits	11/11/1111
	day	replaced randomly
	month_digit	11/11
	month_word	randomly
	year	+-2
Miscellaneous
	account_nr
	e-mail	email\@dot.com
	extra
	license_nr (e.g. cars)	ABS 000
	other_nr_seq
	phone_nr	0000-000000
	person_nr	123456-000
	url	url\@com
	Zip-code	000 00
Sensitive	edu (education)	Markup only, no replacement
	prof (profession)
	fam (family members)
	sensitive (free text
	with hints on
	ethnical & sexual
	info, religious &
	political views)

3.2 Correction annotation codes

Normalization guidelines:

https://spraakbanken.github.io/swell-project/Normalization_guidelines

Correction annotation guidelines:

https://gupea.ub.gu.se/handle/2077/69434

To see an overview of the number of tags at various approximate levels, see Appendix in Volodina et al. (2021): https://arxiv.org/pdf/2105.06681.pdf

Volodina, Elena, Yousuf Ali Mohammed, and Julia Klezl. (2021) DaLAJ-a dataset for linguistic acceptability judgments for Swedish. Proceedings of the 10th NLP4CALL, Linköping university Press.

*Description*
*ORTHOGRAPHY*
O	misspelling
O-Cap	capitalization
O-Comp	compounding
*LEXIS (derivational morphology included)*
L	any other lexical problem	(not represented in the data)
L-Der	wrong derivation mechanism used
L-FL	mix of foreign language in Swedish
L-Ref	reference error
L-W	wrong word choice
*MORPHOLOGY (inflectional) ≈ PHRASE LEVEL**
M-Adj/adv	adjective instead of adverb
M-Case	case problem
M-Def	definiteness problem
M-F	problem on form / morphology level
M-Gend	problem with gender
M-Num	problem with number
M-Other	any other, not listed, problem on morphology level
M-Verb	any problem on the verb or verb phrase level
*SYNTAX ≈ CLAUSE LEVEL*
S-Adv	word order, sentence adverbial placement
S-Clause	clause vs phrase level
S-Comp	phrase vs compound structure
S-Ext	extensive change
S-FinV	word order, verb placement
S-M	missing word
S-MSubj	missing subject
S-Other	any other problem/syntact. level
S-R	word redundant (i.e. removed in the target)
S-Type	change of construction type/phrase level
S-WO	any other word order problem
*PUNCTUATION*
P	general miss on punctuation	(not represented in the data)
P-M	missing
P-R	redundant
P-Sent	sentence segmentation
P-W	wrong
*OTHER*
C	consistency change
Cit-FL	use of foreign language for citation
Com!	comment on an essay level
OBS!	comment on a token level
X	unintelligible
Unid	unidentified	(not represented in the data)

swell-release-v1

Metadata description: the SweLL-gold corpus

Contents

1. General information

1.1 Description of the project

1.2 Description of the SweLL-gold subcorpus

1.3 To cite the SweLL-gold subcorpus

2. Metadata description for the SweLL-gold subcorpus

2.1 Administrative information

2.2 Personal metadata

2.3 Task metadata

2.4 Essay metadata

2.5 School metadata

3. Manual annotation in the SweLL-gold subcorpus

3.1 Pseudonymization codes

3.2 Correction annotation codes