swell-release-v1

ReadMe: SweLL-pilot collection

(Elena Volodina, August 17, 2021)

Contact info: swell@svenska.gu.se

The most recent version of this ReadMe file is available at https://spraakbanken.github.io/swell-release-v1/Readme-SweLL-pilot

SweLL-pilot corpus is a corpus of essays written by adult learners of Swedish. It was collected during the period of 2007-2016, and contains 502 essays that have been anonymized and graded with CEFR labels. In 2020-2021 the SweLL-pilot collection has been added to the SweLL portal to ensure comparable format and attributes with the SweLL-gold collection.

There are three subcorpora in the SweLL-pilot collection:

More information about each subcorpus can be found in the Metadata files below, and a few more details in Volodina et al. (2016). Note!, however, that Volodina et al. (2016) cite a bit different number of essays. This is due to the fact that some of the essays have been transcribed and added to the collection at a later stage.

SweLL-pilot.zip contains

  1. the current ReadMe file https://spraakbanken.github.io/swell-release-v1/Readme-SweLL-pilot

  2. SpIn-folder

    2.1 spIn_metadataDescription.pdf - a pdf version (as of August 2021) of the metadata-description available at the link https://spraakbanken.github.io/swell-release-v1/Metadata-SpIn. The information available at the link is continuously updated, which means it could be useful to check it for later updates.

    2.2. spIn_metadata.xlsx - a spreadsheet containing all metadata ordered by essay-ID. Attributes and variables are described in the Metadata file below (and in its online counterpart). Besides, four additional points of information are provided in the excel file, namely:

    • sentences, i.e. number of sentences per essay
    • tokens, i.e. number of tokens per essay
    • correction_lables, i.e. number of correction labels used in the essay (relevant only for SweLL-gold essays)
    • pseudo_lables, i.e. number of pseudonymization labels used in the essay (relevant only for SweLL-gold essays)

    2.3 spIn.json - all SpIn essays (256) in a json format (aka SVALA format), see Wirén et al. (2019) for the description. The json representation contains three objects: “source”, “target” and “edges”. Edges are links going between token-id in the source to token-id in the target, with tags describing the difference between one and another. Additionally, pseudonymization tags may be attached to the links.

    2.4 spIn.txt - raw essay texts (i.e. containing no annotation)

    2.5 spIn.xml - an xml version of the original essays following Korp format. Attributes and variables are described in the Metadata file below.

    2.6 spIn_Ling_annotated.xml - an xml version (Korp format) of the original essays with linguistic annotation using Sparv-pipeline.

  3. SW1203-folder

    3.1 sw_metadataDescription.pdf - a pdf version (as of August 2021) of the metadata-description available at the link https://spraakbanken.github.io/swell-release-v1/Metadata-SW1203. The information available at the link is continuously updated, which means it could be useful to check it for later updates.

    3.2 sw_metadata.xlsx - a spreadsheet containing all metadata ordered by essay-ID. Attributes and variables are described in the Metadata file below (and in its online counterpart). Besides, four additional points of information are provided in the excel file, namely:

    • sentences, i.e. number of sentences per essay
    • tokens, i.e. number of tokens per essay
    • correction_lables, i.e. number of correction labels used in the essay (relevant only for SweLL-gold essays)
    • pseudo_lables, i.e. number of pseudonymization labels used in the essay (relevant only for SweLL-gold essays)

    3.3 sw.json - all SpIn essays (141) in a json format (aka SVALA format), see Wirén et al. (2019) for the description. The json representation contains three objects: “source”, “target” and “edges”. Edges are links going between token-id in the source to token-id in the target, with tags describing the difference between one and another. Additionally, pseudonymization tags may be attached to the links.

    3.4 sw.txt - raw essay texts (i.e. containing no annotation)

    3.5 sw.xml - an xml version of the original essays following Korp format. Attributes and variables are described in the Metadata file below.

    3.6 sw_Ling_annotated.xml - an xml version (Korp format) of the original essays with linguistic annotation using Sparv-pipeline.

  4. TISUS-folder

    4.1 tisus_metadataDescription.pdf - a pdf version (as of August 2021) of the metadata-description available at the link https://spraakbanken.github.io/swell-release-v1/Metadata-TISUS. The information available at the link is continuously updated, which means it could be useful to check it for later updates.

    4.2 tisus_metadata.xlsx - a spreadsheet containing all metadata ordered by essay-ID. Attributes and variables are described in the Metadata file below (and in its online counterpart). Besides, four additional points of information are provided in the excel file, namely:

    • sentences, i.e. number of sentences per essay
    • tokens, i.e. number of tokens per essay
    • correction_lables, i.e. number of correction labels used in the essay (relevant only for SweLL-gold essays)
    • pseudo_lables, i.e. number of pseudonymization labels used in the essay (relevant only for SweLL-gold essays)

    4.3 tisus.json - all SpIn essays (105) in a json format (aka SVALA format), see Wirén et al. (2019) for the description. The json representation contains three objects: “source”, “target” and “edges”. Edges are links going between token-id in the source to token-id in the target, with tags describing the difference between one and another. Additionally, pseudonymization tags may be attached to the links.

    4.4 tisus.txt - raw essay texts (i.e. containing no annotation)

    4.5 tisus.xml - an xml version of the original essays following Korp format. Attributes and variables are described in the Metadata file below.

    4.6 tisus_Ling_annotated.xml - an xml version (Korp format) of the original essays with linguistic annotation using Sparv-pipeline.

Note! that each format includes a so-called SVALA-link (or full-text link) for each essay. Using that link you can open an essay in a full-text version in a parallel representation (original, target, tags on the links between original and target tokens) using SVALA tool (Wirén et al. 2019). In SVALA, you will be able to play with various annotation modes and tagsets. However, please be aware that the added annotations are not saved to any database.

Metadata descriptions:

Reminder of the access agreement

Please note the agreement conditions:

Always cite

References