Data format
Each MultiGEC subcorpus is composed of a train, a development and a test set, each of which consists of 2+ essay-aligned files, one containing original learner essays and one or more containing reference (i.e. corrected/normalized) texts.
Naming conventions
Data files are named according to the following convention:
is the two-letter ISO 639 code for the languagecorpus
is the name of the subcorpus, lowercasedorig
indicates that the file contains original essays, whereasref1
indicate then
-th reference filetest
indicate the corresponding dataset splits
is the file containing the second reference set for the development split of the NatWebInf Czech subcorpus.
File format
Internally, each file follows this simple Markdown-based format:
### essay_id = 1
Full text of the first essay/reference.
Whitespace, including newline characters, is preserved, but for the sake of readability TWO consecutive newline characters spearate subsequent essays.
### essay_id = 2
Full text of the second essay/reference.
and refn
files are aligned at the essay level in the sense that reference corrections with essay_id = Y
are relative to the essay with essay_id = Y
in the corresponding orig
Note, however, that not all refn
files contain corrections for all essays in their orig
counterpart (that is, some subcorpora only have a second reference for some of the essays).
Each subcorpus is assicuated with a README file and YAML file that summarizes basic metadata in a machine-readable format:
target_language: two-letter ISO language code of the texts in the corpus, e.g. "sv" for Swedish
source_corpus: name of the source corpus
learner_type: L1|L2|mixed # whether the authors of the essays are first-language speakers, second language learners or both/unknown (e.g if the corpus is crowdsources)
short_description: short description of the corpus
links: # optional
- link to paper
- link to data sheet or similar
- ...
contact: maintainer@institution.xx # contact information for the main data provider
availability: open|restricted # "open" if the subcorpus is free to use outside of the shared task, restricted otherwise. This refers to the MultiGEC subcorpus, not necessarily to the source corpus
name: name of the license
url: link to the full text of the license
sentence_aligned: false|true # true if the source corpus is sentence-aligned, false otherwise
original_essays: # stats in reference to the *-orig-*.md files
total: a+b+c
train: a
dev: b
test: c
reference_essays_1: # stats in reference to the *-refX-*.md files
correction_style: minimal|fluency
total: a+b+c
train: a
dev: b
test: c
# reference_essays_2, reference_essays_3...