MultiGEC is a dataset for Multilingual Grammatical Error Correction in 12 European languages (Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Russian, Slovene, Swedish and Ukrainian) that was originally compiled in the context of MultiGEC-2025, the first text-level GEC shared task.
Access
The MultiGEC dataset is subject to the terms of use listed here. At the moment, the data can be obtained by registering to the MultiGEC-2025 shared task. A stable download link will be provided soon.
Overview
The MultiGEC dataset is divided into 17 subcorpora covering different languages, domains and correction styles, summarized below. More detailed information about each subcorpus is available as machine-readable metadata, whose format is described here.
language code | subcorpus | learners | # essays (train) | # essays (dev) | # essays (test) | # essays (total) | hypothesis sets | minimal | fluency | peculiarities |
---|---|---|---|---|---|---|---|---|---|---|
cs | NatWebInf | L1 (web) | 3620 | 1291 | 1256 | 6167 | 2 | ✓ | ||
cs | Romani | L1 (Romani children) | 3247 | 179 | 173 | 3599 | 2 | ✓ | ||
cs | SecLearn | L2 | 2057 | 173 | 177 | 2407 | 2 | ✓ | ||
cs | NatForm | L1 (students) | 227 | 88 | 76 | 391 | 2 | ✓ | ||
en | Write & Improve | L2 | 4040 | 506 | 504 | 5050 | 1 | ✓ | separate download | |
et | EIC | L2 | 206 | 26 | 26 | 258 | 3 | ✓ | ✓ | |
et | EKIL2 | L2 | 1202 | 150 | 151 | 1503 | 2 | ✓ | ||
de | Merlin | L2 | 827 | 103 | 103 | 1033 | 1 | ✓ | pre-tokenized | |
el | GLCII | L2 | 1031 | 129 | 129 | 1289 | 1 | ✓ | ||
is | IceEC | L1 (mixed) | 140 | 18 | 18 | 176 | 1 | ✓ | pre-tokenized | |
is | IceL2EC | L2 | 155 | 19 | 19 | 193 | 1 | ✓ | pre-tokenized; includes text fragments | |
it | Merlin | L2 | 651 | 81 | 81 | 813 | 1 | ✓ | ||
lv | LaVA | L2 | 813 | 101 | 101 | 1015 | 1 | ✓ | ||
ru | RULEC-GEC | mixed (L2 + heritage) | 2539 | 1969 | 1535 | 6043 | 3 | ✓ | ✓ | pre-tokenized; includes text fragments; separate download |
sl | Solar-Eval | L1 (students) | 10 | 50 | 49 | 109 | 1 | ✓ | ||
sv | SweLL_gold | L2 | 402 | 50 | 50 | 502 | 1 | ✓ | ||
uk | UA-GEC | mixed (crowdsourced) | 1706 | 87 | 79 | 1872 | 4 | ✓ | ✓ |