Overall corpus statistics

Following is an overview of the sizes of the 17 subcorpora that make up the MultiGEC dataset in terms of number of texts. For the sake of readability, we only report numbers for the first two hypothesis sets.

train dev test total
orig hyp1 hyp2 orig hyp1 hyp2 orig hyp1 hyp2 orig hyp1 hyp2
Czech - NatWebInf 3620 3620 0 1291 1291 687 1256 1256 1216 6167 6167 1903
Czech - Romani 3247 3247 0 179 179 84 173 173 163 3599 3599 247
Czech - SecLearn 2057 2057 183 173 173 97 177 177 170 2407 2407 450
Czech - NatForm 227 227 0 88 88 47 76 76 74 391 391 121
English - Write & Improve 4040 4040 0 506 506 0 504 504 0 5050 5050 0
Estonian - EIC 206 206 206 26 26 26 26 26 26 258 258 258
Estonian - EKIL2 1202 1202 1202 150 150 150 151 151 151 1503 1503 1503
German - Merlin 827 827 0 103 103 0 103 103 0 1033 1033 0
Greek - GLCII 1031 1031 0 129 129 0 129 129 0 1289 1289 0
Icelandic - IceEC 140 140 0 18 18 0 18 18 0 176 176 0
Icelandic - IceL2EC 155 155 0 19 19 0 19 19 0 193 193 0
Italian - Merlin 651 651 0 81 81 0 81 81 0 813 813 0
Latvian - LaVA 813 813 0 101 101 0 101 101 0 1015 1015 0
Russian - RULEC-GEC 2539 2539 0 1969 1969 0 1535 1535 1535 6043 6043 1535
Slovene - Solar-Eval 10 10 0 50 50 0 49 49 0 109 109 0
Swedish - SweLL-gold 402 402 0 50 50 0 50 50 0 502 502 0
Ukrainian - UA-GEC 1706 1706 0 87 87 87 79 79 79 1872 1872 166
total 22873 22873 1591 5020 5020 1178 4527 4527 3414 32420 32420 6183

Subcorpus-specific statistics

The following tables contain detailed statistics for the 17 language-specific MultiGEC subcorpora. The number of sentences and tokens were recomputed to ensure cross-language consistency, so they might differ from what is reported the papers introducing the source corpora.

Czech - NatWebInf

tokens sentences texts
orig hyp1 hyp2 orig hyp1 hyp2 orig hyp1 hyp2
train 83725 86805 63976 6463 7706 5550 3620 3620 0
dev 29827 33142 17954 2270 2895 1565 1291 1291 687
test 25707 29400 28563 2059 2842 2692 1256 1256 1216
total 139259 149347 110493 10792 13443 9807 6167 6167 1903

Czech - Romani

tokens sentences texts
orig hyp1 hyp2 orig hyp1 hyp2 orig hyp1 hyp2
train 277020 294217 0 18198 21393 0 3247 3247 0
dev 14437 15219 7612 900 1144 550 179 179 84
test 15533 16315 15414 967 1300 1139 173 173 163
total 306990 325751 23026 20065 23837 1689 3599 3599 247

Czech - SecLearn

tokens sentences texts
orig hyp1 hyp2 orig hyp1 hyp2 orig hyp1 hyp2
train 329894 335339 32331 27741 29433 2706 2057 2057 183
dev 31933 32209 19511 2608 2754 1569 173 173 97
test 35085 35505 33836 2710 2914 2730 177 177 170
total 396912 403053 85678 33059 35101 7005 2407 2407 450

Czech - NatForm

tokens sentences texts
orig hyp1 hyp2 orig hyp1 hyp2 orig hyp1 hyp2
train 44034 45165 0 3245 3304 0 227 227 0
dev 22118 22468 12172 1537 1555 878 88 88 47
test 19886 20237 19645 1433 1492 1423 76 76 74
total 86038 87870 31817 6215 6351 2301 391 391 121

English - Write & Improve

tokens sentences texts
orig hyp1 orig hyp1 orig hyp1
train 676366 686379 37341 39074 4040 4040
dev 88628 89877 4307 4669 506 506
test 92915 94276 4911 5324 504 504
total 857909 870532 46559 49067 5050 5050

Estonian - EIC

tokens sentences texts
orig hyp1 hyp2 hyp3 orig hyp1 hyp2 hyp3 orig hyp1 hyp2 hyp3
train 33718 33799 33783 33817 2849 2928 2906 2931 206 206 206 206
dev 4465 4471 4460 4472 366 373 372 373 26 26 26 26
test 4319 4343 4332 4341 385 391 388 392 26 26 26 26
total 42502 42613 42575 42630 3600 3692 3666 3696 258 258 258 258

Estonian - EKIL2

tokens sentences texts
orig hyp1 hyp2 orig hyp1 hyp2 orig hyp1 hyp2
train 187960 189527 189437 14400 14779 14740 1202 1202 1202
dev 24396 24460 24450 1853 1896 1890 150 150 150
test 22952 23117 23106 1676 1731 1727 151 151 151
total 235308 237104 236993 17929 18406 18357 1503 1503 1503

German - Merlin

tokens sentences texts
orig hyp1 orig hyp1 orig hyp1
train 117172 120477 8455 9290 827 827
dev 15739 16144 1102 1206 103 103
test 13343 13755 1029 1121 103 103
total 146254 150376 10586 11617 1033 1033

Greek - GLCII

tokens sentences texts
orig hyp1 orig hyp1 orig hyp1
train 206577 213666 12167 13066 1031 1031
dev 26257 26884 1538 1663 129 129
test 24512 25456 1525 1658 129 129
total 257346 266006 15230 16387 1289 1289

Icelandic - IceEC

tokens sentences texts
orig hyp1 orig hyp1 orig hyp1
train 141302 141411 7146 7211 140 140
dev 16011 16033 784 789 18 18
test 19160 19135 905 909 18 18
total 176473 176579 8835 8909 176 176

Icelandic - IceL2EC

tokens sentences texts
orig hyp1 orig hyp1 orig hyp1
train 124604 124493 5470 5599 155 155
dev 18880 18751 741 789 19 19
test 14310 14288 595 617 19 19
total 157794 157532 6806 7005 193 193

Italian - Merlin

tokens sentences texts
orig hyp1 orig hyp1 orig hyp1
train 82769 83733 6620 6769 651 651
dev 10624 10713 818 848 81 81
test 10482 10566 845 854 81 81
total 103875 105012 8283 8471 813 813

Latvian - LaVA

tokens sentences texts
orig hyp1 orig hyp1 orig hyp1
train 147888 151745 17254 18236 813 813
dev 18413 18949 2228 2359 101 101
test 17894 18311 2091 2188 101 101
total 184195 189005 21573 22783 1015 1015

Russian - RULEC-GEC

tokens sentences texts
orig hyp1 hyp2 hyp3 orig hyp1 hyp2 hyp3 orig hyp1 hyp2 hyp3
train 88173 88363 0 0 5191 5171 0 0 2539 2539 0 0
dev 43521 43661 0 0 2688 2682 0 0 1969 1969 0 0
test 91134 91881 90703 91665 5321 5311 5338 5361 1535 1535 1535 1535
total 222828 223905 90703 91665 13200 13164 5338 5361 6043 6043 1535 1535

Slovene - Solar-Eval

tokens sentences texts
orig hyp1 orig hyp1 orig hyp1
train 5053 5178 253 296 10 10
dev 31316 31623 1672 1825 50 50
test 33467 33609 1775 1908 49 49
total 69836 70410 3700 4029 109 109

Swedish - SweLL_gold

tokens sentences texts
orig hyp1 orig hyp1 orig hyp1
train 120035 123372 6294 6860 402 402
dev 13182 13499 724 770 50 50
test 12016 12376 653 704 50 50
total 145233 149247 7671 8334 502 502

Ukrainian - UA-GEC

tokens sentences texts
orig hyp1 hyp2 hyp3 hyp4 orig hyp1 hyp2 hyp3 hyp4 orig hyp1 hyp2 hyp3 hyp4
train 458693 462431 0 460401 0 29429 30057 0 30078 0 1706 1706 0 1706 0
dev 23866 24106 24168 23954 23949 1318 1338 1370 1337 1380 87 87 87 87 87
test 19951 20121 20158 20023 19995 1089 1143 1193 1143 1192 79 79 79 79 79
total 502510 506658 44326 504378 43944 31836 32538 2563 32558 2572 1872 1872 166 1872 166