Overall corpus statistics
Following is an overview of the sizes of the 17 subcorpora that make up the MultiGEC dataset in terms of number of texts. For the sake of readability, we only report numbers for the first two hypothesis sets.
|
train |
dev |
test |
total |
|
orig |
hyp1 |
hyp2 |
orig |
hyp1 |
hyp2 |
orig |
hyp1 |
hyp2 |
orig |
hyp1 |
hyp2 |
Czech - NatWebInf |
3620 |
3620 |
0 |
1291 |
1291 |
687 |
1256 |
1256 |
1216 |
6167 |
6167 |
1903 |
Czech - Romani |
3247 |
3247 |
0 |
179 |
179 |
84 |
173 |
173 |
163 |
3599 |
3599 |
247 |
Czech - SecLearn |
2057 |
2057 |
183 |
173 |
173 |
97 |
177 |
177 |
170 |
2407 |
2407 |
450 |
Czech - NatForm |
227 |
227 |
0 |
88 |
88 |
47 |
76 |
76 |
74 |
391 |
391 |
121 |
English - Write & Improve |
4040 |
4040 |
0 |
506 |
506 |
0 |
504 |
504 |
0 |
5050 |
5050 |
0 |
Estonian - EIC |
206 |
206 |
206 |
26 |
26 |
26 |
26 |
26 |
26 |
258 |
258 |
258 |
Estonian - EKIL2 |
1202 |
1202 |
1202 |
150 |
150 |
150 |
151 |
151 |
151 |
1503 |
1503 |
1503 |
German - Merlin |
827 |
827 |
0 |
103 |
103 |
0 |
103 |
103 |
0 |
1033 |
1033 |
0 |
Greek - GLCII |
1031 |
1031 |
0 |
129 |
129 |
0 |
129 |
129 |
0 |
1289 |
1289 |
0 |
Icelandic - IceEC |
140 |
140 |
0 |
18 |
18 |
0 |
18 |
18 |
0 |
176 |
176 |
0 |
Icelandic - IceL2EC |
155 |
155 |
0 |
19 |
19 |
0 |
19 |
19 |
0 |
193 |
193 |
0 |
Italian - Merlin |
651 |
651 |
0 |
81 |
81 |
0 |
81 |
81 |
0 |
813 |
813 |
0 |
Latvian - LaVA |
813 |
813 |
0 |
101 |
101 |
0 |
101 |
101 |
0 |
1015 |
1015 |
0 |
Russian - RULEC-GEC |
2539 |
2539 |
0 |
1969 |
1969 |
0 |
1535 |
1535 |
1535 |
6043 |
6043 |
1535 |
Slovene - Solar-Eval |
10 |
10 |
0 |
50 |
50 |
0 |
49 |
49 |
0 |
109 |
109 |
0 |
Swedish - SweLL-gold |
402 |
402 |
0 |
50 |
50 |
0 |
50 |
50 |
0 |
502 |
502 |
0 |
Ukrainian - UA-GEC |
1706 |
1706 |
0 |
87 |
87 |
87 |
79 |
79 |
79 |
1872 |
1872 |
166 |
total |
22873 |
22873 |
1591 |
5020 |
5020 |
1178 |
4527 |
4527 |
3414 |
32420 |
32420 |
6183 |
Subcorpus-specific statistics
The following tables contain detailed statistics for the 17 language-specific MultiGEC subcorpora.
The number of sentences and tokens were recomputed to ensure cross-language consistency, so they might differ from what is reported the papers introducing the source corpora.
Czech - NatWebInf
|
tokens |
sentences |
texts |
|
orig |
hyp1 |
hyp2 |
orig |
hyp1 |
hyp2 |
orig |
hyp1 |
hyp2 |
train |
83725 |
86805 |
63976 |
6463 |
7706 |
5550 |
3620 |
3620 |
0 |
dev |
29827 |
33142 |
17954 |
2270 |
2895 |
1565 |
1291 |
1291 |
687 |
test |
25707 |
29400 |
28563 |
2059 |
2842 |
2692 |
1256 |
1256 |
1216 |
total |
139259 |
149347 |
110493 |
10792 |
13443 |
9807 |
6167 |
6167 |
1903 |
Czech - Romani
|
tokens |
sentences |
texts |
|
orig |
hyp1 |
hyp2 |
orig |
hyp1 |
hyp2 |
orig |
hyp1 |
hyp2 |
train |
277020 |
294217 |
0 |
18198 |
21393 |
0 |
3247 |
3247 |
0 |
dev |
14437 |
15219 |
7612 |
900 |
1144 |
550 |
179 |
179 |
84 |
test |
15533 |
16315 |
15414 |
967 |
1300 |
1139 |
173 |
173 |
163 |
total |
306990 |
325751 |
23026 |
20065 |
23837 |
1689 |
3599 |
3599 |
247 |
Czech - SecLearn
|
tokens |
sentences |
texts |
|
orig |
hyp1 |
hyp2 |
orig |
hyp1 |
hyp2 |
orig |
hyp1 |
hyp2 |
train |
329894 |
335339 |
32331 |
27741 |
29433 |
2706 |
2057 |
2057 |
183 |
dev |
31933 |
32209 |
19511 |
2608 |
2754 |
1569 |
173 |
173 |
97 |
test |
35085 |
35505 |
33836 |
2710 |
2914 |
2730 |
177 |
177 |
170 |
total |
396912 |
403053 |
85678 |
33059 |
35101 |
7005 |
2407 |
2407 |
450 |
|
tokens |
sentences |
texts |
|
orig |
hyp1 |
hyp2 |
orig |
hyp1 |
hyp2 |
orig |
hyp1 |
hyp2 |
train |
44034 |
45165 |
0 |
3245 |
3304 |
0 |
227 |
227 |
0 |
dev |
22118 |
22468 |
12172 |
1537 |
1555 |
878 |
88 |
88 |
47 |
test |
19886 |
20237 |
19645 |
1433 |
1492 |
1423 |
76 |
76 |
74 |
total |
86038 |
87870 |
31817 |
6215 |
6351 |
2301 |
391 |
391 |
121 |
English - Write & Improve
|
tokens |
sentences |
texts |
|
orig |
hyp1 |
orig |
hyp1 |
orig |
hyp1 |
train |
676366 |
686379 |
37341 |
39074 |
4040 |
4040 |
dev |
88628 |
89877 |
4307 |
4669 |
506 |
506 |
test |
92915 |
94276 |
4911 |
5324 |
504 |
504 |
total |
857909 |
870532 |
46559 |
49067 |
5050 |
5050 |
Estonian - EIC
|
tokens |
sentences |
texts |
|
orig |
hyp1 |
hyp2 |
hyp3 |
orig |
hyp1 |
hyp2 |
hyp3 |
orig |
hyp1 |
hyp2 |
hyp3 |
train |
33718 |
33799 |
33783 |
33817 |
2849 |
2928 |
2906 |
2931 |
206 |
206 |
206 |
206 |
dev |
4465 |
4471 |
4460 |
4472 |
366 |
373 |
372 |
373 |
26 |
26 |
26 |
26 |
test |
4319 |
4343 |
4332 |
4341 |
385 |
391 |
388 |
392 |
26 |
26 |
26 |
26 |
total |
42502 |
42613 |
42575 |
42630 |
3600 |
3692 |
3666 |
3696 |
258 |
258 |
258 |
258 |
Estonian - EKIL2
|
tokens |
sentences |
texts |
|
orig |
hyp1 |
hyp2 |
orig |
hyp1 |
hyp2 |
orig |
hyp1 |
hyp2 |
train |
187960 |
189527 |
189437 |
14400 |
14779 |
14740 |
1202 |
1202 |
1202 |
dev |
24396 |
24460 |
24450 |
1853 |
1896 |
1890 |
150 |
150 |
150 |
test |
22952 |
23117 |
23106 |
1676 |
1731 |
1727 |
151 |
151 |
151 |
total |
235308 |
237104 |
236993 |
17929 |
18406 |
18357 |
1503 |
1503 |
1503 |
German - Merlin
|
tokens |
sentences |
texts |
|
orig |
hyp1 |
orig |
hyp1 |
orig |
hyp1 |
train |
117172 |
120477 |
8455 |
9290 |
827 |
827 |
dev |
15739 |
16144 |
1102 |
1206 |
103 |
103 |
test |
13343 |
13755 |
1029 |
1121 |
103 |
103 |
total |
146254 |
150376 |
10586 |
11617 |
1033 |
1033 |
Greek - GLCII
|
tokens |
sentences |
texts |
|
orig |
hyp1 |
orig |
hyp1 |
orig |
hyp1 |
train |
206577 |
213666 |
12167 |
13066 |
1031 |
1031 |
dev |
26257 |
26884 |
1538 |
1663 |
129 |
129 |
test |
24512 |
25456 |
1525 |
1658 |
129 |
129 |
total |
257346 |
266006 |
15230 |
16387 |
1289 |
1289 |
Icelandic - IceEC
|
tokens |
sentences |
texts |
|
orig |
hyp1 |
orig |
hyp1 |
orig |
hyp1 |
train |
141302 |
141411 |
7146 |
7211 |
140 |
140 |
dev |
16011 |
16033 |
784 |
789 |
18 |
18 |
test |
19160 |
19135 |
905 |
909 |
18 |
18 |
total |
176473 |
176579 |
8835 |
8909 |
176 |
176 |
Icelandic - IceL2EC
|
tokens |
sentences |
texts |
|
orig |
hyp1 |
orig |
hyp1 |
orig |
hyp1 |
train |
124604 |
124493 |
5470 |
5599 |
155 |
155 |
dev |
18880 |
18751 |
741 |
789 |
19 |
19 |
test |
14310 |
14288 |
595 |
617 |
19 |
19 |
total |
157794 |
157532 |
6806 |
7005 |
193 |
193 |
Italian - Merlin
|
tokens |
sentences |
texts |
|
orig |
hyp1 |
orig |
hyp1 |
orig |
hyp1 |
train |
82769 |
83733 |
6620 |
6769 |
651 |
651 |
dev |
10624 |
10713 |
818 |
848 |
81 |
81 |
test |
10482 |
10566 |
845 |
854 |
81 |
81 |
total |
103875 |
105012 |
8283 |
8471 |
813 |
813 |
Latvian - LaVA
|
tokens |
sentences |
texts |
|
orig |
hyp1 |
orig |
hyp1 |
orig |
hyp1 |
train |
147888 |
151745 |
17254 |
18236 |
813 |
813 |
dev |
18413 |
18949 |
2228 |
2359 |
101 |
101 |
test |
17894 |
18311 |
2091 |
2188 |
101 |
101 |
total |
184195 |
189005 |
21573 |
22783 |
1015 |
1015 |
Russian - RULEC-GEC
|
tokens |
sentences |
texts |
|
orig |
hyp1 |
hyp2 |
hyp3 |
orig |
hyp1 |
hyp2 |
hyp3 |
orig |
hyp1 |
hyp2 |
hyp3 |
train |
88173 |
88363 |
0 |
0 |
5191 |
5171 |
0 |
0 |
2539 |
2539 |
0 |
0 |
dev |
43521 |
43661 |
0 |
0 |
2688 |
2682 |
0 |
0 |
1969 |
1969 |
0 |
0 |
test |
91134 |
91881 |
90703 |
91665 |
5321 |
5311 |
5338 |
5361 |
1535 |
1535 |
1535 |
1535 |
total |
222828 |
223905 |
90703 |
91665 |
13200 |
13164 |
5338 |
5361 |
6043 |
6043 |
1535 |
1535 |
Slovene - Solar-Eval
|
tokens |
sentences |
texts |
|
orig |
hyp1 |
orig |
hyp1 |
orig |
hyp1 |
train |
5053 |
5178 |
253 |
296 |
10 |
10 |
dev |
31316 |
31623 |
1672 |
1825 |
50 |
50 |
test |
33467 |
33609 |
1775 |
1908 |
49 |
49 |
total |
69836 |
70410 |
3700 |
4029 |
109 |
109 |
Swedish - SweLL_gold
|
tokens |
sentences |
texts |
|
orig |
hyp1 |
orig |
hyp1 |
orig |
hyp1 |
train |
120035 |
123372 |
6294 |
6860 |
402 |
402 |
dev |
13182 |
13499 |
724 |
770 |
50 |
50 |
test |
12016 |
12376 |
653 |
704 |
50 |
50 |
total |
145233 |
149247 |
7671 |
8334 |
502 |
502 |
Ukrainian - UA-GEC
|
tokens |
sentences |
texts |
|
orig |
hyp1 |
hyp2 |
hyp3 |
hyp4 |
orig |
hyp1 |
hyp2 |
hyp3 |
hyp4 |
orig |
hyp1 |
hyp2 |
hyp3 |
hyp4 |
train |
458693 |
462431 |
0 |
460401 |
0 |
29429 |
30057 |
0 |
30078 |
0 |
1706 |
1706 |
0 |
1706 |
0 |
dev |
23866 |
24106 |
24168 |
23954 |
23949 |
1318 |
1338 |
1370 |
1337 |
1380 |
87 |
87 |
87 |
87 |
87 |
test |
19951 |
20121 |
20158 |
20023 |
19995 |
1089 |
1143 |
1193 |
1143 |
1192 |
79 |
79 |
79 |
79 |
79 |
total |
502510 |
506658 |
44326 |
504378 |
43944 |
31836 |
32538 |
2563 |
32558 |
2572 |
1872 |
1872 |
166 |
1872 |
166 |