The Computational SLA working group invites you to participate in the first shared task on Multilingual Grammatical Error Detection, MultiGED, which includes five languages: Czech, English, German, Italian and Swedish.
The results will be presented on 22nd May, 2023, at the NLP4CALL workshop, colocated with the NoDaLiDa conference to be held in the Faroe Islands.
To register for/express interest in the shared task, please fill in this form.
To get important information and updates about the shared task, please join the MultiGED-2023 Google Group.
Official system evaluation will be carried out on CodaLab.
The results are roughly ordered by F0.5-score and show only the best out of two submissions (based on F0.5).
Czech | English - FCE | English - REALEC | |||||||
---|---|---|---|---|---|---|---|---|---|
Team Name | P | R | F0.5 | P | R | F0.5 | P | R | F0.5 |
EliCoDe | 82.01 | 51.79 | 73.44 | 73.64 | 50.34 | 67.40 | 44.32 | 40.73 | 43.55 |
DSL-MIM-HUS | 58.31 | 55.69 | 57.76 | 72.36 | 37.81 | 61.18 | 62.81 | 28.88 | 50.86 |
Brainstorm Thinkers | 62.35 | 23.44 | 46.81 | 70.21 | 37.55 | 59.81 | 48.19 | 31.22 | 43.46 |
VLP-char | 34.93 | 63.95 | 38.42 | 20.76 | 29.53 | 22.07 | - | - | - |
NTNU-TRH | 80.65 | 6.49 | 24.54 | 81.37 | 1.84 | 8.45 | 51.34 | 1.13 | 5.19 |
su-dali | - | - | - | - | - | - | - | - | - |
German | Italian | Swedish | |||||||
---|---|---|---|---|---|---|---|---|---|
Team Name | P | R | F0.5 | P | R | F0.5 | P | R | F0.5 |
EliCoDe | 84.78 | 73.75 | 82.32 | 86.67 | 67.96 | 82.15 | 81.80 | 66.34 | 78.16 |
DSL-MIM-HUS | 77.80 | 51.92 | 70.75 | 75.72 | 38.67 | 63.55 | 74.85 | 44.92 | 66.05 |
Brainstorm Thinkers | 77.94 | 47.55 | 69.11 | 70.65 | 36.46 | 59.49 | 73.81 | 39.94 | 63.11 |
VLP-char | 25.18 | 44.27 | 27.56 | 25.79 | 44.24 | 28.14 | 26.40 | 55.00 | 29.46 |
NTNU-TRH | 83.56 | 15.58 | 44.61 | 93.38 | 19.84 | 53.62 | 80.12 | 5.09 | 20.31 |
su-dali | - | - | - | - | - | - | 82.41 | 27.18 | 58.60 |
The aim of this shared task is to detect tokens in need of correction across five different languages, labeling them as either correct (“c”) or incorrect (“i”), i.e. performing binary classification at the token level, as shown in the example below.
Token | Label |
---|---|
I | c |
saws | i |
the | c |
show | c |
last | c |
nigt | i |
. | c |
We particularly encourage development of multilingual systems that can process all languages using a single model, but this is not a mandatory requirement to participate in the task.
We provide training, development and test data for each of the five languages: Czech, English, German, Italian and Swedish. The training and development datasets are already available in the MultiGED-2023 Github repository, and test sets will be released during the test phase. More information about each corpus is available below.
Some of these datasets are already used in Grammatical Error Detection/Correction (GED/GEC) research, but we also release two new datasets: REALEC (English) and SweLL-gold (Swedish). Where possible, we use the same train/dev/test split as previous work (GECCC, FCE, Falko-MERLIN), and only create new splits when necessary (REALEC, MERLIN, SweLL). All datasets are derived from annotated second language learner essays.
Please let us know if you find any errors/inconsistencies in a dataset by submitting an Issue/Pull Request on the Github repo. Any changes to the data will be announced via the MultiGED-2023 Google Group (so please join it!).
Source corpus | Language | Split | Nr sentences | Nr tokens | Nr errors | Error rate |
---|---|---|---|---|---|---|
GECCC | Czech | Total | 35,453 | 399,742 | 84,041 | 0.210 |
FCE | English | Total | 33,243 | 531,416 | 50,860 | 0.096 |
REALEC* | English | Total | 8,136 | 177,769 | 16,608 | 0.093 |
Falko-MERLIN | German | Total | 24,079 | 381,134 | 57,897 | 0.152 |
MERLIN | Italian | Total | 7949 | 99,698 | 14,893 | 0.149 |
SweLL-gold | Swedish | Total | 8,553 | 145,507 | 27,274 | 0.187 |
* dev and test sets only
Data is provided in a tab-separated format consisting of two columns: the first column contains the token and the second column contains the label (c or i), as in the Task Description. Note that there are no column headers, each sentence is separated by an empty line, and double quotes are escaped (\"
). It is expected that system output is generated in the same format.
Participants may use additional resources to build their systems provided that the resource is publicly available for research purposes. This includes monolingual data, artificial data, pretrained models, syntactic parsers, etc. After the shared task, we encourage participants to share any newly created resources with the community.
Language | Corpus name | Corpus license | MultiGED license |
---|---|---|---|
Czech | GECCC | CC BY-SA 4.0 | CC BY-SA 4.0 |
English | FCE | custom | custom |
REALEC | CC BY-SA 4.0 | CC BY-SA 4.0 | |
German | Falko | CC BY-SA 4.0 | CC BY-SA 4.0 |
MERLIN | CC BY-SA 4.0 | CC BY-SA 4.0 | |
Italian | MERLIN | CC BY-SA 4.0 | CC BY-SA 4.0 |
Swedish | SweLL-gold | CLARIN-ID, -PRIV, -NORED, -BY | CC BY 4.0 |
Evaluation will be carried out in terms of token-based Precision, Recall and F0.5 to be consistent with previous work on error detection (Bell et al., 2019; Kaneko and Komachi, 2019; Yuan et al., 2021.)
Example:
Hypothesis | Reference | Meaning |
---|---|---|
| Token | Label | |:---------|:------| | I | c | | saws | i | | the | c | | show | i | | last | c | | nigt | c | | . | c | | | Token | Label | |:---------|:------| | I | c | | saws | i | | the | c | | show | c | | last | c | | nigt | i | | . | c | | | Meaning | |:---------| | - | | True Positive | | - | | False Positive | | - | | False Negative | | - | |
F0.5 is used instead of F1 because humans judge false positives more harshly than false negatives and so precision is more important than recall.
Although official evaluation will be carried out on CodaLab, we include our evaluation script in this repository, eval.py
, which can be used to evaluate system performance independently. This script can be run from the command line as follows:
python3 eval.py -hyp <hyp_tsv> -ref <ref_tsv>
It is assumed that the hyp_tsv
and ref_tsv
files are in the same two-column tab-separated format as the data provided in this shared taskk. Note that the script processes a single language at a time, so you will need to call it several times to evaluate multiple languages.
We encourage you to submit a paper with your system description to the NLP4CALL workshop special track. We follow the same requirements for paper submissions as the NLP4CALL workshop, i.e. we use the same template and apply the same page limit. All papers will be reviewed by the organizing committee.
Accepted papers will be published in the workshop proceedings through NEALT Proceedings Series and double-published through the ACL anthology.
Further instructions on this will follow.
Please join the MultiGED-2023 google group in order to ask questions, hold discussions and browse for already answered questions.