multiged-2023

Shared task on Multilingual Grammatical Error Detection (MultiGED-2023)

The Computational SLA working group invites you to participate in the first shared task on Multilingual Grammatical Error Detection, MultiGED, which includes five languages: Czech, English, German, Italian and Swedish.

The results will be presented on 22nd May, 2023, at the NLP4CALL workshop, colocated with the NoDaLiDa conference to be held in the Faroe Islands.

To register for/express interest in the shared task, please fill in this form.
To get important information and updates about the shared task, please join the MultiGED-2023 Google Group.
Official system evaluation will be carried out on CodaLab.


Competition results

The results are roughly ordered by F0.5-score and show only the best out of two submissions (based on F0.5).

Czech English - FCE English - REALEC
Team Name P R F0.5 P R F0.5 P R F0.5
EliCoDe 82.01 51.79 73.44 73.64 50.34 67.40 44.32 40.73 43.55
DSL-MIM-HUS 58.31 55.69 57.76 72.36 37.81 61.18 62.81 28.88 50.86
Brainstorm Thinkers 62.35 23.44 46.81 70.21 37.55 59.81 48.19 31.22 43.46
VLP-char 34.93 63.95 38.42 20.76 29.53 22.07 - - -
NTNU-TRH 80.65 6.49 24.54 81.37 1.84 8.45 51.34 1.13 5.19
su-dali - - - - - - - - -
German Italian Swedish
Team Name P R F0.5 P R F0.5 P R F0.5
EliCoDe 84.78 73.75 82.32 86.67 67.96 82.15 81.80 66.34 78.16
DSL-MIM-HUS 77.80 51.92 70.75 75.72 38.67 63.55 74.85 44.92 66.05
Brainstorm Thinkers 77.94 47.55 69.11 70.65 36.46 59.49 73.81 39.94 63.11
VLP-char 25.18 44.27 27.56 25.79 44.24 28.14 26.40 55.00 29.46
NTNU-TRH 83.56 15.58 44.61 93.38 19.84 53.62 80.12 5.09 20.31
su-dali - - - - - - 82.41 27.18 58.60

Task description

The aim of this shared task is to detect tokens in need of correction across five different languages, labeling them as either correct (“c”) or incorrect (“i”), i.e. performing binary classification at the token level, as shown in the example below.

Token Label
I c
saws i
the c
show c
last c
nigt i
. c

We particularly encourage development of multilingual systems that can process all languages using a single model, but this is not a mandatory requirement to participate in the task.

Data

We provide training, development and test data for each of the five languages: Czech, English, German, Italian and Swedish. The training and development datasets are already available in the MultiGED-2023 Github repository, and test sets will be released during the test phase. More information about each corpus is available below.

Some of these datasets are already used in Grammatical Error Detection/Correction (GED/GEC) research, but we also release two new datasets: REALEC (English) and SweLL-gold (Swedish). Where possible, we use the same train/dev/test split as previous work (GECCC, FCE, Falko-MERLIN), and only create new splits when necessary (REALEC, MERLIN, SweLL). All datasets are derived from annotated second language learner essays.

Please let us know if you find any errors/inconsistencies in a dataset by submitting an Issue/Pull Request on the Github repo. Any changes to the data will be announced via the MultiGED-2023 Google Group (so please join it!).

Source corpus Language Split Nr sentences Nr tokens Nr errors Error rate
GECCC Czech Total 35,453 399,742 84,041 0.210
FCE English Total 33,243 531,416 50,860 0.096
REALEC* English Total 8,136 177,769 16,608 0.093
Falko-MERLIN German Total 24,079 381,134 57,897 0.152
MERLIN Italian Total 7949 99,698 14,893 0.149
SweLL-gold Swedish Total 8,553 145,507 27,274 0.187

* dev and test sets only

Data Format

Data is provided in a tab-separated format consisting of two columns: the first column contains the token and the second column contains the label (c or i), as in the Task Description. Note that there are no column headers, each sentence is separated by an empty line, and double quotes are escaped (\"). It is expected that system output is generated in the same format.

External Data

Participants may use additional resources to build their systems provided that the resource is publicly available for research purposes. This includes monolingual data, artificial data, pretrained models, syntactic parsers, etc. After the shared task, we encourage participants to share any newly created resources with the community.

Data Licenses

Language Corpus name Corpus license MultiGED license
Czech GECCC CC BY-SA 4.0 CC BY-SA 4.0
English FCE custom custom
  REALEC CC BY-SA 4.0 CC BY-SA 4.0
German Falko CC BY-SA 4.0 CC BY-SA 4.0
  MERLIN CC BY-SA 4.0 CC BY-SA 4.0
Italian MERLIN CC BY-SA 4.0 CC BY-SA 4.0
Swedish SweLL-gold CLARIN-ID, -PRIV, -NORED, -BY CC BY 4.0

Evaluation

Evaluation will be carried out in terms of token-based Precision, Recall and F0.5 to be consistent with previous work on error detection (Bell et al., 2019; Kaneko and Komachi, 2019; Yuan et al., 2021.)

Example:

HypothesisReferenceMeaning
| Token | Label | |:---------|:------| | I | c | | saws | i | | the | c | | show | i | | last | c | | nigt | c | | . | c | | Token | Label | |:---------|:------| | I | c | | saws | i | | the | c | | show | c | | last | c | | nigt | i | | . | c | | Meaning | |:---------| | - | | True Positive | | - | | False Positive | | - | | False Negative | | - |

F0.5 is used instead of F1 because humans judge false positives more harshly than false negatives and so precision is more important than recall.

Although official evaluation will be carried out on CodaLab, we include our evaluation script in this repository, eval.py, which can be used to evaluate system performance independently. This script can be run from the command line as follows:

python3 eval.py -hyp <hyp_tsv> -ref <ref_tsv>

It is assumed that the hyp_tsv and ref_tsv files are in the same two-column tab-separated format as the data provided in this shared taskk. Note that the script processes a single language at a time, so you will need to call it several times to evaluate multiple languages.

Publication

We encourage you to submit a paper with your system description to the NLP4CALL workshop special track. We follow the same requirements for paper submissions as the NLP4CALL workshop, i.e. we use the same template and apply the same page limit. All papers will be reviewed by the organizing committee.

Accepted papers will be published in the workshop proceedings through NEALT Proceedings Series and double-published through the ACL anthology.

Further instructions on this will follow.

Timeline

Organizers

Contact information and forum for discussions

Please join the MultiGED-2023 google group in order to ask questions, hold discussions and browse for already answered questions.