multiged-2023

Shared task on Multilingual Grammatical Error Detection (MultiGED-2023)

The Computational SLA working group invites you to participate in the first shared task on Multilingual Grammatical Error Detection, MultiGED, which includes five languages: Czech, English, German, Italian and Swedish.

The results will be presented on 22nd May, 2023, at the NLP4CALL workshop, colocated with the NoDaLiDa conference to be held in the Faroe Islands.

To register for/express interest in the shared task, please fill in this form.
To get important information and updates about the shared task, please join the MultiGED-2023 Google Group.
Official system evaluation will be carried out on CodaLab.

Competition results

The results are roughly ordered by F0.5-score and show only the best out of two submissions (based on F0.5).

	Czech			English - FCE			English - REALEC
Team Name	P	R	F0.5	P	R	F0.5	P	R	F0.5
EliCoDe	82.01	51.79	73.44	73.64	50.34	67.40	44.32	40.73	43.55
DSL-MIM-HUS	58.31	55.69	57.76	72.36	37.81	61.18	62.81	28.88	50.86
Brainstorm Thinkers	62.35	23.44	46.81	70.21	37.55	59.81	48.19	31.22	43.46
VLP-char	34.93	63.95	38.42	20.76	29.53	22.07	-	-	-
NTNU-TRH	80.65	6.49	24.54	81.37	1.84	8.45	51.34	1.13	5.19
su-dali	-	-	-	-	-	-	-	-	-

	German			Italian			Swedish
Team Name	P	R	F0.5	P	R	F0.5	P	R	F0.5
EliCoDe	84.78	73.75	82.32	86.67	67.96	82.15	81.80	66.34	78.16
DSL-MIM-HUS	77.80	51.92	70.75	75.72	38.67	63.55	74.85	44.92	66.05
Brainstorm Thinkers	77.94	47.55	69.11	70.65	36.46	59.49	73.81	39.94	63.11
VLP-char	25.18	44.27	27.56	25.79	44.24	28.14	26.40	55.00	29.46
NTNU-TRH	83.56	15.58	44.61	93.38	19.84	53.62	80.12	5.09	20.31
su-dali	-	-	-	-	-	-	82.41	27.18	58.60

Task description

The aim of this shared task is to detect tokens in need of correction across five different languages, labeling them as either correct (“c”) or incorrect (“i”), i.e. performing binary classification at the token level, as shown in the example below.

Token	Label
I	c
saws	i
the	c
show	c
last	c
nigt	i
.	c

We particularly encourage development of multilingual systems that can process all languages using a single model, but this is not a mandatory requirement to participate in the task.

Data

We provide training, development and test data for each of the five languages: Czech, English, German, Italian and Swedish. The training and development datasets are already available in the MultiGED-2023 Github repository, and test sets will be released during the test phase. More information about each corpus is available below.

Some of these datasets are already used in Grammatical Error Detection/Correction (GED/GEC) research, but we also release two new datasets: REALEC (English) and SweLL-gold (Swedish). Where possible, we use the same train/dev/test split as previous work (GECCC, FCE, Falko-MERLIN), and only create new splits when necessary (REALEC, MERLIN, SweLL). All datasets are derived from annotated second language learner essays.

Please let us know if you find any errors/inconsistencies in a dataset by submitting an Issue/Pull Request on the Github repo. Any changes to the data will be announced via the MultiGED-2023 Google Group (so please join it!).

Source corpus	Language	Split	Nr sentences	Nr tokens	Nr errors	Error rate
GECCC	Czech	Total	35,453	399,742	84,041	0.210
FCE	English	Total	33,243	531,416	50,860	0.096
REALEC*	English	Total	8,136	177,769	16,608	0.093
Falko-MERLIN	German	Total	24,079	381,134	57,897	0.152
MERLIN	Italian	Total	7949	99,698	14,893	0.149
SweLL-gold	Swedish	Total	8,553	145,507	27,274	0.187

* dev and test sets only

Data Format

Data is provided in a tab-separated format consisting of two columns: the first column contains the token and the second column contains the label (c or i), as in the Task Description. Note that there are no column headers, each sentence is separated by an empty line, and double quotes are escaped (\"). It is expected that system output is generated in the same format.

External Data

Participants may use additional resources to build their systems provided that the resource is publicly available for research purposes. This includes monolingual data, artificial data, pretrained models, syntactic parsers, etc. After the shared task, we encourage participants to share any newly created resources with the community.

Data Licenses

Language	Corpus name	Corpus license	MultiGED license
Czech	GECCC	CC BY-SA 4.0	CC BY-SA 4.0
English	FCE	custom	custom
	REALEC	CC BY-SA 4.0	CC BY-SA 4.0
German	Falko	CC BY-SA 4.0	CC BY-SA 4.0
	MERLIN	CC BY-SA 4.0	CC BY-SA 4.0
Italian	MERLIN	CC BY-SA 4.0	CC BY-SA 4.0
Swedish	SweLL-gold	CLARIN-ID, -PRIV, -NORED, -BY	CC BY 4.0

Evaluation

Evaluation will be carried out in terms of token-based Precision, Recall and F0.5 to be consistent with previous work on error detection (Bell et al., 2019; Kaneko and Komachi, 2019; Yuan et al., 2021.)

Example:

Hypothesis	Reference	Meaning
\| Token \| Label \| \|:---------\|:------\| \| I \| c \| \| saws \| i \| \| the \| c \| \| show \| i \| \| last \| c \| \| nigt \| c \| \| . \| c \|	\| Token \| Label \| \|:---------\|:------\| \| I \| c \| \| saws \| i \| \| the \| c \| \| show \| c \| \| last \| c \| \| nigt \| i \| \| . \| c \|	\| Meaning \| \|:---------\| \| - \| \| True Positive \| \| - \| \| False Positive \| \| - \| \| False Negative \| \| - \|

F0.5 is used instead of F1 because humans judge false positives more harshly than false negatives and so precision is more important than recall.

Although official evaluation will be carried out on CodaLab, we include our evaluation script in this repository, eval.py, which can be used to evaluate system performance independently. This script can be run from the command line as follows:

python3 eval.py -hyp <hyp_tsv> -ref <ref_tsv>

It is assumed that the hyp_tsv and ref_tsv files are in the same two-column tab-separated format as the data provided in this shared taskk. Note that the script processes a single language at a time, so you will need to call it several times to evaluate multiple languages.

Publication

We encourage you to submit a paper with your system description to the NLP4CALL workshop special track. We follow the same requirements for paper submissions as the NLP4CALL workshop, i.e. we use the same template and apply the same page limit. All papers will be reviewed by the organizing committee.

Accepted papers will be published in the workshop proceedings through NEALT Proceedings Series and double-published through the ACL anthology.

Further instructions on this will follow.

Timeline

23 January, 2023 - first call for participation. Training and validation data released, CodaLab opens for team registrations.
14 February, 2023 - second call/reminder
~~27 February~~ 06 March, 2023 - test data released
~~03 March~~ 13 March, 2023 - system submission deadline (system output)
~~10 March~~ 15 March, 2023 - results announced
03 April, 2023 - paper submission deadline with system descriptions. We encourage you to share models, code, fact sheets, extra data, etc. with the community through github or other repositories on paper publication.
21 April, 2023 - paper reviews sent to the authors
01 May, 2023 - camera-ready deadline
22 May, 2023 - presentations of the systems at NLP4CALL workshop

Organizers

Elena Volodina, University of Gothenburg, Sweden
Chris Bryant, University of Cambridge, UK
Andrew Caines, University of Cambridge, UK
Orphee De Clercq, Ghent University, Belgium
Jennifer-Carmen Frey, EURAC Research, Italy
Elizaveta Ershova, JetBrains, Cyprus
Alexandr Rosen, Charles University, Czech Republic
Olga Vinogradova, Independent researcher, Israel

Contact information and forum for discussions

Please join the MultiGED-2023 google group in order to ask questions, hold discussions and browse for already answered questions.