Grammatical Error Correction (GEC) is the task of correcting different kinds of errors in text such as spelling, punctuation, grammatical, and word choice errors.
GEC is typically formulated as a sentence correction task. A GEC system takes a potentially erroneous sentence as input and is expected to transform it to its corrected version. See the example given below:
Input (Erroneous) | Output (Corrected) |
---|---|
She see Tom is catched by policeman in park at last night. | She saw Tom caught by a policeman in the park last night. |
The CoNLL-2014 shared task test set is the most widely used dataset to benchmark GEC systems. The test set contains 1,312 English sentences with error annotations by 2 expert annotators. Models are evaluated with MaxMatch scorer (Dahlmeier and Ng, 2012) which computes a span-based Fβ-score (β set to 0.5 to weight precision twice as recall).
The shared task setting restricts that systems use only publicly available datasets for training to ensure a fair comparison between systems. The highest published scores on the the CoNLL-2014 test set are given below. A distinction is made between papers that report results in the restricted CoNLL-2014 shared task setting of training using publicly-available training datasets only (Restricted) and those that made use of large, non-public datasets (Unrestricted).
Restricted:
Model | F0.5 | Paper / Source | Code |
---|---|---|---|
CNN Seq2Seq + Quality Estimation (Chollampatt and Ng, EMNLP 2018) | 56.52 | Neural Quality Estimation of Grammatical Error Correction | Official |
SMT + BiGRU (Grundkiewicz and Junczys-Dowmunt, 2018) | 56.25 | Near Human-Level Performance in Grammatical Error Correction with Hybrid Machine Translation | NA |
Transformer (Junczys-Dowmunt et al., 2018) | 55.8 | Approaching Neural Grammatical Error Correction as a Low-Resource Machine Translation Task | Official |
CNN Seq2Seq (Chollampatt and Ng, 2018) | 54.79 | A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction | Official |
Unrestricted:
Model | F0.5 | Paper / Source | Code |
---|---|---|---|
CNN Seq2Seq + Fluency Boost (Ge et al., 2018) | 61.34 | Reaching Human-level Performance in Automatic Grammatical Error Correction: An Empirical Study | NA |
Restricted: uses only publicly available datasets. Unrestricted: uses non-public datasets.
Bryant and Ng, 2015 released 8 additional annotations (in addition to the two official annotations) for the CoNLL-2014 shared task test set (link).
Restricted:
Model | F0.5 | Paper / Source | Code |
---|---|---|---|
SMT + BiGRU (Grundkiewicz and Junczys-Dowmunt, 2018) | 72.04 | Near Human-Level Performance in Grammatical Error Correction with Hybrid Machine Translation | NA |
CNN Seq2Seq (Chollampatt and Ng, 2018) | 70.14 (measured by Ge et al., 2018) | A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction | Official |
Unrestricted:
Model | F0.5 | Paper / Source | Code |
---|---|---|---|
CNN Seq2Seq + Fluency Boost (Ge et al., 2018) | 76.88 | Reaching Human-level Performance in Automatic Grammatical Error Correction: An Empirical Study | NA |
Restricted: uses only publicly available datasets. Unrestricted: uses non-public datasets.
JFLEG test set released by Napoles et al., 2017 consists of 747 English sentences with 4 references for each sentence. Models are evaluated with GLEU metric (Napoles et al., 2016).
Restricted:
Model | GLEU | Paper / Source | Code |
---|---|---|---|
SMT + BiGRU (Grundkiewicz and Junczys-Dowmunt, 2018) | 61.50 | Near Human-Level Performance in Grammatical Error Correction with Hybrid Machine Translation | NA |
Transformer (Junczys-Dowmunt et al., 2018) | 59.9 | Approaching Neural Grammatical Error Correction as a Low-Resource Machine Translation Task | NA |
CNN Seq2Seq (Chollampatt and Ng, 2018) | 57.47 | A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction | Official |
Unrestricted:
Model | GLEU | Paper / Source | Code |
---|---|---|---|
CNN Seq2Seq + Fluency Boost and inference (Ge et al., 2018) | 62.37 | Reaching Human-level Performance in Automatic Grammatical Error Correction: An Empirical Study | NA |
Restricted: uses only publicly available datasets. Unrestricted: uses non-public datasets.