Spelling correction based on pretrained transformer models

Purpose

This is an attempt to create a model that is able to fix spelling errors and common typos.

An english work in progress model and interactive demo can be found here and a german version here.

Install

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Generate Training Data

To generate the training data simply run these two scripts:

sh convert_leipzig_data.sh
python generate_dataset.py

Optional: If you want to combine multiple languages you can edit and run the "combine.sh" script.

By default this will create the english dataset. To switch to a different language, you need to change the language tag in those two scripts.

How to train a model:

For english run sh train_bart_model.sh or train_de_bart_model.sh for the german model.

Contribute:

This is an open research project, improvements and contributions are welcome. If we achive promising results, we will publish them in a more formal way (paper). All contributers will be recognized.

Open Questions

How do evaluate the quality of the model, apart from using CER on syntactic data?
What are good data sets to train on?

Possible Datasets:

https://github.com/snukky/wikiedits
https://github.com/mhagiwara/github-typo-corpus
- Too much noise, does not work well.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
data/raw		data/raw
failed_concepts		failed_concepts
publish		publish
.gitignore		.gitignore
combine.sh		combine.sh
convert_leipzig_data.sh		convert_leipzig_data.sh
debug.ipynb		debug.ipynb
generate_dataset.py		generate_dataset.py
interactive_demo.ipynb		interactive_demo.ipynb
readme.md		readme.md
requirements.txt		requirements.txt
run_summarization.py		run_summarization.py
train_bart_model.sh		train_bart_model.sh
train_de_bart_model.sh		train_de_bart_model.sh
train_mbart_model.sh		train_mbart_model.sh
train_mt5_model.sh		train_mt5_model.sh
wiki.en.test.csv		wiki.en.test.csv
wiki.en.train.csv		wiki.en.train.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spelling correction based on pretrained transformer models

Purpose

Install

Generate Training Data

How to train a model:

Contribute:

Open Questions

Possible Datasets:

About

Contributors 2

Languages

oliverguhr/spelling

Folders and files

Latest commit

History

Repository files navigation

Spelling correction based on pretrained transformer models

Purpose

Install

Generate Training Data

How to train a model:

Contribute:

Open Questions

Possible Datasets:

About

Topics

Resources

Stars

Watchers

Forks

Contributors 2

Languages