This is an attempt to create a model that is able to fix spelling errors and common typos.
An english work in progress model and interactive demo can be found here and a german version here.
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
To generate the training data simply run these two scripts:
sh combine.sh
python generate_dataset.py
By default this will create the english dataset. To switch to a different language, you need to change the language tag in those two scripts.
For english run sh train_bart_model.sh
or train_de_bart_model.sh
for the german model.
This is an open research project, improvements and contributions are welcome. If we achive promising results, we will publish them in a more formal way (paper). All contributers will be recognized.
- https://github.com/snukky/wikiedits
- https://github.com/mhagiwara/github-typo-corpus
- Too much noise, does not work well.