Spelling correction based on pretrained transformer models

Purpose

This is an attempt to create a model that is able to fix spelling errors and common typos.

An english work in progress model and interactive demo can be found here and a german version here.

Install

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Generate Training Data

To generate the training data simply run these two scripts:

sh combine.sh
python generate_dataset.py

By default this will create the english dataset. To switch to a different language, you need to change the language tag in those two scripts.

How to train a model:

For english run sh train_bart_model.sh or train_de_bart_model.sh for the german model.

Contribute:

This is an open research project, improvements and contributions are welcome. If we achive promising results, we will publish them in a more formal way (paper). All contributers will be recognized.

Possible Datasets:

https://github.com/snukky/wikiedits
https://github.com/mhagiwara/github-typo-corpus
- Too much noise, does not work well.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data/raw		data/raw
failed_concepts		failed_concepts
publish		publish
.gitignore		.gitignore
combine.sh		combine.sh
convert_leipzig_data.sh		convert_leipzig_data.sh
debug.ipynb		debug.ipynb
generate_dataset.py		generate_dataset.py
interactive_demo.ipynb		interactive_demo.ipynb
readme.md		readme.md
requirements.txt		requirements.txt
run_summarization.py		run_summarization.py
train_bart_model.sh		train_bart_model.sh
train_de_bart_model.sh		train_de_bart_model.sh
train_mbart_model.sh		train_mbart_model.sh
train_mt5_model.sh		train_mt5_model.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spelling correction based on pretrained transformer models

Purpose

Install

Generate Training Data

How to train a model:

Contribute:

Possible Datasets:

About

Releases

Packages

Languages

JulienBrochier/spelling

Folders and files

Latest commit

History

Repository files navigation

Spelling correction based on pretrained transformer models

Purpose

Install

Generate Training Data

How to train a model:

Contribute:

Possible Datasets:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages