Kragen Sitaker did amazing work back in 2005/2006 'liberating' the OED first edition which is now (mostly) in the public domain and he posted up fairly good scans of volumes 1-6 on archive.org (see 1, 2).
The home pages for the six volumes are:
- http://www.archive.org/details/oed01arch
- http://www.archive.org/details/oed02arch
- http://www.archive.org/details/oed03arch
- http://www.archive.org/details/oed04arch
- http://www.archive.org/details/newenglishdict05murrmiss
- (6a) http://www.archive.org/details/oed6aarch
- (6b) http://www.archive.org/details/oed6barch
However at the time he was unable to do much on the OCR front (no doubt because of the poor performance of open source OCR, particularly on such a complex text as the OED which has lots of non-standard English and font changes). With the better open source OCR engine it would be possible to convert the OED back into text and perhaps wikify it to allow for gradual proof-editing and correction.
-
Download source pdfs
mkdir cache/pdf cd cache/pdf wget http://www.archive.org/download/oed01arch/oed01arch.pdf
Note jp2 is also available: http://www.archive.org/download/oed01arch/oed01arch_jp2.tar
-
Install requirements.
-
tesseract for doing ocr. On Debian/Ubuntu::
apt-get install tesseract-ocr
-
python and pypdf
-
-
Run the script:
python oed.py