This tool was created to automize the process of pulling tables from PDF documents. It goes through all the pages, recognises where tables are and then proceeds to transfer them to csv. Using pytesseract it parses text from each cell and determines its position in the table.
You can use this tool by either directly running the python script along with some flags or by running a Web server that will host a web page for uploading files to procees them on server and return the csv files. Whilst displaying the current progress.
While processing, it displays processing status for each page and gives you option to download each one individually, or altogether at the end
Input table as an image in PDF file
Parsed table
Required python libraries
pip install pytesseract opencv-python tqdm progressbar pdf2image pymupdf fitz frontend tools
# Optional for webserver
pip install aiohttp eventlet
Tesseract installation on Linux using apt
sudo apt install tesseract-ocr tesseract-ocr-rus
Linux using pacman
sudo pacman -S poppler
sudo pacman -S tesseract # Select needed language, for example rus - 94
sudo pacman -S tesseract-data-rus tesseract-data-eng
Windows
- Download tesseract exe from https://github.com/UB-Mannheim/tesseract/wiki.
- Install this exe in C:\Program Files (x86)\Tesseract-OCR
- Open virtual machine command prompt in windows or anaconda prompt.
- Run
pip install pytesseract
From PDF file
python3 recognise.py --client --input example/rencap2021.pdf --limit 10
And from remote PDF file
python3 recognise.py --client --remote https://github.com/pavtiger/Parse-tables-from-PDF/raw/master/example/rencap2021.pdf --limit 10
All data will output to output/
directory. You can find example results in example/
.
You can also change the render quality (>= 200)
python3 recognise.py --client --input example/rencap2021.pdf --limit 10 --quality 300
python3 recognise.py --server
All available flags:
- input - Path to input pdf file to convert
- remote - Link to a remote location from where to obtain PDF file
- limit - Process only first N pages. (-1 if all)
- quality - PDF page render quality (default 200). Increasing will consume more RAM, but going under 200 is highly unadvised. This will cause recongision errors. For reference, 300 requires 8gb of RAM