Awesome Open Source
Awesome Open Source


This project aims to extract tables from scanned image PDFs using Optical Character Recognition.

Install Requirements

  1. Tesseract OCR

    sudo apt-get install tesseract-ocr
  2. Imagemagick

    sudo apt-get install imagemagick
  3. PDF Utilities

    sudo apt-get install poppler-utils
  4. Python packages

    sudo pip install -r requirements.txt


  1. Clear the pdf/ folder and copy all your pdf files to be scanned in it.

  2. Run the OCR:

  3. The scanned text files shall be available in the txt/ folder once the process completes.


  1. If the above doesn't work for you, try the alternate method.

  2. Save your file as input.pdf in the root directory.

  3. Run


Get A Weekly Email With Trending Projects For These Topics
No Spam. Unsubscribe easily at any time.
python (50,856
shell (9,852
ocr (227
tesseract (52
optical-character-recognition (21

Find Open Source By Browsing 7,000 Topics Across 59 Categories