Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
2019 Ccf Bdci Ocr Mczj Ocr Identificationidelement | 706 | 2 years ago | 2 | mit | Python | |||||
2019CCF-BDCI大赛 最佳创新探索奖获得者 基于OCR身份证要素提取赛题冠军 天晨破晓团队 赛题源码 | ||||||||||
Mybox | 103 | a day ago | 34 | apache-2.0 | Java | |||||
Easy tools of document, image, file, network, data, color, and media. | ||||||||||
Automatic_number_plate_recognition_yolo_ocr | 68 | 6 days ago | 2 | mit | Python | |||||
Automatic number plate recognition using tech: Yolo, OCR, Scene text detection, scene text recognation, flask, torch | ||||||||||
Image Table Ocr | 54 | 2 years ago | 3 | December 28, 2020 | 1 | mit | Python | |||
Turn images of tables into CSV data. Detect tables from images and run OCR on the cells. | ||||||||||
Pdf To Csv Table Extactor | 45 | 4 years ago | 5 | wtfpl | Python | |||||
Extract tables from scanned documents pdf into csv file using ocr and image processing | ||||||||||
Apollo17 | 33 | 3 years ago | 1 | agpl-3.0 | HTML | |||||
Apollo 17 Mission Timeline Reconstruction | ||||||||||
Docs2csv | 25 | 7 years ago | 2 | Ruby | ||||||
Scan a folder of document files of all types and extract the text into a CSV suitable for Overview | ||||||||||
Segmentation Free_ocr | 9 | 5 years ago | gpl-3.0 | Python | ||||||
recognize chinese and english without segmentation | ||||||||||
Eulexis_off_line | 9 | a year ago | 3 | gpl-3.0 | C++ | |||||
Ancient Greek lemmatisation tool | ||||||||||
Nara Scripts | 7 | 3 years ago | 5 | other | Python | |||||
Scripts used in the work of the US National Archives |
This python package contains modules to help with finding and extracting tabular data from a PDF or image into a CSV format.
Given an image that contains a table…
Extract the the text into a CSV format…
PRIZE,ODDS 1 IN:,# OF WINNERS*
$3,9.09,"282,447"
$5,16.66,"154,097"
$7,40.01,"64,169"
$10,26.67,"96,283"
$20,100.00,"25,677"
$30,290.83,"8,829"
$50,239.66,"10,714"
$100,919.66,"2,792"
$500,"6,652.07",386
"$40,000","855,899.99",3
1,i223,
Toa,,
,,
,,"* Based upon 2,567,700"
Along with the python requirements that are listed in setup.py and that are automatically installed when installing this package through pip, there are a few external requirements for some of the modules.
I haven’t looked into the minimum required versions of these dependencies, but I’ll list the versions that I’m using.
pdfimages
20.09.0 of Poppler
tesseract
5.0.0 of Tesseract
mogrify
7.0.10 of ImageMagick
There is a demo module that will download an image given a URL and try to extract tables from the image and process the cells into a CSV. You can try it out with one of the images included in this repo.
pip3 install table_ocr
python3 -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png
That will run against the following image:
The following should be printed to your terminal after running the above commands.
Running `extract_tables.main([/tmp/demo_p9on6m8o/simple.png]).`
Extracted the following tables from the image:
[('/tmp/demo_p9on6m8o/simple.png', ['/tmp/demo_p9on6m8o/simple/table-000.png'])]
Processing tables for /tmp/demo_p9on6m8o/simple.png.
Processing table /tmp/demo_p9on6m8o/simple/table-000.png.
Extracted 18 cells from /tmp/demo_p9on6m8o/simple/table-000.png
Cells:
/tmp/demo_p9on6m8o/simple/cells/000-000.png: Cell
/tmp/demo_p9on6m8o/simple/cells/000-001.png: Format
/tmp/demo_p9on6m8o/simple/cells/000-002.png: Formula
...
Here is the entire CSV output:
Cell,Format,Formula
B4,Percentage,None
C4,General,None
D4,Accounting,None
E4,Currency,"=PMT(B4/12,C4,D4)"
F4,Currency,=E4*C4
The package is split into modules with narrow focuses.
pdf_to_images
uses Poppler and ImageMagick to extract images from a PDF.extract_tables
finds and extracts table-looking things from an image.extract_cells
extracts and orders cells from a table.ocr_image
uses Tesseract to OCR the text from an image of a cell.ocr_to_csv
converts into a CSV the directory structure that ocr_image
outputs.The outputs of a previous module can be used by a subsequent module so that they can be chained together to create the entire workflow, as demonstrated by the following shell script.
#!/bin/sh
PDF=$1
python -m table_ocr.pdf_to_images $PDF | grep .png > /tmp/pdf-images.txt
cat /tmp/pdf-images.txt | xargs -I{} python -m table_ocr.extract_tables {} | grep table > /tmp/extracted-tables.txt
cat /tmp/extracted-tables.txt | xargs -I{} python -m table_ocr.extract_cells {} | grep cells > /tmp/extracted-cells.txt
cat /tmp/extracted-cells.txt | xargs -I{} python -m table_ocr.ocr_image {}
for image in $(cat /tmp/extracted-tables.txt); do
dir=$(dirname $image)
python -m table_ocr.ocr_to_csv $(find $dir/cells -name "*.txt")
done
The package was written in a literate programming style. The source code at https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html is meant to act as the documentation and reference material.