Project Name	Stars	Repos Using This	Packages Using This	Most Recent Commit	Total Releases	Latest Release	Open Issues	License	Language
Lambda Text Extractor	143			6 years ago				apache-2.0	Python
AWS Lambda functions to extract text from various binary formats.
Pd3f	131			a year ago			13	agpl-3.0	HTML
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
Php Apache Tika	104	3	3	8 months ago	38	April 14, 2023		mit	PHP
Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
Doc_processing_toolkit	52			7 years ago			4	other	Python
Python library to extract text from PDF, and default to OCR when text extraction fails.
Wagtail_textract	31			6 months ago	8	September 06, 2019	14	bsd-3-clause	Python
Text extraction for Wagtail document search
Mimeograph	28	2	3	11 years ago	11	March 08, 2017	6		CoffeeScript
CoffeeScript lib for PDF OCR and text extraction
Aiopytesseract	13			4 months ago	13	November 21, 2023		apache-2.0	Python
A Python asyncio wrapper for Tesseract-OCR.
Tesseractocr	12			9 years ago				mit	Shell
Full text extraction using the Open Source Tesseract OCR software https://code.google.com/p/tesseract-ocr/ and imagemagick
Cosmic Cube	5			10 years ago					Python
PDF image analysis and selective text extraction using tesseract

Alternatives To Cosmic Cube

Select To Compare

Lambda Text Extractor ⭐ 143

AWS Lambda functions to extract text from various binary formats.

most recent commit 6 years ago

Pd3f ⭐ 131

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

most recent commit a year ago

Php Apache Tika ⭐ 104

Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats

dependent packages 3total releases 38most recent commit 8 months ago

packagist vaites/php-apache-tika} Downloads

Doc_processing_toolkit ⭐ 52

Python library to extract text from PDF, and default to OCR when text extraction fails.

most recent commit 7 years ago

Wagtail_textract ⭐ 31

Text extraction for Wagtail document search

total releases 8most recent commit 6 months ago

Mimeograph ⭐ 28

CoffeeScript lib for PDF OCR and text extraction

dependent packages 3total releases 11most recent commit 11 years ago

Aiopytesseract ⭐ 13

A Python asyncio wrapper for Tesseract-OCR.

total releases 13most recent commit 4 months ago

Tesseractocr ⭐ 12

Full text extraction using the Open Source Tesseract OCR software https://code.google.com/p/tesseract-ocr/ and imagemagick

most recent commit 9 years ago

Cosmic Cube ⭐ 5

PDF image analysis and selective text extraction using tesseract

most recent commit 10 years ago

Suggest An Alternative To cosmic-cube

Alternative Project Comparisons

Cosmic Cube vs Lambda Text Extractor

Cosmic Cube vs Pd3f

Cosmic Cube vs Php Apache Tika

Cosmic Cube vs Doc_processing_toolkit

Cosmic Cube vs Wagtail_textract

Cosmic Cube vs Mimeograph

Cosmic Cube vs Aiopytesseract

Cosmic Cube vs Tesseractocr

Popular Tesseract Projects

Tesseract ⭐ 56,096

Tesseract Open Source OCR Engine (main repository)

dependent packages 7total releases 1latest release February 27, 2018most recent commit 3 months ago

Tesseract.js ⭐ 32,523

Pure Javascript OCR for more than 100 Languages 📖🎉🖥

dependent packages 224total releases 66latest release October 30, 2023most recent commit 3 months ago

Ocrmypdf ⭐ 11,136

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

dependent packages 11total releases 227latest release November 29, 2023most recent commit 3 months ago

Faceai ⭐ 6,666

一款入门级的人脸、视频、文字检测以及识别的项目.

most recent commit 4 years ago

Ripgrep All ⭐ 6,000

rga: ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz, etc.

total releases 18latest release May 19, 2020most recent commit 3 months ago

Popular Text Extraction Projects

Sumy ⭐ 3,343

Module for automatic summarization of text documents and HTML pages.

dependent packages 14total releases 16latest release October 23, 2022most recent commit 3 months ago

Trafilatura ⭐ 2,447

Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

dependent packages 66total releases 39latest release November 29, 2023most recent commit 3 months ago

Unipdf ⭐ 2,231

Golang PDF library for creating and processing PDF files (pure go)

dependent packages 45total releases 72latest release November 11, 2023most recent commit 3 months ago

Tika Python ⭐ 1,316

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

dependent packages 54total releases 35latest release January 02, 2023most recent commit 8 months ago

Image Text Localization Recognition ⭐ 928

A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集シーンテキストの位置認識と識別のための論文リソースの要約

most recent commit 7 months ago

Popular Media Categories

Get A Weekly Email With Trending Projects For These Categories

No Spam. Unsubscribe easily at any time.

Python

Opencv

Vagrant

Tesseract

Image Analysis

Text Extraction

Privacy | About | Terms | Follow Us On Twitter

Downloads, Dependent Repos, Dependent Packages, Total Releases, Latest Releases data powered by Libraries.io.