Project Name	Stars	Repos Using This	Packages Using This	Most Recent Commit	Total Releases	Latest Release	Open Issues	License	Language
Node Tika	128	15	5	4 years ago	23	February 22, 2017	10	mit	Java
Apache Tika bridge for Node.js. Text and metadata extraction, language detection and more.
Php Apache Tika	104	3	3	8 months ago	38	April 14, 2023		mit	PHP
Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
Imagecat	84			6 years ago					Java
ImageCat is an Apache OODT RADIX application that uses Apache Solr, Apache Tika and Apache OODT to ingest 10s of millions of files (images,but could be extended to other files) in place, and to extract metadata and OCR information from those files/images using Tika and Tesseract OCR.
Harvester	59			7 years ago			3	gpl-3.0	JavaScript
Web crawling and document processing through a usable interface.
Rtika	52	1		a year ago	8	April 25, 2020	3	apache-2.0	R
R Interface to Apache Tika
Doc_processing_toolkit	52			7 years ago			4	other	Python
Python library to extract text from PDF, and default to OCR when text extraction fails.
Cogstack Pipeline	39			a year ago				other	Java
Distributed, fault tolerant batch processing for Natural Language Applications and Search, using remote partitioning
Pdf Discovery Demo	24			a year ago			2	apache-2.0	JavaScript
Demonstration of searching PDF document with Solr, Tika, and Tesseract
Tika Server	18			3 years ago			2	apache-2.0	Java
Apache Tika Server with Tesseract 4 Docker Setup
Tika Service	12			a year ago				apache-2.0	Java
Apache Tika running as a web service

Alternatives To Pdf Discovery Demo

Select To Compare

Node Tika ⭐ 128

Apache Tika bridge for Node.js. Text and metadata extraction, language detection and more.

dependent packages 5total releases 23most recent commit 4 years ago

Php Apache Tika ⭐ 104

Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats

dependent packages 3total releases 38most recent commit 8 months ago

packagist vaites/php-apache-tika} Downloads

Imagecat ⭐ 84

ImageCat is an Apache OODT RADIX application that uses Apache Solr, Apache Tika and Apache OODT to ingest 10s of millions of files (images,but could be extended to other files) in place, and to extract metadata and OCR information from those files/images using Tika and Tesseract OCR.

most recent commit 6 years ago

Harvester ⭐ 59

Web crawling and document processing through a usable interface.

most recent commit 7 years ago

Rtika ⭐ 52

R Interface to Apache Tika

total releases 8most recent commit a year ago

Doc_processing_toolkit ⭐ 52

Python library to extract text from PDF, and default to OCR when text extraction fails.

most recent commit 7 years ago

Cogstack Pipeline ⭐ 39

Distributed, fault tolerant batch processing for Natural Language Applications and Search, using remote partitioning

most recent commit a year ago

Pdf Discovery Demo ⭐ 24

Demonstration of searching PDF document with Solr, Tika, and Tesseract

most recent commit a year ago

Tika Server ⭐ 18

Apache Tika Server with Tesseract 4 Docker Setup

most recent commit 3 years ago

Tika Service ⭐ 12

Apache Tika running as a web service

most recent commit a year ago

Suggest An Alternative To pdf-discovery-demo

Alternative Project Comparisons

Pdf Discovery Demo vs Node Tika

Pdf Discovery Demo vs Php Apache Tika

Pdf Discovery Demo vs Imagecat

Pdf Discovery Demo vs Harvester

Pdf Discovery Demo vs Rtika

Pdf Discovery Demo vs Doc_processing_toolkit

Pdf Discovery Demo vs Cogstack Pipeline

Pdf Discovery Demo vs Tika Server

Pdf Discovery Demo vs Tika Service

Popular Tesseract Projects

Tesseract ⭐ 56,096

Tesseract Open Source OCR Engine (main repository)

dependent packages 7total releases 1latest release February 27, 2018most recent commit 3 months ago

Tesseract.js ⭐ 32,523

Pure Javascript OCR for more than 100 Languages 📖🎉🖥

dependent packages 224total releases 66latest release October 30, 2023most recent commit 3 months ago

Ocrmypdf ⭐ 11,136

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

dependent packages 11total releases 227latest release November 29, 2023most recent commit 3 months ago

Faceai ⭐ 6,666

一款入门级的人脸、视频、文字检测以及识别的项目.

most recent commit 4 years ago

Ripgrep All ⭐ 6,000

rga: ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz, etc.

total releases 18latest release May 19, 2020most recent commit 3 months ago

Popular Tika Projects

S3_website ⭐ 2,259

Manage an S3 website: sync, deliver via CloudFront, benefit from advanced S3 website features.

total releases 109latest release October 11, 2017most recent commit a year ago

Tika ⭐ 2,007

The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).

dependent packages 570total releases 66latest release October 17, 2023most recent commit 3 months ago

Tika Python ⭐ 1,316

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

dependent packages 54total releases 35latest release January 02, 2023most recent commit 8 months ago

Fscrawler ⭐ 1,279

Elasticsearch File System Crawler (FS Crawler)

dependent packages 1total releases 5latest release January 10, 2022most recent commit 3 months ago

Lingua ⭐ 622

The most accurate natural language detection library for Java and the JVM, suitable for long and short text alike

dependent packages 3total releases 17latest release August 02, 2022most recent commit 5 months ago

Popular Media Categories