Awesome Open Source
Awesome Open Source
Combined Topics
text-extraction
x
Advertising
📦 10
All Projects
Application Programming Interfaces
📦 124
Applications
📦 192
Artificial Intelligence
📦 78
Blockchain
📦 73
Build Tools
📦 113
Cloud Computing
📦 80
Code Quality
📦 28
Collaboration
📦 32
Command Line Interface
📦 49
Community
📦 83
Companies
📦 60
Compilers
📦 63
Computer Science
📦 80
Configuration Management
📦 42
Content Management
📦 175
Control Flow
📦 213
Data Formats
📦 78
Data Processing
📦 276
Data Storage
📦 135
Economics
📦 64
Frameworks
📦 215
Games
📦 129
Graphics
📦 110
Hardware
📦 152
Integrated Development Environments
📦 49
Learning Resources
📦 166
Legal
📦 29
Libraries
📦 129
Lists Of Projects
📦 22
Machine Learning
📦 347
Mapping
📦 64
Marketing
📦 15
Mathematics
📦 55
Media
📦 239
Messaging
📦 98
Networking
📦 315
Operating Systems
📦 89
Operations
📦 121
Package Managers
📦 55
Programming Languages
📦 245
Runtime Environments
📦 100
Science
📦 42
Security
📦 396
Social Media
📦 27
Software Architecture
📦 72
Software Development
📦 72
Software Performance
📦 58
Software Quality
📦 133
Text Editors
📦 49
Text Processing
📦 136
User Interface
📦 330
User Interface Components
📦 514
Version Control
📦 30
Virtualization
📦 71
Web Browsers
📦 42
Web Servers
📦 26
Web User Interface
📦 210
The Top 16 Text Extraction Open Source Projects
Categories
>
Text Processing
>
Text Extraction
Sumy
⭐
2,472
Module for automatic summarization of text documents and HTML pages.
Unipdf
⭐
1,093
Golang PDF library for creating and processing PDF files (pure go)
Tika Python
⭐
970
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Image Text Localization Recognition
⭐
775
A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約
Unidoc
⭐
696
This repository has moved! https://github.com/unidoc/unipdf
Justext
⭐
410
Heuristic based boilerplate removal tool
Nlp
⭐
368
[UNMANTEINED] Extract values from strings and fill your structs with nlp.
Pdftools
⭐
331
Text Extraction, Rendering and Converting of PDF Documents
Datashare
⭐
241
Better analyze information, in all its forms
Srt
⭐
200
A simple library for parsing, modifying, and composing SRT files.
Breadability
⭐
184
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
Lambda Text Extractor
⭐
153
AWS Lambda functions to extract text from various binary formats.
Php Apache Tika
⭐
73
Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
Wikipedia_ner
⭐
60
📖 Labeled examples from wiki dumps in Python
Pdfio.jl
⭐
54
PDF Reader Library for Native Julia.
Articleparse
⭐
6
Heuristic text extraction from news sites in Python3
1-16 of 16 projects
Advertising
📦 10
All Projects
Application Programming Interfaces
📦 124
Applications
📦 192
Artificial Intelligence
📦 78
Blockchain
📦 73
Build Tools
📦 113
Cloud Computing
📦 80
Code Quality
📦 28
Collaboration
📦 32
Command Line Interface
📦 49
Community
📦 83
Companies
📦 60
Compilers
📦 63
Computer Science
📦 80
Configuration Management
📦 42
Content Management
📦 175
Control Flow
📦 213
Data Formats
📦 78
Data Processing
📦 276
Data Storage
📦 135
Economics
📦 64
Frameworks
📦 215
Games
📦 129
Graphics
📦 110
Hardware
📦 152
Integrated Development Environments
📦 49
Learning Resources
📦 166
Legal
📦 29
Libraries
📦 129
Lists Of Projects
📦 22
Machine Learning
📦 347
Mapping
📦 64
Marketing
📦 15
Mathematics
📦 55
Media
📦 239
Messaging
📦 98
Networking
📦 315
Operating Systems
📦 89
Operations
📦 121
Package Managers
📦 55
Programming Languages
📦 245
Runtime Environments
📦 100
Science
📦 42
Security
📦 396
Social Media
📦 27
Software Architecture
📦 72
Software Development
📦 72
Software Performance
📦 58
Software Quality
📦 133
Text Editors
📦 49
Text Processing
📦 136
User Interface
📦 330
User Interface Components
📦 514
Version Control
📦 30
Virtualization
📦 71
Web Browsers
📦 42
Web Servers
📦 26
Web User Interface
📦 210