Awesome Open Source
Search
Programming Languages
Languages
All Categories
Categories
About
Search results for text extraction
text-extraction
x
93 search results found
Sumy
⭐
3,343
Module for automatic summarization of text documents and HTML pages.
Trafilatura
⭐
2,447
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Unipdf
⭐
2,231
Golang PDF library for creating and processing PDF files (pure go)
Tika Python
⭐
1,316
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Image Text Localization Recognition
⭐
928
A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約
Unidoc
⭐
691
This repository has moved! https://github.com/unidoc/unipdf
Regexgenerator
⭐
656
This project contains the source code of a tool for generating regular expressions for text extraction: 1. automatically, 2. based only on examples of the desired behavior, 3. without any external hint about how the target regex should look like
Datashare
⭐
519
A self-hosted search engine for documents.
Justext
⭐
509
Heuristic based boilerplate removal tool
Pdftools
⭐
480
Text Extraction, Rendering and Converting of PDF Documents
Srt
⭐
389
A simple library and set of tools for parsing, modifying, and composing SRT files.
Nlp
⭐
387
[UNMANTEINED] Extract values from strings and fill your structs with nlp.
Crestify
⭐
232
Intelligent Bookmarking
Textproposals
⭐
191
Implementation of the method proposed in the papers " TextProposals: a Text-specific Selective Search Algorithm for Word Spotting in the Wild" and "Object Proposals for Text Extraction in the Wild" (Gomez & Karatzas), 2016 and 2015 respectively.
Breadability
⭐
191
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
Cutie
⭐
144
CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)
Lambda Text Extractor
⭐
143
AWS Lambda functions to extract text from various binary formats.
Pd3f
⭐
131
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
Aut
⭐
128
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Pdfio.jl
⭐
117
PDF Reader Library for Native Julia.
Php Apache Tika
⭐
104
Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
Benchmarks
⭐
93
Benchmarking PDF libraries
Pdfparser
⭐
88
Python binding to libpoppler with focus on text extraction
Cat
⭐
83
Extract text from plaintext, .docx, .odt and .rtf files. Pure go.
Text_extraction
⭐
80
This code is the implementation of the method proposed in the paper “Multi-script text extraction from natural scenes” (Gomez & Karatzas) to appear in ICDAR2013 conference.
Eaten
⭐
71
EATEN: Entity-aware Attention for Single Shot Visual Text Extraction
Wikipedia_ner
⭐
56
📖 Labeled examples from wiki dumps in Python
Doc_processing_toolkit
⭐
52
Python library to extract text from PDF, and default to OCR when text extraction fails.
Mueller Report
⭐
46
The ██redacted Mueller Report
Datasheet Scrubber
⭐
45
Ocr Open Dataset
⭐
44
list all open dataset about ocr.
Extend
⭐
43
Entity Disambiguation as text extraction (ACL 2022)
Text Extraction Evaluation
⭐
42
Framework for evaluating text extraction algorithms implemented as web services
Mobi
⭐
37
python based software to unpack kindlegen generated ebooks
Untagger
⭐
35
Removal and full text extraction of HTML in Swift inspired by Boilerpipe
Spark Ai Summit 2020 Text Extraction
⭐
33
Pdf Text Data Extractor
⭐
32
PDF text data extraction web app with OCR for scanned documents
Alchemy_api
⭐
31
Provides a client API library for AlchemyAPI's awesome NLP services. Allows you to make parallel or serial requests.
Pdf Text Extraction Benchmark
⭐
31
A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles.
Docwire
⭐
31
DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boost efficiency in text extraction, web data extraction, data mining, document analysis. Offline processing is possible for security and confidentiality
Wagtail_textract
⭐
31
Text extraction for Wagtail document search
Any Text
⭐
31
Get text content from any file
Querido Diario Toolbox
⭐
30
Este projeto empodera quem deseja processar dados no contexto do Querido Diário e realizar suas próprias análises.
Boilerpy3
⭐
29
Python port of Boilerpipe library
Office Text Extractor
⭐
28
Yet another library to extract text from MS Office and PDF files
Mimeograph
⭐
28
CoffeeScript lib for PDF OCR and text extraction
Herbgobbler
⭐
28
Tool for Text Extraction from erb/rhtml files for internationalization (i18n) purposes
Bte
⭐
27
BTE: Body Text Extraction
The Natural Language Processing Workshop
⭐
27
Pnlp
⭐
25
NLP预/后处理工具。
Text_extraction
⭐
25
提取金融相关领域研究报告的主要结论(key idea)
Screaming Frog Shingling
⭐
21
Uses Screaming Frog Internal HTML with text extraction along with a shingling algorithm to compare content duplication across the pages of a crawled site.
Nlp_competitions
⭐
19
A list of NLP competitions, including solutions.
Img2txt
⭐
19
Easy formatted text extraction from images using Google Vision API
Mirusan
⭐
17
A PDF collection reader with built-in full-text search engine
Pd3f Core
⭐
16
📑 Python Package to reconstruct the original continuous text from PDFs with language models
Textextractor2.0
⭐
14
🔥 This web app extracts text in an image.
Jatsdecoder
⭐
14
A text extraction and manipulation toolset for NISO-JATS coded XML files
Scummtr
⭐
13
Fan translation tools for SCUMM engine games
Tokyo
⭐
13
tokyo, a REST API, when given any type of document 📄, Identifies mime-type 🧐. Suggests extension 🦔. Alas Extracts text 💪.
Aiopytesseract
⭐
13
A Python asyncio wrapper for Tesseract-OCR.
Text Extraction From Video Frames
⭐
13
An optical character recognition (OCR) built using OpenCV and TensorFlow.
Tesseractocr
⭐
12
Full text extraction using the Open Source Tesseract OCR software https://code.google.com/p/tesseract-ocr/ and imagemagick
Ocrd_calamari
⭐
12
Recognize text using Calamari OCR and the OCR-D framework
Apache Tika Lambda Layer
⭐
12
AWS Lambda layer containing latest version of Apache Tika
Medinify
⭐
11
Python text classification package with a focus on medical text.
Pine
⭐
11
A simple image to text OCR scanner for macOS
Movie Classfiction Pased On It S Arabic Subtitle
⭐
10
classify English movies by using its Arabic subtitle
Tjbot Node Red
⭐
10
TJBot Node-RED Examples that use Watson Cognitive APIs
Pdf_text_extract
⭐
10
AWS Lambda function written in Python to perform text extraction (using Slate) from a PDF put to S3 & indexed in ElasticSearch. — Edit
Tesseract Ocr Wrapper
⭐
9
This is a highly efficient python wrapper for tesseract-ocr.
Arxiv Fulltext
⭐
9
arXiv plain text extraction
Articleparse
⭐
8
Heuristic text extraction from news sites in Python3
Hotpdf
⭐
8
hotpdf is a fast PDF scraping library to extract text and find text within PDF documents
Pcl Parser
⭐
8
PCL5 parser and renderer, with some support for PLJ and HPGL
Video_text_detection
⭐
8
Bachelor Thesis | Text extraction from complex video scenes
Idcarddataextractorwithondeviceml
⭐
8
Text Extraction from Aadhaar Card,Pan Card for Indian Citizens using on-device(Android) machine learning.
Opendiscoverplatformcasestudy
⭐
8
Case study using dotfurther's Open Discover Platform with the RavenDB document store to rapidly create a full-text search/eDiscovery/information governance capable demonstration application.
Opendiscoversdk
⭐
8
.NET 6 API for document file format identification, text/metadata/attachment/embedded object/sensitive item (PII/PHI)/entity extraction.
Pdfboxlight
⭐
8
Port of Apache PDFBox for Android
Textractor Translator
⭐
8
Translate visual novels and other games in real time
Jatstemplate
⭐
8
Basic JATS document template generator plugin for OJS
Ifilterextractor
⭐
6
A simple component to extract just the text from any file that has an IFilter installed. Available as a C++ COM component and as a C# .NET library.
Apyhub.js
⭐
6
ApyHub SDK for Node.js is a library for accessing the ApyHub APIs.
Whatsapp Chat Text Mining
⭐
6
This is a code written in R to show the text extraction from a whats app chat and representing them in a form of word cloud.
Voice Prescription
⭐
6
Built a GUI application using Tkinter that helps Doctors to prepare prescriptions more efficiently. This uses speech to text conversion and text extraction to prepare prescriptions in the correct format. This project is submitted for Smart India Hackathon 2020.
Ext Tika
⭐
6
A TYPO3 CMS extension that provides Apache Tika functionality
Gridfs Uploader
⭐
5
[deprecated] gridfs-uploader
Pdfi
⭐
5
PDF parsing, drawing, and text extraction
Tecroom
⭐
5
技术栈在线总结文档,包含编程语言、数据结构与算法、机器学习、数据库等。
Spa
⭐
5
The javascript front-end for rendering text-extraction on PDF documents
Tika Page Extractor
⭐
5
Tika per page PDF extractor server returning content as JSON.
Cosmic Cube
⭐
5
PDF image analysis and selective text extraction using tesseract
1-93 of 93 search results
Privacy
|
About
|
Terms
|
Follow Us On Twitter
Copyright 2018-2024 Awesome Open Source. All rights reserved.