Parsr

Transforms PDF, Documents and Images into Enriched Structured Data
Alternatives To Parsr
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Parsr5,145
3 months ago59apache-2.0JavaScript
Transforms PDF, Documents and Images into Enriched Structured Data
Argos Translate1,978104 days ago20May 15, 202254mitPython
Open-source offline translation library written in Python
Nlp Tools28
3 years agoPython
Useful python NLP tools (evaluation, GUI interface, tokenization)
Real_time_datamining_software26
2 years agoapache-2.0Python
携程/榛果民宿实时评论挖掘软件,包含数据的实时采集/数据清洗/结构化保存/ UGC 数据主题提取/情感分析/后结构化可视化等技术的综合性演示 Demo。基于在线民宿 UGC 数据的意见挖掘项目,包含数据挖掘和 NLP 相关的处理,负责数据采集、主题抽取、情感分析等任务。主要克服用户打分和评论不一致,实时对携程和美团在线民宿的满意度进行评测以及对额外数据进行可视化的综合性工具,多维度的对在线 UGC 进行数据挖掘并可视化,demo 视频演示见链接。
Gsoc2019 Text Extraction10
4 years ago5JavaScript
GSoC 2019: Development of a Tool for Extracting Quantitative Text Profiles
Topexapp5
5 months ago22gpl-3.0JavaScript
TopExApp is a graphical user interface for the TopEx Python package. TopEx allows the exploration of topics present in a group of text documents by clustering sentences together that relay common ideas or themes.
Phishe4
2 months agoPython
This project is a phishing classification ML system that can detect attacks through a hybrid URL and language models. It can also be connected to a MISP instance where it can take in new threats and classifying them automatically.
Correctme3
4 years agomitPython
A context based autocorrection application
Mitie Gui3
6 years agoHTML
A GUI for training MITIE NER models
Reddify3
3 months agomitPython
Reddify Creates a Spotify playlist, based on the top posts of a specific music genre subreddit
Alternatives To Parsr
Select To Compare


Alternative Project Comparisons
Readme


Turn your documents into data!

Français | Portuguese | Spanish | 中文

  • Parsr, is a minimal-footprint document (image, pdf, docx, eml) cleaning, parsing and extraction toolchain which generates readily available, organized and usable data in JSON, Markdown (MD), CSV/Pandas DF or TXT formats.

  • It provides analysts, data scientists and developers with clean structured and label-enriched information set for ready-to-use applications ranging from data entry and document analysts automation, archival, and many others.

  • Currently, Parsr can perform: document cleaning, hierarchy regeneration (words, lines, paragraphs), detection of headings, tables, lists, table of contents, page numbers, headers/footers, links, and others. Check out all the features.

Table of Contents

Getting Started

Installation

-- The advanced installation guide is available here --

The quickest way to install and run the Parsr API is through the docker image:

docker pull axarev/parsr

If you also wish to install the GUI for sending documents and visualising results:

docker pull axarev/parsr-ui-localhost

Note: Parsr can also be installed bare-metal (not via Docker containers), the procedure for which is documented in the installation guide.

Usage

-- The advanced usage guide is available here --

To run the API, issue:

docker run -p 3001:3001 axarev/parsr

which will launch it on http://localhost:3001.
Consult the documentation on the usage of the API.

  1. To access the python client to Parsr API, issue:

    pip install parsr-client
    

    To sample the Jupyter Notebook, using the python client, head over to the jupyter demo.

  1. To use the GUI tool (the API needs to already be running), issue:
    docker run -t -p 8080:80 axarev/parsr-ui-localhost:latest
    
    Then, access it through http://localhost:8080.

Refer to the Configuration documentation to interpret the configurable options in the GUI viewer.

The API based usage and the command line usage are documented in the advanced usage guide.

Documentation

All documentation files can be found here.

Contribute

Please refer to the contribution guidelines.

Third Party Licenses

Third Party Libraries licenses for its dependencies:

  1. QPDF: Apache http://qpdf.sourceforge.net
  2. ImageMagick: Apache 2.0 https://imagemagick.org/script/license.php
  3. Pdfminer.six: MIT https://github.com/pdfminer/pdfminer.six/blob/master/LICENSE
  4. PDF.js: Apache 2.0 https://github.com/mozilla/pdf.js
  5. Tesseract: Apache 2.0 tesseract-ocr/tesseract
  6. Camelot: MIT camelot-dev/camelot
  7. MuPDF (Optional dependency): AGPL https://mupdf.com/license.html
  8. Pandoc (Optional dependency): GPL jgm/pandoc

License

Copyright 2020 AXA Group Operations S.A.
Licensed under the Apache 2.0 license (see the LICENSE file).

Popular Gui Projects
Popular Natural Language Processing Projects
Popular User Interface Categories
Related Searches

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Javascript
Python
Typescript
Gui
Nlp
Ocr