Awesome Open Source

Programming Languages

Search results for html extraction

81 search results found

Swiftsoup ⭐ 4,203

SwiftSoup: Pure Swift HTML Parser, with best of DOM, CSS, and jquery (Supports Linux, iOS, Mac, tvOS, watchOS)

Python Goose ⭐ 3,741

Html Content / Article Extractor, web scrapping lib in Python

Textract ⭐ 3,699

extract text from any document. no muss. no fuss.

Webplotdigitizer ⭐ 2,375

Online tool to extract numerical data from plot images.

Html To React Components ⭐ 2,101

Converts HTML pages into React components

Scrapely ⭐ 1,668

A pure-python HTML screen-scraping library

Textract ⭐ 1,487

node.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!

Excalibur ⭐ 1,319

A web interface to extract tabular data from PDFs

Parsel ⭐ 1,010

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors

Mlscraper ⭐ 935

🤖 Scrape data from HTML websites automatically by just providing examples

Snappysnippet ⭐ 802

Chrome extension that allows easy extraction of CSS and HTML from selected element.

Extracts machine-readable metadata and content from Web pages

Npm Pdfreader ⭐ 522

🚜 Parse text and tables from PDF files.

Python Boilerpipe ⭐ 498

Python interface to Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages

An exercise in unsupervised machine learning: Extract Article's Text in HTml documents.

Cx Extractor Python ⭐ 368

基于行块分布函数的通用网页正文抽取算法的Python版本实现，添加了英文支持/ Web page content extraction algorithm, support both Chinese and English

Extract Loader ⭐ 303

webpack loader to extract HTML and CSS from the bundle

Express Ejs Layouts ⭐ 274

Layout support for ejs in express.

PYthon Automated Term Extraction

Readability ⭐ 207

Readability is Elixir library for extracting and curating articles.

Openscraping Lib Csharp ⭐ 205

Turn unstructured HTML pages into structured data. The OpenScraping library can extract information from HTML pages using a JSON config file with xPath rules. It can scrape even multi-level complex objects such as tables and forum posts. This is the C# version.

Pluck text in a fast and intuitive way 🐓

Autolink Java ⭐ 188

Java library to extract links (URLs, email addresses) from plain text; fast, small and smart

Extract URLs to stylesheets, scripts, links, images or HTML imports from HTML

Extract data or evaluate value from HTML/XML documents using XPath

Grunt Critical ⭐ 153

Grunt task to extract & inline critical-path CSS from HTML

Nibbler ⭐ 142

A cute HTML scraper / data extraction tool in under 70 lines of code

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

Cascadia ⭐ 128

Go cascadia package command line CSS selector

Microdataphp ⭐ 119

Extract microdata from HTML using PHP. Based on foolip's MicrodataJS implementation of the Microdata DOM API.

Readabilitybundle ⭐ 117

A bundle of html content extraction algorithms

Html2rss ⭐ 106

📰 Build RSS 2.0 feeds from websites (and JSON APIs) with a few CSS selectors.

Data Mining On Social Media ⭐ 105

Python scripts to extract tweets and facebook posts from public users.

Htmldate ⭐ 101

Fast and robust date extraction from web pages, with Python or on the command-line

Ingredients ⭐ 98

Extract recipe ingredients from any recipe website on the internet.

Hyponymyextraction ⭐ 90

HyponymyExtraction and Graph based on KB Schema, Baike-kb and online text extract, 基于知识概念体系，百科知识库，以及在线搜索结构化方式的词语上下位抽取与可视化展示

Extractotron ⭐ 87

Placeholder for some ideas about OpenStreetMap extracts

Chorrrds ⭐ 87

R package to extract music chords

Automatic Item List Extraction

A python library detect and extract listing data from HTML page.

Webarticle2text ⭐ 79

[DEPRECATED] A script to extract the main article text from an arbitrary webpage.

A readability parser which can extract title, content, images from html pages

Render and parse dynamic web pages from R

FEVER (Fact Extraction and VERification) Annotation Platform and Baselines

Intelligent Web Data Extractor

Whatwordwhere ⭐ 74

Tooling to extract data from scanned paper forms OCR-ed by Tesseract using the HOCR standard.

Extract Mongo Schema ⭐ 70

Extract schema from Mongo database, including foreign keys

Extraction Toolkit

Pdf Extract ⭐ 68

PDF parser and converter to HTML

Easygettext ⭐ 67

Simple gettext tokens extraction tools for HTML and Jade files.

Crawlista ⭐ 65

Crawlista is a support library for Clojure applications that crawl the Web

Osmdata.xyz ⭐ 65

This project provides global data extracts based on OpenStreetMap data as GeoPackages.

Newspaperjs ⭐ 63

News extraction and scraping. Article Parsing

Mifit Data Export ⭐ 57

Set of Unix tools to grab data from Mi Fit Android app, most of this is courtesy of xmxm

Selectorlib ⭐ 55

A library to read a YML file with Xpath or CSS Selectors and extract data from HTML pages using them

In the wild extraction of entities that are found using Flair and displayed using a very elegant front-end.

Html Text ⭐ 52

Extract text from HTML

Html Table Extractor ⭐ 51

extract data from html table

Node Boilerpipe ⭐ 50

A node.js wrapper for Boilerpipe, an excellent Java library for boilerplate removal and fulltext extraction from HTML pages.

Chef Metroextractor ⭐ 50

Creates metro extracts/shapefiles from OSM planet data:

Drugbank ⭐ 44

User-friendly extensions of the DrugBank database

Extract To React ⭐ 43

Chrome/Chromium extension for easy HTML to React conversion.

Colorgram Js ⭐ 43

Color extraction library

Yellowpages Scraper ⭐ 43

Yellowpages.com Web Scraper written in Python and LXML to extract business details available based on a particular category and location.

Extract rich metadata from URLs

Article Title ⭐ 42

Extract the article title of a HTML document

An implementation of the Goose HTML Content / Article Extractor algorithm in golang

Articletext ⭐ 35

Golang package to extract useful text from a HTML document

Tl Create ⭐ 32

tl-create is a cross-platform command line tool to create a X.509 trust list from various trust stores. (Keywords: CABFORUM, eIDAS, WebPKI)

Named Entity Recognition & Relation Extraction 实体命名识别与关系分类

Tools for automatic extraction of activation coordinates from published neuroimaging articles.

Nlp Flask Website ⭐ 30

A simple Flask website for all NLP tasks which includes Text Preprocessing, Keyword Extraction, Text Summarization etc. Created Date: 30 Jan 2019

Html Frontmatter ⭐ 28

Extract key-value metadata from HTML comments

Extract Html Diff ⭐ 27

extract difference between two html pages

Html2csv ⭐ 25

A utility that extracts tables from HTML documents and converts them to CSV format

Alchemyapi_java ⭐ 24

Please note that this legacy AlchemyAPI SDK is no longer supported by IBM. Please use the Watson SDKs https://github.com/watson-developer-cloud?utf8=✓&q

Metro Extracts ⭐ 24

DEPRECATED. See readme for alternative ways to get "city-sized chunks" of OpenStreetMap data

Access version 11.1 of the Varieties of Democracy (V-Dem) dataset

Sunflower ⭐ 23

Easily extract content from a bunch of similarly-formatted HTML files.

Django Xadminlte ⭐ 23

AdminLTE theme and plugins for django-xadmin

Css Chunks Html Webpack Plugin ⭐ 23

Injecting css chunks extracted using extract-css-chunks-webpack-plugin to HTML for html-webpack-plugin

Framework7 Template Webpack ⭐ 22

Deprecated! Framework7 Vue Webpack starter app template with hot-reload & css extraction

Grablinks ⭐ 21

A simple and streamlined Python script to extract and filter links from a remote HTML resource.

Inlinecssparser ⭐ 21

A Visual Studio Extension that helps to extract inline styles into a seperate css file.

Url Metadata Extractor ⭐ 21

API that extracts metadata from a URL.

Screaming Frog Shingling ⭐ 21

Uses Screaming Frog Internal HTML with text extraction along with a shingling algorithm to compare content duplication across the pages of a crawled site.

Openvenues ⭐ 21

Wsi Analysis ⭐ 21

Python scripts for automatic Whole-Slide Image preprocessing.

Official implementation of the paper "Towards Zero-Shot Relation Extraction with Attribute Representation Learning."

Alchemyapi_csharp ⭐ 20

Please note that this legacy AlchemyAPI SDK is no longer supported by IBM. Please use the Watson SDKs https://github.com/watson-developer-cloud?utf8=✓&q

Php Article Extractor ⭐ 20

A PHP library to extract article text from web pages

Citeseerextractor ⭐ 19

React Stylematic ⭐ 18

A stylematic wrapper for React

Farm2table ⭐ 18

Seamless HTML table extraction for Python

[abandoned] statistical HTML content extraction in python

Html Parser ⭐ 18

The HTML-Parser distribution is is a collection of modules that parse and extract information from HTML documents

Open Data Inception ⭐ 18

Linkedin Extractor ⭐ 18

Given a Linkedin profie URL returns structured metadata.

JSoup DSL for Kotlin

Puppypaste ⭐ 17

Extract HTML clipboard contents without losing the structure, as you'd get from pasting into TextEdit or Notepad.

Related Searches

Javascript Html (53,392)

Html Css (19,526)

Python Html (11,009)

Html Bootstrap (5,651)

Php Html (5,615)

Html Theme (5,550)

Html Jekyll (5,387)

Html Jquery (5,205)

Html Markdown (5,082)

Html Reactjs (4,782)

1-81 of 81 search results

Privacy | About | Terms | Follow Us On Twitter

Copyright 2018-2024 Awesome Open Source. All rights reserved.