Awesome Open Source
Awesome Open Source
Combined Topics
scraping
x
Advertising
📦 10
All Projects
Application Programming Interfaces
📦 124
Applications
📦 192
Artificial Intelligence
📦 78
Blockchain
📦 73
Build Tools
📦 113
Cloud Computing
📦 80
Code Quality
📦 28
Collaboration
📦 32
Command Line Interface
📦 49
Community
📦 83
Companies
📦 60
Compilers
📦 63
Computer Science
📦 80
Configuration Management
📦 42
Content Management
📦 175
Control Flow
📦 213
Data Formats
📦 78
Data Processing
📦 276
Data Storage
📦 135
Economics
📦 64
Frameworks
📦 215
Games
📦 129
Graphics
📦 110
Hardware
📦 152
Integrated Development Environments
📦 49
Learning Resources
📦 166
Legal
📦 29
Libraries
📦 129
Lists Of Projects
📦 22
Machine Learning
📦 347
Mapping
📦 64
Marketing
📦 15
Mathematics
📦 55
Media
📦 239
Messaging
📦 98
Networking
📦 315
Operating Systems
📦 89
Operations
📦 121
Package Managers
📦 55
Programming Languages
📦 245
Runtime Environments
📦 100
Science
📦 42
Security
📦 396
Social Media
📦 27
Software Architecture
📦 72
Software Development
📦 72
Software Performance
📦 58
Software Quality
📦 133
Text Editors
📦 49
Text Processing
📦 136
User Interface
📦 330
User Interface Components
📦 514
Version Control
📦 30
Virtualization
📦 71
Web Browsers
📦 42
Web Servers
📦 26
Web User Interface
📦 210
The Top 106 Scraping Open Source Projects
Categories
>
Data Processing
>
Scraping
Scrapy
⭐
39,473
Scrapy, a fast high-level web crawling & scraping framework for Python.
Colly
⭐
12,872
Elegant Scraper and Crawler Framework for Golang
Requests Html
⭐
11,418
Pythonic HTML Parsing for Humans™
Webmagic
⭐
9,557
A scalable web crawler framework for Java.
Tabula
⭐
4,959
Tabula is a tool for liberating data tables trapped inside PDF files
Headless Chrome Crawler
⭐
4,876
Distributed crawler powered by Headless Chrome
Ferret
⭐
4,369
Declarative web scraping
Autoscraper
⭐
3,216
A Smart, Automatic, Fast and Lightweight Web Scraper for Python
Apify Js
⭐
2,688
Apify SDK — The scalable web scraping and crawling library for JavaScript/Node.js. Enables development of data extraction and web automation jobs (not only) with headless Chrome and Puppeteer.
Thal
⭐
2,292
Getting started with Puppeteer and Chrome Headless for Web Scraping
Googlescraper
⭐
2,233
A Python module to scrape several search engines (like Google, Yandex, Bing, Duckduckgo, ...). Including asynchronous networking support.
Panther
⭐
2,191
A browser testing and web crawling library for PHP and Symfony
Embed
⭐
1,688
Get info from any web service or page
Awesome Puppeteer
⭐
1,492
A curated list of awesome puppeteer resources.
Geziyor
⭐
1,220
Geziyor, a fast web crawling & scraping framework for Go. Supports JS rendering.
Artoo
⭐
1,026
artoo.js - the client-side scraping companion.
Django Dynamic Scraper
⭐
1,014
Creating Scrapy scrapers via the Django admin interface
Scrapy Cluster
⭐
911
This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
Instagram Scraper
⭐
906
Scrape the Instagram frontend. Inspired from twitter-scraper by @kennethreitz.
Lulu
⭐
786
[Unmaintained] A simple and clean video/music/image downloader 👾
Imagescraper
⭐
619
✂️ High performance, multi-threaded image scraper
Parsel
⭐
604
Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors
Newcrawler
⭐
584
Free Web Scraping Tool with Java
Easy Scraping Tutorial
⭐
565
Simple but useful Python web scraping tutorial code.
Facebook_data_analyzer
⭐
511
Analyze facebook copy of your data with ruby language. Download zip file from facebook and get info about friends ranking by message, vocabulary, contacts, friends added statistics and more
Gazpacho
⭐
507
🥫 The simple, fast, and modern web scraping library
Nickjs
⭐
493
Web scraping library made by the Phantombuster team. Modern, simple & works on all websites. (Deprecated)
Geeksforgeeks.pdf
⭐
482
Topic wise PDFs of Geeks for Geeks articles. (Last updated in October 2018)
Oj
⭐
475
Tools for various online judges. Downloading sample cases, generating additional test cases, testing your code, and submitting it.
Scrapple
⭐
461
A framework for creating semi-automatic web content extractors
Dataflowkit
⭐
446
Extract structured data from web sites. Web sites scraping.
Facebook Scraper
⭐
432
Scrape Facebook public pages without an API key
Crawly
⭐
380
Crawly, a high-level web crawling & scraping framework for Elixir.
Jekyll
⭐
373
Jekyll-based static site for The Programming Historian
Coronadatascraper
⭐
371
COVID-19 Coronavirus data scraped from government and curated data sources.
Post Tuto Deployment
⭐
365
Build and deploy a machine learning app from scratch 🚀
Comic Dl
⭐
354
Comic-dl is a command line tool to download manga and comics from various comic and manga sites. Supported sites : readcomiconline.to, mangafox.me, comic naver and many more.
Katana
⭐
329
A Python Tool For google Hacking
Socialreaper
⭐
321
Social media scraping / data collection library for Facebook, Twitter, Reddit, YouTube, Pinterest, and Tumblr APIs
Elixir Scrape
⭐
308
Scrape any website, article or RSS/Atom Feed with ease!
Social Media Profiles Regexs
⭐
302
📇 Extract social media profiles and more with regular expressions
Spidermon
⭐
299
Scrapy Extension for monitoring spiders execution.
Linkedin
⭐
291
Linkedin Scraper using Selenium Web Driver, Chromium headless, Docker and Scrapy
Sasila
⭐
280
一个灵活、友好的爬虫框架
Scrapy Crawlera
⭐
277
Crawlera middleware for Scrapy
Gopa
⭐
276
[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn
Lambdasoup
⭐
275
Functional HTML scraping and rewriting with CSS in OCaml
Clean Text
⭐
261
🧹 Python package for text cleaning
Undetected Chromedriver
⭐
252
Custom Selenium Chromedriver up to v88 | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome, Botprotect)
Musoq
⭐
251
Use SQL on various data sources
Jsoup Annotations
⭐
243
Jsoup Annotations POJO
Memorious
⭐
239
Distributed crawling framework for documents and structured data.
List Of User Agents
⭐
232
List of major web + mobile browser user agent strings. +1 Bonus script to scrape :)
Edu Mail Generator
⭐
232
Generate Free Edu Mail(s) within minutes
Reaper
⭐
229
Social media scraping / data collection tool for the Facebook, Twitter, Reddit, YouTube, Pinterest, and Tumblr APIs
Arachnid
⭐
223
Crawl all unique internal links found on a given website, and extract SEO related information - supports javascript based sites
Scrape Linkedin Selenium
⭐
220
`scrape_linkedin` is a python package that allows you to scrape personal LinkedIn profiles & company pages - turning the data into structured json.
Scrapysharp
⭐
214
reborn of https://bitbucket.org/rflechner/scrapysharp
Goose Parser
⭐
211
Universal scrapping tool, which allows you to extract data using multiple environments
Transistor
⭐
205
Transistor, a Python web scraping framework for intelligent use cases.
Idt
⭐
199
Image Dataset Tool (idt) is a cli tool designed to make the otherwise repetitive and slow task of creating image datasets into a fast and intuitive process.
Jsonframe Cheerio
⭐
195
simple multi-level scraper json input/output for Cheerio
Antch
⭐
193
Antch, a fast, powerful and extensible web crawling & scraping framework for Go
Juriscraper
⭐
188
An API to scrape American court websites for metadata.
Loconotion
⭐
184
Turn Notion pages into lightweight, customizable static websites
Anime Dl
⭐
183
Anime-dl is a command-line program to download anime from CrunchyRoll and Funimation.
Jikan Rest
⭐
168
The REST API for Jikan
Linkedin Learning Downloader
⭐
168
Linkedin Learning videos downloader
Xquery
⭐
154
Extract data or evaluate value from HTML/XML documents using XPath
Serpscrap
⭐
152
SEO python scraper to extract data from major searchengine result pages. Extract data like url, title, snippet, richsnippet and the type from searchresults for given keywords. Detect Ads or make automated screenshots. You can also fetch text content of urls provided in searchresults or by your own. It's usefull for SEO and business related research tasks.
Linkedin Profile Scraper
⭐
148
🕵️♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.
Sqrape
⭐
144
Simple Query Scraping with CSS and Go Reflection (MOVED to Gitlab)
Shadow Useragent
⭐
144
Pick the most common user-agents on the Internet 👻
Fantasy Basketball
⭐
141
Scraping statistics, predicting NBA player performance with neural networks and boosting algorithms, and optimising lineups for Draft Kings with genetic algorithm. Capstone Project for Machine Learning Engineer Nanodegree by Udacity.
Search Engine Google
⭐
136
🕷 Google client for SERPS
Educative.io Downloader
⭐
135
📖 This tool is to download course from educative.io for offline usage. It uses your login credentials and download the course.
Udemycoursegrabber
⭐
131
Your will to enroll in Udemy course is here, but the money isn't? Search no more! This python program searches for your desired course in more than [insert big number here] websites, compares the last updated date, and gives you the download link of the latest one back, but you also have the choice to see the other ones as well!
Phpscraper
⭐
131
PHP Scraper - an highly opinionated web-interface for PHP
Torchbear
⭐
125
🔥🐻 The Speakeasy Scripting Engine Which Combines Speed, Safety, and Simplicity
Secret Agent
⭐
120
The web browser that's built for scraping.
Scan For Webcams
⭐
120
scan for webcams on the internet
Od Database
⭐
119
Distributed crawler, database and web frontend for public directories indexing
Seleniumcrawler
⭐
117
An example using Selenium webdrivers for python and Scrapy framework to create a web scraper to crawl an ASP site
Laravel Bank Statements
⭐
105
Laravel package to collect your bank statements history. Currently support for parsing statements history from BCA, Mandiri, BNI, and MUAMALAT e-banking websites.
Souqscraper
⭐
104
Simple scriptes for Level UP your scraping Skills, and source code for Level UP playlist on Youtube
D4n155
⭐
103
OWASP D4N155 - Intelligent and dynamic wordlist using OSINT
Languagepod101 Scraper
⭐
99
Python scraper for Language Pods such as Japanesepod101.com 👹 🗾 🍣 Compatible with Japanese, Chinese, French, German, Italian, Korean, Portuguese, Russian, Spanish and many more! ✨
Dotnetcrawler
⭐
95
DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c
Nintendeals
⭐
91
Library with a set of tools for scraping information about Nintendo games and its prices across all regions (NA, EU and JP).
Billy
⭐
85
legacy backend for Open States
Pastepwn
⭐
83
Python framework to scrape Pastebin pastes and analyze them
Google Covid19 Mobility Reports
⭐
83
Data extraction of Google's COVID-19 Mobility Reports
Humanoid
⭐
82
Node.js package to bypass CloudFlare's anti-bot JavaScript challenges
Detect Cms
⭐
79
PHP Library for detecting CMS
Grawler
⭐
78
Grawler is a tool written in PHP which comes with a web interface that automates the task of using google dorks, scrapes the results, and stores them in a file.
Email Extractor
⭐
77
The main functionality is to extract all the emails from one or several URLs - La funcionalidad principal es extraer todos los correos electrónicos de una o varias Url
Viewstate
⭐
75
ASP.NET View State Decoder
Nimquery
⭐
74
Nim library for querying HTML using CSS-selectors (like JavaScripts document.querySelector)
Api Store
⭐
69
Contains all the public APIs listed in Phantombuster's API store. Pull requests welcome!
Torrengo
⭐
62
Torrengo is a CLI (command line) program written in Go which concurrently searches torrents from various sources.
1-100 of 106 projects
Next >
Advertising
📦 10
All Projects
Application Programming Interfaces
📦 124
Applications
📦 192
Artificial Intelligence
📦 78
Blockchain
📦 73
Build Tools
📦 113
Cloud Computing
📦 80
Code Quality
📦 28
Collaboration
📦 32
Command Line Interface
📦 49
Community
📦 83
Companies
📦 60
Compilers
📦 63
Computer Science
📦 80
Configuration Management
📦 42
Content Management
📦 175
Control Flow
📦 213
Data Formats
📦 78
Data Processing
📦 276
Data Storage
📦 135
Economics
📦 64
Frameworks
📦 215
Games
📦 129
Graphics
📦 110
Hardware
📦 152
Integrated Development Environments
📦 49
Learning Resources
📦 166
Legal
📦 29
Libraries
📦 129
Lists Of Projects
📦 22
Machine Learning
📦 347
Mapping
📦 64
Marketing
📦 15
Mathematics
📦 55
Media
📦 239
Messaging
📦 98
Networking
📦 315
Operating Systems
📦 89
Operations
📦 121
Package Managers
📦 55
Programming Languages
📦 245
Runtime Environments
📦 100
Science
📦 42
Security
📦 396
Social Media
📦 27
Software Architecture
📦 72
Software Development
📦 72
Software Performance
📦 58
Software Quality
📦 133
Text Editors
📦 49
Text Processing
📦 136
User Interface
📦 330
User Interface Components
📦 514
Version Control
📦 30
Virtualization
📦 71
Web Browsers
📦 42
Web Servers
📦 26
Web User Interface
📦 210