Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Crawlee | 12,871 | 42 | 7 days ago | 747 | December 10, 2023 | 96 | apache-2.0 | TypeScript | ||
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation. | ||||||||||
Browser Fingerprinting | 3,353 | a year ago | 7 | JavaScript | ||||||
Analysis of Bot Protection systems with available countermeasures 🚿. How to defeat anti-bot system 👻 and get around browser fingerprinting scripts 🕵️♂️ when scraping the web? | ||||||||||
Thal | 2,268 | 4 years ago | mit | JavaScript | ||||||
Getting started with Puppeteer and Chrome Headless for Web Scraping | ||||||||||
Phpscraper | 486 | 7 months ago | 34 | June 18, 2023 | 20 | gpl-3.0 | PHP | |||
A universal web-util for PHP. | ||||||||||
Browsertrix Crawler | 470 | 6 months ago | 91 | agpl-3.0 | JavaScript | |||||
Run a high-fidelity browser-based crawler in a single Docker container | ||||||||||
Zimit | 209 | 5 months ago | 31 | gpl-3.0 | Python | |||||
Make a ZIM file from any Web site and surf offline! | ||||||||||
Aws Pdf Textract Pipeline | 148 | 6 months ago | 5 | mit | TypeScript | |||||
:mag: Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using AWS textract. Built with AWS CDK + TypeScript | ||||||||||
Gpt4v Scraper | 126 | 6 months ago | JavaScript | |||||||
AI agent that can SEE 👁️, control, navigate, & do stuff for you on your browser. | ||||||||||
Actor Scraper | 93 | 1 | 2 | a year ago | 12 | May 28, 2019 | 13 | apache-2.0 | JavaScript | |
House of Apify Scrapers. Generic scraping actors with a simple UI to handle complex web crawling and scraping use cases. | ||||||||||
Browser Pool | 77 | 7 | 2 years ago | 82 | June 20, 2022 | 8 | TypeScript | |||
A Node.js library to easily manage and rotate a pool of web browsers, using any of the popular browser automation libraries like Puppeteer, Playwright, or SecretAgent. |