Project Name	Stars	Most Recent Commit	Total Releases	Latest Release	Open Issues	License	Language
Awesome Web Archiving	1,669	3 months ago			3	cc0-1.0
An Awesome List for getting started with web archiving
Browsertrix Crawler	470	3 months ago			91	agpl-3.0	JavaScript
Run a high-fidelity browser-based crawler in a single Docker container
Warcdb	380	4 months ago	4	October 22, 2023	7	apache-2.0	Python
WarcDB: Web crawl data as SQLite databases.
Warcreate	187	5 months ago			58	mit	JavaScript
Chrome extension to "Create WARC files from any webpage"
Squidwarc	163	4 years ago			9	apache-2.0	JavaScript
Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
Warc Parquet	96	3 months ago	10	September 13, 2023	5		Rust
🗄️ A simple CLI for converting WARC to Parquet.
Wget Lua	72	4 months ago			10	gpl-3.0	C
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
Warcworker	33	4 years ago			6	gpl-3.0	Python
A dockerized, queued high fidelity web archiver based on Squidwarc
Httrack2warc	20	2 years ago			2	apache-2.0	Java
Converts HTTrack crawls to WARC files
Sandcrawler	19	a year ago			2		HTML
Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wiki

Alternatives To Sandcrawler

Select To Compare

Awesome Web Archiving ⭐ 1,669

An Awesome List for getting started with web archiving

most recent commit 3 months ago

Browsertrix Crawler ⭐ 470

Run a high-fidelity browser-based crawler in a single Docker container

most recent commit 3 months ago

Warcdb ⭐ 380

WarcDB: Web crawl data as SQLite databases.

total releases 4most recent commit 4 months ago

Warcreate ⭐ 187

Chrome extension to "Create WARC files from any webpage"

most recent commit 5 months ago

Squidwarc ⭐ 163

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

most recent commit 4 years ago

Warc Parquet ⭐ 96

🗄️ A simple CLI for converting WARC to Parquet.

total releases 10most recent commit 3 months ago

Wget Lua ⭐ 72

Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.

most recent commit 4 months ago

Warcworker ⭐ 33

A dockerized, queued high fidelity web archiver based on Squidwarc

most recent commit 4 years ago

Httrack2warc ⭐ 20

Converts HTTrack crawls to WARC files

most recent commit 2 years ago

Sandcrawler ⭐ 19

Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wiki

most recent commit a year ago

Suggest An Alternative To sandcrawler

Alternative Project Comparisons

Sandcrawler vs Awesome Web Archiving

Sandcrawler vs Browsertrix Crawler

Sandcrawler vs Warcdb

Sandcrawler vs Warcreate

Sandcrawler vs Squidwarc

Sandcrawler vs Warc Parquet

Sandcrawler vs Wget Lua

Sandcrawler vs Warcworker

Sandcrawler vs Httrack2warc

Popular Web Archiving Projects

Archivebox ⭐ 19,489

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

dependent packages 1total releases 26latest release November 04, 2023most recent commit 22 days ago

Conifer ⭐ 1,434

Collect and revisit web pages.

most recent commit 5 months ago

Pywb ⭐ 1,259

Core Python Web Archiving Toolkit for replay and recording of web archives

dependent packages 3total releases 98latest release May 19, 2023most recent commit 5 months ago

Archiveweb.page ⭐ 674

A High-Fidelity Web Archiving Extension for Chrome and Chromium based browsers!

dependent packages 2total releases 15latest release October 07, 2023most recent commit 3 months ago

npm @webrecorder/archivewebpage} Downloads

Ipwb ⭐ 577

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

total releases 241latest release August 16, 2023most recent commit 3 months ago

Popular Crawler Projects

Scrapy ⭐ 49,918

Scrapy, a fast high-level web crawling & scraping framework for Python.

dependent packages 445total releases 96latest release September 18, 2023most recent commit 3 months ago

Lux ⭐ 24,752

👾 Fast and simple video download library and CLI tool written in Go

dependent packages 8total releases 40latest release November 06, 2023most recent commit 17 days ago

Colly ⭐ 21,902

Elegant Scraper and Crawler Framework for Golang

dependent packages 328total releases 22latest release March 08, 2022most recent commit a month ago

Easyspider ⭐ 20,149

A visual no-code/code-free web crawler/spider易采集：一个可视化浏览器自动化测试/数据采集/爬虫软件，可以无代码图形化

most recent commit 16 days ago

Proxy_pool ⭐ 19,442

Python ProxyPool for web spider

most recent commit 3 months ago