Sandcrawler

Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wiki
Alternatives To Sandcrawler
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Awesome Web Archiving1,669
3 months ago3cc0-1.0
An Awesome List for getting started with web archiving
Browsertrix Crawler470
3 months ago91agpl-3.0JavaScript
Run a high-fidelity browser-based crawler in a single Docker container
Warcdb380
4 months ago4October 22, 20237apache-2.0Python
WarcDB: Web crawl data as SQLite databases.
Warcreate187
5 months ago58mitJavaScript
Chrome extension to "Create WARC files from any webpage"
Squidwarc163
4 years ago9apache-2.0JavaScript
Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
Warc Parquet96
3 months ago10September 13, 20235Rust
🗄️ A simple CLI for converting WARC to Parquet.
Wget Lua72
4 months ago10gpl-3.0C
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
Warcworker33
4 years ago6gpl-3.0Python
A dockerized, queued high fidelity web archiver based on Squidwarc
Httrack2warc20
2 years ago2apache-2.0Java
Converts HTTrack crawls to WARC files
Sandcrawler19
a year ago2HTML
Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wiki
Alternatives To Sandcrawler
Select To Compare


Alternative Project Comparisons
Popular Web Archiving Projects
Popular Crawler Projects
Popular Content Management Categories
Related Searches

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Python
Html
Script
Crawler
Hadoop
Web Archiving