Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Awesome Web Archiving | 1,669 | 3 months ago | 3 | cc0-1.0 | ||||||
An Awesome List for getting started with web archiving | ||||||||||
Browsertrix Crawler | 470 | 3 months ago | 91 | agpl-3.0 | JavaScript | |||||
Run a high-fidelity browser-based crawler in a single Docker container | ||||||||||
Warcdb | 380 | 4 months ago | 4 | October 22, 2023 | 7 | apache-2.0 | Python | |||
WarcDB: Web crawl data as SQLite databases. | ||||||||||
Warcreate | 187 | 5 months ago | 58 | mit | JavaScript | |||||
Chrome extension to "Create WARC files from any webpage" | ||||||||||
Squidwarc | 163 | 4 years ago | 9 | apache-2.0 | JavaScript | |||||
Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head | ||||||||||
Warc Parquet | 96 | 3 months ago | 10 | September 13, 2023 | 5 | Rust | ||||
🗄️ A simple CLI for converting WARC to Parquet. | ||||||||||
Wget Lua | 72 | 4 months ago | 10 | gpl-3.0 | C | |||||
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication. | ||||||||||
Warcworker | 33 | 4 years ago | 6 | gpl-3.0 | Python | |||||
A dockerized, queued high fidelity web archiver based on Squidwarc | ||||||||||
Httrack2warc | 20 | 2 years ago | 2 | apache-2.0 | Java | |||||
Converts HTTrack crawls to WARC files | ||||||||||
Sandcrawler | 19 | a year ago | 2 | HTML | ||||||
Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wiki |