Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Heritrix3 | 2,579 | 2 | 6 months ago | 9 | July 27, 2022 | 48 | other | Java | ||
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. | ||||||||||
Awesome Web Archiving | 1,669 | 4 months ago | 3 | cc0-1.0 | ||||||
An Awesome List for getting started with web archiving | ||||||||||
Grab Site | 1,254 | a month ago | 92 | other | Python | |||||
The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns | ||||||||||
Awesome Datahoarding | 892 | 8 months ago | 4 | |||||||
List of data-hoarding related tools | ||||||||||
Brozzler | 613 | 2 | 3 months ago | 23 | January 02, 2020 | 40 | apache-2.0 | Python | ||
brozzler - distributed browser-based web crawler | ||||||||||
Archivebot | 328 | 5 months ago | 169 | mit | Python | |||||
ArchiveBot, an IRC bot for archiving websites | ||||||||||
Tumblr_crawler | 258 | 6 years ago | 2 | gpl-3.0 | Python | |||||
This is a Multi-thread crawler for Tumblr. | ||||||||||
Google Group Crawler | 213 | 2 years ago | 6 | Shell | ||||||
[Deprecated] Get (almost) original messages from google group archives. Your data is yours. | ||||||||||
Cc Crawl Statistics | 97 | 5 months ago | apache-2.0 | Python | ||||||
Statistics of Common Crawl monthly archives mined from URL index files | ||||||||||
Wget Lua | 72 | 5 months ago | 10 | gpl-3.0 | C | |||||
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication. |