Heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Alternatives To Heritrix3
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Heritrix32,57926 months ago9July 27, 202248otherJava
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Awesome Web Archiving1,669
3 months ago3cc0-1.0
An Awesome List for getting started with web archiving
Grab Site1,254
a month ago92otherPython
The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Awesome Datahoarding892
7 months ago4
List of data-hoarding related tools
Brozzler613
23 months ago23January 02, 202040apache-2.0Python
brozzler - distributed browser-based web crawler
Archivebot328
5 months ago169mitPython
ArchiveBot, an IRC bot for archiving websites
Tumblr_crawler258
6 years ago2gpl-3.0Python
This is a Multi-thread crawler for Tumblr.
Google Group Crawler213
2 years ago6Shell
[Deprecated] Get (almost) original messages from google group archives. Your data is yours.
Cc Crawl Statistics97
4 months agoapache-2.0Python
Statistics of Common Crawl monthly archives mined from URL index files
Wget Lua72
4 months ago10gpl-3.0C
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
Alternatives To Heritrix3
Select To Compare


Alternative Project Comparisons
Popular Crawler Projects
Popular Archive Projects
Popular Data Processing Categories
Related Searches

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Java
Archive
Crawler