Awesome Open Source
Search
Programming Languages
Languages
All Categories
Categories
About
Search results for archive crawler
archive
x
crawler
x
13 search results found
Heritrix3
⭐
2,579
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Awesome Web Archiving
⭐
1,669
An Awesome List for getting started with web archiving
Grab Site
⭐
1,254
The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Awesome Datahoarding
⭐
892
List of data-hoarding related tools
Brozzler
⭐
613
brozzler - distributed browser-based web crawler
Archivebot
⭐
328
ArchiveBot, an IRC bot for archiving websites
Tumblr_crawler
⭐
258
This is a Multi-thread crawler for Tumblr.
Google Group Crawler
⭐
213
[Deprecated] Get (almost) original messages from google group archives. Your data is yours.
Cc Crawl Statistics
⭐
97
Statistics of Common Crawl monthly archives mined from URL index files
Wget Lua
⭐
72
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
Wayback_archiver
⭐
39
Ruby gem to send URLs to Wayback Machine
Warcworker
⭐
33
A dockerized, queued high fidelity web archiver based on Squidwarc
Warc
⭐
19
Parse WARC (Web Archive Files) as a node.js stream
Actor Templates
⭐
19
This project is the 🏠 home of Apify actor template projects to help users quickly get started.
Web2warc
⭐
17
An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)
Mailinglistscraper
⭐
15
A python web scraper for public email lists.
Gzipstream
⭐
15
gzipstream allows Python to process multi-part gzip files from a streaming source
Sentry
⭐
14
Parallelized web crawler written in Golang
Waybackprov
⭐
13
utility to fetch provenance information from Internet Archive's Wayback Machine
Jawa
⭐
11
Web Archive Crawler (OSDI'22)
Ccrawl
⭐
11
Simple CORPORA list crawler
Ukwa Manage
⭐
10
Shepherding our web archives from crawl to access.
Webarchiver
⭐
9
Decentralized web archiving
Fess Testdata
⭐
9
Test Data Repository for Crawling/Parsing
Gallery Explorer
⭐
8
Gallery Information Explorer
Bamboo
⭐
8
Web archive collection manager
Chronicrawl
⭐
8
Experimental continouous web crawler for web archiving
Linkbak
⭐
8
linkbak is a web page archiver : it reads a list of links and dumps the corresponding pages in HTML and PDF.
Crawl Cfgov
⭐
8
Archive the HTML of consumerfinance.gov daily
Collyfront
⭐
8
This is the web UI for [Colly](https://github.com/gocolly/colly).
Datasurvey
⭐
7
Crawl a directory of files and generate a summary of what is available.
Fhirmaker
⭐
7
crawl public medical imaging archives, create Patient and DiagnosticReport resources which in turn are discoverable via a FHIR API
Httrack2arc
⭐
7
HTTrack2Arc is a tool that converts crawls made by HTTrack to Internet Archive ARC files.
Crawlrss
⭐
7
Crawl RSS - Heritrix 3 add-on
Spiderman
⭐
6
Exploration in browser-assisted web crawling
Photo De Duplication
⭐
5
Using OODT to leverage large scale photo duplication
Za
⭐
5
Scalable web crawling service
Webarticlecurator
⭐
5
Web Article Curator
Internet Archive Link Extractor
⭐
5
Tool for extracting external links of a URL from Internet Archive snapshots
Related Searches
Python Crawler (4,545)
Python Archive (1,902)
Javascript Archive (1,148)
Javascript Crawler (1,142)
Crawler Scrapy (988)
Scraper Crawler (896)
Php Archive (870)
Java Crawler (807)
Crawler Spider (709)
Archive Zip (651)
1-13 of 13 search results
Privacy
|
About
|
Terms
|
Follow Us On Twitter
Copyright 2018-2024 Awesome Open Source. All rights reserved.