Awesome Open Source

Programming Languages

Search results for archive crawler

13 search results found

Heritrix3 ⭐ 2,579

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

Awesome Web Archiving ⭐ 1,669

An Awesome List for getting started with web archiving

Grab Site ⭐ 1,254

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

Awesome Datahoarding ⭐ 892

List of data-hoarding related tools

Brozzler ⭐ 613

brozzler - distributed browser-based web crawler

Archivebot ⭐ 328

ArchiveBot, an IRC bot for archiving websites

Tumblr_crawler ⭐ 258

This is a Multi-thread crawler for Tumblr.

Google Group Crawler ⭐ 213

[Deprecated] Get (almost) original messages from google group archives. Your data is yours.

Cc Crawl Statistics ⭐ 97

Statistics of Common Crawl monthly archives mined from URL index files

Wget Lua ⭐ 72

Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.

Wayback_archiver ⭐ 39

Ruby gem to send URLs to Wayback Machine

Warcworker ⭐ 33

A dockerized, queued high fidelity web archiver based on Squidwarc

Parse WARC (Web Archive Files) as a node.js stream

Actor Templates ⭐ 19

This project is the 🏠 home of Apify actor template projects to help users quickly get started.

Web2warc ⭐ 17

An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)

Mailinglistscraper ⭐ 15

A python web scraper for public email lists.

Gzipstream ⭐ 15

gzipstream allows Python to process multi-part gzip files from a streaming source

Parallelized web crawler written in Golang

Waybackprov ⭐ 13

utility to fetch provenance information from Internet Archive's Wayback Machine

Web Archive Crawler (OSDI'22)

Simple CORPORA list crawler

Ukwa Manage ⭐ 10

Shepherding our web archives from crawl to access.

Webarchiver ⭐ 9

Decentralized web archiving

Fess Testdata ⭐ 9

Test Data Repository for Crawling/Parsing

Gallery Explorer ⭐ 8

Gallery Information Explorer

Web archive collection manager

Chronicrawl ⭐ 8

Experimental continouous web crawler for web archiving

linkbak is a web page archiver : it reads a list of links and dumps the corresponding pages in HTML and PDF.

Crawl Cfgov ⭐ 8

Archive the HTML of consumerfinance.gov daily

Collyfront ⭐ 8

This is the web UI for [Colly](https://github.com/gocolly/colly).

Datasurvey ⭐ 7

Crawl a directory of files and generate a summary of what is available.

Fhirmaker ⭐ 7

crawl public medical imaging archives, create Patient and DiagnosticReport resources which in turn are discoverable via a FHIR API

Httrack2arc ⭐ 7

HTTrack2Arc is a tool that converts crawls made by HTTrack to Internet Archive ARC files.

Crawl RSS - Heritrix 3 add-on

Spiderman ⭐ 6

Exploration in browser-assisted web crawling

Photo De Duplication ⭐ 5

Using OODT to leverage large scale photo duplication

Scalable web crawling service

Webarticlecurator ⭐ 5

Web Article Curator

Internet Archive Link Extractor ⭐ 5

Tool for extracting external links of a URL from Internet Archive snapshots

Related Searches

Python Crawler (4,545)

Python Archive (1,902)

Javascript Archive (1,148)

Javascript Crawler (1,142)

Crawler Scrapy (988)

Scraper Crawler (896)

Php Archive (870)

Java Crawler (807)

Crawler Spider (709)

Archive Zip (651)

1-13 of 13 search results

Privacy | About | Terms | Follow Us On Twitter

Copyright 2018-2024 Awesome Open Source. All rights reserved.