Awesome Open Source

Programming Languages

Search results for web archiving

web-archiving x

65 search results found

Archivebox ⭐ 19,721

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

Awesome Web Archiving ⭐ 1,669

An Awesome List for getting started with web archiving

Conifer ⭐ 1,458

Collect and revisit web pages.

Core Python Web Archiving Toolkit for replay and recording of web archives

Archiveweb.page ⭐ 674

A High-Fidelity Web Archiving Extension for Chrome and Chromium based browsers!

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

Replayweb.page ⭐ 574

Serverless replay of web archives directly in the browser

Browsertrix Crawler ⭐ 470

Run a high-fidelity browser-based crawler in a single Docker container

Auto Archiver ⭐ 439

Automatically archive links to videos, images, and social media content from Google Sheets (and more).

Webrecorder Player ⭐ 424

Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)

Indelible links

WarcDB: Web crawl data as SQLite databases.

Archivenow ⭐ 376

A Tool To Push Web Resources Into Web Archives

🐋 Web Archiving Integration Layer: One-Click User Instigated Preservation

Waybackpy ⭐ 235

Wayback Machine API interface & a command-line tool

Archiveror ⭐ 188

Archiveror will help you preserve the webpages you love. 💾

Warcreate ⭐ 187

Chrome extension to "Create WARC files from any webpage"

Streaming WARC/ARC library for fast web archive IO

Squidwarc ⭐ 163

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

Social Feed Manager user interface application.

Ph Submissions ⭐ 133

The repository and website hosting the peer review process for new Programming Historian lessons

Archivebox Browser Extension ⭐ 130

Official ArchiveBox browser extension: automatically/manually preserve your browsing history using ArchiveBox.

Electron Archivebox ⭐ 127

Desktop Electron app for ArchiveBox internet archiver. (ALPHA: not ready for general use)

Cdx_toolkit ⭐ 121

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

Archivespark ⭐ 118

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

🐋 One-Click User Instigated Preservation

Browsertrix Cloud ⭐ 113

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!

Perpetual Access To The Scholarly Record

Warc Parquet ⭐ 96

🗄️ A simple CLI for converting WARC to Parquet.

Awesome Memento ⭐ 73

A list of things related to software, literature, and other content for 🕣 Memento

Wget Lua ⭐ 72

Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.

Node Warc ⭐ 62

Parse And Create Web ARChive (WARC) files with node.js

Recover lost websites from the Web Infrastructure

A server to collect & archive websites that also supports video downloads

Memgator ⭐ 48

A Memento Aggregator CLI and Server in Go

Warcworker ⭐ 33

A dockerized, queued high fidelity web archiver based on Squidwarc

Outbackcdx ⭐ 28

Web archive index server based on RocksDB

Quickcacheandarchivesearch ⭐ 27

Quick Cache and Archive search buttons

Web Snap ⭐ 25

Create "perfect" snapshots of web pages

Homebrew Archivebox ⭐ 22

Homebrew formula for the ArchiveBox self-hosted internet archiving solution.

Munin Indexer ⭐ 21

A social media open post web archiving tool

Metawarc ⭐ 21

metawarc: a command-line tool for metadata extraction from files from WARC (Web ARChive)

Httrack2warc ⭐ 20

Converts HTTrack crawls to WARC files

Sandcrawler ⭐ 19

Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wiki

Cc Notebooks ⭐ 18

Various Jupyter notebooks about Common Crawl data

Bookmark Archiver ⭐ 18

🗄 Save an archived copy of websites from Pocket/Pinboard/Bookmarks/RSS. Outputs HTML, PDFs, and more...

Cdxj Indexer ⭐ 17

CDXJ Indexing of WARC/ARCs

Pdf_trio ⭐ 15

A PDF classifier ensemble with REST API service

Seeder - Czech webarchive curating tool and public site

Pip Archivebox ⭐ 13

Official Python package for ArchiveBox, the self-hosted internet archiving solution.

Debian Archivebox ⭐ 13

Home of the official apt/deb package for Ubuntu/Debian-based systems.

Conifer Deploy ⭐ 13

Conifer setup and deployment via Ansible

Httpreserve ⭐ 10

Digital Preservation of HTTP in documentary heritage.

Dat Share ⭐ 10

A prototype server to swarm multiple DATs for Webrecorder

Ukwa Manage ⭐ 10

Shepherding our web archives from crawl to access.

Webarchiver ⭐ 9

Decentralized web archiving

Chronicrawl ⭐ 8

Experimental continouous web crawler for web archiving

Internet Archiving Talk ⭐ 8

🎭 An introduction to the Internet Archiving ecosystem, tooling, and some of the ethical dilemmas that the community faces.

Web archive collection manager

Hadoopconcatgz ⭐ 7

A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz

Chrome debugging protocol client for Java

Digestbox ⭐ 7

DigestBox takes any webpage URL (news article, video link, comment thread, etc.) and gives you just the raw content. It's powered by ArchiveBox.io under the hood.

Capture Urls ⭐ 5

Archive a list of URLs using the Wayback Machine

CLI implementation of httpreserve that can test links and retrieve internet archive replacements

Warcprotocol ⭐ 5

Parser for WARC (aka WebArchive) files

1-65 of 65 search results

Privacy | About | Terms | Follow Us On Twitter

Copyright 2018-2024 Awesome Open Source. All rights reserved.