Awesome Open Source
Search
Programming Languages
Languages
All Categories
Categories
About
Search results for web archiving
web-archiving
x
65 search results found
Archivebox
⭐
19,721
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
Awesome Web Archiving
⭐
1,669
An Awesome List for getting started with web archiving
Conifer
⭐
1,458
Collect and revisit web pages.
Pywb
⭐
1,312
Core Python Web Archiving Toolkit for replay and recording of web archives
Archiveweb.page
⭐
674
A High-Fidelity Web Archiving Extension for Chrome and Chromium based browsers!
Ipwb
⭐
577
InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS
Replayweb.page
⭐
574
Serverless replay of web archives directly in the browser
Browsertrix Crawler
⭐
470
Run a high-fidelity browser-based crawler in a single Docker container
Auto Archiver
⭐
439
Automatically archive links to videos, images, and social media content from Google Sheets (and more).
Webrecorder Player
⭐
424
Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)
Perma
⭐
389
Indelible links
Warcdb
⭐
380
WarcDB: Web crawl data as SQLite databases.
Archivenow
⭐
376
A Tool To Push Web Resources Into Web Archives
Wail
⭐
330
🐋 Web Archiving Integration Layer: One-Click User Instigated Preservation
Waybackpy
⭐
235
Wayback Machine API interface & a command-line tool
Archiveror
⭐
188
Archiveror will help you preserve the webpages you love. 💾
Warcreate
⭐
187
Chrome extension to "Create WARC files from any webpage"
Warcio
⭐
173
Streaming WARC/ARC library for fast web archive IO
Squidwarc
⭐
163
Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
Sfm Ui
⭐
148
Social Feed Manager user interface application.
Ph Submissions
⭐
133
The repository and website hosting the peer review process for new Programming Historian lessons
Archivebox Browser Extension
⭐
130
Official ArchiveBox browser extension: automatically/manually preserve your browsing history using ArchiveBox.
Electron Archivebox
⭐
127
Desktop Electron app for ArchiveBox internet archiver. (ALPHA: not ready for general use)
Cdx_toolkit
⭐
121
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
Archivespark
⭐
118
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
Wail
⭐
116
🐋 One-Click User Instigated Preservation
Browsertrix Cloud
⭐
113
Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
Fatcat
⭐
98
Perpetual Access To The Scholarly Record
Warc Parquet
⭐
96
🗄️ A simple CLI for converting WARC to Parquet.
Awesome Memento
⭐
73
A list of things related to software, literature, and other content for 🕣 Memento
Wget Lua
⭐
72
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
Node Warc
⭐
62
Parse And Create Web ARChive (WARC) files with node.js
Warrick
⭐
59
Recover lost websites from the Web Infrastructure
Collect
⭐
57
A server to collect & archive websites that also supports video downloads
Memgator
⭐
48
A Memento Aggregator CLI and Server in Go
Warcworker
⭐
33
A dockerized, queued high fidelity web archiver based on Squidwarc
Outbackcdx
⭐
28
Web archive index server based on RocksDB
Quickcacheandarchivesearch
⭐
27
Quick Cache and Archive search buttons
Web Snap
⭐
25
Create "perfect" snapshots of web pages
Homebrew Archivebox
⭐
22
Homebrew formula for the ArchiveBox self-hosted internet archiving solution.
Munin Indexer
⭐
21
A social media open post web archiving tool
Metawarc
⭐
21
metawarc: a command-line tool for metadata extraction from files from WARC (Web ARChive)
Httrack2warc
⭐
20
Converts HTTrack crawls to WARC files
Sandcrawler
⭐
19
Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wiki
Cc Notebooks
⭐
18
Various Jupyter notebooks about Common Crawl data
Bookmark Archiver
⭐
18
🗄 Save an archived copy of websites from Pocket/Pinboard/Bookmarks/RSS. Outputs HTML, PDFs, and more...
Cdxj Indexer
⭐
17
CDXJ Indexing of WARC/ARCs
Pdf_trio
⭐
15
A PDF classifier ensemble with REST API service
Seeder
⭐
15
Seeder - Czech webarchive curating tool and public site
Pip Archivebox
⭐
13
Official Python package for ArchiveBox, the self-hosted internet archiving solution.
Debian Archivebox
⭐
13
Home of the official apt/deb package for Ubuntu/Debian-based systems.
Conifer Deploy
⭐
13
Conifer setup and deployment via Ansible
Httpreserve
⭐
10
Digital Preservation of HTTP in documentary heritage.
Dat Share
⭐
10
A prototype server to swarm multiple DATs for Webrecorder
Ukwa Manage
⭐
10
Shepherding our web archives from crawl to access.
Webarchiver
⭐
9
Decentralized web archiving
Chronicrawl
⭐
8
Experimental continouous web crawler for web archiving
Internet Archiving Talk
⭐
8
🎭 An introduction to the Internet Archiving ecosystem, tooling, and some of the ethical dilemmas that the community faces.
Bamboo
⭐
8
Web archive collection manager
Hadoopconcatgz
⭐
7
A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz
Chropro
⭐
7
Chrome debugging protocol client for Java
Digestbox
⭐
7
DigestBox takes any webpage URL (news article, video link, comment thread, etc.) and gives you just the raw content. It's powered by ArchiveBox.io under the hood.
Capture Urls
⭐
5
Archive a list of URLs using the Wayback Machine
Linkstat
⭐
5
CLI implementation of httpreserve that can test links and retrieve internet archive replacements
Warcprotocol
⭐
5
Parser for WARC (aka WebArchive) files
1-65 of 65 search results
Privacy
|
About
|
Terms
|
Follow Us On Twitter
Copyright 2018-2024 Awesome Open Source. All rights reserved.