Awesome Open Source
Search
Programming Languages
Languages
All Categories
Categories
About
Search results for python warc
python
x
warc
x
62 search results found
Archivebox
⭐
19,721
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
Conifer
⭐
1,434
Collect and revisit web pages.
Ipwb
⭐
577
InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS
Warcprox
⭐
348
WARC writing MITM HTTP/S proxy
Wail
⭐
330
🐋 Web Archiving Integration Layer: One-Click User Instigated Preservation
Archivebot
⭐
328
ArchiveBot, an IRC bot for archiving websites
Cc Pyspark
⭐
280
Process Common Crawl data with Python and Spark
Bitextor
⭐
260
Bitextor generates translation memories from multilingual websites
Zimit
⭐
209
Make a ZIM file from any Web site and surf offline!
Warc
⭐
196
Python library for reading and writing warc files
Warcio
⭐
173
Streaming WARC/ARC library for fast web archive IO
Cocrawler
⭐
159
CoCrawler is a versatile web crawler built using modern tools and concurrency.
Webarchiveplayer
⭐
156
NOTE: This project is no longer being actively developed.. Check out Webrecorder Player for the latest player. https://github.com/webrecorder/webrecorderplayer-e (Legacy: Desktop application for browsing web archives (WARC and ARC)
Cdx_toolkit
⭐
121
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
Warcat
⭐
96
Tool and library for handling Web ARChive (WARC) files.
Warctools
⭐
84
Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)
Warcit
⭐
63
Convert Directories, Files and ZIP Files to Web Archives (WARC)
Warc Proxy
⭐
57
Serving content from a WARC
Warcmiddleware
⭐
42
WarcMiddleware lets users seamlessly download a mirror copy of a website when running a web crawl with the Python web crawler Scrapy.
Newsgrabber
⭐
34
Grabbing all news.
Pywb Webrecorder
⭐
34
Check out https://github.com/webrecorder/webrecorder for newer version matching https://webrecorder.io
Chatnoir Resiliparse
⭐
33
A robust web archive analytics toolkit
Liveweb
⭐
32
Liveweb proxy of the Wayback Machine project
Warc2zim
⭐
30
Command line tool to convert a file in the WARC format to a file in the ZIM format
Webarchive Indexing
⭐
30
Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.
Forum Dl
⭐
26
Scrape posts, threads from forums, news aggregators, mail archives, export to JSONL, mailbox, WARC
Indie Map
⭐
24
🗺 A public IndieWeb social graph and dataset.
Python Webarchive
⭐
24
Create WebKit/Safari .webarchive files on any platform
Metawarc
⭐
21
metawarc: a command-line tool for metadata extraction from files from WARC (Web ARChive)
Har2warc
⭐
21
Convert HTTP Archive (HAR) -> Web Archive (WARC) format
Warcproxy
⭐
21
Saves proxied HTTP traffic to a WARC file.
Megawarc
⭐
18
Nondestructive warc-in-tar to warc conversion
Cdxj Indexer
⭐
17
CDXJ Indexing of WARC/ARCs
Cc Lambda
⭐
16
Search the common crawl using lambda functions
Gzipstream
⭐
15
gzipstream allows Python to process multi-part gzip files from a streaming source
Historian Warc 1
⭐
14
The Historian's WARC Toolkit
Warcmitmproxy
⭐
14
HTTP(S) proxy that saves traffic to a WARC file, using libmitmproxy.
Commoncrawl Warc Retrieval
⭐
14
Python tools to retrieve text from CommonCrawl WARC files based on cdx index.
Parler Data Tools
⭐
12
Fyp Autotextsum
⭐
12
Automatic Text Summarization with Machine Learning
Warcmerge
⭐
10
Merging WARCs into a single WARC file
Warctozip Service
⭐
10
An HTTP-based warc-to-zip converter
Reprozip Web
⭐
10
ReproZip for the Preservation of Web Applications
Eis Warc Archiver
⭐
10
ARCHIVED--Docker app to crawl URLs and generate WARCs
Py Wasapi Client
⭐
9
A client for the Archive-It And Webrecorder WASAPI Data Transfer API
Webarchiver
⭐
9
Decentralized web archiving
Draintasker
⭐
9
a tool for continuously ingesting w/arc files into the archive
Warc Content
⭐
9
simple warc archive content browser
Cc Mrjob
⭐
9
Demonstration of using Python to process the Common Crawl dataset with the mrjob framework
Saitan
⭐
8
Saitan allows you to save a webpage from the Internet to a web archiving tool like the Internet Archive’s Wayback Machine and archive.is. Saitan allows you also to download a local copy of the page, and all its components in a WARC file, it can timestamp it to prove that the file existed prior to some point in time.
Common_crawl_insight
⭐
7
Warctozip
⭐
7
Convert a warc to a zip with Hanzo warc-tools and warctozip.py
Sfm Elk
⭐
7
Proof-of-concept analytics dashboard for Social Feed Manager using ELK stack
Cdx Summary
⭐
6
Summarize web archive capture index (CDX) files.
Webrender Phantomjs
⭐
6
A RESTful API for rendering web pages in PhantomJS
Commoncrawl_downloader
⭐
5
Wdps
⭐
5
Web Data Processing Systems 2018 (VU course XM_40020)
Warcreplay
⭐
5
Creates a proxy that lets you view the contents of a Warc file as though you were browsing the live web.
Pywb Ipfs
⭐
5
Experimental recording and replay of WARCs to/from IPFS (https://ipfs.io/)
Pywb Warcbase
⭐
5
pywb support for warcbase
Off Topic Memento Toolkit
⭐
5
This system evaluates a collection of mementos (archived web pages) to determine which are off topic. The collection can be part of an Archive-It collection, a single TimeMap, or stored in a WARC file.
Webarticlecurator
⭐
5
Web Article Curator
Related Searches
Python Django (28,897)
Python Machine Learning (20,195)
Python Flask (17,643)
Python Dataset (14,960)
Python Docker (14,026)
Python Tensorflow (13,736)
Python Command Line (13,139)
Python Deep Learning (13,092)
Python Jupyter Notebook (12,976)
Python Network (11,495)
1-62 of 62 search results
Privacy
|
About
|
Terms
|
Follow Us On Twitter
Copyright 2018-2024 Awesome Open Source. All rights reserved.