Awesome Open Source

Programming Languages

Search results for python warc

62 search results found

Archivebox ⭐ 19,721

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

Conifer ⭐ 1,434

Collect and revisit web pages.

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

Warcprox ⭐ 348

WARC writing MITM HTTP/S proxy

🐋 Web Archiving Integration Layer: One-Click User Instigated Preservation

Archivebot ⭐ 328

ArchiveBot, an IRC bot for archiving websites

Cc Pyspark ⭐ 280

Process Common Crawl data with Python and Spark

Bitextor ⭐ 260

Bitextor generates translation memories from multilingual websites

Make a ZIM file from any Web site and surf offline!

Python library for reading and writing warc files

Streaming WARC/ARC library for fast web archive IO

Cocrawler ⭐ 159

CoCrawler is a versatile web crawler built using modern tools and concurrency.

Webarchiveplayer ⭐ 156

NOTE: This project is no longer being actively developed.. Check out Webrecorder Player for the latest player. https://github.com/webrecorder/webrecorderplayer-e (Legacy: Desktop application for browsing web archives (WARC and ARC)

Cdx_toolkit ⭐ 121

A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine

Tool and library for handling Web ARChive (WARC) files.

Warctools ⭐ 84

Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)

Convert Directories, Files and ZIP Files to Web Archives (WARC)

Warc Proxy ⭐ 57

Serving content from a WARC

Warcmiddleware ⭐ 42

WarcMiddleware lets users seamlessly download a mirror copy of a website when running a web crawl with the Python web crawler Scrapy.

Newsgrabber ⭐ 34

Grabbing all news.

Pywb Webrecorder ⭐ 34

Check out https://github.com/webrecorder/webrecorder for newer version matching https://webrecorder.io

Chatnoir Resiliparse ⭐ 33

A robust web archive analytics toolkit

Liveweb proxy of the Wayback Machine project

Warc2zim ⭐ 30

Command line tool to convert a file in the WARC format to a file in the ZIM format

Webarchive Indexing ⭐ 30

Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.

Forum Dl ⭐ 26

Scrape posts, threads from forums, news aggregators, mail archives, export to JSONL, mailbox, WARC

Indie Map ⭐ 24

🗺 A public IndieWeb social graph and dataset.

Python Webarchive ⭐ 24

Create WebKit/Safari .webarchive files on any platform

Metawarc ⭐ 21

metawarc: a command-line tool for metadata extraction from files from WARC (Web ARChive)

Har2warc ⭐ 21

Convert HTTP Archive (HAR) -> Web Archive (WARC) format

Warcproxy ⭐ 21

Saves proxied HTTP traffic to a WARC file.

Megawarc ⭐ 18

Nondestructive warc-in-tar to warc conversion

Cdxj Indexer ⭐ 17

CDXJ Indexing of WARC/ARCs

Cc Lambda ⭐ 16

Search the common crawl using lambda functions

Gzipstream ⭐ 15

gzipstream allows Python to process multi-part gzip files from a streaming source

Historian Warc 1 ⭐ 14

The Historian's WARC Toolkit

Warcmitmproxy ⭐ 14

HTTP(S) proxy that saves traffic to a WARC file, using libmitmproxy.

Commoncrawl Warc Retrieval ⭐ 14

Python tools to retrieve text from CommonCrawl WARC files based on cdx index.

Parler Data Tools ⭐ 12

Fyp Autotextsum ⭐ 12

Automatic Text Summarization with Machine Learning

Warcmerge ⭐ 10

Merging WARCs into a single WARC file

Warctozip Service ⭐ 10

An HTTP-based warc-to-zip converter

Reprozip Web ⭐ 10

ReproZip for the Preservation of Web Applications

Eis Warc Archiver ⭐ 10

ARCHIVED--Docker app to crawl URLs and generate WARCs

Py Wasapi Client ⭐ 9

A client for the Archive-It And Webrecorder WASAPI Data Transfer API

Webarchiver ⭐ 9

Decentralized web archiving

Draintasker ⭐ 9

a tool for continuously ingesting w/arc files into the archive

Warc Content ⭐ 9

simple warc archive content browser

Demonstration of using Python to process the Common Crawl dataset with the mrjob framework

Saitan allows you to save a webpage from the Internet to a web archiving tool like the Internet Archive’s Wayback Machine and archive.is. Saitan allows you also to download a local copy of the page, and all its components in a WARC file, it can timestamp it to prove that the file existed prior to some point in time.

Common_crawl_insight ⭐ 7

Warctozip ⭐ 7

Convert a warc to a zip with Hanzo warc-tools and warctozip.py

Proof-of-concept analytics dashboard for Social Feed Manager using ELK stack

Cdx Summary ⭐ 6

Summarize web archive capture index (CDX) files.

Webrender Phantomjs ⭐ 6

A RESTful API for rendering web pages in PhantomJS

Commoncrawl_downloader ⭐ 5

Web Data Processing Systems 2018 (VU course XM_40020)

Warcreplay ⭐ 5

Creates a proxy that lets you view the contents of a Warc file as though you were browsing the live web.

Pywb Ipfs ⭐ 5

Experimental recording and replay of WARCs to/from IPFS (https://ipfs.io/)

Pywb Warcbase ⭐ 5

pywb support for warcbase

Off Topic Memento Toolkit ⭐ 5

This system evaluates a collection of mementos (archived web pages) to determine which are off topic. The collection can be part of an Archive-It collection, a single TimeMap, or stored in a WARC file.

Webarticlecurator ⭐ 5

Web Article Curator

Related Searches

Python Django (28,897)

Python Machine Learning (20,195)

Python Flask (17,643)

Python Dataset (14,960)

Python Docker (14,026)

Python Tensorflow (13,736)

Python Command Line (13,139)

Python Deep Learning (13,092)

Python Jupyter Notebook (12,976)

Python Network (11,495)

1-62 of 62 search results

Privacy | About | Terms | Follow Us On Twitter

Copyright 2018-2024 Awesome Open Source. All rights reserved.