Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Trafilatura | 2,447 | 66 | 3 months ago | 39 | November 29, 2023 | 66 | gpl-3.0 | Python | ||
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments | ||||||||||
Weibo_terminater | 2,265 | 5 years ago | 9 | Python | ||||||
Final Weibo Crawler Scrap Anything From Weibo, comments, weibo contents, followers, anything. The Terminator | ||||||||||
Bookcorpus | 698 | 9 months ago | 5 | mit | Python | |||||
Crawl BookCorpus | ||||||||||
Commoncrawl | 466 | 6 years ago | 8 | C++ | ||||||
Common Crawl support library to access 2008-2012 crawl archives (ARC files) | ||||||||||
Ptt Chat Generator | 190 | 4 years ago | 4 | mit | Python | |||||
批踢踢推文產生器 | ||||||||||
Corpuscrawler | 176 | 5 months ago | 16 | other | Python | |||||
Crawler for linguistic corpora | ||||||||||
Indonesian Nlp Resources | 98 | 4 years ago | mit | |||||||
data resource untuk NLP bahasa indonesia | ||||||||||
Ktspeechcrawler | 73 | 4 years ago | 2 | mit | Python | |||||
Automatically constructing corpus for automatic speech recognition from YouTube videos | ||||||||||
Worldfactbook Dataset | 36 | 10 years ago | CSS | |||||||
Teneo | 22 | 11 years ago | Java | |||||||