Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Lighthouse Parade | 341 | 8 days ago | 13 | October 20, 2022 | 24 | mit | TypeScript | |||
A Node.js command line tool that crawls a domain and gathers lighthouse performance data for every page. | ||||||||||
Gulp Sitespeedio | 37 | 14 | 9 | 5 years ago | 11 | December 14, 2017 | 1 | mit | JavaScript | |
Test performance budgets and performance best practice rules using sitespeed.io http://www.sitespeed.io | ||||||||||
Torequests | 22 | 13 days ago | mit | Python | ||||||
Async wrapper for requests / aiohttp, and some crawler toolkits. Let synchronization code enjoy the performance of asynchronous programming. | ||||||||||
Stock Fundamental Data Scraping And Analysis | 14 | 2 years ago | Jupyter Notebook | |||||||
Project on building a web crawler to collect the fundamentals of the stock and review their performance in one go | ||||||||||
Iptv Extreme | 11 | 7 years ago | 3 | mit | C++ | |||||
Lighting fast IPTV version, performance wise | ||||||||||
Api Wanplus Crawler | 6 | 4 years ago | 10 | March 29, 2017 | 1 | mit | Python | |||
A small custom lol-esports-api of wanplus.com | ||||||||||
Spiderman | 5 | 5 years ago | November 30, 2020 | mit | PHP | |||||
high performance crawler framework | ||||||||||
Repo Crawler | 3 | 3 years ago | 1 | Kotlin | ||||||
A small utility for crawling model repositories |
Cheap modular c++ crawler library
This library, although STABLE is a WORK IN PROGRESS. Feel free to use is, ask about it and contribute.
This C++ library is designed to run as many simultaneous downloads as it is physically possible on a single machine. It is implemented using Boost.Asio with libcurl MULTI_SOCKET.
A sample executable driver is available which uses the library to download a list of URLs. You can use it as a guideline of how to integrate CheapCrawler in your own project.
The library has 2 main components the crawler and the downloader.
Some utilities are also available:
Flow:
While (keepCrawling)
Pull URL list from the crawling dispatcher
Prepare robots.txt download for each host and protocol
Try download each robots.txt
log each downloaded robots.txt result
if no robots.txt => crawl all
if download error => set result to download error, don't crawl
else (successful download of robots.txt)
(NOT IMPLEMENTED YET) For all forbidden urls by robots.txt:
=> remove from download queue
Download allowed URLs with 2 secs default delay between download requests per same host
Some additional specifications:
keepCrawling()
returns falseImplements the Downloader interface of the crawler in a performant oriented implementation using libcurl and Boost.Asio. Multiple simultaneous downloads are possible with a fixed number of threads usage. Should theoretically support multiple hundreds of simultaneous downloads.
I developed this library on macOS and haven't tested it on anything else, but it should be easy to cross compile it on Linux.
To be able to use CheapCrawler you will need the following:
Here is how you can build the library and the driver executable separately.
git clone https://github.com/EugenCazacu/CheapCrawler.git
mkdir build && cd build
conan install ../CheapCrawler
cmake -DCHEAP_CRAWLER_RUN_TESTS=ON -DCMAKE_BUILD_TYPE=Release ../CheapCrawler
make
./bin/crawlerDriver -u "http://example.com"
Here is how I integrate CheapCrawler into my project. There is definitely room for improvement here.
git remote add CheapCrawler https://github.com/EugenCazacu/CheapCrawler.git
git subtree add --squash --prefix=external/CheapCrawler CheapCrawler master
To update to a new CheapCrawler version:
git subtree pull --squash --prefix=external/CheapCrawler CheapCrawler master
After adding the CheapCrawler
directory to your cmake build, you can use the crawlerLibrary
target.
I am developing this library along with another private project. I will be adding features and fixing issues as I need them. If you would like to use this library and need support, feel free to contact me and I will try to help you. This library is a work in progress and if you find any problems with it, please submit an issue. If you want to contribute, please get in contact with me beforehand to make sure I will be able to accept your contribution.
Here is a list of additional features I plan and hope to implement: