Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Node Crawler | 6,494 | 518 | 140 | 7 months ago | 31 | December 30, 2022 | 42 | mit | JavaScript | |
Web Crawler/Spider for NodeJS + server-side jQuery ;-) | ||||||||||
Headless Chrome Crawler | 5,051 | 10 | 12 | 2 years ago | 21 | June 11, 2018 | 28 | mit | JavaScript | |
Distributed crawler powered by Headless Chrome | ||||||||||
Node Web Crawler | 104 | 9 years ago | 4 | mit | JavaScript | |||||
A web scraper with a web user interface which shows scraping stats in realtime. Uses Node.JS, jQuery, socket.io and Express. | ||||||||||
Status Jquery Crawler | 69 | 11 years ago | 1 | other | JavaScript | |||||
Check for broken links in yout website with jQuery | ||||||||||
Sitequery | 50 | 3 | 12 years ago | 19 | March 24, 2012 | 1 | JavaScript | |||
A node.js module for reactive webcrawling | ||||||||||
Dom Query | 46 | 6 years ago | November 27, 2020 | 4 | mit | PHP | ||||
A jQuery-like interface for DOM Crawling | ||||||||||
Json Web Crawler | 17 | 1 | a year ago | 19 | December 22, 2021 | JavaScript | ||||
Use JSON to list all elements (with css 3 and jquery selector) that you want to crawl. | ||||||||||
Crawler Client | 14 | 5 years ago | 2 | JavaScript | ||||||
crawler dev tools using electron webview | ||||||||||
Cnbeta | 14 | 7 years ago | Python | |||||||
一键抓取cnbeta 首页的所有消息 | ||||||||||
Pywebquery | 10 | 12 years ago | Python | |||||||
a jquery liked pythonic web crawler library ,it's based on BeautifulSoup and wget |
Distributed crawler powered by Headless Chrome
Crawlers based on simple requests to HTML files are generally fast. However, it sometimes ends up capturing empty bodies, especially when the websites are built on such modern frontend frameworks as AngularJS, React and Vue.js.
Powered by Headless Chrome, the crawler provides simple APIs to crawl these dynamic websites with the following features:
yarn add headless-chrome-crawler
# or "npm i headless-chrome-crawler"
Note: headless-chrome-crawler contains Puppeteer. During installation, it automatically downloads a recent version of Chromium. To skip the download, see Environment variables.
const HCCrawler = require('headless-chrome-crawler');
(async () => {
const crawler = await HCCrawler.launch({
// Function to be evaluated in browsers
evaluatePage: (() => ({
title: $('title').text(),
})),
// Function to be called with evaluated results from browsers
onSuccess: (result => {
console.log(result);
}),
});
// Queue a request
await crawler.queue('https://example.com/');
// Queue multiple requests
await crawler.queue(['https://example.net/', 'https://example.org/']);
// Queue a request with custom options
await crawler.queue({
url: 'https://example.com/',
// Emulate a tablet device
device: 'Nexus 7',
// Enable screenshot by passing options
screenshot: {
path: './tmp/example-com.png'
},
});
await crawler.onIdle(); // Resolved when no queue is left
await crawler.close(); // Close the crawler
})();
See here for the full examples list. The examples can be run from the root folder as follows:
NODE_PATH=../ node examples/priority-queue.js
See here for the API reference.
See here for the debugging tips.
There are roughly two types of crawlers. One is static and the other is dynamic.
The static crawlers are based on simple requests to HTML files. They are generally fast, but fail scraping the contents when the HTML dynamically changes on browsers.
Dynamic crawlers based on PhantomJS and Selenium work magically on such dynamic applications. However, PhantomJS's maintainer has stepped down and recommended to switch to Headless Chrome, which is fast and stable. Selenium is still a well-maintained cross browser platform which runs on Chrome, Safari, IE and so on. However, crawlers do not need such cross browsers support.
This crawler is dynamic and based on Headless Chrome.
This crawler is built on top of Puppeteer.
Puppeteer provides low to mid level APIs to manupulate Headless Chrome, so you can build your own crawler with it. This way you have more controls on what features to implement in order to satisfy your needs.
However, most crawlers requires such common features as following links, obeying robots.txt and etc. This crawler is a general solution for most crawling purposes. If you want to quickly start crawling with Headless Chrome, this crawler is for you.