Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Colly | 20,810 | 81 | 273 | 4 days ago | 22 | March 08, 2022 | 172 | apache-2.0 | Go | |
Elegant Scraper and Crawler Framework for Golang | ||||||||||
Proxy_pool | 18,119 | 3 months ago | 254 | mit | Python | |||||
Python爬虫代理IP池(proxy pool) | ||||||||||
Easyspider | 16,826 | 5 days ago | 5 | agpl-3.0 | JavaScript | |||||
A visual no-code/code-free web crawler/spider易采集:一个可视化爬虫软件,可以无代码图形化的设计和执行爬虫任务 | ||||||||||
Pyspider | 15,943 | 30 | 2 | 3 months ago | 17 | April 18, 2018 | 297 | apache-2.0 | Python | |
A Powerful Spider(Web Crawler) System in Python. | ||||||||||
Examples Of Web Crawlers | 11,050 | a year ago | 3 | mit | Python | |||||
一些非常有趣的python爬虫例子,对新手比较友好,主要爬取淘宝、天猫、微信、微信读书、豆瓣、QQ等网站。(Some interesting examples of python crawlers that are friendly to beginners. ) | ||||||||||
Crawlab | 10,070 | 2 months ago | 1 | March 03, 2019 | 39 | bsd-3-clause | Go | |||
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架 | ||||||||||
Photon | 9,272 | 5 | 9 months ago | 18 | January 25, 2019 | 46 | gpl-3.0 | Python | ||
Incredibly fast crawler designed for OSINT. | ||||||||||
Avbook | 8,777 | 7 months ago | 85 | PHP | ||||||
AV 电影管理系统, avmoo , javbus , javlibrary 爬虫,线上 AV 影片图书馆,AV 磁力链接数据库,Japanese Adult Video Library,Adult Video Magnet Links - Japanese Adult Video Database | ||||||||||
Spider Flow | 8,075 | 4 months ago | 20 | mit | Java | |||||
新一代爬虫平台,以图形化方式定义爬虫流程,不写代码即可完成爬虫。 | ||||||||||
Infospider | 6,649 | 4 months ago | 7 | gpl-3.0 | Python | |||||
INFO-SPIDER 是一个集众多数据源于一身的爬虫工具箱🧰,旨在安全快捷的帮助用户拿回自己的数据,工具代码开源,流程透明。支持数据源包括GitHub、QQ邮箱、网易邮箱、阿里邮箱、新浪邮箱、Hotmail邮箱、Outlook邮箱、京东、淘宝、支付宝、中国移动、中国联通、中国电信、知乎、哔哩哔哩、网易云音乐、QQ好友、QQ群、生成朋友圈相册、浏览器浏览历史、12306、博客园、CSDN博客、开源中国博客、简书。 |
Installation | Run | Screenshot | Architecture | Integration | Compare | Community & Sponsorship | CHANGELOG | Disclaimer
Golang-based distributed web crawler management platform, supporting various languages including Python, NodeJS, Go, Java, PHP and various web crawler frameworks including Scrapy, Puppeteer, Selenium.
You can follow the installation guide.
Please open the command line prompt and execute the command below. Make sure you have installed docker-compose
in advance.
git clone https://github.com/crawlab-team/examples
cd examples/docker/basic
docker-compose up -d
Next, you can look into the docker-compose.yml
(with detailed config params) and the Documentation for further information.
Please use docker-compose
to one-click to start up. By doing so, you don't even have to configure MongoDB database. Create a file named docker-compose.yml
and input the code below.
version: '3.3'
services:
master:
image: crawlabteam/crawlab:latest
container_name: crawlab_example_master
environment:
CRAWLAB_NODE_MASTER: "Y"
CRAWLAB_MONGO_HOST: "mongo"
volumes:
- "./.crawlab/master:/root/.crawlab"
ports:
- "8080:8080"
depends_on:
- mongo
worker01:
image: crawlabteam/crawlab:latest
container_name: crawlab_example_worker01
environment:
CRAWLAB_NODE_MASTER: "N"
CRAWLAB_GRPC_ADDRESS: "master"
CRAWLAB_FS_FILER_URL: "http://master:8080/api/filer"
volumes:
- "./.crawlab/worker01:/root/.crawlab"
depends_on:
- master
worker02:
image: crawlabteam/crawlab:latest
container_name: crawlab_example_worker02
environment:
CRAWLAB_NODE_MASTER: "N"
CRAWLAB_GRPC_ADDRESS: "master"
CRAWLAB_FS_FILER_URL: "http://master:8080/api/filer"
volumes:
- "./.crawlab/worker02:/root/.crawlab"
depends_on:
- master
mongo:
image: mongo:4.2
container_name: crawlab_example_mongo
restart: always
Then execute the command below, and Crawlab Master and Worker Nodes + MongoDB will start up. Open the browser and enter http://localhost:8080
to see the UI interface.
docker-compose up -d
For Docker Deployment details, please refer to relevant documentation.
The architecture of Crawlab is consisted of a master node, worker nodes, SeaweedFS (a distributed file system) and MongoDB database.
The frontend app interacts with the master node, which communicates with other components such as MongoDB, SeaweedFS and worker nodes. Master node and worker nodes communicate with each other via gRPC (a RPC framework). Tasks are scheduled by the task scheduler module in the master node, and received by the task handler module in worker nodes, which executes these tasks in task runners. Task runners are actually processes running spider or crawler programs, and can also send data through gRPC (integrated in SDK) to other data sources, e.g. MongoDB.
The Master Node is the core of the Crawlab architecture. It is the center control system of Crawlab.
The Master Node provides below services:
The Master Node communicates with the frontend app, and send crawling tasks to Worker Nodes. In the mean time, the Master Node uploads (deploys) spiders to the distributed file system SeaweedFS, for synchronization by worker nodes.
The main functionality of the Worker Nodes is to execute crawling tasks and store results and logs, and communicate with the Master Node through gRPC. By increasing the number of Worker Nodes, Crawlab can scale horizontally, and different crawling tasks can be assigned to different nodes to execute.
MongoDB is the operational database of Crawlab. It stores data of nodes, spiders, tasks, schedules, etc. Task queue is also stored in MongoDB.
SeaweedFS is an open source distributed file system authored by Chris Lu. It can robustly store and share files across a distributed system. In Crawlab, SeaweedFS mainly plays the role as file synchronization system and the place where task log files are stored.
Frontend app is built upon Element-Plus, a popular Vue 3-based UI framework. It interacts with API hosted on the Master Node, and indirectly controls Worker Nodes.
Crawlab SDK provides some helper
methods to make it easier for you to integrate your spiders into Crawlab, e.g. saving results.
In settings.py
in your Scrapy project, find the variable named ITEM_PIPELINES
(a dict
variable). Add content below.
ITEM_PIPELINES = {
'crawlab.scrapy.pipelines.CrawlabPipeline': 888,
}
Then, start the Scrapy spider. After it's done, you should be able to see scraped results in Task Detail -> Data
Please add below content to your spider files to save results.
# import result saving method
from crawlab import save_item
# this is a result record, must be dict type
result = {'name': 'crawlab'}
# call result saving method
save_item(result)
Then, start the spider. After it's done, you should be able to see scraped results in Task Detail -> Data
A crawling task is actually executed through a shell command. The Task ID will be passed to the crawling task process in the form of environment variable named CRAWLAB_TASK_ID
. By doing so, the data can be related to a task.
There are existing spider management frameworks. So why use Crawlab?
The reason is that most of the existing platforms are depending on Scrapyd, which limits the choice only within python and scrapy. Surely scrapy is a great web crawl framework, but it cannot do everything.
Crawlab is easy to use, general enough to adapt spiders in any language and any framework. It has also a beautiful frontend interface for users to manage spiders much more easily.
Framework | Technology | Pros | Cons | Github Stats |
---|---|---|---|---|
Crawlab | Golang + Vue | Not limited to Scrapy, available for all programming languages and frameworks. Beautiful UI interface. Naturally support distributed spiders. Support spider management, task management, cron job, result export, analytics, notifications, configurable spiders, online code editor, etc. | Not yet support spider versioning |
|
ScrapydWeb | Python Flask + Vue | Beautiful UI interface, built-in Scrapy log parser, stats and graphs for task execution, support node management, cron job, mail notification, mobile. Full-feature spider management platform. | Not support spiders other than Scrapy. Limited performance because of Python Flask backend. |
|
Gerapy | Python Django + Vue | Gerapy is built by web crawler guru Germey Cui. Simple installation and deployment. Beautiful UI interface. Support node management, code edit, configurable crawl rules, etc. | Again not support spiders other than Scrapy. A lot of bugs based on user feedback in v1.0. Look forward to improvement in v2.0 |
|
SpiderKeeper | Python Flask | Open-source Scrapyhub. Concise and simple UI interface. Support cron job. | Perhaps too simplified, not support pagination, not support node management, not support spiders other than Scrapy. |
|
If you feel Crawlab could benefit your daily work or your company, please add the author's Wechat account noting "Crawlab" to enter the discussion group.