Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Webspider | 353 | 2 months ago | 3 | mit | Python | |||||
在线地址: http://119.23.223.90:8000 | ||||||||||
Celerystalk | 294 | 3 years ago | 24 | mit | Python | |||||
An asynchronous enumeration & vulnerability scanner. Run all the tools on all the hosts. | ||||||||||
Videospider | 178 | 4 years ago | 2 | mit | Python | |||||
抓取豆瓣,bilibili等中的电视剧、电影、动漫演员等信息 | ||||||||||
Spiders | 152 | 3 years ago | 1 | Python | ||||||
各种大小爬虫集合 | ||||||||||
Scrapyscript | 92 | a year ago | 10 | December 11, 2021 | 6 | mit | Python | |||
Run a Scrapy spider programmatically from a script or a Celery task - no project required. | ||||||||||
Ark | 58 | 7 years ago | Python | |||||||
分布式扫描框架 | ||||||||||
Spider_docker | 30 | 7 years ago | 1 | |||||||
为爬虫引用创建container,包括的模块:scrapy, mongo, celery, rabbitmq | ||||||||||
Deadpool | 22 | 3 years ago | Python | |||||||
该项目是一个使用celery作为主体框架的爬虫应用,能够灵活的添加爬虫任务,并且同时运行多站点的爬虫工作,所有组件都能够原生支持规模并发和分布式,加上celery原生的分布式调用,实现大规模并发。 | ||||||||||
Netease_spider | 17 | 6 years ago | Python | |||||||
网易严选爬虫 | ||||||||||
Antitools | 17 | 3 years ago | 1 | Python | ||||||
antitools |
Scrapyscript is a Python library you can use to run Scrapy spiders directly from your code. Scrapy is a great framework to use for scraping projects, but sometimes you don't need the whole framework, and just want to run a small spider from a script or a Celery job. That's where Scrapyscript comes in.
With Scrapyscript, you can:
Job
Job(s)
in a Processor
processor.run()
to execute them... returning all results when the last job completes.
Let's see an example.
import scrapy
from scrapyscript import Job, Processor
processor = Processor(settings=None)
class PythonSpider(scrapy.spiders.Spider):
name = "myspider"
def start_requests(self):
yield scrapy.Request(self.url)
def parse(self, response):
data = response.xpath("//title/text()").extract_first()
return {'title': data}
job = Job(PythonSpider, url="http://www.python.org")
results = processor.run(job)
print(results)
[{ "title": "Welcome to Python.org" }]
See the examples directory for more, including a complete Celery
example.
pip install scrapyscript
A single request to call a spider, optionally passing in *args or **kwargs, which will be passed through to the spider constructor at runtime.
# url will be available as self.url inside MySpider at runtime
myjob = Job(MySpider, url='http://www.github.com')
Create a multiprocessing reactor for running spiders. Optionally provide a scrapy.settings.Settings
object to configure the Scrapy runtime.
settings = scrapy.settings.Settings(values={'LOG_LEVEL': 'WARNING'})
processor = Processor(settings=settings)
Start the Scrapy engine, and execute one or more jobs. Blocks and returns consolidated results in a single list.
jobs
can be a single instance of Job
, or a list.
results = processor.run(myjob)
or
results = processor.run([myjob1, myjob2, ...])
As per the scrapy docs, a Spider
must return an iterable of Request
and/or dict
or Item
objects.
Requests will be consumed by Scrapy inside the Job
. dict
or scrapy.Item
objects will be queued
and output together when all spiders are finished.
Due to the way billiard handles communication between processes, each dict
or Item
must be
pickle-able using pickle protocol 0. It's generally best to output dict
objects from your Spider.
Updates, additional features or bug fixes are always welcome.
git clone [email protected]:jschnurr/scrapyscript.git
poetry install
make test
or make tox
See CHANGELOG.md
The MIT License (MIT). See LICENCE file for details.