Scrapyscript

Run a Scrapy spider programmatically from a script or a Celery task - no project required.
Alternatives To Scrapyscript
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Webspider353
2 months ago3mitPython
在线地址: http://119.23.223.90:8000
Celerystalk294
3 years ago24mitPython
An asynchronous enumeration & vulnerability scanner. Run all the tools on all the hosts.
Videospider178
4 years ago2mitPython
抓取豆瓣,bilibili等中的电视剧、电影、动漫演员等信息
Spiders152
3 years ago1Python
各种大小爬虫集合
Scrapyscript92
a year ago10December 11, 20216mitPython
Run a Scrapy spider programmatically from a script or a Celery task - no project required.
Ark58
7 years agoPython
分布式扫描框架
Spider_docker30
7 years ago1
为爬虫引用创建container,包括的模块:scrapy, mongo, celery, rabbitmq
Deadpool22
3 years agoPython
该项目是一个使用celery作为主体框架的爬虫应用,能够灵活的添加爬虫任务,并且同时运行多站点的爬虫工作,所有组件都能够原生支持规模并发和分布式,加上celery原生的分布式调用,实现大规模并发。
Netease_spider17
6 years agoPython
网易严选爬虫
Antitools17
3 years ago1Python
antitools
Alternatives To Scrapyscript
Select To Compare


Alternative Project Comparisons
Readme


Scrapyscript

Embed Scrapy jobs directly in your code

What is Scrapyscript?

Scrapyscript is a Python library you can use to run Scrapy spiders directly from your code. Scrapy is a great framework to use for scraping projects, but sometimes you don't need the whole framework, and just want to run a small spider from a script or a Celery job. That's where Scrapyscript comes in.

With Scrapyscript, you can:

  • wrap regular Scrapy Spiders in a Job
  • load the Job(s) in a Processor
  • call processor.run() to execute them

... returning all results when the last job completes.

Let's see an example.

import scrapy
from scrapyscript import Job, Processor

processor = Processor(settings=None)

class PythonSpider(scrapy.spiders.Spider):
    name = "myspider"

    def start_requests(self):
        yield scrapy.Request(self.url)

    def parse(self, response):
        data = response.xpath("//title/text()").extract_first()
        return {'title': data}

job = Job(PythonSpider, url="http://www.python.org")
results = processor.run(job)

print(results)
[{ "title": "Welcome to Python.org" }]

See the examples directory for more, including a complete Celery example.

Install

pip install scrapyscript

Requirements

  • Linux or MacOS
  • Python 3.8+
  • Scrapy 2.5+

API

Job (spider, *args, **kwargs)

A single request to call a spider, optionally passing in *args or **kwargs, which will be passed through to the spider constructor at runtime.

# url will be available as self.url inside MySpider at runtime
myjob = Job(MySpider, url='http://www.github.com')

Processor (settings=None)

Create a multiprocessing reactor for running spiders. Optionally provide a scrapy.settings.Settings object to configure the Scrapy runtime.

settings = scrapy.settings.Settings(values={'LOG_LEVEL': 'WARNING'})
processor = Processor(settings=settings)

Processor.run(jobs)

Start the Scrapy engine, and execute one or more jobs. Blocks and returns consolidated results in a single list. jobs can be a single instance of Job, or a list.

results = processor.run(myjob)

or

results = processor.run([myjob1, myjob2, ...])

A word about Spider outputs

As per the scrapy docs, a Spider must return an iterable of Request and/or dict or Item objects.

Requests will be consumed by Scrapy inside the Job. dict or scrapy.Item objects will be queued and output together when all spiders are finished.

Due to the way billiard handles communication between processes, each dict or Item must be pickle-able using pickle protocol 0. It's generally best to output dict objects from your Spider.

Contributing

Updates, additional features or bug fixes are always welcome.

Setup

Tests

  • make test or make tox

Version History

See CHANGELOG.md

License

The MIT License (MIT). See LICENCE file for details.

Popular Celery Projects
Popular Spider Projects
Popular Messaging Categories

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Python
Spider
Scrapy
Celery
Reactor
Twisted
Multiprocessing