Scrapy Puppeteer

Alternatives To Scrapy Puppeteer
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Spider_python732
6 months ago13apache-2.0Python
python爬虫
Phpscraper381
2 days ago15March 28, 202220gpl-3.0PHP
A universal web-util for PHP.
Fakebrowser290
a year ago54January 14, 202210lgpl-3.0JavaScript
🤖 Fake fingerprints to bypass anti-bot systems. Simulate mouse and keyboard operations to make behavior like a real person.
Double Agent120
7 months ago3mitTypeScript
A test suite of common scraper detection techniques. See how detectable your scraper stack is.
Scrapy Puppeteer103
2 years ago1November 30, 20188mitPython
Scrapy + Puppeteer
Scrapy Puppeteer35
13 days ago3August 02, 2022bsd-3-clausePython
Library that helps use puppeteer in scrapy.
Js Renderer16
6 months ago3mitJavaScript
A online puppeteer service on Vercel to render pages with javascript (js). Mainly useful for web scraping (not using splash).
Scrapy Puppeteer Service9
12 days ago4bsd-3-clauseJavaScript
A special service that runs puputeer instances.
Crawlitem Puppeteer2
3 years agoapache-2.0JavaScript
puppeteer抓取商品的例子
Scrap2019 Ncov2
2 years ago1JavaScript
This repository was created back in Jan 2020 when no one was aware of Corona virus on the western side of the world
Alternatives To Scrapy Puppeteer
Select To Compare


Alternative Project Comparisons
Readme

Scrapy with Puppeteer

PyPI Build Status Test Coverage Maintainability

Scrapy middleware to handle javascript pages using puppeteer.

⚠ IN ACTIVE DEVELOPMENT - READ BEFORE USING ⚠

This is an attempt to make Scrapy and Puppeteer work together to handle Javascript-rendered pages. The design is strongly inspired of the Scrapy Splash plugin.

Scrapy and Puppeteer

The main issue when running Scrapy and Puppeteer together is that Scrapy is using Twisted and that Pyppeteeer (the python port of puppeteer we are using) is using asyncio for async stuff.

Luckily, we can use the Twisted's asyncio reactor to make the two talking with each other.

That's why you cannot use the buit-in scrapy command line (installing the default reactor), you will have to use the scrapyp one, provided by this module.

If you are running your spiders from a script, you will have to make sure you install the asyncio reactor before importing scrapy or doing anything else:

import asyncio
from twisted.internet import asyncioreactor

asyncioreactor.install(asyncio.get_event_loop())

Installation

$ pip install scrapy-puppeteer

Configuration

Add the PuppeteerMiddleware to the downloader middlewares:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_puppeteer.PuppeteerMiddleware': 800
}

Usage

Use the scrapy_puppeteer.PuppeteerRequest instead of the Scrapy built-in Request like below:

from scrapy_puppeteer import PuppeteerRequest

def your_parse_method(self, response):
    # Your code...
    yield PuppeteerRequest('http://httpbin.org', self.parse_result)

The request will be then handled by puppeteer.

The selector response attribute work as usual (but contains the html processed by puppeteer).

def parse_result(self, response):
    print(response.selector.xpath('//title/@text'))

Additional arguments

The scrapy_puppeteer.PuppeteerRequest accept 2 additional arguments:

wait_until

Will be passed to the waitUntil parameter of puppeteer. Default to domcontentloaded.

wait_for

Will be passed to the waitFor to puppeteer.

screenshot

When used, puppeteer will take a screenshot of the page and the binary data of the .png captured will be added to the response meta:

yield PuppeteerRequest(
    url,
    self.parse_result,
    screenshot=True
)

def parse_result(self, response):
    with open('image.png', 'wb') as image_file:
        image_file.write(response.meta['screenshot'])
Popular Puppeteer Projects
Popular Scrapy Projects
Popular Web Browsers Categories

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Python
Scrapy
Asyncio
Puppeteer
Reactor
Twisted