Awesome Open Source
Awesome Open Source

pastebin-scraper

This is a multithreaded scraping script for Pastebin. It scrapes the main site for new pastes, downloads their raw content and processes them by a user-defined output format.

WHY?

Fun.

Installation

The usual dance.

pip install -r requirements.txt

Define all required specs in settings.ini. Should you decide to go with a database output, make sure the respective connector is installed. At the moment MySQL with pymysql and SQLite with the standard built in Python 3 connector are supported.

Also note that the file output creates a subdirectory output and dumps every paste as a separate file into it.

Settings

ini is a highly underrated file format. Here are some definitions on what the settings parameter actually do.

GENERAL

  • PasteLimit Stop after having scraped n pastes. Set to 0 for indefinite scraping
  • PBLink URL to Pastebin or another equivalent site
  • DownloadWorkers Number of workers that download the raw paste content and further process it
  • NewPasteCheckInterval Time to wait before checking the main site for new pastes again
  • IPBlockedWaitTime Time to wait until checking the main site again after the scraper's IP has been blocked

LOGGING

  • RotationLog Location of log file that contains debug output
  • MaxRotationSize Size in bytes before another log file is created
  • RotationBackupCount Maximum number of log files to keep

STDOUT/ FILE

  • Enable Enable formatted stdout output of paste data
  • ContentDisplayLimit Maximum amount of characters to show before content is cut off (0 to display all)
  • ShowName Display the paste name
  • ShowLang Display the paste language
  • ShowLink Display the complete paste link
  • ShowData Display the raw paste content
  • DataEncoding Encoding of the raw paste data

MYSQL

  • Enable Enable MySQL output
  • TableName Main table name to insert data into
  • Host MySQL server host
  • Port MySQL server port
  • Username MySQL server user
  • Password User password

SQLITE

  • Enable Enable SQLite output
  • Filename Filename the db should be saved as (usually ends with .db)
  • TableName Main table name to insert data into

If you use this thing for some cool data analysis or even research, let me know if I can help!

Inspiration for this scraper was taken from here.


Get A Weekly Email With Trending Projects For These Topics
No Spam. Unsubscribe easily at any time.
python (51,962
mysql (1,002
sqlite (307
scraper (185
multithreading (104
pastebin (30

Find Open Source By Browsing 7,000 Topics Across 59 Categories