A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases, etc. Sparkler (contraction of Spark-Crawler) is a new web crawler that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and pf4j. Sparkler is an extensible, highly scalable, and high-performance web crawler that is an evolution of Apache Nutch and runs on Apache Spark Cluster.
To use sparkler, install docker and run the below commands:
# Step 0. Get this script wget https://raw.githubusercontent.com/USCDataScience/sparkler/master/sparkler-core/bin/dockler.sh # Step 1. Run the script - it starts docker container and forwards ports to host bash dockler.sh # Step 2. Inject seed urls /data/sparkler/bin/sparkler.sh inject -id 1 -su 'http://www.bbc.com/news' # Step 3. Start the crawl job /data/sparkler/bin/sparkler.sh crawl -id 1 -tn 100 -i 2 # id=1, top 100 URLs, do -i=2 iterations
1. Follow Steps 0-1 2. Create a file name seed-urls.txt using Emacs editor as follows: a. emacs sparkler/bin/seed-urls.txt b. copy paste your urls c. Ctrl+x Ctrl+s to save d. Ctrl+x Ctrl+c to quit the editor [Reference: http://mally.stanford.edu/~sr/computing/emacs.html] * Note: You can use Vim and Nano editors also or use: echo -e "http://example1.com\nhttp://example2.com" >> seedfile.txt command. 3. Inject seed urls using the following command, (assuming you are in sparkler/bin directory) $bash sparkler.sh inject -id 1 -sf seed-urls.txt 4. Start the crawl job.
To crawl until the end of all new URLS, use
-i -1, Example:
/data/sparkler/bin/sparkler.sh crawl -id 1 -i -1
Access the dashboard http://localhost:8983/banana/ (forwarded from docker image). The dashboard should look like the one in the below: