A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases, etc. Sparkler (contraction of Spark-Crawler) is a new web crawler that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and pf4j. Sparkler is an extensible, highly scalable, and high-performance web crawler that is an evolution of Apache Nutch and runs on Apache Spark Cluster.
To use sparkler, install docker and run the below commands:
# Step 0. Get the image docker pull ghcr.io/uscdatascience/sparkler/sparkler:main # Step 1. Create a volume for elastic docker volume create elastic # Step 1. Inject seed urls docker run -v elastic:/elasticsearch-7.17.0/data ghcr.io/uscdatascience/sparkler/sparkler:main inject -id myid -su 'http://www.bbc.com/news' # Step 3. Start the crawl job docker run -v elastic:/elasticsearch-7.17.0/data ghcr.io/uscdatascience/sparkler/sparkler:main crawl -id myid -tn 100 -i 2 # id=1, top 100 URLs, do -i=2 iterations
1. Follow Steps 0-1 2. Create a file name seed-urls.txt using Emacs editor as follows: a. emacs sparkler/bin/seed-urls.txt b. copy paste your urls c. Ctrl+x Ctrl+s to save d. Ctrl+x Ctrl+c to quit the editor [Reference: http://mally.stanford.edu/~sr/computing/emacs.html] * Note: You can use Vim and Nano editors also or use: echo -e "http://example1.com\nhttp://example2.com" >> seedfile.txt command. 3. Inject seed urls using the following command, (assuming you are in sparkler/bin directory) $bash sparkler.sh inject -id 1 -sf seed-urls.txt 4. Start the crawl job.
To crawl until the end of all new URLS, use
-i -1, Example:
/data/sparkler/bin/sparkler.sh crawl -id 1 -i -1