Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Spring Boot Quick | 2,282 | 2 months ago | 13 | Java | ||||||
:herb: 基于springboot的快速学习示例,整合自己遇到的开源框架,如:rabbitmq(延迟队列)、Kafka、jpa、redies、oauth2、swagger、jsp、docker、k3s、k3d、k8s、mybatis加解密插件、异常处理、日志输出、多模块开发、多环境打包、缓存cache、爬虫、jwt、GraphQL、dubbo、zookeeper和Async等等:pushpin: | ||||||||||
Sparkler | 401 | 8 months ago | 55 | apache-2.0 | Java | |||||
Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark. | ||||||||||
Cc Pyspark | 280 | 9 months ago | 4 | mit | Python | |||||
Process Common Crawl data with Python and Spark | ||||||||||
Docs | 102 | 4 years ago | 3 | |||||||
《数据采集从入门到放弃》源码。内容简介:爬虫介绍、就业情况、爬虫工程师面试题 ;HTTP协议介绍; Requests使用 ;解析器Xpath介绍; MongoDB与MySQL; 多线程爬虫; Scrapy介绍 ;Scrapy-redis介绍; 使用docker部署; 使用nomad管理docker集群; 使用EFK查询docker日志 | ||||||||||
Cc Index Table | 78 | 3 months ago | 8 | apache-2.0 | Java | |||||
Index Common Crawl archives in tabular format | ||||||||||
Keywordanalysis | 33 | 5 years ago | ||||||||
Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends | ||||||||||
Engineeringteam | 32 | 5 years ago | 2 | |||||||
와이빅타 엔지니어링팀의 자료를 정리해두는 곳입니다. | ||||||||||
Search_ads_web_service | 27 | 6 years ago | Java | |||||||
Online search advertisement platform & Realtime Campaign Monitoring [Maybe Deprecated] | ||||||||||
Steam_recommendation_system | 25 | 6 years ago | Jupyter Notebook | |||||||
Recommendation System, Collaborative Filtering, Spark, Hive, Flask, Web Crawler, AWS EC2, AWS RDS | ||||||||||
Sparkwarc | 13 | 2 years ago | 4 | January 11, 2022 | apache-2.0 | WebAssembly | ||||
Load WARC files into Apache Spark with sparklyr |
A web crawler is a bot program that fetches resources from the web for the sake of building applications like search engines, knowledge bases, etc. Sparkler (contraction of Spark-Crawler) is a new web crawler that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and pf4j. Sparkler is an extensible, highly scalable, and high-performance web crawler that is an evolution of Apache Nutch and runs on Apache Spark Cluster.
Sparkler is being proposed to Apache Incubator. Review the proposal document and provide your suggestions here here Will be done later, eventually!
To use sparkler, install docker and run the below commands:
# Step 0. Get the image
docker pull ghcr.io/uscdatascience/sparkler/sparkler:main
# Step 1. Create a volume for elastic
docker volume create elastic
# Step 1. Inject seed urls
docker run -v elastic:/elasticsearch-7.17.0/data ghcr.io/uscdatascience/sparkler/sparkler:main inject -id myid -su 'http://www.bbc.com/news'
# Step 3. Start the crawl job
docker run -v elastic:/elasticsearch-7.17.0/data ghcr.io/uscdatascience/sparkler/sparkler:main crawl -id myid -tn 100 -i 2 # id=1, top 100 URLs, do -i=2 iterations
1. Follow Steps 0-1
2. Create a file name seed-urls.txt using Emacs editor as follows:
a. emacs sparkler/bin/seed-urls.txt
b. copy paste your urls
c. Ctrl+x Ctrl+s to save
d. Ctrl+x Ctrl+c to quit the editor [Reference: http://mally.stanford.edu/~sr/computing/emacs.html]
* Note: You can use Vim and Nano editors also or use: echo -e "http://example1.com\nhttp://example2.com" >> seedfile.txt command.
3. Inject seed urls using the following command, (assuming you are in sparkler/bin directory)
$bash sparkler.sh inject -id 1 -sf seed-urls.txt
4. Start the crawl job.
To crawl until the end of all new URLS, use -i -1
, Example: /data/sparkler/bin/sparkler.sh crawl -id 1 -i -1
Any questions or suggestions are welcomed in our mailing list [email protected] Alternatively, you may use the slack channel for getting help http://irds.usc.edu/sparkler/#slack