Awesome Open Source

Programming Languages

Search results for java crawler

186 search results found

Tinkertime ⭐ 26

NO LONGER SUPPORTED - The Ultimate KSP Mod Mechanic

Java web crawling library

Real_time_social_media_mining ⭐ 24

DevOps pipeline for Real Time Social/Web Mining

Myhttpclient ⭐ 23

爬虫抓取框架,封装HttpClient,Htmlunit,Selenium等工具

Httpproxy ⭐ 23

JAVA实现的IP代理池，支持HTTP与HTTPS两种方式

Distributed, asynchronous web crawler

Gecco Redis ⭐ 23

Gecko crawler supports distributed by redis

Crawling Framework ⭐ 21

Easily crawl news portals or blog sites using Storm Crawler.

a biodiversity dataset tracker

A dataset for knowledge base population research using Common Crawl and DBpedia.

Eksi sözlük crawl,stat , api calismalari

Httrack2warc ⭐ 20

Converts HTTrack crawls to WARC files

A simple, scalable, and highly efficient web crawler framework for Java.

Common Crawl Quick Hacks ⭐ 18

common crawl quick hack examples

Sitecrawler ⭐ 18

This is a Java library which can be used to crawl the content of some of web properties (www.salesforce.com, blogs.salesforce.com for example). It supports dynamic scaling (depending on available machine power (CPU, RAM) and network capacity) out of the box. It also has a Plugin structure, which allows others to write code (plugins) that act on the crawled pages.

Kairos, combines a focused crawler and an information extraction engine, to convert a list of conference websites into a index filled with fields of metadata that correspond to individual papers. Using event date metadata extracted from the conference website, Kairos proactively harvests metadata about the individual papers soon after they are made public. We use a Maximum Entropy classifier to classify uniform resource locators (URLs) as scientific conference websites and use Conditional Random

A simple and flexible web crawler framework for java.

Webtoon Crawler ⭐ 17

Let's download webtoons while they are free!

Douyin Crawler ⭐ 16

抖音爬虫. 通过手机代理爬取用户的作品和用户的喜欢

Cc Bill Tracker ⭐ 16

These map reduce functions use Common Crawl data to look at the spread of congressional legislation on the internet

Leetcodecrawler ⭐ 15

A tool for crawling the description and accepted submitted code of problems on the LeetCode and LeetCode-Cn website.

Tentacle ⭐ 15

a opensource spider with java

Paperwebcrawler ⭐ 15

IEEE XPLORE等文献网站的爬虫工具/Crawler for Paper Website like IEEE XPLORE

Webhunger ⭐ 15

WebHunger is an extensible, full-scale crawler framework that supports distributed crawling, aiming at getting users focused on web page parsing without concerning for the crawling process.

Googleplay Web Crawler ⭐ 15

Mapreduce project by Hadoop, Nutch, AWS EMR, Pig, Tez, Hive

Groundhog ⭐ 14

A framework for crawling GitHub projects and raw data and to extract metrics from them

Twitter Crawler ⭐ 14

REST and STREAMING crawlers of Twitter (java)

Brings 1.13 and 1.14 movement like swimming and crawling into 1.12.2! Based off https://github.com/pentantan/BetterSwiming

Nutch In Java ⭐ 14

How to use Apache Nutch without command line

Paper plugin 1.20 - /crash, /crawl, /lunar, /vanish, /sit - client detection

Springbootdc ⭐ 13

SpringBoot Developer Components

Lightshot ⭐ 13

Lightshot image grabber

Serritor ⭐ 13

Serritor is an open source web crawler framework built upon Selenium and written in Java. It can be used to crawl dynamic web pages that require JavaScript to render data.

大概就是爬取YouTube之类一些墙外的一些热门内容到一些大陆能访问的网站

Nio Crawler ⭐ 13

Simple Java Crawler using NIO

Robots.txt ⭐ 13

🤖 robots.txt as a service. Crawls robots.txt files, downloads and parses them to check rules through an API

Spotifydiscoverybot ⭐ 13

A Java-based bot that automatically crawls for new releases by your followed artists on Spotify. Never miss a release again!

Venom Tutorial ⭐ 12

A tutorial based on your preferred open source focused crawler for the deep web.

Torfiles ⭐ 12

An open-source torrent searching serice

Fastcrawler ⭐ 12

一个快速，简单，基于多线程的网络爬虫框架

Pttcrawler ⭐ 12

PTTCrawler is a powerful ptt crawler written by Java

Pubsenti Finder ⭐ 12

微博评论情感分析，爬虫，文本分类，Web。

Crawler Framework ⭐ 12

分布式爬虫框架,基于webdrvier模拟用户请求,kafka消息传递,分布式网页存储使用hbase ip服务和号码验证服务等, proxy page使用H5和we版进行接入

In One File Manager ⭐ 12

Desktop File Manager for Windows

Knowledge Distillation ⭐ 12

site crawler for knowledge graph

Warc Mapreduce ⭐ 11

warc and wet support for Hadoop's mapreduce api

Confluence Static Cache ⭐ 11

Generates static file cache for Confluence

Supermonkey ⭐ 11

A crawler for automated Android UI testing.

a groory spider .

Springbootcrawlerdb ⭐ 11

A Spring Boot web crawler setup/example with crawler4j, Jsoup, Spring Data JPA (Hibernate), PostgresDB.

Chatper15_net_io_img_crawler ⭐ 11

第15章 Kotlin 文件IO操作与多线程

Ghost Login ⭐ 11

Specifically designed to solve the web crawler when collecting Internet web data who need to login the web-site by useing some Simulated ways.

Apk Crawler ⭐ 10

APK-Crawler is a tool for collecting apk files.

Instagram Crawler ⭐ 10

Spring Boot Integration Crawler Sample ⭐ 10

a spring boot + spring integration crawler sample.

Dwtc Extractor ⭐ 10

Extraction code used to create the Dresden Web Table Corpus

Nutch Crawler ⭐ 9

Apache Nutch fork tunned for web services and data discovery.

Weather Mrs ⭐ 9

天气爬虫采集，kafka实时分发，flume 收集数据导入到 Hbase, 再由 Hive 与 Hbase 建立映射，Superset 分析和展示数据。

Codechef Crawler ⭐ 9

[deprecated] A web crawler that can download all successful submissions of a user in Codechef. Just provide the username.

Dhtcrawler ⭐ 9

Crawl torrent in Java

Webmuncher ⭐ 9

A web scrapper/crawler in Java

Bilibili Plugin ⭐ 9

哔哩哔哩插件姬

Quora Loader ⭐ 9

A realtime read-only locator and extraction library for Quora questions and answers.

Nutch Indexer Discovery ⭐ 9

Watson Discovery Service indexing plugin for Apache Nutch

Hubs is a content crawler application on Android. It provides apis to crawl web content and display data.

Kaboomzhihu ⭐ 9

知乎批量关注，批量取消关注

Analysis Platform for Developer Learning Resources

Questionrecommendation ⭐ 8

Programming Questions Recommendation System (牛客网试题推荐系统)

Adaptive Crawler ⭐ 8

Twitter Adaptive Crawler

Dwtc Tools ⭐ 8

Dresden Web Table Corpus Java library

Agrotagger ⭐ 8

This application allows to index web documents, creating RDF triples that link a web URL to some URIs of a SKOS thesaurus

An Android Client for LeetCode

Zhihu Crawler ⭐ 8

A simple ZhiHu Crawler using WebMagic

Chronicrawl ⭐ 8

Experimental continouous web crawler for web archiving

Java implementation of the Internet Research Lab Web Crawler (IRLbot) as presented by Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov in their paper "IRLbot: Scaling to 6 Billion Pages and Beyond"

Docker Codesearch ⭐ 8

Code Search on Fess

Gecco是一款用java语言开发的轻量化的易用的网络爬虫。Gecco整合了jsoup、httpcl request。如果你喜欢这款爬虫框架请star 或者 fork!

Web archive collection manager

Ukwa Heritrix ⭐ 8

The UKWA Heritrix3 custom modules and Docker builder.

Leechcrawler ⭐ 8

Incremental crawling capabilities for Apache Tika. Crawl content out of e.g. file systems, http(s) sources (webcrawling) imap(s) servers or your own arbitrary data sources. LeechCrawler offers additional Tika parsers providing these crawling capabilities.

Atom Nuke ⭐ 8

Renren Analysis ⭐ 8

a project which is used for crawler and data visualization on renren.com

Eksiseyler ⭐ 8

Sample MVP project uses jsoup-web-crawl like API

Rxcrawler ⭐ 8

a java crawler base on rx-java

Baidu Search Result Crawler ⭐ 8

一个百度搜索结果内容获取爬虫。

Cis555 Project ⭐ 7

The final project for CIS555 at the University of Pennsylvania.

Movie Showtimes ⭐ 7

Web Service & Android Application to look up Vietnam movie showtimes

Crawlerzwei ⭐ 7

CZwei Crawler for www.2ch.net

Fns_front ⭐ 7

🕶👨‍🎤👩‍🎤Fashion Network Service.

Crawl RSS - Heritrix 3 add-on

Vertx Crawler ⭐ 7

Web Crawler based on Vert.x

A simple Crawler-based search engine that demonstrates the main features of a search engine (web crawling, indexing and ranking) and the interaction between them using Java and a Web Interface.

Born2crawl ⭐ 7

A highly performant and versatile crawling engine, designed with scalability and extensibility in mind.

Fess Ds Atlassian ⭐ 7

DataStore Crawler for JIRA/Confluence

Java Learn ⭐ 7

🕷 a flexible web crawler framework

ryanair crawler based on webkit

Cloud Computing Search Engine ⭐ 7

A cloud-based web search engine computing Hadoop MapReduce on Amazon EC2 consisting of crawler, indexer, PageRank.

Httrack2arc ⭐ 7

HTTrack2Arc is a tool that converts crawls made by HTTrack to Internet Archive ARC files.

Related Searches

Java Spring (21,350)

Java Plugin (12,452)

Java Spring Boot (11,982)

Java Video Game (8,093)

Java Gradle (8,072)

Java Docker (6,180)

Java Database (6,015)

Java Mysql (5,954)

Java Sdk (5,864)

Javascript Java (5,468)

101-186 of 186 search results

Privacy | About | Terms | Follow Us On Twitter

Copyright 2018-2024 Awesome Open Source. All rights reserved.