Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Undetected Chromedriver | 5,335 | 31 | a day ago | 35 | March 16, 2022 | 668 | gpl-3.0 | Python | ||
Custom Selenium Chromedriver | Zero-Config | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome / CloudFlare IUAM) | ||||||||||
Rod | 3,858 | 58 | 4 days ago | 385 | September 23, 2022 | 92 | mit | Go | ||
A Devtools driver for web automation and scraping | ||||||||||
Panther | 2,746 | 881 | 30 | 10 days ago | 21 | December 02, 2021 | 180 | mit | PHP | |
A browser testing and web crawling library for PHP and Symfony | ||||||||||
Oj | 856 | 4 | 7 | 5 months ago | 109 | September 13, 2021 | 24 | mit | Python | |
Tools for various online judges. Downloading sample cases, generating additional test cases, testing your code, and submitting it. | ||||||||||
Skrape.it | 623 | 3 | 3 months ago | 12 | February 23, 2022 | 26 | mit | Kotlin | ||
A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion. | ||||||||||
Spidermon | 458 | a month ago | 14 | December 23, 2021 | 52 | bsd-3-clause | Python | |||
Scrapy Extension for monitoring spiders execution. | ||||||||||
Morph | 454 | 8 months ago | 351 | agpl-3.0 | Ruby | |||||
Take the hassle out of web scraping | ||||||||||
Headlesschrome | 112 | 4 years ago | 1 | mit | Go | |||||
A Go package for working with headless Chrome. Run interactive JavaScript commands on web pages with Go and Chrome. | ||||||||||
Pittapi | 87 | 10 months ago | 9 | May 02, 2019 | 6 | gpl-2.0 | Python | |||
An API to easily get data from the University of Pittsburgh | ||||||||||
Script.module.openscrapers | 75 | 3 years ago | 1 | gpl-3.0 | Python | |||||
OpenScrapers Project |
skrape{it} is a Kotlin-based HTML/XML testing and web scraping library that can be used seamlessly in Spring-Boot, Ktor, Android or other Kotlin-JVM projects. The ability to analyze and extract HTML including client-side rendered DOM trees and all other XML-related markup specifications such as SVG, UML, RSS,... makes it unique. It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. First and foremost skrape{it} aims to be a testing tool (not tied to a particular test runner), but it can also be used to scrape websites in a convenient fashion.
In addition, extensions for well-known testing libraries are provided to extend them with the mentioned skrape{it} functionality. Currently available:
You'll always find the latest documentation, release notes and examples regarding official releases at https://docs.skrape.it. The README file you are reading right now provides example related to the latest master. Just use it if you won't wait for latest changes to be released. If you don't want to read that much or just want to get a rough overview on how to use skrape{it}, you can have a look at the Documentation by Example section which refers to the current master.
All our official/stable releases will be published to mavens central repository.
dependencies {
implementation("it.skrape:skrapeit:1.2.2")
}
<dependency>
<groupId>it.skrape</groupId>
<artifactId>skrapeit</artifactId>
<version>1.2.2</version>
</dependency>
We are offering snapshot releases by publishing every successful build of a commit that has been pushed to master branch. Thereby you can just install the latest implementation of skrape{it}. Be careful since these are non-official releases and may be unstable as well as breaking changes can occur at any time.
repositories {
maven { url = uri("https://oss.sonatype.org/content/repositories/snapshots/") }
}
dependencies {
implementation("it.skrape:skrapeit:0-SNAPSHOT") { isChanging = true } // version number will stay - implementation may change ...
}
// optional
configurations.all {
resolutionStrategy {
cacheChangingModulesFor(0, "seconds")
}
}
<repositories>
<repository>
<id>snapshot</id>
<url>https://oss.sonatype.org/content/repositories/snapshots/</url>
</repository>
</repositories>
...
<dependency>
<groupId>it.skrape</groupId>
<artifactId>skrapeit</artifactId>
<version>0-SNAPSHOT</version>
</dependency>
You can find further examples in the projects integration tests.
We have a working Android sample using jetpack-compose in our example projects as living documentation.
@Test
fun `can read and return html from String`() {
htmlDocument("""
<html>
<body>
<h1>welcome</h1>
<div>
<p>first p-element</p>
<p class="foo">some p-element</p>
<p class="foo">last p-element</p>
</div>
</body>
</html>""") {
h1 {
findFirst {
text toBe "welcome"
}
}
p {
withClass = "foo"
findFirst {
text toBe "some p-element"
className toBe "foo"
}
}
p {
findAll {
text toContain "p-element"
}
findLast {
text toBe "last p-element"
}
}
}
}
}
data class MySimpleDataClass(
val httpStatusCode: Int,
val httpStatusMessage: String,
val paragraph: String,
val allParagraphs: List<String>,
val allLinks: List<String>
)
class HtmlExtractionService {
fun extract() {
val extracted = skrape(HttpFetcher) {
request {
url = "http://localhost:8080"
}
response {
MySimpleDataClass(
httpStatusCode = status { code },
httpStatusMessage = status { message },
allParagraphs = document.p { findAll { eachText } },
paragraph = document.p { findFirst { text } },
allLinks = document.a { findAll { eachHref } }
)
}
}
print(extracted)
// will print:
// MyDataClass(httpStatusCode=200, httpStatusMessage=OK, paragraph=i'm a paragraph, allParagraphs=[i'm a paragraph, i'm a second paragraph], allLinks=[http://some.url, http://some-other.url])
}
}
data class MyDataClass(
var httpStatusCode: Int = 0,
var httpStatusMessage: String = "",
var paragraph: String = "",
var allParagraphs: List<String> = emptyList(),
var allLinks: List<String> = emptyList()
)
class HtmlExtractionService {
fun extract() {
val extracted = skrape(HttpFetcher) {
request {
url = "http://localhost:8080"
}
extractIt<MyDataClass> {
it.httpStatusCode = statusCode
it.httpStatusMessage = statusMessage.toString()
htmlDocument {
it.allParagraphs = p { findAll { eachText }}
it.paragraph = p { findFirst { text }}
it.allLinks = a { findAll { eachHref }}
}
}
}
print(extracted)
// will print:
// MyDataClass(httpStatusCode=200, httpStatusMessage=OK, paragraph=i'm a paragraph, allParagraphs=[i'm a paragraph, i'm a second paragraph], allLinks=[http://some.url, http://some-other.url])
}
}
@Test
fun `dsl can skrape by url`() {
skrape(HttpFetcher) {
request {
url = "http://localhost:8080/example"
}
response {
htmlDocument {
// all official html and html5 elements are supported by the DSL
div {
withClass = "foo" and "bar" and "fizz" and "buzz"
findFirst {
text toBe "div with class foo"
// it's possible to search for elements from former search results
// e.g. search all matching span elements within the above div with class foo etc...
span {
findAll {
// do something
}
}
}
findAll {
toBePresentExactlyTwice
}
}
// can handle custom tags as well
"a-custom-tag" {
findFirst {
toBePresentExactlyOnce
text toBe "i'm a custom html5 tag"
text
}
}
// can handle custom tags written in css selctor query syntax
"div.foo.bar.fizz.buzz" {
findFirst {
text toBe "div with class foo"
}
}
// can handle custom tags and add selector specificas via DSL
"div.foo" {
withClass = "bar" and "fizz" and "buzz"
findFirst {
text toBe "div with class foo"
}
}
}
}
}
}
fun getDocumentByUrl(urlToScrape: String) = skrape(BrowserFetcher) { // <--- pass BrowserFetcher to include rendered JS
request { url = urlToScrape }
response { htmlDocument { this } }
}
fun main() {
// do stuff with the document
println(getDocumentByUrl("https://docs.skrape.it").eachLink)
}
AsyncFetcher
provides coroutine supportsuspend fun getAllLinks(): Map<String, String> = skrape(AsyncFetcher) {
request {
url = "https://my-fancy.website"
}
response {
htmlDocument { eachLink }
}
}
class ExampleTest {
val myPreConfiguredClient = skrape(HttpFetcher) {
// url can be a plain url as string or build by #urlBuilder
request {
method = Method.POST // defaults to GET
url = "" // you can either pass url as String (defaults to 'http://localhost:8080')
url { // or build url (will respect value from url as String param)
// thereby you can pass a url and just override or add parts
protocol = UrlBuilder.Protocol.HTTPS // defaults to given scheme from url param (HTTP if not set)
host = "skrape.it" // defaults to given host from url param (localhost if not set)
port = 12345 // defaults to given port from url param (8080 if not set explicitly - none port if given url param value does noit have port) - set to -1 to remove port
path = "/foo" // defaults to given path from url param (none path if not set)
queryParam { // can handle adding query parameters of several types (defaults to none)
"foo" to "bar" // add query paramter foo=bar
"aaa" to false // add query paramter aaa=false
"bbb" to .4711 // add query paramter bbb=0.4711
"ccc" to 42 // add query paramter ccc=42
"ddd" to listOf("a", 1, null) // add query paramter ddd=a,1,null
+"xxx" // add query paramter xxx (just key, no value)
}
}
timeout = 5000 // optional -> defaults to 5000ms
followRedirects = true // optional -> defaults to true
userAgent = "some custom user agent" // optional -> defaults to "Mozilla/5.0 skrape.it"
cookies = mapOf("some-cookie-name" to "some-value") // optional
headers = mapOf("some-custom-header" to "some-value") // optional
}
}
@Test
fun `can use preconfigured client`() {
myPreConfiguredClient.response {
status { code toBe 200 }
// do more stuff
}
// slightly modify preconfigured client
myPreConfiguredClient.apply {
request {
followRedirects = false
}
}.response {
status { code toBe 301 }
// do more stuff
}
}
}
skrape(HttpFetcher) {
request {
url = "https://www.my-fancy.url"
method = Method.GET
headers = mapOf("Content-Type" to "application/json")
body = """{"foo":"bar"}"""
}
response {
htmlDocument {
...
skrape(HttpFetcher) {
request {
url = "https://www.my-fancy.url"
method = Method.POST
body {
data = "just a plain text" // content-type header will automatically set to "text/plain"
contentType = "your-custom/content" // can optionally override content-type
}
}
response {
htmlDocument {
...
skrape(HttpFetcher) {
request {
url = "https://www.my-fancy.url"
method = Method.POST
body {
json("""{"foo":"bar"}""") // will automatically set content-type header to "application/json"
// or
xml("<foo>bar</foo>") // will automatically set content-type header to "text/xml"
// or
form("foo=bar") // will automatically set content-type header to "application/x-www-form-urlencoded"
}
}
response {
htmlDocument {
...
skrape(HttpFetcher) {
request {
url = "https://www.my-fancy.url"
method = Method.POST
body {
// will automatically set content-type header to "application/json"
// will create {"foo":"bar","xxx":{"a":"b","c":[1,"d"]}} as request body
json {
"foo" to "bar"
"xxx" to json {
"a" to "b"
"c" to listOf(1, "d")
}
}
}
}
response {
htmlDocument {
...
skrape(HttpFetcher) {
request {
url = "https://www.my-fancy.url"
method = Method.POST
body {
// will automatically set content-type header to "application/x-www-form-urlencoded"
// will create foo=bar&xxx=1.5 as request body
form {
"foo" to "bar"
"xxx" to 1.5
}
}
}
response {
htmlDocument {
...
If you need help, have questions on how to use skrape{it} or want to discuss features please don't hesitate to use the projects discussions section on GitHub or raise an issue if you found a bug.
Skrape{it} is and always will be free and open-source. I try to reply to everyone needing help using these projects. Obviously, the development, maintenance takes time.
However, if you are using this project and be happy with it or just want to encourage me to continue creating stuff or fund the caffeine and pizzas that fuel its development, there are few ways you can do it :-