|Project Name||Stars||Downloads||Repos Using This||Packages Using This||Most Recent Commit||Total Releases||Latest Release||Open Issues||License||Language|
|Spark||35,323||2,394||882||6 hours ago||46||May 09, 2021||216||apache-2.0||Scala|
|Apache Spark - A unified analytics engine for large-scale data processing|
|Sparkinternals||4,665||a year ago||27|
|Notes talking about the design and implementation of Apache Spark|
|Bigdl||4,179||10||5 hours ago||16||April 19, 2021||718||apache-2.0||Jupyter Notebook|
|Fast, distributed, secure AI for Big Data|
|Hudi||4,058||26||6 hours ago||13||August 16, 2022||619||apache-2.0||Java|
|Upserts, Deletes And Incremental Processing on Big Data.|
|Synapseml||3,951||1||3 days ago||5||January 12, 2022||281||mit||Scala|
|Simple and Distributed Machine Learning|
|Coolplayspark||3,333||10 months ago||35||Scala|
|酷玩 Spark: Spark 源代码解析、Spark 类库等|
|Koalas||3,228||1||12||3 months ago||47||October 19, 2021||109||apache-2.0||Python|
|Koalas: pandas API on Apache Spark|
|Spark Nlp||3,159||2||2||6 hours ago||90||March 05, 2021||37||apache-2.0||Scala|
|State of the Art Natural Language Processing|
|Interactive and Reactive Data Science using Scala and Spark.|
|Deequ||2,717||4||3 days ago||31||February 15, 2022||124||apache-2.0||Scala|
|Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.|
Welcome to this project I started several years ago with this simple idea: let's use Apache Spark with Java and not learn all those complex stuff like Hadoop or Scala. I am not that smart anyway...
This project has evolved in a book, named "Spark in Action, 2nd edition" published by Manning Publications. If you want to know more, and be guided through your Spark learning process, I can only recommend to read the book at Manning. Find out more about Spark in Action, 2nd edition, on the Manning website. The book contains more examples, more explanation, is professionally written and edited.
Spark in Action, 2e covers using Spark with Java, Python (PySpark), and Scala.
All Spark in Action's examples are on GitHub. Here are the repos with the book examples:
Chapter 1 So, what is Spark, anyway? An introduction to Spark with a simple ingestion example.
Chapter 2 Architecture and flows Mental model around Spark and exporting data to PostgreSQL from Spark.
Chapter 3 The majestic role of the dataframe.
Chapter 4 Fundamentally lazy.
Chapter 5 Building a simple app for deployment and Deploying your simple app.
Chapter 7 Ingestion from files.
Chapter 8 Ingestion from databases.
Chapter 9 Advanced ingestion: finding data sources & building your own.
Chapter 10 Ingestion through structured streaming.
Chapter 11 Working with Spark SQL.
Chapter 12 Transforming your data.
Chapter 13 Transforming entire documents.
Chapter 14 Extending transformations with user-defined functions (UDFs).
Chapter 15 Aggregating your data.
Chapter 16 Cache and checkpoint: enhancing Spark’s performances.
Chapter 17 Exporting data & building full data pipelines.
In the meanwhile, this project is still live, with more raw-level examples, that may (or may not) work.
This project is still live as I add experiments and answers to StackOverflow. I try to keep this project up to date with the version of Spark, but I must admit I only validate for compilations.
These labs rely on:
The master branch will always contain the latest version of Spark, currently v3.2.0.
A few labs around Apache Spark, exclusively in Java.
Organization is now in sub packages: