Net.jgp.labs.spark

Apache Spark examples exclusively in Java
Alternatives To Net.jgp.labs.spark
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Spark35,3232,3948826 hours ago46May 09, 2021216apache-2.0Scala
Apache Spark - A unified analytics engine for large-scale data processing
Sparkinternals4,665
a year ago27
Notes talking about the design and implementation of Apache Spark
Bigdl4,179105 hours ago16April 19, 2021718apache-2.0Jupyter Notebook
Fast, distributed, secure AI for Big Data
Hudi4,058266 hours ago13August 16, 2022619apache-2.0Java
Upserts, Deletes And Incremental Processing on Big Data.
Synapseml3,95113 days ago5January 12, 2022281mitScala
Simple and Distributed Machine Learning
Coolplayspark3,333
10 months ago35Scala
酷玩 Spark: Spark 源代码解析、Spark 类库等
Koalas3,2281123 months ago47October 19, 2021109apache-2.0Python
Koalas: pandas API on Apache Spark
Spark Nlp3,159226 hours ago90March 05, 202137apache-2.0Scala
State of the Art Natural Language Processing
Spark Notebook3,131
a year ago207apache-2.0JavaScript
Interactive and Reactive Data Science using Scala and Spark.
Deequ2,71743 days ago31February 15, 2022124apache-2.0Scala
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Alternatives To Net.jgp.labs.spark
Select To Compare


Alternative Project Comparisons
Readme

Some Java examples for Apache Spark

Welcome

Welcome to this project I started several years ago with this simple idea: let's use Apache Spark with Java and not learn all those complex stuff like Hadoop or Scala. I am not that smart anyway...

A book!

This project has evolved in a book, named "Spark in Action, 2nd edition" published by Manning Publications. If you want to know more, and be guided through your Spark learning process, I can only recommend to read the book at Manning. Find out more about Spark in Action, 2nd edition, on the Manning website. The book contains more examples, more explanation, is professionally written and edited.

Spark in Action, 2e covers using Spark with Java, Python (PySpark), and Scala.

All Spark in Action's examples are on GitHub. Here are the repos with the book examples:

Chapter 1 So, what is Spark, anyway? An introduction to Spark with a simple ingestion example.

Chapter 2 Architecture and flows Mental model around Spark and exporting data to PostgreSQL from Spark.

Chapter 3 The majestic role of the dataframe.

Chapter 4 Fundamentally lazy.

Chapter 5 Building a simple app for deployment and Deploying your simple app.

Chapter 7 Ingestion from files.

Chapter 8 Ingestion from databases.

Chapter 9 Advanced ingestion: finding data sources & building your own.

Chapter 10 Ingestion through structured streaming.

Chapter 11 Working with Spark SQL.

Chapter 12 Transforming your data.

Chapter 13 Transforming entire documents.

Chapter 14 Extending transformations with user-defined functions (UDFs).

Chapter 15 Aggregating your data.

Chapter 16 Cache and checkpoint: enhancing Spark’s performances.

Chapter 17 Exporting data & building full data pipelines.

In the meanwhile, this project is still live, with more raw-level examples, that may (or may not) work.

This repo

This project is still live as I add experiments and answers to StackOverflow. I try to keep this project up to date with the version of Spark, but I must admit I only validate for compilations.

Environment

These labs rely on:

  • Apache Spark v3.2.0 (based on Scala v2.12).
  • Java 8.

Notes on Branches

The master branch will always contain the latest version of Spark, currently v3.2.0.

Labs

A few labs around Apache Spark, exclusively in Java.

Organization is now in sub packages:

  • l000_ingestion: Data ingestion from various sources.
  • l020_streaming: Data ingestion via streaming. Special note on Streaming.
  • l050_connection: Connect to Spark.
  • l100_checkpoint: Checkpoint introduced in Spark v2.1.0.
  • l150_udf: UDF (User Defined Functions).
  • l200_join: added join examples.
  • l240_foreach: foreach() on a dataframe.
  • l250_map: map (in the context of mapping, not always linked to map/reduce).
  • l300_reduce: reduce.
  • l400_industry_formats: working with industry formats, limited, for now, to HL7 and FHIR.
  • l500_misc: other examples.
  • l600_ml: ML (Machine Learning).
  • l700_save: saving your results.
  • l800_concurrency: labs around concurrency access, work in progress.
  • l900_analytics: More complex examples of using Spark for Analytics.
  • l900_analytics: More complex examples of using Spark for Analytics.

Contribute

  • If you would like to see more labs, send your request to jgp at jgp dot net or @jgperrin on Twitter.
  • Contact me as well if you want to add some of your examples to this repo (or simply do a pull request).
Popular Spark Projects
Popular Apache Spark Projects
Popular Data Processing Categories
Related Searches

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Java
Spark
Streaming
Dataframe
Apache Spark