|Project Name||Stars||Downloads||Repos Using This||Packages Using This||Most Recent Commit||Total Releases||Latest Release||Open Issues||License||Language|
|Spark||35,873||2,394||882||17 hours ago||46||May 09, 2021||262||apache-2.0||Scala|
|Apache Spark - A unified analytics engine for large-scale data processing|
|Cookbook||11,769||2 months ago||110||apache-2.0|
|The Data Engineering Cookbook|
|God Of Bigdata||7,992||2 months ago||2|
|Zeppelin||6,058||32||23||18 hours ago||2||June 21, 2017||142||apache-2.0||Java|
|Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.|
|Sparkinternals||4,665||2 years ago||27|
|Notes talking about the design and implementation of Apache Spark|
|Iceberg||4,326||18 hours ago||4||May 23, 2022||1,351||apache-2.0||Java|
|Bigdl||4,219||10||a day ago||16||April 19, 2021||744||apache-2.0||Jupyter Notebook|
|Fast, distributed, secure AI for Big Data|
|Tensorflowonspark||3,851||5||17 days ago||32||April 21, 2022||13||apache-2.0||Python|
|TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.|
|Spark Nlp||3,270||2||2||2 days ago||90||March 05, 2021||36||apache-2.0||Scala|
|State of the Art Natural Language Processing|
|Koalas||3,228||1||12||6 months ago||47||October 19, 2021||109||apache-2.0||Python|
|Koalas: pandas API on Apache Spark|
Spark Version: 1.0.2 Doc Version: 22.214.171.124
|@JerryLead||Lijie Xu||Author of the original Chinese version, and English version update|
|@juhanlol||Han JU||English version and update (Chapter 0, 1, 3, 4, and 7)|
|@invkrh||Hao Ren||English version and update (Chapter 2, 5, and 6)|
|@AorJoa||Bhuridech Sudsee||Thai version|
This series discuss the design and implementation of Apache Spark, with focuses on its design principles, execution mechanisms, system architecture and performance optimization. In addition, there's some comparisons with Hadoop MapReduce in terms of design and implementation. I'm reluctant to call this document a "code walkthrough", because the goal is not to analyze each piece of code in the project, but to understand the whole system in a systematic way (through analyzing the execution procedure of a Spark job, from its creation to completion).
There're many ways to discuss a computer system. Here, We've chosen a problem-driven approach. Firstly one concrete problem is introduced, then it gets analyzed step by step. We'll start from a typical Spark example job and then discuss all the related important system modules. I believe that this approach is better than diving into each module right from the beginning.
The target audiences of this series are geeks who want to have a deeper understanding of Apache Spark as well as other distributed computing frameworks.
I'll try my best to keep this documentation up to date with Spark since it's a fast evolving project with an active community. The documentation's main version is in sync with Spark's version. The additional number at the end represents the documentation's update version.
For more academic oriented discussion, please check out Matei's PHD thesis and other related papers. You can also have a look at my blog (in Chinese) blog.
I haven't been writing such complete documentation for a while. Last time it was about three years ago when I was studying Andrew Ng's ML course. I was really motivated at that time! This time I've spent 20+ days on this document, from the summer break till now (August 2014). Most of the time is spent on debugging, drawing diagrams and thinking how to put my ideas in the right way. I hope you find this series helpful.
We start from the creation of a Spark job, and then discuss its execution. Finally, we dive into some related system modules and features.
Chinese Version is at markdown/. Thai Version is at markdown/thai
The documentation is written in markdown. The pdf version is also available here.
If you're under Mac OS X, I recommand MacDown with a github theme for reading.
Thanks @Yourtion for creating the gitbook version.
Online reading http://spark-internals.books.yourtion.com/
I've created some examples to debug the system during the writing, they are avaible under SparkLearning/src/internals.
We have written a book named "The design principles and implementation of Apache Spark", which talks about the system problems, design principles, and implementation strategies of Apache Spark, and also details the shuffle, fault-tolerant, and memory management mechanisms. Currently, it is written in Chinese.
Book link: https://item.jd.com/12924768.html
Book preface: https://github.com/JerryLead/ApacheSparkBook/blob/master/Preface.pdf
I appreciate the help from the following in providing solutions and ideas for some detailed issues:
@Andrew-Xia Participated in the discussion of BlockManager's implemetation's impact on broadcast(rdd).
@CrazyJVM Participated in the discussion of BlockManager's implementation.
@ Participated in the discussion of BlockManager's implementation.
Thanks to the following for complementing the document:
|Weibo ID||Chapter||Content||Revision status|
|@OopsOutOfMemory||Overview||Relation between workers and executors and Summary on Spark Executor Driver's Resouce Management (in Chinese)||There's not yet a conclusion on this subject since its implementation is still changing, a link to the blog is added|
Thanks to the following for finding errors:
|Weibo Id||Chapter||Error/Issue||Revision status|
|@Joshuawangzj||Overview||When multiple applications are running, multiple Backend process will be created||Corrected, but need to be confirmed. No idea on how to control the number of Backend processes|
|@_cs_cm||Overview||Latest groupByKey() has removed the mapValues() operation, there's no MapValuesRDD generated||Fixed groupByKey() related diagrams and text|
|@||JobLogicalPlan||N:N relation in FullDepedency N:N is a NarrowDependency||Modified the description of NarrowDependency into 3 different cases with detaild explaination, clearer than the 2 cases explaination before|
|@zzl0||Fisrt four chapters||Lots of typossuch as "groupByKey has generated the 3 following RDDs"should be 2. Check pull request||All fixed|
|@TEL||Cache and Broadcast chapter||Lots of typos||All fixed|
|@cloud-fan||JobLogicalPlan||Some arrows in the Cogroup() diagram should be colored red||All fixed|
|@CrazyJvm||Shuffle details||Starting from Spark 1.1, the default value for spark.shuffle.file.buffer.kb is 32k, not 100k||All fixed|
Special thanks to @Andy for his great support.
Special thanks to the rockers (including researchers, developers and users) who participate in the design, implementation and discussion of big data systems.