Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Synapseml | 3,950 | 1 | 16 hours ago | 5 | January 12, 2022 | 280 | mit | Scala | ||
Simple and Distributed Machine Learning | ||||||||||
Spark Nlp | 3,155 | 2 | 2 | 16 hours ago | 90 | March 05, 2021 | 36 | apache-2.0 | Scala | |
State of the Art Natural Language Processing | ||||||||||
Linkis | 3,000 | 16 hours ago | 243 | apache-2.0 | Java | |||||
Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines. | ||||||||||
Ibis | 2,553 | 24 | 16 | 17 hours ago | 32 | April 28, 2022 | 70 | apache-2.0 | Python | |
Expressive analytics in Python at any scale. | ||||||||||
Petastorm | 1,593 | 6 | 5 days ago | 77 | February 19, 2022 | 169 | apache-2.0 | Python | ||
Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code. | ||||||||||
Mleap | 1,434 | 15 | 12 | a month ago | 26 | May 07, 2021 | 103 | apache-2.0 | Scala | |
MLeap: Deploy ML Pipelines to Production | ||||||||||
Awesome Spark | 1,427 | a month ago | 19 | cc0-1.0 | Shell | |||||
A curated list of awesome Apache Spark packages and resources. | ||||||||||
Optimus | 1,345 | 5 days ago | 27 | June 19, 2022 | 29 | apache-2.0 | Python | |||
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark | ||||||||||
Spark Py Notebooks | 1,227 | 6 years ago | 6 | other | Jupyter Notebook | |||||
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks | ||||||||||
Sparkmagic | 1,199 | 22 | 5 | a month ago | 47 | May 02, 2022 | 144 | other | Python | |
Jupyter magics and kernels for working with remote Spark clusters |
Apache Spark is one of the hottest new trends in the technology domain. It is the framework with probably the highest potential to realize the fruit of the marriage between Big Data and Machine Learning. It runs fast (up to 100x faster than traditional Hadoop MapReduce due to in-memory operation, offers robust, distributed, fault-tolerant data objects (called RDD), and integrates beautifully with the world of machine learning and graph analytics through supplementary packages like Mlib and GraphX.
Unlike most Python libraries, getting PySpark to start working properly is not as straightforward as pip install ...
and import ...
Most of us with Python-based data science and Jupyter/IPython background take this workflow as granted for all popular Python packages. We tend to just head over to our CMD or BASH shell, type the pip install command, launch a Jupyter notebook and import the library to start practicing.
But, PySpark+Jupyter combo needs a little bit more love :-)
python3 --version
sudo apt-get update
sudo apt install python3-pip
pip3 install jupyter
export PATH=$PATH:~/.local/bin
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get install oracle-java8-installer
sudo apt-get install oracle-java8-set-default
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export JRE_HOME=/usr/lib/jvm/java-8-oracle/jre
sudo apt-get install scala
pip3 install py4j
sudo tar -zxvf spark-2.3.1-bin-hadoop2.7.tgz
export SPARK_HOME='/home/tirtha/Spark/spark-2.3.1-bin-hadoop2.7'
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3
export PATH=$SPARK_HOME:$PATH:~/.local/bin:$JAVA_HOME/bin:$JAVA_HOME/jre/bin
source .bashrc
RDD
Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.
Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations.
Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created through deterministic operations on either data on stable storage or other RDDs. RDD is a fault-tolerant collection of elements that can be operated on in parallel.
There are two ways to create RDDs,
Dataframe
In Apache Spark, a DataFrame is a distributed collection of rows under named columns. It is conceptually equivalent to a table in a relational database, an Excel sheet with Column headers, or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. It also shares some common characteristics with RDD:
Spark SQL provides a DataFrame API that can perform relational operations on both external data sources and Spark's built-in distributed collections—at scale!
To support a wide variety of diverse data sources and algorithms in Big Data, Spark SQL introduces a novel extensible optimizer called Catalyst, which makes it easy to add data sources, optimization rules, and data types for advanced analytics such as machine learning. Essentially, Spark SQL leverages the power of Spark to perform distributed, robust, in-memory computations at massive scale on Big Data.
Spark SQL provides state-of-the-art SQL performance and also maintains compatibility with all existing structures and components supported by Apache Hive (a popular Big Data warehouse framework) including data formats, user-defined functions (UDFs), and the metastore. Besides this, it also helps in ingesting a wide variety of data formats from Big Data sources and enterprise data warehouses like JSON, Hive, Parquet, and so on, and performing a combination of relational and procedural operations for more complex, advanced analytics.
Spark SQL has been shown to be extremely fast, even comparable to C++ based engines such as Impala.
Following graph shows a nice benchmark result of DataFrames vs. RDDs in different languages, which gives an interesting perspective on how optimized DataFrames can be.
Why is Spark SQL so fast and optimized? The reason is because of a new extensible optimizer, Catalyst, based on functional programming constructs in Scala.
Catalyst's extensible design has two purposes.