Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Spark | 35,911 | 2,394 | 882 | 16 hours ago | 46 | May 09, 2021 | 274 | apache-2.0 | Scala | |
Apache Spark - A unified analytics engine for large-scale data processing | ||||||||||
Cookbook | 11,769 | 2 months ago | 110 | apache-2.0 | ||||||
The Data Engineering Cookbook | ||||||||||
God Of Bigdata | 7,992 | 2 months ago | 2 | |||||||
专注大数据学习面试,大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive... | ||||||||||
Zeppelin | 6,060 | 32 | 23 | 4 days ago | 2 | June 21, 2017 | 141 | apache-2.0 | Java | |
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more. | ||||||||||
Sparkinternals | 4,665 | 2 years ago | 27 | |||||||
Notes talking about the design and implementation of Apache Spark | ||||||||||
Iceberg | 4,339 | 16 hours ago | 4 | May 23, 2022 | 1,356 | apache-2.0 | Java | |||
Apache Iceberg | ||||||||||
Bigdl | 4,222 | 10 | 21 hours ago | 16 | April 19, 2021 | 745 | apache-2.0 | Jupyter Notebook | ||
Fast, distributed, secure AI for Big Data | ||||||||||
Tensorflowonspark | 3,851 | 5 | 21 days ago | 32 | April 21, 2022 | 13 | apache-2.0 | Python | ||
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters. | ||||||||||
Spark Nlp | 3,272 | 2 | 2 | 3 days ago | 90 | March 05, 2021 | 38 | apache-2.0 | Scala | |
State of the Art Natural Language Processing | ||||||||||
Koalas | 3,228 | 1 | 12 | 6 months ago | 47 | October 19, 2021 | 109 | apache-2.0 | Python | |
Koalas: pandas API on Apache Spark |
TensorFlowOnSpark brings scalable deep learning to Apache Hadoop and Apache Spark clusters.
By combining salient features from the TensorFlow deep learning framework with Apache Spark and Apache Hadoop, TensorFlowOnSpark enables distributed deep learning on a cluster of GPU and CPU servers.
It enables both distributed TensorFlow training and inferencing on Spark clusters, with a goal to minimize the amount of code changes required to run existing TensorFlow programs on a shared grid. Its Spark-compatible API helps manage the TensorFlow cluster with the following steps:
TFNode.DataFeed
class. Note that we leverage the Hadoop Input/Output Format to access TFRecords on HDFS.TensorFlowOnSpark was developed by Yahoo for large-scale distributed deep learning on our Hadoop clusters in Yahoo's private cloud.
TensorFlowOnSpark provides some important benefits (see our blog) over alternative deep learning solutions.
TensorFlowOnSpark is provided as a pip package, which can be installed on single machines via:
# for tensorflow>=2.0.0
pip install tensorflowonspark
# for tensorflow<2.0.0
pip install tensorflowonspark==1.4.4
For distributed clusters, please see our wiki site for detailed documentation for specific environments, such as our getting started guides for single-node Spark Standalone, YARN clusters and AWS EC2. Note: the Windows operating system is not currently supported due to this issue.
To use TensorFlowOnSpark with an existing TensorFlow application, you can follow our Conversion Guide to describe the required changes. Additionally, our wiki site has pointers to some presentations which provide an overview of the platform.
Note: since TensorFlow 2.x breaks API compatibility with TensorFlow 1.x, the examples have been updated accordingly. If you are using TensorFlow 1.x, you will need to checkout the v1.4.4
tag for compatible examples and instructions.
API Documentation is automatically generated from the code.
Please join the TensorFlowOnSpark user group for discussions and questions. If you have a question, please review our FAQ before posting.
Contributions are always welcome. For more information, please see our guide for getting involved.
The use and distribution terms for this software are covered by the Apache 2.0 license. See LICENSE file for terms.