Awesome Open Source
Search results for python spark
1,076 search results found
Apache Spark - A unified analytics engine for large-scale data processing
Data Science Ipython Notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learning using automatic differentiation.
List of Data Science Cheatsheets to rule the world
An orchestration platform for the development, production, and observation of data assets.
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
🧙 The modern replacement for Airflow. Build, run, and manage data pipelines for integrating and transforming data.
Fast, distributed, secure AI for Big Data
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.
Python SQL Parser and Transpiler
Koalas: pandas API on Apache Spark
NumPy and Pandas interface to Big Data
The flexibility of Python with the scale and performance of modern SQL.
Python clone of Spark, a MapReduce alike framework in Python
Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
Spark Deep Learning
Deep Learning Pipelines for Apache Spark
A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).
Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.
Distributed Deep learning with Keras & Spark
Spark Py Notebooks
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
MLeap: Deploy ML Pipelines to Production
Machine Learning Platform and Recommendation Engine built on Kubernetes
🚚 Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
Extended pickling support for Python objects
Aws Glue Samples
AWS Glue code samples
Jupyter magics and kernels for working with remote Spark clusters
Baidu Bigflow is an interface that allows for writing distributed computing programs and provides lots of simple, flexible, powerful APIs. Using Bigflow, you can easily handle data of any scale. Bigflow processes 4P+ data inside Baidu and runs about 10k jobs every day.
(Deprecated) Scikit-learn integration package for Apache Spark
Pyspark Example Project
Example project implementing best practices for PySpark ETL jobs and applications.
Python Helper library for Jupyter Notebooks
ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
A Data Engineering & Machine Learning Knowledge Hub
Apache Livy is an open source REST interface for interacting with Apache Spark from anywhere.
Spark Movie Lens
An on-line movie recommender using Spark, Python Flask, and the MovieLens dataset
An open source framework for building data analytic applications.
Machine learning resources，including algorithm, paper, dataset, example and so on.
Fast, accurate and scalable probabilistic data linkage using your choice of SQL backend
Devops Python Tools
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.
A command-line tool for launching Apache Spark clusters.
Elasticsearch Spark Recommender
Use Jupyter Notebooks to demonstrate how to build a Recommender with Apache Spark & Elasticsearch
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Python Data Science Cheatsheet
This is the development repository for sparkMeasure, a tool for performance troubleshooting of Apache Spark workloads. It simplifies the collection and analysis of Spark task and stage metrics data.
A lightweight, multi-tenant, scalable and secure gateway that enables Jupyter Notebooks to share resources across distributed clusters such as Apache Spark, Kubernetes and others.
pyspark🍒🥭 is delicious，just eat it!😋😋
Aws Glue Libs
AWS Glue Libraries are additions and enhancements to Spark for ETL operations.
Training of Locally Optimized Product Quantization (LOPQ) models for approximate nearest neighbor search of high dimensional data in Python and Spark.
Data Science Learning Resources
A collection of data science and machine learning resources that I've found helpful (I only post what I've read!)
Engine for ML/Data tracking, visualization, explainability, drift detection, and dashboards for Polyaxon.
Monitor the stability of a Pandas or Spark dataframe ⚙︎
Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition
Complete Life Cycle Of A Data Science Project
Zeek Analysis Tools (ZAT): Processing and analysis of Zeek network data with Pandas, scikit-learn, Kafka and Spark
Version 1 of Technical Best Practices of Azure Databricks based on real world Customer and Technical SME inputs
Scripts used to setup a Spark cluster on EC2
Create clusters of VMs on the cloud and configure them with Ansible.
Spark Standalone Cluster On Docker
Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. ⚡️
TensorFlow on Spark
Train and run Pytorch models on Apache Spark.
Code repository for Learning PySpark by Packt
Sparrow scheduling platform (U.C. Berkeley).
Easy to use library to bring Tensorflow on Apache Spark
Pandas and Spark DataFrame comparison for humans
Distributed scikit-learn meta-estimators in PySpark
Process Common Crawl data with Python and Spark
Tidb Docker Compose
Azure Event Hubs
☁️ Cloud-scale telemetry ingestion from any stream of data with Azure Event Hubs
Beginner data engineering project - batch edition
Pyspark Style Guide
This is a guide to PySpark code style presenting common situations and the associated best practices based on the most frequent recurring topics across the PySpark repos we've encountered.
A robust, and flexible open source User & Entity Behavior Analytics (UEBA) framework used for Security Analytics. Developed with luv by Data Scientists & Security Analysts from the Cyber Security Industry. [PRE-ALPHA]
A Spark library for Amazon SageMaker.
A pure Python implementation of Apache Spark's RDD and DStream interfaces.
A tool for building feature stores.
Installations for Data Science. Anaconda, RStudio, Spark, TensorFlow, AWS (Amazon Web Services).
A repository to keep track of all the code that I end up writing for my blog posts.
Big Data Processing Framework - Unified Data API or SQL on Any Storage
Code to accompany Mastering Data Science from PT press
RayDP provides simple APIs for running Spark on Ray and integrating Spark with AI libraries.
Spark Recommendation Engine
Joblib Apache Spark Backend
New Generation Opensource Data Stack Demo
Python Python3 (857,414)
Python Flask (16,475)
Python Dataset (14,792)
Python Pytorch (14,667)
Python Machine Learning (14,099)
Python Docker (13,757)
Python Tensorflow (13,736)
Python Command Line (13,209)
Python Deep Learning (13,092)
Python Jupyter Notebook (12,976)
1-100 of 1,076 search results
Follow Us On Twitter
Copyright 2018-2023 Awesome Open Source. All rights reserved.