Awesome Open Source
Search results for spark pyspark
533 search results found
Simple and Distributed Machine Learning
State of the Art Natural Language Processing
The flexibility of Python with the scale and performance of modern SQL.
Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.
Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Spark Py Notebooks
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
MLeap: Deploy ML Pipelines to Production
A curated list of awesome Apache Spark packages and resources.
🚚 Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
Jupyter magics and kernels for working with remote Spark clusters
Baidu Bigflow is an interface that allows for writing distributed computing programs and provides lots of simple, flexible, powerful APIs. Using Bigflow, you can easily handle data of any scale. Bigflow processes 4P+ data inside Baidu and runs about 10k jobs every day.
Pyspark Example Project
Example project implementing best practices for PySpark ETL jobs and applications.
PySpark-Tutorial provides basic algorithms using PySpark
Sparkling Water provides H2O functionality inside Spark cluster
Scriptis is for interactive data analysis with script development(SQL, Pyspark, HiveQL), task submission(Spark, Hive), UDF, function, resource management and intelligent diagnosis.
Devops Python Tools
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
pyspark🍒🥭 is delicious，just eat it!😋😋
A comprehensive Spark guide collated from multiple sources that can be referred to learn more about Spark or as an interview refresher.
Code base for the Learning PySpark book (in preparation)
This is a repo documenting the best practices in PySpark.
Includes notes on Apache Spark, Spark for Physics, Jupyter notebook examples for Spark, Oracle and other DB systems.
Pandas and Spark DataFrame comparison for humans and more!
Spark Standalone Cluster On Docker
Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. ⚡️
Code repository for Learning PySpark by Packt
A Spark library for Amazon SageMaker.
Distributed scikit-learn meta-estimators in PySpark
Process Common Crawl data with Python and Spark
Spark Gotchas. A subjective compilation of the Apache Spark tips and tricks
A tool for building feature stores.
Pyspark Style Guide
This is a guide to PySpark code style presenting common situations and the associated best practices based on the most frequent recurring topics across the PySpark repos we've encountered.
Spark Jupyter Aws
A guide on how to set up Jupyter with Pyspark painlessly on AWS EC2 clusters, with S3 I/O support
A pure Python implementation of Apache Spark's RDD and DStream interfaces.
A repository to keep track of all the code that I end up writing for my blog posts.
Big Data Processing Framework - Unified Data API or SQL on Any Storage
🐍 Quick reference guide to common patterns & functions in PySpark.
Joblib Apache Spark Backend
Gallery of Apache Zeppelin notebooks
Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines
Azure Cosmosdb Spark
Apache Spark Connector for Azure Cosmos DB
Java library for approximate nearest neighbors search using Hierarchical Navigable Small World graphs
Cloud Dataproc: Samples and Utils
Drunken Data Quality
Spark package for checking data quality
Apache Spark (PySpark) Practice on Real Data
Data Algorithms With Spark
O'Reilly Book: [Data Algorithms with Spark] by Mahmoud Parsian
GeoTrellis for PySpark
Isolation Forest on Spark
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Big Data Mapreduce Course
Big Data Modeling, MapReduce, Spark, PySpark @ Santa Clara University
HandySpark - bringing pandas-like capabilities to Spark dataframes
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
A library that provides useful extensions to Apache Spark and PySpark.
Aliyun Emapreduce Demo
Mastering Big Data Analytics With Pyspark
Mastering Big Data Analytics with PySpark, Published by Packt
Apache (Py)Spark type annotations (stub files).
Spark Df Profiling
Create HTML profiling reports from Apache Spark DataFrames
Spark Knn Recommender
Item and User-based KNN recommendation algorithms using PySpark
Spark R Notebooks
R on Apache Spark (SparkR) tutorials for Big Data analysis and Machine Learning as IPython / Jupyter notebooks
Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Repo for all my code on the articles I post on medium
Movalytics Data Warehouse
Data pipeline performing ETL to AWS Redshift using Spark, orchestrated with Apache Airflow
Relation Extraction using Deep learning(CNN)
Big Data Engineering Coursera Yandex
Big Data for Data Engineers Coursera Specialization from Yandex
Pyspark Predictive Maintenance
Predictive Maintenance using Pyspark
Phrase At Scale
Detect common phrases in large amounts of text using a data-driven approach. Size of discovered phrases can be arbitrary. Can be used in languages other than English
Spark 2.0 Python Machine Learning examples
PySpark Cassandra brings back the fun in working with Cassandra data in PySpark.
Azure Databricks Nyc Taxi Workshop
An Azure Databricks workshop leveraging the New York Taxi and Limousine Commission Trip Records dataset
PySpark Cookbook, published by Packt
Python Spark Streaming
Learn By Examples
Real-world Spark pipelines examples
JupyterLab extension that enables monitoring launched Apache Spark jobs from within a notebook
pyspark-cassandra is a Python port of the awesome @datastax Spark Cassandra connector. Compatible w/ Spark 2.0, 2.1, 2.2, 2.3 and 2.4
Jgit Spark Connector
jgit-spark-connector is a library for running scalable data retrieval pipelines that process any number of Git repositories for source code analysis.
Streaming data changes to a Data Lake with Debezium and Delta Lake pipeline
Data Exploration in PySpark made easy - Pyspark_dist_explore provides methods to get fast insights in your Spark DataFrames.
Python PMML scoring library
Pyspark Twitter Stream Mining
Real-time Machine Learning with Apache Spark on Twitter Public Stream
Word2Vec models with Twitter data using Spark. Blog:
Spark ML with pyspark
Apache Spark (Scala, PySpark, SparkR) Code, Tricks, and References
🌐 Interactive Workshop on GeoAnalysis using PySpark
This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.
Helpers & syntactic sugar for PySpark.
Pyspark Setup Guide
A guide for setting up Spark + PySpark under Ubuntu linux
Some class materials for a data processing course using PySpark
A collection of tutorials on Hadoop, MapReduce, Spark, Docker
A library for Spark DataFrame using MinIO Select API
Mlflow Spark Summit 2019
MLFlow Spark Summit 2019 Presentation
PySpark for Elastic Search
Repository used for Spark Trainings
Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes
Spark Hive Udf
Example project showing how to use Hive UDFs in Apache Spark
Spark SQL UDF examples
Datapipelines Essentials Python
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Spark Modularized View
Spark Nba Analytics
Analyzing NBA data using Spark 2.1
Spark Dgraph Connector
A connector for Apache Spark and PySpark to Dgraph databases.
Scala Spark (3,279)
Python Spark (2,044)
Java Spark (1,596)
Jupyter Spark (1,284)
Spark Hadoop (1,199)
Apache Spark (1,178)
Jupyter Notebook Spark (1,151)
Spark Kafka (985)
Spark Streaming (817)
Python Pyspark (782)
1-100 of 533 search results
Follow Us On Twitter
Copyright 2018-2023 Awesome Open Source. All rights reserved.