Awesome Open Source
Search
Programming Languages
Languages
All Categories
Categories
About
Search results for python pyspark
pyspark
x
python
x
277 search results found
Ibis
⭐
3,404
The flexibility of Python with the scale and performance of modern SQL.
Machine Learning
⭐
2,607
🌎 machine learning tutorials (mainly in Python3)
Petastorm
⭐
1,693
Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Spark Py Notebooks
⭐
1,515
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Mleap
⭐
1,479
MLeap: Deploy ML Pipelines to Production
Sparkmagic
⭐
1,272
Jupyter magics and kernels for working with remote Spark clusters
Bigflow
⭐
1,122
Baidu Bigflow is an interface that allows for writing distributed computing programs and provides lots of simple, flexible, powerful APIs. Using Bigflow, you can easily handle data of any scale. Bigflow processes 4P+ data inside Baidu and runs about 10k jobs every day.
Sparkit Learn
⭐
1,054
PySpark + Scikit-learn = Sparkit-learn
Hopsworks
⭐
1,041
Hopsworks - Data-Intensive AI platform with a Feature Store
Pyspark Example Project
⭐
1,034
Example project implementing best practices for PySpark ETL jobs and applications.
Pyspark Examples
⭐
778
Pyspark RDD, DataFrame and Dataset Examples in Python language
Devops Python Tools
⭐
709
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
Kuwala
⭐
610
Kuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data science models and products with a focus on geospatial data. Currently, the following data connectors are available worldwide: a) High-resolution demograp
Eat_pyspark_in_10_days
⭐
534
pyspark🍒🥭 is delicious,just eat it!😋😋
Pandapy
⭐
483
PandaPy has the speed of NumPy and the usability of Pandas 10x to 50x faster (by @firmai)
Chispa
⭐
443
PySpark test helper methods with beautiful error messages
Findspark
⭐
428
Gather Deployment
⭐
347
Gathers Python deployment, infrastructure and practices.
Datacompy
⭐
339
Pandas and Spark DataFrame comparison for humans and more!
Sparklingpandas
⭐
338
Sparkling Pandas
Tdigest
⭐
332
t-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark
Spark Standalone Cluster On Docker
⭐
311
Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. ⚡
Learning Pyspark
⭐
294
Code repository for Learning PySpark by Packt
Sagemaker Spark
⭐
285
A Spark library for Amazon SageMaker.
Sk Dist
⭐
283
Distributed scikit-learn meta-estimators in PySpark
Cc Pyspark
⭐
280
Process Common Crawl data with Python and Spark
Butterfree
⭐
269
A tool for building feature stores.
Pyspark Style Guide
⭐
264
This is a guide to PySpark code style presenting common situations and the associated best practices based on the most frequent recurring topics across the PySpark repos we've encountered.
Pysparkling
⭐
253
A pure Python implementation of Apache Spark's RDD and DStream interfaces.
Dbldatagen
⭐
234
Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines
Morphl Community Edition
⭐
233
MorphL Community Edition uses big data and machine learning to predict user behaviors in digital products and services with the end goal of increasing KPIs (click-through rates, conversion rates, etc.) through personalization
Learningapachespark
⭐
233
LearningApacheSpark
Data_science_blogs
⭐
232
A repository to keep track of all the code that I end up writing for my blog posts.
Gimel
⭐
230
Big Data Processing Framework - Unified Data API or SQL on Any Storage
Joblib Spark
⭐
226
Joblib Apache Spark Backend
Sql Data Analysis And Visualization Projects
⭐
200
SQL data analysis & visualization projects using MySQL, PostgreSQL, SQLite, Tableau, Apache Spark and pySpark.
Mack
⭐
188
Delta Lake helper methods in PySpark
Spark Extension
⭐
152
A library that provides useful extensions to Apache Spark and PySpark.
Geopyspark
⭐
151
GeoTrellis for PySpark
Data Algorithms With Spark
⭐
151
O'Reilly Book: [Data Algorithms with Spark] by Mahmoud Parsian
Osci
⭐
140
Open Source Contributor Index
Pyspark Cheatsheet
⭐
140
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Handyspark
⭐
129
HandySpark - bringing pandas-like capabilities to Spark dataframes
Aut
⭐
128
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Pyspark Stubs
⭐
116
Apache (Py)Spark type annotations (stub files).
Movalytics Data Warehouse
⭐
116
Data pipeline performing ETL to AWS Redshift using Spark, orchestrated with Apache Airflow
Spark Df Profiling
⭐
115
Create HTML profiling reports from Apache Spark DataFrames
Spark Knn Recommender
⭐
113
Item and User-based KNN recommendation algorithms using PySpark
Replay
⭐
109
A Comprehensive Framework for Building End-to-End Recommendation Systems with State-of-the-Art Models
Machinelearning
⭐
106
Machine learning for beginner(Data Science enthusiast)
Dataproc Templates
⭐
103
Dataproc templates and pipelines for solving simple in-cloud data tasks
Dataanalysiswithpythonandpyspark
⭐
102
Code repository for the "PySpark in Action" book
Dampr
⭐
101
Python Data Processing library
Spark With Python
⭐
98
Fundamentals of Spark with Python (using PySpark), code examples
Relation_extraction
⭐
93
Relation Extraction using Deep learning(CNN)
Big Data Engineering Coursera Yandex
⭐
91
Big Data for Data Engineers Coursera Specialization from Yandex
Pyspark Csv
⭐
87
An external PySpark module that works like R's read.csv or Panda's read_csv, with automatic type inference and null value handling. Parses csv data into SchemaRDD. No installation required, simply include pyspark_csv.py via SparkContext.
Phrase At Scale
⭐
84
Detect common phrases in large amounts of text using a data-driven approach. Size of discovered phrases can be arbitrary. Can be used in languages other than English
Pyspark Cassandra
⭐
81
PySpark Cassandra brings back the fun in working with Cassandra data in PySpark.
Spark_python_ml_examples
⭐
81
Spark 2.0 Python Machine Learning examples
Anovos
⭐
78
Anovos - An Open Source Library for Scalable feature engineering Using Apache-Spark
Python Spark Streaming
⭐
73
Pyspark Cassandra
⭐
67
pyspark-cassandra is a Python port of the awesome @datastax Spark Cassandra connector. Compatible w/ Spark 2.0, 2.1, 2.2, 2.3 and 2.4
Jgit Spark Connector
⭐
67
jgit-spark-connector is a library for running scalable data retrieval pipelines that process any number of Git repositories for source code analysis.
Mmtf Pyspark
⭐
64
Methods for the parallel and distributed analysis and mining of the Protein Data Bank using MMTF and Apache Spark.
Pyspark_dist_explore
⭐
64
Data Exploration in PySpark made easy - Pyspark_dist_explore provides methods to get fast insights in your Spark DataFrames.
Pypmml
⭐
64
Python PMML scoring library
Pyspark Twitter Stream Mining
⭐
63
Real-time Machine Learning with Apache Spark on Twitter Public Stream
Sparkly
⭐
60
Helpers & syntactic sugar for PySpark.
Apachespark
⭐
59
This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.
Cuallee
⭐
56
A data quality acceleration library to get data sets verified in a friendly interface
Data_processing_course
⭐
53
Some class materials for a data processing course using PySpark
Replay
⭐
53
RecSys Library
Towardsdataengineering
⭐
52
This repo contains commands that data engineers use in day to day work.
Spark Training
⭐
52
Repository used for Spark Trainings
Pyspark Elastic
⭐
52
PySpark for Elastic Search
Soda Spark
⭐
49
Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes
Apollo
⭐
48
Advanced similarity and duplicate source code proof of concept for our research efforts.
Stork
⭐
47
Make your libraries magically appear in Databricks.
Sparkora
⭐
46
Powerful rapid automatic EDA and feature engineering library with a very easy to use API 🌟
Terraform Emr Pyspark
⭐
46
Quickstart PySpark with Anaconda on AWS/EMR using Terraform
Datapipelines Essentials Python
⭐
45
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Cluster Pack
⭐
44
A library on top of either pex or conda-pack to make your Python code easily available on a cluster
Emr Bootstrap Pyspark
⭐
43
Quickstart PySpark with Anaconda on AWS/EMR
Smv
⭐
41
Spark Modularized View
Pydata_berlin2016_materials
⭐
39
Collection of pointers to slides and repositories from speakers at PyData Berlin 2016
Dsq
⭐
39
Distributed Streaming Quantiles (for PySpark)
Pytest Spark
⭐
38
pytest plugin to run the tests with support of pyspark
Azure Databricks
⭐
37
Azure Databricks - Advent of 2020 Blogposts
Spark_app_twitter
⭐
36
A data engineering project (Twitter monitor app)
Pyjaws
⭐
36
PyJaws: A Pythonic Way to Define Databricks Jobs and Workflows
Pyspark Cassandra
⭐
35
Utilities and examples to asssist in working with PySpark and Cassandra.
Dlsa
⭐
33
Distributed least squares approximation (dlsa) implemented with Apache Spark
Data Analytics Services
⭐
33
This repo collects the open-source work of the Analytics Service within NHS Digital Data Services
Spark Twitter Sentiment Analysis
⭐
33
Sentiment Analysis of a Twitter Topic with Spark Structured Streaming
Shparkley
⭐
33
Spark implementation of computing Shapley Values using monte-carlo approximation
Pyspark Algorithms
⭐
33
PySpark Algorithms Book: https://www.amazon.com/dp/B07X4B2218/ref=sr_1_2
Luigi Sample
⭐
33
Sample repo for luigi tasks & config
Gmm
⭐
31
Gaussian Mixture Model Implementation in Pyspark
Check Engine
⭐
30
Data validation library for PySpark 3.0.0
Related Searches
Python Machine Learning (20,195)
Python Flask (17,643)
Python Dataset (14,792)
Python Docker (14,113)
Python Tensorflow (13,736)
Python Deep Learning (13,092)
Python Jupyter Notebook (12,976)
Python Html (10,924)
Python Algorithms (10,033)
Python Testing (9,479)
1-100 of 277 search results
Next >
Privacy
|
About
|
Terms
|
Follow Us On Twitter
Copyright 2018-2024 Awesome Open Source. All rights reserved.