Awesome Open Source
Search
Programming Languages
Languages
All Categories
Categories
About
Search results for python spark
python
x
spark
x
733 search results found
Spark
⭐
37,661
Apache Spark - A unified analytics engine for large-scale data processing
Data Science Ipython Notebooks
⭐
25,668
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Redash
⭐
24,479
Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.
Deeplearning4j
⭐
13,397
Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learning using automatic differentiation.
Ds Cheatsheets
⭐
11,535
List of Data Science Cheatsheets to rule the world
Dagster
⭐
9,467
An orchestration platform for the development, production, and observation of data assets.
It_book
⭐
8,543
本项目收藏这些年来看过或者听过的一些不错的常用的上千本书籍,没准你想找的书就在这里呢,包含了互联网行
H2o 3
⭐
6,618
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Mage Ai
⭐
6,324
🧙 The modern replacement for Airflow. Build, run, and manage data pipelines for integrating and transforming data.
Dev Setup
⭐
5,802
macOS development environment setup: Easy-to-understand instructions with automated setup scripts for developer tools like Vim, Sublime Text, Bash, iTerm, Python data analysis, Spark, Hadoop MapReduce, AWS, Heroku, JavaScript web development, Android development, common data stores, and dev-based OS X defaults.
Technical Books
⭐
5,519
😆 国内外互联网技术大牛们都写了哪些书籍:计算机基础、网络、前端、后端、数据库、架构、大数据、深度学习.
Bigdl
⭐
4,728
Accelerate LLM with low-bit (FP4 / INT4 / FP8 / INT8) optimizations using bigdl-llm
Sqlglot
⭐
4,652
Python SQL Parser and Transpiler
Tensorflowonspark
⭐
3,851
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.
Ibis
⭐
3,404
The flexibility of Python with the scale and performance of modern SQL.
Koalas
⭐
3,291
Koalas: pandas API on Apache Spark
Dpark
⭐
2,637
Python clone of Spark, a MapReduce alike framework in Python
Analytics Zoo
⭐
2,592
Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
Lakesoul
⭐
2,248
LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
Spark Deep Learning
⭐
1,915
Deep Learning Pipelines for Apache Spark
Benchm Ml
⭐
1,839
A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).
Fugue
⭐
1,821
A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.
.github
⭐
1,722
ApacheCN 开源组织:公告、介绍、成员、活动、交流方式
Petastorm
⭐
1,693
Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Elephas
⭐
1,548
Distributed Deep learning with Keras & Spark
Spark Py Notebooks
⭐
1,515
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Cloudpickle
⭐
1,514
Extended pickling support for Python objects
Mleap
⭐
1,479
MLeap: Deploy ML Pipelines to Production
Seldon Server
⭐
1,420
Machine Learning Platform and Recommendation Engine built on Kubernetes
Aws Glue Samples
⭐
1,334
AWS Glue code samples
Sparkmagic
⭐
1,272
Jupyter magics and kernels for working with remote Spark clusters
Bigflow
⭐
1,122
Baidu Bigflow is an interface that allows for writing distributed computing programs and provides lots of simple, flexible, powerful APIs. Using Bigflow, you can easily handle data of any scale. Bigflow processes 4P+ data inside Baidu and runs about 10k jobs every day.
Machine Learning
⭐
1,046
机器学习原理
Spark Sklearn
⭐
1,039
(Deprecated) Scikit-learn integration package for Apache Spark
Pixiedust
⭐
1,035
Python Helper library for Jupyter Notebooks
Pyspark Example Project
⭐
1,034
Example project implementing best practices for PySpark ETL jobs and applications.
Adam
⭐
966
ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
Splink
⭐
939
Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
Around Dataengineering
⭐
926
A Data Engineering & Machine Learning Knowledge Hub
Coding Now
⭐
925
学习记录的一些笔记,以及所看得一些电子书eBooks、视频资源和平常收纳的一些自己认为比较好的博客、
Incubator Livy
⭐
840
Apache Livy is an open source REST interface for interacting with Apache Spark from anywhere.
Spark Movie Lens
⭐
757
An on-line movie recommender using Spark, Python Flask, and the MovieLens dataset
Cdap
⭐
735
An open source framework for building data analytic applications.
Devops Python Tools
⭐
709
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
Machinelearning
⭐
684
Machine learning resources,including algorithm, paper, dataset, example and so on.
Flintrock
⭐
627
A command-line tool for launching Apache Spark clusters.
Pythondatascience Collections
⭐
615
最全数据分析资料汇总(含python、爬虫、数据库、大数据、tableau、统计学等)
Listenbrainz Server
⭐
613
Server for the ListenBrainz project, including the front-end (javascript/react) code that it serves and all of the data processing components that LB uses.
Dist Keras
⭐
611
Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.
Sparkmeasure
⭐
603
This is the development repository for sparkMeasure, a tool for performance troubleshooting of Apache Spark workloads. It simplifies the collection and analysis of Spark task and stage metrics data.
Elasticsearch Spark Recommender
⭐
603
Use Jupyter Notebooks to demonstrate how to build a Recommender with Apache Spark & Elasticsearch
Goodreads_etl_pipeline
⭐
593
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Python Data Science Cheatsheet
⭐
590
Python数据科学速查表
Aws Glue Libs
⭐
568
AWS Glue Libraries are additions and enhancements to Spark for ETL operations.
Eat_pyspark_in_10_days
⭐
534
pyspark🍒🥭 is delicious,just eat it!😋😋
Lopq
⭐
512
Training of Locally Optimized Product Quantization (LOPQ) models for approximate nearest neighbor search of high dimensional data in Python and Spark.
Data Science Learning Resources
⭐
499
A collection of data science and machine learning resources that I've found helpful (I only post what I've read!)
Complete Life Cycle Of A Data Science Project
⭐
499
Complete-Life-Cycle-of-a-Data-Science-Project
Timliu Python
⭐
492
python资源集合与开源硬件
Popmon
⭐
461
Monitor the stability of a Pandas or Spark dataframe ⚙︎
Agile_data_code_2
⭐
435
Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition
Findspark
⭐
428
Recommendersystems
⭐
421
推荐系统
Zat
⭐
414
Zeek Analysis Tools (ZAT): Processing and analysis of Zeek network data with Pandas, scikit-learn, Kafka and Spark
Azuredatabricksbestpractices
⭐
377
Version 1 of Technical Best Practices of Azure Databricks based on real world Customer and Technical SME inputs
Spark Ec2
⭐
367
Scripts used to setup a Spark cluster on EC2
Learning Resource
⭐
351
列出一些优秀的程序员学习资源
Datacompy
⭐
339
Pandas and Spark DataFrame comparison for humans and more!
Sparklingpandas
⭐
338
Sparkling Pandas
Elasticluster
⭐
334
Create clusters of VMs on the cloud and configure them with Ansible.
Spark Standalone Cluster On Docker
⭐
311
Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. ⚡
Tensorspark
⭐
302
TensorFlow on Spark
Sparkflow
⭐
301
Easy to use library to bring Tensorflow on Apache Spark
Sparktorch
⭐
297
Train and run Pytorch models on Apache Spark.
Learning Pyspark
⭐
294
Code repository for Learning PySpark by Packt
Sparrow
⭐
292
Sparrow scheduling platform (U.C. Berkeley).
Sagemaker Spark
⭐
285
A Spark library for Amazon SageMaker.
Sk Dist
⭐
283
Distributed scikit-learn meta-estimators in PySpark
Cc Pyspark
⭐
280
Process Common Crawl data with Python and Spark
Tidb Docker Compose
⭐
278
Azure Event Hubs
⭐
277
☁️ Cloud-scale telemetry ingestion from any stream of data with Azure Event Hubs
Beginner_de_project
⭐
276
Beginner data engineering project - batch edition
Spark Programming In Python
⭐
269
Apache Spark 3 - Spark Programming in Python for Beginners
Butterfree
⭐
269
A tool for building feature stores.
Raydp
⭐
265
RayDP provides simple APIs for running Spark on Ray and integrating Spark with AI libraries.
Pyspark Style Guide
⭐
264
This is a guide to PySpark code style presenting common situations and the associated best practices based on the most frequent recurring topics across the PySpark repos we've encountered.
Openuba
⭐
264
A robust, and flexible open source User & Entity Behavior Analytics (UEBA) framework used for Security Analytics. Developed with luv by Data Scientists & Security Analysts from the Cyber Security Industry. [PRE-ALPHA]
Pysparkling
⭐
253
A pure Python implementation of Apache Spark's RDD and DStream interfaces.
Bisheserver
⭐
242
本系统是我的毕业设计项目,题目为“基于用户画像的电影推荐系统的设计与实现”。主要是以Django作为
Dbldatagen
⭐
234
Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines
Learningapachespark
⭐
233
LearningApacheSpark
Installations_mac_ubuntu_windows
⭐
233
Installations for Data Science. Anaconda, RStudio, Spark, TensorFlow, AWS (Amazon Web Services).
Data_science_blogs
⭐
232
A repository to keep track of all the code that I end up writing for my blog posts.
Gimel
⭐
230
Big Data Processing Framework - Unified Data API or SQL on Any Storage
Intro_ds
⭐
229
Code to accompany Mastering Data Science from PT press
Joblib Spark
⭐
226
Joblib Apache Spark Backend
Spark Recommendation Engine
⭐
222
Ngods Stocks
⭐
217
New Generation Opensource Data Stack Demo
Recommendationsystem
⭐
190
Book recommender system using collaborative filtering based on Spark
Mlflow Examples
⭐
179
Basic and advanced MLflow examples for many ML flavors
Related Searches
Python Django (28,897)
Python Flask (17,643)
Python Dataset (14,792)
Python Pytorch (14,667)
Python Tensorflow (14,376)
Python Docker (14,113)
Python Machine Learning (14,099)
Python Command Line (13,351)
Python Deep Learning (13,092)
Python Jupyter Notebook (12,976)
1-100 of 733 search results
Next >
Privacy
|
About
|
Terms
|
Follow Us On Twitter
Copyright 2018-2024 Awesome Open Source. All rights reserved.