Awesome Open Source

Programming Languages

Search results for python spark

733 search results found

Spark ⭐ 37,661

Apache Spark - A unified analytics engine for large-scale data processing

Data Science Ipython Notebooks ⭐ 25,668

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

Redash ⭐ 24,479

Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.

Deeplearning4j ⭐ 13,397

Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learning using automatic differentiation.

Ds Cheatsheets ⭐ 11,535

List of Data Science Cheatsheets to rule the world

Dagster ⭐ 9,467

An orchestration platform for the development, production, and observation of data assets.

It_book ⭐ 8,543

本项目收藏这些年来看过或者听过的一些不错的常用的上千本书籍，没准你想找的书就在这里呢，包含了互联网行

H2o 3 ⭐ 6,618

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

Mage Ai ⭐ 6,324

🧙 The modern replacement for Airflow. Build, run, and manage data pipelines for integrating and transforming data.

Dev Setup ⭐ 5,802

macOS development environment setup: Easy-to-understand instructions with automated setup scripts for developer tools like Vim, Sublime Text, Bash, iTerm, Python data analysis, Spark, Hadoop MapReduce, AWS, Heroku, JavaScript web development, Android development, common data stores, and dev-based OS X defaults.

Technical Books ⭐ 5,519

😆 国内外互联网技术大牛们都写了哪些书籍：计算机基础、网络、前端、后端、数据库、架构、大数据、深度学习.

Bigdl ⭐ 4,728

Accelerate LLM with low-bit (FP4 / INT4 / FP8 / INT8) optimizations using bigdl-llm

Sqlglot ⭐ 4,652

Python SQL Parser and Transpiler

Tensorflowonspark ⭐ 3,851

TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

The flexibility of Python with the scale and performance of modern SQL.

Koalas ⭐ 3,291

Koalas: pandas API on Apache Spark

Dpark ⭐ 2,637

Python clone of Spark, a MapReduce alike framework in Python

Analytics Zoo ⭐ 2,592

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray

Lakesoul ⭐ 2,248

LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.

Spark Deep Learning ⭐ 1,915

Deep Learning Pipelines for Apache Spark

Benchm Ml ⭐ 1,839

A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).

Fugue ⭐ 1,821

A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.

.github ⭐ 1,722

ApacheCN 开源组织：公告、介绍、成员、活动、交流方式

Petastorm ⭐ 1,693

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

Elephas ⭐ 1,548

Distributed Deep learning with Keras & Spark

Spark Py Notebooks ⭐ 1,515

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

Cloudpickle ⭐ 1,514

Extended pickling support for Python objects

Mleap ⭐ 1,479

MLeap: Deploy ML Pipelines to Production

Seldon Server ⭐ 1,420

Machine Learning Platform and Recommendation Engine built on Kubernetes

Aws Glue Samples ⭐ 1,334

AWS Glue code samples

Sparkmagic ⭐ 1,272

Jupyter magics and kernels for working with remote Spark clusters

Bigflow ⭐ 1,122

Baidu Bigflow is an interface that allows for writing distributed computing programs and provides lots of simple, flexible, powerful APIs. Using Bigflow, you can easily handle data of any scale. Bigflow processes 4P+ data inside Baidu and runs about 10k jobs every day.

Machine Learning ⭐ 1,046

机器学习原理

Spark Sklearn ⭐ 1,039

(Deprecated) Scikit-learn integration package for Apache Spark

Pixiedust ⭐ 1,035

Python Helper library for Jupyter Notebooks

Pyspark Example Project ⭐ 1,034

Example project implementing best practices for PySpark ETL jobs and applications.

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends

Around Dataengineering ⭐ 926

A Data Engineering & Machine Learning Knowledge Hub

Coding Now ⭐ 925

学习记录的一些笔记，以及所看得一些电子书eBooks、视频资源和平常收纳的一些自己认为比较好的博客、

Incubator Livy ⭐ 840

Apache Livy is an open source REST interface for interacting with Apache Spark from anywhere.

Spark Movie Lens ⭐ 757

An on-line movie recommender using Spark, Python Flask, and the MovieLens dataset

An open source framework for building data analytic applications.

Devops Python Tools ⭐ 709

80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.

Machinelearning ⭐ 684

Machine learning resources，including algorithm, paper, dataset, example and so on.

Flintrock ⭐ 627

A command-line tool for launching Apache Spark clusters.

Pythondatascience Collections ⭐ 615

最全数据分析资料汇总（含python、爬虫、数据库、大数据、tableau、统计学等）

Listenbrainz Server ⭐ 613

Server for the ListenBrainz project, including the front-end (javascript/react) code that it serves and all of the data processing components that LB uses.

Dist Keras ⭐ 611

Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.

Sparkmeasure ⭐ 603

This is the development repository for sparkMeasure, a tool for performance troubleshooting of Apache Spark workloads. It simplifies the collection and analysis of Spark task and stage metrics data.

Elasticsearch Spark Recommender ⭐ 603

Use Jupyter Notebooks to demonstrate how to build a Recommender with Apache Spark & Elasticsearch

Goodreads_etl_pipeline ⭐ 593

An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.

Python Data Science Cheatsheet ⭐ 590

Python数据科学速查表

Aws Glue Libs ⭐ 568

AWS Glue Libraries are additions and enhancements to Spark for ETL operations.

Eat_pyspark_in_10_days ⭐ 534

pyspark🍒🥭 is delicious，just eat it!😋😋

Training of Locally Optimized Product Quantization (LOPQ) models for approximate nearest neighbor search of high dimensional data in Python and Spark.

Data Science Learning Resources ⭐ 499

A collection of data science and machine learning resources that I've found helpful (I only post what I've read!)

Complete Life Cycle Of A Data Science Project ⭐ 499

Complete-Life-Cycle-of-a-Data-Science-Project

Timliu Python ⭐ 492

python资源集合与开源硬件

Monitor the stability of a Pandas or Spark dataframe ⚙︎

Agile_data_code_2 ⭐ 435

Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition

Findspark ⭐ 428

Recommendersystems ⭐ 421

Zeek Analysis Tools (ZAT): Processing and analysis of Zeek network data with Pandas, scikit-learn, Kafka and Spark

Azuredatabricksbestpractices ⭐ 377

Version 1 of Technical Best Practices of Azure Databricks based on real world Customer and Technical SME inputs

Spark Ec2 ⭐ 367

Scripts used to setup a Spark cluster on EC2

Learning Resource ⭐ 351

列出一些优秀的程序员学习资源

Datacompy ⭐ 339

Pandas and Spark DataFrame comparison for humans and more!

Sparklingpandas ⭐ 338

Sparkling Pandas

Elasticluster ⭐ 334

Create clusters of VMs on the cloud and configure them with Ansible.

Spark Standalone Cluster On Docker ⭐ 311

Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. ⚡

Tensorspark ⭐ 302

TensorFlow on Spark

Sparkflow ⭐ 301

Easy to use library to bring Tensorflow on Apache Spark

Sparktorch ⭐ 297

Train and run Pytorch models on Apache Spark.

Learning Pyspark ⭐ 294

Code repository for Learning PySpark by Packt

Sparrow ⭐ 292

Sparrow scheduling platform (U.C. Berkeley).

Sagemaker Spark ⭐ 285

A Spark library for Amazon SageMaker.

Sk Dist ⭐ 283

Distributed scikit-learn meta-estimators in PySpark

Cc Pyspark ⭐ 280

Process Common Crawl data with Python and Spark

Tidb Docker Compose ⭐ 278

Azure Event Hubs ⭐ 277

☁️ Cloud-scale telemetry ingestion from any stream of data with Azure Event Hubs

Beginner_de_project ⭐ 276

Beginner data engineering project - batch edition

Spark Programming In Python ⭐ 269

Apache Spark 3 - Spark Programming in Python for Beginners

Butterfree ⭐ 269

A tool for building feature stores.

RayDP provides simple APIs for running Spark on Ray and integrating Spark with AI libraries.

Pyspark Style Guide ⭐ 264

This is a guide to PySpark code style presenting common situations and the associated best practices based on the most frequent recurring topics across the PySpark repos we've encountered.

Openuba ⭐ 264

A robust, and flexible open source User & Entity Behavior Analytics (UEBA) framework used for Security Analytics. Developed with luv by Data Scientists & Security Analysts from the Cyber Security Industry. [PRE-ALPHA]

Pysparkling ⭐ 253

A pure Python implementation of Apache Spark's RDD and DStream interfaces.

Bisheserver ⭐ 242

本系统是我的毕业设计项目，题目为“基于用户画像的电影推荐系统的设计与实现”。主要是以Django作为

Dbldatagen ⭐ 234

Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines

Learningapachespark ⭐ 233

LearningApacheSpark

Installations_mac_ubuntu_windows ⭐ 233

Installations for Data Science. Anaconda, RStudio, Spark, TensorFlow, AWS (Amazon Web Services).

Data_science_blogs ⭐ 232

A repository to keep track of all the code that I end up writing for my blog posts.

Big Data Processing Framework - Unified Data API or SQL on Any Storage

Intro_ds ⭐ 229

Code to accompany Mastering Data Science from PT press

Joblib Spark ⭐ 226

Joblib Apache Spark Backend

Spark Recommendation Engine ⭐ 222

Ngods Stocks ⭐ 217

New Generation Opensource Data Stack Demo

Recommendationsystem ⭐ 190

Book recommender system using collaborative filtering based on Spark

Mlflow Examples ⭐ 179

Basic and advanced MLflow examples for many ML flavors

Related Searches

Python Django (28,897)

Python Flask (17,643)

Python Dataset (14,792)

Python Pytorch (14,667)

Python Tensorflow (14,376)

Python Docker (14,113)

Python Machine Learning (14,099)

Python Command Line (13,351)

Python Deep Learning (13,092)

Python Jupyter Notebook (12,976)

1-100 of 733 search results

Privacy | About | Terms | Follow Us On Twitter

Copyright 2018-2024 Awesome Open Source. All rights reserved.