Awesome Open Source
Search
Programming Languages
Languages
All Categories
Categories
About
Search results for python big data
big-data
x
python
x
422 search results found
Spark
⭐
36,829
Apache Spark - A unified analytics engine for large-scale data processing
Data Science Ipython Notebooks
⭐
25,242
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Flink
⭐
22,014
Apache Flink
Cython
⭐
8,401
The most widely used Python to C compiler
Vaex
⭐
7,985
Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
Catboost
⭐
7,367
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
Beam
⭐
7,155
Apache Beam is a unified programming model for Batch and Streaming data processing.
H2o 3
⭐
6,489
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Feast
⭐
4,791
Feature Store for Machine Learning
Arrow Datafusion
⭐
4,041
Apache Arrow DataFusion SQL Query Engine
Koalas
⭐
3,291
Koalas: pandas API on Apache Spark
Blaze
⭐
2,949
NumPy and Pandas interface to Big Data
Dpark
⭐
2,637
Python clone of Spark, a MapReduce alike framework in Python
Img2dataset
⭐
2,629
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
Avro
⭐
2,581
Apache Avro is a data serialization system.
Nakedtensor
⭐
2,471
Bare bone examples of machine learning in TensorFlow
Root
⭐
2,232
The official repository for ROOT: analyzing, storing and visualizing big data, scientifically
Ambari
⭐
1,991
Apache Ambari simplifies provisioning, managing, and monitoring of Apache Hadoop clusters.
Spark Py Notebooks
⭐
1,515
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Lakesoul
⭐
1,496
LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
Autocrawler
⭐
1,438
Google, Naver multiprocess image web crawler (Selenium)
Scikit Learn Intelex
⭐
1,047
Intel(R) Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application
Autodl
⭐
999
Automated Deep Learning without ANY human intervention. 1'st Solution for AutoDL challenge@NeurIPS.
Adam
⭐
955
ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
Arrow Ballista
⭐
930
Apache Arrow Ballista Distributed Query Engine
Coding Now
⭐
925
学习记录的一些笔记,以及所看得一些电子书eBooks、视频资源和平常收纳的一些自己认为比较好的博客、
Incubator Livy
⭐
819
Apache Livy is an open source REST interface for interacting with Apache Spark from anywhere.
Spark Movie Lens
⭐
757
An on-line movie recommender using Spark, Python Flask, and the MovieLens dataset
Visualpython
⭐
706
GUI-based Python code generator for data science, extension to Jupyter Lab, Jupyter Notebook and Google Colab.
Nipype
⭐
702
Workflows and interfaces for neuroimaging packages
Sdc
⭐
645
Numba extension for compiling Pandas data frames, Intel® Scalable Dataframe Compiler
Dataengineeringproject
⭐
644
Example end to end data engineering project.
Oio Sds
⭐
611
High Performance Software-Defined Object Storage for Big Data and AI, that supports Amazon S3 and Openstack Swift
Opendata.cern.ch
⭐
604
Source code for the CERN Open Data portal
Scanner
⭐
602
Efficient video analysis at scale
Courses
⭐
590
Answers for Quizzes & Assignments that I have taken
Listenbrainz Server
⭐
581
Server for the ListenBrainz project, including the front-end (javascript/react) code that it serves and all of the data processing components that LB uses.
Eland
⭐
557
Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
Redislite
⭐
550
Redis in a python module.
Bigartm
⭐
537
Fast topic modeling platform
Bigtop
⭐
532
Bigtop is an Apache Foundation project for Infrastructure Engineers and Data Scientists looking for comprehensive packaging, testing, and configuration of the leading open source big data components.
Decentralized Internet
⭐
485
A SDK/library for decentralized web and distributing computing projects
Conjure Up
⭐
456
Deploying complex solutions, magically.
Datafaker
⭐
377
Datafaker is a large-scale test data and flow test data generation tool. Datafaker fakes data and inserts to varied data sources. 测试数据生成工具
Ustore
⭐
375
Multi-Modal Database replacing MongoDB, Neo4J, and Elastic with 1 faster ACID solution, with NetworkX and Pandas interfaces, and bindings for C 99, C++ 17, Python 3, Java, GoLang 🗄️
Arvados
⭐
351
An open source platform for managing and analyzing biomedical big data
Belajarpython.com
⭐
339
Open Source Indonesian Python Programming Tutorial Site
Uproot3
⭐
313
ROOT I/O in pure Python and NumPy.
100daysofmlcode
⭐
302
My journey to learn and grow in the domain of Machine Learning and Artificial Intelligence by performing the #100DaysofMLCode Challenge. Now supported by bright developers adding their learnings 👍
Video2dataset
⭐
300
Easily create large video dataset from video urls
Baize
⭐
299
白泽自动化运维系统:配置管理、网络探测、资产管理、业务管理、CMDB、CD、DevOps、作业编排、
Lithops
⭐
280
A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀
Selinon
⭐
277
An advanced distributed task flow management on top of Celery
Flink Ml
⭐
265
Machine learning library of Apache Flink
Cc2dataset
⭐
264
Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...
Gimel
⭐
230
Big Data Processing Framework - Unified Data API or SQL on Any Storage
Awkward 0.x
⭐
218
Manipulate arrays of complex data structures as easily as Numpy.
Amazon S3 Find And Forget
⭐
217
Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
Simple It English
⭐
212
Simple-IT-English: smart wordbook from community for community
Keyvi
⭐
210
Keyvi - the key value index. It is an in-memory FST-based data structure highly optimized for size and lookup performance.
Hadoop Attack Library
⭐
200
A collection of pentest tools and resources targeting Hadoop environments
Uproot5
⭐
199
ROOT I/O in pure Python and NumPy.
Predictionio Sdk Python
⭐
198
PredictionIO Python SDK
Aws Etl Orchestrator
⭐
185
A serverless architecture for orchestrating ETL jobs in arbitrarily-complex workflows using AWS Step Functions and AWS Lambda.
Athenacli
⭐
184
AthenaCLI is a CLI tool for AWS Athena service that can do auto-completion and syntax highlighting.
Pmaw
⭐
179
A multithread Pushshift.io API Wrapper for reddit.com comment and submission searches.
Tipdm
⭐
178
TipDM建模平台,开源的数据挖掘工具。
Idp
⭐
165
IDP is an open source AI IDE for data scientists and big data engineers.
Juicy Bigdata
⭐
162
🎉🎉🐳 Datawhale大数据处理导论教程 | 大数据技术方向的开篇课程🎉🎉
Keyvi
⭐
161
Keyvi - a key value index that powers Cliqz search engine. It is an in-memory FST-based data structure highly optimized for size and lookup performance.
Datasciencevm
⭐
161
Tools and Docs on the Azure Data Science Virtual Machine (http://aka.ms/dsvm)
Bigdata Playground
⭐
154
A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Streaming/MLib, Apache Flink, Scala, Python, Apache Kafka, Apache Hbase, Apache Parquet, Apache Avro, Apache Storm, Twitter Api, MongoDB, NodeJS, Angular, GraphQL
Data Algorithms With Spark
⭐
151
O'Reilly Book: [Data Algorithms with Spark] by Mahmoud Parsian
Geopyspark
⭐
151
GeoTrellis for PySpark
Accelerator
⭐
150
The Accelerator is a tool for fast and reproducible processing of large amounts of data.
Bigdata_practice
⭐
140
大数据分析可视化实践
Notebook
⭐
140
✍ 记录一路走来学习的计算机专业知识 ,力求构建 AI & CS & SE 知识体系
Pyspark Cheatsheet
⭐
140
PySpark Cheat Sheet - example code to help you learn PySpark and develop apps faster
Verticapy
⭐
133
VerticaPy is a Python library that exposes sci-kit like functionality to conduct data science projects on data stored in Vertica, thus taking advantage Vertica’s speed and built-in analytics and machine learning capabilities.
Incubator Liminal
⭐
131
Apache Liminals goal is to operationalise the machine learning process, allowing data scientists to quickly transition from a successful experiment to an automated pipeline of model training, validation, deployment and inference in production. Liminal provides a Domain Specific Language to build ML workflows on top of Apache Airflow.
Griffon Vm
⭐
129
Griffon Data Science Virtual Machine
Aut
⭐
128
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Python Bigdata
⭐
128
Data science and Big Data with Python
Acousticbrainz Server
⭐
126
The server components for the AcousticBrainz project
Cloud Volume
⭐
117
Read and write Neuroglancer datasets programmatically.
Hazelcast Python Client
⭐
110
Hazelcast Python Client
Frank Kanes Taming Big Data With Apache Spark And Python
⭐
106
Frank Kane's Taming Big Data with Apache Spark and Python, published by Packt
Spark Website
⭐
105
Apache Spark Website
Merlin
⭐
100
Machine Learning for HPC Workflows
Covid19 Sir
⭐
100
CovsirPhy: Python library for COVID-19 analysis with phase-dependent SIR-derived ODE models.
Spark With Python
⭐
98
Fundamentals of Spark with Python (using PySpark), code examples
Panoptes
⭐
95
A Global Scale Network Telemetry Ecosystem
Big Data Engineering Coursera Yandex
⭐
91
Big Data for Data Engineers Coursera Specialization from Yandex
Data Competitions
⭐
90
Data competition experience and solutions
Graph_sampling
⭐
89
Graph Sampling is a python package containing various approaches which samples the original graph according to different sample sizes.
Tianchi Bigdata
⭐
84
A code repository for my Tianchi big data competition.
Clgen
⭐
83
Deep learning program generator
Ar Embeddings
⭐
82
Sentiment Analysis for Arabic Text (tweets, reviews, and standard Arabic) using word2vec
Anovos
⭐
78
Anovos - An Open Source Library for Scalable feature engineering Using Apache-Spark
Cqu_bigdata
⭐
77
重庆大学计算机学院“大数据课程群”实验及PPT
Related Searches
Python Python3 (857,414)
Python Django (28,897)
Python Deep (22,263)
Python Ml (20,195)
Python Pytorch (17,959)
Python Dataset (14,792)
Python Machine Learning (14,099)
Python Tensorflow (13,736)
Python Deep Learning (13,092)
Python Jupyter Notebook (12,976)
1-100 of 422 search results
Next >
Privacy
|
About
|
Terms
|
Follow Us On Twitter
Copyright 2018-2023 Awesome Open Source. All rights reserved.