Awesome Python Data Science
Probably the best curated list of data science software in Python
Contents
Machine Learning
General Purpouse Machine Learning
scikit-learn - Machine learning in Python.
Shogun - Machine learning toolbox.
xLearn - High Performance, Easy-to-use, and Scalable Machine Learning Package.
cuML - RAPIDS Machine Learning Library.
modAL - Modular active learning framework for Python3.
Sparkit-learn - PySpark + scikit-learn = Sparkit-learn.
mlpack - A scalable C++ machine learning library (Python bindings).
dlib - Toolkit for making real world machine learning and data analysis applications in C++ (Python bindings).
MLxtend - Extension and helper modules for Python's data analysis and machine learning libraries.
hyperlearn - 50%+ Faster, 50%+ less RAM usage, GPU support re-written Sklearn, Statsmodels.
Reproducible Experiment Platform (REP) - Machine Learning toolbox for Humans.
scikit-multilearn - Multi-label classification for python.
seqlearn - Sequence classification toolkit for Python.
pystruct - Simple structured learning framework for Python.
sklearn-expertsys - Highly interpretable classifiers for scikit learn.
RuleFit - Implementation of the rulefit.
metric-learn - Metric learning algorithms in Python.
pyGAM - Generalized Additive Models in Python.
Karate Club - An unsupervised machine learning library for graph structured data.
Little Ball of Fur - A library for sampling graph structured data.
causalml - Uplift modeling and causal inference with machine learning algorithms.
Deepchecks - Validation & testing of ML models and data during model development, deployment, and production.
Automated Machine Learning
TPOT - Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
auto-sklearn - An automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator.
MLBox - A powerful Automated Machine Learning python library.
Ensemble Methods
ML-Ensemble - High performance ensemble learning.
Stacking - Simple and useful stacking library, written in Python.
stacked_generalization - Library for machine learning stacking generalization.
vecstack - Python package for stacking (machine learning technique).
Imbalanced Datasets
imbalanced-learn - Module to perform under sampling and over sampling with various techniques.
imbalanced-algorithms - Python-based implementations of algorithms for learning on imbalanced data.
Random Forests
Extreme Learning Machine
Python-ELM - Extreme Learning Machine implementation in Python.
Python Extreme Learning Machine (ELM) - A machine learning technique used for classification/regression tasks.
hpelm - High performance implementation of Extreme Learning Machines (fast randomized neural networks).
Kernel Methods
pyFM - Factorization machines in python.
fastFM - A library for Factorization Machines.
tffm - TensorFlow implementation of an arbitrary order Factorization Machine.
liquidSVM - An implementation of SVMs.
scikit-rvm - Relevance Vector Machine implementation using the scikit-learn API.
ThunderSVM - A fast SVM Library on GPUs and CPUs.
Gradient Boosting
XGBoost - Scalable, Portable and Distributed Gradient Boosting.
LightGBM - A fast, distributed, high performance gradient boosting.
CatBoost - An open-source gradient boosting on decision trees library.
ThunderGBM - Fast GBDTs and Random Forests on GPUs.
Deep Learning
PyTorch
PyTorch - Tensors and Dynamic neural networks in Python with strong GPU acceleration.
torchvision - Datasets, Transforms and Models specific to Computer Vision.
torchtext - Data loaders and abstractions for text and NLP.
torchaudio - An audio library for PyTorch.
ignite - High-level library to help with training neural networks in PyTorch.
PyToune - A Keras-like framework and utilities for PyTorch.
skorch - A scikit-learn compatible neural network library that wraps pytorch.
PyTorchNet - An abstraction to train neural networks.
pytorch_geometric - Geometric Deep Learning Extension Library for PyTorch.
Catalyst - High-level utils for PyTorch DL & RL research.
pytorch_geometric_temporal - Temporal Extension Library for PyTorch Geometric.
TensorFlow
TensorFlow - Computation using data flow graphs for scalable machine learning by Google.
TensorLayer - Deep Learning and Reinforcement Learning Library for Researcher and Engineer.
TFLearn - Deep learning library featuring a higher-level API for TensorFlow.
Sonnet - TensorFlow-based neural network library.
tensorpack - A Neural Net Training Interface on TensorFlow.
Polyaxon - A platform that helps you build, manage and monitor deep learning models.
NeuPy - NeuPy is a Python library for Artificial Neural Networks and Deep Learning (previously: ).
tfdeploy - Deploy tensorflow graphs for fast evaluation and export to tensorflow-less environments running numpy.
tensorflow-upstream - TensorFlow ROCm port.
TensorFlow Fold - Deep learning with dynamic computation graphs in TensorFlow.
tensorlm - Wrapper library for text generation / language models at char and word level with RNN.
TensorLight - A high-level framework for TensorFlow.
Mesh TensorFlow - Model Parallelism Made Easier.
Ludwig - A toolbox, that allows to train and test deep learning models without the need to write code.
Keras - A high-level neural networks API running on top of TensorFlow.
keras-contrib - Keras community contributions.
Hyperas - Keras + Hyperopt: A very simple wrapper for convenient hyperparameter.
Elephas - Distributed Deep learning with Keras & Spark.
Hera - Train/evaluate a Keras model, get metrics streamed to a dashboard in your browser.
Spektral - Deep learning on graphs.
qkeras - A quantization deep learning library.
MXNet
MXNet - Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler.
Gluon - A clear, concise, simple yet powerful and efficient API for deep learning (now included in MXNet).
MXbox - Simple, efficient and flexible vision toolbox for mxnet framework.
gluon-cv - Provides implementations of the state-of-the-art deep learning models in computer vision.
gluon-nlp - NLP made easy.
Xfer - Transfer Learning library for Deep Neural Networks.
MXNet - HIP Port of MXNet.
Others
Tangent - Source-to-Source Debuggable Derivatives in Pure Python.
autograd - Efficiently computes derivatives of numpy code.
Myia - Deep Learning framework (pre-alpha).
nnabla - Neural Network Libraries by Sony.
Caffe - A fast open framework for deep learning.
hipCaffe - The HIP port of Caffe.
DISCONTINUED PROJECTS
Web Scraping
BeautifulSoup : The easiest library to scrape static websites for beginners
Scrapy : Fast and extensible scraping library. Can write rules and create customized scraper without touching the coure
Selenium : Use Selenium Python API to access all functionalities of Selenium WebDriver in an intuitive way like a real user.
Pattern : High level scraping for well-establish websites such as Google, Twitter, and Wikipedia. Also has NLP, machine learning algorithms, and visualization
twitterscraper : Efficient library to scrape twitter
Data Manipulation
Data Containers
pandas - Powerful Python data analysis toolkit.
pandas_profiling - Create HTML profiling reports from pandas DataFrame objects
cuDF - GPU DataFrame Library.
blaze - NumPy and pandas interface to Big Data.
pandasql - Allows you to query pandas DataFrames using SQL syntax.
pandas-gbq - pandas Google Big Query.
xpandas - Universal 1d/2d data containers with Transformers .functionality for data analysis by The Alan Turing Institute .
pysparkling - A pure Python implementation of Apache Spark's RDD and DStream interfaces.
Arctic - High performance datastore for time series and tick data.
datatable - Data.table for Python.
koalas - pandas API on Apache Spark.
modin - Speed up your pandas workflows by changing a single line of code.
swifter - A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner.
pandas_flavor - A package which allow to write your own flavor of Pandas easily.
pandas-log - A package which allow to provide feedback about basic pandas operations and find both buisness logic and performance issues.
vaex - Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second.
Pipelines
pdpipe - Sasy pipelines for pandas DataFrames.
SSPipe - Python pipe (|) operator with support for DataFrames and Numpy and Pytorch.
pandas-ply - Functional data manipulation for pandas.
Dplython - Dplyr for Python.
sklearn-pandas - pandas integration with sklearn.
Dataset - Helps you conveniently work with random or sequential batches of your data and define data processing.
pyjanitor - Clean APIs for data cleaning.
meza - A Python toolkit for processing tabular data.
Prodmodel - Build system for data science pipelines.
dopanda - Hints and tips for using pandas in an analysis environment.
CircleCi : Automates your software builds, tests, and deployments.
Feature Engineering
General
Featuretools - Automated feature engineering.
skl-groups - A scikit-learn addon to operate on set/"group"-based features.
Feature Forge - A set of tools for creating and testing machine learning feature.
few - A feature engineering wrapper for sklearn.
scikit-mdr - A sklearn-compatible Python implementation of Multifactor Dimensionality Reduction (MDR) for feature construction.
tsfresh - Automatic extraction of relevant features from time series.
Feature Selection
scikit-feature - Feature selection repository in python.
boruta_py - Implementations of the Boruta all-relevant feature selection method.
BoostARoota - A fast xgboost feature selection algorithm.
scikit-rebate - A scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning.
Visualization
General Purposes
Matplotlib - Plotting with Python.
seaborn - Statistical data visualization using matplotlib.
prettyplotlib - Painlessly create beautiful matplotlib plots.
python-ternary - Ternary plotting library for python with matplotlib.
missingno - Missing data visualization module for Python.
chartify - Python library that makes it easy for data scientists to create charts.
physt - Improved histograms.
Interactive plots
animatplot - A python package for animating plots build on matplotlib.
plotly - A Python library that makes interactive and publication-quality graphs.
Bokeh - Interactive Web Plotting for Python.
Altair - Declarative statistical visualization library for Python. Can easily do many data transformation within the code to create graph
bqplot - Plotting library for IPython/Jupyter notebooks
pyecharts - Migrated from Echarts , a charting and visualization library, to Python's interactive visual drawing library.
Map
folium - Makes it easy to visualize data on an interactive open street map
geemap - Python package for interactive mapping with Google Earth Engine (GEE)
Automatic Plotting
HoloViews - Stop plotting your data - annotate your data and let it visualize itself.
AutoViz : Visualize data automatically with 1 line of code (ideal for machine learning)
SweetViz : Visualize and compare datasets, target values and associations, with one line of code.
NLP
pyLDAvis : Visualize interactive topic model
Deployment
datapane - A collection of APIs to turn scripts and notebooks into interactive reports.
binder - Enable sharing and execute Jupyter Notebooks
fastapi - Modern, fast (high-performance), web framework for building APIs with Python
streamlit - Make it easy to deploy machine learning model
Model Explanation
Shapley - A data-driven framework to quantify the value of classifiers in a machine learning ensemble.
Alibi - Algorithms for monitoring and explaining machine learning models.
anchor - Code for "High-Precision Model-Agnostic Explanations" paper.
aequitas - Bias and Fairness Audit Toolkit.
Contrastive Explanation - Contrastive Explanation (Foil Trees).
yellowbrick - Visual analysis and diagnostic tools to facilitate machine learning model selection.
scikit-plot - An intuitive library to add plotting functionality to scikit-learn objects.
shap - A unified approach to explain the output of any machine learning model.
ELI5 - A library for debugging/inspecting machine learning classifiers and explaining their predictions.
Lime - Explaining the predictions of any machine learning classifier.
FairML - FairML is a python toolbox auditing the machine learning models for bias.
L2X - Code for replicating the experiments in the paper Learning to Explain: An Information-Theoretic Perspective on Model Interpretation .
PDPbox - Partial dependence plot toolbox.
pyBreakDown - Python implementation of R package breakDown.
PyCEbox - Python Individual Conditional Expectation Plot Toolbox.
Skater - Python Library for Model Interpretation.
model-analysis - Model analysis tools for TensorFlow.
themis-ml - A library that implements fairness-aware machine learning algorithms.
treeinterpreter - Interpreting scikit-learn's decision tree and random forest predictions.
AI Explainability 360 - Interpretability and explainability of data and machine learning models.
Auralisation - Auralisation of learned features in CNN (for audio).
CapsNet-Visualization - A visualization of the CapsNet layers to better understand how it works.
lucid - A collection of infrastructure and tools for research in neural network interpretability.
Netron - Visualizer for deep learning and machine learning models (no Python code, but visualizes models from most Python Deep Learning frameworks).
FlashLight - Visualization Tool for your NeuralNetwork.
tensorboard-pytorch - Tensorboard for pytorch (and chainer, mxnet, numpy, ...).
mxboard - Logging MXNet data for visualization in TensorBoard.
Reinforcement Learning
OpenAI Gym - A toolkit for developing and comparing reinforcement learning algorithms.
Coach - Easy experimentation with state of the art Reinforcement Learning algorithms.
garage - A toolkit for reproducible reinforcement learning research.
OpenAI Baselines - High-quality implementations of reinforcement learning algorithms.
Stable Baselines - A set of improved implementations of reinforcement learning algorithms based on OpenAI Baselines.
RLlib - Scalable Reinforcement Learning.
Horizon - A platform for Applied Reinforcement Learning.
TF-Agents - A library for Reinforcement Learning in TensorFlow.
TensorForce - A TensorFlow library for applied reinforcement learning.
TRFL - TensorFlow Reinforcement Learning.
Dopamine - A research framework for fast prototyping of reinforcement learning algorithms.
keras-rl - Deep Reinforcement Learning for Keras.
ChainerRL - A deep reinforcement learning library built on top of Chainer.
Probabilistic Methods
pomegranate - Probabilistic and graphical models for Python.
pyro - A flexible, scalable deep probabilistic programming library built on PyTorch.
ZhuSuan - Bayesian Deep Learning.
PyMC - Bayesian Stochastic Modelling in Python.
PyMC3 - Python package for Bayesian statistical modeling and Probabilistic Machine Learning.
sampled - Decorator for reusable models in PyMC3.
Edward - A library for probabilistic modeling, inference, and criticism.
InferPy - Deep Probabilistic Modelling Made Easy.
GPflow - Gaussian processes in TensorFlow.
PyStan - Bayesian inference using the No-U-Turn sampler (Python interface).
sklearn-bayes - Python package for Bayesian Machine Learning with scikit-learn API.
skggm - Estimation of general graphical models.
pgmpy - A python library for working with Probabilistic Graphical Models.
skpro - Supervised domain-agnostic prediction framework for probabilistic modelling by The Alan Turing Institute .
Aboleth - A bare-bones TensorFlow framework for Bayesian deep learning and Gaussian process approximation.
PtStat - Probabilistic Programming and Statistical Inference in PyTorch.
PyVarInf - Bayesian Deep Learning methods with Variational Inference for PyTorch.
emcee - The Python ensemble sampling toolkit for affine-invariant MCMC.
hsmmlearn - A library for hidden semi-Markov models with explicit durations.
pyhsmm - Bayesian inference in HSMMs and HMMs.
GPyTorch - A highly efficient and modular implementation of Gaussian Processes in PyTorch.
MXFusion - Modular Probabilistic Programming on MXNet.
sklearn-crfsuite - A scikit-learn inspired API for CRFsuite.
Genetic Programming
gplearn - Genetic Programming in Python.
DEAP - Distributed Evolutionary Algorithms in Python.
karoo_gp - A Genetic Programming platform for Python with GPU support.
monkeys - A strongly-typed genetic programming framework for Python.
sklearn-genetic - Genetic feature selection module for scikit-learn.
Optimization
Spearmint - Bayesian optimization.
BoTorch - Bayesian optimization in PyTorch.
scikit-opt - Heuristic Algorithms for optimization.
SMAC3 - Sequential Model-based Algorithm Configuration.
Optunity - Is a library containing various optimizers for hyperparameter tuning.
hyperopt - Distributed Asynchronous Hyperparameter Optimization in Python.
hyperopt-sklearn - Hyper-parameter optimization for sklearn.
sklearn-deap - Use evolutionary algorithms instead of gridsearch in scikit-learn.
sigopt_sklearn - SigOpt wrappers for scikit-learn methods.
Bayesian Optimization - A Python implementation of global optimization with gaussian processes.
SafeOpt - Safe Bayesian Optimization.
scikit-optimize - Sequential model-based optimization with a scipy.optimize
interface.
Solid - A comprehensive gradient-free optimization framework written in Python.
PySwarms - A research toolkit for particle swarm optimization in Python.
Platypus - A Free and Open Source Python Library for Multiobjective Optimization.
GPflowOpt - Bayesian Optimization using GPflow.
POT - Python Optimal Transport library.
Talos - Hyperparameter Optimization for Keras Models.
nlopt - Library for nonlinear optimization (global and local, constrained or unconstrained).
Time Series
sktime - A unified framework for machine learning with time series.
tslearn - Machine learning toolkit dedicated to time-series data.
tick - Module for statistical learning, with a particular emphasis on time-dependent modelling.
Prophet - Automatic Forecasting Procedure.
PyFlux - Open source time series library for Python.
bayesloop - Probabilistic programming framework that facilitates objective model selection for time-varying parameter models.
luminol - Anomaly Detection and Correlation library.
dateutil - Powerful extensions to the standard datetime module
maya - makes it very easy to parse a string and for changing timezones
Natural Language Processing
NLTK - Modules, data sets, and tutorials supporting research and development in Natural Language Processing.
CLTK - The Classical Language Toolkik.
gensim - Topic Modelling for Humans.
PSI-Toolkit - A natural language processing toolkit.
pyMorfologik - Python binding for Morfologik .
skift - Scikit-learn wrappers for Python fastText.
Phonemizer - Simple text to phonemes converter for multiple languages.
flair - Very simple framework for state-of-the-art NLP.
spaCy - Industrial-Strength Natural Language Processing.
Computer Audition
librosa - Python library for audio and music analysis.
Yaafe - Audio features extraction.
aubio - A library for audio and music analysis.
Essentia - Library for audio and music analysis, description and synthesis.
LibXtract - A simple, portable, lightweight library of audio feature extraction functions.
Marsyas - Music Analysis, Retrieval and Synthesis for Audio Signals.
muda - A library for augmenting annotated audio data.
madmom - Python audio and music signal processing library.
Computer Vision
OpenCV - Open Source Computer Vision Library.
scikit-image - Image Processing SciKit (Toolbox for SciPy).
imgaug - Image augmentation for machine learning experiments.
imgaug_extension - Additional augmentations for imgaug.
Augmentor - Image augmentation library in Python for machine learning.
albumentations - Fast image augmentation library and easy to use wrapper around other libraries.
Statistics
pandas_summary - Extension to pandas dataframes describe function.
Pandas Profiling - Create HTML profiling reports from pandas DataFrame objects.
statsmodels - Statistical modeling and econometrics in Python.
stockstats - Supply a wrapper StockDataFrame
based on the pandas.DataFrame
with inline stock statistics/indicators support.
weightedcalcs - A pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.
scikit-posthocs - Pairwise Multiple Comparisons Post-hoc Tests.
Alphalens - Performance analysis of predictive (alpha) stock factors.
Distributed Computing
Horovod - Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
PySpark - Exposes the Spark programming model to Python.
Veles - Distributed machine learning platform.
Jubatus - Framework and Library for Distributed Online Machine Learning.
DMTK - Microsoft Distributed Machine Learning Toolkit.
PaddlePaddle - PArallel Distributed Deep LEarning.
dask-ml - Distributed and parallel machine learning.
Distributed - Distributed computation in Python.
Experimentation
Sacred - A tool to help you configure, organize, log and reproduce experiments.
Xcessiv - A web-based application for quick, scalable, and automated hyperparameter tuning and stacked ensembling.
Persimmon - A visual dataflow programming language for sklearn.
Ax - Adaptive Experimentation Platform.
Neptune - A lightweight ML experiment tracking, results visualization and management tool.
Evaluation
recmetrics - Library of useful metrics and plots for evaluating recommender systems.
Metrics - Machine learning evaluation metric.
sklearn-evaluation - Model evaluation made easy: plots, tables and markdown reports.
AI Fairness 360 - Fairness metrics for datasets and ML models, explanations and algorithms to mitigate bias in datasets and models.
Computations
numpy - The fundamental package needed for scientific computing with Python.
Dask - Parallel computing with task scheduling.
bottleneck - Fast NumPy array functions written in C.
CuPy - NumPy-like API accelerated with CUDA.
scikit-tensor - Python library for multilinear algebra and tensor factorizations.
numdifftools - Solve automatic numerical differentiation problems in one or more variables.
quaternion - Add built-in support for quaternions to numpy.
adaptive - Tools for adaptive and parallel samping of mathematical functions.
Spatial Analysis
GeoPandas - Python tools for geographic data.
PySal - Python Spatial Analysis Library.
Quantum Computing
PennyLane - Quantum machine learning, automatic differentiation, and optimization of hybrid quantum-classical computations.
QML - A Python Toolkit for Quantum Machine Learning.
Conversion
sklearn-porter - Transpile trained scikit-learn estimators to C, Java, JavaScript and others.
ONNX - Open Neural Network Exchange.
MMdnn - A set of tools to help users inter-operate among different deep learning frameworks.
Contributing
Contributions are welcome! ðŸ˜Ž
Read the contribution guideline .
License
This work is licensed under the Creative Commons Attribution 4.0 International License - CC BY 4.0