A curated list of awesome resources for practicing data science using Python, including not only libraries, but also links to tutorials, code snippets, blog posts and talks.

pandas - Data structures built on top of numpy.

scikit-learn - Core ML library.

matplotlib - Plotting library.

seaborn - Data visualization library based on matplotlib.

datatile - Basic statistics using `DataFrameSummary(df).summary()`

.

pandas_profiling - Descriptive statistics using `ProfileReport`

.

sklearn_pandas - Helpful `DataFrameMapper`

class.

missingno - Missing data visualization.

rainbow-csv - Plugin to display .csv files with nice colors.

General Jupyter Tricks

Fixing environment: link

Python debugger (pdb) - blog post, video, cheatsheet

cookiecutter-data-science - Project template for data science projects.

nteract - Open Jupyter Notebooks with doubleclick.

papermill - Parameterize and execute Jupyter notebooks, tutorial.

nbdime - Diff two notebook files, Alternative GitHub App: ReviewNB.

RISE - Turn Jupyter notebooks into presentations.

qgrid - Pandas `DataFrame`

sorting.

pivottablejs - Drag n drop Pivot Tables and Charts for jupyter notebooks.

itables - Interactive tables in Jupyter.

jupyter-datatables - Interactive tables in Jupyter.

debugger - Visual debugger for Jupyter.

nbcommands - View and search notebooks from terminal.

handcalcs - More convenient way of writing mathematical equations in Jupyter.

notebooker - Productionize and schedule Jupyter Notebooks.

bamboolib - Intuitive GUI for tables.

voila - Turn Jupyter notebooks into standalone web applications.

voila-gridstack - Voila grid layout.

Pandas Tricks

Using df.pipe() (video)

pandasvault - Large collection of pandas tricks.

modin - Parallelization library for faster pandas `DataFrame`

.

vaex - Out-of-Core DataFrames.

pandarallel - Parallelize pandas operations.

xarray - Extends pandas to n-dimensional arrays.

swifter - Apply any function to a pandas dataframe faster.

pandas_flavor - Write custom accessors like `.str`

and `.dt`

.

pandas-log - Find business logic issues and performance issues in pandas.

pandapy - Additional features for pandas.

lux - Dataframe visualization within Jupyter.

dtale - View and analyze Pandas data structures, integrating with Jupyter.

polars - Multi-threaded alternative to pandas.

duckdb - Efficiently run SQL queries on pandas DataFrame.

scikit-learn-intelex - Intel extension for scikit-learn for speed.

drawdata - Quickly draw some points and export them as csv, website.

tqdm - Progress bars for for-loops. Also supports pandas apply().

icecream - Simple debugging output.

loguru - Python logging.

pyprojroot - Helpful `here()`

command from R.

intake - Loading datasets made easier, talk.

textract - Extract text from any document.

camelot - Extract text from PDF.

spark - `DataFrame`

for big data, cheatsheet, tutorial.

sparkit-learn, spark-deep-learning - ML frameworks for spark.

koalas - Pandas API on Apache Spark.

dask, dask-ml - Pandas `DataFrame`

for big data and machine learning library, resources, talk1, talk2, notebooks, videos.

dask-gateway - Managing dask clusters.

turicreate - Helpful `SFrame`

class for out-of-memory dataframes.

h2o - Helpful `H2OFrame`

class for out-of-memory dataframes.

datatable - Data Table for big data support.

cuDF - GPU DataFrame Library, Intro.

ray - Flexible, high-performance distributed execution framework.

mars - Tensor-based unified framework for large-scale data computation.

bottleneck - Fast NumPy array functions written in C.

bolz - A columnar data container that can be compressed.

cupy - NumPy-like API accelerated with CUDA.

petastorm - Data access library for parquet files by Uber.

zarr - Distributed numpy arrays.

NVTabular - Feature engineering and preprocessing library for tabular data by nvidia.

tensorstore - Reading and writing large multi-dimensional arrays (Google).

nextflow - Run scripts and workflow graphs in Docker image using Google Life Sciences, AWS Batch, Website.

dsub - Run batch computing tasks in Docker image in the Google Cloud.

ni - Command line tool for big data.

xsv - Command line tool for indexing, slicing, analyzing, splitting and joining CSV files.

csvkit - Another command line tool for CSV files.

csvsort - Sort large csv files.

tsv-utils - Tools for working with CSV files by ebay.

cheat - Make cheatsheets for command line commands.

phik - Correlation between categorical, ordinal and interval variables.

statsmodels - Statistical tests.

linearmodels - Instrumental variable and panel data models.

pingouin - Statistical tests. Pairwise correlation between columns of pandas DataFrame

scipy.stats - Statistical tests.

scikit-posthocs - Statistical post-hoc tests for pairwise multiple comparisons.

Bland-Altman Plot 1, 2 - Plot for agreement between two methods of measurement.

ANOVA, Tutorials: One-way, Two-way, Type 1,2,3 explained.

test_proportions_2indep - Proportion test.

G-Test - Alternative to chi-square test, power_divergence.

torch-two-sample - Friedman-Rafsky Test: Compare two population based on a multivariate generalization of the Runstest. Explanation, Application

Squential Analysis - Wikipedia.

Treatment Effects Monitoring - Design and Analysis of Clinical Trials PennState.

sequential - Exact Sequential Analysis for Poisson and Binomial Data (R package).

confseq - Uniform boundaries, confidence sequences, and always-valid p-values.

Great Overview over Visualizations

Dependent Propabilities

Null Hypothesis Significance Testing (NHST) and Sample Size Calculation

Correlation

Cohen's d

Confidence Interval

Equivalence, non-inferiority and superiority testing

Bayesian two-sample t test

Distribution of p-values when comparing two groups

Understanding the t-distribution and its normal approximation

Inverse Propensity Weighting

Dealing with Selection Bias By Propensity Based Feature Selection

Modes, Medians and Means: A Unifying Perspective

Using Norms to Understand Linear Regression

Verifying the Assumptions of Linear Models

Mediation and Moderation Intro

Montgomery et al. - How conditioning on post-treatment variables can ruin your experiment and what to do about it

Greenland - Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations

Blume - Second-generation p-values: Improved rigor, reproducibility, & transparency in statistical analyses

Lindeløv - Common statistical tests are linear models

Chatruc - The Central Limit Theorem and its misuse

Al-Saleh - Properties of the Standard Deviation that are Rarely Mentioned in Classrooms

Wainer - The Most Dangerous Equation

Gigerenzer - The Bias Bias in Behavioral Economics

Cook - Estimating the chances of something that hasn’t happened yet

R Epidemics Consortium - Large tool suite for working with epidemiological data (R packages). Github

incidence2 - Computation, handling, visualisation and simple modelling of incidence (R package).

EpiEstim - Estimate time varying instantaneous reproduction number R during epidemics (R package) paper.

researchpy - Helpful `summary_cont()`

function for summary statistics (Table 1).

zEpid - Epidemiology analysis package, Tutorial.

tipr - Sensitivity analyses for unmeasured confounders (R package).

Checklist.

pandasgui - GUI for viewing, plotting and analyzing Pandas DataFrames.

janitor - Clean messy column names.

pandera - Data / Schema validation.

impyute - Imputations.

fancyimpute - Matrix completion and imputation algorithms.

imbalanced-learn - Resampling for imbalanced datasets.

tspreprocess - Time series preprocessing: Denoising, Compression, Resampling.

Kaggler - Utility functions (`OneHotEncoder(min_obs=100)`

)

pyupset - Visualizing intersecting sets.

pyemd - Earth Mover's Distance / Wasserstein distance, similarity between histograms. OpenCV implementation, POT implementation

littleballoffur - Sampling from graphs.

cleanlab - Machine learning with noisy labels, finding mislabeled data, and uncertainty quantification. Also see awesome list below.

doubtlab - Find bad or noisy labels.

iterative-stratification - Stratification of multilabel data.

Talk

sklearn - Pipeline, examples.

pdpipe - Pipelines for DataFrames.

scikit-lego - Custom transformers for pipelines.

skoot - Pipeline helper functions.

categorical-encoding - Categorical encoding of variables, vtreat (R package).

dirty_cat - Encoding dirty categorical variables.

patsy - R-like syntax for statistical models.

mlxtend - LDA.

featuretools - Automated feature engineering, example.

tsfresh - Time series feature engineering.

pypeln - Concurrent data pipelines.

feature_engine - Encoders, transformers, etc.

NVTabular - Feature engineering and preprocessing library for tabular data by nvidia.

Fiji - General purpose tool. Image viewer and image processing package.

napari - Multi-dimensional image viewer.

fiftyone - Viewer and tool for building high-quality datasets and computer vision models.

DivNoising - Unsupervised denoising method.

aydin - Image denoising.

unprocessing - Image denoising by reverting the image processing pipeline.

jump-cellpainting - Cellpainting dataset.

MedMNIST - Datasets for 2D and 3D Biomedical Image Classification.

CytoImageNet - Huge diverse dataset like ImageNet but for cell images.

cellpose dataset - Cell images.

Haghighi - Gene Expression and Morphology Profiles.

broadinstitute/lincs-profiling-complementarity - Cellpainting vs. L1000 assay.

Awesome Cytodata

BD Spectrum Viewer - Calculate spectral overlap, bleed through for fluorescence microscopy dyes.

Tree of Microscopy - Review of cell segmentation algorithms, Paper.

cellpose - Cell segmentation. Paper, Dataset.

skimage - Illumination correction (CLAHE).

cidre - Illumination correction method for optical microscopy.

BaSiCPy - Background and Shading Correction of Optical Microscopy Images, BaSiC.

ashlar - Whole-slide microscopy image stitching and registration.

CSBDeep - Image denoising, restoration and object detection, Project page.

mcmicro - Multiple-choice microscopy pipeline, Paper.

UnMicst - Identifying Cells and Segmenting Tissue.

stardist - Object Detection with Star-convex Shapes.

nnUnet - 3D biomedical image segmentation.

atomai - Deep and Machine Learning for Microscopy.

allencell - Tools for the 3D segmentation of intracellular structures.

Tran - A benchmark of batch-effect correction methods for single-cell RNA sequencing data, Code.

R Tutorial on correcting batch effects.

harmonypy - Fuzzy k-means and locally linear adjustments.

pyliger - Batch-effect correction, Example, R package.

nimfa - Nonnegative matrix factorization.

scgen - Batch removal. Doc.

CORAL - Correcting for Batch Effects Using Wasserstein Distance, Code, Paper.

adapt - Aweseome Domain Adaptation Python Toolbox.

pytorch-adapt - Various neural network models for domain adaptation.

skimage - Regionprops: area, eccentricity, extent.

mahotas - Zernike, Haralick, LBP, and TAS features.

pyradiomics - Radiomics features from medical imaging.

pyefd - Elliptical feature descriptor, approximating a contour with a Fourier series.

Overview Paper, Talk, Repo

Blog post series - 1, 2, 3, 4

Tutorials - 1, 2

sklearn - Feature selection.

eli5 - Feature selection using permutation importance.

scikit-feature - Feature selection algorithms.

stability-selection - Stability selection.

scikit-rebate - Relief-based feature selection algorithms.

scikit-genetic - Genetic feature selection.

boruta_py - Feature selection, explaination, example.

Boruta-Shap - Boruta feature selection algorithm + shapley values.

linselect - Feature selection package.

mlxtend - Exhaustive feature selection.

BoostARoota - Xgboost feature selection algorithm.

INVASE - Instance-wise Variable Selection using Neural Networks.

SubTab - Subsetting Features of Tabular Data for Self-Supervised Representation Learning, AstraZeneca.

mrmr - Maximum Relevance and Minimum Redundancy Feature Selection, Website.

arfs - All Relevant Feature Selection.

VSURF - Variable Selection Using Random Forests (R package) doc.

FeatureSelectionGA - Feature Selection using Genetic Algorithm.

apricot - Selecting subsets of data sets to train machine learning models quickly.

ducks - Index data for fast lookup by any combination of fields.

Check also the Clustering section and self-supervised learning section for ideas!

Review

PCA - link

Autoencoder - link

Isomaps - link

LLE - link

Force-directed graph drawing - link

MDS - link

Diffusion Maps - link

t-SNE - link

NeRV - link, paper

MDR - link

UMAP - link

Random Projection - link

Ivis - link

SimCLR - link

esvit - Vision Transformers for Representation Learning (Microsoft).

MCML - Semi-supervised dimensionality reduction of Multi-Class, Multi-Label data (sequencing data) paper.

Dangers of PCA (paper).

Talk, tsne intro.
sklearn.manifold and sklearn.decomposition - PCA, t-SNE, MDS, Isomaps and others.

Additional plots for PCA - Factor Loadings, Cumulative Variance Explained, Correlation Circle Plot, Tweet

sklearn.random_projection - Johnson-Lindenstrauss lemma, Gaussian random projection, Sparse random projection.

sklearn.cross_decomposition - Partial least squares, supervised estimators for dimensionality reduction and regression.

prince - Dimensionality reduction, factor analysis (PCA, MCA, CA, FAMD).

Faster t-SNE implementations: lvdmaaten, MulticoreTSNE, FIt-SNE
umap - Uniform Manifold Approximation and Projection, talk, explorer, explanation, parallel version.

humap - Hierarchical UMAP.

sleepwalk - Explore embeddings, interactive visualization (R package).

somoclu - Self-organizing map.

scikit-tda - Topological Data Analysis, paper, talk, talk, paper.

giotto-tda - Topological Data Analysis.

ivis - Dimensionality reduction using Siamese Networks.

trimap - Dimensionality reduction using triplets.

scanpy - Force-directed graph drawing, Diffusion Maps.

direpack - Projection pursuit, Sufficient dimension reduction, Robust M-estimators.

DBS - DatabionicSwarm (R package).

contrastive - Contrastive PCA.

scPCA - Sparse contrastive PCA (R package).

tmap - Visualization library for large, high-dimensional data sets.

lollipop - Linear Optimal Low Rank Projection.

linearsdr - Linear Sufficient Dimension Reduction (R package).

PHATE - Tool for visualizing high dimensional data.

iterative-stratification - Cross validators with stratification for multilabel data.

livelossplot - Live training loss plot in Jupyter Notebook.

All charts, Austrian monuments.

Better heatmaps and correlation plots.

Example notebooks for interactive visualizations(Plotly,Seaborn, Holoviz, Altair)

cufflinks - Dynamic visualization library, wrapper for plotly, medium, example.

physt - Better histograms, talk, notebook.

fast-histogram - Fast histograms.

matplotlib_venn - Venn diagrams, alternative.

joypy - Draw stacked density plots (=ridge plots), Ridge plots in seaborn.

mosaic plots - Categorical variable visualization, example.

scikit-plot - ROC curves and other visualizations for ML models.

yellowbrick - Visualizations for ML models (similar to scikit-plot).

bokeh - Interactive visualization library, Examples, Examples.

lets-plot - Plotting library.

animatplot - Animate plots build on matplotlib.

plotnine - ggplot for Python.

altair - Declarative statistical visualization library.

bqplot - Plotting library for IPython/Jupyter Notebooks.

hvplot - High-level plotting library built on top of holoviews.

dtreeviz - Decision tree visualization and model interpretation.

chartify - Generate charts.

VivaGraphJS - Graph visualization (JS package).

pm - Navigatable 3D graph visualization (JS package), example.

python-ternary - Triangle plots.

falcon - Interactive visualizations for big data.

hiplot - High dimensional Interactive Plotting.

visdom - Live Visualizations.

mpl-scatter-density - Scatter density plots. Alternative to 2d-histograms.

ComplexHeatmap - Complex heatmaps for multidimensional genomic data (R package).

largeVis - Visualize embeddings (t-SNE etc.) (R package).

proplot - Matplotlib wrapper.

morpheus - Broad Institute tool matrix visualization and analysis software. Source, Tutorial: 1, 2, Code.

palettable - Color palettes from colorbrewer2.

colorcet - Collection of perceptually uniform colormaps.

Named Colors Wheel - Color wheel for all named HTML colors.

superset - Dashboarding solution by Apache.

streamlit - Dashboarding solution. Resources, Gallery Components, bokeh-events.

mercury - Convert Python notebook to web app, Example.

dash - Dashboarding solution by plot.ly. Resources.

visdom - Dashboarding library by facebook.

panel - Dashboarding solution.

altair example - Video.

voila - Turn Jupyter notebooks into standalone web applications.

voila-gridstack - Voila grid layout.

gradio - Create UIs for your machine learning model.

samplics - Sampling techniques for complex survey designs.

folium - Plot geographical maps using the Leaflet.js library, jupyter plugin.

gmaps - Google Maps for Jupyter notebooks.

stadiamaps - Plot geographical maps.

datashader - Draw millions of points on a map.

sklearn - BallTree, Example.

pynndescent - Nearest neighbor descent for approximate nearest neighbors.

geocoder - Geocoding of addresses, IP addresses.

Conversion of different geo formats: talk, repo

geopandas - Tools for geographic data

Low Level Geospatial Tools (GEOS, GDAL/OGR, PROJ.4)

Vector Data (Shapely, Fiona, Pyproj)

Raster Data (Rasterio)

Plotting (Descartes, Catropy)

Predict economic indicators from Open Street Map ipynb.

PySal - Python Spatial Analysis Library.

geography - Extract countries, regions and cities from a URL or text.

cartogram - Distorted maps based on population.

Examples: 1, 2, 2-ipynb, 3.

surprise - Recommender, talk.

turicreate - Recommender.

implicit - Fast Collaborative Filtering for Implicit Feedback Datasets.

spotlight - Deep recommender models using PyTorch.

lightfm - Recommendation algorithms for both implicit and explicit feedback.

funk-svd - Fast SVD.

pywFM - Factorization.

Intro to Decision Trees and Random Forests, Intro to Gradient Boosting 1, 2, Decision Tree Visualization

lightgbm - Gradient boosting (GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, doc.

xgboost - Gradient boosting (GBDT, GBRT or GBM) library, doc, Methods for CIs: link1, link2.

catboost - Gradient boosting.

h2o - Gradient boosting and general machine learning framework.

snapml - Gradient boosting and general machine learning framework by IBM, for CPU and GPU. PyPI

pycaret - Wrapper for xgboost, lightgbm, catboost etc.

thundergbm - GBDTs and Random Forest.

h2o - Gradient boosting.

forestci - Confidence intervals for random forests.

scikit-garden - Quantile Regression.

grf - Generalized random forest.

dtreeviz - Decision tree visualization and model interpretation.

Nuance - Decision tree visualization.

rfpimp - Feature Importance for RandomForests using Permuation Importance.

Why the default feature importance for random forests is wrong: link

treeinterpreter - Interpreting scikit-learn's decision tree and random forest predictions.

bartpy - Bayesian Additive Regression Trees.

infiniteboost - Combination of RFs and GBDTs.

merf - Mixed Effects Random Forest for Clustering, video

rrcf - Robust Random Cut Forest algorithm for anomaly detection on streams.

groot - Robust decision trees.

linear-tree - Trees with linear models at the leaves.

talk-nb, nb2, talk.

Text classification Intro, Preprocessing blog post.

gensim - NLP, doc2vec, word2vec, text processing, topic modelling (LSA, LDA), Example, Coherence Model for evaluation.

Embeddings - GloVe ([1], [2]), StarSpace, wikipedia2vec, visualization.

magnitude - Vector embedding utility package.

pyldavis - Visualization for topic modelling.

spaCy - NLP.

NTLK - NLP, helpful `KMeansClusterer`

with `cosine_distance`

.

pytext - NLP from Facebook.

fastText - Efficient text classification and representation learning.

annoy - Approximate nearest neighbor search.

faiss - Approximate nearest neighbor search.

pysparnn - Approximate nearest neighbor search.

infomap - Cluster (word-)vectors to find topics, example.

datasketch - Probabilistic data structures for large data (MinHash, HyperLogLog).

flair - NLP Framework by Zalando.

stanfordnlp - NLP Library.

Chatistics - Turn Messenger, Hangouts, WhatsApp and Telegram chat logs into DataFrames.

textvec - Supervised text vectorization tool.

textdistance - Collection for comparing distances between two or more sequences.

MinCovDet - Robust estimator of covariance, RMPV, Paper, App1, App2.

winsorize - Simple adjustment of outliers.

moderated z-score - Weighted average of z-scores based on Spearman correlation.

Single cell tutorial.

cellxgene - Interactive explorer for single-cell transcriptomics data.

scanpy - Analyze single-cell gene expression data, tutorial.

besca - Beyond single-cell analysis.

janggu - Deep Learning for Genomics.

gdsctools - Drug responses in the context of the Genomics of Drug Sensitivity in Cancer project, ANOVA, IC50, MoBEM, doc.

See also Microscopy Section above.

Overview over cell segmentation algorithms

python_for_microscopists - Notebooks and associated youtube channel for a variety of image processing tasks.

mahotas - Image processing (Bioinformatics), example.

imagepy - Software package for bioimage analysis.

scimap - Spatial Single-Cell Analysis Toolkit.

CellProfiler - Biological image analysis.

imglyb - Viewer for large images, talk, slides.

microscopium - Unsupervised clustering of images + viewer, talk.

cytokit - Analyzing properties of cells in fluorescent microscopy datasets.

ZeroCostDL4Mic - Deep-Learning in Microscopy.

TDC - Drug Discovery and Development.

DeepPurpose - Deep Learning Based Molecular Modeling and Prediction Toolkit.

mit6874 - Computational Systems Biology: Deep Learning in the Life Sciences.

Talk

cv2 - OpenCV, classical algorithms: Gaussian Filter, Morphological Transformations.

scikit-image - Image processing.

Convolutional Neural Networks for Visual Recognition - Stanford CS class.

ConvNet Shape Calculator - Calculate output dimensions of Conv2D layer.

Great Gradient Descent Article.

Intro to semi-supervised learning.

fast.ai course - Lessons 1-7, Lessons 8-14

Tensorflow without a PhD - Neural Network course by Google.

Feature Visualization: Blog, PPT

Tensorflow Playground

Visualization of optimization algorithms, Another visualization

cutouts-explorer - Image Viewer.

imgaug - More sophisticated image preprocessing.

Augmentor - Image augmentation library.

keras preprocessing - Preprocess images.

albumentations - Wrapper around imgaug and other libraries.

augmix - Image augmentation from Google.

kornia - Image augmentation, feature extraction and loss functions.

augly - Image, audio, text, video augmentation from Facebook.

SegLoss - List of loss functions for medical image segmentation.

rational_activations - Rational activation functions.

ktext - Utilities for pre-processing text for deep learning in Keras.

textgenrnn - Ready-to-use LSTM for text generation.

ctrl - Text generation.

OpenMMLab - Framework for segmentation, classification and lots of other computer vision tasks.

caffe - Deep learning framework, pretrained models.

mxnet - Deep learning framework, book.

keras - Neural Networks on top of tensorflow, examples.

keras-contrib - Keras community contributions.

keras-tuner - Hyperparameter tuning for Keras.

hyperas - Keras + Hyperopt: Convenient hyperparameter optimization wrapper.

elephas - Distributed Deep learning with Keras & Spark.

tflearn - Neural Networks on top of tensorflow.

tensorlayer - Neural Networks on top of tensorflow, tricks.

tensorforce - Tensorflow for applied reinforcement learning.

autokeras - AutoML for deep learning.

PlotNeuralNet - Plot neural networks.

lucid - Neural network interpretability, Activation Maps.

tcav - Interpretability method.

AdaBound - Optimizer that trains as fast as Adam and as good as SGD, alt.

foolbox - Adversarial examples that fool neural networks.

hiddenlayer - Training metrics.

imgclsmob - Pretrained models.

netron - Visualizer for deep learning and machine learning models.

ffcv - Fast dataloder.

Good Pytorch Introduction

skorch - Scikit-learn compatible neural network library that wraps pytorch, talk, slides.

fastai - Neural Networks in pytorch.

timm - Pytorch image models.

ignite - Highlevel library for pytorch.

torchcv - Deep Learning in Computer Vision.

pytorch-optimizer - Collection of optimizers for pytorch.

pytorch-lightning - Wrapper around PyTorch.

lightly - MoCo, SimCLR, SimSiam, Barlow Twins, BYOL, NNCLR.

MONAI - Deep learning in healthcare imaging.

kornia - Image transformations, epipolar geometry, depth estimation.

torchinfo - Nice model summary.

lovely-tensors - Inspect tensors, mean, std, inf values.

flexflow - Distributed TensorFlow Keras and PyTorch.

horovod - Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Awesome List.

netron - Viewer for neural networks.

visualkeras - Visualize Keras networks.

Good Yolo Explanation

segmentation_models - Segmentation models with pretrained backbones: Unet, FPN, Linknet, PSPNet.

yolact - Fully convolutional model for real-time instance segmentation.

EfficientDet Pytorch, EfficientDet Keras - Scalable and Efficient Object Detection.

detectron2 - Object Detection (Mask R-CNN) by Facebook.

simpledet - Object Detection and Instance Recognition.

CenterNet - Object detection.

FCOS - Fully Convolutional One-Stage Object Detection.

norfair - Real-time 2D object tracking.

Detic - Detector with image classes that can use image-level labels (facebookresearch).

EasyCV - Image segmentation, classification, metric-learning, object detection, pose estimation.

cvat - Image annotation tool.

pigeon - Create annotations from within a Jupyter notebook.

nfnets - Neural network.

efficientnet - Neural network.

pycls - Pytorch image classification networks: ResNet, ResNeXt, EfficientNet, and RegNet (by Facebook).

SPADE - Semantic Image Synthesis.

Entity Embeddings of Categorical Variables, code, kaggle

Image Super-Resolution - Super-scaling using a Residual Dense Network.

Cell Segmentation - Talk, Blog Posts: 1, 2

deeplearning-models - Deep learning models.

Variational Autoencoder Explanation Video

disentanglement_lib - BetaVAE, FactorVAE, BetaTCVAE, DIP-VAE.

ladder-vae-pytorch - Ladder Variational Autoencoders (LVAE).

benchmark_VAE - Unifying Generative Autoencoder implementations.

Awesome GAN Applications

The GAN Zoo - List of Generative Adversarial Networks.

CycleGAN and Pix2pix - Various image-to-image tasks.

Tensorflow GAN implementations

Pytorch GAN implementations

Pytorch GAN implementations

StudioGAN - Pytorch GAN implementations.

SegFormer - Simple and Efficient Design for Semantic Segmentation with Transformers.

esvit - Efficient self-supervised Vision Transformers.

nystromformer - More efficient transformer because of approximate self-attention.

Great overview for deep learning for tabular data

How to do Deep Learning on Graphs with Graph Convolutional Networks

Introduction To Graph Convolutional Networks

An attempt at demystifying graph deep learning

ogb - Open Graph Benchmark, Benchmark datasets.

networkx - Graph library.

cugraph - RAPIDS, Graph library on the GPU.

pytorch-geometric - Various methods for deep learning on graphs.

dgl - Deep Graph Library.

graph_nets - Build graph networks in Tensorflow, by deepmind.

hummingbird - Compile trained ML models into tensor computations (by Microsoft).

cuML - RAPIDS, Run traditional tabular ML tasks on GPUs, Intro.

thundergbm - GBDTs and Random Forest.

thundersvm - Support Vector Machines.

Legate Numpy - Distributed Numpy array multiple using GPUs by Nvidia (not released yet) video.

Understanding SVM Regression: slides, forum, paper

pyearth - Multivariate Adaptive Regression Splines (MARS), tutorial.

pygam - Generalized Additive Models (GAMs), Explanation.

GLRM - Generalized Low Rank Models.

tweedie - Specialized distribution for zero inflated targets, Talk.

MAPIE - Estimating prediction intervals.

Regressio - Regression and Spline models.

orthopy - Orthogonal polynomials in all shapes and sizes.

Talk, Notebook

Blog post: Probability Scoring

All classification metrics

DESlib - Dynamic classifier and ensemble selection.

human-learn - Create and tune classifier based on your rule set.

Contrastive Representation Learning

metric-learn - Supervised and weakly-supervised metric learning algorithms.

pytorch-metric-learning - Pytorch metric learning.

deep_metric_learning - Methods for deep metric learning.

ivis - Metric learning using siamese neural networks.

tensorflow similarity - Metric learning.

scipy.spatial - All kinds of distance metrics.

pyemd - Earth Mover's Distance / Wasserstein distance, similarity between histograms. OpenCV implementation, POT implementation

dcor - Distance correlation and related Energy statistics.

GeomLoss - Kernel norms, Hausdorff divergences, Debiased Sinkhorn divergences (=approximation of Wasserstein distance).

lightly - MoCo, SimCLR, SimSiam, Barlow Twins, BYOL, NNCLR.

vissl - Self-Supervised Learning with PyTorch: RotNet, Jigsaw, NPID, ClusterFit, PIRL, SimCLR, MoCo, DeepCluster, SwAV.

Overview of clustering algorithms applied image data (= Deep Clustering).

Clustering with Deep Learning: Taxonomy and New Methods.

Hierarchical Cluster Analysis (R Tutorial) - Dendrogram, Tanglegram

hdbscan - Clustering algorithm, talk, blog.

pyclustering - All sorts of clustering algorithms.

FCPS - Fundamental Clustering Problems Suite (R package).

GaussianMixture - Generalized k-means clustering using a mixture of Gaussian distributions, video.

nmslib - Similarity search library and toolkit for evaluation of k-NN methods.

buckshotpp - Outlier-resistant and scalable clustering algorithm.

merf - Mixed Effects Random Forest for Clustering, video

tree-SNE - Hierarchical clustering algorithm based on t-SNE.

MiniSom - Pure Python implementation of the Self Organizing Maps.

distribution_clustering, paper, related paper, alt.

phenograph - Clustering by community detection.

FastPG - Clustering of single cell data (RNA). Improvement of phenograph, Paper.

HypHC - Hyperbolic Hierarchical Clustering.

BanditPAM - Improved k-Medoids Clustering.

dendextend - Comparing dendrograms (R package).

DeepDPM - Deep Clustering With An Unknown Number of Clusters.

Wagner, Wagner - Comparing Clusterings - An Overview

- Adjusted Rand Index
- Normalized Mutual Information
- Adjusted Mutual Information
- Fowlkes-Mallows Score
- Silhouette Coefficient
- Variation of Information, Julia
- Pair Confusion Matrix
- Consensus Score - The similarity of two sets of biclusters.

Assessing the quality of a clustering (video)

fpc - Various methods for clustering and cluster validation (R package).

- Minimum distance between any two clusters
- Distance between centroids
- p-separation index: Like minimum distance. Look at the average distance to nearest point in different cluster for p=10% "border" points in any cluster. Measuring density, measuring mountains vs valleys
- Estimate density by weighted count of close points

Other measures:

- Within-cluster average distance
- Mean of within-cluster average distance over nearest-cluster average distance (silhouette score)
- Within-cluster similarity measure to normal/uniform
- Within-cluster (squared) distance to centroid (this is the k-Means loss function)
- Correlation coefficient between distance we originally had to the distance the are induced by the clustering (Huberts Gamma)
- Entropy of cluster sizes
- Average largest within-cluster gap
- Variation of clusterings on bootstrapped data

scikit-multilearn - Multi-label classification, talk.

Stanford Lecture Series on Fourier Transformation, Youtube, Lecture Notes.

Visual fourier explanation.

The Scientist & Engineer's Guide to Digital Signal Processing (1999).

Kalman Filter article.

Kalman Filter book - Focuses on intuition using Jupyter Notebooks. Includes Baysian and various Kalman filters.

Interactive Tool for FIR and IIR filters, Examples.

filterpy - Kalman filtering and optimal estimation library.

geomstats - Computations and statistics on manifolds with geometric structures.

statsmodels - Time series analysis, seasonal decompose example, SARIMA, granger causality.

kats - Time series prediction library by Facebook.

prophet - Time series prediction library by Facebook.

neural_prophet - Time series prediction built on Pytorch.

pyramid, pmdarima - Wrapper for (Auto-) ARIMA.

modeltime - Time series forecasting framework (R package).

pyflux - Time series prediction algorithms (ARIMA, GARCH, GAS, Bayesian).

atspy - Automated Time Series Models.

pm-prophet - Time series prediction and decomposition library.

htsprophet - Hierarchical Time Series Forecasting using Prophet.

nupic - Hierarchical Temporal Memory (HTM) for Time Series Prediction and Anomaly Detection.

tensorflow - LSTM and others, examples: link, link, link, Explain LSTM, seq2seq: 1, 2, 3, 4

tspreprocess - Preprocessing: Denoising, Compression, Resampling.

tsfresh - Time series feature engineering.

tsfel - Time series feature extraction.

thunder - Data structures and algorithms for loading, processing, and analyzing time series data.

gatspy - General tools for Astronomical Time Series, talk.

gendis - shapelets, example.

tslearn - Time series clustering and classification, `TimeSeriesKMeans`

, `TimeSeriesKMeans`

.

pastas - Simulation of time series.

fastdtw - Dynamic Time Warp Distance.

fable - Time Series Forecasting (R package).

pydlm - Bayesian time series modeling (R package, Blog post)

PyAF - Automatic Time Series Forecasting.

luminol - Anomaly Detection and Correlation library from Linkedin.

matrixprofile-ts - Detecting patterns and anomalies, website, ppt, alternative.

stumpy - Another matrix profile library.

obspy - Seismology package. Useful `classic_sta_lta`

function.

RobustSTL - Robust Seasonal-Trend Decomposition.

seglearn - Time Series library.

pyts - Time series transformation and classification, Imaging time series.

Turn time series into images and use Neural Nets: example, example.

sktime, sktime-dl - Toolbox for (deep) learning with time series.

adtk - Time Series Anomaly Detection.

rocket - Time Series classification using random convolutional kernels.

luminaire - Anomaly Detection for time series.

etna - Time Series library.

Chaos Genius - ML powered analytics engine for outlier/anomaly detection and root cause analysis.

TimeSeriesSplit - Sklearn time series split.

tscv - Evaluation with gap.

Tutorial on using cvxpy: 1, 2

pandas-datareader - Read stock data.

yfinance - Read stock data from Yahoo Finance.

findatapy - Read stock data from various sources.

ta - Technical analysis library.

backtrader - Backtesting for trading strategies.

surpriver - Find high moving stocks before they move using anomaly detection and machine learning.

ffn - Financial functions.

bt - Backtesting algorithms.

alpaca-trade-api-python - Commission-free trading through API.

eiten - Eigen portfolios, minimum variance portfolios and other algorithmic investing strategies.

tf-quant-finance - Quantitative finance tools in tensorflow, by Google.

quantstats - Portfolio management.

Riskfolio-Lib - Portfolio optimization and strategic asset allocation.

OpenBBTerminal - Terminal.

mplfinance - Financial markets data visualization.

pyfolio - Portfolio and risk analytics.

zipline - Algorithmic trading.

alphalens - Performance analysis of predictive stock factors.

empyrical - Financial risk metrics.

trading_calendars - Calendars for various securities exchanges.

Time-dependent Cox Model in R.

lifelines - Survival analysis, Cox PH Regression, talk, talk2.

scikit-survival - Survival analysis.

xgboost - `"objective": "survival:cox"`

NHANES example

survivalstan - Survival analysis, intro.

convoys - Analyze time lagged conversions.

RandomSurvivalForests (R packages: randomForestSRC, ggRandomForests).

pysurvival - Survival analysis.

DeepSurvivalMachines - Fully Parametric Survival Regression.

auton-survival - Regression, Counterfactual Estimation, Evaluation and Phenotyping with Censored Time-to-Events.

sklearn - Isolation Forest and others.

pyod - Outlier Detection / Anomaly Detection.

eif - Extended Isolation Forest.

AnomalyDetection - Anomaly detection (R package).

luminol - Anomaly Detection and Correlation library from Linkedin.

Distances for comparing histograms and detecting outliers - Talk: Kolmogorov-Smirnov, Wasserstein, Energy Distance (Cramer), Kullback-Leibler divergence.

banpei - Anomaly detection library based on singular spectrum transformation.

telemanom - Detect anomalies in multivariate time series data using LSTMs.

luminaire - Anomaly Detection for time series.

TorchDrift - Drift Detection for PyTorch Models.

alibi-detect - Algorithms for outlier, adversarial and drift detection.

evidently - Evaluate and monitor ML models from validation to production.

Lipton et al. - Detecting and Correcting for Label Shift with Black Box Predictors.

Bu et al. - A pdf-Free Change Detection Test Based on Density Difference Estimation.

lightning - Large-scale linear classification, regression and ranking.

SLIM - Scoring systems for classification, Supersparse linear integer models.

CS 594 Causal Inference and Learning

Statistical Rethinking - Video Lecture Series, Bayesian Statistics, Causal Models, R, python, numpyro1, numpyro2, tensorflow-probability.

Python Causality Handbook

dowhy - Estimate causal effects.

CausalImpact - Causal Impact Analysis (R package).

causallib - Modular causal inference analysis and model evaluations by IBM, examples.

causalml - Causal inference by Uber.

upliftml - Causal inference by Booking.com.

EconML - Heterogeneous Treatment Effects Estimation by Microsoft.

causality - Causal analysis using observational datasets.

DoubleML - Machine Learning + Causal inference, Tweet, Presentation, Paper.

Bours - Confounding

Bours - Effect Modification and Interaction

Intro, Guide

PyMC3 - Bayesian modelling, intro

numpyro - Probabilistic programming with numpy, built on pyro.

pomegranate - Probabilistic modelling, talk.

pmlearn - Probabilistic machine learning.

arviz - Exploratory analysis of Bayesian models.

zhusuan - Bayesian deep learning, generative models.

edward - Probabilistic modeling, inference, and criticism, Mixture Density Networks (MNDs), MDN Explanation.

Pyro - Deep Universal Probabilistic Programming.

tensorflow probability - Deep learning and probabilistic modelling, talk1, notebook talk1, talk2, example.

bambi - High-level Bayesian model-building interface on top of PyMC3.

neural-tangents - Infinite Neural Networks.

bnlearn - Bayesian networks, parameter learning, inference and sampling methods.

Visualization, Article

GPyOpt - Gaussian process optimization.

GPflow - Gaussian processes (Tensorflow).

gpytorch - Gaussian processes (Pytorch).

Model Stacking Blog Post

mlxtend - `EnsembleVoteClassifier`

, `StackingRegressor`

, `StackingCVRegressor`

for model stacking.

vecstack - Stacking ML models.

StackNet - Stacking ML models.

mlens - Ensemble learning.

combo - Combining ML models (stacking, ensembling).

pycm - Multi-class confusion matrix.

pandas_ml - Confusion matrix.

Plotting learning curve: link.

yellowbrick - Learning curve.

pyroc - Receiver Operating Characteristic (ROC) curves.

awesome-conformal-prediction - Uncertainty quantification.

uncertainty-toolbox - Predictive uncertainty quantification, calibration, metrics, and visualization.

skope-rules - Interpretable classifier, IF-THEN rules.

sklearn-expertsys - Interpretable classifiers, Bayesian Rule List classifier.

Princeton - Reproducibility Crisis in ML‑based Science

Book, Examples

shap - Explain predictions of machine learning models, talk, Good Shap intro.

treeinterpreter - Interpreting scikit-learn's decision tree and random forest predictions.

lime - Explaining the predictions of any machine learning classifier, talk, Warning (Myth 7).

lime_xgboost - Create LIMEs for XGBoost.

eli5 - Inspecting machine learning classifiers and explaining their predictions.

lofo-importance - Leave One Feature Out Importance, talk, examples: 1, 2, 3.

pybreakdown - Generate feature contribution plots.

pycebox - Individual Conditional Expectation Plot Toolbox.

pdpbox - Partial dependence plot toolbox, example.

partial_dependence - Visualize and cluster partial dependence.

skater - Unified framework to enable model interpretation.

anchor - High-Precision Model-Agnostic Explanations for classifiers.

l2x - Instancewise feature selection as methodology for model interpretation.

contrastive_explanation - Contrastive explanations.

DrWhy - Collection of tools for explainable AI.

lucid - Neural network interpretability.

xai - An eXplainability toolbox for machine learning.

innvestigate - A toolbox to investigate neural network predictions.

dalex - Explanations for ML models (R package).

interpretml - Fit interpretable models, explain models.

shapash - Model interpretability.

imodels - Interpretable ML package.

captum - Model interpretability and understanding for PyTorch.

AdaNet - Automated machine learning based on tensorflow.

tpot - Automated machine learning tool, optimizes machine learning pipelines.

auto_ml - Automated machine learning for analytics & production.

autokeras - AutoML for deep learning.

nni - Toolkit for neural architecture search and hyper-parameter tuning by Microsoft.

automl-gs - Automated machine learning.

mljar - Automated machine learning.

automl_zero - Automatically discover computer programs that can solve machine learning tasks from Google.

AlphaPy - Automated Machine Learning using scikit-learn xgboost, LightGBM and others.

Karate Club - Unsupervised learning on graphs.

Pytorch Geometric - Graph representation learning with PyTorch.

DLG - Graph representation learning with TensorFlow.

cvxpy - Modeling language for convex optimization problems. Tutorial: 1, 2

deap - Evolutionary computation framework (Genetic Algorithm, Evolution strategies).

evol - DSL for composable evolutionary algorithms, talk.

platypus - Multiobjective optimization.

autograd - Efficiently computes derivatives of numpy code.

nevergrad - Derivation-free optimization.

gplearn - Sklearn-like interface for genetic programming.

blackbox - Optimization of expensive black-box functions.

Optometrist algorithm - paper.

DeepSwarm - Neural architecture search.

evotorch - Evolutionary computation library built on Pytorch.

sklearn - GridSearchCV, RandomizedSearchCV.

sklearn-deap - Hyperparameter search using genetic algorithms.

hyperopt - Hyperparameter optimization.

hyperopt-sklearn - Hyperopt + sklearn.

optuna - Hyperparamter optimization, Talk.

skopt - `BayesSearchCV`

for Hyperparameter search.

tune - Hyperparameter search with a focus on deep learning and deep reinforcement learning.

hypergraph - Global optimization methods and hyperparameter optimization.

bbopt - Black box hyperparameter optimization.

dragonfly - Scalable Bayesian optimisation.

botorch - Bayesian optimization in PyTorch.

ax - Adaptive Experimentation Platform by Facebook.

lightning-hpo - Hyperparameter optimization based on optuna.

sklearn - PassiveAggressiveClassifier, PassiveAggressiveRegressor.

river - Online machine learning.

Kaggler - Online Learning algorithms.

onelearn - Online Random Forests.

Talk

modAL - Active learning framework.

YouTube, YouTube

Intro to Monte Carlo Tree Search (MCTS) - 1, 2, 3

AlphaZero methodology - 1, 2, 3, Cheat Sheet

RLLib - Library for reinforcement learning.

Horizon - Facebook RL framework.

airflow - Schedule and monitor workflows.

prefect - Python specific workflow scheduling.

dagster - Development, production and observation of data assets.

ploomber - Workflow orchestration.

kestra - Workflow orchestration.

cml - CI/CD for Machine Learning Projects.

rocketry - Task scheduling.

Reduce size of docker images (video)

Optimize Docker Image Size

cog - Facilitates building Docker images.

dephell - Dependency management.

poetry - Dependency management.

pyup - Dependency management.

pypi-timemachine - Install packages with pip as if you were in the past.

dvc - Version control for large files.

hangar - Version control for tensor data.

kedro - Build data pipelines.

feast - Feature store. Video.

pinecone - Database for vector search applications.

truss - Serve ML models.

milvus - Vector database for similarity search.

mlem - Version and deploy your ML models following GitOps principles.

m2cgen - Transpile trained ML models into other languages.

sklearn-porter - Transpile trained scikit-learn estimators to C, Java, JavaScript and others.

mlflow - Manage the machine learning lifecycle, including experimentation, reproducibility and deployment.

modelchimp - Experiment Tracking.

skll - Command-line utilities to make it easier to run machine learning experiments.

BentoML - Package and deploy machine learning models for serving in production.

dagster - Tool with focus on dependency graphs.

knockknock - Be notified when your training ends.

metaflow - Lifecycle Management Tool by Netflix.

cortex - Deploy machine learning models.

Neptune - Experiment tracking and model registry.

clearml - Experiment Manager, MLOps and Data-Management.

polyaxon - MLOps.

sematic - Deploy machine learning models.

zenml - MLOPs.

All kinds of math and statistics resources

Gilbert Strang - Linear Algebra

Gilbert Strang - Matrix Methods in Data Analysis, Signal Processing, and Machine Learning

daft - Render probabilistic graphical models using matplotlib.

unyt - Working with units.

scrapy - Web scraping library.

VowpalWabbit - ML Toolkit from Microsoft.

Python Record Linkage Toolkit - link records in or between data sources.

more_itertools - Extension of itertools.

funcy - Fancy and practical functional tools.

dateparser - A better date parser.

jellyfish - Approximate string matching.

coloredlogs - Colored logging output.

Distill.pub - Blog.

Machine Learning Videos

Data Science Notebooks

Recommender Systems (Microsoft)

Datascience Cheatsheets

datasharing - Guide to data sharing.

Chan - Introduction to Probability for Data Science

Colonescu - Principles of Econometrics with R

Awesome Adversarial Machine Learning

Awesome AI Booksmarks

Awesome AI on Kubernetes

Awesome Big Data

Awesome Business Machine Learning

Awesome Causality

Awesome Community Detection

Awesome CSV

Awesome Cytodata

Awesome Data Science with Ruby

Awesome Dash

Awesome Decision Trees

Awesome Deep Learning

Awesome ETL

Awesome Financial Machine Learning

Awesome Fraud Detection

Awesome GAN Applications

Awesome Graph Classification

Awesome Industry Machine Learning

Awesome Gradient Boosting

Awesome Learning with Label Noise

Awesome Machine Learning

Awesome Machine Learning Books

Awesome Machine Learning Interpretability

Awesome Machine Learning Operations

Awesome Metric Learning

Awesome Monte Carlo Tree Search

Awesome Neural Network Visualization

Awesome Online Machine Learning

Awesome Pipeline

Awesome Public APIs

Awesome Python

Awesome Python Data Science

Awesome Python Data Science

Awesome Python Data Science

Awesome Pytorch

Awesome Quantitative Finance

Awesome Recommender Systems

Awesome Semantic Segmentation

Awesome Sentence Embedding

Awesome Time Series

Awesome Time Series Anomaly Detection

Awesome Visual Attentions

Awesome Visual Transformer

NYU Deep Learning SP21 - Youtube Playlist.

Color codes

Frequency codes for time series

Date parsing codes

Feature Calculators tsfresh

Do you know a package that should be on this list? Did you spot a package that is no longer maintained and should be removed from this list? Then feel free to read the contribution guidelines and submit your pull request or create a new issue.

Alternatives To DatascienceSelect To Compare

Related Awesome Lists

Top Programming Languages

Get A Weekly Email With Trending Projects For These Topics

No Spam. Unsubscribe easily at any time.

Python (859,600)

Jupyter (170,089)

Machine Learning (39,651)

Deep Learning (38,472)

Artificial Intelligence (19,623)

Csv (15,149)

Csv Files (15,149)

Awesome (13,763)

Awesome List (13,763)

Data Science (11,027)

Statistics (10,715)

Pandas (6,993)

Data Visualization (6,069)

Data Analysis (5,169)

Data Mining (2,125)

Bayes (91)