My ML related stuff including notebooks, codes and a curated list of various useful resources such as books and softwares. Almost everything mentioned here is free(as speech not free food) or open-source.
This document is an attempt to come up with a curated list of Machine Learning resources, including books, papers, software, libraries, notebooks, etc. Most of the libraries are for Python though the rest of the materials here are generally suited for working with data.
Machine Learning for Business: Machine Learning for Business teaches you how to make your company more automated, productive, and competitive by mastering practical, implementable machine learning techniques and tools
Papers with code: It is a convenient repository of research papers that are coming with their code published too, you can access the code from many recent cutting-edge algorithms from here
Twitter datasets: A list of datasets related to social platform Twitter
Apache Zeppelin: A great notebook environment for data visualization and doing analytics stuff, it can connect to many different databases and data management systems
Numpy: Linear algebra library for fast numerical computation
Scikit Learn: High-level Machine Learning library with tons of features, very easy-to-use and extendable
Bokeh: An interactive high-level data visualization library
Matplotlib: A compelling data visualization library, More low-level than other visualization libs
Graph Tool: A fast and powerful library for working with graphs in Python. It's developed on top of Boost C++ libraries, so consequently, it's very efficient
NetworkX: A Python module for Complex Network modelling and analysis, Very easy-to-use but may be slow on times because it's in pure Python
TensorFlow: Low-level library for creating deep artificial neural networks, works both on CPU and GPU. Usually, you use TF in conjunction with a library with higher-level API exposing TF's functionalities like Keras
Keras: "Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano" - Keras's website
NLTK: Swiss Army knife tool for text processing in Python
Pattern: Another good text processing library for Python
Orange: Orange is a general-purpose data mining and analysis tool also library that lets you develop machine learning pipelines just by a few dragging and dropping
Dask: A fast data manipulation library with out-of-core handling of the data, Suited for a distributed environment, Its API is (exactly)compatible with Pandas' API
Scikit-learn Contrib/HDBSCAN: A high-performance implementation of HDBSCAN clustering, HDBSCAN is a robust and easy-to-use clustering algorithm with minimal parameters, Ideal for exploratory data analysis; It works as an extension to Scikit-learn
Turi Create: A fast tool/library for simplifying various ML tasks
Scikit-learn-Contrib/Categorical-Encoding: An extension library for Scikit-learn that provides additional categorical feature encoding schemes(e.g. LeaveOneOut scheme)
Optunity: A library for hyperparameter optimization
Pyro: "Pyro is a universal probabilistic programming language (PPL) written in Python and supported by PyTorch on the backend" - Pyro's website
GEM: A Python library that provides various graph embedding methods like 'node2vec' and 'locally linear embedding.'
DynamicGEM: A dynamic graph embedding library like GEM
GraphSAGE: A graph embedding framework to generate low-dimensional vector representations for nodes, instrumental if you need to use deep learning on graph data
Horovd: A distributed training framework for TensorFlow, Keras, and PyTorch by Uber
NetLSD: Python implementation of NetLSD, a scalable graph embedding algorithm for representing a graph via a low-dimensional vector
SHAP: A tool for exploring and explaining the outcome of an arbitrary model
MLflow: A software toolbox to manage ML projects' workflow and life-cycle, it aims to make ML software projects easier to implement by providing various helper components for each step
pyGAM: A Python module for building Generalized Additive Models (GAMs)
ggplot: "ggplot is a plotting system for Python based on R's ggplot2 and the Grammar of Graphics. It is built for making professionally looking, plots quickly with minimal code" - ggplot's website
Linkpred: A Python package for link prediction on graphs
SparklingGraph: A Python library to process large scale graphs using Spark and GraphX in a distributed manner
Galry: A high-performance visualization library in Python
Dedupe: A Python library for fuzzy entity-resolution and record deduplication
PyText: A deep-learning-based NLP modelling framework built on top of PyTorch
flair: A state-of-the-art NLP framework in Python from Zalando
NearPy: "A Python framework for fast (approximated) nearest neighbour search in large, high-dimensional data sets using different locality-sensitive hashes" according to its descriptions
fastchunking: A (fast) text chunking algorithm implemented in C++ and Python
Vaex: Vaex is a data manipulation library much like Pandas and Dask with a lazy out-of-core approach to handling the data so you can work with huge tables with it
openTSNE: An extensible, parallel implementation of t-SNE
Active Semi-Supervised Clustering: An extension library for Scikit-learn that implements a collection of useful active semi-supervised clustering algorithms
TextDistance: A Python library for calculating and comparing the distance between two sequences (such as text documents) with many algorithms
Ray: A scalable. high-performance distributed execution framework for executing arbitrary Python functions on multiple machines, suitable for many ML workloads
Pyitlib: An opensource library for calculating a useful collection of information-theoretic measures (i.e. Entropy) for discrete random variables
KDEpy: A collection of useful kernel density estimators in Python 3.5+
Tsfresh: A Python library for (automatic) feature extraction and engineering on time-dependent data
GPy: A Python library for working with Gaussian processes
Tslearn: A machine learning library dedicated to working with time-dependent data
Ludwig: "Ludwig is a toolbox that allows to train and test deep learning models without the need to write code" - Ludwigs's website
PyJanitor: Python port of R's janitor package, for data cleansing and manipulation
FastText: A library for fast and efficient text embedding and classification
Mimesis: A fast and valuable fake data generation library
PyOD: A Python software toolbox for scalable Outlier Detection (aka Anomaly Detection)
Creme: A Python library for Online Learning and building incremental models
vg: A linear algebra library much like Numpy with a more human-friendly interface
GraphKernels: A fast library for calculating various graph kernels
GraKeL: A graph kernel calculation library that is using Scikit-learn's API so it can be used with other functionalities and routines already present in Scikit-learn without much hassle
Graphsim: A graph similarity extension libraries for NetworkX
Textract: A general text extraction tool from many file formats
Sacred: Sacred is a Python library to make an ML workflow easier to reproduce and manage for you!
TextDistance: TextDistance is a Python library for calculating and comparing the distance between two or more sequences of an arbitrary alphabet (e.g., words, DNA sequences), it has got over 30 distance algorithms to use
Py_stringmathcing: Py_stringmathcing is a Python library consisting of a comprehensive set of string tokenizers (such as alphabetical tokenizers, whitespace tokenizers) and also string similarity measures (e.g., edit distance, Jaccard distance)
JGraph: JGraph is a WebGL graph drawing library for Python
Kedro: A Python library and also tool to manage your data analysis workflow in your projects
PySAL: PySAL is a Python package for geolocation-based data analysis
k-Shape: This is a Python implementation of the k-Shape clustering algorithm for clustering the time series data
Pyforest: You could use Pyforest to import all Python data science-related libraries lazily as you need them in your code
ETE Toolkit: ETE Toolkit is a Python toolbox for visualizing and analysis of tree format data
Whoosh: Whoosh is a full-text indexing and search library for Python
Geoplot: Geoplot is a Python visualization library for geospatial plotting of geolocational records
GeoPandas: GeoPandas is a high-level library with an API similar to Pandas that makes working with geospatial datasets in Python much easier
Edward: "A library for probabilistic modelling, inference, and criticism" - its website
HyperTools: A Python library for high-dimensional data visualization and analysis
TextRank: TextRank algorithm implementation for Python 3
pymorton: A Python package for ordinal hashing of multidimensional points into a one-dimensional ordering
PySS3: A Python package implementing SS3 text classifier with visualizations tools for explainable artificial intelligence (XAI)
Lpproj: A Python implementation of Locality Preserving Projections (LPP) with Scikit-Learn compatible API
Multi-Rake: Multilingual rapid automatic keyword extraction (Multi-RAKE) is a Python library for automatic text summarization and keyword extraction of text in many different languages
ACME: A software framework for research on reinforcement learning
fastText: A fast text representation learning and classification library from Facebook
Distance: A useful library in pure Python to calculate the distance between arbitrary sequences
Texhero: "Text preprocessing, representation and visualization from zero to hero" -- Texthero's website
xLearn: "High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface." -- xLearn's description
TextBlob: A text processing library with a high-level API
PySurvival: PySurvival is a Python package for survival analysis of data
Scikit-survival: Scikit-survival is an extension to Scikit-learn that adds survival analysis capabilities to it
rfpimp: A Python package that brings permutation-based feature importance measure to Scikit-learn Random Forests learners
Jiant: Jiant is a NLP software toolkit with the multitask and transfer learning capabilities
PyG: "PyG (PyTorch Geometric) is a library built upon PyTorch to easily write and train Graph Neural Networks (GNNs) for a wide range of applications related to structured data."---PyG's documentation
Nodevectors: A Python package with fast and sclable implementations for some popular vertex embedding algorithms
MALLET: "MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modelling, information extraction, and other machine learning applications to text." - MALLET's website
MLPack: A fast ML library written in C++ with bindings to Python
t-SNE: Implementation of famous t-distributed stochastic neighbour embedding algorithm for various languages
Accord.NET: Accord.NET is a Machine Learning framework written in C#, its API is available for .NET, it also comes combined with some audio and image processing libraries entirely written in C#
OpenNN: A C++ library to build complex neural network models
MOA: A tool for mining stream data, by people who also created Weka
MLPACK: C++ Machine Learning library for scalability, speed, and ease-of-use
MOSES: "Moses is a statistical machine translation system that allows you to train translation models for any language pair automatically." - Moses's website
Parallel Python: A Python module for parallel execution of code on SMP and Cluster environment
BeautifulSoup: A handy Python library to digest almost anything from World Wild Web
Wordbatch: A library for parallel feature extraction on textual data(and potentially other complex data types)
Smile: "Smile is a fast and comprehensive machine learning system"- Smile's website
Tablesaw: A daydreamer and visualization library for Java
TensorFlow Models: A repository of models and examples built with TensorFlow
Curated list of graph embedding methods: A collection of paper-code pairs for the state of the art graph embeddings(a.k.a network representational learning) algorithms
Pegasus: An open-source system for analyzing huge graphs. It seems it is not being developed or maintained for a long time
Dataset: A handy tool to simplify the task of reading and writing to relational databases
Twython: A Twitter API library in pure Python with tonnes of features
Apache TinkerPop: A cool graph storage and computation framework, it can be used both as a graph analytics platform and a graph database system, love the little gremlins!
Graphexp: Graphexp is a visual graph explorer with D3.js for TinkerPop
Scilab: An open-source numerical computation language and environment, great Matlab alternative
Glow: A compiler for Neural Network hardware accelerators for various hardware
GraphJet: A real-time graph processing library in Java
GraphDrawing: A lovely graph analysis and drawing library in Java
Java Data Mining Package: An opensource Java package for mining massive datasets implementing a vast collection of algorithms (i.e. clustering, regression, classification and graphical models)
ScalaNLP: A numerical computation and Data Mining library suite written in Scala, with an emphasis on NLP
Vegas: A very flexible declarative data visualization library in Scala that works with Apache Spark right out of the box
DeepLearning.scala: A simple Scala library for creating complex artificial neural networks by ThoughtWorks
XAPIAN: An open-source search engine library with bindings to be used in many high-level programming languages, for example, Python, Java, and Lua!
DataMelt: "DataMelt is a free software for numeric computation, mathematics, statistics, symbolic calculations, data analysis and data visualization" - DataMelt's website
Luna: A functional programming language to create data processing friendly programs in a WYSIWYG way
NetLogo: A computational multi-agent development and simulation environment, very cool tool for investigating complex phenomena via implementing simple computational rules for agents!
LabPlot: LabPlot is a lovely application for data analysis and plotting, it is part of KDE Project!
Meta Toolkit: A fast software toolkit implementing many useful ML algorithms, it is written in C++
Record Linkage Tools: A collection of useful resources for record deduplication and linkage
Gunrock: A GPU based graph analytics and processing library, it works with CUDA
Papers on Graph Analytics: A thorough list of publications related to graphs covering many interesting topics
GraphIt: GraphIt - "A High-Performance Domain Specific Language for Graph Analytics" - GraphIt's website
SMORe: A handy tool and library for fast weighted graph embedding in C++
Warp-ctc: A fast parallel implementation of CTC, for both CPU and GPU
ZVTM: A handy graph visualization library for Java
mrJob: A Python library to create MapReduce jobs and run them on multiple machines (i.e., in a cluster)
Metanome: A collection of interesting materials (e.g., algorithms, code, articles) related to data profiling
Graphillion: Graphilion is a software library for working with many graphs in a parallel fashion
Awesome graph classification: A very comprehensive collection of graph embedding, classification and representation learning papers with the code!
VFML: Very Fast ML (hence the name VFML) is a fast C library for mining very massive data streams
Talisman: Talisman is a modular JavaScript library for NLP and Machine Learning activities
StyleGAN: StyleGAN is TensorFlow implementation of a proposed architecture for GANs from NVIDIA, you can use it to create photo-realistic pictures of people who don't exist!
Java String Similarity: A Java library implementing a collection of useful text similarity/distance measures
Label Studio: Label Studio is a handy tool with a nice UI for labelling your data (e.g., records and documents)
GraphML: GraphML is a graph representation and serialization file format based on XML that could store many different types of graphs with their attributes without loss of information
Taco: A compiler for compiling and executing general tensor algebra operations on sparse tensors in machine code for CPUs and GPUs
Libspatialindex: Libspatialindex contains many robust geolocational indexing algorithms like R*-tree and TPR-tree
NLP Best Practices: A collection of best practices and their examples in the NLP domain from Microsoft
Tulip: Tulip is a nice open-source data visualization and analysis software toolbox, it is especially good for working with graphs and graph datasets
Juno: Juno is an IDE based on Atom for Julia programming language
BoofCV: A real-time machine vision and image processing in Java
cuDF: cuDF is a library with API similar to Pandas that is built based on the Apache Arrow columnar memory format; cudf uses GPU routines for loading, joining, aggregating, filtering, and otherwise manipulating data
LASER toolkit: LASER (Language-Agnostic SEntence Representations) is a software toolkit for sentence embedding for about 100 different languages
Idyll: "A toolkit for creating data-driven stories and explorable explanations" - Idyll's website
DeepLearning4J: A java-based software toolbox for building and training deep artificial neural networks
NeMo: NeMo is a software toolkit for building AI applications
TRAINS Agent: TRAINS Agent is a DevOps tool for setting up and running an AI experiment on a cluster computing environment
TensorFlow Hub: TensorFlow Hub is a library for the publication, discovery, and consumption of reusable parts of deep learning models
AIX360: An explainable AI (XAI) toolkit to interpret Machine Learning models
Catalyst: Catalyst is a tool for making Deep Learning experiments on PyTorch reproducible
TensorFlowJS: TensorFlowJS is a JavaScript library to use TensorFlow models in web applications in the browser
Kst: Kst is a handy data visualization tool from KDE project
AMIDST: AMIDST is a Java software toolbox for probabilistic modelling of data
LIBFFM: "LIBFFM is an open-source tool for field-aware factorization machines (FFM)"; people won a few real-world data science challenges in Kaggle
jLDADMM: A Java package for LDA and DMM topic modelling
Stan: "Stan is a state-of-the-art platform for statistical modelling and high-performance statistical computation." - Stan's website
DEAP: Distributed Evolutionary Algorithms in Python
DynaML: "DynaML is a Scala & JVM Machine Learning toolbox for research, education & industry." -- Its website
ExecuteMulan: A Java utility to run the multi-label classification method from Mulan with more ease
GTN: "GTN is an open-source framework for automatic differentiation with a powerful, expressive type of graph called weighted finite-state transducers (WFSTs). Just as PyTorch provides a framework for automatic differentiation with tensors, GTN provides such a framework for WFSTs. AI researchers and engineers can use GTN to train graph-based machine learning models more effectively." -- Facebook
Tribuo: An open-source machine learning library in Java from Oracle
Libbow: "Bow (or libbow) is a library of C code useful for writing statistical text analysis, language modelling and information retrieval programs." -- Its website
Readings in Database Systems(The Red Book): An enjoyable to read. It's a little bit hard to follow at first for me, but a great many resources are mentioned at the end of each chapter, and it gives great insights into the history, trends and future of DBMSs and Data Processing Platforms
Machine Learning Meets Databases: A very informative and also easy to follow article, including a short introduction to Machine Learning and also describing its relation to Data Mining and Databases
"Practicalmachinelearning" and other potentially trademarked words, copyrighted images and copyrighted readme contents likely belong to the legal entity who owns the "Habedi" organization. Awesome Open Source is not affiliated with the legal entity who owns the "Habedi" organization.