Source Code: https://github.com/mljar/mljar-supervised
Community chat: Slack channel
mljar-supervised is an Automated Machine Learning Python package that works with tabular data. It is designed to save time for a data scientist 😎. It abstracts the common way to preprocess the data, construct the machine learning models, and perform hyper-parameters tuning to find the best model 🏆. It is no black-box as you can see exactly how the ML pipeline is constructed (with a detailed Markdown report for each ML model).
mljar-supervised will help you with:
It has three built-in modes of work:
Explainmode, which is ideal for explaining and understanding the data, with many data explanations, like decision trees visualization, linear models coefficients display, permutation importances and SHAP explanations of data,
Performfor building ML pipelines to use in production,
Competemode that trains highly-tuned ML models with ensembling and stacking, with a purpose to use in ML competitions.
Of course, you can further customize the details of each
mode to meet the requirements.
Neural Networks, and
not-so-random-searchalgorithm (random-search over defined set of values) and hill climbing to fine-tune final models.
Baselinefor your data. So you will know if you need Machine Learning or not! You will know how good are your ML models comparing to the
Baselineis computed based on prior class distribution for classification, and simple mean for regression.
max_depth <= 5, so you can easily visualize them with amazing dtreeviz to better understand your data.
mljar-supervisedis using simple linear regression and include its coefficients in the summary report, so you can check which features are used the most in the linear model.
Competemode or after setting
mljar-supervisedcreates markdown reports from AutoML training full of ML details and charts.
In the docs you can find details about AutoML modes are presented in the table .
automl = AutoML(mode="Explain")
It is aimed to be used when the user wants to explain and understand the data.
Neural Networkalgorithms and ensemble.
automl = AutoML(mode="Perform")
It should be used when the user wants to train a model that will be used in real-life use cases.
Neural Network. It uses ensembling.
automl = AutoML(mode="Compete")
It should be used for machine learning competitions.
Nearest Neighbors. It uses ensemble and stacking.
There is a simple interface available with
import pandas as pd from sklearn.model_selection import train_test_split from supervised.automl import AutoML df = pd.read_csv( "https://raw.githubusercontent.com/pplonski/datasets-for-start/master/adult/data.csv", skipinitialspace=True, ) X_train, X_test, y_train, y_test = train_test_split( df[df.columns[:-1]], df["income"], test_size=0.25 ) automl = AutoML() automl.fit(X_train, y_train) predictions = automl.predict(X_test)
fit will print:
Create directory AutoML_1 AutoML task to be solved: binary_classification AutoML will use algorithms: ['Baseline', 'Linear', 'Decision Tree', 'Random Forest', 'Xgboost', 'Neural Network'] AutoML will optimize for metric: logloss 1_Baseline final logloss 0.5519845471086654 time 0.08 seconds 2_DecisionTree final logloss 0.3655910192804364 time 10.28 seconds 3_Linear final logloss 0.38139916864708445 time 3.19 seconds 4_Default_RandomForest final logloss 0.2975204390214936 time 79.19 seconds 5_Default_Xgboost final logloss 0.2731086827200411 time 5.17 seconds 6_Default_NeuralNetwork final logloss 0.319812276905242 time 21.19 seconds Ensemble final logloss 0.2731086821194617 time 1.43 seconds
The example code for classification of the optical recognition of handwritten digits dataset. Running this code in less than 30 minutes will result in test accuracy ~98%.
import pandas as pd # scikit learn utilites from sklearn.datasets import load_digits from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split # mljar-supervised package from supervised.automl import AutoML # load the data digits = load_digits() X_train, X_test, y_train, y_test = train_test_split( pd.DataFrame(digits.data), digits.target, stratify=digits.target, test_size=0.25, random_state=123 ) # train models with AutoML automl = AutoML(mode="Perform") automl.fit(X_train, y_train) # compute the accuracy on test data predictions = automl.predict_all(X_test) print(predictions.head()) print("Test accuracy:", accuracy_score(y_test, predictions["label"].astype(int)))
Regression example on Boston house prices data. On test data it scores ~ 10.85 mean squared error (MSE).
import numpy as np import pandas as pd from sklearn.datasets import load_boston from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from supervised.automl import AutoML # mljar-supervised # Load the data housing = load_boston() X_train, X_test, y_train, y_test = train_test_split( pd.DataFrame(housing.data, columns=housing.feature_names), housing.target, test_size=0.25, random_state=123, ) # train models with AutoML automl = AutoML(mode="Explain") automl.fit(X_train, y_train) # compute the MSE on test data predictions = automl.predict(X_test) print("Test MSE:", mean_squared_error(y_test, predictions))
For details please check mljar-supervised docs.
If you need help: submit the issue or join our Slack channel.
The report from running AutoML will contain the table with infomation about each model score and time needed to train the model. For each model there is a link, which you can click to see model's details. The performance of all ML models is presented as scatter and box plots so you can visually inspect which algorithms perform the best :throphy:.
The example for
Decision Tree summary with trees visualization. For classification tasks additional metrics are provided:
The example for
From PyPi repository:
pip install mljar-supervised
From source code:
git clone https://github.com/mljar/mljar-supervised.git cd mljar-supervised python setup.py install
Installation for development
git clone https://github.com/mljar/mljar-supervised.git virtualenv venv --python=python3.6 source venv/bin/activate pip install -r requirements.txt pip install -r requirements_dev.txt
Running in the docker:
FROM python:3.7-slim-buster RUN apt-get update && apt-get -y update RUN apt-get install -y build-essential python3-pip python3-dev RUN pip3 -q install pip --upgrade RUN pip3 install mljar-supervised jupyter CMD ["jupyter", "notebook", "--port=8888", "--no-browser", "--ip=0.0.0.0", "--allow-root"]
To get started take a look at our Contribution Guide for information about our process and where you can fit in!
mljar-supervised is provided with MIT license.
mljar-supervised is an open-source project created by MLJAR. We care about ease of use in the Machine Learning.
The mljar.com provides a beautiful and simple user interface for building machine learning models.