LightAutoML (LAMA) is an AutoML framework by Sber AI Lab.
It provides automatic model creation for the following tasks:
Current version of the package handles datasets that have independent samples in each row. I.e. each row is an object with its specific features and target. Multitable datasets and sequences are a work in progress :)
Note: we use AutoWoE
library to automatically create interpretable models.
Authors: Alexander Ryzhkov, Anton Vakhrushev, Dmitry Simakov, Vasilii Bunakov, Rinchin Damdinov, Pavel Shvets, Alexander Kirilin.
Documentation of LightAutoML is available here, you can also generate it.
Full GPU pipeline for LightAutoML currently available for developers testing (still in progress). The code and tutorials available here
To install LAMA framework on your machine from PyPI, execute following commands:
# Install base functionality:
pip install -U lightautoml
# For partial installation use corresponding option.
# Extra dependecies: [nlp, cv, report]
# Or you can use 'all' to install everything
pip install -U lightautoml[nlp]
Additionaly, run following commands to enable pdf report generation:
# MacOS
brew install cairo pango gdk-pixbuf libffi
# Debian / Ubuntu
sudo apt-get install build-essential libcairo2 libpango-1.0-0 libpangocairo-1.0-0 libgdk-pixbuf2.0-0 libffi-dev shared-mime-info
# Fedora
sudo yum install redhat-rpm-config libffi-devel cairo pango gdk-pixbuf2
# Windows
# follow this tutorial https://weasyprint.readthedocs.io/en/stable/install.html#windows
Let's solve the popular Kaggle Titanic competition below. There are two main ways to solve machine learning problems using LightAutoML:
import pandas as pd
from sklearn.metrics import f1_score
from lightautoml.automl.presets.tabular_presets import TabularAutoML
from lightautoml.tasks import Task
df_train = pd.read_csv('../input/titanic/train.csv')
df_test = pd.read_csv('../input/titanic/test.csv')
automl = TabularAutoML(
task = Task(
name = 'binary',
metric = lambda y_true, y_pred: f1_score(y_true, (y_pred > 0.5)*1))
)
oof_pred = automl.fit_predict(
df_train,
roles = {'target': 'Survived', 'drop': ['PassengerId']}
)
test_pred = automl.predict(df_test)
pd.DataFrame({
'PassengerId':df_test.PassengerId,
'Survived': (test_pred.data[:, 0] > 0.5)*1
}).to_csv('submit.csv', index = False)
LighAutoML framework has a lot of ready-to-use parts and extensive customization options, to learn more check out the resources section.
Tutorial_1_basics.ipynb
Tutorial_2_WhiteBox_AutoWoE.ipynb
Tutorial_3_sql_data_source.ipynb
Tutorial_4_NLP_Interpretation.ipynb
Tutorial_5_uplift.ipynb
Tutorial_6_custom_pipeline.ipynb
Tutorial_7_ICE_and_PDP_interpretation.ipynb
Note 1: for production you have no need to use profiler (which increase work time and memory consomption), so please do not turn it on - it is in off state by default
Note 2: to take a look at this report after the run, please comment last line of demo with report deletion command.
LightAutoML crash courses:
Video guides:
Papers:
Articles about LightAutoML:
If you are interested in contributing to LightAutoML, please read the Contributing Guide to get started.
This project is licensed under the Apache License, Version 2.0. See LICENSE file for more details.
First of all you need to install git and poetry.
# Load LAMA source code
git clone https://github.com/sberbank-ai-lab/LightAutoML.git
cd LightAutoML/
# !!!Choose only one item!!!
# 1. Global installation: Don't create virtual environment
poetry config virtualenvs.create false --local
# 2. Recommended: Create virtual environment inside your project directory
poetry config virtualenvs.in-project true
# For more information read poetry docs
# Install LAMA
poetry lock
poetry install
import pandas as pd
from sklearn.metrics import f1_score
from lightautoml.automl.presets.tabular_presets import TabularAutoML
from lightautoml.tasks import Task
df_train = pd.read_csv('../input/titanic/train.csv')
df_test = pd.read_csv('../input/titanic/test.csv')
# define that machine learning problem is binary classification
task = Task("binary")
reader = PandasToPandasReader(task, cv=N_FOLDS, random_state=RANDOM_STATE)
# create a feature selector
model0 = BoostLGBM(
default_params={'learning_rate': 0.05, 'num_leaves': 64,
'seed': 42, 'num_threads': N_THREADS}
)
pipe0 = LGBSimpleFeatures()
mbie = ModelBasedImportanceEstimator()
selector = ImportanceCutoffSelector(pipe0, model0, mbie, cutoff=0)
# build first level pipeline for AutoML
pipe = LGBSimpleFeatures()
# stop after 20 iterations or after 30 seconds
params_tuner1 = OptunaTuner(n_trials=20, timeout=30)
model1 = BoostLGBM(
default_params={'learning_rate': 0.05, 'num_leaves': 128,
'seed': 1, 'num_threads': N_THREADS}
)
model2 = BoostLGBM(
default_params={'learning_rate': 0.025, 'num_leaves': 64,
'seed': 2, 'num_threads': N_THREADS}
)
pipeline_lvl1 = MLPipeline([
(model1, params_tuner1),
model2
], pre_selection=selector, features_pipeline=pipe, post_selection=None)
# build second level pipeline for AutoML
pipe1 = LGBSimpleFeatures()
model = BoostLGBM(
default_params={'learning_rate': 0.05, 'num_leaves': 64,
'max_bin': 1024, 'seed': 3, 'num_threads': N_THREADS},
freeze_defaults=True
)
pipeline_lvl2 = MLPipeline([model], pre_selection=None, features_pipeline=pipe1,
post_selection=None)
# build AutoML pipeline
automl = AutoML(reader, [
[pipeline_lvl1],
[pipeline_lvl2],
], skip_conn=False)
# train AutoML and get predictions
oof_pred = automl.fit_predict(df_train, roles = {'target': 'Survived', 'drop': ['PassengerId']})
test_pred = automl.predict(df_test)
pd.DataFrame({
'PassengerId':df_test.PassengerId,
'Survived': (test_pred.data[:, 0] > 0.5)*1
}).to_csv('submit.csv', index = False)
Seek prompt advice at Slack community or Telegram group.
Open bug reports and feature requests on GitHub issues.