Awesome Open Source
Awesome Open Source

PyPI version

mlmachine

"mlmachine is a Python library that organizes and accelerates notebook-based machine learning experiments."

Table of Contents

Novel Functionality

Easy, Elegant EDA

mlmachine creates beautiful and informative EDA panels with ease:

# create EDA panel for all "category" features
for feature in mlmachine_titanic.data.mlm_dtypes["category"]:
    mlmachine_titanic.eda_cat_target_cat_feat(
        feature=feature,
        legend_labels=["Died","Survived"],
    )

alt text

Pandas-in / Pandas-out Pipelines

mlmachine makes Scikit-learn transformers Pandas-friendly.

Here's an example. See how simply wrapping the mlmachine utility PandasTransformer() around OneHotEncoder() maintains our DataFrame:

alt text

KFold Target Encoding

mlmachine includes a utility called KFoldEncoder, which applies target encoding on categorical features and leverages out-of-fold encoding to prevent target leakage:

# perform 5-fold target encoding with TargetEncoder from the category_encoders library
encoder = KFoldEncoder(
    target=mlmachine_titanic_machine.training_target,
    cv=KFold(n_splits=5, shuffle=True, random_state=0),
    encoder=TargetEncoder,
)
encoder.fit_transform(mlmachine_titanic_machine.training_features[["Pclass"]])

alt text

Crowd-sourced Feature Importance & Exhaustive Feature Selection

mlmachine employs a robust approach to estimating feature importance by using a variety of techniques:

  • Tree-based Feature Importance
  • Recursive Feature Elimination
  • Sequential Forward Selection
  • Sequential Backward Selection
  • F-value / p-value
  • Variance 
  • Target Correlation

This occurs with one simple execution, and operates on multiple estimators and/or models, and one or more scoring metrics:

# instantiate custom models
rf2 = RandomForestClassifier(max_depth=2)
rf4 = RandomForestClassifier(max_depth=4)
rf6 = RandomForestClassifier(max_depth=6)

# estimator list - default XGBClassifier, default
# RandomForestClassifier and three custom models
estimators = [
    XGBClassifier,
    RandomForestClassifier,
    rf2,
    rf4,
    rf6,
]

# instantiate FeatureSelector object
fs = mlmachine_titanic_machine.FeatureSelector(
    data=mlmachine_titanic_machine.training_features,
    target=mlmachine_titanic_machine.training_target,
    estimators=estimators,
)

# run feature importance techniques, use ROC AUC and
# accuracy score metrics and 0 CV folds (where applicable)
feature_selector_summary = fs.feature_selector_suite(
    sequential_scoring=["roc_auc","accuracy_score"],
    sequential_n_folds=0,
    save_to_csv=True,
)

Then the features are winnowed away, from least important to most important, through an exhaustive cross-validation procedure in search of an optimum feature subset:

alt text



Hyperparameter Tuning with Bayesian Optimization

mlmachine can perform Bayesian optimization on multiple estimators in one shot, and includes functionality for visualizing model performance and parameter selections:

# generate parameter selection panels for each parameter
mlmachine_titanic_machine.model_param_plot(
        bayes_optim_summary=bayes_optim_summary,
        estimator_class="KNeighborsClassifier",
        estimator_parameter_space=estimator_parameter_space,
        n_iter=100,
    )

alt text

Example Notebooks

All examples can be viewed here

Example Notebook 1 - Learn the basics of mlmachine, how to create EDA panels, and how to execute Pandas-friendly Scikit-learn transformations and pipelines.

Example Notebook 2 - Learn how use mlmachine to assess a datasets pre-processing needs. See examples of how to use novel functionality, such as GroupbyImputer(), KFoldEncoder() and DualTransformer().

Example Notebook 3 - Learn how to perform thorough feature importance estimation, followed by an exhaustive, cross-validation-driven feature selection process.

Example Notebook 4 - Learn how to execute hyperparameter tuning with Bayesian optimization for multiple model and multiple parameter spaces in one simple execution.

Articles on Medium

mlmachine - Clean ML Experiments, Elegant EDA & Pandas Pipelines - Published 4/3/2020

mlmachine - GroupbyImputer, KFoldEncoder, and Skew Correction - Published 4/13/2020

Installation

Python Requirements: 3.6, 3.7

mlmachine uses the latest, or almost latest, versions of all dependencies. Therefore, it is highly recommended that mlmachine is installed in a virtual environment.

pyenv

Create a new virtual environment:

$ pyenv virtualenv 3.7.5 mlmachine-env

Activate your new virtual environment:

$ pyenv activate mlmachine-env

Install mlmachine using pip to install mlmachine and all dependencies:

$ pip install mlmachine

anaconda

Create a new virtual environment:

$ conda create --name mlmachine-env python=3.7

Activate your new virtual environment:

$ conda activate mlmachine-env

Install mlmachine using pip to install mlmachine and all dependencies:

$ pip install mlachine

Feedback

Any and all feedback is welcome. Please send me an email at [email protected]

Acknowledgments

mlmachine stands on the shoulders of many great Python packages:

catboost | category_encoders | eif | hyperopt | imbalanced-learn | jupyter | lightgbm | matplotlib | numpy | pandas | prettierplot | scikit-learn | scipy | seaborn | shap | statsmodels | xgboost |

Related Awesome Lists
Top Programming Languages
Top Projects

Get A Weekly Email With Trending Projects For These Topics
No Spam. Unsubscribe easily at any time.
Python (806,773
Jupyter Notebook (154,126
Machine Learning (37,065
Data Science (10,149
Pandas (6,613
Data Visualization (5,666
Data Analysis (4,802
Eda (1,488