Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Datascience | 3,691 | 18 days ago | 1 | cc0-1.0 | ||||||
Curated list of Python resources for data science. | ||||||||||
Tensorflow_template_application | 1,839 | 2 years ago | 14 | apache-2.0 | Python | |||||
TensorFlow template application for deep learning | ||||||||||
Automl Gs | 1,642 | 3 years ago | 2 | April 05, 2019 | 25 | mit | Python | |||
Provide an input CSV and a target field to predict, generate a model + code to run it. | ||||||||||
Universal Data Tool | 1,612 | a year ago | 173 | mit | JavaScript | |||||
Collaborate & label any type of data, images, text, or documents, in an easy web interface or desktop app. | ||||||||||
Dataprofiler | 1,140 | 1 | 8 days ago | 30 | June 28, 2022 | 42 | apache-2.0 | Python | ||
What's in your data? Extract schema, statistics and entities from datasets | ||||||||||
Clevercsv | 1,067 | 6 | 8 days ago | 42 | May 12, 2022 | 11 | mit | Python | ||
CleverCSV is a Python package for handling messy CSV files. It provides a drop-in replacement for the builtin CSV module with improved dialect detection, and comes with a handy command line application for working with CSV files. | ||||||||||
Dataframe Go | 642 | 14 | a year ago | 4 | April 22, 2021 | 6 | other | Go | ||
DataFrames for Go: For statistics, machine-learning, and data manipulation/exploration | ||||||||||
Tech.ml.dataset | 509 | 2 days ago | 251 | January 05, 2021 | 4 | epl-1.0 | Clojure | |||
A Clojure high performance data processing system | ||||||||||
Graphwaveletneuralnetwork | 501 | 13 days ago | gpl-3.0 | Python | ||||||
A PyTorch implementation of "Graph Wavelet Neural Network" (ICLR 2019) | ||||||||||
Atm | 481 | 3 years ago | 14 | July 30, 2019 | 17 | mit | Python | |||
Auto Tune Models - A multi-tenant, multi-data system for automated machine learning (model selection and tuning). |
Give an input CSV file and a target field you want to predict to automl-gs, and get a trained high-performing machine learning or deep learning model plus native Python code pipelines allowing you to integrate that model into any prediction workflow. No black box: you can see exactly how the data is processed, how the model is constructed, and you can make tweaks as necessary.
automl-gs is an AutoML tool which, unlike Microsoft's NNI, Uber's Ludwig, and TPOT, offers a zero code/model definition interface to getting an optimized model and data transformation pipeline in multiple popular ML/DL frameworks, with minimal Python dependencies (pandas + scikit-learn + your framework of choice). automl-gs is designed for citizen data scientists and engineers without a deep statistical background under the philosophy that you don't need to know any modern data preprocessing and machine learning engineering techniques to create a powerful prediction workflow.
Nowadays, the cost of computing many different models and hyperparameters is much lower than the opportunity cost of an data scientist's time. automl-gs is a Python 3 module designed to abstract away the common approaches to transforming tabular data, architecting machine learning/deep learning models, and performing random hyperparameter searches to identify the best-performing model. This allows data scientists and researchers to better utilize their time on model performance optimization.
The models generated by automl-gs are intended to give a very strong baseline for solving a given problem; they're not the end-all-be-all that often accompanies the AutoML hype, but the resulting code is easily tweakable to improve from the baseline.
You can view the hyperparameters and their values here, and the metrics that can be optimized here. Some of the more controversial design decisions for the generated models are noted in DESIGN.md.
Currently automl-gs supports the generation of models for regression and classification problems using the following Python frameworks:
tf.keras
) | tensorflow
xgboost
To be implemented:
catboost
lightgbm
automl-gs can be installed via pip:
pip3 install automl_gs
You will also need to install the corresponding ML/DL framework (e.g. tensorflow
/tensorflow-gpu
for TensorFlow, xgboost
for xgboost, etc.)
After that, you can run it directly from the command line. For example, with the famous Titanic dataset:
automl_gs titanic.csv Survived
If you want to use a different framework or configure the training, you can do it with flags:
automl_gs titanic.csv Survived --framework xgboost --num_trials 1000
You may also invoke automl-gs directly from Python. (e.g. via a Jupyter Notebook)
from automl_gs import automl_grid_search
automl_grid_search('titanic.csv', 'Survived')
The output of the automl-gs training is:
automl_tensorflow_20190317_020434
) with contains:
model.py
: The generated model file.pipeline.py
: The generated pipeline file.requirements.txt
: The generated requirements file./encoders
: A folder containing JSON-serialized encoder files/metadata
: A folder containing training statistics + other cool stuff not yet implemented!automl_results.csv
: A CSV containing the training results after each epoch and the hyperparameters used to train at that time.Once the training is done, you can run the generated files from the command line within the generated folder above.
To predict:
python3 model.py -d data.csv -m predict
To retrain the model on new data:
python3 model.py -d data.csv -m train
You can view these at any time by running automl_gs -h
in the command line.
csv_path
: Path to the CSV file (must be in the current directory) [Required]target_field
: Target field to predict [Required]target_metric
: Target metric to optimize [Default: Automatically determined depending on problem type]framework
: Machine learning framework to use [Default: 'tensorflow']model_name
: Name of the model (if you want to train models with different names) [Default: 'automl']num_trials
: Number of trials / different hyperparameter combos to test. [Default: 100]split
: Train-validation split when training the models [Default: 0.7]num_epochs
: Number of epochs / passes through the data when training the models. [Default: 20]col_types
: Dictionary of fields:data types to use to override automl-gs's guesses. (only when using in Python) [Default: {}]gpu
: For non-Tensorflow frameworks and Pascal-or-later GPUs, boolean to determine whether to use GPU-optimized training methods (TensorFlow can detect it automatically) [Default: False]tpu_address
: For TensorFlow, hardware address of the TPU on the system. [Default: None]For a quick Hello World on how to use automl-gs, see this Jupyter Notebook.
Due to the size of some examples w/ generated code and accompanying data visualizations, they are maintained in a separate repository. (and also explain why there are two distinct "levels" in the example viz above!)
TL;DR: auto-ml gs generates raw Python code using Jinja templates and trains a model using the generated code in a subprocess: repeat using different hyperparameters until done and save the best model.
automl-gs loads a given CSV and infers the data type of each column to be fed into the model. Then it tries a ETL strategy for each column field as determined by the hyperparameters; for example, a Datetime field has its hour
and dayofweek
binary-encoded by default, but hyperparameters may dictate the encoding of month
and year
as additional model fields. ETL strategies are optimized for frameworks; TensorFlow for example will use text embeddings, while other frameworks will use CountVectorizers to encode text (when training, TensorFlow will also used a shared text encoder via Keras's functional API). automl-gs then creates a statistical model with the specified framework. Both the model ETL functions and model construction functions are saved as a generated Python script.
automl-gs then runs the generated training script as if it was a typical user. Once the model is trained, automl-gs saves the training results in its own CSV, along with all the hyperparameters used to train the model. automl-gs then repeats the task with another set of hyperparameters, until the specified number of trials is hit or the user kills the script.
The best model Python script is kept after each trial, which can then easily be integrated into other scripts, or run directly to get the prediction results on a new dataset.
xgboost
. The results may surprise you!Feature development will continue on automl-gs as long as there is interest in the package.
plotnine
)Max Woolf (@minimaxir)
Max's open-source projects are supported by his Patreon. If you found this project helpful, any monetary contributions to the Patreon are appreciated and will be put to good creative use.
MIT
The code generated by automl-gs is unlicensed; the owner of the generated code can decide the license.