Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
H2o 3 | 6,572 | 62 | 33 | a day ago | 49 | August 09, 2023 | 2,718 | apache-2.0 | Jupyter Notebook | |
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc. | ||||||||||
Transmogrifai | 2,099 | 3 | 2 years ago | 9 | June 11, 2020 | 44 | bsd-3-clause | Scala | ||
TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning | ||||||||||
Cube Studio | 1,558 | a month ago | 1 | October 13, 2022 | 76 | other | Jupyter Notebook | |||
cube studio开源云原生一站式机器学习/深度学习AI平台,支持sso登录,多租户/多项目组,数据资产对接,notebook在线开发,拖拉拽任务流pipeline编排,多机多卡分布式算法训练,超参搜索,推理服务VGPU,多集群调度,边缘计算,serverless,标注平台,自动化标注,数据集管理,大模型一键微调,llmops,私有知识库,AI应用商店,支持模型一键开发/推理/微调,私有化部署,支持国产cpu/gpu/npu芯片,支持RDMA,支持pytorch/tf/mxnet/deepspeed/paddle/colossalai/horovod/spark/ray/volcano分布式 | ||||||||||
Maggy | 88 | 17 days ago | 21 | June 03, 2022 | 8 | apache-2.0 | Python | |||
Distribution transparent Machine Learning experiments on Apache Spark | ||||||||||
E2eaiok | 28 | a day ago | 154 | December 02, 2023 | 21 | apache-2.0 | Jupyter Notebook | |||
Intel® End-to-End AI Optimization Kit | ||||||||||
Featuretoolsonspark | 16 | 4 years ago | 4 | June 13, 2019 | 1 | mit | Python | |||
A simplified version of featuretools for Spark | ||||||||||
Open Ml Libraries | 5 | 8 months ago | mit | |||||||
A useful machine learning repository, for research and business use. This repository contains hundreds of open-source framework for artificial intelligence, you can use those for research and also can use in your business | ||||||||||
Microsoft Big Data Scientist And Ai | 5 | 3 years ago | ||||||||
Microsoft Big Data, Data Scientist, and AI | ||||||||||
Data Science | 2 | 2 years ago | 1 | mit | Jupyter Notebook | |||||
A collection of neat and practical data science and machine learning projects | ||||||||||
Project_lost_saturn | 2 | a year ago | Jupyter Notebook | |||||||
This is the core of project lost_saturn . The project lost_saturn project is a modern approach to datascience, focus on enabling DataScience on containerised environments everywhere. Built first with a local setup and transformed into a container solution. It has tools centralized in Jupyter , with Spark and AutoML H2O.ai . Ideal to run Notebooks in Jupyter in WSL (Windows Subsystem Linux), or Docker containers with Ubunto 18.4 LTS |
TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library written in Scala that runs on top of Apache Spark. It was developed with a focus on accelerating machine learning developer productivity through machine learning automation, and an API that enforces compile-time type-safety, modularity, and reuse. Through automation, it achieves accuracies close to hand-tuned models with almost 100x reduction in time.
Use TransmogrifAI if you need a machine learning library to:
To understand the motivation behind TransmogrifAI check out these:
Skip to Quick Start and Documentation.
The Titanic dataset is an often-cited dataset in the machine learning community. The goal is to build a machine learnt model that will predict survivors from the Titanic passenger manifest. Here is how you would build the model using TransmogrifAI:
import com.salesforce.op._
import com.salesforce.op.readers._
import com.salesforce.op.features._
import com.salesforce.op.features.types._
import com.salesforce.op.stages.impl.classification._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
implicit val spark = SparkSession.builder.config(new SparkConf()).getOrCreate()
import spark.implicits._
// Read Titanic data as a DataFrame
val passengersData = DataReaders.Simple.csvCase[Passenger](path = pathToData).readDataset().toDF()
// Extract response and predictor Features
val (survived, predictors) = FeatureBuilder.fromDataFrame[RealNN](passengersData, response = "survived")
// Automated feature engineering
val featureVector = predictors.transmogrify()
// Automated feature validation and selection
val checkedFeatures = survived.sanityCheck(featureVector, removeBadFeatures = true)
// Automated model selection
val pred = BinaryClassificationModelSelector().setInput(survived, checkedFeatures).getOutput()
// Setting up a TransmogrifAI workflow and training the model
val model = new OpWorkflow().setInputDataset(passengersData).setResultFeatures(pred).train()
println("Model summary:\n" + model.summaryPretty())
Model summary:
Evaluated Logistic Regression, Random Forest models with 3 folds and AuPR metric.
Evaluated 3 Logistic Regression models with AuPR between [0.6751930383321765, 0.7768725281794376]
Evaluated 16 Random Forest models with AuPR between [0.7781671467343991, 0.8104798040316159]
Selected model Random Forest classifier with parameters:
|-----------------------|--------------|
| Model Param | Value |
|-----------------------|--------------|
| modelType | RandomForest |
| featureSubsetStrategy | auto |
| impurity | gini |
| maxBins | 32 |
| maxDepth | 12 |
| minInfoGain | 0.001 |
| minInstancesPerNode | 10 |
| numTrees | 50 |
| subsamplingRate | 1.0 |
|-----------------------|--------------|
Model evaluation metrics:
|-------------|--------------------|---------------------|
| Metric Name | Hold Out Set Value | Training Set Value |
|-------------|--------------------|---------------------|
| Precision | 0.85 | 0.773851590106007 |
| Recall | 0.6538461538461539 | 0.6930379746835443 |
| F1 | 0.7391304347826088 | 0.7312186978297163 |
| AuROC | 0.8821603927986905 | 0.8766642291593114 |
| AuPR | 0.8225075757571668 | 0.850331080886535 |
| Error | 0.1643835616438356 | 0.19682151589242053 |
| TP | 17.0 | 219.0 |
| TN | 44.0 | 438.0 |
| FP | 3.0 | 64.0 |
| FN | 9.0 | 97.0 |
|-------------|--------------------|---------------------|
Top model insights computed using correlation:
|-----------------------|----------------------|
| Top Positive Insights | Correlation |
|-----------------------|----------------------|
| sex = "female" | 0.5177801026737666 |
| cabin = "OTHER" | 0.3331391338844782 |
| pClass = 1 | 0.3059642953159715 |
|-----------------------|----------------------|
| Top Negative Insights | Correlation |
|-----------------------|----------------------|
| sex = "male" | -0.5100301587292186 |
| pClass = 3 | -0.5075774968534326 |
| cabin = null | -0.31463114463832633 |
|-----------------------|----------------------|
Top model insights computed using CramersV:
|-----------------------|----------------------|
| Top Insights | CramersV |
|-----------------------|----------------------|
| sex | 0.525557139885501 |
| embarked | 0.31582347194683386 |
| age | 0.21582347194683386 |
|-----------------------|----------------------|
While this may seem a bit too magical, for those who want more control, TransmogrifAI also provides the flexibility to completely specify all the features being extracted and all the algorithms being applied in your ML pipeline. Visit our docs site for full documentation, getting started, examples, faq and other information.
You can simply add TransmogrifAI as a regular dependency to an existing project. Start by picking TransmogrifAI version to match your project dependencies from the version matrix below (if not sure - take the stable version):
TransmogrifAI Version | Spark Version | Scala Version | Java Version |
---|---|---|---|
0.7.1 (unreleased, master), 0.7.0 (stable) | 2.4 | 2.11 | 1.8 |
0.6.1, 0.6.0, 0.5.3, 0.5.2, 0.5.1, 0.5.0 | 2.3 | 2.11 | 1.8 |
0.4.0, 0.3.4 | 2.2 | 2.11 | 1.8 |
For Gradle in build.gradle
add:
repositories {
jcenter()
mavenCentral()
}
dependencies {
// TransmogrifAI core dependency
compile 'com.salesforce.transmogrifai:transmogrifai-core_2.11:0.7.0'
// TransmogrifAI pretrained models, e.g. OpenNLP POS/NER models etc. (optional)
// compile 'com.salesforce.transmogrifai:transmogrifai-models_2.11:0.7.0'
}
For SBT in build.sbt
add:
scalaVersion := "2.11.12"
resolvers += Resolver.jcenterRepo
// TransmogrifAI core dependency
libraryDependencies += "com.salesforce.transmogrifai" %% "transmogrifai-core" % "0.7.0"
// TransmogrifAI pretrained models, e.g. OpenNLP POS/NER models etc. (optional)
// libraryDependencies += "com.salesforce.transmogrifai" %% "transmogrifai-models" % "0.7.0"
Then import TransmogrifAI into your code:
// TransmogrifAI functionality: feature types, feature builders, feature dsl, readers, aggregators etc.
import com.salesforce.op._
import com.salesforce.op.aggregators._
import com.salesforce.op.features._
import com.salesforce.op.features.types._
import com.salesforce.op.readers._
// Spark enrichments (optional)
import com.salesforce.op.utils.spark.RichDataset._
import com.salesforce.op.utils.spark.RichRDD._
import com.salesforce.op.utils.spark.RichRow._
import com.salesforce.op.utils.spark.RichMetadata._
import com.salesforce.op.utils.spark.RichStructType._
Visit our docs site for full documentation, getting started, examples, faq and other information.
See scaladoc for the programming API.
BSD 3-Clause © Salesforce.com, Inc.