Azure Distributed Data Engineering Toolkit (AZTK) is a python CLI application for provisioning on-demand Spark on Docker clusters in Azure. It's a cheap and easy way to get up and running with a Spark cluster, and a great tool for Spark users who want to experiment and start testing at scale.
This toolkit is built on top of Azure Batch but does not require any Azure Batch knowledge to use.
This repository has been marked for archival. It is no longer maintained.
pip install aztk
aztk spark init
wget -q https://raw.githubusercontent.com/Azure/aztk/v0.10.3/account_setup.sh -O account_setup.sh && chmod 755 account_setup.sh && /bin/bash account_setup.sh
.aztk/secrets.yamlfile. For more information see Getting Started Scripts.
The core experience of this package is centered around a few commands.
# create your cluster aztk spark cluster create aztk spark cluster add-user
# monitor and manage your clusters aztk spark cluster get aztk spark cluster list aztk spark cluster delete
# login and submit applications to your cluster aztk spark cluster ssh aztk spark cluster submit
First, create your cluster:
aztk spark cluster create --id my_cluster --size 5 --vm-size standard_d2_v2
--vm-sizeargument must be the official SKU name which usually come in the form: "standard_d2_v2"
--id) can only contain alphanumeric characters including hyphens and underscores, and cannot contain more than 64 characters.
More information regarding using a cluster can be found in the cluster documentation
To check your cluster status, use the
aztk spark cluster get --id my_cluster
When your cluster is ready, you can submit jobs from your local machine to run against the cluster. The output of the spark-submit will be streamed to your local console. Run this command from the cloned AZTK repo:
// submit a java application aztk spark cluster submit \ --id my_cluster \ --name my_java_job \ --class org.apache.spark.examples.SparkPi \ --executor-memory 20G \ path\to\examples.jar 1000 // submit a python application aztk spark cluster submit \ --id my_cluster \ --name my_python_job \ --executor-memory 20G \ path\to\pi.py 1000
aztk spark cluster submitcommand takes the same parameters as the standard
spark-submitcommand, except instead of specifying
--master, AZTK requires that you specify your cluster
--idand a unique job
--name, argument must be at least 3 characters long
--no-waitoption for your command to return immediately
Learn more about the spark submit command here
Most users will want to work interactively with their Spark clusters. With the
aztk spark cluster ssh command, you can SSH into the cluster's master node. This command also helps you port-forward your Spark Web UI and Spark Jobs UI to your local machine:
aztk spark cluster ssh --id my_cluster --user spark
By default, we port forward the Spark Web UI to localhost:8080, Spark Jobs UI to localhost:4040, and the Spark History Server to localhost:18080.
You can configure these settings in the .aztk/ssh.yaml file.
NOTE: When working interactively, you may want to use tools like Jupyter or RStudio-Server. To do so, you need to setup your cluster with the appropriate docker image and plugin. See Plugins for more information.
You can also see your clusters from the CLI:
aztk spark cluster list
And get the state of any specified cluster:
aztk spark cluster get --id <my_cluster_id>
Finally, you can delete any specified cluster:
aztk spark cluster delete --id <my_cluster_id>
You can find more documentation here