TensorHive is an open source system for monitoring and managing computing resources across multiple hosts. It solves the most common problems and nightmares about accessing and sharing your AI-oriented infrastructure across multiple, often competing users.
It's designed with simplicity, flexibility and configuration-friendliness in mind.
Our goal is to provide solutions for painful problems that ML engineers often have to struggle with when working with remote machines in order to run neural network trainings.
mem_usedare great for this purpose
0️⃣ Dead-simple one-machine installation and configuration, no
1️⃣ Users can make GPU reservations for specific time range in advance via reservation mechanism
➡️ no more frustration caused by rules: "first come, first served" or "the law of the jungle".
2️⃣ Users can prepare and schedule custom tasks (commands) to be run on selected GPUs and hosts
➡️ automate and simplify distributed trainings - "one button to rule them all"
3️⃣ Gather all useful GPU metrics, from all configured hosts in one dashboard
➡️ no more manual logging in to each individual machine in order to check if GPU is currently in use or not
For more details, check out the full list of features.
pip install tensorhive
(optional) For development purposes we encourage separation from your current python packages using e.g. virtualenv, Anaconda.
git clone https://github.com/roscisz/TensorHive.git && cd TensorHive pip install -e .
TensorHive is already shipped with newest web app build, but in case you modify the source, you can can build it with
make app (currently on
master branch). For more useful commands see our Makefile.
Build tested with
Node v10.15.2 and
init command will guide you through basic configuration process:
You can check connectivity with the configured hosts using the
(optional) If you want to allow your UNIX users to set up their TensorHive accounts on their own and run distributed
Task execution plugin, use the
key command to generate the SSH key for TensorHive:
Now you should be ready to launch a TensorHive instance:
Web application and API Documentation can be accessed via URLs highlighted in green (Ctrl + click to open in browser).
You can fully customize TensorHive behaviours via INI configuration files (which will be created automatically after
~/.config/TensorHive/main_config.ini ~/.config/TensorHive/mailbot_config.ini ~/.config/TensorHive/hosts_config.ini
Accessible infrastructure can be monitored in the Nodes overview tab. Sample screenshot: Here you can add new watches, select metrics and monitor ongoing GPU processes and its' owners.
Each column represents all reservation events for a GPU on a given day. In order to make a new reservation simply click and drag with your mouse, select GPU(s), add some meaningful title, optionally adjust time range.
From now on, only your processes are eligible to run on reserved GPU(s). TensorHive periodically checks if some other user has violated it. He will be spammed with warnings on all his PTYs, emailed every once in a while, additionally admin will also be notified (it all depends on the configuration).
|Terminal warning||Email warning|
Thanks to the
Task execution module, you can define commands for tasks you want to run on any configured nodes.
You can manage them manually or set spawn/terminate date.
Commands are run within
screen session, so attaching to it while they are running is a piece of cake.
It provides a simple, but flexible (framework-agnostic) command templating mechanism that will help you automate multi-node trainings. Additionally, specialized templates help to conveniently set proper parameters for chosen well known frameworks:
In the examples
directory, you will find sample scenarios of using the
Task execution module for various
frameworks and computing environments.
TensorHive requires that users who want to use this feature must append TensorHive's public key to their
~/.ssh/authorized_keys on all nodes they want to connect to.
screencommand as backend - user can easily attach to running task
|Gdansk University of Technology||NVIDIA DGX Station (4x Tesla V100) + NVIDIA DGX-1 (8x Tesla V100)||30+|
|Lab at GUT||20 machines with GTX 1060 each||20+|
|Gradient PG||A server with two GPUs shared by the Gradient science club at GUT.||30+|
|VoiceLab - Conversational Intelligence||30+ GTX and RTX GPUs||10+|
This diagram will help you to grasp the rough concept of the system.
We'd ❤️ to collect your observations, issues and pull requests!
Feel free to report any configuration problems, we will help you.
Currently we are working on user groups for differentiated GPU access control, grouping tasks into jobs and process-killing reservation violation handler, deadline - July 2020 , so stay tuned!
TensorHive has been greatly supported within a joint project between VoiceLab.ai and Gdańsk University of Technology titled: "Exploration and selection of methods for parallelization of neural network training using multiple GPUs".
Project created and maintained by: