3# ProGraML: Program Graphs for Machine Learning
|OS||GNU/Linux, macOS ≥ 10.15|
|Python Versions||3.6, 3.7, 3.8, 3.9|
ProGraML is a representation for programs as input to a machine learning model.
Key features are:
To get stuck in and play around with our graph representation, visit:
Or if papers are more your ☕, have a read of ours:
~/.local/opt/programl(or a directory of your choice) using:
mkdir -p ~/.local/opt/programl tar xjvf ~/Downloads/programl-*.tar.bz2 -C ~/.local/opt/programl
export PATH=$HOME/.local/opt/programl/bin:$PATH export LD_LIBRARY_PATH=$HOME/.local/opt/programl/lib:$LD_LIBRARY_PATH
Install the python dependencies using:
$ python -m pip install -r requirements.txt
Once you have the above requirements installed, test that everything is working by building and running full test suite:
$ bazel test //...
Build and install the command line tools to
~/.local (or a
directory of your choice) using:
$ bazel run -c opt //:install -- ~/.local
Then to use them, append the following to your
export PATH=~/.local/opt/programl/bin:$PATH export LD_LIBRARY_PATH=~/.local/opt/programl/lib:$LD_LIBRARY_PATH
Please see this doc for download links for our publicly available datasets of LLVM-IRs, ProGraML graphs, and data flow analysis labels.
If you are using bazel you can add ProGraML as an external dependency. Add to your WORKSPACE file:
load("@bazel_tools//tools/build_defs/repo:http.bzl", "http_archive") http_archive( name="programl", strip_prefix="ProGraML-<stable-commit>", urls=["https://github.com/ChrisCummins/ProGraML/archive/<stable-commit>.tar.gz"], ) # ----------------- Begin ProGraML dependencies ----------------- <WORKSPACE dependencies> # ----------------- End ProGraML dependencies -----------------
Then in your BUILD file:
cc_library( name = "mylib", srcs = ["mylib.cc"], deps = [ "@programl//programl/ir/llvm", ], ) py_binary( name = "myscript", srcs = ["myscript.py"], deps = [ "@programl//programl/ir/llvm/py:llvm", ], )
The ProGraML representation is constructed in multiple stages. Here we describe the process for a simple recursive Fibonacci implementation in C. For instructions on how to run this process, see Usage below.
We start by lowering the program to a compiler IR. In this case, we'll use
LLVM-IR. This can be done using:
clang -emit-llvm -S -O3 fib.c.
We begin building a graph by constructing a full-flow graph of the program. In a
full-flow graph, every instruction is a node and the edges are control-flow.
Note that edges are positional so that we can differentiate the branching
control flow in that
Then we add a graph node for every variable and constant. In the drawing above, the diamonds are constants and the variables are ovals. We add data-flow edges to describe the relations between constants and the instructions that use them, and variables and the constants which define/use them. Like control edges, data edges have positions. In the case of data edges, the position encodes the order of a data element in the list of instruction operands.
Finally, we add call edges (green) from callsites to the function entry
instruction, and return edges from function exits to the callsite. Since this is
a graph of a recursive function, the callsites refer back to the entry of the
external node is used to represent a call from an
In the manner of Unix Zen, creating and manipulating ProGraML graphs is done using command-line tools which act as filters, reading in graphs from stdin and emitting graphs to stdout. The structure for graphs is described through a series of protocol buffers.
This section provides an example step-by-step guide for generating a program graph for a C++ application.
-emit-llvm -Sflags. For a single-source application, the command line invocation would be:
$ clang-10 -emit-llvm -S -c my_app.cpp -o my_app.ll
For a multi-source application, you can compile each file to LLVM-IR separately and then link the results. For example:
$ clang-10 -emit-llvm -S -c foo.cpp -o foo.ll $ clang-10 -emit-llvm -S -c bar.cpp -o bar.ll $ llvm-link foo.ll bar.ll -S -o my_app.ll
$ llvm2graph < my_app.ll > my_app.pbtxt
The generated file
my_app.pbtxt uses a human-readable
format which you can inspect using a text editor. In this case, we will
render it to an image file using Graphviz.
$ graph2dot < my_app.pbtxt > my_app.dot
$ dot -Tpng my_app.dot -o my_app.png
bazel run -c opt //tasks/dataflow:train_ggnn -- \ --analysis reachability \ --path=$HOME/programl
--analysis is the name of the analysis you want to evaluate, and
--path is the root of the unpacked dataset. There are a lot of options that
you can use to control the behavior of the experiment, see
--helpfull for a
full list. Some useful ones include:
--batch_sizecontrols the number of nodes in each batch of graphs.
--layer_timestepsdefines the layers of the GGNN model, and the number of timesteps used for each.
--learning_ratesets the initial learning rate of the optimizer.
--lr_decay_ratethe rate at which learning rate decays.
--lr_decay_stepsnumber of gradient steps until the lr is decayed.
--train_graph_countslists the number of graphs to train on between runs of the validation set.
🏗️ Under construction We are in the process of refactoring the dataflow experiments with a revamped API. There are currently bugs in the data loader which may affect training jobs, see #147.
Funding sources: HiPEAC Travel Grant.