Scientific Luigi (SciLuigi for short) is a light-weight wrapper library around Spotify's Luigi workflow system that aims to make writing scientific workflows more fluent, flexible and modular.
Luigi is a flexile and fun-to-use library. It has turned out though that its default way of defining dependencies by hard coding them in each task's requires() function is not optimal for some type of workflows common e.g. in bioinformatics where multiple inputs and outputs, complex dependencies, and the need to quickly try different workflow connectivity in an explorative fashion is central to the way of working.
SciLuigi was designed to solve some of these problems, by providing the following "features" over vanilla Luigi:
Because of Luigi's easy-to-use API these changes have been implemented as a very thin layer on top of luigi's own API with no changes at all to the luigi core, which means that you can continue leveraging the work already being put into maintaining and further developing luigi by the team at Spotify and others.
For a brief 10 minute screencast going through the basics below, see this link
Just to give a quick feel for how a workflow definition might look like in SciLuigi, check this code example (implementation of tasks hidden here for brevity. See Usage section further below for more details):
import sciluigi as sl
class MyWorkflow(sl.WorkflowTask):
def workflow(self):
# Initialize tasks:
foowrt = self.new_task('foowriter', MyFooWriter)
foorpl = self.new_task('fooreplacer', MyFooReplacer,
replacement='bar')
# Here we do the *magic*: Connecting outputs to inputs:
foorpl.in_foo = foowrt.out_foo
# Return the last task(s) in the workflow chain.
return foorpl
That's it! And again, see the "usage" section just below for a more detailed description of getting to this!
Please use the issue queue for any support questions, rather than mailing the author(s) directly, as the solutions can then help others who face similar issues (we are a very small team with very limited time, so this is important).
Install SciLuigi, including its dependencies (luigi etc), through PyPI:
pip install sciluigi
Now you can use the library by just importing it in your python script, like so:
import sciluigi
Note that you can aliase it to a shorter name, for brevity, and to save keystrokes:
import sciluigi as sl
Creating workflows in SciLuigi differs slightly from how it is done in vanilla Luigi. Very briefly, it is done in these main steps:
workflow()
method.The first thing to do when creating a workflow, is to define a workflow task.
You do this by:
sciluigi.WorkflowTask
workflow()
method.import sciluigi
class MyWorkflow(sciluigi.WorkflowTask):
def workflow(self):
pass # TODO: Implement workflow here later!
Then, you need to define some tasks that can be done in this workflow.
This is done by:
sciluigi.Task
(or sciluigi.SlurmTask
if you want Slurm support)in_<yournamehere>
for each input, in the new task classout_<yournamehere>()
for each output, that return sciluigi.TargetInfo
objects. (sciluigi.TargetInfo is initialized with a reference to the task object itself - typically self
- and a path name, where upstream tasks paths can be used).run()
method of the task.Let's define a simple task that just writes "foo" to a file named foo.txt
:
class MyFooWriter(sciluigi.Task):
# We have no inputs here
# Define outputs:
def out_foo(self):
return sciluigi.TargetInfo(self, 'foo.txt')
def run(self):
with self.out_foo().open('w') as foofile:
foofile.write('foo\n')
Then, let's create a task that replaces "foo" with "bar":
class MyFooReplacer(sciluigi.Task):
replacement = sciluigi.Parameter() # Here, we take as a parameter
# what to replace foo with.
# Here we have one input, a "foo file":
in_foo = None
# ... and an output, a "bar file":
def out_replaced(self):
# As the path to the returned target(info), we
# use the path of the foo file:
return sciluigi.TargetInfo(self, self.in_foo().path + '.bar.txt')
def run(self):
with self.in_foo().open() as in_f:
with self.out_replaced().open('w') as out_f:
# Here we see that we use the parameter self.replacement:
out_f.write(in_f.read().replace('foo', self.replacement))
The last lines, we could have instead written using the command-line sed
utility, available in linux, by calling it on the commandline, with the built-in ex()
method:
def run(self):
# Here, we use the in-built self.ex() method, to execute commands:
self.ex("sed 's/foo/{repl}/g' {inpath} > {outpath}".format(
repl=self.replacement,
inpath=self.in_foo().path,
outpath=self.out_replaced().path))
Now, we can use these two tasks we created, to create a simple workflow, in our workflow class, that we also created above.
We do this by:
self.new_task(<unique_taskname>, <task_class>, *args, **kwargs)
method, of the workflow task.out_*
method to the right in_*
field.import sciluigi
class MyWorkflow(sciluigi.WorkflowTask):
def workflow(self):
foowriter = self.new_task('foowriter', MyFooWriter)
fooreplacer = self.new_task('fooreplacer', MyFooReplacer,
replacement='bar')
# Here we do the *magic*: Connecting outputs to inputs:
fooreplacer.in_foo = foowriter.out_foo
# Return the last task(s) in the workflow chain.
return fooreplacer
Now, the only thing that remains, is adding a run method to the end of the script.
You can use luigi's own luigi.run()
, or our own two methods:
sciluigi.run()
sciluigi.run_local()
The run_local()
one, is handy if you don't want to run a central scheduler daemon, but just want to run the workflow as a script.
Both of the above take the same options as luigi.run()
, so you can for example set the main class to use (our workflow task):
# End of script ....
if __name__ == '__main__':
sciluigi.run_local(main_task_cls=MyWorkflow)
Now, you should be able to run the workflow as simple as:
python myworkflow.py
... provided of course, that the workflow is saved in a file named myworkflow.py.
See the examples folder for more detailed examples!
The basic idea behind SciLuigi, and a preceding solution to it, was presented in workshop (e-Infra MPS 2015) talk:
See also this collection of links, to more of our reported experiences using Luigi, which lead up to the creation of SciLuigi.
Both of the limitations are due to the fact that Luigi does scheduling and execution separately (with the exception of Luigi's dynamic dependencies, but they work only for upstream tasks, not downstream tasks, which we would need).
If you run into any of these problems, you might be interested in a new workflow engine we develop to overcome these limitations: SciPipe.
This work has been supported by:
Many ideas and inspiration for the API is taken from:
Below is an incomplete list of publications using SciLuigi for computational analysis. If you are using SciLuigi in a publication, please consider adding your own here.
Schulz W, Durant T, Siddon A, Torres R. Use of application containers and workflows for genomic data analysis. J Pathol Inform. 2016;7(1):53. DOI: 10.4103/2153-3539.197197
If you find yourself needing some more advanced scheduling features like dynamic scheduling, or run into performance problems with Python/Luigi/SciLuigi, you might be interested to check out a new workflow engine we develop, in the Go programming language, to cope with some of the limitations we have still faced with Python/Luigi/SciLuigi: SciPipe.
SciPipe leverages some of the successful parts of Luigi's API, such as the flexible file name formatting, but replaces the Luigi scheduler with a custom, novel and very light-weight implicit dataflow scheduler written in Go. We find that it makes life much easier for complex workflow constructs as those involving cross validation, and/or nested parameter sweeps.