This repository contains the next generation of the Genome Analysis Toolkit (GATK). The contents of this repository are 100% open source and released under the Apache 2.0 license (see LICENSE.TXT).
GATK4 aims to bring together well-established tools from the GATK and Picard codebases under a streamlined framework, and to enable selected tools to be run in a massively parallel way on local clusters or in the cloud using Apache Spark. It also contains many newly developed tools not present in earlier releases of the toolkit.
git lfs installafter downloading, followed by
git lfs pullfrom the root of your git clone to download all of the large files, including those required to run the test suite. The full download is approximately 5 gigabytes. Alternatively, if you are just building GATK and not running the test suite, you can skip this step since the build itself will use git-lfs to download the minimal set of large
lfsresource files required to complete the build. The test resources will not be downloaded, but this greatly reduces the size of the download.
./gradlewscript which will download and use an appropriate gradle version automatically (see examples below).
gatkenvironment requires hardware with AVX support for tools that depend on TensorFlow (e.g. CNNScoreVariant). The GATK Docker image comes with the
conda env create -f gatkcondaenv.ymlto create the
./gradlew localDevCondaEnv. This generates the Python package archive and conda yml dependency file(s) in the build directory, and also creates (or updates) the local
source activate gatkto activate the
./gatk PrintReads -I src/test/resources/NA12878.chr17_69k_70k.dictFix.bam -O output.bam
./gatk PrintReads --help
You can download and run pre-built versions of GATK4 from the following places:
A zip archive with everything you need to run GATK4 can be downloaded for each release from the github releases page. We also host unstable archives generated nightly in the Google bucket gs://gatk-nightly-builds.
To do a full build of GATK4, first clone the GATK repository using "git clone", then run:
Equivalently, you can just type:
build/directory with a name like
gatk-VERSION.zipcontaining a complete standalone GATK distribution, including our launcher
gatk, both the local and spark jars, and this README.
Other ways to build:
build/libswith a name like
gatk-package-VERSION-local.jar, and can be used outside of your git clone.
build/libswith a name like
gatk-package-VERSION-spark.jar, and can be used outside of your git clone.
To remove previous builds, run:
For faster gradle operations, add
org.gradle.daemon=true to your
This will keep a gradle daemon running in the background and avoid the ~6s gradle start up time on every command.
Gradle keeps a cache of dependencies used to build GATK. By default this goes in
~/.gradle. If there is insufficient free space in your home directory, you can change the location of the cache by setting the
GRADLE_USER_HOME environment variable.
The version number is automatically derived from the git history using
git describe, you can override it by setting the
./gradlew -DversionOverride=my_weird_version printVersion )
The standard way to run GATK4 tools is via the
gatk wrapper script located in the root directory of a clone of this repository.
gatkcan be run:
./gradlew bundleto a directory, and running
gatkscript within the same directory as fully-packaged GATK jars produced by
GATK_SPARK_JAR, and setting them to the paths to the GATK jars produced by
gatkcan run non-Spark tools as well as Spark tools, and can run Spark tools locally, on a Spark cluster, or on Google Cloud Dataproc.
java -jardirectly and bypassing
gatkcauses several important system properties to not get set, including htsjdk compression level!
For help on using
gatk itself, run
To print a list of available tools, run
BaseRecalibratorSpark). Most other tools are non-Spark-based.
To print help for a particular tool, run
./gatk ToolName --help.
To run a non-Spark tool, or to run a Spark tool locally, the syntax is:
./gatk ToolName toolArguments.
Tool arguments that allow multiple values, such as -I, can be supplied on the command line using a file with the extension ".args". Each line of the file should contain a single value for the argument.
./gatk PrintReads -I input.bam -O output.bam
./gatk PrintReadsSpark -I input.bam -O output.bam
To pass JVM arguments to GATK, run
gatk with the
./gatk --java-options "-Xmx4G" <rest of command> ./gatk --java-options "-Xmx4G -XX:+PrintGCDetails" <rest of command>
To pass a configuration file to GATK, run
gatk with the
./gatk --gatk-config-file GATKProperties.config <rest of command>
An example GATK configuration file is packaged with each release as
This example file contains all current options that are used by GATK and their default values.
./gatk PrintReads -I gs://mybucket/path/to/my.bam -L 1:10000-20000 -O output.bam
gcloud auth application-default login
gcloud auth activate-service-account --key-file "$PATH_TO_THE_KEY_FILE"
GOOGLE_APPLICATION_CREDENTIALSenvironment variable to point to the file
GATK4 Spark tools can be run in local mode (without a cluster). In this mode, Spark will run the tool
in multiple parallel execution threads using the cores in your CPU. You can control how many threads
Spark will use via the
PrintReadsSpark with 4 threads on your local machine:
./gatk PrintReadsSpark -I src/test/resources/large/CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam -O output.bam \ -- \ --spark-runner LOCAL --spark-master 'local'
PrintReadsSpark with as many worker threads as there are logical cores on your local machine:
./gatk PrintReadsSpark -I src/test/resources/large/CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam -O output.bam \ -- \ --spark-runner LOCAL --spark-master 'local[*]'
Note that the Spark-specific arguments are separated from the tool-specific arguments by a
./gatk ToolName toolArguments -- --spark-runner SPARK --spark-master <master_url> additionalSparkArguments
./gatk PrintReadsSpark -I hdfs://path/to/input.bam -O hdfs://path/to/output.bam \ -- \ --spark-runner SPARK --spark-master <master_url>
./gatk PrintReadsSpark -I hdfs://path/to/input.bam -O hdfs://path/to/output.bam \ -- \ --spark-runner SPARK --spark-master <master_url> \ --num-executors 5 --executor-cores 2 --executor-memory 4g \ --conf spark.executor.memoryOverhead=600
You can also omit the "--num-executors" argument to enable dynamic allocation if you configure the cluster properly (see the Spark website for instructions).
Note that the Spark-specific arguments are separated from the tool-specific arguments by a
Running a Spark tool on a cluster requires Spark to have been installed from http://spark.apache.org/, since
gatk invokes the
spark-submit tool behind-the-scenes.
Note that the examples above use YARN but we have successfully run GATK4 on Mesos as well.
gcloudtool behind-the-scenes. As part of the installation, be sure that you follow the
gcloudsetup instructions here. As this library is frequently updated by Google, we recommend updating your copy regularly to avoid any version-related difficulties.
Once you're set up, you can run a Spark tool on your Dataproc cluster using a command of the form:
./gatk ToolName toolArguments -- --spark-runner GCS --cluster myGCSCluster additionalSparkArguments
./gatk PrintReadsSpark \ -I gs://my-gcs-bucket/path/to/input.bam \ -O gs://my-gcs-bucket/path/to/output.bam \ -- \ --spark-runner GCS --cluster myGCSCluster
./gatk PrintReadsSpark \ -I gs://my-gcs-bucket/path/to/input.bam \ -O gs://my-gcs-bucket/path/to/output.bam \ -- \ --spark-runner GCS --cluster myGCSCluster \ --num-executors 5 --executor-cores 2 --executor-memory 4g \ --conf spark.yarn.executor.memoryOverhead=600
When using Dataproc you can access the web interfaces for YARN, Hadoop and HDFS by opening an SSH tunnel and connecting with your browser. This can be done easily using included
Or see these these instructions for more details.
Note that the spark-specific arguments are separated from the tool-specific arguments by a
If you want to avoid uploading the GATK jar to GCS on every run, set the
environment variable to a bucket you have write access to (eg.,
Dataproc Spark clusters are configured with dynamic allocation so you can omit the "--num-executors" argument and let YARN handle it automatically.
Certain GATK tools may optionally generate plots using the R installation provided within the conda environment. If you are uninterested in plotting, R is still required by several of the unit tests. Plotting is currently untested and should be viewed as a convenience rather than a primary output.
A tab completion bootstrap file for the bash shell is now included in releases. This file allows the command-line shell to complete GATK run options in a manner equivalent to built-in command-line tools (e.g. grep).
This tab completion functionality has only been tested in the bash shell, and is released as a beta feature.
To enable tab completion for the GATK, open a terminal window and source the included tab completion script:
Sourcing this file will allow you to press the tab key twice to get a list of options available to add to your current GATK command. By default you will have to source this file once in each command-line session, then for the rest of the session the GATK tab completion functionality will be available. GATK tab completion will be available in that current command-line session only.
Note that you must have already started typing an invocation of the GATK (using gatk) for tab completion to initiate:
echo "source <PATH_TO>/gatk-completion.sh" >> ~/.bashrc
<PATH_TO>is the fully qualified path to the
Do not put private or restricted data into the repo.
Try to keep datafiles under 100kb in size. Larger test files should go into
src/test/resources/large (and subdirectories) so that they'll be stored and tracked by git-lfs as described above.
GATK4 is Apache 2.0 licensed. The license is in the top level LICENSE.TXT file. Do not add any additional license text or accept files with a license included in them.
Each tool should have at least one good end-to-end integration test with a check for expected output, plus high-quality unit tests for all non-trivial utility methods/classes used by the tool. Although we have no specific coverage target, coverage should be extensive enough that if tests pass, the tool is guaranteed to be in a usable state.
All newly written code must have good test coverage (>90%).
All bug fixes must be accompanied by a regression test.
All pull requests must be reviewed before merging to master (even documentation changes).
Don't issue or accept pull requests that introduce warnings. Warnings must be addressed or suppressed.
Don't issue or accept pull requests that significantly decrease coverage (less than 1% decrease is sort of tolerable).
toString() for anything other than human consumption (ie. don't base the logic of your code on results of
clone() unless you really know what you're doing. If you do override it, document thoroughly. Otherwise, prefer other means of making copies of objects.
For logging, use org.apache.logging.log4j.Logger
We mostly follow the Google Java Style guide
Git: Don't push directly to master - make a pull request instead.
Git: Rebase and squash commits when merging.
If you push to master or mess up the commit history, you owe us 1 growler or tasty snacks at happy hour. If you break the master build, you owe 3 growlers (or lots of tasty snacks). Beer may be replaced by wine (in the color and vintage of buyer's choosing) in proportions of 1 growler = 1 bottle.
Before running the test suite, be sure that you've installed
git lfs and downloaded the large test data, following the git lfs setup instructions
To run the test suite, run
spark: run only the cloud, unit, integration, conda (python + R), or Spark tests
all: run the entire test suite
gcloudand authenticated with a project that has access to the cloud test data. They also require setting several certain environment variables.
HELLBENDER_JSON_SERVICE_ACCOUNT_KEY: path to a local JSON file with service account credentials
HELLBENDER_TEST_PROJECT: your google cloud project
HELLBENDER_TEST_STAGING: a gs:// path to a writable location
HELLBENDER_TEST_INPUTS: path to cloud test data, ex: gs://hellbender/test/resources/
TEST_VERBOSITY=minimalwill produce much less output from the test suite
To run a subset of tests, use gradle's test filtering (see gradle doc):
--testswith a wildcard to run a specific test class, method, or to select multiple test classes:
./gradlew test --tests *SomeSpecificTestClass
./gradlew test --tests *SomeTest.someSpecificTestMethod
./gradlew test --tests all.in.specific.package*
To run tests and compute coverage reports, run
./gradlew jacocoTestReport. The report is then in
(IntelliJ has a good coverage tool that is preferable for development).
We use Travis-CI as our continuous integration provider.
See the test report at. If TestNG itself crashes there will be no report generated.
We use Broad Jenkins for our long-running tests and performance tests.
To output stack traces for
UserException set the environment variable
We use git-lfs to version and distribute test data that is too large to check into our repository directly. You must install and configure it in order to be able to run our test suite.
After installing git-lfs, run
git lfs install
To manually retrieve the large test data, run
git lfs pull from the root of your GATK git clone.
To add a new large file to be tracked by git-lfs, simply:
src/test/resources/large(or a subdirectory)
git addthe file(s), then
git commit -a
git lfs trackon the files manually: all files in
src/test/resources/largeare tracked by git-lfs automatically.
Ensure that you have
gradle and the Java 8 JDK installed
You may need to install the TestNG and Gradle plugins (in preferences)
Clone the GATK repository using git
In IntelliJ, click on "Import Project" in the home screen or go to File -> New... -> Project From Existing Sources...
Select the root directory of your GATK clone, then click on "OK"
Select "Import project from external model", then "Gradle", then click on "Next"
Ensure that "Gradle project" points to the build.gradle file in the root of your GATK clone
Select "Use auto-import" and "Use default gradle wrapper".
Make sure the Gradle JVM points to Java 1.8. You may need to set this manually after creating the project, to do so find the gradle settings by clicking the wrench icon in the gradle tab on the right bar, from there edit "Gradle JVM" argument to point to Java 1.8.
After downloading project dependencies, IntelliJ should open a new window with your GATK project
Make sure that the Java version is set correctly by going to File -> "Project Structure" -> "Project". Check that the "Project SDK" is set to your Java 1.8 JDK, and "Project language level" to 8 (you may need to add your Java 8 JDK under "Platform Settings" -> SDKs if it isn't there already). Then click "Apply"/"Ok".
Follow the instructions above for creating an IntelliJ project for GATK
Go to Run -> "Edit Configurations", then click "+" and add a new "Application" configuration
Set the name of the new configuration to something like "GATK debug"
For "Main class", enter
Ensure that "Use classpath of module:" is set to use the "gatk" module's classpath
Enter the arguments for the command you want to debug in "Program Arguments"
Set breakpoints, etc., as desired, then select "Run" -> "Debug" -> "GATK debug" to start your debugging session
In future debugging sessions, you can simply adjust the "Program Arguments" in the "GATK debug" configuration as needed
If there are dependency changes in
build.gradle it is necessary to refresh the gradle project. This is easily done with the following steps.
Running JProfiler standalone:
~/gatk/build/libs/gatk-package-4.alpha-196-gb542813-SNAPSHOT-local.jarfor "Main class or executable JAR" and enter the right "Arguments"
Running JProfiler from within IntelliJ:
To upload snapshots to Sonatype you'll need the following:
You must have a registered account on the sonatype JIRA (and be approved as a gatk uploader)
You need to configure several additional properties in your
If you want to upload a release instead of a snapshot you will additionally need to have access to the gatk signing key and password
#needed for snapshot upload sonatypeUsername=<your sonatype username> sonatypePassword=<your sonatype password> #needed for signing a release signing.keyId=<gatk key id> signing.password=<gatk key password> signing.secretKeyRingFile=/Users/<username>/.gnupg/secring.gpg
To perform an upload, use
Builds are considered snapshots by default. You can mark a build as a release build by setting
The archive name is based off of
Please see the the Docker README in
scripts/docker. This has instructions for the Dockerfile in the root directory.
Please see the How to release GATK4 wiki article for instructions on releasing GATK4.
To generate GATK documentation, run
A WDL wrapper can be generated for any GATK4 tool that is annotated for WDL generation (see the wiki article How to Prepare a GATK tool for WDL Auto Generation) to learn more about WDL annotations.
To generate the WDL Wrappers, run
./gradlew gatkWDLGen. The generated WDLs and accompanying JSON input files can
be found in the
To generate WDL Wrappers and validate the resulting outputs, run
Running this task requires a local cromwell installation, and environment
WOMTOOL_JAR to be set to the full pathnames of the
womtool jar files.
If no local install is available, this task will run automatically on travis in a separate job whenever a PR is submitted.
WDL wrappers for each GATK release are published to the gatk-tool-wdls repository. Only tools that have been annotated for WDL generation will show up there.
We use Zenhub to organize and track github issues.
To add Zenhub to github, go to the Zenhub home page while logged in to github, and click "Add Zenhub to Github"
Zenhub allows the GATK development team to assign time estimates to issues, and to mark issues as Triaged/In Progress/In Review/Blocked/etc.
Apache Spark is a fast and general engine for large-scale data processing. GATK4 can run on any Spark cluster, such as an on-premise Hadoop cluster with HDFS storage and the Spark runtime, as well as on the cloud using Google Dataproc.
In a cluster scenario, your input and output files reside on HDFS, and Spark will run in a distributed fashion on the cluster. The Spark documentation has a good overview of the architecture.
Note that if you don't have a dedicated cluster you can run Spark in standalone mode on a single machine, which exercises the distributed code paths, albeit on a single node.
While your Spark job is running, the Spark UI is an excellent place to monitor the progress.
Additionally, if you're running tests, then by adding
-Dgatk.spark.debug=true you can run a single Spark test and
look at the Spark UI (on http://localhost:4040/) as it runs.
You can find more information about tuning Spark and choosing good values for important settings such as the number of executors and memory settings at the following:
(Note: section inspired by, and some text copied from, Apache Parquet)
We welcome all contributions to the GATK project. The contribution can be a issue report or a pull request. If you're not a committer, you will need to make a fork of the gatk repository and issue a pull request from your fork.
For ideas on what to contribute, check issues labeled "Help wanted (Community)". Comment on the issue to indicate you're interested in contibuting code and for sharing your questions and ideas.
To contribute a patch:
./gradlew testin the root directory.
We tend to do fairly close readings of pull requests, and you may get a lot of comments. Some things to consider:
finalunless there is a strong reason not to.
a + band not
foo(int a,int b)but
foo(int a, int b).
Thank you for getting involved!
Licensed under the Apache 2.0 License. See the LICENSE.txt file.