Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Metaflow | 6,686 | 13 hours ago | 57 | September 17, 2022 | 272 | apache-2.0 | Python | |||
:rocket: Build and manage real-life data science projects with ease! | ||||||||||
Knowledge Repo | 5,314 | 1 | a month ago | 30 | April 20, 2022 | 129 | apache-2.0 | Python | ||
A next-generation curated knowledge sharing platform for data scientists and other technical professions. | ||||||||||
Drake | 1,329 | 15 | 1 | 13 days ago | 43 | September 21, 2021 | gpl-3.0 | R | ||
An R-focused pipeline toolkit for reproducibility and high-performance computing | ||||||||||
Targets | 736 | 9 days ago | 1 | other | R | |||||
Function-oriented Make-like declarative workflows for R | ||||||||||
Canvasxpress | 273 | 3 | 3 days ago | 38 | December 01, 2022 | 6 | R | |||
CanvasXpress: A JavaScript Library for Data Analytics with Full Audit Trail Capabilities. | ||||||||||
Steppy | 135 | 5 | 5 years ago | 16 | November 23, 2018 | 16 | mit | Python | ||
Lightweight, Python library for fast and reproducible experimentation :microscope: | ||||||||||
Jupyter Guide | 90 | 3 years ago | 1 | mit | Jupyter Notebook | |||||
Guide for Reproducible Research and Data Science in Jupyter Notebooks | ||||||||||
Openml R | 90 | 5 | 1 | 7 months ago | 11 | September 21, 2019 | 25 | other | Jupyter Notebook | |
R package to interface with OpenML | ||||||||||
Targets Tutorial | 66 | 2 years ago | other | R | ||||||
Short course on the targets R package | ||||||||||
Gittargets | 62 | 4 months ago | 1 | other | R | |||||
Data version control for reproducible analysis pipelines in R with {targets}. |
In computationally demanding data analysis pipelines, the
targets
R package maintains an
up-to-date set of results while skipping tasks that do not need to
rerun. This process increases speed and increases trust in the final end
product. However, it also overwrites old output with new output, and
past results disappear by default. To preserve historical output, the
gittargets
package captures version-controlled snapshots of the data
store, and each snapshot links to the underlying commit of the source
code. That way, when the user rolls back the code to a previous branch
or commit, gittargets
can recover the data contemporaneous with that
commit so that all targets remain up to date.
targets
, which has resources
on the documentation
website.targets
data
store.The package is available to install from any of the following sources.
Type | Source | Command |
---|---|---|
Release | CRAN | install.packages("gittargets") |
Development | GitHub | remotes::install_github("ropensci/gittargets") |
Development | rOpenSci | install.packages("gittargets", repos = "https://ropensci.r-universe.dev") |
You will also need command line Git, available at
https://git-scm.com/downloads.[^1] Please make sure Git is reachable
from your system path environment variables. To control which Git
executable gittargets
uses, you may set the TAR_GIT
environment
variable with usethis::edit_r_environ()
or Sys.setenv()
. You will
also need to configure your user name and user email at the global level
using the instructions at
https://git-scm.com/book/en/v2/Getting-Started-First-Time-Git-Setup
(or gert::git_config_global_set()
). Run tar_git_ok()
to check
installation and configuration.
tar_git_ok()
#> Git binary: /path/to/git
#> Git config global user name: your_user_name
#> Git config global user email: [email protected]
#> [1] TRUE
There are also backend-specific installation requirements and recommendations in the package vignettes.
Consider an example pipeline with source code in _targets.R
and output
in the data store.
# _targets.R
library(targets)
list(
tar_target(data, airquality),
tar_target(model, lm(Ozone ~ Wind, data = data)) # Regress on wind speed.
)
Suppose you run the pipeline and confirm that all targets are up to date.
tar_make()
#> start target data
#> built target data
#> start target model
#> built target model
#> end pipeline
tar_outdated()
#> character(0)
It is good practice to track the source code in a version control repository so you can revert to previous commits or branches. However, the data store is usually too large to keep in the same repository as the code, which typically lives in a cloud platform like GitHub where space and bandwidth are pricey. So when you check out an old commit or branch, you revert the code, but not the data. In other words, your targets are out of sync and out of date.
gert::git_branch_checkout(branch = "other-model")
# _targets.R
library(targets)
list(
tar_target(data, airquality),
tar_target(model, lm(Ozone ~ Temp, data = data)) # Regress on temperature.
)
tar_outdated()
#> [1] "model"
With gittargets
, you can keep your targets up to date even as you
check out code from different commits or branches. The specific steps
depend on the data backend you choose, and each supported backend has a
package
vignette with
a walkthrough. For example, the most important steps of the Git data
backend are as
follows.
tar_git_init()
: initialize a Git/Git
LFS repository for the data
store.tar_make()
)
and commit any changes to the source code.tar_git_snapshot()
: create a data snapshot for the current code
commit.tar_git_checkout()
: revert the data to the appropriate prior
snapshot.targets
generates a large amount of data in _targets/objects/
, and
data snapshots and checkouts may take a long time. To work around
performance limitations, you may wish to only snapshot the data at the
most important milestones of your project. Please refer to the package
vignettes for
specific recommendations on optimizing performance.
The first data versioning system in gittargets
uses
Git, which is designed for source code and may
not scale to enormous amounts of compressed data. Future releases of
gittargets
may explore alternative data backends more powerful than
Git LFS.
Newer versions of the targets
package (>= 0.9.0) support continuous
data versioning through cloud storage, e.g.Amazon Web
Services for S3
buckets with versioning
enabled.
In this approach, targets
tracks the version ID of each cloud-backed
target.
That way, when the metadata file reverts to a prior version, the
pipeline automatically uses prior versions of targets that were up to
date at the time the metadata was written. This approach has two
distinct advantages over gittargets
:
However, not all users have access to cloud services like
AWS, not everyone is able or willing to pay
the monetary costs of cloud storage for every single version of every
single target, and uploads and downloads to and from the cloud may
bottleneck some pipelines. gittargets
fills this niche with a data
versioning system that is
Please note that the gittargets
project is released with a
Contributor Code of Conduct. By
contributing to this project, you agree to abide by its terms.
citation("gittargets")
#>
#> To cite gittargets in publications use:
#>
#> William Michael Landau (2021). gittargets: Version Control for the
#> targets Package. https://docs.ropensci.org/gittargets/,
#> https://github.com/ropensci/gittargets.
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Manual{,
#> title = {gittargets: Version Control for the Targets Package},
#> author = {William Michael Landau},
#> note = {https://docs.ropensci.org/gittargets/, https://github.com/ropensci/gittargets},
#> year = {2021},
#> }
[^1]: gert
does not have these requirements, but gittargets
does not
exclusively rely on gert
because libgit2
does not automatically
work with Git LFS.