This package provides unified methods for accessing popular datasets used in cancer research.
pip install cancer_data
The raw downloaded files occupy approximately 15 GB, and the processed HDFs take up about 10 GB. On a relatively recent machine with a fast SSD, processing all of the files after download takes about 3-4 hours. At least 16 GB of RAM is recommended for handling the large splicing tables.
A complete description of the datasets may be found in schema.csv.
|Cancer Cell Line Encyclopedia (CCLE)||Many (see portal)||https://portals.broadinstitute.org/ccle/data (registration required)|
|Cancer Dependency Map (DepMap)||Genome-wide CRISPR-cas9 and RNAi screens, gene expression, mutations, and copy number||https://depmap.org/portal/download/|
|The Cancer Genome Atlas (TCGA)||Mutations, RNAseq expression and splicing, and copy number||https://xenabrowser.net/datapages/?cohort=TCGA%20Pan-Cancer%20(PANCAN)&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443|
|The Genotype-Tissue Expression (GTEx) Project||RNAseq expression and splicing||https://gtexportal.org/home/datasets|
The goal of this package is to make statistical analysis and coordination of these datasets easier. To that end, it provides the following features:
The schema serves as the reference point for all datasets used. Each dataset is identified by a unique
id column, which also serves as its access identifier.
Datasets are downloaded from the location specified in
download_url, after which they are checked against the provided
The next steps depend on the
type of the dataset:
referencedatasets, such as the hg19 FASTA files, are left as-is.
primary_datasetobjects are preprocessed and converted into HDF5 format.
secondary_datasetobjects are defined as being made from
primary_datasetobjects. These are also processed and converted into HDF5 format.
To keep track of which datasets are necessary for producing another, the
dependencies column specifies the dataset
ids that are required for making another. For instance, the
ccle_proteomics dataset is dependent on the
ccle_annotations dataset for converting cell line names to Achilles IDs. When running the processing pipeline, the package will automatically check that dependencies are met, and raise an error if they are not found.
Some datasets have filtering applied to reduce their size. These are listed below:
depmap_hotspot), a minimum mutation frequency is used to remove especially rare (present in less than four samples) mutations.