Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Iceberg | 4,805 | 2 | 7 hours ago | 16 | May 23, 2022 | 1,495 | apache-2.0 | Java | ||
Apache Iceberg | ||||||||||
Gaffer | 1,713 | 4 | 21 | 17 days ago | 100 | May 23, 2023 | 120 | apache-2.0 | Java | |
A large-scale entity and relation database supporting aggregation of properties | ||||||||||
Petastorm | 1,614 | 8 | 5 months ago | 86 | February 03, 2023 | 171 | apache-2.0 | Python | ||
Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code. | ||||||||||
Adam | 955 | 20 | 17 | a month ago | 14 | December 16, 2020 | 35 | apache-2.0 | Scala | |
ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed. | ||||||||||
Devops Python Tools | 675 | 2 months ago | 33 | mit | Python | |||||
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc. | ||||||||||
Iceberg | 409 | 2 years ago | 27 | apache-2.0 | Java | |||||
Iceberg is a table format for large, slow-moving tabular data | ||||||||||
Spindle | 333 | 9 years ago | 2 | apache-2.0 | JavaScript | |||||
Next-generation web analytics processing with Scala, Spark, and Parquet. | ||||||||||
Rumble | 194 | 4 months ago | 4 | December 03, 2019 | 134 | other | Java | |||
⛈️ RumbleDB 1.21.0 "Hawthorn blossom" 🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more | ||||||||||
Spark Programming Guide Zh Cn | 188 | 7 months ago | other | |||||||
Spark 编程指南简体中文版 | ||||||||||
Parquet Index | 113 | 2 years ago | 16 | apache-2.0 | Scala | |||||
Spark SQL index for Parquet tables |
ADAM is a library and command line tool that enables the use of Apache Spark to parallelize genomic data analysis across cluster/cloud computing environments. ADAM uses a set of schemas to describe genomic sequences, reads, variants/genotypes, and features, and can be used with data in legacy genomic file formats such as SAM/BAM/CRAM, BED/GFF3/GTF, and VCF, as well as data stored in the columnar Apache Parquet format. On a single node, ADAM provides competitive performance to optimized multi-threaded tools, while enabling scale out to clusters with more than a thousand cores. ADAM's APIs can be used from Scala, Java, Python, R, and SQL.
Over the last decade, DNA and RNA sequencing has evolved from an expensive, labor intensive method to a cheap commodity. The consequence of this is generation of massive amounts of genomic and transcriptomic data. Typically, tools to process and interpret these data are developed with a focus on excellence of the results generated, not on scalability and interoperability. A typical sequencing workflow consists of a suite of tools from quality control, mapping, mapped read preprocessing, to variant calling or quantification, depending on the application at hand. Concretely, this usually means that such a workflow is implemented as tools glued together by scripts or workflow descriptions, with data written to files at each step. This approach entails three main bottlenecks:
We propose here a transformative solution for these problems, by replacing ad-hoc workflows by the ADAM framework, developed in the Apache Spark ecosystem.
ADAM enables the high performance in-memory cluster computing functionality of Apache Spark on genomic data, ensuring efficient and fault-tolerant distribution based on data parallelism, without the intermediate disk operations required in traditional distributed approaches.
Furthermore, the ADAM and Apache Spark approach comes with an additional benefit. Typically, the endpoint of a sequencing pipeline is a file with processed data for a single sample: e.g. variants for DNA sequencing, read counts for RNA sequencing, etc. The real endpoint, however, of a sequencing experiment initiated by an investigator is interpretation of these data in a certain context. This usually translates into (statistical) analysis of multiple samples, connection with (clinical) metadata, and interactive visualization, using data science tools such as R, Python, Tableau and Spotfire. In addition to scalable distributed processing, Apache Spark also allows interactive data analysis in the form of analysis notebooks (Spark Notebook, Jupyter, or Zeppelin), or direct connection to the data in R and Python.
ADAM is available in Conda via Bioconda, https://bioconda.github.io
$ conda install adam
ADAM is available in Homebrew via Brewsci/bio, brewsci/homebrew-bio
$ brew install brewsci/bio/adam
ADAM is available in Docker via BioContainers, https://biocontainers.pro
$ docker pull quay.io/biocontainers/adam:{tag}
Find {tag}
on the tag search page, https://quay.io/repository/biocontainers/adam?tab=tags
You will need to have Apache Maven version 3.3.9 or later installed in order to build ADAM.
$ git clone https://github.com/bigdatagenomics/adam.git
$ cd adam
$ mvn install
You'll need to have a Spark release on your system and the $SPARK_HOME
environment variable pointing at it;
prebuilt binaries can be downloaded from the Spark website.
As of ADAM version 0.37.0, Spark version 3.2.0 or later is required.
ADAM's documentation is available at http://adam.readthedocs.io.
ADAM's core API documentation is available at http://javadoc.io/doc/org.bdgenomics.adam/adam-core-spark3_2.12.
ADAM builds upon the open source Apache Spark, Apache Avro, and Apache Parquet projects. Additionally, ADAM can be deployed for both interactive and production workflows using a variety of platforms.
There are a number of tools built using ADAM's core APIs:
For more, please see our awesome list of applications that extend ADAM.
The best way to reach the ADAM team is to post in our Gitter channel or to open an issue on our Github repository. For more contact methods, please see our support page.
ADAM is released under the Apache License, Version 2.0.
ADAM has been described in two manuscripts. The first, a tech report, came out in 2013 and described the rationale behind using schemas for genomics, and presented an early implementation of some of the preprocessing algorithms. To cite this paper, please cite:
@techreport{massie13,
title={{ADAM}: Genomics Formats and Processing Patterns for Cloud Scale Computing},
author={Massie, Matt and Nothaft, Frank and Hartl, Christopher and Kozanitis, Christos and Schumacher, Andr{\'e} and Joseph, Anthony D and Patterson, David A},
year={2013},
institution={UCB/EECS-2013-207, EECS Department, University of California, Berkeley}
}
The second, a conference paper, appeared in the SIGMOD 2015 Industrial Track. This paper described how ADAM's design was influenced by database systems, expanded upon the concept of a stack architecture for scientific analyses, presented more results comparing ADAM to state-of-the-art single node genomics tools, and demonstrated how the architecture generalized beyond genomics. To cite this paper, please cite:
@inproceedings{nothaft15,
title={Rethinking Data-Intensive Science Using Scalable Analytics Systems},
author={Nothaft, Frank A and Massie, Matt and Danford, Timothy and Zhang, Zhao and Laserson, Uri and Yeksigian, Carl and Kottalam, Jey and Ahuja, Arun and Hammerbacher, Jeff and Linderman, Michael and Franklin, Michael and Joseph, Anthony D. and Patterson, David A.},
booktitle={Proceedings of the 2015 International Conference on Management of Data (SIGMOD '15)},
year={2015},
organization={ACM}
}
We prefer that you cite both papers, but if you can only cite one paper, we prefer that you cite the SIGMOD 2015 manuscript.