Broad releases open source version of genomic analysis software
The Broad Institute of MIT and Harvard is planning to release the most recent version of its Genome Analysis Toolkit under an open source software license.
The software package, designated GATK4, contains new tools and rebuilt architecture. It is currently available as an alpha preview on the Broad Institute’s GATK website, with a beta release expected in mid-June.
The announcement was one of several that the Broad Institute made in conjunction with Intel, its research partner working in the area of genomics research, with the intent of making it more accessible and affordable.
The new version of Broad’s software is built on a new architecture, enabling significant streamlining of individual tools and support for performance-enhancing technologies, such as Apache Spark. The framework brings improvements to parallelization, capitalizing on cloud deployment and making the process of analyzing vast amounts of genomic data easier, faster, and more efficient, researchers say.
“Thanks to the rapid adoption of cloud computing, researchers can finally do away with many of the infrastructure-related complications that have hampered progress, especially at smaller institutions and startups,” said Eric Banks, senior director of data sciences and data engineering at Broad and a creator of the original GATK software package. More than 45,000 academic and commercial users worldwide rely on GATK.
GATK4 will be released as a fully open source product, thanks in part to a collaboration between Broad and Intel to advance high-performance analytics so researchers can study massive amounts of genomic data from diverse sources worldwide.
At the Intel-Broad Center for Genomic Data Engineering, software engineers and researchers have worked together on building, optimizing, and widely sharing new tools and infrastructure to help scientists integrate and process genomic data. GATK4 has benefited from this collaboration, which has helped engineers optimize best practices in hardware and software for genome analytics to make it possible to combine and use research data sets that reside on private, public and hybrid clouds.
“The GATK tools are crucial for both germline and cancer analyses,” said Robert Grossman of the University of Chicago Department of Medicine and an expert in biomedical informatics.
“Open source code is a foundation of efficient biomedical research,” said Brad Chapman, a research scientist at the Harvard T.H. Chan School of Public Health. “It enables reproducibility, reuse and remixing by removing barriers for sharing and distributing analyses.”
In addition, Broad and Intel are introducing a new Genomics Stack, a hardware technology advance that is five times faster than previous versions and which supports larger data volumes with easier deployments than previous iterations.
The Genomics Stack is the first deliverable from the five-year, $25 million collaboration between Intel and Broad intended to make genomics accessible and affordable for academic and non-profit organizations as well as commercial users.