Open source data sharing software takes aim at cancer
Researchers collaborating in Pittsburgh have developed an open-source software resource that can better enable investigators studying cancer to process large amounts of genomic cancer data.
The new resource, developed by researchers from the University of Pittsburgh and the Pittsburgh Supercomputing Center can assist investigators in sorting through genomic cancer data to determine better methods of cancer prevention, diagnosis and treatment.
The open-source software, which processes data generated by The Cancer Genome Atlas (TCGA) project and is called TCGA Expedition, is described in an article in the journal PLOS ONE.
“Starting with TCGA, our goal is to make large data sets available to the average researcher who would not otherwise be able to access this information,” said lead author Rebecca Jacobson, MD, professor of biomedical informatics at the University of Pittsburgh’s School of Medicine, as well as its chief information officer.
“There’s a growing understanding that further advances in healthcare are going to require a previously unseen level of data-sharing, which will require new tools,” Jacobson added. “That’s particularly true in cancer research, as recognized by the major focus on data sharing in Vice President Joseph Biden’s recently announced Cancer Moonshot initiative.”
Funding for the new software was provided by IPM and the University of Pittsburgh Cancer Institute (UPCI), a partner with UPMC CancerCenter.
“This work is about enabling and speeding up science,” said Adrian Lee, director of IPM and of UPCI’s Women’s Cancer Research Center, and a co-author on the new paper. “Resources such as this will be key in our move to precision cancer genomic medicine.”
Examining a cancer’s complete set of DNA, or genome, can provide insights into many aspects of tumor biology. The goal of TCGA, a collaborative effort of the National Cancer Institute and the National Human Genome Research Institute, is to collect and share genomic data from cancers with poor prognoses and the greatest impacts on public health. To date, the project has profiled 33 different cancers from more than 11,000 patients, and the resulting data has been used in more than 1,000 cancer studies.
“These very large data sets are incredibly hard to work with because they are enormous, not only in terms of the amount of digital storage space they need, but also in terms of the complexity of software and computational processing power that they require,” Jacobson said. “Right now, our institutions are choking on data.”
The new software continuously downloads, processes and manages the TCGA data, enabling researchers to take the tools that they need and apply them to making cancer discoveries. The team then put the new software to work, creating an information technology framework called the Pittsburgh Genomoe Resource Repository to enable approved Pitt researchers to use the TCGA data much more effectively.
While initially designed for TCGA data, the new software can also be used with other large data sets, and is already a key part of several other big data projects PGRR supports, such as the National Institutes of Health’s Big Data to Knowledge initiative and Pennsylvania’s Commonwealth Universal Research Enhancement program.
“The fact that we made our software open source and freely available demonstrates our commitment to taking the advances in using big data sets and data-sharing that we make here and helping other institutions make their own advances,” Jacobson said.