St. Jude launches largest public repository of pediatric cancer genomics data
St. Jude Children’s Research Hospital has launched a new publicly available data-sharing and collaboration platform in the cloud that it contends is the world’s largest public repository of pediatric cancer genomics data.
Called St. Jude Cloud, the online resource—which includes analysis tools and visualizations—is available to the global research community to help advance medical breakthroughs in childhood cancer.
Developed by scientists at St. Jude along with Microsoft and cloud-based genome informatics and data management vendor DNAnexus, the platform provides researchers with access to more than 5,000 whole genome, 5,000 whole exome and 1,200 RNA-Seq datasets from more than 5,000 pediatric cancer patients and survivors.
“We started working on it about two years ago,” says Scott Newman, group lead for bioinformatics analysis in St. Jude’s Department of Computational Biology. “These whole genomes are huge—about 100 gigabytes per person. We’re generating data on thousands of samples. And our institution is committed to sharing data with the research community.”
The goal is to make 10,000 whole genome sequences available on St. Jude Cloud by 2019, according to Newman, who adds that secure sharing and collaborative analysis of such massive datasets are critical for discovering cures for pediatric cancer.
“For the longest time, the state-of-the-art has been to house these things in central repositories either in the U.S. or Europe, and then physically download the data on to your local computer,” adds Newman. “The idea now is that we host this data in the cloud and you as a scientist or collaborator can come to the data—you don’t have to download anything and you can run your analysis in the cloud instead.”
On Sunday, Newman discussed the newly launched St. Jude Cloud in a presentation at the American Association for Cancer Research annual meeting in Chicago. Researchers who want access to the data-sharing and collaboration platform must apply for access.
“We want to share our data—that’s a no-brainer—but we also want to share our computational methods,” says Newman. “We’ve got very strong skills in genome analysis. Our department chair, Jinghui Zhang, is one of the pioneers of genomic sequencing algorithm development.”
He notes that these powerful algorithms have been thoroughly validated using real world data and have been made available on the St. Jude Cloud “through a point-and-click interface” so that these analysis tools are easy to use and can be leveraged by “somebody who is not a computational scientist” with “just a few mouse clicks.”
Newman emphasizes that the data is de-identified to ensure patient anonymity.
“You agree to preserve at all times the confidentiality of information and data pertaining to data subjects,” states the St. Jude Cloud data access agreement. “In particular, you will not use or attempt to use the data to compromise or otherwise infringe the confidentiality of information on or about data subjects and their right to privacy, and you will not attempt in any way to identify or re-identify data subjects in any manner.”
In the future, Newman says the hope is that researchers with their own pediatric cancer genomics datasets will consider contributing their data to St. Jude Cloud “adding to this huge body of knowledge.”