Genome sequencing

Protecting the privacy of genomic databases by mixing in ‘noise’

As precision medicine grows, researchers from MIT and Indiana University are suggesting a new approach for preserving privacy while enabling wide access to genomic databases.

Aug 17 163 min read

Greg Slabodkin

Managing Editor, Health Data Management

Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory and Indiana University at Bloomington have developed a system for protecting the privacy of patient information in genomic databases used for medical research.

A cryptographic technique called differential privacy permits database queries for genome-wide association studies (GWAS), which try to find correlations between particular genetic variations and disease diagnoses, while reducing the chances of compromising personal health information to almost zero, according to researchers writing recently in the journal Cell Systems.

By adding a little bit of “noise” or random variation to the results of database searches, the technique is designed to confound algorithms that would seek to extract private information from the results of multiple customized sequential searches. Their approach takes on particular importance as the healthcare industry starts to ramp up precision or personalized medicine initiatives that tap into valuable genomic databases.

“Numerous works have shown that aggregate genomic data, including GWAS statistics, can leak private information about participants,” say authors of the study. “These findings have led the NIH, among others, to place much of its aggregate genomic data into repositories and require researchers to apply for access. Recent work has also shown that a popular method for sharing genomic data, genomic data-sharing beacons, leaks potentially private information about participants. These results illustrate the need for new methods that allow privacy-preserving access to genomic data.”

Sean Simmons, an MIT postdoctoral researcher in mathematics and lead author of the paper, contends that genomic databases that contain individuals’ medical histories have inherent privacy risks. He warns that an attacker armed with genetic information about someone could query a database for that person’s medical data, and that if permitted to make repeated queries—each informed by the results of the last—could potentially extract private data from the database.

“If there’s someone with nefarious intent and they ask questions through these queries in the right way, they might be able to gain access to private information about individuals in studies, such as their disease status,” says Simmons. “Our work has focused on methods that would enable databases like these to return useful results to users while still preserving privacy.”

He suggests that the amount of noise added to the results of database searches depends on how robust the required privacy safeguards are, as well as the type and volume of data. “You can choose the level of privacy you want.”

Setting the noise correctly is both an art and a science, adds Simmons, who argues that there is some tradeoff between privacy and accuracy. His hope is that even a search that returns slightly inaccurate information would still make biomedical research much more efficient.

“Differential privacy provides us with the possibility of granting wider access to genomic data now, with immediate benefits for the research community,” conclude the researchers, who assert that “understanding exactly where our method is most useful will require tests on a large variety of datasets in numerous application domains.”

However, they predict that “in the long term, it is possible that differential privacy techniques will no longer be needed as we come to understand exactly how much privacy is lost after releasing aggregate genomic data. Currently, we are far from this understanding.”

More for you

Loading data for hdm_tax_topic #better-outcomes...