Algorithm Optimizes Big Data Clusters for Medical Breakthroughs

Researchers at Rice University have developed a big data technique that could have a significant impact on healthcare through “clustering” and the ability to reveal information in complex sets of data like electronic health records.

“Health records include a person’s age, white blood cell count, weight, and they are starting to have proteomics data or gene mutation data. Those are all features of a person and if you want to group a bunch of people together to figure out what therapy or drug to give them you need to do that in a systematic way. You can’t just assign them to whatever group the doctor feels like assigning them to,” says Amina Qutub, an assistant professor of bioengineering at Rice.

Also See: Researchers Working to Convert EHR Data to Phenotypes

Traditionally, a major challenge in cluster analysis has been finding the optimal number of data clusters. However, the Rice algorithm extracts characteristics about patients from a data set, mixing and matching them randomly to create artificial populations.

The data analysis tool was developed by Qutub and graduate student Wendy Hu in a lab at Rice’s BioScience Research Collaborative. Called “progeny” clustering, the new technique looks at the individual features of each patient and identifies similarities across different groups of people. Consequently, progeny clustering ensures that the number of clusters is as accurate as possible—the more accurate the clusters, the more personalized the treatment can be—and has the potential to help clinicians obtain meaningful patient groupings when designing trials for the treatment of diseases.

According to Qutub, Texas Children's Hospital in Houston is using the algorithm to design a childhood leukemia clinical trial aimed at enrolling 1,150 patients in the U.S., Australia, Canada, and New Zealand, to identify how to group pediatric patients and which treatments they should be given.

“Progeny clustering allowed them to design a robust clinical trial, even though that trial did not involve a large number of children,” said Qutub, who added that a particular strength of their computing technique is that it allows researchers to determine the ideal number of clusters in small patient populations.

In addition, according to Hu, progeny clustering is just as reliable as other clustering evaluation algorithms, but at a fraction of the computational cost. In addition, she said the technique avoids reusing the old data—a common practice among other sampling methods, which makes them less computationally efficient.

An article published this month in Nature's online journal Scientific Reports, which was co-authored by Hu and Qutub, not only found that progeny clustering compared favorably to other algorithms but was the only method to successfully discover clinically meaningful groupings in an acute myeloid leukemia reverse phase protein array data set.

For reprint and licensing requests for this article, click here.