Machine learning enables physical activity data to be re-identified

Register now

By leveraging machine learning algorithms, it is possible to re-identify physical activity data collected from wearable devices from which protected health information has been removed.

That’s the contention of Anil Aswani, assistant professor in industrial engineering and operations research at the University of California at Berkeley.

“This type of data that’s collected by activity trackers is very useful for improving treatments and wellness programs but there’s also hidden privacy risks,” says Aswani.

In a cross-sectional study of national physical activity data from 14,451 individuals, Aswani and his colleagues applied linear support vector machine (SVM) and random forest methods from machine learning to re-identify the 20-minute-level physical activity data of about 80 percent of children and 95 percent of adults.

“The findings of this study suggest that current practices for de-identifying physical activity data are insufficient for privacy and that de-identification should aggregate the physical activity data of many people to ensure individuals’ privacy,” according to an article in the latest issue of JAMA Network Open.

Also See: Data re-identification remains risk despite HIPAA safeguards

“Despite data aggregation and removal of protected health information, there is concern that de-identified physical activity data collected from wearable devices can be re-identified,” the authors add.

While protected health information was removed from the physical activity data in the study, researchers used random forest and linear SVM algorithms to match demographic and 20-minute aggregated data to individual-specific medical record numbers.

“In the study, we did not attempt to match the data to actual names,” observes Aswani. “However, mathematically, it makes no difference to the machine learning algorithm if you re-identify record numbers or names.”

Aswani notes that the two algorithms—linear support vector machine and random forest—were selected for the study precisely because they are “fairly standard” and “not state-of-the-art methods.”

According to the authors, theirs is the first study to have been published that demonstrates either the possibility or impossibility of re-identifying such activity data.

“Our study raises red flags,” concludes Aswani. “We need to be careful in terms of who we share the data with and what we share to help protect patient privacy.”

At the same time, he believes HIPAA needs to be updated to address the fact that—with the advent of machine learning—de-identified data can be re-identified, putting patient privacy at risk.

“HIPAA’s regulations were set in an era before machine learning and artificial intelligence grew in capabilities and pervasiveness,” Aswani adds. “HIPAA requires a reevaluation in light of these advances in algorithms.”

For reprint and licensing requests for this article, click here.