AI – data acquisition challenges

A data normalization process can require significant resources to initiate and maintain.


More healthcare organizations are trying to generate accurate and reliable artificial intelligence algorithms to assist with improving healthcare delivery while reducing costs.

These organizations can control the quality of their own data. But they also must use large volumes of data from third-party resources or other organizations to develop accurate and defensible AI algorithms. And using that data requires a formidable data normalization process that can involve significant resources to initiate and maintain.

Data scientists spend 45 percent of their time on data preparation used to inform their algorithms, according to one report. Data normalization challenges include the identification and ingestion of data sources, mapping data so it’s searchable and building algorithms for analytics and AI.

While healthcare has codified several data elements (e.g., DRGs, ICD-10, CPT-4, and HCPCS), a significant portion of healthcare documentation is still captured as unstructured text data. Using natural language processing engines to extract and map unstructured text into coded data elements is improving the quality of internal data for organizations, but it’s challenging when external data acquisition is incorporated into larger data sets.

Some healthcare organizations, however, are creating solutions to help alleviate these data acquisition challenges so they can create accurate and reliable data environments that improve analytics and AI accuracy and value.

Using a “black box”

A formidable challenge for healthcare organizations is how to share data with others without violating HIPAA’s privacy and security regulations.

BeeKeeperAI, a spinout of UCSF’s Center for Digital Health Innovation, has developed a solution called EscrowAI™ to help address this issue. EscrowAI is a confidential, sightless computing and zero-trust platform that changes the paradigm from one of sharing the data, to one of enabling secure computing against the data without sharing it. Within the platform, which runs in the Microsoft Azure confidential computing environment, protected health information never leaves an organization’s secure environment -- the data is never seen nor shared, and the algorithm intellectual property is protected.  This enables AI developers to validate, deploy and monitor the performance of algorithms without identifying patients or data sources.

Since its spinout, BeeKeeperAI has worked with a global biopharma company to validate a rare disease detection model, AI research groups within health systems to accelerate industry collaborations, and startups that needed access to PHI to validate algorithms.

Absent BeeKeeperAI’s platform, researchers' and third-party algorithm developers' access to PHI is extremely limited, requiring either an IRB-approved clinical trial or through a healthcare organization’s de-identification of the data prior to sharing it. Both can cause extensive delays, add substantial costs and introduce other practical limits on the amount of data accessible (for example, having limited resources to anonymize data). 

Other efforts currently underway to create repositories of de-identified data that can be used to improve data analytics and AI algorithms include Truveta, a consortium of 20 U.S. healthcare organizations, and Project Nightingale, a collaboration between Google and Ascension.

Value-based care

Value-based care will require the use of an AI infrastructure to screen high-risk patient populations to ensure patients are receiving appropriate care in modalities best suited for their treatments.

Acquiring data sets for AI use from comparable healthcare organizations will enable the development more accurate algorithms to support value-based care initiatives.

Any organization evaluating healthcare data collaborations for value-based care or other initiatives should first evaluate the data used to ensure its de-identified and representative of similar patient populations managed by the organization.

After selection of the data source, the organization must evaluate the ability to acquire and download the data in a timely manner (e.g., FHIR APIs) and confirm that data mapping can be implemented to ensure appropriate organization database population.

The organization should use its innovation centers and/or data informatics groups to provide thorough testing of the data acquisition and normalization process before it is made available for general use.

Key challenges

The use of real-world PHI is fundamental for the validation, deployment, and ongoing monitoring of algorithms. However, under the old paradigm of data sharing, real-world data is only accessible within the context of clinical trials.  The lack of access to PHI is the largest barrier to healthcare AI validation and implementation. But with resources such as BeeKeeperAI, an algorithm can compute against real-world data within the data steward’s secure environment, reducing the privacy and security risks, and enabling much more efficient validation and implementation of healthcare AI at a fraction of the cost.

De-identification of the patient data used in ongoing collaboration projects is a key challenge for the future of healthcare AI. Deidentified healthcare data can be reidentified when combined with other data available on the internet and social media. Also, there are concerns that AI algorithms developed on deidentified data may not perform as expected when exposed to real-world data because of changes incurred during the deidentification process. The use of synthetic data generators can play a key role in helping to develop inferences during early algorithm development, but when the algorithm is ready for market or regulatory validation, real-world data is required.

The other challenge for data acquisition is developing the ability to easily curate, transform and make the data securely accessible for third-party computing. Data normalization and the mapping of normalized data to an organization’s database models require skilled informaticists to ensure optimum efficiency. If an organization lacks qualified staff members, it may need to turn to consultants for expertise.

Mike Davis is an analyst for KLAS Research.

This article was originally published on the KLAS Research site.

More for you

Loading data for hdm_tax_topic #better-outcomes...