HIT Think

Why data integrity is the key to machine learning in healthcare

Register now

As innovative technologies designed to support implementation of value-based care models continue to flood the market and patient demand for transparency of quality and cost of the care grows, patient-centric, data-driven technologies will lead the roost.

These technology platforms address the need for timely, actionable patient-specific insights and leverage data to meaningfully improve quality, drive efficiencies and lower costs.

Leading the way in delivering differentiated value, artificial intelligence (AI) technologies, such as machine learning (ML) and natural language processing (NLP), are entering the mainstream with the potential to transform healthcare delivery and better meet patient expectations.

ML, for instance, has wide-ranging applications—from helping clinicians refine or customize care plans for individual patients to increasing the speed at which pharma and medical device companies can develop therapies that can improve patient outcomes. These technologies can enable a healthcare system to continuously "learn" from the vast amounts of clinical data that has for years remained siloed and untapped and in a growing number of use cases to leverage AI to support clinical decision making at the point of care.

We’ve always heard the phrase “garbage in, garbage out.” As the healthcare industry seeks to extract meaningful insights from patient-reported, clinical or claims-based data and leverage those insights to improve patient care, there are several factors regarding the integrity of that data which one must consider.

Approximately 100 million medical records are extracted each year for patients who, as they age, often receive care from an average of six different providers. Applying ML could significantly reduce the time and effort required to review that data and identify indicators that could impact a patient’s care and outcome. To unlock its full potential, however, data quality must be verified – insights generated by algorithms using inaccurate data could be seriously flawed. The first step in generating meaningful output is to ensure quality input.

Data quality can be ensured by checking for errors that occur during data entry and at other points in the process, but this manual process can be tedious and even impossible. Algorithms that compare various data dimensions to known benchmarks can help to identify outliers. Moreover, ML also allows healthcare organizations to overcome these limitations, as ML models can be trained to intelligently recognize discrepancies and discern logical patterns, effectively learning how to see and account for errors in data.

The increasing use of clinical data derived from medical records is also a threat to data quality, with more than 75 percent of healthcare data existing in unstructured form. While NLP tools can help clean data reducing manual review, establishing a standardized format for clinical information can further streamline the processing of datasets ingested by ML models.

Bias, even when it’s unintentional, can affect data integrity, so it must be anticipated and accounted for in any analytics process. In healthcare, bias can occur when there is insufficient data about a condition. Amputation, for example, is less common, so it may not be documented in the form required by the algorithm. In other words, there isn’t enough tagged clinical data for a classifier to be developed for it. When the model is deployed without a classifier, it would likely not have sufficient information to produce a clinical recommendation regarding an amputation. As more data becomes available, however, results become more precise and reliable.

Skewed data may also introduce bias. If, for example, 90 percent of medical records are negative for a disease diagnosis, a ML model trained on that dataset is likely to be skewed toward a negative result. To address this and other forms of bias, the development and testing of processes must be as rigorous as possible, involving data and ML experts at key junctures.

Finally, humans play important roles at the beginning of developing particular ML algorithms – humans with their own inherent bias of the problem to be solved including what data is needed to train the model, bias as to the validity and application of the results and bias towards opportunities to improve the model.

Even with good data, machine training and validated algorithms, ML will fail without trust, and transparency is critical to establishing trust. There must be reasonable visibility into the types of data from which the machine learned, and users must understand contextually how the model was trained, as clinicians will require a clear and complete understanding of and control over what they’ll be presenting as a foundation of their diagnosis.

Clinicians will also want assurance of the reliability of the algorithm. Studies, such as one showing that medical records reviewed with NLP enhancements took 30 to 40 percent of the time to complete relative to traditional medical reviews, can provide important proof points and help accelerate acceptance and adoption of ML and other AI technologies.

Amidst a massive data-driven revolution in healthcare, we’re increasingly able to improve patient care and outcomes with data, but challenges remain. Data must be collected, prepared, analyzed and applied with discipline and consistency. Otherwise, stakeholders, particularly those at the point of care, will be resistant to change. As AI technologies such as ML and NLP mature to a point where quality can be improved enough to earn trust, we are that much closer to the promise of data-driven healthcare.

For reprint and licensing requests for this article, click here.