Free text in radiology reports complicates automated diagnosis

Inconsistencies caused imperfect performance in machine learning, which templates could ameliorate.


Allowing radiologists to report their findings using free text rather than in structured templates increases their variability in language and length, making them harder to use—and more difficult for machine learning to predict diagnoses.

That finding, from a new study in the Journal of the American College of Radiology, suggested that structured templates for radiology reports could improve diagnostics, make results easier to understand, enhance billing, and assist in population health.

However, the study authors, from Milton S. Hershey Medical Center in Hershey Pa., noted that templates can be a burden on the radiologists using them.

The researchers looked at the extent of variation in free text radiology reporting to determine the need for templates. They used commercial text analytics and natural language processing software to parse more than 1,100 emergency department CT chest imaging reports from 2016, documenting the extent of variation. They set up a learning task using machine learning in a commercial software program to decide whether the text in a report signifies the presence or absence of a finding, in this case pulmonary embolism.

If unstructured reports contain clear and unambiguous diagnostic statements, the machine learning software will devise a small number of accurate rules that map the diagnostic statements to the presence or absence of pulmonary embolism. If not, then more and less accurate rules are needed.

The researchers were looking to generate a suite of machine learning rules that could predict the “gold standard” of radiologic diagnosing of pulmonary embolism.

However, there was “extensive variation” in the language used in the reports. For instance, more than 2,200 unique words were used despite the fact that review was to rule out pulmonary embolism. Moreover, the most common term in the findings section was not embolism or emboli, and it was only the second most common word in the impressions section.

There was also substantial variability in the length of reports, which the researchers found surprising.

The variability and nuances in the text impeded machine learning; it had low positive prediction rate of 73 percent and a misclassification rate of 3 percent.

“Large-scale variation in terms employed and hedging in use of such terms complicates the machine learning task and leads to imperfect performance in duplicating the correct diagnosis. …. [T]his study’s results suggest that too many, and too many different, words are being used than actually needed to report on the findings in a chest CT with contrast coming from the ED for a PE rule-out. It is not obvious what these words offer in incremental information beyond a smaller set,” the study authors wrote.

The researchers predicted that had templates been used, there would have been a maximum of 50 to 100 words across all reports for findings and impressions, requiring only a single machine learning rule to predict the presence or absence of pulmonary embolism, with 100 percent accuracy.

“Interpretation of the free text was a difficult machine learning task and suggests potential difficulty for human recipients in fully understanding such reports. These results support the prospective assessment of the impact of a fully structured report template with at least some mandatory discrete fields on ease of use of reports and their understanding,” said the study authors.

More for you

Loading data for hdm_tax_topic #better-outcomes...