Differences in equipment and procedures complicates machine learning
Differences in imaging equipment, procedures and protocols can dramatically affect the performance of deep machine learning when analyzing brain tumors, according to a new study in Medical Physics.
Automatic brain tumor segmentation from MRI data using deep learning methodologies has gained steam in recent years. Convolutional neural networks (CNNs), a type of deep learning algorithm, are commonly used for segmentation of brain tumors, and provider organizations have recently begun sharing images to increase the data to work with.
However, providers often use different imaging equipment, image acquisition parameters and contrast injection protocols, which could cause institutional bias; a CNN model trained on MRI data from one organization may stumble when tested on MRI data from another.
The researchers, from the Radiology Department at Duke University School of Medicine, used MRI data of 22 glioblastoma patients from MD Anderson Cancer Center and 22 glioblastoma patients from Henry Ford Hospital to assess how CNN models worked with their own and each other’s MRI data. They evaluated three different scenarios:
• Data used for training a model coming from the same institution as the data used for testing the model
• Data used for training a model coming from a different institution than the data used for testing the model
• Data used for training a model that comes from the same institution as the data used for testing, but the data is enriched by additional data from the other institution, resulting in a bigger size of the training set
They found that the performance of one provider’s model for segmentation of brain tumors “dramatically” deteriorates when the model was trained on data from the other. Adding data from the different providers can improve performance of the model, but not for segmentation of the entire tumor.
The study authors hypothesize that differences in imaging between the two providers is the reason for the reduced performance. For example, MD Anderson’s magnetic field strength for most cases is 1.5 Tesla and slice thickness is 5 millimeters. At Henry Ford, magnetic field strength is 3 Tesla and slice thickness varied between 3 and 5 millimeters.
The results don’t necessarily mean that health care providers shouldn’t try to use other providers’ images for deep learning, says Maciej A. Mazurowski, PhD, one of the study’s authors. “We should try. There might not be enough data at each institution [to rely on individually],” he says.
There are several ways to overcome this problem, Mazurowski contends. It would be optimal if every image was acquired using the same protocol and the scanners calibrated to each other. However, this is unlikely to happen anytime soon, he adds.
Another option is to standardize images after they’ve been acquired, and efforts are being made in that direction. A third, less explored avenue would be to use data from different organizations and the learning algorithm could internally use the images altogether, akin to how a radiologist can internally normalize images from different scanners.
While the study highlights a drawback in testing deep learning models on data it wasn’t trained on, that does not diminish the strides that deep learning has made in imaging, Mazurowski says. “Deep learning is making a big difference in machine learning and artificial intelligence in general and is something to watch in radiology. The study does not show that this doesn’t work, but that you need to be very careful.”