Breast cancer study gets new insights from cloud-based approach
The American Cancer Society is using Google Cloud to empower its study of pathology images to find contributing factors for breast cancer and determine how best to prevent it.
The cancer organization is using the cloud in a variety of ways, first to hold the massive high-resolution digital pathology images; to convert them to a usable format; to standardize pre-processing and normalize colors; and then use machine learning to train the machine learning initiative.
The use of the cloud facilitates the research using the large digital files from the Cancer Prevention Study-II (CPS-II) nutrition cohort, a prospective study of more than 184,000 American men and women.
The use of the Google Cloud and associated tools to hold and manipulate the ACS database offers a glimpse of how advanced technologies can sort through massive amounts of clinical data to provide insights that advanced medical practice. The research looks to increase understanding about how lifestyle factors, diagnosis and treatment of specific subtypes of cancer.
In this study, the American Cancer Society obtained the medical records and surgical tissue samples for 1,700 CPS-II study participants who had been diagnosed with breast cancer. As part of the study, ACS researchers wanted to look at high-resolution images of the tumor tissue to find linkages between lifestyle, medical and genetic factors with molecular subtypes of breast cancer—with the hope of finding whether different features in the breast cancer tissue can be linked to a better prognosis.
There are technical challenges in such a study, says Mia M. Gaudet, scientific director of epidemiology research at ACS. The 1,700 digital pathology images were of high resolution, and captured in an uncompressed and proprietary format—each image is about 10 gigabytes in size. The images are hematoxylin and eosin-stained whole section slides of breast tumor tissue from women who were diagnosed with breast cancer—the slides are the primary tissue specimen used to diagnosis solid tumors.
Even if images were converted to a usable format—a costly and time-consuming process—individual analysis by pathologists would be slow, taking as much as three years, Gaudet estimates.
That’s where Google and its machine-learning capabilities came in. ACS teamed up with Slalom, a partner of the Google Cloud Machine Learning Engine, a managed service that enables developers and data scientists to build and bring machine learning models to production. Google’s Cloud ML Engine offers training and prediction services, which can be used together or individually.
First, cloud computing capabilities took aim at standardizing the digital pathology images, because study results depended on having the images translated consistently and colors normalized.
“Our team, including a board certified pathologist, chose a set of images that had minimal fading and artifacts, referred to as ‘template images,’ ” Gaudet says. “The distributions of colors in these images were considered as close to ‘ideal’ as possible, given the set of images available. The color distributions of all remaining images were coerced to align with the color distributions in this template image set. This was done via clustering techniques using the Python programming language and associated image processing libraries.”
Slalom used the Google Cloud Platform to build a machine learning “pipeline” that included this image preprocessing, feature engineering and clustering that enabled interpretation of the images—stored in the Google Cloud—via machine learning.
Using Keras—a relatively easy-to-use sequential application programming interface—with a TensorFlow backend for prototyping, Slalom created an auto-encoder model. It then used distributed training on Cloud ML Engine to convert the images into feature vectors that represent patterns in the images as a sequence of numbers.
With this automated machine learning approach, analysis of the images was completed in only three months, and with a higher degree of accuracy and consistency, Gaudet says. "By leveraging Cloud ML Engine to analyze cancer images, we're gaining more understanding of the complexity of breast tumor tissue and how known risk factors lead to certain patterns,” she adds.
The machine learning algorithm clustered tiles of the whole-section tissue specimens into 11 clusters and as many as 18 sub-clusters. “We then merged the clustering data with the long-term survey, clinical and mortality data from the 1,700 women diagnosed with breast cancer in the Cancer Prevention Study-II cohort,” Gaudet says.
The approach identified about 100 different patterns, picking up on aspects of the tumor tissue that were already known, such as different DNA arrangements in cancer vs. non-cancer cells. It also found patterns that researchers couldn’t link with any standard clinical terms, thus identifying things “that we had no idea exists, and this is the benefit of this approach,” she adds.
Some of the clusters and sub-clusters identified in the research had strong relationships to known clinical factors, such as tumor grades, but the research revealed some clusters that were not previously known to exist—for example, related to long-term survival. “That is the benefit of this approach—it identified patterns that we did not previously known existed,” Gaudet adds.
ACS researchers plan to analyze whether the clusters are related to dying from breast cancer and whether the clusters are related to known breast cancer risk factors, such as whether obesity is related to certain tissue clusters in breast tumors.