Breast MRI dataset aims to support research, machine learning

Studies of 922 patients include images and supporting documentation that helps provide a full picture of diagnosis, treatment and outcomes.


Machine learning and artificial intelligence initiatives are dependent on the quality and consistency of information fed into advanced computing systems that help them “learn.”

To do a better job of advancing this process with breast cancer, a wide range of imaging and supporting clinical documentation from magnetic resonance imaging (MRI) studies is being made available to advance understanding of the disease.

A dataset of 922 breast cancer patients treated at Duke Hospital has been made publicly available for machine learning and clinical research, thanks to The Cancer Imaging Archive (TCIA), a service that de-identifies and hosts a large archive of medical images of cancer accessible for public download.

TCIA was created and originally hosted by Washington University in Saint Louis, but was relocated in 2015 from the Mallinckrodt Institute of Radiology at Washington University to the Department of Biomedical Informatics at the University of Arkansas for Medical Sciences (UAMS).

The availability of the breast MRI dataset is expected to provide a huge boost to research because the imaging modality has a potential benefit in determining the prognosis of patients’ short- and long-term outcomes, in addition to enable improved predictions of pathological and genomic features of tumors. But to make progress in the field, large, well-annotated datasets are essential.

UAMS researchers say the breast MRI dataset is from a single institution – the retrospective collection is of 922 biopsy-confirmed invasive breast cancer patients that includes:

  • Demographic, clinical, pathology, treatment, outcomes and genomic data.
  • Pre-operative dynamic contrast-enhanced MRI images, downloaded from PACS systems and de-identified for release by TCIA and shared in DICOM standard format.
  • Location of lesions in MRI images annotated by radiologists.
  • Imaging features from MRI images, extracted by software.

Researchers note that a set of 529 radiomics features from the images has been extracted to a tabular format for researching organizations that don’t want to deal with the images themselves. Highly curated additional data includes nearly 100 columns with variables sicj as recirremce-free survival, and information on pathology, chemo treatment, tumor response and more, according to Maciej Mazurowski, associate professor of radiology, electrical and computer engineering, and biostatistics and bioinformatics at Duke University.

A detailed description of the dataset and all its components can be found here. The dataset can be accessed here. An article in the British Journal of Cancer is the primary publication describing the dataset and can be found here.

More for you

Loading data for hdm_tax_topic #better-outcomes...