How unstructured data will impact precision medicine

The ability of information systems to manage increasingly larger datasets will dictate the success of personalized medicine efforts.


Unstructured data governs precision medicine. Precision medicine, defined as seeking "to improve stratification and timing of healthcare by utilizing biological information and biomarkers on the level of molecular disease pathways, genetics, proteomics as well as metabolomics,” is the future of medicine.

As a result, understanding unstructured data and its architectures is therefore critical to the future of healthcare. Since the mid-1990s, there has been popular, but yet unverified, meme that “laboratory medicine (pathology) influences 70 percent of all clinical decision-making.” If one combines all unstructured data (pathology, radiology and genomics), this meme may hold true.



Clinical research usefulness and translation of unstructured biomedical data into the continuing patient-physician interface remains a daunting challenge. Modern information infrastructures are poised to revolutionize the future of medicine. Amara’s Law states that "we tend to overestimate the effect of a technology in the short run and underestimate the effect in the long run.” This article summarizes the challenges and issues to consider for distributed information infrastructures.

A comparison of the data sizes vs. number of yearly studies for various clinical unstructured datasets, compared in the chart associated with this column, is compared against structured data typically found in electronic medical records.

The future growth is the early- and companion-diagnostics technologies utilizing various ‘Omics (including genomics, proteomics, microbiomics, metabolomics and transcriptomics) as well as pathology (digital pathology, confocal and optical microscopy and flow cytometry). The dotted gray lines around genomics and digital pathology are the size ranges. Radiation planning and the continued growth in the resolution and combination modalities in radiology and cardiology are already driving this unstructured data growth in the number of studies per year, shown above for a typical hospital of about 250 beds.

The U.S. National Institutes of Health (NIH) cumulative budget since 1938 is approximately $670 billion. The recent cumulative budgets between 2012 and 2015 for NIH alone total approximately $121 billion, and that’s almost double that of the combined $61 billion budget for the Centers for Disease Control(CDC), Food and Drug Administration(FDA) and Health and Human Services (HHS).

This 2:1 ratio of research to clinical budgets needs to change. Information infrastructures and information services need to resolve the serious translation gaps between research and treatment, as shown in an essay published recently by Stanford professor John Ioannidis. In his essay, “Why Most Clinical Research Is Not Useful,” Ioannidis analyzed how often studies published in major general medical journals and all clinical research adequately address certain utility features—he contends that studies rarely achieve utility.

The key aspects of the study that impact information infrastructures are context placement, information gain and transparency. Alternative hybrid solutions must be explored to address the complexity because of the multiple parties involved.

Several clinical challenges remain that are directly or indirectly related to the type of information infrastructure. Examples include:

• "Clarity of meaning" in genome (and other ‘omics) data as it is filtered to the physician and to the patient –integrating ‘Omics data with a Clinical Decision Support (CDS) systems.

• A distributed system with easy data transfer (bringing the compute to where the data resides), as the patient is being treated and in a home-care setting.

• Accuracy and reliability (data and sample mixture for integrity, especially when released for “open” use).

• Workflow and API design for integrating and visualizing genomic data with phenotypic data.

• Privacy and data flow, especially for pediatric and elderly patients.

• Inevitability of change -- people change their minds about opting in/out, technology changes yearly, governments and regulatory bodies are becoming more aware and knowledgeable, therapies change based on ‘omics knowledge. Building change management in distributed, WAN-separated systems will remain a “continuous integration” challenge.

• Costs of these new infrastructures to the healthcare system, insurers and the patient.

The cloud is a common theme in contemporary technology. Almost all of the publications on the cloud in relation to precision medicine mention "cloud computing" and not "cloud storage." The issues to consider for a flexible, resilient, scalable, cost-effective and secure storage infrastructure in the cloud include:

• Downtime (delays, errors, WAN speeds).

• Batch versus stream: data movement of each, especially of large ‘omics files, integrating both these data types (commonly referred to as slow data and fast data) in a distributed and scalable infrastructure is important. One approach is the Lambda Architecture.

• Cost of retaining large files and datasets on the cloud for long periods.

• “Perpetual Betas:” The cloud attracts early research and beta code but private and hybrid infrastructure as a service (IaaS) remains the production/enterprise-grade infrastructure, especially for clinical-grade workflows.

• Privacy and security, vulnerability – user and data access, risk cost, multiple app access to same dataset.

• Federal and state legal requirements, data controls and data Integrity required by regulatory agencies.

• Cloud vendor app and protocol flexibility and interoperability.

• Consent management in the cloud for both single and multiple parties.

• Special data and regulatory considerations for patients who are minors less than 18 years old or seniors more than 65 years old.

The understanding of health and disease at the cellular and molecular level – precision medicine –is in its early days. It is a long, tough slog to translate the research (academic and otherwise) to treat and care for the patient. To understand this research (and its translation) at population scale, infrastructures and services need to carefully consider current technologies as they integrate with the cloud, open data and data ownership, data sizing, system implementation and networking for data transfer.

This article listed a “systems-level” view of healthcare data sizes, complexity of translating research into clinical use, the “last-mile” challenge of implementing infrastructures to benefit the physician and the patient along with the information infrastructures that will shape the future of precision medicine.

The next part will focus on security and privacy details of these infrastructures and some detail on the hybrid cloud infrastructure design for healthcare.

Sanjay Joshi is chief technology officer of healthcare and life sciences at EMC's Emerging Technologies Division.

More for you

Loading data for hdm_tax_topic #better-outcomes...