Will synthetic data revolutionize data sharing and AI?

Emerging de-identification tools could help ensure data aggregated for research projects remains private.


Several projects are attempting to build large data repositories of de-identified patient data to advance healthcare data analytics and associated artificial intelligence deep learning models and algorithms.

These initiatives could provide beneficial insights into patient treatments, outcomes, medication efficacy and protocols for treating disease and chronic illness.

Truveta, Google’s projects with Ascension and Mayo Clinic and the Health Data Collaborative are examples of repository projects.

The challenge in all these efforts is to create an effective de-identification process for patient data that ensures compliance with patient privacy regulations, including HIPAA in the U.S. and the General Data Protection Act in Europe.

Several de-identification tools are available to process structured data, unstructured data and images. Many of these tools, however, have been developed by medical universities; these are not commercial, off-the-shelf applications that are supported and frequently updated with new capabilities. As a result, using these tools will likely create a significant amount of overhead for the data informaticists or scientists who are working to create large healthcare data lakes that can be supported and maintained for long periods of time.

But some new commercial solutions for de-identifying data are emerging, and these new solutions could dramatically increase the value of healthcare data analytics and AI.

Synthetic data solutions

Synthetic data solutions can assist data scientists with de-identifying data that can then be used to create large aggregate databases to generate more accurate data analytics, AI models and AI algorithms.

Synthetic data is annotated information that computer simulations or algorithms generate as an alternative to real-world data. Synthetic data may be artificial, but it mathematically or statistically reflects real-world data. Several studies attest to the benefits of using synthetic data for AI models. For example:

  • A team at Deloitte generated an AI training model with 80 percent synthetic data that reportedly provided the same level of accuracy that a model using real data would have provided.
  • An American University of Beirut 2020 study showed that using synthetic data improved machine learning model performance up to 20 percent while categorizing actions in videos.
  • Research generated by De Gruyter demonstrated the ability to identify drivers of cars with 87 percent accuracy by analyzing synthesized sensor data generated by vehicles.

Synthetic data will become increasingly valuable to supporting deep-learning AI models. Deep learning that supports neural programming, bioinformatics and natural language processing will benefit from large-volume synthetic data sets.

Gartner estimates that by 2024, 60 percent of the data used for the development of AI and analytics projects will be synthetically generated. This suggests that the market is about to engage in a rapid uptake and utilization of synthetic data. But healthcare providers tend to lag behind the adoption curves of other industries. Emerging commercial synthetic data solutions could help drive higher adoption.

Generating large data sets

Synthetic data solutions could enable healthcare organizations to generate the large data sets that are needed to produce more accurate analytics and AI models and produce algorithms that continue to improve the output. These solutions could perform these functions while protecting the confidentiality and identification of the patient data that is synthesized.

This approach for creating large patient data sets could also help protect consumers from unauthorized use of their data by large technology companies (e.g., Google, Microsoft and Amazon). Some of the existing data collaboration projects could convert to using synthetic data. And synthetic data could be a catalyst for high success rates with AI projects.

Emerging commercial synthetic data companies

While synthetic data solutions have been developed by universities to meet their needs, the healthcare provider market will require commercial solutions to provide the necessary support functions. Among the emerging vendors are.

  • Replica Analytics: Supports EHR data sharing between collaborators.
  • MDClone: Enables collaboration across teams, organizations and external third parties with the use of synthetic data.
  • Statice: Works to minimize privacy risk for patient data analysis.

Success factors

Provider organizations that desire to expand or improve their AI capabilities should first identify similar organizations with which they can collaborate to create large-modeled healthcare data sets.

Once collaboration partners have been identified, provider organizations should test the synthetic data solutions for generating de-identified patient data against deep learning AI algorithms that will improve healthcare delivery. This testing should be conducted in well-controlled environments, such as innovation centers.

Once new AI algorithms have been validated, the collaboration can expand the creation and testing of new algorithms that benefit the collaborative partners.

Overcoming challenges

Using large patient data sets while complying with patient privacy regulations creates significant challenges for many organizations that want to implement and expand AI projects. Synthetic data could help overcome these challenges.

Larger healthcare organizations and medical centers, which can afford to hire skilled programmers and informaticists, likely can create custom synthetic data solutions. But most healthcare provider organizations do not have the budgets to recruit and retain skilled staff members to support AI deep learning projects.

Emerging commercial synthetic data vendors could help many of these healthcare organizations to recruit data collaboration partners for sharing synthesized patient data to drive higher levels of AI success. Synthetic data solutions could drive data sharing for healthcare provider organizations, and that could, in turn, result in achieving AI benefits

Mike Davis is an analyst for KLAS Research.

More for you

Loading data for hdm_tax_topic #better-outcomes...