Four rules for achieving scalable data unification

This game plan can help healthcare organizations that are increasingly dealing with the challenges of incorporating dozens of large data sources to achieve benefits of scale.

Sep 06 176 min read

Michael Stonebraker

Adjunct professor of computer science, Massachusetts Institute of Technology

Picking an effective approach for unifying data is, by and large, determined by the scale of the challenge. If your problem is unifying three data sources with 10 records each, then it doesn’t matter what type of tool you use. A whiteboard or paper and pencil is probably the best approach.

If the problem is integrating five data sources with 100,000 records each, you can likely use traditional rules-based approaches and simple known technologies, although it may well be painful.

But for organizations the size of healthcare systems, that’s hardly the case. They typically need to combine tens or hundreds of separate data sources with perhaps millions of records each. However, there are tactics such organizations can use to successfully perform unification at scale.

Data unification is the process of ingesting, transforming, mapping, deduplicating and exporting data from multiple sources. Two types of products are routinely used to accomplish this task—Extract, Transform and Load (ETL) tools and Master Data Management (MDM) tools.

These processes require that a human construct a global schema upfront; discover and convert local schemas into that global schema; write program-cleaning and transformation routines; and write a collection of rules for matching and merging data. It routinely takes three to six months to do this for each data source.

At General Electric, with 80 procurement systems containing information about its global suppliers, this approach would therefore take 20 to 40 person years; even by applying human parallelism, this will be a multi-year project and cost millions of dollars. GE is certainly not alone in confronting tasks of this magnitude.

This raises two questions: Why does an enterprise have so many data sources? And why would an enterprise want to unify its data sources?

To answer the second question first, there is a huge upside to GE to perform data unification on their 80 supplier databases. A procurement officer, when purchasing paperclips from Staples, is only able to see the information in her database about her relationship with Staples. When the Staples contract comes up for renewal, a procurement officer would love to know the terms and conditions negotiated with Staples by other business units, so that she can demand “most favored nation” status.

GE estimates that accomplishing this task with all of its vendors would save the company around $1 billion per year. Needless to say, GE would prefer to be on a single procurement system, but every time the corporation acquires a company, they also acquire its procurement system. Precisely because of the limitations of traditional data integration systems, GE has historically been unable to create a single view of their supplier base. It simply requires too much human work.

Any reasonable shot at solving this problem must be largely automated, with humans reviewing only a small fraction of the unification operations. If GE can automate 95 percent of its unification operations, the human labor requirement is now three to six months months rather than 20 to 40 years.

This leads us to the first rule of scaling your data unification problem.

Rule 1: Scalable data unification systems must be mostly automated

The next issue for scalable data unification is the possibility of large numbers of data sources. For example, Novartis has about 10,000 bench scientists, each recording data in a personal electronic lab notebook. Novartis would gain substantial productivity advantages from understanding which scientists are producing the same results using different reagents, or different results using the same reagents.

Because each scientist produces their results independently, the number of attributes across the company’s collective 10,000 sources is very large. Any attempt to define a global schema upfront would be hopeless. Even in less extreme cases, the task of upfront schema development is usually a fool’s errand. Enterprises tried constructing upfront enterprise-wide schemas in the 1990s, and these projects all failed, because they were out-of-date on day one of the project, let alone at the completion of the project.

The only feasible solution is to build a schema “bottom-up” from the local data sources by discovering a global schema from the source attributes. In other words, the global schema is produced “last.”

Therefore, the second rule of scalable data unification is:

Rule 2: Scalable data unification systems must be “schema last”

Because this rule must also follow rule 1, we also must ensure that the majority of schema building is automated.

One of the most insidious problems in traditional data unification using ETL or MDM is the starkly perverse division of labor. All of the work is foisted onto professional computer scientists, except for some collaboration with business experts in understanding business requirements.

The professionals who are responsible for building data structures and pipelines cannot be expected to understand the nuances of the data itself. Consider for a moment two supplier names “Cessna Textron Av” and “Textron Aviation.” A computer scientist has no idea whether they are the same or different suppliers. However, a procurement officer in GE’s aerospace division almost certainly knows. Scalable data unification systems must resolve ambiguous cases and solicit information from domain experts in addition to interfacing with data architects and computer scientists.

This is called a collaborative system, and the third rule of scalable data unification systems is:

Rule 3: When domain-specific information is required, only collaborative data unification systems will scale

Traditional ETL and MDM systems rely on rules to match, merge and classify records. GE, for example, might instruct that “any transaction with Microsoft is classified as a computer equipment/software purchase.” This one rule might classify a few thousand transactions. To classify all of GE’s 80 million transactions would require thousands of rules—way beyond the number of rules a human can comprehend.

Moreover, I have never seen an implementation with thousands of rules. In short, rule systems don’t scale. In contrast, scalable match, merge and classify can be solved using machine learning. In this case, rules are one useful tool for providing training data if it is not available in some other way. In summary, a rule system is one technique for generating training data that can then be used by a machine learning system to deal with the scale problem.

The final rule for scalable data unification is:

Rule 4: Scalable data unification must rely on machine learning, not rules

Taken together these four rules—that scalable data unification systems must be mostly automated, schema-last, collaborative and rely on machine learning—point us towards a path that can unify large number of data sources and avoid the scalability failures of ETL, MDM and whiteboards. If an enterprise is tasked with unifying tens or hundreds of data sources, it will have to use these four rules to succeed.

More for you

Loading data for hdm_tax_topic #better-outcomes...