Why matching patient IDs is a critical challenge

One review of supposedly good identity data showed that it was worse than anyone thought, underscoring the battle that healthcare faces.


The ability to consolidate and harmonize patient identities within and across health systems is at the core of the future of healthcare—for care quality, meaningful use, population health, precision medicine and cost management.

In fact, it is so critical that the Office of the National Coordinator for Health Information Technology (ONC) has set a new milestone for “all organizations that match electronic health information to have an internal duplicate record rate of no more than 0.5% at the end of 2020.”

But the average duplicate record rate for healthcare organizations ranges from 10 percent to 20 percent, and it’s been that high for years. Clearly, current state-of-the-art technologies and approaches are not adequate to hit this milestone.

Matching engines—like those found in master patient index tools—are the foundational technology responsible for linking and de-duplicating patient records. These engines use patients’ identity data as the key to making a match. For example, if a hospital has two records that both contain the data [Name: John Smith, Address: 123 Main Street, Birthdate: 11/23/1981], then the hospital’s matching engine will determine that both of those records belong to the same person.

But matching engines are only as accurate as the data they are using, and patient identity data is notoriously inaccurate—it is mistyped, misspelled, mis-transcribed, out-of-date, incomplete, and rife with default entries for birthdates and SSNs. Even the best probabilistic matching algorithms on the market cannot overcome a misspelled name, an old address and a missing birthdate to link two records together.

In fact, matching engines typically only achieve 70 percent native match rates unless organizations implement major data quality and data governance initiatives. And those rates drop when trying to harmonize identities across disparate healthcare institutions because of the levels of coordination and cooperation required.

We gained new insight into this question after performing a comprehensive and massive-scale study of identity data from credit, telecommunications and government records. These records spanned nearly the entire adult population of the United States and contained historical data going back decades, including address and name history.

This dataset is significant because these industries have strict governance standards around data collection and strong incentives for individuals to correctly report their own data. To illustrate: if a patient’s identity data is incorrect on their electronic health record, they will still be seen by a doctor; but if a person’s identity data is wrong on their mortgage application, they won’t be able to buy a house. For this reason, we believe the results of this study represent a high ceiling on the quality of identity data across industries.

What we discovered about the state of identity data was worse than we anticipated—the data we analyzed was rife with errors and ambiguities. Here are a few highlights:
  • Over 120,000 people in the US were supposedly born in the 1800s. This is likely due to the mistyping of birthdates (for example, entering “11/23/1891” instead of “11/23/1981”).
  • Nearly five times as many birthdates are recorded as being on January 1 than on any other day of the year. This is likely due to the entry of default values (for example, unless otherwise stated, a birthdate is recorded as being on 1/1/1900). Similarly, more than twice as many birthdates are recorded as being on the first of a month than on any other day of the month.
  • More than 35,000 people have been recorded as having both the names Sarah and Sara, and more than 16,000 have been recorded as having both the names John and Jon. Similarly, more than 19,000 people have been recorded as having both the names Brian and Brain. While the first two represent name ambiguities, the third is a simple but common misspelling.
  • The average US adult has supposedly lived at 3.84 addresses. This number is disconcertingly low, because, in reality, the average person will live at nine different addresses throughout their adulthood.

Some healthcare organizations will be able to achieve the ONC milestone of a 0.5 percent duplicate record rate by 2020. But to do so, these organizations will have to invest an enormous amount of time, energy and money to improve the quality of their patient data and to enforce strict data governance standards. Moreover, they will have to invest this time, energy and money on an ongoing basis to prevent new duplicate records from being created and to account for the fact that patient identity data becomes incorrect at a rate of 1 percent per month.

For organizations not willing or able to go through these heroic efforts, there is a way to link patient records with high levels of accuracy despite incorrect, out-of-date and incomplete patient identity data. It is called referential matching, and it involves leveraging a highly curated third-party database of identity data as a reference during the matching process. Importantly, the databases used by referential matching technologies have pieced together fragments of identity data into cohesive “identities” for everyone in the United States. These fragments include historical data (like old addresses and maiden names), incorrect data (like common misspellings of names), and correct data (which is constantly updated to ensure currency).

All of this data differentiates referential matching from traditional approaches in two key ways:

Ability to use historical data to match. While The Sequoia Project and others consider address data as being less-than-ideal for matching because of its low stability over time, referential matching can utilize historical data to make a match. For example, a patient record with an old address and a maiden name will match to the same identity in the reference database as a patient record with a new address and a married name.

Ability to match based on the uniqueness of a patient’s identity data. The Sequoia Project discovered that the combination of a patient’s name and birthdate is unique 95.7 percent of the time. This means that a matching engine should be able to use only those two attributes to link the records of 95.7 percent of patients. However, there is no way for traditional matching engines to know which 95.7 percent of patients are uniquely identifiable from just their name and birthdate. Because of this, traditional matching engines will not make a match using only a few pieces of data. Referential matching engines, on the other hand, can use their comprehensive store of identity data to know whether very sparse data uniquely identifies a patient. Therefore, in theory, a referential matching engine could link a patient record with just a name and address to another with just a name and birthdate.

Healthcare organizations have been trying to solve the problem of consolidating and harmonizing patient identities within and across systems for decades, but they still are prone to error as many as one out of every five times. Even the most sophisticated matching algorithms on the market are limited by the quality of patient identity data they are using to match. And identity data has proven time and again to remain stubbornly inaccurate, out-of-date, and incomplete. It’s perhaps time to consider new approaches.

More for you

Loading data for hdm_tax_topic #better-outcomes...