Correcting Anonymized Real World Data at Scale

As the amount and types of healthcare data expand, so do the potential applications. Layered on this explosive growth is the need to protect patient privacy. So how do companies reconcile the intersection of using data to solve clinical and business problems with the need to preserve individuals’ personal health information (PHI)? Data de-identification is a process that replaces PHI with tokens, hashes or unique identifiers to allow privacy to be maintained. By de-identifying and connecting data sources, more detailed longitudinal views of patients can be created while maintaining privacy, enabling a more complete view of the patient journey.

However, the process has multiple phases and is not without challenges.

Linking Records

The real value of healthcare data is gained by linking data sets. Linking provides a longitudinal view across data sources and is known as the “patient journey.” A patient journey typically shows patient interactions with the healthcare system, from their interactions with specific providers, diagnoses, treatments, and healthcare decision-making, to their outcomes - providing insights for clinical and business decision-making. Creating and maintaining the links is challenging when data comes from disparate sources and vendors in different formats and with varying degrees of completeness. Expertise in transforming healthcare data is needed to match and link de-identified data efficiently so it is consistent and ready to use. At Kythera Labs, we’ve partnered with Datavant, the industry leader in securely de-identifying and connecting patient data. We then apply data science and machine learning across our data sources and event assets to correct for instability of source transactions, such as token defects, and make de-identified data more accessible and accurate at scale.

A Bit More About De-identification

De-identification occurs by applying advanced statistical methods that remove PHI and replace this information with tokens. Then the data must be Expertly Determined, a process that ensures an individual’s identity cannot be “reversed-engineered” so that their privacy is maintained. For this post, we focus on the creation of tokens. Once de-identification has been accomplished, data sets can be joined using the patient tokens.

The Gray Area: Possible Missed Matches

Once PHI is replaced with tokens, how do we confidently know that the records we join relate to the same person? In an ideal world, pairs of records would be compared, and there would be a simple, “Yes, they match” or “No, they do not match” output. In actuality, many factors create a gray area of possible matches.

Two central problems can occur when linking data sets. The first an individual counted as more than one person in the data set (False Negative match). For example, Kathleen Lee’s records are associated with Kathleen Lee and Kate Lee due to an input error on the patient name. So it appears that there are two patients when only one exists. These errors can result in a patient’s history being disjointed, appearing as two or more patients, each with only part of the actual journey - directly impacting the ability to build longitudinal patient behavior models and inflating patient and encounter counts. When using anonymized data, it is difficult to know on the face of it where the data linkage went astray.

On the opposite end of the spectrum, errors in linking can occur when two or more different individuals get assigned the same token or ID (False Positive match). An example of a False Positive match is when records for Ann Garcia and Ann Garcia (different people born on the same day) are inaccurately matched and linked. When this problem occurs, how can you know that the longitudinal path was correct if it includes events from two or more different individuals? These two errors result in missed or erroneous insights because the longitudinal journey is misrepresented. (For a more detailed discussion of False Positives and False Negatives and how they are used in a Confusion Matrix, click here for our Guide to Machine Learning which explains these concepts in more detail).

Resolving Erroneous Tokens

It is a known industry problem when using de-identified claims data that there will be more patient tokens than people in a defined geography population. Using several metrics such as birth and death rates, patient migration patterns, and other measures, Kythera Labs can assess the predictable rate of patient tokens expected in a data set. Even when accounting for these metrics, the number of patient tokens present in the data and the number of new patient tokens created with each incremental delivery of new claims outpaces what would be expected.

For example, one collection of de-identified U.S. healthcare claims data that we have seen has 410M+ patient tokens across a 7-year history. This is 23% higher than the entire estimated population of the United States during that period, not even accounting for what one would expect for an insured patient population.

The Kythera Labs Team works to address token problems by identifying which patient tokens are most likely representations of patients already in our claims repository. This process begins with leveraging our entire claims repository to define where every diagnosis and procedure code pairing would likely occur in a patient’s claim history. Additionally, we assess patient records based on the history of healthcare encounters, considering that some patients may come into view abruptly due to a valid reason, such as recently acquiring insurance.

Once we have completed our processes, we can identify suspect tokens and the number of claims associated with those tokens. Through machine learning algorithms, we can correct at scale suspect tokens to reduce the number of False Positives and False Negatives.

Be a Smarter Data User

Life sciences companies, researchers, and healthcare providers link de-identified real-world data from various sources. While there are many benefits, it is important to be aware of the problems associated with improper linking techniques. False Negatives or duplicates can be successfully minimized, while False Positives (when two or more different individuals get assigned the same token) can create a false narrative. Kythera has spent years creating more accurate and complete healthcare data. Get in touch to learn more about our process. A more informed healthcare user leads to better outcomes for all.

Published:

August 30, 2022

Ryan Leurck

Chief Analytics Officer

Ryan leads the Analytics and Products teams at Kythera Labs. He is an engineer and data scientist with over 13 years of experience in operations research, system-of-system design, and research and development portfolio valuation and analysis. Ryan received his start on the research faculty at The Georgia Institute of Technology Aerospace System Design Lab where he led researchers in the application of machine learning and big data technologies.

Hospitals and Health Systems Unlock Market Intelligence & Competitive Advantage by Combining Internal Claims and External Claims Datasets

Integrating internal and external data delivers a strategic advantage, empowering health systems to proactively adapt to shifting market dynamics.

March 11, 2025

Accelerating Rare Disease Drug Development with AI, Machine Learning, and Real-World Data

AI, machine learning, and real-world data are transforming rare disease drug development by speeding up patient identification, optimizing clinical trials, and driving smarter regulatory decisions.

February 28, 2025

Unlocking the Power of EHR Data Using OMOP: A Framework for Standardizing and Analyzing Healthcare Data

The OMOP Common Data Model standardizes EHR data, enabling interoperability, advanced analytics, and real-world evidence generation for life sciences research.

February 3, 2025