This is especially true when working with healthcare claims. Specific fields in information sources may not be available to users due to a variety of causes. Whatever the cause, this “missingness” may introduce biases and in your analyses, threaten the validity of your insights and ultimately lead to eroding trust in your data and conclusions if not heeded and addressed. Missingness in healthcare data may be as subtle as only having 9 digits of a 10-digit NPI or as critical as missing an entire segment of healthcare transactions. While many data scientists address missingness by simply omitting bad data and using clean-up methods such as imputation, doing so is a very risky proposition. Why? Because missingness not only impacts the integrity of your data, it impacts the utility of the conclusions you can make with it. Ultimately the problem with missingness comes down to not knowing what you don’t really know.
Here is a rather simple example to that point. Many boxes that house fragile goods have verbiage on the top of the box that says “FRAGILE - THIS END UP.” This information provides a valuable service to the consumer to ensure the contents of the package are not damaged upon opening the box. Yet those same words are only half as effective if the words “OTHER SIDE UP” are not on the bottom of the box. Once again, two boxes with the same words on the top provide a different level of utility to the user based on words located on a completely opposite side of the box.
In the healthcare space, let’s say the planning team at a national health system wants to understand their market strength for performing knee replacements in a state on the other side of the country. They decide to go about doing this by purchasing claims from the leading claims consolidators without understanding that these claims are only those that each consolidator has access to. Providers that bill directly to payers, for example, will not be in their data sets. If this strategy team was to analyze the claims under the pretense that they were looking at the entire market, they are potentially setting themselves up to erroneously undercounting several leading providers.
So how does one maximize the utility of their healthcare data in spite of the issue of missingness? One thing you can do is to work with your data providers to truly understand the contents and context of your data. You would be very surprised how often data vendors will leave out key details about their data during the sales cycle. Some vendors omit information due to contractual criteria, and without the purchaser knowing these omissions in advance, you may not be getting the data you want or expect.
One way we observe for missingness is by combining a candidate data set with one or more vetted sources of data to test the data’s integrity. This “truth data” is available at the state and federal level and is readily available on the internet for free or for purchase. You’d also be surprised how often your own in-house data may be used to verify the completeness of one or more facets of a third-party data set.
One of the major difficulties of merging vendor data with truth data, whether your own, purchased, or publicly available, is doing so in a manner that preserves the integrity of the data without violating HIPAA regulations. Our Wayfinder platform and its tokenization process provide a readily available solution to that problem. In addition, our Wayfinder platform has pre-configured pipelines that apply machine learning algorithms that correct commonly known missing values. It also includes mastered directories, so the data is complete and accurate. This gives the user a head start in correcting the issues of incomplete data. Once installed, you can begin answering those important business questions in minutes.
To learn more about how we apply machine learning to improve data, get in touch.