It took a pandemic, but the US finally has (some) centralized medical data

Throughout the pandemic, there has been serious tension between what the public wants to know, and what scientists have been able to say for certain.

Scientists have been able to learn more about covid, faster, than any other disease in history—but at the same time, the public has been shocked when doctors can’t answer seemingly basic questions: What are the symptoms of covid-19? How does it spread? Who’s most susceptible? What’s the best way to treat it?

Nowhere has this conflict been more clear than in the US, which spends nearly a fifth of its gross domestic product on healthcare, but achieves worse outcomes than any other wealthy country. Finding the answers has been complicated not just because the science is hard, but because American healthcare is built on a patchwork of incompatible, archaic systems.

Across the nation, federal, state, and local privacy laws overlap and sometimes contradict one another. Medical records, meanwhile, are messy, fragmented and intensely siloed by the institutions that own them—both for privacy reasons, and because selling de-identified medical data is incredibly profitable.

But accessing data trapped in these silos is the only way to answer questions about covid. That’s why so much vital research has been done abroad, in countries with national healthcare systems, despite the US having a huge number of both covid patients and research institutions. Some of the strongest data on risk factors for covid mortality and features of long covid have come from the UK, for example, where public health researchers have access to data from 56 million NHS patients’ medical records.

At the beginning of the pandemic, a group of researchers funded by the US National Institutes of Health, or NIH, realized that many questions about covid-19 would be impossible to answer without breaking down barriers to data sharing. So they developed a framework for combining actual patient records across institutions in a way that could be both private and useful.

The result is the National COVID Cohort Collaborative (N3C), which collects medical records from millions of patients around the country, cleans them, and then grants access to groups studying everything from when to use a ventilator to how covid affects women’s periods.

“It’s just shocking that we had no harmonized, aggregate health data for research in the face of a pandemic,” says Melissa Haendel, Professor of Medical Informatics at the Oregon Health & Science University and one of the co-leads of N3C. “We never would have gotten everyone to give us this degree of data outside the context of a pandemic, but now that we’ve done it, it’s a demonstration that clinical data can be harmonized and shared broadly in a secure way, and a transparent way.”

The database is now one of the largest collections of covid records in the world, with 6.3 million patient records from 56 institutions and counting, including records from 2.1 million patients with the virus. Most records go back to 2018, and contributing organizations have pledged to keep updating them for five years. That makes N3C not just one of the most useful resources for studying the disease today, but one of most promising ways to study long covid.

Institutions sending records, in bulk, to a centralized federal government is an anomaly in American healthcare. Put to good use, it has the potential to answer detailed questions long after the pandemic. And it may even serve as proof of concept for similar efforts in the future.

Open-source data

To contribute information to the database, participating providers first pick two groups of patients: people who have tested positive for covid, and those who will serve as a control group. They then strip out everything that makes the data personally identifiable, except zip code and dates of service, and transmit it securely to N3C. There, technicians clean the data—not always an easy task—and put it into the database.

Anyone can submit a research proposal through N3C’s dashboard, whether or not they’re affiliated with a submitting institution. Even citizen scientists can request access to an anonymized version of the dataset.

A committee at Johns Hopkins reviews each proposal, and decides what version of the data researchers will be able to access. There are several tiers of information: a limited dataset, a second level of real records with zip codes and dates obscured, and a third made of computer-generated “synthetic” records, which attempt to keep the same attributes of the records without containing any real patient data. Everyone has to go through data security training before they can access it.

So far 215 research projects have been approved, including studies to track outcomes for patients with different covid vaccines and examine the complication rates of elective surgeries in non-covid patients during the pandemic. The first publication from the collaborative was an analysis of mortality risk factors in cancer patients who contract SARS CoV2, and several pre-prints have been released, including covid outcomes in liver disease patients and people with HIV.

More accountability, better science

Clean, accurate data is vital to such studies, but it’s been tough to come by in the chaos of the pandemic. Last June, two major journals, the BMJ and The Lancet, retracted papers based on ‘data’ from Surgisphere, a little-known medical data company with a handful of employees. It claimed to have access to real-time patient records of nearly 100,000 covid patients in 700 hospitals around the world–sometimes including more patients than had actually been diagnosed in a given country.

Before being retracted, the papers were used to halt clinical trials and alter medical practices. But when researchers became suspicious—particularly given that even a single medical data transfer agreement takes enormous time and labor—the company refused to let anyone audit the data. In fact, there’s no proof the database ever existed.

N3C, on the other hand, is auditable by, and accountable to, thousands of researchers at hundreds of participating institutions, with a strong focus on transparency and reproducibility. Everything users do within the interface, which uses Palantir’s GovCloud platform, is carefully preserved, so anyone with access can retrace their steps.

“This isn’t rocket science, and it isn’t really new. It’s just hard work. It’s tedious, it has to be done carefully, and we have to validate every step,” says Christopher Chute, a professor of medicine at Johns Hopkins who also co-leads N3C. “The worst thing we could do is methodically transform data into garbage that would give us wrong answers.”

Brute force

Haendel points out that these efforts haven’t come easy. “The diversity in expertise that it took to make this happen, the perseverance, dedication, and, frankly, brute force, is just unprecedented,” she says.

That brute force has come from many different fields, many of them not traditionally part of medical research.

“Having everyone on board from all aspects of science really helped. During covid people were much more willing to collaborate,” says Mary Boland, a professor of informatics at the University of Pennsylvania. “You could have engineers, you could have computer scientists, physicists, all these people who might not normally participate in public health research.”

Boland is part of a group using the N3C data to look for whether covid increases irregular bleeding in women with polycystic ovarian syndrome. Outside of covid, most researchers have to use insurance claims data to get a large enough database for population-level analyses, she says.

Claims data can answer some questions about how well drugs work in the real world, for instance. But those databases are missing huge amounts of information, including lab results, what symptoms people are reporting, and even whether patients die.

Collecting and cleaning

Outside of insurance claims databases, most health data collaboratives in the US use a federated model. Participants in these studies all agree to format their own datasets in a common format, and then run queries from the collective, such as the proportion of serious covid cases by age group. Several international covid research collectives, including the Observational Health Data Sciences and Informatics (OHDSI, pronounced “Odyssey”), operate this way, avoiding legal and political problems with cross-border patient data.

OHDSI, which was founded in 2014, has researchers from 30 countries, who together hold records for 600 million patients.

“That allows each institution to keep their data behind their own firewalls, with their own data protections in place. It doesn’t require any patient data to move back and forth,” says Boland. “That’s comforting for a lot of places, especially with all the hacking that’s been going on lately.”

But relying on each institution to do a good job getting their data into a common format, without any central body to vet how clean the data is, carries a lot of risks.

“Getting data into a common data format is the biggest challenge, because even medication names—you’d think that would be standardized across the US, but it’s really not,” says Boland. “Pharmacies will often have their generic drug and it might have slightly different ingredients because of patent laws. Each of those is its own drug name.”

N3C, on the other hand, asks participants to send their raw, messy records to one place, and let the central body clean it up and standardize it. While there are many obvious benefits, there are significant legal and social obstacles, both in America and internationally; many institutions, for instance, can’t contribute to N3C because of privacy laws in their states.

It’s also technologically challenging. Combining even two sets of electronic medical records is extremely difficult and labor-intensive; the quality of data is often low, and there’s little standardization. In multi-site healthcare organizations, as many as 1 in 5 medical records are duplicate files, mostly due to data entry screw-ups during appointments or check-ins, according to a 2018 Pew paper.

Those defending federated models often claim they do their own quality control behind their firewall. But N3C researchers were shocked to find out just how messy the data was.

“There was a certain amount of skepticism from sites, like, ‘We don’t really need this kind of data quality framework, we already do that at our own sites confidentially, behind our firewall, we don’t need your stinking harmonization tools,’” says Haendel. “But we learned those quality measures are insufficient when you look at data in aggregate.”

Some of the data quality problems have bordered on the absurd.

“In some cases, organizations have failed to put in units of measure. So there was a weight, but there was no unit, like we were just supposed to know,” says Chute. But having such a huge number of records gave them an advantage, and let them save many datapoints that otherwise would have been thrown out.

“We were able to look at the distributions of data for which we did have units, and see where the mystery data fit. You can just eyeball it—oh, this is obviously pounds, or kilograms.”

A big fish in a much bigger ocean

As extensive as it is, the N3C database is dwarfed by the scale of data collected and maintained elsewhere in the US healthcare system. Government agencies, hospitals, testing labs, insurers, and others all keep their own fragments of the healthcare ecosystem: The Department of Health and Human Services tracks more than 2,000 health-related datasets from federal, state, and local agencies alone.

The usefulness of each is hobbled by compulsive data siloing: it’s essentially impossible for researchers to connect Medicare claims, records from vaccine registries, states’ racial and ethnicity data for vaccinations, or databases on covid-19 variants sequenced from patient samples around the country on their own. Indeed, turning raw records into useful information is so challenging it’s become a thriving private industry, where data brokers buy de-identified records in bulk, analyze correlations between variables, and sell their analyses—or the data itself—to researchers and governments.

“We’re willing to give all our data to a commercial entity and let them sell it back to us, but we’re unwilling to pay for the most basic public health infrastructure,” says Haendel. “This volunteer effort in the face of a pandemic is amazing, but it’s not a sustainable long-term solution for dealing with future pandemics. Or just healthcare in general.”

The N3C approach steers away from some of those problems, but there are significant holes in its data, notably information on vaccinations. Most vaccines are being administered at community sites, while the collaborative’s records are from primary care visits and hospitalizations, which means that just 245,000 Pfizer and 104,000 Moderna vaccines have been captured in the records. A healthcare analytics company is building a tool to securely integrate patient records from multiple sources, but it won’t be available for at least a few months.

Even with those gaps, though, N3C’s enormous database offers one of the best resources for researchers looking to answer the many unsolved questions about covid.

“That’s kind of where we’re stuck now,” says Haendel. “We really need domain experts in all different aspects of clinical care and the science behind them, to help us find all the needles in haystacks.”

This story is part of the Pandemic Technology Project, supported by The Rockefeller Foundation.