Lessons from the pandemic’s superstar data scientist, Youyang Gu

The data scientist Youyang Gu thinks of himself as a realist—he declares it in his Twitter profile: “Presenter of unbiased takes. Realist.”

When he noticed the scattershot covid-19 projections last spring—one model projected 2 million US deaths by the summer, another predicted 60,000—Gu questioned whether that was as good as the modeling could be. He decided to take a shot at making a covid-19 model himself. “My whole entire goal was to produce the most accurate model possible,” Gu says, from his apartment in Manhattan. “No ‘if this’ or ‘if that.’ Basically, no ‘ifs.’ It doesn’t really matter what the scenarios are. I just wanted to lay it out: ‘This is the most likely or realistic forecast for what’s going to happen.’”

Within a week, he’d built a machine-learning model and launched his COVID-19 Projections website. He ran the model every day—it only took one hour on his laptop—and posted covid-19 death projections for 50 US states, 34 counties, and 71 countries.

By the end of April, he was attracting attention—ultimately, millions checked his website daily. Carl Bergstrom, a professor of biology at the University of Washington, took notice and commented on Twitter that Gu’s model was “making predictions that seem as good as any I’ve seen.”

“I can be a bit of an ML skeptic. But in this case, don’t let the ‘machine learning’ text fool you into thinking this is snake oil,” Bergstrom tweeted.

An MIT grad with a master’s degree in electrical engineering and computer science (plus a degree in math), Gu, 27, had been working on a sports analytics startup when the pandemic hit. But he put that venture on pause as major league sports shut down. And then, by simply googling “epidemiology,” he began his foray into covid-19 modeling.

“I had zero background in infectious-disease modeling,” he says. But he did have a few years’ experience as a data scientist in finance, working with statistical models—models that, based on certain statistical assumptions, analyze data and make projections about, say, where the price of a stock will be in the future.

“It turns out that a lot of infectious-disease modeling is basically statistical modeling,” says Gu. And the finance industry’s profit-driven goal for accuracy served him well in the epidemiological domain. “If you can’t make an accurate model in finance, you won’t have a job anymore,” he says. By contrast, the goal in academia—from Gu’s perspective, at least—is not so much to make accurate models, but rather to publish papers and inform public policy. “That’s not to say they don’t make accurate models—just that they don’t optimize specifically for accuracy,” he says.

Gu’s model combines machine learning with a classic infectious-disease simulator called an SEIR model (factoring in individuals in the population who are susceptible, exposed, infectious, recovered, or removed due to death).

The SEIR component uses as input a simulated set of parameters—a best-guess range for variables such as the basic reproduction number (the rate at which new cases arise in an entirely susceptible population at the start of an outbreak, before interventions or immunity), infection rate, lockdown date, reopening date, and effective reproduction number (the rate at which new cases arise after some interventions). In terms of outputs, the SEIR simulator first computes the infections over time, and then computes the deaths (multiplying infections by the infection fatality rate).

Gu’s machine-learning layer then generates thousands of different combinations for those parameter sets in trying to find the real-life parameters for each geographical region. It learns which parameters generate the most accurate death projections by comparing the SEIR predictions with real data on daily deaths from Johns Hopkins University. “It tries to learn what parameter sets generate deaths that most closely match the actual observed data, looking back,” says Gu. “And then it uses those parameters to forecast and make projections about deaths into the future.”

The forecasts proved remarkably accurate. For instance, on May 3, he made an appearance on CNN Tonight and shared his model’s projections that the US would reach 70,000 deaths on May 5, 80,000 deaths on May 11, 90,000 deaths on May 18, and 100,000 deaths on May 27. On May 28, he tweeted, “covid19-projections.com got all 4 dates exactly correct.” With some rounding, that was true.

“I’m not saying I’ve been perfect over this past year. I’ve been wrong many times. But I think we can all learn to approach science as a method of finding the truth, rather than the truth itself.”

Youyang Gu

The model wasn’t perfect, of course, but it impressed Nicholas Reich, a biostatistician and infectious-disease researcher at the University of Massachusetts, Amherst, whose lab, in collaboration with the US Centers for Disease Control and Prevention, aggregates results from about 100 international modeling teams. Among all the aggregated models, Reich observed, Gu’s model was “consistently among the top.”

On October 6, Gu posted his final death forecast, just before the fall wave. The model projected there would be 231,000 deaths in the US by November 1. The total recorded by that date: 230,995.

Gu shut down his first model in early October because by then there were lots of teams doing good death forecasts. He turned instead to modeling true infections versus reported infections. And then in December he started tracking vaccine rollout and the elusive “pat h to herd immunity”—which in early 2021 he revised to “path to normality.” Whereas herd immunity is achieved when a sufficient portion of a population is immune to the virus, thus curtailing further spread, Gu defines normality as “the lifting of all covid-19-related restrictions for the majority of US states.”

“It became clear that we’re not going to reach herd immunity in 2021, at least definitely not across the whole country,” he says. “And I think it’s important, especially if you’re trying to instill confidence, that we make sensible paths to when we can go back to normal. We shouldn’t be pegging that on an unrealistic goal like reaching herd immunity. I’m still cautiously optimistic that my original forecast in February, for a return to normal in the summer, will be valid.”

In early March, he packed up shop entirely—he figured he’d made what contribution he could. “I wanted to step back and let the other modelers and experts do their work,” he says. “I don’t want to muddle the space.”

He’s still keeping an eye on the data, doing research and analysis—on the variants, the vaccine rollout, and the fourth wave. “If I see anything that’s particularly troubling or worrisome that I think people aren’t talking about, I’ll definitely post it,” he says. But for the time being he is focusing on other projects, such as “YOLO Stocks,” a stock ticker analytics platform. His main pandemic work is as a member of the World Health Organization’s technical advisory group on covid-19 mortality assessment, where he shares his outsider’s expertise.

“I’ve definitely learned a lot this past year,” Gu says. “It was very eye-opening.”

Lesson #1: Focus on fundamentals

“From the data science perspective, my models have shown the importance of simplicity, which is often undervalued,” says Gu. His death forecasting model was simple in not only its design—the SEIR component with a machine-learning layer—but also its very pared-down, “bottom-up” approach regarding input data. Bottom-up means “start from the bare-bones minimum and add complexity as needed,” he says. “My model only uses past deaths to predict future deaths. It doesn’t use any other real data source.”

Gu noticed that other models drew on an eclectic variety data about cases, hospitalizations, testing, mobility, mask use, comorbidities, age distribution, demographics, pneumonia seasonality, annual pneumonia death rate, population density, air pollution, altitude, smoking data, self-reported contacts, airline passenger traffic, point of care, smart thermometers, Facebook posts, Google searches, and more.

“There is this belief that if you add more data to the model, or make it more sophisticated, then the model will do better,” he says. “But in real-word situations like the pandemic, where data is so noisy, you want to keep things as simple as possible.”

“I decided early on that past deaths are the best predictor of future deaths. It’s very simple: input, output. Adding more data sources will just make it more difficult to extract the signal from the noise.”

Lesson #2: Minimize assumptions

Gu considers that he had an advantage in approaching the problem with a blank slate. “My goal was to just follow the data on covid to learn about covid,” he says. “That’s one of the main benefits of an outsider’s perspective.”

But not being an epidemiologist, Gu also had to be sure that he wasn’t making incorrect or inaccurate assumptions. “My role is to design the model such that it can learn the assumptions for me,” he says.

“When new data comes along that goes against our beliefs, sometimes we tend to overlook that new data or ignore it, and that can cause repercussions down the road,” he notes. “I certainly found myself falling victim to that, and I know that lots of other people have as well.”

“So being aware of the potential bias that we have and recognizing it, and being able to adjust our priors—adjusting our beliefs if new data disproves them—is really important, especially in a fast-moving environment like what we’ve seen with covid.”

Lesson #3: Test the hypothesis

“What I’ve seen over the last few months is that anyone can make claims or manipulate data to fit the narrative of what they want to believe in,” Gu says. This highlights the importance of simply making testable hypotheses.

“For me, that is the whole basis of my projections and forecasts. I have a set of assumptions, and if those assumptions are true, then this is what we predict will happen in the future,” he says. “And if the assumptions end up being wrong, then of course we have to admit that the assumptions we make are not true and adjust accordingly. If you don’t make testable hypotheses, then there is no way to show whether you are actually right or wrong.”

Lesson #4: Learn from mistakes

“Not all the projections that I made were correct,” Gu says. In May 2020, he projected 180,000 deaths in the US by August. “That is much higher than we saw,” he recalls. His testable hypothesis proved incorrect—“and that forced me to adjust my assumptions.”

At the time, Gu was using a fixed infection fatality rate of approximately 1% as a constant in the SEIR simulator. When in the summer he lowered the infection fatality rate to about 0.4% (and later to about 0.7%), his projections returned to a more realistic range.

Lesson #5: Engage critics

“Not everyone will agree with my ideas, and I welcome that,” says Gu, who used Twitter to post his projections and analysis. “I try to respond to people as much as I can, and defend my position, and debate with people. It forces you to think about what your assumptions are and why you think they are correct.”

“It goes back to confirmation bias,” he says. “If I am not able to properly defend my position, then is it really the right claim, and should I be making these claims? It helps me understand, by engaging with other people, how to think about these problems. When other people present evidence that counters my positions, I have to be able to acknowledge when I may be incorrect in some of my assumptions. And that has actually helped me tremendously in improving my model.”

Lesson #6: Exercise healthy skepticism

“I am now much more skeptical of science—and it’s not a bad thing,” Gu says. “I think it’s important to always question results, but in a healthy way. It’s a fine line. Because a lot of people just flat-out reject science, and that’s not the way to go about it either.”

“But I think it’s also important to not just blindly trust science,” he continues. “Scientists aren’t perfect.” It is appropriate, he says, if something doesn’t seem right, to ask questions and find explanations. “It’s important to have different perspectives. If there is anything we’ve learned over the past year, it’s that no one is 100% right all the time.”

“I can’t speak for all scientists, but my job is to cut through all the noise and get to the truth,” he says. “I’m not saying I’ve been perfect over this past year. I’ve been wrong many times. But I think we can all learn to approach science as a method of finding the truth, rather than the truth itself.”