Why Google’s AI Overviews gets things wrong
When Google announced it was rolling out its artificial intelligence-powered search feature earlier this month, the company promised that “Google will do the googling for you.” The new feature, called AI Overviews, provides brief, AI-generated summaries highlighting key information and links on top of search results.
Unfortunately, AI systems are inherently unreliable. Within days of AI Overviews being released in the US, users were sharing examples of it suggesting that users add glue to pizza, eat at least one small rock a day, and that former US president Andrew Johnson earned university degrees between 1947 and 2012, despite dying in 1875.
On Thursday, Liz Reid, head of Google Search, announced that the company has been making technical improvements to the system to make it less likely to generate incorrect answers to users’ queries, including better detection mechanisms for nonsensical queries, and limiting the inclusion of satire and humor content and user-generated content in responses that could result in it offering misleading advice.
But why is AI Overviews returning unreliable, potentially dangerous information? And what, if anything, can be done to fix it?
How does AI Overviews work?
In order to understand why AI-powered search engines get things wrong, we need to look at how they’ve been optimized to work. We know that AI Overviews uses a new generative AI model in Gemini, Google’s family of large language models (LLMs), that’s been customized for Google Search. That model has been integrated with Google’s core web ranking systems and designed to pull out relevant results from its index of websites.
Most LLMs simply predict the next word (or token) in a sequence which makes them appear fluent, but also prone to making things up. They have no ground truth to rely on, but instead choose each word purely based on a statistical calculation, which leads to hallucinations. To get around this, the Gemini model in AI Overviews is highly likely to use an AI technique called Retrieval-augmented generation (RAG), which allows an LLM to check specific sources outside of the data it’s been trained on, such as certain web pages, says Chirag Shah, a professor at the University of Washington who specializes in online search.
Once a user enters their query, it’s checked against the documents that make up its information sources, and used to generate a response. Because the system is able to match the original query to specific parts of web pages, it’s able to cite where it drew its answer from—something normal LLMs cannot do.
One major upside of RAG is that the responses it generates to a user’s queries should be more up to date, factually accurate, and more relevant than a typical model that just generates an answer based on its training data. The technique is often used to try and prevent LLMs from hallucinating. (A Google spokesperson would not confirm if AI Overviews uses RAG.)
So why does it return bad answers?
But RAG is far from foolproof. In order for an LLM using RAG to come up with a good answer, it has to both retrieve the information correctly, and generate the response correctly. A bad answer is the result of one or both parts of the process failing.
In the case of AI Overviews recommending a pizza recipe that contains glue—drawing from a joke post from Reddit—it’s likely that the post appeared relevant to the user’s original query about cheese not sticking to pizza, but that something went wrong in the retrieval process, says Shah. “Just because it’s relevant doesn’t mean it’s right, and the generation part of the process doesn’t question that,” he says.
Similarly, if a RAG system comes across conflicting information, like a policy handbook, and an updated version of the same handbook, it’s unable to work out which version to draw its response from. Instead, it may combine information from both to create a potentially misleading answer.
“The large language model generates fluent language based on the provided sources, but fluent language is not the same as correct information,” says Suzan Verberne, a professor at Leiden University who specializes in natural language processing.
The more specific a topic is, the higher the chance of misinformation in a large language model’s output, she says, adding: “This is a problem in the medical domain, but also education and science.”
According to the Google spokesperson, in many cases when AI Overviews returns incorrect answers it’s because there’s not a lot of high quality information available on the web to show for the query—or that the query matches satirical sites or joke posts most closely.
The vast majority of AI Overviews provide high quality information, and that many of the examples of AI Overviews’ bad answers were in response to uncommon queries, they say, adding that the number of AI Overviews containing potentially harmful, obscene, or otherwise violative content accounted for less than one in every 7 million unique queries. Google is continuing to remove AI Overviews on certain queries in accordance with its content policies.
It’s not just about bad training data
Although the pizza glue blunder is a good example of AI Overviews pointing to an unreliable source, AI Overviews can still generate misinformation from factually correct sources. Melanie Mitchell, an artificial intelligence researcher at the Santa Fe Institute in New Mexico, googled ‘How many Muslim presidents has the US had?’, to which AI Overviews responded: ‘The United States has had one Muslim president, Barack Hussein Obama.’
While Barack Obama himself is not Muslim, making AI Overviews’ response wrong, it drew its information from a chapter in an academic book titled Barack Hussein Obama: America’s First Muslim President? So not only did the AI system miss the entire point of the essay, it interpreted it in the exact opposite way, says Mitchell. “There’s a few problems here for the AI; one is finding a good source that’s not a joke, but another is interpreting what the source is saying correctly,” she adds. “This is something that AI systems have trouble doing, and it’s important to note that even when it does get a good source, it can still make errors.”
Can the problem be fixed?
Ultimately, we know that AI systems are unreliable, and so long as they are generating text word-by-word based on word probabilities, hallucination is always going to be a risk. And while AI Overviews is likely to improve as Google tweaks it behind the scenes, we can never be certain it’ll be 100% accurate.
Google has said that it’s adding triggering restrictions for queries where AI Overviews were not proving to be as helpful, and has added additional “triggering refinements” for queries related to health. The company could add a step to the information retrieval process designed to flag when a query is risky, and have the system refuse to generate an answer in these instances, says Verberne. Google doesn’t aim to show AI Overviews for explicit or dangerous topics, or for queries that indicate a vulnerable situation, the Google spokesperson says.
Techniques like reinforcement learning from human feedback, which involves incorporating human feedback into training an LLM’s data, can also help to improve the quality of its answers.
Similarly, LLMs could be trained specifically for the task of identifying when the question cannot be answered, and it could also pay to instruct the LLM to carefully assess the quality of the retrieved document before generating an answer, Verbene says. “Proper instruction helps a lot!”
Although Google has added a label to AI Overviews answers reading ‘Generative AI is experimental,’ it should consider making it much clearer that the feature is in beta and emphasizing that it is not ready to provide fully reliable answers, says Shah. “Until it’s no longer beta—which it currently definitely is, and will be for some time— it should be completely optional. It should not be forced on us as part of core search.”