Google’s new Project Astra could be generative AI’s killer app
Google DeepMind has announced an impressive grab bag of new products and prototypes that may just let it seize back its lead in the race to turn generative artificial intelligence into a mass-market concern.
Top billing goes to Gemini 2.0—the latest iteration of Google DeepMind’s family of multimodal large language models, now redesigned around the ability to control agents—and a new version of Project Astra, the experimental everything app that the company teased at Google I/O in May.
MIT Technology Review got to try out Astra in a closed-door live demo last week. It was a stunning experience, but there’s a gulf between polished promo and live demo.
Astra uses Gemini 2.0’s built-in agent framework to answer questions and carry out tasks via text, speech, image, and video, calling up existing Google apps like Search, Maps, and Lens when it needs to. “It’s merging together some of the most powerful information retrieval systems of our time,” says Bibo Xu, product manager for Astra.
Gemini 2.0 and Astra are joined by Mariner, a new agent built on top of Gemini that can browse the web for you; Jules, a new Gemini-powered coding assistant; and Gemini for Games, an experimental assistant that you can chat to and ask for tips as you play video games.
(And let’s not forget that in the last week Google DeepMind also announced Veo, a new video generation model; Imagen 3, a new version of its image generation model; and Willow, a new kind of chip for quantum computers. Whew. Meanwhile, CEO Demis Hassabis was in Sweden yesterday receiving his Nobel Prize.)
Google DeepMind claims that Gemini 2.0 is twice as fast as the previous version, Gemini 1.5, and outperforms it on a number of standard benchmarks, including MMLU-Pro, a large set of multiple-choice questions designed to test the abilities of large language models across a range of subjects, from math and physics to health, psychology, and philosophy.
But the margins between top-end models like Gemini 2.0 and those from rival labs like OpenAI and Anthropic are now slim. These days, advances in large language models are less about how good they are and more about what you can do with them.
And that’s where agents come in.
Hands on with Project Astra
Last week I was taken through an unmarked door on an upper floor of a building in London’s King’s Cross district into a room with strong secret-project vibes. The word “ASTRA” was emblazoned in giant letters across one wall. Xu’s dog, Charlie, the project’s de facto mascot, roamed between desks where researchers and engineers were busy building a product that Google is betting its future on.
“The pitch to my mum is that we’re building an AI that has eyes, ears, and a voice. It can be anywhere with you, and it can help you with anything you’re doing” says Greg Wayne, co-lead of the Astra team. “It’s not there yet, but that’s the kind of vision.”
The official term for what Xu, Wayne, and their colleagues are building is “universal assistant.” Exactly what that means in practice, they’re still figuring out.
At one end of the Astra room were two stage sets that the team uses for demonstrations: a drinks bar and a mocked-up art gallery. Xu took me to the bar first. “A long time ago we hired a cocktail expert and we got them to instruct us to make cocktails,” said Praveen Srinivasan, another co-lead. “We recorded those conversations and used that to train our initial model.”
Xu opened a cookbook to a recipe for a chicken curry, pointed her phone at it, and woke up Astra. “Ni hao, Bibo!” said a female voice.
“Oh! Why are you speaking to me in Mandarin?” Xu asked her phone. “Can you speak to me in English, please?”
“My apologies, Bibo. I was following a previous instruction to speak in Mandarin. I will now speak in English as you have requested.”
Astra remembers previous conversations, Xu told me. It also keeps track of the previous 10 minutes of video. (There’s a remarkable moment in the promo video that Google put out in May when Astra tells the person giving the demo where she had left her glasses, having spotted them on a desk a few seconds earlier. But I saw nothing like this in the live demo.)
Back to the cookbook. Moving her phone camera over the page for a few seconds, Xu asked Astra to read the recipe and tell her what spices were in it. “I recall the recipe mentioning a teaspoon of black peppercorns, a teaspoon of hot chili powder, and a cinnamon stick,” it replied.
“I think you’re missing a few,” said Xu. “Take another look.”
“You are correct—I apologize. I also see ground turmeric and curry leaves in the ingredients.”
Seeing this tech in action, two things hit you straight away. First, it’s glitchy and often needs correcting. Second, those glitches can be corrected with just a few spoken words. You simply interrupt the voice, repeat your instructions, and move on. It feels more like coaching a child than butting heads with broken software.
Next Xu pointed her phone at a row of wine bottles and asked Astra to pick the one that would go best with the chicken curry. It went for a rioja and explained why. Xu asked how much a bottle would cost. Astra said it would need to use Search to look prices up online. A few seconds later it came back with its answer.
We moved to the art gallery, and Xu showed Astra a number of screens with famous paintings on them: the Mona Lisa, Munch’s The Scream, a Vermeer, a Seurat, and several others. “Ni hao, Bibo!” the voice said.
“You’re speaking to me in Mandarin again,” Xu said. “Try to speak to me in English, please.”
“My apologies, I seem to have misunderstood. Yes, I will respond in English.” (I should know better, but I could swear I heard the snark.)
It was my turn. Xu handed me her phone.
I tried to trip Astra up, but it was having none of it. I asked it what famous art gallery we were in, but it refused to hazard a guess. I asked why it had identified the paintings as replicas and it started to apologize for its mistake (Astra apologizes a lot). I was compelled to interrupt: “No, no—you’re right, it’s not a mistake. You’re correct to identify paintings on screens as fake paintings.” I couldn’t help feeling a bit bad: I’d confused an app that exists only to please.
When it works well, Astra is enthralling. The experience of striking up a conversation with your phone about whatever you’re pointing it at feels fresh and seamless. In a media briefing yesterday, Google DeepMind shared a video showing off other uses: reading an email on your phone’s screen to find a door code (and then reminding you of that code later), pointing a phone at a passing bus and asking where it goes, quizzing it about a public artwork as you walk past. This could be generative AI’s killer app.
And yet there’s a long way to go before most people get their hands on tech like this. There’s no mention of a release date. Google DeepMind has also shared videos of Astra working on a pair of smart glasses, but that tech is even further down the company’s wish list.
Mixing it up
For now, researchers outside Google DeepMind are keeping a close eye on its progress. “The way that things are being combined is impressive,” says Maria Liakata, who works on large language models at Queen Mary University of London and the Alan Turing Institute. “It’s hard enough to do reasoning with language, but here you need to bring in images and more. That’s not trivial.”
Liakata is also impressed by Astra’s ability to recall things it has seen or heard. She works on what she calls long-range context, getting models to keep track of information that they have come across before. “This is exciting,” says Liakata. “Even doing it in a single modality is exciting.”
But she admits that a lot of her assessment is guesswork. “Multimodal reasoning is really cutting-edge,” she says. “But it’s very hard to know exactly where they’re at, because they haven’t said a lot about what is in the technology itself.”
For Bodhisattwa Majumder, a researcher who works on multimodal models and agents at the Allen Institute for AI, that’s a key concern. “We absolutely don’t know how Google is doing it,” he says.
He notes that if Google were to be a little more open about what it is building, it would help consumers understand the limitations of the tech they could soon be holding in their hands. “They need to know how these systems work,” he says. “You want a user to be able to see what the system has learned about you, to correct mistakes, or to remove things you want to keep private.”
Liakata is also worried about the implications for privacy, pointing out that people could be monitored without their consent. “I think there are things I’m excited about and things that I’m concerned about,” she says. “There’s something about your phone becoming your eyes—there’s something unnerving about it.”
“The impact these products will have on society is so big that it should be taken more seriously,” she says. “But it’s become a race between the companies. It’s problematic, especially since we don’t have any agreement on how to evaluate this technology.”
Google DeepMind says it takes a long, hard look at privacy, security, and safety for all its new products. Its tech will be tested by teams of trusted users for months before it hits the public. “Obviously, we’ve got to think about misuse. We’ve got to think about, you know, what happens when things go wrong,” says Dawn Bloxwich, director of responsible development and innovation at the company. “There’s huge potential. The productivity gains are huge. But it is also risky.”
No team of testers can anticipate all the ways that people will use and misuse new technology. So what’s the plan for when the inevitable happens? Companies need to design products that can be recalled or switched off just in case, says Bloxwich: “If we need to make changes quickly or pull something back, then we can do that.”