The Download: GPT-4o’s polluted Chinese training data, and astronomy’s AI challenge

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology.

GPT-4o’s Chinese token-training data is polluted by spam and porn websites

Soon after OpenAI released GPT-4o last Monday, some Chinese speakers started to notice that something seemed off about this newest version of the chatbot: the tokens it uses to parse text were full of spam and porn phrases.

Humans read in words, but LLMs read in tokens, which are distinct units in a sentence that have consistent and significant meanings. GPT-4o is supposed to be better than its predecessors at handling multi-language tasks, and many of the advances were achieved through a new tokenization tool that does a better job compressing texts in non-English languages.

But, at least when it comes to the Chinese language, the new tokenizer used by GPT-4o has introduced a disproportionate number of meaningless phrases—and experts say that’s likely due to insufficient data cleaning and filtering before the tokenizer was trained. If left unresolved, it could lead to hallucinations, poor performance, and misuse. Read the full story.

—Zeyi Yang

Astronomers are enlisting AI to prepare for a data downpour

In deserts across Australia and South Africa, astronomers are planting forests of metallic detectors that will together scour the cosmos for radio signals. When it boots up in five years or so, the Square Kilometer Array Observatory will look for new information about the universe’s first stars and the different stages of galactic evolution.

But after synching hundreds of thousands of dishes and antennas, astronomers will quickly face a new challenge: combing through some 300 petabytes of cosmological data a year—enough to fill a million laptops. So in preparation for the information deluge, astronomers are turning to AI for assistance. Read the full story.

—Zack Savitsky

Join us for Future Compute

If you’re interested in learning more about how to navigate the rapid changes in technology, Future Compute is the conference for you. It’s designed to help teach leaders strategic vision, agility, and a deep understanding of emerging technologies, and is held tomorrow, May 21, on MIT’s campus. Join us in-person or online by registering today.

EmTech Digital kicks off this week

The pace of AI development is truly breakneck these days—and we’ve got a sneak peek at what’s coming next. If you want to learn about how Google plans to develop and deploy AI, come and hear from its vice president of AI, Jay Yagnik, at our flagship AI conference, EmTech Digital.

We’ll hear from OpenAI about its video generation model Sora too, and Nick Clegg, Meta’s president of global affairs, will also join MIT Technology Review’s executive editor Amy Nordrum for an exclusive interview on stage.

It’ll be held at the MIT campus and streamed live online this week on May 22-23. Readers of The Download get 30% off tickets with the code DOWNLOADD24—here’s how to register. See you there!

The must-reads

I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology.

1 Apple is teaming up with OpenAI to overhaul iOS18
In the hopes it’ll give Apple an edge over rivals Google and Microsoft. (Bloomberg $)
+ OpenAI and Google recently launched their own supercharged AI assistants. (MIT Technology Review)

2 Blue Origin took six customers to the edge of space on Sunday
It’s the company’s first tourist flight in almost two years. (CNN)
+ Space tourism hasn’t exactly got off the ground yet. (WP $)

3 How TikTok users are skirting around its weight-loss drug promotion ban
Talking in code is becoming increasingly common. (WP $)
+ A new kind of weight-loss therapy is on the horizon. (Fast Company $)
+ What don’t we know about Ozempic? Quite a lot, actually. (Vox)
+ Weight-loss injections have taken over the internet. But what does this mean for people IRL? (MIT Technology Review)

4 Chinese companies are pushing ‘AI-in-a-box’ products
They’re sold as all-in-one cloud computing solutions, much to cloud providers’ chagrin. (FT $)

5 Microscopic blood clots could explain the severity of long covid
But doctors are calling for rigorous peer review before any solid conclusions can be made. (Undark Magazine)
+ Scientists are finding signals of long covid in blood. They could lead to new treatments. (MIT Technology Review)

6 How hackers saved stalled Polish trains
It looks as though the locomotives’ manufacturer could be behind the breakdown. (WSJ $)

7 We’re getting closer to making an HIV vaccine
A successful trial is giving researchers new hope. (Wired $)
+ Three people were gene-edited in an effort to cure their HIV. The result is unknown. (MIT Technology Review)

8 Most healthy people don’t need to track their blood glucose
That doesn’t stop companies trying to sell you their monitoring services, though. (The Guardian)

9 Filming strangers is public is not okay
And yet, people keep doing it. Why? (Vox)

10 Beware the spread of AI slop
Spam is no longer a strong enough term—the latest wave of AI images is slop. (The Guardian)

Quote of the day

“It’s a process of trust collapsing bit by bit, like dominoes falling one by one.”

—An anonymous OpenAI insider tells Vox that safety-minded employees are losing faith in the company’s CEO Sam Altman.

The big story

What does GPT-3 “know” about me?

August 2022

One of the biggest stories in tech is the rise of large language models that produce text that reads like a human might have written it.

These models’ power comes from being trained on troves of publicly available human-created text hoovered up from the internet. If you’ve posted anything even remotely personal in English on the internet, chances are your data might be part of some of the world’s most popular LLMs.

Melissa Heikkilä, MIT Technology Review’s AI reporter, wondered what data these models might have on her—and how it could be misused. So she put OpenAI’s GPT-3 to the test. Read about what she found.

We can still have nice things

A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line or tweet ’em at me.)

+ Sea urchins just love tiny hats
+ There’s nothing better than a Lego optical illusion of sorts.
+ Waking up each morning can be tough. Maybe a better alarm is the way forward?
+ Out of the way: it’s the annual worm charming championships!