The race to understand the exhilarating, dangerous world of language AI

On May 18, Google CEO Sundar Pichai announced an impressive new tool: an AI system called LaMDA that can chat to users about any subject.

To start, Google plans to integrate LaMDA into its main search portal, its voice assistant, and Workplace, its collection of cloud-based work software that includes Gmail, Docs, and Drive. But the eventual goal, said Pichai, is to create a conversational interface that allows people to retrieve any kind of information—text, visual, audio—across all Google’s products just by asking.

LaMDA’s rollout signals yet another way in which language technologies are becoming enmeshed in our day-to-day lives. But Google’s flashy presentation belied the ethical debate that now surrounds such cutting-edge systems. LaMDA is what’s known as a large language model (LLM)—a deep-learning algorithm trained on enormous amounts of text data.

Studies have already shown how racist, sexist, and abusive ideas are embedded in these models. They associate categories like doctors with men and nurses with women; good words with white people and bad ones with Black people. Probe them with the right prompts, and they also begin to encourage things like genocide, self-harm, and child sexual abuse. Because of their size, they have a shockingly high carbon footprint. Because of their fluency, they easily confuse people into thinking a human wrote their outputs, which experts warn could enable the mass production of misinformation.

In December, Google ousted its ethical AI co-lead Timnit Gebru after she refused to retract a paper that made many of these points. A few months later, after wide-scale denunciation of what an open letter from Google employees called the company’s “unprecedented research censorship,” it fired Gebru’s coauthor and co-lead Margaret Mitchell as well.

It’s not just Google that is deploying this technology. The highest-profile language models so far have been OpenAI’s GPT-2 and GPT-3, which spew remarkably convincing passages of text and can even be repurposed to finish off music compositions and computer code. Microsoft now exclusively licenses GPT-3 to incorporate into yet-unannounced products. Facebook has developed its own LLMs for translation and content moderation. And startups are creating dozens of products and services based on the tech giants’ models. Soon enough, all of our digital interactions—when we email, search, or post on social media—will be filtered through LLMs.

Unfortunately, very little research is being done to understand how the flaws of this technology could affect people in real-world applications, or to figure out how to design better LLMs that mitigate these challenges. As Google underscored in its treatment of Gebru and Mitchell, the few companies rich enough to train and maintain LLMs have a heavy financial interest in declining to examine them carefully. In other words, LLMs are increasingly being integrated into the linguistic infrastructure of the internet atop shaky scientific foundations.

More than 500 researchers around the world are now racing to learn more about the capabilities and limitations of these models. Working together under the BigScience project led by Huggingface, a startup that takes an “open science” approach to understanding natural-language processing (NLP), they seek to build an open-source LLM that will serve as a shared resource for the scientific community. The goal is to generate as much scholarship as possible within a single focused year. Their central question: How and when should LLMs be developed and deployed to reap their benefits without their harmful consequences?

“We can’t really stop this craziness around large language models, where everybody wants to train them,” says Thomas Wolf, the chief science officer at Huggingface, who is co-leading the initiative. “But what we can do is try to nudge this in a direction that is in the end more beneficial.”

Stochastic parrots

In the same month that BigScience kicked off its activities, a startup named Cohere quietly came out of stealth. Started by former Google researchers, it promises to bring LLMs to any business that wants one—with a single line of code. It has developed a technique to train and host its own model with the idle scraps of computational resources in a data center, which holds down the costs of renting out the necessary cloud space for upkeep and deployment.

Among its early clients is the startup Ada Support, a platform for building no-code customer support chatbots, which itself has clients like Facebook and Zoom. And Cohere’s investor list includes some of the biggest names in the field: computer vision pioneer Fei-Fei Li, Turing Award winner Geoffrey Hinton, and Apple’s head of AI, Ian Goodfellow.

Cohere is one of several startups and initiatives now seeking to bring LLMs to various industries. There’s also Aleph Alpha, a startup based in Germany that seeks to build a German GPT-3; an unnamed venture started by several former OpenAI researchers; and the open-source initiative Eleuther, which recently launched GPT-Neo, a free (and somewhat less powerful) reproduction of GPT-3.

But it’s the gap between what LLMs are and what they aspire to be that has concerned a growing number of researchers. LLMs are effectively the world’s most powerful autocomplete technologies. By ingesting millions of sentences, paragraphs, and even samples of dialogue, they learn the statistical patterns that govern how each of these elements should be assembled in a sensible order. This means LLMs can enhance certain activities: for example, they are good for creating more interactive and conversationally fluid chatbots that follow a well-established script. But they do not actually understand what they’re reading or saying. Many of the most advanced capabilities of LLMs today are also available only in English.

Among other things, this is what Gebru, Mitchell, and five other scientists warned about in their paper, which calls LLMs “stochastic parrots.” “Language technology can be very, very useful when it is appropriately scoped and situated and framed,” says Emily Bender, a professor of linguistics at the University of Washington and one of the coauthors of the paper. But the general-purpose nature of LLMs—and the persuasiveness of their mimicry—entices companies to use them in areas they aren’t necessarily equipped for.

In a recent keynote at one of the largest AI conferences, Gebru tied this hasty deployment of LLMs to consequences she’d experienced in her own life. Gebru was born and raised in Ethiopia, where an escalating war has ravaged the northernmost Tigray region. Ethiopia is also a country where 86 languages are spoken, nearly all of them unaccounted for in mainstream language technologies.

Despite LLMs having these linguistic deficiencies, Facebook relies heavily on them to automate its content moderation globally. When the war in Tigray first broke out in November, Gebru saw the platform flounder to get a handle on the flurry of misinformation. This is emblematic of a persistent pattern that researchers have observed in content moderation. Communities that speak languages not prioritized by Silicon Valley suffer the most hostile digital environments.

Gebru noted that this isn’t where the harm ends, either. When fake news, hate speech, and even death threats aren’t moderated out, they are then scraped as training data to build the next generation of LLMs. And those models, parroting back what they’re trained on, end up regurgitating these toxic linguistic patterns on the internet.

In many cases, researchers haven’t investigated thoroughly enough to know how this toxicity might manifest in downstream applications. But some scholarship does exist. In her 2018 book Algorithms of Oppression, Safiya Noble, an associate professor of information and African-American studies at the University of California, Los Angeles, documented how biases embedded in Google search perpetuate racism and, in extreme cases, perhaps even motivate racial violence.

“The consequences are pretty severe and significant,” she says. Google isn’t just the primary knowledge portal for average citizens. It also provides the information infrastructure for institutions, universities, and state and federal governments.

Google already uses an LLM to optimize some of its search results. With its latest announcement of LaMDA and a recent proposal it published in a preprint paper, the company has made clear it will only increase its reliance on the technology. Noble worries this could make the problems she uncovered even worse: “The fact that Google’s ethical AI team was fired for raising very important questions about the racist and sexist patterns of discrimination embedded in large language models should have been a wake-up call.”

BigScience

The BigScience project began in direct response to the growing need for scientific scrutiny of LLMs. In observing the technology’s rapid proliferation and Google’s attempted censorship of Gebru and Mitchell, Wolf and several colleagues realized it was time for the research community to take matters into its own hands.

Inspired by open scientific collaborations like CERN in particle physics, they conceived of an idea for an open-source LLM that could be used to conduct critical research independent of any company. In April of this year, the group received a grant to build it using the French government’s supercomputer.

At tech companies, LLMs are often built by only half a dozen people who have primarily technical expertise. BigScience wanted to bring in hundreds of researchers from a broad range of countries and disciplines to participate in a truly collaborative model-construction process. Wolf, who is French, first approached the French NLP community. From there, the initiative snowballed into a global operation encompassing more than 500 people.

The collaborative is now loosely organized into a dozen working groups and counting, each tackling different aspects of model development and investigation. One group will measure the model’s environmental impact, including the carbon footprint of training and running the LLM and factoring in the life-cycle costs of the supercomputer. Another will focus on developing responsible ways of sourcing the training data—seeking alternatives to simply scraping data from the web, such as transcribing historical radio archives or podcasts. The goal here is to avoid toxic language and nonconsensual collection of private information.

Other working groups are dedicated to developing and evaluating the model’s “multilinguality.” To start, BigScience has selected eight languages or language families, including English, Chinese, Arabic, Indic (including Hindi and Urdu), and Bantu (including Swahili). The plan is to work closely with every language community to map out as many of its regional dialects as possible and ensure that its distinct data privacy norms are respected. “We want people to have a say in how their data is used,” says Yacine Jernite, a Huggingface researcher.

The point is not to build a commercially viable LLM to compete with the likes of GPT-3 or LaMDA. The model will be too big and too slow to be useful to companies, says Karën Fort, an associate professor at the Sorbonne. Instead, the resource is being designed purely for research. Every data point and every modeling decision is being carefully and publicly documented, so it’s easier to analyze how all the pieces affect the model’s outcomes. “It’s not just about delivering the final product,” says Angela Fan, a Facebook researcher. “We envision every single piece of it as a delivery point, as an artifact.”

The project is undoubtedly ambitious—more globally expansive and collaborative than any the AI community has seen before. The logistics of coordinating so many researchers is itself a challenge. (In fact, there’s a working group for that, too.) What’s more, every single researcher is contributing on a volunteer basis. The grant from the French government covers only computational, not human, resources.

But researchers say the shared need that brought the community together has galvanized an impressive level of energy and momentum. Many are optimistic that by the end of the project, which will run until May of next year, they will have produced not only deeper scholarship on the limitations of LLMs but also better tools and practices for building and deploying them responsibly.

The organizers hope this will inspire more people within industry to incorporate those practices into their own LLM strategy, though they are the first to admit they are being idealistic. If anything, the sheer number of researchers involved, including many from tech giants, will help establish new norms within the NLP community.

In some ways the norms have already shifted. In response to conversations around the firing of Gebru and Mitchell, Cohere heard from several of its clients that they were worried about the technology’s safety. Cohere now includes a page on its website featuring a pledge to continuously invest in technical and non-technical research to mitigate the possible harms of its model. It says it will also assemble an advisory council made up of external experts to help it create policies on the permissible use of its technologies.

“NLP is at a very important turning point,” says Fort. That’s why BigScience is exciting. It allows the community to push the research forward and provide a hopeful alternative to the status quo within industry: “It says, ‘Let’s take another pass. Let’s take it together—to figure out all the ways and all the things we can do to help society.’”

“I want NLP to help people,” she says, “not to put them down.”

Main Menu