A way to let robots learn by listening will make them more useful

Most AI-powered robots today use cameras to understand their surroundings and learn new tasks, but it’s becoming easier to train robots with sound too, helping them adapt to tasks and environments where visibility is limited.

Though sight is important, there are daily tasks where sound is actually more helpful, like listening to onions sizzling on the stove to see if the pan is at the right temperature. Training robots with audio has only been done in highly controlled lab settings, however, and the techniques have lagged behind other fast robot-teaching methods.

Researchers at the Robotics and Embodied AI Lab at Stanford University set out to change that. They first built a system for collecting audio data, consisting of a GoPro camera and a gripper with a microphone designed to filter out background noise. Human demonstrators used the gripper for a variety of household tasks and then used this data to teach robotic arms how to execute the task on their own. The team’s new training algorithms help robots gather clues from audio signals to perform more effectively.

“Thus far, robots have been training on videos that are muted,” says Zeyi Liu, a PhD student at Stanford and lead author of the study. “But there is so much helpful data in audio.”

To test how much more successful a robot can be if it’s capable of “listening,” the researchers chose four tasks: flipping a bagel in a pan, erasing a whiteboard, putting two Velcro strips together, and pouring dice out of a cup. In each task, sounds provide clues that cameras or tactile sensors struggle with, like knowing if the eraser is properly contacting the whiteboard or whether the cup contains dice.

After demonstrating each task a couple of hundred times, the team compared the success rates of training with audio and training only with vision. The results, published in a paper on arXiv that has not been peer-reviewed, were promising. When using vision alone in the dice test, the robot could tell 27% of the time if there were dice in the cup, but that rose to 94% when sound was included.

It isn’t the first time audio has been used to train robots, says Shuran Song, the head of the lab that produced the study, but it’s a big step toward doing so at scale: “We are making it easier to use audio collected ‘in the wild,’ rather than being restricted to collecting it in the lab, which is more time consuming.”

The research signals that audio might become a more sought-after data source in the race to train robots with AI. Researchers are teaching robots faster than ever before using imitation learning, showing them hundreds of examples of tasks being done instead of hand-coding each one. If audio could be collected at scale using devices like the one in the study, it could give them an entirely new “sense,” helping them more quickly adapt to environments where visibility is limited or not useful.

“It’s safe to say that audio is the most understudied modality for sensing [in robots],” says Dmitry Berenson, associate professor of robotics at the University of Michigan, who was not involved in the study. That’s because the bulk of research on training robots to manipulate objects has been for industrial pick-and-place tasks, like sorting objects into bins. Those tasks don’t benefit much from sound, instead relying on tactile or visual sensors. But as robots broaden into tasks in homes, kitchens, and other environments, audio will become increasingly useful, Berenson says.

Consider a robot trying to find which bag or pocket contains a set of keys, all with limited visibility. “Maybe even before you touch the keys, you hear them kind of jangling,” Berenson says. “That’s a cue that the keys are in that pocket instead of others.”

Still, audio has limits. The team points out sound won’t be as useful with so-called soft or flexible objects like clothes, which don’t create as much usable audio. The robots also struggled with filtering out the audio of their own motor noises during tasks, since that noise was not present in the training data produced by humans. To fix it, the researchers needed to add robot sounds—whirs, hums, and actuator noises—into the training sets so the robots could learn to tune them out.

The next step, Liu says, is to see how much better the models can get with more data, which could mean adding more microphones, collecting spatial audio, and incorporating microphones into other types of data-collection devices.