Imagine you’re at a concert with your significant other and want to say something special when their favorite song starts to play. However, you realize the music is so loud that she won’t be able to hear your special words. What can you do?
You’ll most likely have to postpone it because talking and listening to someone in a loud, noisy, and crowded environment is often challenging. But, you know what? AI can solve this challenge too.
A team of researchers at the University of Washington (UW) has developed an interesting AI system that allows you to listen to the specific person you want to in a crowded environment using ordinary noise-cancelling headphones.
All you need to do is look at the person, press a button, and enroll them. The AI system, called “Target Speech Hearing” will remove all the surrounding noise and sounds. Now you can talk and listen to the enrolled person even when they are not facing you or lost somewhere in the crowd.
“As urban environments get more noisy, this technology gives us back some control over our acoustic scene and what we want to focus on. This can also be very beneficial for hearing aids for folks who have hearing loss,” Shyam Gollakota, one of the researchers and Head of the Mobile Intelligence Lab at the University of Washington, told ZME Science.
How does target speech hearing work?
Commercially available noise-canceling headphones eliminate the noise in your environment allowing you to listen to songs undisturbed. However, you can’t use them to listen to a sound from a particular person or object. This is where target speech hearing (TSH) can help you.
Do you ever wonder why familiar voices, like those of a close friend or parent, stand out to us in crowded environments? This is because our brains are capable of focusing on sounds from a target source, given prior knowledge of what the source sounds like.
So, TSH works similarly to the human brain. It allows headphones to learn a target speaker’s voice and how they differ from other human voices in the environment. Here is a step-by-step explanation of its working mechanism:
- A user wearing headphones equipped with TSH clicks a button on the headphones and looks at the target speaker for a few (two to five) seconds.
- During this time, the system captures a noisy audio example from the target across the left and right microphones.
- The system uses this recording to extract the speaker’s voice characteristics even when there are other speakers and noises in the vicinity. This is called the enrollment stage.
- The neural network is then trained on the voice’s characteristics within the two- to five-second timeframe.
Once the AI learns the voice characteristics, it then cancels all other sounds in the environment and plays just the enrolled speaker’s voice in real time even as the listener moves around in noisy places and no longer faces the speaker.
“Since all this is happening in real-time, we effectively suppress all sounds except for say the chirping of the birds,” Gollakota said.
Unlike ChatGPT, TSH doesn’t need data centers
According to the researchers, when people typically talk about neural networks and artificial intelligence these days, they refer to large language models like ChatGPT. Such models run in huge data centers. However, setting up data centers to make TSH work will make the technology impractical.
“So, we had to design special neural networks that can run on a smartphone and can extract the sound we care about in real-time. This is because the kind of sound intelligence one needs for this is likely something that even small insects have. So, what we are showing here is that we do not need a large neural model to achieve these tasks,” Gollakota told ZME Science.
The researchers demonstrated the AI’s action with a specific pair of commercial noise-canceling headphones. But this can work with most noise-canceling headphones. Plus, this technology can also be used for earbuds and hearing aids.
However, the AI also has some limitations. For instance, the TSH system can enroll only one speaker at a time, and it’s only able to enroll a speaker when there is no other loud voice coming from the same direction as the target speaker’s voice.
The researchers are working to overcome these limitations and plan to make the AI system commercially available through a startup.
“We are working on getting this to a much smaller form factor, e.g., a wireless earbud or a hearing aid. That would be transformative since it can then be included in billions of earbuds that folks use today,” Gollakota said.
The study is published in ACM Digital Library.