I am really good at hearing people’s voices in a crowd. I can tell who is talking 50 feet away in the sea of cubicles at work. But when you try to talk to a smart speaker from 10 feet away, it can be challenging to say the least. And if there’s any kind of background noise, this is literally impossible. But Google may have an extremely straightforward solution. Google researchers have developed a deep learning system that can pick out specific voices by looking at their faces when they’re speaking.
The team has trained its neural network model to recognize individual people speaking by themselves and then created virtual “parties” in order to teach the AI how to isolate multiple voices into distinct audio tracks. Insane, right? Not only that, but these “parties” simulated real-world scenarios with background noise and everything. The results are incredible and can be seen below.
But what does this mean? It means AI is getting better. It means that Google has been able to identify a problem and then has come up with a possible solution. While this particular solution means that AI can generate a clean audio track for one person just by focusing on their face. But in the future, this kind of technology could be used for some other incredible technological advancements. What, exactly? I don’t have a crystal ball, so I’m not able to see into the future.
Google is currently looking at how this feature can be used in its products. But there are more than a few prime candidates. This has some benefit for video chat services like Hangouts, or Duo. If you’re talking in a noisy room. But it could also be useful for voice enhancements in video recording. This could be a big technological breakthrough from the perspective of noise cancellation.
Where I find this technology interesting, is how it pertains to people with disabilities. There are huge implications for accessibility. Think about it. A camera could be linked to someone’s hearing aids. The hearing aids would boost the sound of whoever is in front of you. It would require someone to be wearing the camera, but I don’t think we’re that far off from these kinds of wearables being “normal” or mainstream. Can you say Google Glass? In addition, this could also be more effective for closed captioning. There is a huge need to have more videos online captioned for people who are Deaf or hard of hearing.
All the possible benefits, bring with it some concerns. There are potential privacy issues as it might get used by people to eavesdrop in public. But it also wouldn’t be too difficult to limit the voice separation to people who have clearly given their consent. It will be interesting to see where this goes and what additional usage it gets. Like I said – this could mean some pretty big technological implications in the future.