Siri hasn’t always been all that reliable. I remember, I was once looking for a restaurant “near me”, and I got some pretty inappropriate responses in terms of locations on a map. I know Apple has been trying to make some improvements to Siri, but they haven’t had much luck. Recently, Apple’s Siri team has published a new Machine Learning Journal entry that details some of the processes behind making voice-activated “Hey Siri” work with just your voice. They had previously documented part of this process last fall, but this is the first Machine Learning Journal entry of 2018, and it focuses on the challenge of speaker recognition.
In the last entry Apple indicates that the phrase was chosen, mostly because a number of users were already using it naturally:
The phrase “Hey Siri” was originally chosen to be as natural as possible; in fact, it was so natural that even before this feature was introduced, users would invoke Siri using the home button and inadvertently prepend their requests with the words, “Hey Siri.”
The new journal entry outlines three main challenges with activating Siri by using voice, which are:
- The main user saying a similar phrase to Hey Siri
- Another user saying Hey Siri, or
- Another user saying something similar to Hey Siri
Confusing, right? By limiting activation to the main user’s voice, the design ideally prevents two out of those three issues. The journal entry briefly touches on how Apple approaches the problem:
We measure the performance of a speaker recognition system as a combination of an Imposter Accept (IA) rate and a False Reject (FR) rate. It is important, however, to distinguish (and equate) these values from those used to measure the quality of a key-phrase trigger system.
Each Machine Learning Journal entry looks at Apple’s implementation in detail, before touching on the unsolved problems with the feature – such as using Hey Siri in a noisy environment or a large room. Voice-activated Siri started with the iPhone 6s. Today, however, Hey Siri works on new iPhones, iPads and Apple Watches and it’s the main controller for the HomePod. In the latest journal entry, Apple states:
One of our current research efforts is focused on understanding and quantifying the degradation in these difficult conditions in which the environment of an incoming test utterance is a severe mismatch from the existing utterances in a user’s speaker profile.
While I think that this is an issue, in general, for voice recognition, but how are Amazon and Google able to do it? I mean, Siri was kind of the first of its kind in terms of this technology. So why is it so difficult for Apple to get it right? Do Amazon and Google just not have these problems? Or is it just a matter of not knowing about them? Sure, Alexa does respond when someone on TV says “Alexis”, but even I can’t hear clearly enough to know that’s what is being said. Which is why I certainly can’t think that a machine would have the ability to do this.
All of that said, I think this is good research and I am happy that Apple is looking into it. But I do think that they have a lot more work to do around Siri, and it’s not just about the voice recognition aspect.