Apple Takes a Closer Look at How the 'Hey Siri' Feature Works
In an interesting and detailed post to its Machine Learning Journal blog, Apple provides a unique look at how the always-listening “Hey Siri” feature works.
The "Hey Siri" workflow.
Apple uses a combination of machine learning and a deep neural network to help power the feature available on most newer iPhone and iPad models along with the Apple Watch:
The microphone in an iPhone or Apple Watch turns your voice into a stream of instantaneous waveform samples, at a rate of 16000 per second. A spectrum analysis stage converts the waveform sample stream to a sequence of frames, each describing the sound spectrum of approximately 0.01 sec. About twenty of these frames at a time (0.2 sec of audio) are fed to the acoustic model, a Deep Neural Network (DNN) which converts each of these acoustic patterns into a probability distribution over a set of speech sound classes: those used in the “Hey Siri” phrase, plus silence and other speech, for a total of about 20 sound classes
Another unique tidbit is that if the device missed your first try of saying the phrase, it will monetarily become more sensitive state to make it easier to catch a repeat of “Hey Siri”:
We built in some flexibility to make it easier to activate Siri in difficult conditions while not significantly increasing the number of false activations. There is a primary, or normal threshold, and a lower threshold that does not normally trigger Siri. If the score exceeds the lower threshold but not the upper threshold, then it may be that we missed a genuine “Hey Siri” event. When the score is in this range, the system enters a more sensitive state for a few seconds, so that if the user repeats the phrase, even without making more effort, then Siri triggers. This second-chance mechanism improves the usability of the system significantly, without increasing the false alarm rate too much because it is only in this extra-sensitive state for a short time.
The post also details how the “Hey Siri” phrase was selected:
Well before there was a Hey Siri feature, a small proportion of users would say “Hey Siri” at the start of a request, having started by pressing the button. We used such “Hey Siri” utterances for the initial training set for the US English detector model. We also included general speech examples, as used for training the main speech recognizer. In both cases, we used automatic transcription on the training phrases. Siri team members checked a subset of the transcriptions for accuracy.
We created a language-specific phonetic specification of the “Hey Siri” phrase. In US English, we had two variants, with different first vowels in “Siri”—one as in “serious” and the other as in “Syria.” We also tried to cope with a short break between the two words, especially as the phrase is often written with a comma: “Hey, Siri.” Each phonetic symbol results in three speech sound classes (beginning, middle and end) each of which has its own output from the acoustic model.
While you might not think twice when you say “Hey Siri” to awaken the virtual assistant on your iPhone, iPad, or Apple Watch, there is a huge amount of work going on to make sure that the feature always works. Definitely take a look at the post if you’re interested in speech recognition or machine learning.