Automatic Speaker Recognition for Authenticating Users in the Internet of Things

Our world is becoming increasingly mobile. People rely on their mobile devices not only for communication, but for applications from commerce, to home automation and security, to banking, to entertainment, just to name a few. Coupled with today’s expanding environment of web-connected “smart” devices—known as the Internet of Things (IoT)—it’s more critical than ever for accurate user authentication for these devices.

The challenge lies in how to remotely authenticate a person when there is no physical or visual contact. Mobile speech services already have the ability to listen to commands such as, “Hey Siri,” or “Ok Google;” however, the device is not necessarily trained to distinguish its owner’s voice from anyone else’s voice, or is done for the purpose of convenience rather than security. Voice biometrics, also known as automatic speaker recognition, provides a reliable way to verify a person’s unique identity for secure access control over their device. Automatic speaker recognition technology is used to determine who is speaking, rather than what is being spoken.

At SRI’s Speech Technology and Research (STAR) Lab, we’ve been developing automatic speaker recognition technology that allows a device to simultaneously identify a user, determine their authorization to access a device, and execute their command. We call this technology ‘command authentication’. Not only does it provide a more secure environment in which the device can be used, but it also can be helpful in practical applications. For example, why should the process of unlocking a phone, and then opening up an app be two disjointed processes? With command authentication, you can register a command to both open the app while authenticating that the authorized user indeed issued the request. In the case of a home automation application, if you asked your home system to “switch to my profile,” how would it know who the “my” is out of several people in the home? Command authentication allows multiple users of a single device to register commands and then tie the commands to distinct user-specific tasks.

There are many potential applications for command authentication, and it’s important to note that while the device may always be listening, all of the processing of the speech is being done on the device itself. It is never sent to the cloud for processing, as is the case with many other speech recognition technologies today. Because SRI’s command authentication technology is based on pattern matching and not speech recognition, it confines the processing of the speech to the privacy of your own device.

The previous examples of command authentication require a user to register a command and associate it with a task to be executed. This combination of user plus command is predefined, resulting in very high accuracy of user identification with a low amount of speech required–typically only two or three seconds.

SRI is also extending this technology to free-speech recognition. Now consider a home automation system that doesn’t require a predefined message or set of keywords to determine who it is talking to. With free-speech (also known as text-independent) speaker recognition, the system could determine which household members were conversing with it—all without transcribing what was being said, while keeping the processing local, and without sending it to the cloud. This type of speaker recognition can allow people to go about their daily lives without needing to remember key phrases, and through conversing with the device rather than using a single command, the device will know who it’s talking to. Free-speech speaker recognition could also be useful for identifying speakers in conference calls in the workplace, or even as a factor of identification in a call center.

The free-speech recognition technology can also be extended to audio/visual speaker search. Have you ever watched a show, heard a voice over or narrator and thought, “that voice sounds familiar” and then try to search who it is? SRI’s technology can rapidly locate video or audio clips that contain a speaker of interest, and take the user to the point at which that person speaks. All of the processing is based on audio, rather than image or video processing, which significantly reduces the computation required to search through a multimedia database. The technology then extracts thumbnail images from the point of highest confidence for rapid visual confirmation if a speaker is on screen.

Similar to how people can tag faces in a photo collection, where a system models a face that is tagged and then attempts to find other instances of that person in a photo collection, this technology could be used to tag audio in a home video collection or even potentially commercial applications such as broadcast news. This is particularly useful where rapid indexing and retrieval might be required and image processing for face recognition might not be applicable, such as where the host of a program or user of the video camera may speak regularly but not actually appear on screen. In the Internet of Things, speaker tagging can enable a wealth of meta data that can help users get the results they want with the ability to parse through them rapidly.

These are just a few of the many applications for speaker recognition technology. At SRI, we’ve been focused on developing this technology for real-world conditions that include reverb, background noise, chatter, and competing voices in everyday life. We believe the time is near where the technology will be commercially viable and we are looking forward to making it available to the consumer and business market.