VOiCES: SRI and IQT Labs collaborate on advancing speech research for far-field application


With nearly one million audio segments and more than 3800 hours of audio, the VOiCES corpus is poised to power AI and machine learning research that could generate exciting advancements in everything from speech and speaker recognition technology to hearing aid function.

It’s called VOiCES for short, and it just might provide the foundation for powering huge improvements in everything from digital voice assistants and smart devices, to hearing aid function.

Formally known as Voices Obscured in Complex Environmental Settings, the VOiCES data set is the result of a collaboration between SRI International and IQT Labs, an In-Q-Tel (IQT) initiative that brings together startups, academia, and government to best leverage big data with the help of the right tools and technology.

At its core, VOiCES is a large scale speech corpus that mimics conditions in acoustically challenging environments. It’s especially unique for a variety of reasons, a key one being that it was recorded at a distance as opposed to close range like most speech datasets.

Researchers require massive amounts of data to develop the algorithms and related technology needed to fuel the creation of products like voice assistants that can accurately understand and respond to human speech in environments with considerable background noise.

“The ultimate goal of this corpus is to advance acoustic research, affording researchers free access to complex acoustic data,” says Michael Lomnitz, a data scientist with IQT Labs and a member of the VOiCES team. “The corpus is open source, Creative Commons BY 4.0, free for commercial, academic, and government use.” Amazon Web Services is a distribution partner.

The existing large-scale corpora for speech research under reverberant conditions suffered significant shortcomings. Most prior work either uses software simulations to generate data representing reverberant conditions or uses actual data with very few speakers, which results in limited significance in subsequent analysis. The VOiCES corpus was collected by recording retransmitted audio from high-quality loudspeakers in real rooms, each room having a distinct acoustic profile, capturing natural reverberation using multiple microphones.

“This is one of the largest datasets ever released for far-field conditions to the speech community,” says Mahesh Kumar Nandwana, an SRI advanced computer scientist and co-organizer of VOiCES. “It has the potential to help lead to some really amazing advancements for real-world applications.”

The audio challenge

Humans are increasingly interacting with everyday technology/products through voice recognition interfaces. These interfaces are woven into everything from smartphones and vehicles, to personal digital assistants like Alexa and Siri. Interaction with more products via voice will only increase in the years ahead. And yet, for all the marvelous convenience that voice recognition tech has offered to date, its performance can often fall short of the mark. One example: You’re not alone if you’ve asked your iPhone to call your friend “Bill” and you end up ringing “Phil.”

A key reason for the underperformance is that voice recognition tech is driven by machine learning that itself is based on clean and close-talk audio. Spoken speech in the real world, however, doesn’t happen in a controlled, clutter-free environment. It happens in a highly reverberant room and against the backdrop of T.V. or music, for instance.

To improve voice recognition technology, as well as make other advancements in the field of advanced audio functionality, researchers need to bolster AI and machine learning, basing new super smart algorithms on audio that’s drawn from real-world settings. Only there’s been a problem: audio collected under these far-field conditions hasn’t existed, at least not in the abundance needed and as free and open to any researcher interested in using it to advance the greater good.

IQT Labs and SRI International joined hands to change that.

“We wanted our dataset to have other speakers and ambient noise, like the classic cocktail party problem, where the task is to hear and understand a sound of interest, like someone speaking, in a sort of complex auditory setting that you’d find at a cocktail party with all its din and conversation,” says Todd Stavish, senior vice president at IQT Labs.

YouTube player

Collaborating toward a solution

For help creating that dataset, IQT Labs turned to SRI International, which came highly recommended. “Some of our government partners had said how good SRI was with audio and data collection, and as we looked into things, it really made sense to partner up,” says Stavish.

Nandwana says SRI was excited to help tackle the challenge: “We have frequently released big data sets,” he says. “The community has come to trust us because, even though anyone can collect data, it’s another thing to do it properly with the right protocols in place. We’ve earned a good reputation for doing that, and we were eager to work on this project.”

Colleen Richey, an SRI senior linguist and co-organizer of VOiCES, played an instrumental role in creating the dataset.

As she explains, the recordings occurred in real rooms of different sizes, with varying audio background and reverberation profiles for each room. As the clean speech played on speakers, so did various types of distractive sounds from other speakers, including television, music, or just a kind of hum or babble, Richey says.

“We played the same audio in the different rooms in very different acoustic environments to mimic a range of the types of settings speech happens within in the real world,” Richey explains.

In a unique approach, recordings occurred under far-field conditions; twelve to twenty microphones were strategically placed at a distance throughout the rooms to help replicate real-world settings. To imitate the way people often move their heads, gesticulate or otherwise move during a conversation, a foreground speaker used a motorized platform, rotating over a range of angles during recordings. Ultimately, the resulting dataset came to feature 3,800+ hours (nearly 24 weeks) of common human speech from 300 speakers (half male, half female).

Notably, an excellent benefit of the VOiCES dataset is the broad distribution of speech quality present in the recordings. Depending on the presence and type of distractor sounds, distance from the mic to source and distractor speakers, and microphone type, the intelligibility of the original audio can range from very clear to very poor. This makes it possible to quantify the correlation between objective measures of speech quality and the performance of a speech processing system.

All told, the corpus contains the source audio, the retransmitted audio, orthographic transcriptions, and speaker labels.

“With this data, teams can develop the AI algorithms or machine learning to solve different problems in the sphere of audio and speech technology,” says Nandwana.

Building the future now

Researchers are already jumping at the chance.

As Nandwana shares, we hosted a challenge based on the VOiCES corpus at INTERSPEECH 2019, the International Speech Communication Association’s annual conference. The challenge received tremendous response from the speech research community. A total of 60 international research organizations from academia and industry registered for this challenge. About 12 papers were published at the conference after peer review, he says. Meanwhile, there’ll be a VOiCES 2020 special session at Odyssey 2020: The speaker and language recognition workshop planned for November. The session is dedicated to a broad range of research areas aimed at advancing far-field/distant speaker recognition using the VOiCES corpus.

And, as alluded to earlier, the potential end-game applications are thrilling.

VOiCES can propel acoustic research that helps smartphones, voice assistants, and the like better detect and understand human speech, as well as identify a particular speaker. The latter development could be crucial to enhancing security human voice biometrics to prevent voice spoofing on phones. Additionally, the VOiCES corpus can assist with auditory source separation and localization, noise reduction, and general enhancement of acoustic quality, which can eventually translate to developing better devices for assisting people with hearing problems.

What’s more, SRI and IQT Labs aren’t done with VOiCES yet. There are plans for a next-generation iteration –a multimodal data collection with potential elements like dialogs, video, LIDAR, and more.

“The relationship we’ve established with SRI has been fantastic,” says Lomnitz. “They’ve been extremely successful at designing the experiments, collecting the data and more. I don’t see us partnering with anyone else for future VOiCES work, given their expertise.”

Helpful links:

1. Introducing the VOiCES dataset: https://voices18.github.io/

2. V0iCES 2020: Advances in Far-Field Speaker Recognition Odyssey 2020 session: http://www.odyssey2020.org/voices2020.html

Read more from SRI