Faster voice recognition for Windows Phone 8

Published by at

Via the official Windows Phone Blog comes news that Microsoft has silently upgraded the backend for Windows Phone 8's voice recognition service resulting in results being returned twice as fast and an increase in accuracy by 15%. The update has already been rolled out in select markets (e.g. the US got the upgrade over the last few weeks) and should be available in all market that support voice recognition in due course.

Voice recognition is used by Windows Phone in a wide number of areas including search, text message composition, and the dictation of text in OneNote. Windows Phone platform APIs also allow third party app to add their own voice recognition features, powered by the Microsoft backend. In both cases audio is captured by the phone and sent for processing on Microsoft's servers. It is this server side stage that has been upgraded.

From the Bing Blogs post:

If you’re using a Windows Phone in the US, you may have noticed that the voice capabilities on your phone have gotten better. Over the past few weeks, we’ve been rolling out updates to Windows Phone customers, to improve the speed and accuracy of voice to text and voice search. Now when you compose a text message or search using your voice, Bing will return results twice as fast as before and increase accuracy by 15 percent.

The accuracy and speed improvements are the result of implementing a series of improvements based on research around a approach that Microsoft refers to as Deep Neural Networks:

Over the past year, we’ve been working closely with Microsoft Research (MSR) to address limitations of the previous voice experience. To achieve the speed and accuracy improvements, we focused on an advanced approach called Deep Neural Networks (DNNs). DNN is a technology that is inspired by the functioning of neurons in the brain. In a similar way, DNN technology can detect patterns akin to the way biological systems recognize patterns.

By coupling MSR’s major research breakthroughs in the use of DNNs with the large datasets provided by Bing’s massive index, the DNNs were able to learn more quickly and help Bing voice capabilities get noticeably closer to the way humans recognize speech. We also made a few improvements under the hood that allowed Bing to more easily identify speech patterns and cut through ambient and background noise – cutting down response time by half and improving the word error rate by 15 percent, even in noisy situations.

More information on the DDN approach and the underlying research is available in this blog post. The science is fascinating, but the end result is what will have the big impact. Improving accuracy and speed should result in greater use of voice recognition features.

Just as importantly, the DNN approach also has the potential to reduce the amount of work needed to make a new language available for the voice control feature. This is important because Windows Phone 8's voice features are only available in a subset of countries that Windows Phone is available in (only France, Germany, Italy, Spain, United Kingdom, United States get "full" voice support). In part this is because of a dependence on local data, but a large part of the reason is the need to train voice control for each language / country.

Source / Credit: Bing