Earlier this year, we moved Project Vaani into a prototyping phase with user trials. We collected valuable feedback around product concept and design, as well as validated our assumption that a voice interface would help create a faster, easier and unconstrained way of doing things. This solution was designed and built around the users’ lives, as opposed to the service providers’ business interests. With that validation, we decided to focus on implementing the core voice technology components that everyone (Mozilla, as well as partners and the community) could use in their projects.
We also found that current market solutions offer speech recognition by using cloud based solutions which lead to privacy concerns. Such solutions also need investments in cloud infrastructure. We believe creating an offline solution that can be embedded in other applications and low-footprint devices would be essential to avoid these issues.
With that, the next steps identified are as follows:
- We are starting with an online solution to order to create trained models first. We will then compress these models so that they can be used offline on small footprint devices.
- We will also create a public corpora of voices which will help with creating inclusive technologies for both speech-to-text recognition and text-to-speech synthesis.
Deep-Learning based Online Speech Recognition
In the past ten years, Deep Learning has revolutionized numerous fields: natural language processing, image classification, automatic translation. Recently, Speech recognition has also benefited from the research in this space. Over the past few months, Mozilla’s deep-learning team has been using Tensorflow to build a speech decoder based on the findings in Baidu’s Deep Speech published research paper. The paper claims to be able to achieve a high accuracy by using a bidirectional recurrent neural network (BRNN) to ingest speech spectrograms and generate English text transcriptions.
This project will produce a Speech-To-Text (STT) engine, requiring a server class machine with an adequately powerful CPU, GPU, and memory. Once the model is trained, we plan to use Tensorflow serving to query the model without the high resource requirement. This will pave the way for creating the offline solution noted below.
Project PipSqueak – Offline Local Speech recognition
Project Pipsqueak is a client-based offline STT engine that targets devices with a smaller footprint (e.g. RPi 3 and Android phones). Based off of Google’s research in this area, the idea is to reduce the neural network model by quantization and a SVD-based (single value decomposition) compression technique. Reducing model size, we’ll be able to translate speech to text on small footprint devices removing the need of having servers and an internet connection to have accurate speech decoding. Once implemented, Project Pipsqueak could be embedded in other platforms and applications, such as Firefox or connected devices.
Project VoiceBank enables our community of volunteers from around the world to “donate” their voice for creating a public corpora that includes variety of languages and accents. This public resource will be available for all open source voice interface projects to help everyone build inclusive technologies that work for all. The Project VoiceBank team will be kicking off this project at the Mozilla Work Week in Hawaii in December 2016.
Architecture / Code Samples / Repos can be found here.