Command-line tools for speech and intent recognition on Linux

voice2json

Command-line tools for speech and intent recognition on Linux

View on GitHub

voice2json logo

voice2json is a collection of command-line tools for offline speech/intent recognition on Linux. It is free, open source (MIT), and supports 17 human languages.

From the command-line:

$ voice2json transcribe-wav \
      < turn-on-the-light.wav | \
      voice2json recognize-intent | \
      jq .

produces a JSON event like:

{
    "text": "turn on the light",
    "intent": {
        "name": "LightState"
    },
    "slots": {
        "state": "on"
    }
}

when trained with this template:

[LightState]
states = (on | off)
turn (<states>){state} [the] light

Tools like Node-RED can be easily integrated with voice2json through MQTT.

voice2json is optimized for:

Sets of voice commands that are described well by a grammar
Commands with uncommon words or pronunciations
Commands or intents that can vary at runtime

It can be used to:

Supported speech to text systems include:

CMU’s pocketsphinx
Dan Povey’s Kaldi
Mozilla’s DeepSpeech 0.6
Kyoto University’s Julius

Unique Features

voice2json is more than just a wrapper around pocketsphinx, Kaldi, DeepSpeech, and Julius!

Training produces both a speech and intent recognizer. By describing your voice commands with voice2json’s templating language, you get more than just transcriptions for free.
Re-training is fast enough to be done at runtime (usually < 5s), even up to millions of possible voice commands. This means you can change referenced slot values or add/remove intents on the fly.
All of the available commands are designed to work well in Unix pipelines, typically consuming/emitting plaintext or newline-delimited JSON. Audio input/output is file-based, so you can receive audio from any source.

How it Works

voice2json needs a description of the voice commands you want to be recognized in a file named sentences.ini. This can be as simple as a listing of [Intents] and sentences:

[GarageDoor]
open the garage door
close the garage door

[LightState]
turn on the living room lamp
turn off the living room lamp
...

A small templating language is available to describe sets of valid voice commands, with [optional words], (alternative | choices), and <shared rules>. Portions of (commands can be){annotated} as containing slot values that you want in the recognized JSON.

When trained, voice2json will transform audio data into JSON objects with the recognized intent and slots.

Custom voice command training

Assumptions

voice2json is designed to work under the following assumptions:

Speech can be segmented into voice commands by a wake word + silence, or via a push-to-talk mechanism
A voice commands contains at most one intent
Intents and slot values are equally likely

Getting Started

Install voice2json
Download a profile and extract it to $HOME/.config/voice2json
- Your profile settings will be in $HOME/.config/voice2json/profile.yml
Edit sentences.ini in your profile and add your custom voice commands
Train your profile
Use the transcribe-wav and recognize-intent commands to do speech/intent recognition
- See the recipes for more possibilities

Why Not That

Why not just use Google, Dragon, or something else?

Cloud-based speech and intent recognition services, such as Google Assistant or Amazon’s Alexa, require a constant Internet connection to function. Additionally, they keep a copy of everything you say on their servers. Despite the high accuracy and deep integration with other services, this approach is too brittle and uncomfortable for me.

Dragon Naturally Speaking offers local installations and offline functionality. Great! Unfortunately, Dragon requires Microsoft Windows to function. It is possible to use Dragon in Wine on Linux or via a virtual machine, but is difficult to set up and not officially supported by Nuance.

Until recently, Snips offered an impressive amount of functionality offline and was easy to interoperate with. Unfortunately, they were purchased by Sonos and have since shut down their online services (required to change your Snips assistants). See Rhasspy if you are looking for a Snips replacement, and avoid investing time and effort in a platform you cannot control!

If you feel comfortable sending your voice commands through the Internet for someone else to process, or are not comfortable with Linux and the command line, I recommend taking a look at Mycroft.

No Magic, No Surprises

voice2json is not an A.I. or gee-whizzy machine learning system. It does not attempt to guess what you want to do, and keeps everything on your local machine. There is no online account sign-up needed, no privacy policy to review, and no advertisements. All generated artifacts are in standard data formats; typically just text.

Once you’ve installed voice2json and downloaded a profile, there is no longer a need for an Internet connection. At runtime, voice2json will only every write to your profile directory or the system’s temporary directory (/tmp).

Supported Languages

voice2json supports the following languages/locales. I don’t speak or write any language besides U.S. English very well, so please let me know if any profile is broken or could be improved! I’m mostly Chinese Room-ing it.

Untested profiles may work, but I don’t have the necessary data or enough understanding of the language to test them.

Language Locale System Closed Open Download Catalan ca-es pocketsphinx UNTESTED UNTESTED Download Dutch (Nederlands) nl kaldi ★ ★ ★ ★ ★ (2x) ☹ (1x) Download Dutch (Nederlands) nl pocketsphinx ★ ★ ★ ★ (18x) ☹ (3x) Download English en-in pocketsphinx ☹ (4x) ☹ (4x) Download English en-us deepspeech ★ ★ ★ ★ ★ (1x) ★ ★ ★ ★ (1x) Download English en-us julius ★ ★ ★ ★ (1x) UNTESTED Download English en-us kaldi ★ ★ ★ ★ ★ (3x) ★ ★ ★ ★ (1x) Download English en-us pocketsphinx ★ ★ ★ ★ ★ (9x) ★ ★ ★ ★ (2x) Download French (Français) fr kaldi ★ ★ ★ ★ (4x) ★ ★ ★ ★ (1x) Download French (Français) fr pocketsphinx ★ ★ ★ ★ (23x) ☹ (3x) Download German (Deutsch) de pocketsphinx ★ ★ ★ ★ ★ (17x) ★ ★ ★ ★ ★ (3x) Download German (Deutsch) de-DE deepspeech ★ ★ ★ ★ ★ (1x) ★ ★ ★ ★ (1x) Download German (Deutsch) de-DE kaldi ★ ★ ★ ★ ★ (4x) ★ ★ ★ ★ (1x) Download Greek (Ελληνικά) el-gr pocketsphinx ★ ★ ★ ★ ★ (15x) ☹ (1x) Download Hindi (Devanagari) hi pocketsphinx UNTESTED UNTESTED Download Italian (Italiano) it pocketsphinx ★ ★ ★ ★ ★ (21x) ★ ★ ★ ★ ★ (7x) Download Kazakh (қазақша) kz pocketsphinx UNTESTED UNTESTED Download Korean ko-kr kaldi ☹ (4x) ☹ (4x) Download Mandarin zh-cn pocketsphinx UNTESTED UNTESTED Download Polish (polski) pl julius UNTESTED UNTESTED Download Portuguese (Português) pt-br pocketsphinx ★ ★ ★ ★ (51x) ☹ (11x) Download Russian (Русский) ru pocketsphinx ★ ★ ★ ★ ★ (17x) ☹ (1x) Download Spanish (Español) es pocketsphinx ★ ★ ★ ★ (25x) ★ ★ ★ ★ (15x) Download Spanish es-mexican pocketsphinx ★ ★ ★ ★ ★ (9x) ★ ★ ★ ★ (2x) Download Swedish (svenska) sv kaldi ★ ★ ★ ★ (3x) ☹ (1x) Download Vietnamese (Tiếng Việt) vi kaldi ★ ★ ★ ★ ★ (4x) ☹ (1x)

Legend

Each profile is given a ★ rating, indicating how accurate it was at transcribing a set of test WAV files. I’m considering anything below 75% accuracy to be effectively unusable (☹).

Transcription Accuracy ★ ★ ★ ★ ★ [95%, 100%] ★ ★ ★ ★ [90%, 95%) ★ ★ ★ [85%, 90%) ★ ★ [80%, 85%) ★ [75%, 80%) ☹ [0%, 75%)

Profiles are tested in two conditions:

Closed
- All example sentences from the profile’s sentences.ini are run through Google WaveNet to produce synthetic speech
- The profile is trained and tested on exactly the sentences it should recognize (ideal case)
- This resembles the intended use case of voice2json, though real world speech will be less perfect
Open
- Speech examples are provided by contributors, VoxForge, or Mozilla Common Voice
- The profile is tested using the sample WAV files with the --open flag
- This (usually) demonstrates why its best to define voice commands first!

Transcription speed-up is given as (Nx) where N is the average ratio of real-time to transcription time. A value of 2x means that voice2json was able to transcribe the test WAV files twice as fast as their real-time durations on average. The reported values come from an Intel Core i7-based laptop with 16GB of RAM, so expect slower transcriptions on Raspberry Pi’s.

Contributing

Community contributions are welcomed! There are many different ways to contribute:

Pull requests for bug fixes, new features, or corrections to the documentation
Help with any of the supported language profiles, including:
- Testing to make sure the acoustic models and default pronunciation dictionaries are working
- Translations of the example voice commands
- Example WAV files of you speaking with text transcriptions for performance testing
Contributing to Mozilla Common Voice
Assist other voice2json community members
Implement or critique one of my crazy ideas

Ideas

Here are some ideas I have for making voice2json better that I don’t have time to implement.

Yet Another Wake Word Library

Porcupine is the best free wake word library I’ve found to date, but it has two major limitations for me:

It is not entirely open source
- I can’t build it for architecture that aren’t currently supported
Custom wake words expire after 30 days
- I can’t include custom wake words in pre-built packages/images

Picovoice has been very generous to release porcupine for free, so I’m not suggesting they change anything. Instead, I’d love to see a free and open source wake word library that has these features:

Free and completely open source
Performance close to porcupine or snowboy
Able to run on a Raspberry Pi alongside other software (no 100% CPU usage)
Can add custom wake words without hours of training

Mycroft Precise comes close, but requires a lot of expertise and time to train custom wake words. It’s performance is also unfortunately poorer than porcupine (in my limited experience).

I’ve wondered if Mycroft Precise’s approach (a GRU) could be extended to include Pocketsphinx’s keyword search mode as an input feature during training and at runtime. On it’s own, Pocketsphinx’s performance as a wake word detector is abysmal. But perhaps as one of several features in a neural network, it could help more than hurt.

Acoustic Models From Audiobooks

The paper LibriSpeech: An ASR Corpus Based on Public Domain Audio Books describes a method for taking free audio books from LibriVox and training acoustic models from it using Kaldi. For languages besides English, this may be a way of getting around the lack of free transcribed audio datasets! Although not ideal, it’s better than nothing.

For some languages, the audiobook approach may be especially useful with end-to-end machine learning approaches, like Mozilla’s DeepSpeech and Facebook’s wav2letter. Typical approaches to building acoustic models require the identification of a language’s phonemes and the construction of a large pronunciation dictionary. End-to-end approaches go directly from acoustic features to graphemes (letters), subsuming the phonetic dictionary step. More data is required, of course, but books tend to be quite long.

Android Support

voice2json uses pocketsphinx, Kaldi, and Julius for speech recognition. All of these libraries have at least a proof-of-concept Android build:

It seems feasible that voice2json could be ported to Android, providing decent offline mobile speech/intent recognition.

Browser-Based voice2json

Could empscripten be used to compile WebAssembly versions of voice2json’s dependencies? Combined with something like pyodide, it might be possible to run (most of) voice2json entirely in a modern web browser.

smiling terminal

voice2json is maintained by synesthesiam. This page was generated by GitHub Pages.

Command-line tools for speech and intent recognition on Linux

voice2json

Command-line tools for speech and intent recognition on Linux

Unique Features

How it Works

Assumptions

Getting Started

Why Not That

No Magic, No Surprises

Supported Languages

Legend

Contributing

Ideas

Yet Another Wake Word Library

Acoustic Models From Audiobooks

Android Support

Browser-Based voice2json

Recommend

DeFi平台deFIRE即将于5月21日启动IDO

灰度母公司DCG持有185万枚ZEN，约占流通量16.8%

Scientists rediscover lost coffee species suited to a warmer climate

犀牛财经早报：华夏、招商等五家银行被罚3.66亿浪潮信息董事长辞职

产品经理的价值是什么？

创维酷开谈电视双品牌：站在技术之巅，看向不同的人群 | 深圳湾

以太坊上稳定币总发行量突破600亿美元，创历史新高

BTC近1小时爆仓已破1亿美元

NumPy之:ndarray中的函数

为什么用户会跨APP进行移动搜索？

About Joyk