Sphinx [http://cmusphinx.sourceforge.net/] is the best engine you could probably use at the moment.
If you want a hack of a solution you could try implement the methods listed here [http://mikepultz.com/2011/03/accessing-google-speech-api-chrome-11/] with your own decision tree of possible results.
What solutions have you looked into? There are several.
The CMU Sphinx Project by Carnegie Mellon is the most popular, and has been proven to outperform Google's engine in at least one research paper.
I have a BA in General Linguistics, and a MA in Computational Linguistics. If I had it to do over again, I would have done an MS in Computer Science and tried to steer my CS projects toward NLP. This seems to me like the best way to get into NLP engineering, which I discovered along the way is actually what I wanted to do.
Probably the best thing you can do is develop your programming, math, and statistics skills and start doing personal programming projects in NLP. Mess around with CMU Sphinx, NLTK, take a Machine Learning course online, that kind of thing.
A phD is a really valuable thing in the job market if you want to do NLP, surprisingly enough. I thought it would pretty much limit me to work in academia, but I was super wrong. One thing you might want to do is look on some job boards (I like Indeed and Dice) for some NLP/CompLing/ML job postings, identify the gaps between what companies are looking for and what you can do, and figure out how you can close those gaps in the next few years.
First of all this is a badass project, and is awesome in every way. I've been fooling around with pocketsphinx on a Beaglebone Black, and the recognition is very accurate regardless of voice. You can also improve recognition time by restricting the words the engine will listen for.
CMU Sphinx (http://cmusphinx.sourceforge.net/) is a great speech-to-text engine. I've run it on a Raspberry Pi for simple speech commands, and it worked well. No internet connection required, by the way.
I'm working on a project that will eventually integrage voice recognition, so I always take note of these threads. The outcome is never very good though. It seems that the Linux ecosystem is really lacking in this area.
The name I hear most is CMU Sphinx. I've yet to really look into it, but it's almost certainly not what you're looking for.
There are some other suggestions, but rather than recopy them here, I'll just link to those threads. Maybe they'll help some of you out.
Ask Linux: Mature, stable speech recognition for Ubuntu?
It's been a while, so here it is again: Speech recognition on Linux?
Sphinx3 does phoneme recognition, but don't expect to much;
>... convert speech to a stream of phonemes rather than words. This is possible, although the results can be disappointing. The reason is that automatic speech recognition relies heavily on contextual constraints (i.e. language modeling) to guide the search algorithm. The phoneme recognition task is much less constrained that word decoding, and therefore the error rate (even when measured in terms of phoneme error for word decoding) is considerably higher. For mostly the same reason, phoneme decoding is quite slow.
If you still want to try it, here is one Sphinx3 phonem -> IPA mapping I found.
I've had some success running Pocket Sphinx on a Raspbery Pi -- just so long as I created a corpus with a limit set of words. (If you are running on a full powered system, that shouldn't be necessary.)
I have, however, been unable to get the Python libraries to build correctly, so I can't yet build out a programmatic interface.
Here are some of the links that helped get me this far:
Pretty much any grammar-based speech reco engine could be used for this off-the-shelf. On Windows you can use Voice Recognition Server (see Unity guide); a cross-platform option is CMU Sphinx. Some speech reco was already done for the early Voyager Bridge demo, not sure what system they used.
It'd be great however to have a solution that is fine-tuned and trained specifically for the Rift mic, which does not exist right now.
I'll take a stab at the interpreter half of your question. I did some research into it a couple years ago.
CMU has a great introduction to speech recognition here. The basic problem is to figure out what you think someone is going to say, and what you think their accent is going to be. Desktop-level speech recognition is limited to a set of expected queries ("what time is it?" "launch firefox" etc) in a fairly quiet environment. A speech recognition engine will come up with a bunch of potential mappings (is it more likely that you said '/u/andyflip' or 'you and he flip'?)
Siri & google (and amazon) level stuff is going to take advantage of their processing capacity. They can hand your audio clip off to N computers to figure out what words you said and what that probably means. And they have a ton of processing power that is idle over short amounts of time, so they can do some much more detailed guessing.
Not hard to believe. I wrote something similar using Sphinx and it had two modes of recognition:
What I did was "continuously listen" until I heard the trigger word and then begin recording until the volume returns to background levels. Then submit the recording to Sphinx for processing.
In any case, since it uses your own WiFi you can snoop on what the Echo is sending up to Amazon and make sure it's not sending up constant recordings.
I'm using a custom C# wrapper around CMU's Pocketsphinx. I'm a fan since it has good recognition rates, runs locally (no sending data off to the cloud), and works across a good range of platforms (it was designed for use on lower power devices like cell phones, but also works just fine on desktop OS's).
Native interop let's me call directly into libraries, with some thin wrapper classes to make things look more natural. From there most of the work is trying to integrate cleanly with unity (e.g. properly capturing a continuous stream of audio using Microphone and re-encoding it for Pocketsphinx) and building utilities for semantic extraction to make it easier to use outputs (to avoid needing absurd amounts of fragile Regex's to turn text into actions).
sphinx is a standalone open-source voice recognition. It can be used for voice to text, but I'm not sure if it can be used for speaker recognition. People say it is documented fairly well, have a look.
http://cmusphinx.sourceforge.net/
Just as an anecdote: Pocketsphinx works great for me with american english. That suggests to me that the acoustic model (I'm assuming you switched from en-US to German Voxforge) isn't matching your accent. Have you tried training a new acoustic model?
Another option for real time text to speech that sounds pretty good is pico2wave. And pocketsphinx, properly wielded, can actually work to do speech recognition. For example.
The first package you might try is Sphinx (not to be confused with Sphinx Search). http://cmusphinx.sourceforge.net/ http://blends.debian.org/accessibility/tasks/speechrecognition
I know nothing of the program; this is google-fu.
What platform are you targeting? I'll assume PC.
For speech-to-text:
for Java, there is Sphinx
for .NET, there's Microsoft Speech SDK (Windows only)
Both pretty easy to use.
Now the question is what kinds of text do you want (free-form dictation vs grammar-based) and what do you want to do with the acquired text.
You want CMU's open source Sphinx speech recognition project. It has something you can run on your server, as well as a whole bunch of nice writeup about how it all works and hooks for experimentation. Have used in grant-funded projects: highly recommend.
Is there any reason why you can't use API and programs from this tutorial?
Alternatively, if some lag is acceptable, you could use temporary files with sound samples from microphone and process responses in few seconds long discrete chunks.
Is this helpful or am I missing the point of your question?
You can use a lot of things for speech. If you're on Windows you can just use Microsoft's Speech APIs, they are ok. If you want open source try Sphinx.
If I remember correctly, Java only defined the speech recognition and text to speech interfaces. They do not provide implementations, you have to find those from some third party. Sphinx is an open source speech library from CMU.
I benchmarked Kaldi and Julius in 2013, and they were terrible for my application with error rates in the high 40%s when Sphinx-3 was in the low 20%s. Did they get a lot better lately?
What is your source for the "obsolete" claim?
See line 687 in http://cmusphinx.sourceforge.net/doc/pocketsphinx/fsg__search_8c_source.html
GMM scoring is part of the inner loops of the Viterbi beam search. Yes I've profiled it. I've literally been working with cmusphinx source code since 1989, or even earlier if you count very isolated module optimizations such as log addition tables.
Are there any open source DNN recognizers?
The go-to FOSS speech recognizer is Sphinx, but unfortunately, it doesn't have any .NET bindings that I can find easily. It's non-trivial (but not impossible) to build a wrapper around the API.
I know .NET3.0 had a built-in speech recognition API. I have no idea if it's migrated to Mono or Core.
I found this link: https://www.lifewire.com/state-of-linux-voice-recognition-2204883 and then looked at sphinx (http://cmusphinx.sourceforge.net/wiki/download/). I have downloaded and compiled it. I will probably do a multiple scripts (too lazy to write one big one)..
Pass 1:
For each file: Convert mp3 to wav: ffmpeg -i inputfilename.mp3 -acodec pcm_s16le -ar 16000 outputfilename.wav
then run sphinx on that file pocketsphinx_continuous -dict /usr/share/pocketsphinx/model/lm/en_US/cmu07a.dic -infile voice2.wav -lm cmusphinx-5.0-en-us.lm 2>voice2.log
Take the logs and throw in a folder.
Phase 2: Grep the folder for instances of the sound tag (assuming sphinx can detect it) and generate a kill list
Have you looked at pocketsphinx? Apparently it has keyword detection now but what I've done in the past is use the Python bindings in "continuous listen" mode and then as soon as your keyword is triggered begin recording. Once you stop talking (a second of silence) it sends the recording to Google or whoever to do the TTS. That way, all you really have to train is the keyword.
Chriscicc, have you tried the kinect in "dumb" audio mode? I'm curious why you say it is critical to use the MS SDK.
The open linux driver for kinect, libfreenect has support for the audio system on kinect. You can record (or pipe) each of the 4 microphones simultaneously. Of course having 4 wav streams is far from having a useful beam formed signal, but really the beam forming is for voice localization, not speech recognition. Maybe they use it for noise suppression, but with them being so close together I'm not sure how that would work.
You can also get the motion information (RGB-D cloud, etc). via libfreenect simultaneously.
This is far from "official support", I agree, but technically already possible to pipe the audio streams or a blend them (via sox) over to a voice recognition sdk. For a cloud based approach, you can use the google speech recognition api:
http://www.webupd8.org/2014/02/linux-speech-recognition-using-google.html
For an offline approach try sphinx:
It's really good for voice commands an limited dictation, it's not as good as Siri Or Google Now but it performs well. The code is around 18M, the models (acoustic and language) are around the 12M depending on how you get/generate it. More here: http://cmusphinx.sourceforge.net/wiki/
Would upvote twice, if I could. There is open source speech-to-text software that will run on a Raspberry Pi and doesn't require an internet connection (http://cmusphinx.sourceforge.net/). More robust, more privacy.
If you think Youtube CCs are hilariously bad wait until you see transcriptions generated by pocketsphinx, which most people use to do transcriptions locally :) I just tried to offer a "less worse" option.
keyword then google vr is exactly what I've done. I just used an old android phone (with external mic) as the hardware. Sphinx currently has a comparison of recognition accuracy vs dictionary and language model size on their website.
Since you don't mind some programming, how about cmu sphinx. It is probably the most cutting-edge open-source voice recognition software out there. They actually have a post on their website right now talking about performance of various dictionary sizes WRT error rates and memory size.
I has the same accuracy, just a smaller number of available languages and words. I have written an android app with PocketSphinx and it handles keyword detection and full-dictionary recognition quite nicely. For custom words or custom languages the packages get very large, though, so it's more economical (process-wise) to send out the recording for off-device recognition if you want to be able to handle lots of variation. For my purposes, though, I wanted to not share that spoken data and I only needed to handle US English with a standard dictionary so all's easy on-device.
Do you have recommendations for speech recognition software ? I have tried to use cmusphinx but I always got strange results.
Moderni sustavi grade se obično od dvije veće komponente. To su akustički i jezični model. Prva se uči korištenjem velike baze snimljenog govora gdje je govor već transkribiran. Druga komponenta (jezični model) uči se na temelju velike količine tekstova pisanih u jeziko kojeg se želi modelirati.
Primjer open source softvera kojim se gradi prva komponenta: http://cmusphinx.sourceforge.net/wiki/
Sličan primjer za drugu komponentu:
http://sourceforge.net/apps/mediawiki/irstlm
Za podrobnije razumijevanje rada tih komponenti potrebno je znanje statistike, metoda optimizacije i strojnog učenja.
Inače kao primjer bilo tko može naučiti slične sustave da npr. prevode s turskog na npr. japanski ukoliko imaju relativno veliku količinu paralelnih tekstova na turskom i japanskom i veliku količinu tekstova na japanskom. Pri tome ne treba znati ni turski ni japanski.
Thanks for the response, and for pointing me in the right direction.
I have more questions, sorry if this is asking too much.
>You should concentrate on the Viterbi algorithm first, using the more complex Baum-Welch algorithm is often not worth the trouble.
Please pardon my ignorance, but it seems to me like you're comparing viterbi vs. baum-welch here - aren't they completely different algorithms? From my understanding, viterbi is for testing HMMs' state sequence likelihoods based on an observed sequence, while baum-welch is for optimizing an HMM's state sequence probability given a training sequence? Or - since baum-welch iteratively changes the transition probs - were you implying that if I used fixed transition probs, I could skip baum-welch and move on to viterbi testing?
Also, by fixed transition probs, did you mean discrete pdfs - as in segregating observation vectors into clusters (VQ/k-means/etc.?), corresponding to the states in the word model? How would it be different if I used individual phonemes as clusters/states, rather than including the phoneme transitions, or using biphones/triphones? Am I correct in assuming that the deltas & double-deltas are sufficient to account for the phoneme transitions?
BTW, here are the specifications and feature vector components that I used for each frame: can anyone tell if they're optimal? Redundant?
1 = log energy = log of the mean of the squares of the frame samples
2:13 = 12 Mel Frequency Cepstral Coeffs MFCCs. I used the same specs as in the linked paper:
20 overlapping triangular mel filterbanks (only args(3:14) used)
frequency range: 0-8kHz (Nyquist sampling rate = 16 kHz /2)
14:25 = delta MFCC's aka velocities (like this)
26:37 = double-delta MFCCs aka accelerations (see above link)
Just noticed this got some upvotes. If it helps any, I'm an IT project manager and can contribute in that manner. I've only used linux for a few months now, but am getting more comfortable everyday.
Voxforge is a database to collect transcriptions for open source speech recognition engines such as CMU Sphinx. OpenClipart.org might also be a good starting place for FOSS pecs/pictures.
edit manner not manor..and formatting.
What you want to do is called forced alignment - where you take a speech recognition engine but instead of just giving it the audio and asking it to find words, you give it the transcription and just ask it to align the transcription with the audio.
Forced alignment is much faster and more accurate than general speech recognition. So your idea is great. It could be real-time.
You don't want to build your own speech recognition engine, though. It's extremely difficult, would take years to study the theory and build a system, and you'd also need tons of data that would be very expensive to collect. However, you could download and use an existing speech recognition engine. For example, CMU Sphinx is an open-source speech reco engine, or you could license a commercial engine from some place like Nuance.
Learning to use an existing engine to do forced alignment and making a bouncing-ball animation is not a beginner-level project, but it's something an intermediate programmer could do in a few weeks. It would definitely be a fun challenge. My main advice would be to break the problem up into a bunch of smaller pieces rather than trying to build the whole thing at once.
Unfortunately we work more on the theoretical side of things. To implement a custom solution would take some effort, but there may be a relevant open-source project based on Sphinx. Sorry that I can't be of more help!
http://en.wikipedia.org/wiki/CMU_Sphinx
Contact any of the PhD's on this list: http://cmusphinx.sourceforge.net/wiki/research/
and see if they can use your services or want to perform any tests or interviews with you.
According to the Oracle website, they have defined a speech API but do not provide an implementation themselves. You have already found what appears to be the most commonly used implementation, Sphinx4, although there are others. Here is the sourceforge page for Sphinx4, which has information about installing and using it, and here is the wiki (also on sourceforge) which has more documentation and tutorials.