Android Question VOSK Speech to Text - Text to Speech crash

xldaedalus · Jun 4, 2024

I'm a newbie to java / android / B4A

Writing an app using Biswajit's Speech to Text. That part works great. Once I have created action with the text, I want the app to speak the result so user knows what the app is actually doing. The speaking is done by creating a JavaObject and the speaking works great using the voice engines on the phone. However on completion, of speaking is where the crash occurs, so I'm not doing something right. Here's the problem code. Line 14 appears to be the first sign of trouble, and the STT.startListening doesn't / can't start. Below code is unfiltered log. I assume a "frame" is something like a system clock counter? So, I don't know was skipped frames are. I added and removed TRY - CATCH to catch exceptions, but there or not, makes no difference.

Any ideas, suggestions would be greatly appreciated.

B4X:

Sub speak_now( cTxt As String )
  If cTxt.Length>0 Then
     Try
Log("Start Speaking")
        tts.Speak(cTxt, True )
        Dim jo As JavaObject = tts
Log("Start Speaking after sleep")
            
        Do While jo.RunMethod("isSpeaking",Null)
        Sleep(250)
            Log("Speaking")
        Loop
            '
Log("Done Speaking")

        timer.Enabled=True
            Log("Timer Enabled")
        STT.prepareMicrophone(jsontext)
            Log("microphone prepared")
            ''            STT.prepareMicrophone("")
        If STT.startListening(-1) Then
            Log("Speak_Now failed...")
         Else
            Log("TTS re-started...")
        End If
     Catch
            Log("Speak_Now Catch" & LastException)
     End Try
  End If
End Sub

Start Speaking
Skipped 93 frames! The application may be doing too much work on its main thread.
Davey! duration=1572ms; Flags=0, FrameTimelineVsyncId=2821073, IntendedVsync=1167105536958399, Vsync=1167107090707074, InputEventId=0, HandleInputStart=1167107100661864, AnimationStart=1167107100666969, PerformTraversalsStart=1167107100670041, DrawStart=1167107101106187, FrameDeadline=1167105553625065, FrameInterval=1167107098855614, FrameStartTime=16706975, SyncQueued=1167107101774677, SyncStart=1167107102330198, IssueDrawCommandsStart=1167107102572802, SwapBuffers=1167107105736760, FrameCompleted=1167107110042750, DequeueBufferDuration=40938, QueueBufferDuration=2910990, GpuCompleted=1167107107814052, SwapBuffersCompleted=1167107110042750, DisplayPresentTime=0, CommandSubmissionCompleted=1167107105736760,
Start Speaking after sleep
Speaking
Speaking
Speaking
Speaking
Speaking
Speaking
Speaking
Speaking
Speaking
Speaking
Speaking
Speaking
Speaking
Speaking
Speaking
Speaking
Speaking
Speaking
Speaking
Speaking
Speaking
Speaking
Done Speaking
Timer Enabled
UpdateGrammarFst():recognizer.cc:287) ["hey", "joe", "jill", "yo", "set", "channel", "value", "of", "oh", "zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety", "hundred", "percent", "intensity", "saturation", "cct", "kelvin", "fixture", "query", "on", "off", "all", "up", "down", "degrees", "point", "type"]
Estimate():language_model.cc:142) Estimating language model with ngram-order=2, discount=0.5
OutputToFst():language_model.cc:209) Created language model with 54 states and 106 arcs.
Unexpected event (missing RaiseSynchronousEvents): stt_readytolisten
Check the unfiltered logs for the full stack trace.
java.lang.Exception: Stack trace
at java.lang.Thread.dumpStack(Thread.java:1615)
at anywheresoftware.b4a.shell.Shell.raiseEventImpl(Shell.java:314)
at anywheresoftware.b4a.shell.Shell.raiseEvent(Shell.java:255)
at java.lang.reflect.Method.invoke(Native Method)
at anywheresoftware.b4a.ShellBA.raiseEvent2(ShellBA.java:157)
at anywheresoftware.b4a.BA.raiseEvent(BA.java:201)
at com.biswajit.vosk.SpeechToText.prepareMicrophone(SpeechToText.java:116)
at java.lang.reflect.Method.invoke(Native Method)
at anywheresoftware.b4a.shell.Shell.runVoidMethod(Shell.java:777)
at anywheresoftware.b4a.shell.Shell.raiseEventImpl(Shell.java:354)
at anywheresoftware.b4a.shell.Shell.raiseEvent(Shell.java:255)
at java.lang.reflect.Method.invoke(Native Method)
at anywheresoftware.b4a.ShellBA.raiseEvent2(ShellBA.java:157)
at anywheresoftware.b4a.BA.raiseEvent2(BA.java:205)
at anywheresoftware.b4a.BA.raiseEvent(BA.java:201)
at anywheresoftware.b4a.shell.DebugResumableSub$RemoteResumableSub.resume(DebugResumableSub.java:22)
at anywheresoftware.b4a.keywords.Common$14.run(Common.java:1748)
at android.os.Handler.handleCallback(Handler.java:942)
at android.os.Handler.dispatchMessage(Handler.java:99)
at android.os.Looper.loopOnce(Looper.java:201)
at android.os.Looper.loop(Looper.java:288)
at android.app.ActivityThread.main(ActivityThread.java:7920)
at java.lang.reflect.Method.invoke(Native Method)
at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:549)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:942)
microphone prepared
Speak_Now failed...
Speak_Now Catch(Exception) Not initialized
STT ready
STT Failed to Start...
Unable to match the desired swap behavior.
Expecting binder but got null!
Skipped 36 frames! The application may be doing too much work on its main thread.
Skipped 62 frames! The application may be doing too much work on its main thread.
Davey! duration=1056ms; Flags=0, FrameTimelineVsyncId=2823206, IntendedVsync=1167165030295356, Vsync=1167166063628648, InputEventId=0, HandleInputStart=1167166077388925, AnimationStart=1167166077396790, PerformTraversalsStart=1167166077403613, DrawStart=1167166077914602, FrameDeadline=1167165046962022, FrameInterval=1167166076824759, FrameStartTime=16666666, SyncQueued=1167166079759186, SyncStart=1167166080399186, IssueDrawCommandsStart=1167166080802988, SwapBuffers=1167166083309134, FrameCompleted=1167166087738509, DequeueBufferDuration=64375, QueueBufferDuration=952552, GpuCompleted=1167166087738509, SwapBuffersCompleted=1167166085466373, DisplayPresentTime=1167107139784208, CommandSubmissionCompleted=1167166083309134,
Skipped 36 frames! The application may be doing too much work on its main thread.
Skipped 40 frames! The application may be doing too much work on its main thread.
Skipped 153 frames! The application may be doing too much work on its main thread.

Erel · Jun 4, 2024

Unexpected event (missing RaiseSynchronousEvents): stt_readytolisten
Check the unfiltered logs for the full stack trace.

This is not a crash. It means that there is a difference between the behavior in debug mode and release mode (the event is not raised immediately, it is related to a missing declaration in the library). Test it in release mode.

xldaedalus · Jun 4, 2024

Well, it seems like a "crash" because the STT stops completely and won't re-start. I have to re-compile each time I hit this function which make it hard to debug the other areas of the app I'm working one.

I think I may have found the problem. STT.stop is called before this function is called. I thought maybe calling STT.prepareMicrophone before re-starting the STT was panicking the process. I don't know how to "re-start" the STT it so I commented out

calling STT:

       timer.Enabled=True

            Log("Timer Enabled")

        STT.prepareMicrophone(jsontext)

            Log("microphone prepared")

            ''            STT.prepareMicrophone("")

        If STT.startListening(-1) Then

            Log("Speak_Now failed...")

         Else

            Log("TTS re-started...")

        End If

And call the first STT "init" function and re-initialize the STT from the function that calls this one. This seems to have solved the problem. But as I don't yet know how all this works together, I realize I may be creating other problems I am, as yet, unaware of. I need to test it a lot more.

I have to be on the road for a week or so, but I will I will try as you suggest and get back here. I didn't know there might be a difference between debug and release but that makes sense.

Thanks so much for your help. This is a great program and I'm enjoying myself very much! Biswatji as done an amazing job with the VOSK STT

Next, I want to figure out how I can get my own AI voice and add it to the TTS.

Thanks again.

drgottjr · Jun 5, 2024

tts and stt function asynchronously; they tell you when they're ready to do something
or have finished doing it. you can't keep pestering them like small child in a car
asking a parent over and over, "are we there? are we there? are we there?"

prepareMicrophone raises the ReadyToListen event. you cannot proceed until
you consume that event.
eg, your statements:

B4X:

STT.prepareMicrophone(jsontext)
log("microphone prepared")

are wrong. you don't know that the microphone, in fact, is prepared. just saying it's
prepared doesn't make it so. the preparation takes time. and every time you stop
stt, you have to go through the preparation again. this is a particular problem when
you combine stt with tts. (ideally, you wouldn't stop listening until you were finished.)

you are trying to do too much with your try/catch block. both tts and stt have error
reporting events. you should handle them.

after you prepareMicrophone and the ReadyToListen event is raised, you should log
"ready" in the event. you will see there is a noticeable pause before "ready". you can't
call StartListening until ReadyToListen is raised and handled. if you "startlistening"
before "readytolisten", the call fails.

i can't imagine what your timer is for; StartListening(-1) means wait indefinitely.

the original example which came with the library shows clearly how the vosk library works.
it uses an older method of consuming events, but it still works. you can easily remove it
and use waitfor instead.

i have used the vosk library with android's tts. i've even used it to listen in one language
and speak the translation back in another.

Talk To The Hand (Redux)

So, here is Mi Dica II. The initial version https://www.b4x.com/android/forum/threads/talk-to-the-hand.143453/ produced a half-duplex speech recognition/translation app using Google's Voice Recognition engine, MLKit's on-device translator and Android's TTS engine. Basically, I speak in my...

www.b4x.com

xldaedalus · Jun 18, 2024

Thanks for the tips drgottjr. I don't know what they mean, but I will do a bit more research to understand, as you suggest. FYI, Erel is right, if I run the code in release mode, the STT re-starts per the given code. So, until I figure out how to do it properly, I'll either run the Bridgelogger for debug statements or I'll just re-initialize the STT.

The voice detection works well, but if you hesitate too long, it does cut you off. So, its a question of what is "finished speaking"

Thank you for taking the time to reply.

drgottjr · Jun 19, 2024

vosk can stop listening after a certain period of time regardless of whether or not someone is speaking.
it will also stop when it detects silence (as defined by it). vosk also has some latency issues which can
cause a timeout. you have to crank it back up, wait for the event signaling it's ready and continue (or
repeat what you were saying before being cut off). in some cases, it is easy to overload it, and it can't
recover.
if you hoping to use the software in a hospital emergency room setting to direct, eg, brain surgery operations,
you might want to re-assess. it's free, it's fun, it mostly works, but you have to be able to deal with its limitations
in a timely manner.

xldaedalus · Jun 26, 2024

Definitely not doing brain surgery! Haha. Yes, I do realize there are limitations. I'm thinking I might eventually try to only use the VOSK when no internet is available. But, for now, I think it's good enough for my purpose. It is a lot of fun working on the concept of converting a language into action, giving a machine a sense of humor. A good challenge to be sure. Is your translation app up and running? It seems a great idea. I have used apps like Google Translate while aboard, but without internet or cell service data, it doesn't work so well, esp when it looks like the cabbie is taking you toward the bad part of town. I wonder if there is a way to add some tonal modulation analysis to sense for stress, ie lying. Hmmm.

Thanks again for your advice.

Android Question VOSK Speech to Text - Text to Speech crash

xldaedalus

Member

Erel

B4X founder

xldaedalus

Member

drgottjr

Expert

Talk To The Hand (Redux)

xldaedalus

Member

drgottjr

Expert

xldaedalus

Member

Similar Threads