The idea is to use streaming. The goal is for the sound to start playing after 200–400 ms, instead of waiting for the entire sentence to be generated.
Pocket-TTS offers a function called "
tts_model.generate_audio_stream(voice_state, text)".
It returns small audio chunks as they are generated.
So you can receive a chunk, play it immediately, and continue receiving the next ones.
from pocket_tts import TTSModel
import numpy as np
# Load the model and voice only once
tts_model = TTSModel.load_model()
voice_state = tts_model.get_state_for_audio_prompt("alba")
def stream_tts(text):
# Returns a list of audio chunks (numpy arrays)
chunks = []
for chunk in tts_model.generate_audio_stream(voice_state, text):
chunks.append(chunk)
return chunks
With this, we can play each chunk as soon as it arrives, which greatly reduces latency.
Private Sub Button1_Click
StreamSpeak(TextArea1.Text)
End Sub
Private Sub StreamSpeak(Text As String)
Dim PyStream As PyWrapper = Py.ImportModule("python_streaming")
Dim Result As PyWrapper = PyStream.Run("stream_tts").Arg(Text)
' Résultat = liste de chunks numpy
Dim chunks As List = Result.ToList
For Each chunk As PyWrapper In chunks
PlayChunk(chunk)
Next
End Sub
Private Sub PlayChunk(chunk As PyWrapper)
' Convertir chunk numpy → bytes WAV
Dim buffer As PyWrapper = IO.Run("BytesIO")
ScipyWavfile.Run("write").Arg(buffer).Arg(TTSModel.GetField("sample_rate")).Arg(chunk)
Winsound.Run("PlaySound").Arg(buffer.Run("getvalue")).Arg(Winsound.GetField("SND_MEMORY"))
End Sub
With this code, the first chunk arrives in 200–400 ms. The sound starts immediately, and the phrase continues to generate during playback.
The perceived latency is very low.
Untested code.