Tts
TTS Modules
← Back to README | Architecture
Overview
The TTS layer is defined by TTSInterface. All engines generate a NumPy audio array + sample rate, which the brain plays via sounddevice. The active engine is selected with tts_provider in config.
src/modules/tts/
├── edge_tts_wrapper.py Microsoft EdgeTTS (free, online)
├── kokoro_tts_wrapper.py Kokoro ONNX (local, no API)
└── orpheus_tts_wrapper.py Orpheus (API, high quality)
Interface Contract
class TTSInterface(ABC):
async def generate_audio(text: str) -> (np.ndarray, sample_rate: int)
async def speak(text: str, output_device_id: int) -> None
def reload_config(config: BrainConfig) -> None
The brain always calls generate_audio() and handles playback itself via sounddevice.play(). speak() is also an abstract method — implementations must provide it (even if only as a thin wrapper around generate_audio()). The Kokoro wrapper includes a working speak() for direct use; the brain itself does not call it.
Important for custom TTS engines: if you omit
speak()from your implementation, Python will raiseTypeErrorat instantiation time because it is declared@abstractmethodinTTSInterface. This allows interrupt/resume functionality (the audio buffer is tracked at the brain level).
Providers
EdgeTTS (edge_tts_wrapper.py)
Library: edge-tts
Cost: Free (uses Microsoft Edge's TTS API)
Config keys: tts_voice, tts_pitch, tts_rate, tts_volume
Generates audio to a temporary MP3 file, reads it back as a NumPy array via soundfile, then deletes the temp file. Each generation uses a unique UUID filename to avoid collisions during concurrent calls.
Voice format: "it-IT-IsabellaNeural", "en-US-AvaNeural", etc.
Full voice list: edge-tts --list-voices
tts = EdgeTTSWrapper(voice="en-US-AvaNeural", pitch="+5Hz", rate="+10%", volume="+33%")
audio, sr = await tts.generate_audio("Hello!")
Constructor note:
EdgeTTSWrapper.__init__also accepts anoutput_fileparameter (default:"temp_tts.mp3"). This parameter is vestigial —generate_audio()ignores it and always uses a unique UUID-based filename to prevent collisions during concurrent calls. It is safe to omit. The class-level default forvoiceis"en-US-JennyNeural"; at runtime the value fromBrainConfig.tts_voice("en-US-AvaNeural") is always passed explicitly.
Default mismatch note (EdgeTTS): The class-level constructor defaults for
pitch,rate, andvolumeare"+0Hz","+0%","+0%"respectively — these differ from theBrainConfigdefaults of"+5Hz","+10%","+33%". The brain always passes the config values explicitly, so the class defaults only matter ifEdgeTTSWrapperis instantiated directly without arguments (e.g. in tests or standalone usage).
Dead method note:
EdgeTTSWrappercontains a vestigialgenerate_audio(self, text: str, filename: str) -> Nonedefinition (the original helper that wrote to a fixed filename). Python silently shadows it with the secondgenerate_audio(self, text: str) -> tuple[np.ndarray, int]definition, which is the one that actually executes. The first definition is unreachable and has no effect at runtime.
Kokoro ONNX (kokoro_tts_wrapper.py)
Library: kokoro-onnx
Cost: Free (runs entirely locally)
Config keys: kokoro_model, kokoro_voices_file, kokoro_voice, kokoro_speed, kokoro_lang
Runs the Kokoro TTS model locally via ONNX Runtime. No internet connection required after downloading the model files. Best for privacy or offline use.
Model files: kokoro-v0_19.onnx and voices.bin are downloaded automatically on first launch if missing (from GitHub Releases, ~125 MB total). No manual download needed.
To use a custom path, update kokoro_model and kokoro_voices_file in config.json.
Voice examples: af_bella, af_sarah, am_adam, bf_emma
Orpheus (orpheus_tts_wrapper.py)
Library: requests
Cost: API-based (Baseten — billed per inference)
Config keys: orpheus_voice
Env vars (secrets — never saved to config.json): ORPHEUS_API_KEY, ORPHEUS_ENDPOINT
Calls a self-deployed Orpheus model on Baseten. Produces highly expressive, human-like speech — the highest quality TTS option available.
Setup required: You must deploy the Orpheus model to your own Baseten workspace before use. See Setup Guide → Orpheus TTS Setup for step-by-step instructions.
The wrapper POSTs to your endpoint with stream: true, collects raw PCM bytes (24 kHz, 16-bit mono), decodes them to a NumPy array, and returns them to the brain for playback.
Voice examples: zoe, tara, leo, leah
Default mismatch note (Orpheus): The
OrpheusTTSWrapperclass constructor defaults tovoice="tara". TheBrainConfigdataclass default fororpheus_voiceis"zoe". At runtime the brain always passesconfig.orpheus_voiceexplicitly, so the effective default seen by users is"zoe". The class default only matters for direct instantiation without arguments.
Hot Reload
reload_config() updates voice in place for EdgeTTS (also pitch, rate, volume). For Kokoro it updates voice, speed, and lang. For Orpheus, it also updates the API key, endpoint URL, and voice. Changing tts_provider itself requires a restart (the object type changes).
Adding a New TTS Engine
- Create
src/modules/tts/my_tts.pyand extendTTSInterface. - Implement
async generate_audio(text) -> (np.ndarray, int)andreload_config(). - In
main.py, add the instantiation branch. - Add the provider name to
--tts-providerchoices.