POST /text-to-speech/stream — send text in, get a binary audio stream back. The response starts arriving as soon as the first audio chunk is ready, so you can begin playback or piping without waiting for the full file.
No WebSocket handshake, no config messages, no connection lifecycle. One HTTP request, one streamed response.
Common use cases:
Both give you streaming audio. The difference is how much control you need.
Use HTTP Stream when:
curl works out of the boxUse WebSocket when:
Since the response is a raw binary audio stream, you can pipe it directly without buffering the whole file in memory.
The response is a binary audio stream — not JSON, not base64. The Content-Type header matches your requested codec (e.g., audio/mpeg for MP3).
You can:
--output in cURL, f.write(chunk) in Python)This is different from the REST endpoint (/text-to-speech) which returns base64-encoded audio inside a JSON response. The stream endpoint returns raw binary audio — no decoding needed.
Pass dict_id to apply custom pronunciations during streaming synthesis:
See the Pronunciation Dictionary guide for setup.
Errors return JSON (not audio) with the standard error format:
Full endpoint spec with all parameters and error details is in the API Reference.
Need help? Reach out on Discord.