HTTP Streaming API
POST /text-to-speech/stream — send text in, get a binary audio stream back. The response starts arriving as soon as the first audio chunk is ready, so you can begin playback or piping without waiting for the full file.
No WebSocket handshake, no config messages, no connection lifecycle. One HTTP request, one streamed response.
Common use cases:
- Backend audio generation — Pipe audio directly to a file, S3, or a downstream service
- API proxying — Forward the stream to your frontend or mobile client as-is
- Batch processing — Generate audio for a queue of texts using simple HTTP calls
- Serverless / edge — Works in any environment that supports HTTP — no WebSocket runtime needed
HTTP Stream vs WebSocket — When to Use Which
Both give you streaming audio. The difference is how much control you need.
Use HTTP Stream when:
- You have a complete text and just need audio back
- You’re generating audio server-side (batch jobs, API endpoints, CI pipelines)
- Your runtime doesn’t support WebSocket (serverless functions, edge workers)
- You want the simplest possible integration —
curlworks out of the box
Use WebSocket when:
- You’re building a conversational agent that streams text incrementally (e.g., from an LLM)
- You need to send multiple texts without reconnecting
- Low time-to-first-byte on successive utterances matters (connection is already warm)
- You need fine-grained control over buffering and flushing
Code Examples
Piping the Stream
Since the response is a raw binary audio stream, you can pipe it directly without buffering the whole file in memory.
Request Parameters
Response
The response is a binary audio stream — not JSON, not base64. The Content-Type header matches your requested codec (e.g., audio/mpeg for MP3).
You can:
- Save it directly to a file (
--outputin cURL,f.write(chunk)in Python) - Pipe it to an audio player
- Forward it to a client as a streaming HTTP response
This is different from the REST endpoint (/text-to-speech) which returns base64-encoded audio inside a JSON response. The stream endpoint returns raw binary audio — no decoding needed.
With Pronunciation Dictionary
Pass dict_id to apply custom pronunciations during streaming synthesis:
See the Pronunciation Dictionary guide for setup.
Error Handling
Errors return JSON (not audio) with the standard error format:
Full endpoint spec with all parameters and error details is in the API Reference.
Need help? Reach out on Discord.