Speech to Text
Transcribe audio files and live streams with word-level timestamps and diarization.
The /transcriptions endpoint converts audio into text. Upload a file for batch
transcription, or stream audio over the Realtime API for
live captions.
Transcribe a file
curl https://api.vocenza.com/v1/transcriptions \
-H "Authorization: Bearer $VOCENZA_API_KEY" \
-F model="vocenza-stt-1" \
-F file="@meeting.wav" \
-F "timestamps=word"{
"text": "Welcome everyone, let's get started.",
"language": "en",
"duration": 2.74,
"words": [
{ "word": "Welcome", "start": 0.10, "end": 0.52 },
{ "word": "everyone", "start": 0.54, "end": 1.05 }
]
}Parameters
| Parameter | Type | Description |
|---|---|---|
model | string | vocenza-stt-1 (accuracy) or vocenza-stt-1-flash (realtime). |
file | binary | Audio file. wav, mp3, m4a, flac, or ogg. |
language | string | ISO-639-1 hint, e.g. en. Omit to auto-detect. |
timestamps | string | none, segment, or word. Defaults to segment. |
diarize | boolean | Label speakers (speaker_0, speaker_1, …). |
Diarization
Set diarize: true to attribute each segment to a speaker:
{
"segments": [
{ "speaker": "speaker_0", "text": "Are we ready?", "start": 0.0, "end": 1.1 },
{ "speaker": "speaker_1", "text": "Yes, go ahead.", "start": 1.3, "end": 2.2 }
]
}Live transcription
For captions on a live microphone, stream PCM audio frames to the Realtime API
and read transcript.delta events as the user speaks — no file upload needed.
Supported languages
Vocenza auto-detects and transcribes 50+ languages. Pass language only when
you want to lock detection (for example, to avoid code-switching errors on short
clips).