Speech to Text

Transcribe audio files and live streams with word-level timestamps and diarization.

The /transcriptions endpoint converts audio into text. Upload a file for batch transcription, or stream audio over the Realtime API for live captions.

Transcribe a file

curl https://api.vocenza.com/v1/transcriptions \
  -H "Authorization: Bearer $VOCENZA_API_KEY" \
  -F model="vocenza-stt-1" \
  -F file="@meeting.wav" \
  -F "timestamps=word"
response.json
{
  "text": "Welcome everyone, let's get started.",
  "language": "en",
  "duration": 2.74,
  "words": [
    { "word": "Welcome", "start": 0.10, "end": 0.52 },
    { "word": "everyone", "start": 0.54, "end": 1.05 }
  ]
}

Parameters

ParameterTypeDescription
modelstringvocenza-stt-1 (accuracy) or vocenza-stt-1-flash (realtime).
filebinaryAudio file. wav, mp3, m4a, flac, or ogg.
languagestringISO-639-1 hint, e.g. en. Omit to auto-detect.
timestampsstringnone, segment, or word. Defaults to segment.
diarizebooleanLabel speakers (speaker_0, speaker_1, …).

Diarization

Set diarize: true to attribute each segment to a speaker:

{
  "segments": [
    { "speaker": "speaker_0", "text": "Are we ready?", "start": 0.0, "end": 1.1 },
    { "speaker": "speaker_1", "text": "Yes, go ahead.", "start": 1.3, "end": 2.2 }
  ]
}

Live transcription

For captions on a live microphone, stream PCM audio frames to the Realtime API and read transcript.delta events as the user speaks — no file upload needed.

Supported languages

Vocenza auto-detects and transcribes 50+ languages. Pass language only when you want to lock detection (for example, to avoid code-switching errors on short clips).