Speech to Text

Transcribe audio files and live streams with word-level timestamps and diarization.

The /transcriptions endpoint converts audio into text. Upload a file for batch transcription, or stream audio over the Realtime API for live captions.

Transcribe a file

curl https://api.vocenza.com/v1/transcriptions \
  -H "Authorization: Bearer $VOCENZA_API_KEY" \
  -F model="vocenza-stt-1" \
  -F file="@meeting.wav" \
  -F "timestamps=word"

response.json

{
  "text": "Welcome everyone, let's get started.",
  "language": "en",
  "duration": 2.74,
  "words": [
    { "word": "Welcome", "start": 0.10, "end": 0.52 },
    { "word": "everyone", "start": 0.54, "end": 1.05 }
  ]
}

Parameters

Parameter	Type	Description
`model`	string	`vocenza-stt-1` (accuracy) or `vocenza-stt-1-flash` (realtime).
`file`	binary	Audio file. `wav`, `mp3`, `m4a`, `flac`, or `ogg`.
`language`	string	ISO-639-1 hint, e.g. `en`. Omit to auto-detect.
`timestamps`	string	`none`, `segment`, or `word`. Defaults to `segment`.
`diarize`	boolean	Label speakers (`speaker_0`, `speaker_1`, …).

Diarization

Set diarize: true to attribute each segment to a speaker:

{
  "segments": [
    { "speaker": "speaker_0", "text": "Are we ready?", "start": 0.0, "end": 1.1 },
    { "speaker": "speaker_1", "text": "Yes, go ahead.", "start": 1.3, "end": 2.2 }
  ]
}

Live transcription

For captions on a live microphone, stream PCM audio frames to the Realtime API and read transcript.delta events as the user speaks — no file upload needed.

Supported languages

Vocenza auto-detects and transcribes 50+ languages. Pass language only when you want to lock detection (for example, to avoid code-switching errors on short clips).

Transcribe a file#

Parameters#

Diarization#

Supported languages#

Transcribe a file

Parameters

Diarization

Supported languages