Text & Audio API: Translation and transcription service

A content-agnostic translation and transcription service. Submit text or audio in your input language, specify a target language and an output format (text or audio), and, if the language pair is supported, receive the translated result back in that format. The API does not interpret content or assume a domain; what you do with it is up to you.

POST /api/v1/text · POST /api/v1/audio Scope: field:text / field:audio

What is the Text & Audio API?

The Text and Audio API is a stateless, content-agnostic translation and transcription service. You supply input (text or audio), specify the input language, the target language, and the output format you want back (text or audio). If the language pair is supported, the API returns the translated result. The service does not interpret what you send, there is no agricultural model, no follow-up logic, and no memory between calls.

Two flows are supported:

  • Text → text or audio: submit a string in your input language; receive the translation as text, or as synthesised speech.
  • Audio → text or audio: submit an audio file in your input language; receive the transcript, optionally translated to a target language and optionally synthesised back to speech.

Local dialect coverage is the focus

The service was built to expand translation and speech support to African and Bantu languages that mainstream translation tools cover poorly. Use it for any content, the API does not assume a topic. Audio is processed asynchronously through a job queue so larger files can complete in the background.

Authentication and scopes

All endpoints require a valid API key. Use X-Api-Key in examples; Bearer tokens are also accepted where the key audience is allowed:

X-Api-Key: YOUR_API_KEY
Authorization: Bearer YOUR_API_KEY

Text and audio operations use separate scopes, allowing you to grant minimal permissions to each integration:

Text scope, field:text

  • POST /api/v1/text, Synchronous text translation
  • POST /api/v1/text/jobs, Create async text job
  • GET /api/v1/text/jobs/job_abc123, Check job status

Audio scope, field:audio

  • POST /api/v1/audio, Upload audio file for transcription/translation
  • GET /api/v1/audio/audio_job_abc123, Check audio job status
  • GET /api/v1/locales, List supported languages

Billing

POST /api/v1/text accepts user API keys in production. Text job polling, audio upload/status, and locale listing currently require a server-audience key. GET endpoints are not billed.

Synchronous text translation

POST /api/v1/text translates a text string and optionally synthesises it to audio in a single synchronous call. Use this for short strings where low latency matters.

Required fields

  • text (string), The input text to translate
  • lang (string), Target language locale key (e.g. sw-ke)
  • output (string), Either "text" or "audio"

Optional fields

  • source_lang (string, default "auto"), Source language; set to "auto" for automatic detection

Example, text output

POST /api/v1/text
Authorization: Bearer YOUR_API_KEY
Content-Type: application/json

{
  "text": "Hello, how are you today?",
  "lang": "sw-ke",
  "output": "text",
  "source_lang": "en-us"
}

The response includes translation_used (whether translation was applied), text (the output string), and, when output is "audio", an audio_url link and audio_format.

Asynchronous audio jobs

Audio processing, transcription and synthesis, is handled asynchronously via a job queue. This allows large audio files to be processed in the background without blocking your application.

Upload audio

POST /api/v1/audio accepts a multipart form upload with the audio file and lang and output form fields. Returns a job_id immediately.

Poll for status

GET /api/v1/audio/audio_job_abc123 returns the current job status. When status is completed, the response includes the transcript, translation, and audio URL if synthesis was requested.

Accepted formats

Common audio container formats (MP3, M4A, WAV, OGG, WebM, FLAC). Maximum upload size is 15 MB per file, see the Input limits section for full details. Shorter clips process faster.

Async text jobs

For longer text that needs audio synthesis, use POST /api/v1/text/jobs to submit a text translation job asynchronously and retrieve the result with GET /api/v1/text/jobs/job_abc123.

Example, upload audio

POST /api/v1/audio
Authorization: Bearer YOUR_API_KEY
Content-Type: multipart/form-data

file=@recording.m4a
lang=sw-ke
output=text
source_lang=auto

Polling job status

After submitting an audio or text job, poll the status endpoint until the job reaches a terminal state. Job status values:

Active states

  • queued, Job received and waiting to be processed
  • processing, Transcription or synthesis is in progress

Terminal states

  • completed, Processing finished; results are available in the response
  • failed, Processing failed; check error_code and error_message for details

Polling interval

For short audio messages, start polling after 3–5 seconds. Use exponential backoff, most jobs complete within 10–30 seconds. Avoid polling more frequently than once per 2 seconds to stay within rate limits.

Supported languages

Use GET /api/v1/locales to retrieve the current list of supported languages at runtime. Each locale entry includes whether audio synthesis (audio_supported) is available, since not all text locales have voice synthesis.

The two tables below are accurate as of the current deployment. Treat them as a snapshot, language coverage expands over time, so always query GET /api/v1/locales at runtime instead of hardcoding the list.

Languages with audio (TTS) support

  • Bemba, bem
  • Chichewa / Nyanja, ny
  • Shona, sn
  • Swahili, sw
  • English, en
  • French, fr
  • Spanish, es
  • Portuguese, pt
  • Chinese (Simplified), zh
  • Chinese (Traditional), zh-tw

When you set output: "audio", the target language must be one of the codes above. Other languages return 400 audio_unavailable.

Text translation only (no voice)

  • Lozi, loz
  • Tumbuka, tum
  • Tonga, toi
  • Zulu, zu
  • Xhosa, xh
  • Lingala, ln
  • Luganda, lg
  • Kinyarwanda, rw
  • Yoruba, yo
  • Hausa, ha
  • Igbo, ig

These languages translate but have no voice synthesis. Use output: "text" only.

Check at runtime

Language support expands over time. Always query /api/v1/locales at runtime rather than hardcoding the supported language list in your application.

Output modes

The output parameter controls what the API returns. Both text and audio endpoints support the same two modes:

output: "text"

Returns the translated text string in the target language. No audio synthesis. Use this when the downstream system will display or speak the text through its own TTS engine.

output: "audio"

Translates the text and synthesises it to speech in the target language. Returns an audio_url to the generated audio file and the audio_format (e.g. mp3).

Input limits

Limits are enforced to preserve translation and transcription quality and to prevent abuse. Exceeding any limit returns a 400 response with details so you can resize the request and retry.

Input Limit Notes
text character limit enforced server-side Long inputs are rejected so translation quality stays consistent. The exact maximum is returned in the error response if you exceed it. Split long content into multiple requests.
audio (upload) 15 MB max file size Files larger than 15 MB are rejected before the job is queued. Accepted containers: MP3, M4A, WAV, OGG, WebM, FLAC. For best transcription quality, use 16 kHz mono or higher.
audio (live recording) 10 seconds max duration Applies to live in-browser recordings from the FieldAudio playground. Direct file uploads via the API are bound by the 15 MB file-size limit only.

Why limits exist

Translation and transcription quality degrades on very long inputs, and large uploads consume queue capacity that other callers share. The limits above are not a usage cap, they shape one request. Use them as guardrails when designing batching and retry logic.

Error codes

Common error responses from the Text and Audio API:

Client errors (4xx)

  • 400 empty_text, The input text is empty
  • 400 text_too_long, The input text exceeds the character limit. See Input limits.
  • 400 unsupported_target_lang, The target language is not supported
  • 400 audio_unavailable, Audio synthesis not supported for this language
  • 400 invalid_output, output must be "text" or "audio"
  • 400 file_too_large, Audio upload exceeds the 15 MB limit
  • 401, Missing or invalid API key
  • 404, Job not found (for status polling)

Server errors (5xx)

  • 503, Translation or synthesis service temporarily unavailable

Audio file validation

Audio uploads require a non-empty file. If the file field is missing from the multipart form or the uploaded file is empty, the API returns 400 immediately, before the job is queued. Validate file size and presence before uploading to avoid wasted calls.