Last Updated: November 26, 2024

Transcription API Features Guide

Introduction

Salad Transcription API offers a suite of powerful features to help you get the most out of your audio and video content. This guide covers the key transcription parameters you can use to customize your transcription outputs:

Transcription Output Options:
- return_as_file
- sentence_level_timestamps
- word_level_timestamps
Speaker Identification:
- diarization
- sentence_diarization
Language Specification:
- language_code

By properly utilizing these parameters, you can enhance the accuracy, efficiency, and usability of your transcriptions.

Transcription Parameters

1. `return_as_file`

Description

The return_as_file parameter allows you to receive your transcription output as a downloadable file URL. This is particularly useful when dealing with large responses, as it helps avoid issues with response size limitations.

Default: false
Type: boolean

Usage

Set "return_as_file": true in your request to receive the transcription output as a file URL. Example:

"input": {
  "url": "https://example.com/path/to/file.mp3",
  "return_as_file": true
}

Note:

If the response size exceeds 1 MB, the output will automatically be returned as a file URL, even if return_as_file is set to false.
This helps ensure reliable delivery of large transcription outputs.

2. sentence_level_timestamps

Description

Include timestamps at the sentence level in your transcription output.

Default: false
Type: boolean

Usage

Set "sentence_level_timestamps": true to include sentence-level timestamps. Set to false if you do not need them. Example:

"input": {
  "url": "https://example.com/path/to/file.mp3",
  "sentence_level_timestamps": true
}

Output Format

"sentence_level_timestamps": [
    {
        "text": "Welcome to our presentation.",
        "timestamp": [
            29.98,
            29.98
        ],
        "start": 29.98,
        "end": 29.98,
    }
],

3. word_level_timestamps

Description

Include timestamps for each word in your transcription output.

Default: false
Type: boolean

Usage

Set "word_level_timestamps": true to include word-level timestamps. Example:

"input": {
  "url": "https://example.com/path/to/file.mp3",
  "word_level_timestamps": true
}

Output Format Word-level timestamps are provided in the word_segments array of the output.

"word_segments": [
    {
        "word": "I",
        "start": 4.963,
        "end": 5.024,
        "score": 0.597,
    },
    {
        "word": "understand",
        "start": 5.084,
        "end": 5.548,
        "score": 0.471,
    },
    {
        "word": "something.",
        "start": 6.073,
        "end": 6.376,
        "score": 0.295,
    }
]

4. diarization

Description

Enable speaker separation and identification at the word level.

Default: false
Type: boolean

Usage

Example:

"input": {
  "url": "https://example.com/path/to/file.mp3",
  "diarization": true
}

Output Format Speaker labels are included in both segments and word_segments when diarization is enabled.

"word_segments": [
    {
        "word": "I",
        "start": 4.963,
        "end": 5.024,
        "score": 0.597,
        "speaker": "SPEAKER_00"
    },
    {
        "word": "understand",
        "start": 5.084,
        "end": 5.548,
        "score": 0.471,
        "speaker": "SPEAKER_00"
    },
    {
        "word": "something.",
        "start": 6.073,
        "end": 6.376,
        "score": 0.295,
        "speaker": "SPEAKER_00"
    }
]

5. sentence_diarization

Description

Include speaker information at the sentence level.

Default: false
Type: boolean

Usage

Example:

"input": {
  "url": "https://example.com/path/to/file.mp3",
  "sentence_diarization": true
}

Output Format Speaker labels are included in the segments array when sentence_diarization is enabled.

"sentence_level_timestamps": [
    {
        "text": "I understand something.",
        "timestamp": [
            4.64,
            6.82
        ],
        "start": 4.64,
        "end": 6.82,
        "speaker": "SPEAKER_00"
    }
]

Note: If several speakers are identified in one sentence the most appearing one will be returned.

6. `multichannel`

Description

Enable multichannel transcription for audio files with separate speaker channels. Requires either diarization or sentence_diarization to return speaker/channel information.

Default: false
Type: boolean

If you have a multichannel audio file with multiple speakers, you can transcribe each of them separately. Set "multichannel": true and pair it with either:

"diarization": true → word-level speaker and channel labeling
"sentence_diarization": true → sentence-level speaker and channel labeling

If the audio has only one channel, the system will automatically fall back to regular speaker diarization. Multichannel transcription is supported for all languages, has no channel limit and incurs no additional cost. It may increase transcription time by approximately 25%.

Usage

"input": {
  "url": "https://example.com/path/to/file.wav",
  "multichannel": true,
  "diarization": true
}

Output Format (word-level) The output will include speaker and channel labels for each word.

"word_segments": [
  {
    "word": "Okay,",
    "start": 0.324,
    "end": 0.566,
    "timestamp": [
        0.324,
        0.566
    ],
    "speaker": "SPEAKER_0",
    "channel": 0
  }
]

Output Format (sentence-level) The output will include speaker and channel labels for each sentence.

"sentence_level_timestamps": [
  {
    "start": 0.324,
    "end": 1.335,
    "text": "Okay, how are you?",
    "speaker": "SPEAKER_0",
    "channel": "0",
    "score": 100.0
  }
]

7. language_code

Description

The language_code parameter allows you to specify the language of the transcription to improve the accuracy of transcription, diarization and timestamps. This parameter supports both full language names and their short versions.

Default: en (English)
Type: string

If no language is specified or the provided language is not in the list, the system will automatically detect the language. However, specifying the language correctly enhances transcription accuracy. If the wrong language is specified, the system will return a translation to the specified language. For multilingual audio content, it is recommended not to specify a language to achieve optimal results. Accuracy varies by language. For tested results, see our accuracy benchmarks.

Usage

Set the language_code to the ISO 639-1 code of the audio’s language. Example:

"input": {
  "url": "https://example.com/path/to/french_audio.mp3",
  "language_code": "fr"
}

Supported Language Codes (ISO 639-1):

Code	Language	Code	Language	Code	Language
en	English	zh	Chinese	de	German
es	Spanish	ru	Russian	ko	Korean
fr	French	ja	Japanese	pt	Portuguese
tr	Turkish	pl	Polish	ca	Catalan
nl	Dutch	ar	Arabic	sv	Swedish
it	Italian	id	Indonesian	hi	Hindi
fi	Finnish	vi	Vietnamese	he	Hebrew
uk	Ukrainian	el	Greek	ms	Malay
cs	Czech	ro	Romanian	da	Danish
hu	Hungarian	ta	Tamil	no	Norwegian
th	Thai	ur	Urdu	hr	Croatian
bg	Bulgarian	lt	Lithuanian	la	Latin
mi	Maori	ml	Malayalam	cy	Welsh
sk	Slovak	te	Telugu	fa	Persian
lv	Latvian	bn	Bengali	sr	Serbian
az	Azerbaijani	sl	Slovenian	kn	Kannada
et	Estonian	mk	Macedonian	br	Breton
eu	Basque	is	Icelandic	hy	Armenian
ne	Nepali	mn	Mongolian	bs	Bosnian
kk	Kazakh	sq	Albanian	sw	Swahili
gl	Galician	mr	Marathi	pa	Punjabi
si	Sinhala	km	Khmer	sn	Shona
yo	Yoruba	so	Somali	af	Afrikaans
oc	Occitan	ka	Georgian	be	Belarusian
tg	Tajik	sd	Sindhi	gu	Gujarati
am	Amharic	yi	Yiddish	lo	Lao
uz	Uzbek	fo	Faroese	ht	Haitian Creole
ps	Pashto	tk	Turkmen	nn	Nynorsk
mt	Maltese	sa	Sanskrit	lb	Luxembourgish
my	Myanmar	bo	Tibetan	tl	Tagalog
mg	Malagasy	as	Assamese	tt	Tatar
haw	Hawaiian	ln	Lingala	ha	Hausa
ba	Bashkir	jw	Javanese	su	Sundanese
yue	Cantonese

Note

Mandatory for Diarization: When using diarization or sentence-level diarization, specifying language_code is required for optimal performance.
Automatic Detection: If no language is specified or the provided language is not supported, the system will automatically detect the language.
Translation Option: If the wrong language is specified, the system will provide a transcription translated into the specified language.
Multilingual Content: For multilingual audio, it is better not to specify a language to ensure the best results.

By setting the correct language_code or leveraging the automatic detection feature, you ensure the best possible transcription quality tailored to your audio content.

Explanation

Tutorials

How-to Guides

Reference

Speech-to-Text Guide

Transcription API Features Guide

Introduction

Transcription Parameters

1. `return_as_file`

Description

Usage

2. sentence_level_timestamps

Description

Usage

3. word_level_timestamps

Description

Usage

4. diarization

Description

Usage

5. sentence_diarization

Description

Usage

6. `multichannel`

Description

Usage

7. language_code

Description

Usage

Note

Explanation

Tutorials

How-to Guides

Reference

​Transcription API Features Guide

​Introduction

​Transcription Parameters

​1. return_as_file

​Description

​Usage

​2. sentence_level_timestamps

​Description

​Usage

​3. word_level_timestamps

​Description

​Usage

​4. diarization

​Description

​Usage

​5. sentence_diarization

​Description

​Usage

​6. multichannel

​Description

​Usage

​7. language_code

​Description

​Usage

​Note

Transcription API Features Guide

Introduction

Transcription Parameters

1. `return_as_file`

Description

Usage

2. sentence_level_timestamps

Description

Usage

3. word_level_timestamps

Description

Usage

4. diarization

Description

Usage

5. sentence_diarization

Description

Usage

6. `multichannel`

Description

Usage

7. language_code

Description

Usage

Note