Last Updated: November 26, 2024

Transcription API Features Guide

Introduction

Salad Transcription API offers a suite of powerful features to help you get the most out of your audio and video content. This guide covers the key transcription parameters you can use to customize your transcription outputs:
  • Transcription Output Options:
    • return_as_file
    • sentence_level_timestamps
    • word_level_timestamps
  • Speaker Identification:
    • diarization
    • sentence_diarization
  • Language Specification:
    • language_code
By properly utilizing these parameters, you can enhance the accuracy, efficiency, and usability of your transcriptions.

Transcription Parameters

1. return_as_file

Description

The return_as_file parameter allows you to receive your transcription output as a downloadable file URL. This is particularly useful when dealing with large responses, as it helps avoid issues with response size limitations.
  • Default: false
  • Type: boolean

Usage

Set "return_as_file": true in your request to receive the transcription output as a file URL. Example:
"input": {
  "url": "https://example.com/path/to/file.mp3",
  "return_as_file": true
}
Note:
  • If the response size exceeds 1 MB, the output will automatically be returned as a file URL, even if return_as_file is set to false.
  • This helps ensure reliable delivery of large transcription outputs.

2. sentence_level_timestamps

Description

Include timestamps at the sentence level in your transcription output.
  • Default: false
  • Type: boolean

Usage

Set "sentence_level_timestamps": true to include sentence-level timestamps. Set to false if you do not need them. Example:
"input": {
  "url": "https://example.com/path/to/file.mp3",
  "sentence_level_timestamps": true
}
Output Format
"sentence_level_timestamps": [
    {
        "text": "Welcome to our presentation.",
        "timestamp": [
            29.98,
            29.98
        ],
        "start": 29.98,
        "end": 29.98,
    }
],

3. word_level_timestamps

Description

Include timestamps for each word in your transcription output.
  • Default: false
  • Type: boolean

Usage

Set "word_level_timestamps": true to include word-level timestamps. Example:
"input": {
  "url": "https://example.com/path/to/file.mp3",
  "word_level_timestamps": true
}
Output Format Word-level timestamps are provided in the word_segments array of the output.
"word_segments": [
    {
        "word": "I",
        "start": 4.963,
        "end": 5.024,
        "score": 0.597,
    },
    {
        "word": "understand",
        "start": 5.084,
        "end": 5.548,
        "score": 0.471,
    },
    {
        "word": "something.",
        "start": 6.073,
        "end": 6.376,
        "score": 0.295,
    }
]

4. diarization

Description

Enable speaker separation and identification at the word level.
  • Default: false
  • Type: boolean

Usage

Example:
"input": {
  "url": "https://example.com/path/to/file.mp3",
  "diarization": true
}
Output Format Speaker labels are included in both segments and word_segments when diarization is enabled.
"word_segments": [
    {
        "word": "I",
        "start": 4.963,
        "end": 5.024,
        "score": 0.597,
        "speaker": "SPEAKER_00"
    },
    {
        "word": "understand",
        "start": 5.084,
        "end": 5.548,
        "score": 0.471,
        "speaker": "SPEAKER_00"
    },
    {
        "word": "something.",
        "start": 6.073,
        "end": 6.376,
        "score": 0.295,
        "speaker": "SPEAKER_00"
    }
]

5. sentence_diarization

Description

Include speaker information at the sentence level.
  • Default: false
  • Type: boolean

Usage

Example:
"input": {
  "url": "https://example.com/path/to/file.mp3",
  "sentence_diarization": true
}
Output Format Speaker labels are included in the segments array when sentence_diarization is enabled.
"sentence_level_timestamps": [
    {
        "text": "I understand something.",
        "timestamp": [
            4.64,
            6.82
        ],
        "start": 4.64,
        "end": 6.82,
        "speaker": "SPEAKER_00"
    }
]
Note: If several speakers are identified in one sentence the most appearing one will be returned.

6. multichannel

Description

Enable multichannel transcription for audio files with separate speaker channels. Requires either diarization or sentence_diarization to return speaker/channel information.
  • Default: false
  • Type: boolean
If you have a multichannel audio file with multiple speakers, you can transcribe each of them separately. Set "multichannel": true and pair it with either:
  • "diarization": true → word-level speaker and channel labeling
  • "sentence_diarization": true → sentence-level speaker and channel labeling
If the audio has only one channel, the system will automatically fall back to regular speaker diarization. Multichannel transcription is supported for all languages, has no channel limit and incurs no additional cost. It may increase transcription time by approximately 25%.

Usage

"input": {
  "url": "https://example.com/path/to/file.wav",
  "multichannel": true,
  "diarization": true
}

Output Format (word-level) The output will include speaker and channel labels for each word.
"word_segments": [
  {
    "word": "Okay,",
    "start": 0.324,
    "end": 0.566,
    "timestamp": [
        0.324,
        0.566
    ],
    "speaker": "SPEAKER_0",
    "channel": 0
  }
]
Output Format (sentence-level) The output will include speaker and channel labels for each sentence.
"sentence_level_timestamps": [
  {
    "start": 0.324,
    "end": 1.335,
    "text": "Okay, how are you?",
    "speaker": "SPEAKER_0",
    "channel": "0",
    "score": 100.0
  }
]

7. language_code

Description

The language_code parameter allows you to specify the language of the transcription to improve the accuracy of transcription, diarization and timestamps. This parameter supports both full language names and their short versions.
  • Default: en (English)
  • Type: string
If no language is specified or the provided language is not in the list, the system will automatically detect the language. However, specifying the language correctly enhances transcription accuracy. If the wrong language is specified, the system will return a translation to the specified language. For multilingual audio content, it is recommended not to specify a language to achieve optimal results. Accuracy varies by language. For tested results, see our accuracy benchmarks.

Usage

Set the language_code to the ISO 639-1 code of the audio’s language. Example:
"input": {
  "url": "https://example.com/path/to/french_audio.mp3",
  "language_code": "fr"
}
Supported Language Codes (ISO 639-1):
CodeLanguageCodeLanguageCodeLanguage
enEnglishzhChinesedeGerman
esSpanishruRussiankoKorean
frFrenchjaJapaneseptPortuguese
trTurkishplPolishcaCatalan
nlDutcharArabicsvSwedish
itItalianidIndonesianhiHindi
fiFinnishviVietnameseheHebrew
ukUkrainianelGreekmsMalay
csCzechroRomaniandaDanish
huHungariantaTamilnoNorwegian
thThaiurUrduhrCroatian
bgBulgarianltLithuanianlaLatin
miMaorimlMalayalamcyWelsh
skSlovakteTelugufaPersian
lvLatvianbnBengalisrSerbian
azAzerbaijanislSlovenianknKannada
etEstonianmkMacedonianbrBreton
euBasqueisIcelandichyArmenian
neNepalimnMongolianbsBosnian
kkKazakhsqAlbanianswSwahili
glGalicianmrMarathipaPunjabi
siSinhalakmKhmersnShona
yoYorubasoSomaliafAfrikaans
ocOccitankaGeorgianbeBelarusian
tgTajiksdSindhiguGujarati
amAmharicyiYiddishloLao
uzUzbekfoFaroesehtHaitian Creole
psPashtotkTurkmennnNynorsk
mtMaltesesaSanskritlbLuxembourgish
myMyanmarboTibetantlTagalog
mgMalagasyasAssamesettTatar
hawHawaiianlnLingalahaHausa
baBashkirjwJavanesesuSundanese
yueCantonese

Note

  • Mandatory for Diarization: When using diarization or sentence-level diarization, specifying language_code is required for optimal performance.
  • Automatic Detection: If no language is specified or the provided language is not supported, the system will automatically detect the language.
  • Translation Option: If the wrong language is specified, the system will provide a transcription translated into the specified language.
  • Multilingual Content: For multilingual audio, it is better not to specify a language to ensure the best results.
By setting the correct language_code or leveraging the automatic detection feature, you ensure the best possible transcription quality tailored to your audio content.