Audio Transcription

Action ID: openai_transcriptions

Description

Transcribe audio to text using OpenAI's Whisper model. This node converts spoken audio in multiple formats into accurate written text transcripts. Whisper supports numerous languages, accents, and technical language, making it highly versatile for audio processing workflows.

Provider

OpenAI

Connection

Name

Description

Required

Input Parameters

Name

Type

Required

Default

Description

model

dropdown

whisper-1

The model to use for transcription.

audio_file

string

✓

The audio file to transcribe. Accepts: file_id, HTTP URL, or uploaded file. Supported formats: MP3, MP4, WebM.

response_format

dropdown

text

The format of the transcript output. Options: json, text, srt, verbose_json, vtt

language

string

The language of the input audio. Supplying the input language in ISO-639-1 format will improve accuracy and latency.

prompt

string

An optional text to guide the model's style or continue a previous audio segment.

temperature

number

0.0

The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.

View JSON Schema

{
  "description": "Transcription node input.",
  "properties": {
    "model": {
      "default": "whisper-1",
      "description": "The model to use for transcription.",
      "enum": [
        "whisper-1"
      ],
      "title": "Model",
      "type": "string"
    },
    "audio_file": {
      "description": "The audio file to transcribe.",
      "title": "Audio File",
      "type": "string"
    },
    "response_format": {
      "default": "text",
      "description": "The format of the transcript output.",
      "enum": [
        "json",
        "text",
        "srt",
        "verbose_json",
        "vtt"
      ],
      "title": "Response Format",
      "type": "string"
    },
    "language": {
      "description": "The language of the input audio. Supplying the input language in ISO-639-1 format will improve accuracy and latency.",
      "title": "Language",
      "type": "string"
    },
    "prompt": {
      "description": "An optional text to guide the model's style or continue a previous audio segment.",
      "title": "Prompt",
      "type": "string"
    },
    "temperature": {
      "default": 0.0,
      "description": "The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.",
      "maximum": 1.0,
      "minimum": 0.0,
      "title": "Temperature",
      "type": "number"
    }
  },
  "required": [
    "audio_file"
  ],
  "title": "TranscriptionInput",
  "type": "object"
}

Output Parameters

Name

Type

Description

text

string

The transcribed text from the audio file.

View JSON Schema

{
  "description": "Response from transcription.",
  "properties": {
    "text": {
      "title": "Text",
      "type": "string"
    }
  },
  "title": "TranscriptionResponse",
  "type": "object"
}

How It Works

This node sends your audio file to OpenAI's Whisper API for transcription. Whisper automatically detects the language if not specified, though providing the language code (ISO-639-1 format like "en" for English, "fr" for French) improves accuracy and speed. The response format determines how the transcription is returned: plain text, JSON with metadata, or timestamped formats (SRT/VTT). Optional prompts can guide the model's interpretation of ambiguous words or maintain consistency with previous segments.

Usage Examples

Example 1: Simple Audio Transcription

Input:

audio_file: "https://example.com/interview.mp3"
model: "whisper-1"
response_format: "text"
language: "en"
temperature: 0.0

Output:

text: "Good morning, thank you for joining us today. I'm excited to discuss the new project developments..."

Example 2: Timestamped Video Subtitle Transcription

Input:

audio_file: "file_id_abc123"
model: "whisper-1"
response_format: "vtt"
language: "en"
prompt: "This is a technical presentation about machine learning"

Output:

text: "WEBVTT

00:00.000 --> 00:05.000
Good morning, thank you for joining us today.

00:05.000 --> 00:12.000
I'm excited to discuss the new project developments..."

Example 3: Non-English Audio with Context

Input:

audio_file: "https://example.com/spanish_podcast.mp3"
model: "whisper-1"
response_format: "json"
language: "es"
prompt: "This is a Spanish business podcast discussing marketing strategies"
temperature: 0.2

Output:

text: "Buenos días, gracias por acompañarnos. Hoy hablaremos sobre nuevas estrategias de marketing digital..."

Common Use Cases

Meeting Transcriptions: Convert recorded meetings into text transcripts for documentation
Podcast Transcription: Create searchable text versions of podcast episodes
Accessibility: Generate captions and transcripts for video content
Customer Support: Transcribe customer service calls for training and quality assurance
Research: Convert interview recordings into text for analysis
Content Creation: Generate blog post content from audio recordings
Legal Documentation: Create accurate transcripts of depositions and proceedings

Error Handling

Error Type

Cause

Solution

Invalid Audio Format

Audio file format not supported

Use MP3, MP4, or WebM formats

Audio Too Long

Audio file exceeds size or duration limits

Split longer audio into smaller chunks

Invalid Language Code

Language code is incorrectly formatted

Use ISO-639-1 codes (e.g., "en", "es", "fr")

Invalid Response Format

Response format not supported

Use: text, json, srt, verbose_json, or vtt

Invalid Temperature

Temperature outside range 0-1

Use a value between 0.0 and 1.0

Authentication Error

Invalid or missing API key

Verify OpenAI connection is properly configured

File Not Found

Audio file URL is invalid or inaccessible

Check URL validity and ensure file is publicly accessible

Audio Quality Issues

Audio is too unclear or noisy

Try with clearer audio or lower temperature

Timeout Error

Request took too long to process

Try with shorter audio or ensure stable connection

Notes

Supported Formats: MP3, MP4, WebM, and other common audio formats are supported. Audio files should be under 25MB.
Language Detection: Whisper can auto-detect language, but specifying it (e.g., "en", "es", "fr") improves accuracy and speed.
Response Formats: Choose "text" for simple transcript, "srt" or "vtt" for timestamped subtitles, "json" or "verbose_json" for detailed metadata.
Prompt Engineering: Use prompts to guide interpretation of technical terms, proper nouns, or to maintain consistency across segments.
Temperature Control: Lower temperature (0.0-0.3) for accurate transcription. Higher (0.5-1.0) for more variable interpretation.
Multi-language Support: Whisper supports 99+ languages. Works reliably across accents and dialects.
Accuracy: For best results, use clear audio with minimal background noise. Whisper is quite robust but benefits from good audio quality.

PreviousText-to-Speech (OpenAI)NextOutpainting

Last updated 3 months ago

hashtagDescription

hashtagProvider

hashtagConnection

hashtagInput Parameters

hashtagOutput Parameters

hashtagHow It Works

hashtagUsage Examples

hashtagExample 1: Simple Audio Transcription

hashtagExample 2: Timestamped Video Subtitle Transcription

hashtagExample 3: Non-English Audio with Context

hashtagCommon Use Cases

hashtagError Handling

hashtagNotes

Description

Provider

Connection

Input Parameters

Output Parameters

How It Works

Usage Examples

Example 1: Simple Audio Transcription

Example 2: Timestamped Video Subtitle Transcription

Example 3: Non-English Audio with Context

Common Use Cases

Error Handling

Notes