# Speech to text

**Action ID:** `speech_to_text`

## Description

Speech to text node.

## Connection

| Name               | Description                                 | Required | Category |
| ------------------ | ------------------------------------------- | -------- | -------- |
| PixelML Connection | The PixelML connection to call PixelML API. | True     | pixelml  |

## Input Parameters

| Name     | Type     | Required | Default | Description                                                                                                                                             |
| -------- | -------- | :------: | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
| provider | dropdown |     -    | Groq    | Which provider to use for speech to text. Available options: Groq, Azure, AWS\_Transcribe                                                               |
| language | dropdown |     ✓    | -       | Which language that the audio is in. Supports: English (UK, Canada, US, South Africa), French, Italian, Japanese, Russian, Vietnamese, Chinese variants |
| audio    | string   |     ✓    | -       | Audio file URL to convert and transcribe                                                                                                                |

<details>

<summary>View JSON Schema</summary>

**Input Schema**

```json
{
  "$defs": {
    "SpeechToTextProvider": {
      "description": "Speech to text provider.",
      "enum": [
        "Groq",
        "Azure",
        "AWS_Transcribe"
      ],
      "title": "SpeechToTextProvider",
      "type": "string"
    },
    "SpeechToTextSupportedLanguage": {
      "description": "Speech to text supported language.",
      "enum": [
        "English (United Kingdom)",
        "English (Canada)",
        "English (United States)",
        "English (South Africa)",
        "French (France)",
        "Italian (Italy)",
        "Japanese (Japan)",
        "Russian (Russia)",
        "Vietnamese (Vietnam)",
        "Chinese (Wu, Simplified)",
        "Chinese (Cantonese, Simplified)",
        "Chinese (Mandarin, Simplified)"
      ],
      "title": "SpeechToTextSupportedLanguage",
      "type": "string"
    }
  },
  "description": "Speech to text node input.",
  "properties": {
    "provider": {
      "$ref": "#/$defs/SpeechToTextProvider",
      "default": "Groq",
      "description": "Which provider to use for speech to text",
      "title": "Provider"
    },
    "language": {
      "$ref": "#/$defs/SpeechToTextSupportedLanguage",
      "description": "Which language that the audio is in",
      "title": "Language"
    },
    "audio": {
      "description": "Audio file url to convert transcribe",
      "title": "Audio file",
      "type": "string"
    }
  },
  "required": [
    "language",
    "audio"
  ],
  "title": "SpeechToTextNodeInput",
  "type": "object"
}
```

</details>

## Output Parameters

| Name             | Type   | Description                                  |
| ---------------- | ------ | -------------------------------------------- |
| transcript       | string | The transcribed text from the audio file     |
| transcript\_file | string | URL to a file containing the full transcript |

<details>

<summary>View JSON Schema</summary>

```json
{
  "description": "Speech To Text node output.",
  "properties": {
    "transcript": {
      "title": "Transcribed text",
      "type": "string"
    },
    "transcript_file": {
      "title": "Transcript file",
      "type": "string"
    }
  },
  "required": [
    "transcript",
    "transcript_file"
  ],
  "title": "SpeechToTextNodeOutput",
  "type": "object"
}
```

</details>

## How It Works

This node takes an audio file URL and language specification, sends the audio to your chosen speech-to-text provider (Groq, Azure, or AWS Transcribe) through the PixelML API, processes the audio through advanced speech recognition algorithms, and returns both the transcribed text and a URL to a file containing the complete transcript.

## Usage Examples

### Example 1: English Meeting Transcription with Groq

**Input:**

```
provider: "Groq"
language: "English (United States)"
audio: "https://example.com/team-meeting.mp3"
```

**Output:**

```
transcript: "Good morning team. Today we'll discuss the Q4 roadmap and our strategic priorities for the upcoming quarter. Let's start with product updates from the engineering team."
transcript_file: "https://pixelml-storage.com/transcripts/abc123-meeting.txt"
```

### Example 2: French Customer Call with Azure

**Input:**

```
provider: "Azure"
language: "French (France)"
audio: "https://example.com/customer-call-fr.wav"
```

**Output:**

```
transcript: "Bonjour, merci d'avoir appelé notre service client. Comment puis-je vous aider aujourd'hui? Je comprends votre problème et je vais vous aider à le résoudre."
transcript_file: "https://pixelml-storage.com/transcripts/def456-call-fr.txt"
```

### Example 3: Japanese Interview with AWS Transcribe

**Input:**

```
provider: "AWS_Transcribe"
language: "Japanese (Japan)"
audio: "https://example.com/interview-jp.mp3"
```

**Output:**

```
transcript: "本日はインタビューにお越しいただきありがとうございます。まず、あなたの経験とスキルについて教えてください。"
transcript_file: "https://pixelml-storage.com/transcripts/ghi789-interview-jp.txt"
```

## Common Use Cases

* **Meeting Transcription**: Convert recorded business meetings, standups, or conference calls into searchable text documents
* **Customer Service Analysis**: Transcribe support calls for quality assurance, training, or sentiment analysis
* **Interview Documentation**: Create written records of job interviews, research interviews, or media interviews
* **Podcast Production**: Generate transcripts for podcast episodes to improve accessibility and SEO
* **Voice Note Processing**: Convert voice memos and audio notes into text for easier organization and search
* **Multilingual Content Creation**: Transcribe audio content in multiple languages for translation or localization workflows
* **Legal Documentation**: Create accurate transcripts of depositions, hearings, or client consultations

## Error Handling

| Error Type               | Cause                                                   | Solution                                                               |
| ------------------------ | ------------------------------------------------------- | ---------------------------------------------------------------------- |
| Invalid API Connection   | PixelML connection credentials are missing or incorrect | Verify your PixelML API credentials in the connection settings         |
| Audio URL Inaccessible   | Cannot download audio file from provided URL            | Ensure the URL is publicly accessible and returns a valid audio file   |
| Unsupported Audio Format | Audio format not supported by the selected provider     | Convert audio to a commonly supported format like MP3, WAV, or M4A     |
| Language Not Supported   | Selected language not available for the chosen provider | Select a different language or switch to a provider that supports it   |
| Transcription Failed     | Provider unable to process the audio                    | Check audio quality and ensure it contains clear speech                |
| Provider Unavailable     | Selected speech-to-text provider is temporarily down    | Try a different provider or retry after a short delay                  |
| Rate Limit Exceeded      | Too many transcription requests in a short time         | Implement delays between requests or contact PixelML about rate limits |

## Notes

* **Provider Selection**: Each provider (Groq, Azure, AWS Transcribe) has different strengths. Groq offers fast processing, Azure excels at multiple languages, and AWS provides robust accuracy.
* **Language Matching**: Always select the correct language to ensure accurate transcription. Mismatched languages result in poor quality output.
* **Audio Quality**: Clear audio with minimal background noise produces the best transcription results. Consider audio preprocessing for noisy files.
* **Supported Languages**: The node supports 12 language variants including multiple English dialects, French, Italian, Japanese, Russian, Vietnamese, and Chinese dialects.
* **File Output**: The transcript\_file URL provides a persistent copy of the full transcript, useful for long audio files or archiving.
* **Cost Considerations**: Different providers have different pricing models. Check PixelML's pricing for each provider.
* **Processing Time**: Transcription time varies by provider and audio length. Longer files take more time to process.
* **Accuracy**: Transcription accuracy depends on audio quality, speaker clarity, accent, and technical terminology. Review transcripts for critical applications.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.agenticflow.ai/reference/nodes/speech_to_text.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
