# Text Extract

**Action ID:** `text_extract`

## Description

Extract text content from various file formats including PDFs, images, and documents using OCR and document parsing technology.

## Connection

| Name               | Description                                 | Required | Category |
| ------------------ | ------------------------------------------- | :------: | -------- |
| PixelML Connection | The PixelML connection to call PixelML API. |     ✓    | pixelml  |

## Input Parameters

| Name | Type   | Required | Default | Description                                                                               |
| ---- | ------ | :------: | ------- | ----------------------------------------------------------------------------------------- |
| file | string |     ✓    | -       | URL of the file to extract text from. Supports PDFs, images, and various document formats |

<details>

<summary>View JSON Schema</summary>

```json
{
  "description": "Text extract node input.",
  "properties": {
    "file": {
      "description": "File to extract text from",
      "title": "File",
      "type": "string"
    }
  },
  "required": [
    "file"
  ],
  "title": "TextExtractNodeInput",
  "type": "object"
}
```

</details>

## Output Parameters

| Name | Type   | Description                              |
| ---- | ------ | ---------------------------------------- |
| text | string | The extracted text content from the file |

<details>

<summary>View JSON Schema</summary>

```json
{
  "description": "Text extract node output.",
  "properties": {
    "text": {
      "description": "Extracted text from the file",
      "title": "Extracted text",
      "type": "string"
    }
  },
  "required": [
    "text"
  ],
  "title": "TextExtractNodeOutput",
  "type": "object"
}
```

</details>

## How It Works

This node uses advanced OCR (Optical Character Recognition) and document parsing technology to extract text from various file formats. For PDFs, it extracts embedded text directly when available or applies OCR for scanned PDFs. For images, it uses computer vision to detect and recognize text characters. The extracted text maintains line breaks and paragraph structure when possible, providing clean, readable output.

## Usage Examples

### Example 1: Extract Text from PDF

**Input:**

```
file: "https://example.com/invoice.pdf"
```

**Output:**

```
text: "Invoice #12345
Date: January 15, 2025
Customer: Acme Corp
Total Amount: $1,250.00
..."
```

### Example 2: Extract Text from Image

**Input:**

```
file: "https://example.com/business-card.jpg"
```

**Output:**

```
text: "John Doe
Senior Developer
Acme Technology
john.doe@example.com
+1 (555) 123-4567"
```

### Example 3: Extract Text from Scanned Document

**Input:**

```
file: "https://example.com/scanned-contract.pdf"
```

**Output:**

```
text: "SERVICE AGREEMENT

This agreement is made on January 15, 2025 between...
Terms and Conditions:
1. Service Duration
2. Payment Terms
..."
```

## Common Use Cases

* **Invoice Processing**: Extract text from invoices for automated accounting and data entry
* **Document Digitization**: Convert scanned documents and images into searchable, editable text
* **Receipt Processing**: Extract information from receipts for expense tracking and reporting
* **Business Card Scanning**: Extract contact information from business card images
* **Form Processing**: Extract data from filled forms and applications
* **Legal Document Processing**: Extract text from contracts and legal documents for review
* **ID Verification**: Extract text from identification documents for verification workflows

## Error Handling

| Error Type         | Cause                                            | Solution                                                 |
| ------------------ | ------------------------------------------------ | -------------------------------------------------------- |
| Invalid File URL   | URL is malformed or file is inaccessible         | Verify the file URL is valid and publicly accessible     |
| Unsupported Format | File format is not supported for text extraction | Convert file to a supported format (PDF, JPG, PNG, etc.) |
| No Text Found      | File contains no readable text                   | Ensure the file contains visible text and is not blank   |
| File Too Large     | File size exceeds maximum allowed                | Compress or split the file into smaller parts            |
| Low Image Quality  | Image quality is too poor for OCR                | Use a higher resolution scan or image                    |
| Connection Error   | Cannot connect to PixelML API                    | Check your PixelML connection settings and API key       |

## Notes

* **Supported Formats**: Works with PDFs, JPG, PNG, JPEG, WebP, and other common image and document formats.
* **OCR Accuracy**: Accuracy depends on image quality, text clarity, and font legibility. High-resolution images produce best results.
* **Language Support**: The system supports multiple languages for text extraction. Specify language if detection is inaccurate.
* **Text Structure**: The node attempts to preserve text structure including paragraphs and line breaks.
* **Processing Time**: Extraction time varies based on file size and complexity. Large PDFs may take 30-60 seconds.
* **Handwriting**: OCR works best with printed text. Handwritten text may have lower accuracy rates.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.agenticflow.ai/reference/nodes/text_extract.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
