Extract Content

Action ID: extract_content

Description

Extract structured content from text using a specified schema.

Input Parameters

Name
Type
Required
Default
Description

extract_from

string

-

The content to extract from

extract_schema

object

-

A JSON schema representing the structure for extracted content

model

dropdown

-

gpt-3.5-turbo-0613

The LLM model to use for extracting content. Available options: gpt-3.5-turbo-0613, gpt-4-32k-0613

View JSON Schema

Input Schema

{
  "$defs": {
    "AllowedModels": {
      "description": "Allowed models for the web scraping node.",
      "enum": [
        "gpt-3.5-turbo-0613",
        "gpt-4-32k-0613"
      ],
      "title": "AllowedModels",
      "type": "string"
    }
  },
  "description": "Extract content node input.",
  "properties": {
    "extract_from": {
      "description": "The content to extract from.",
      "title": "Extract From",
      "type": "string"
    },
    "extract_schema": {
      "additionalProperties": true,
      "description": "A json string represent schema for the extracted content.",
      "title": "Extract Schema",
      "type": "object"
    },
    "model": {
      "$ref": "#/$defs/AllowedModels",
      "default": "gpt-3.5-turbo-0613",
      "description": "The LLM model to use for extracting content.",
      "title": "Model"
    }
  },
  "required": [
    "extract_from",
    "extract_schema"
  ],
  "title": "ExtractContentNodeInput",
  "type": "object"
}

Output Parameters

Name
Type
Description

extracted_content

array

The extracted content from the text as an array of objects matching the provided schema

View JSON Schema

Output Schema

{
  "description": "Extract content node output.",
  "properties": {
    "extracted_content": {
      "description": "The extracted content from the web URLs.",
      "items": {
        "additionalProperties": true,
        "type": "object"
      },
      "title": "Extracted Content",
      "type": "array"
    }
  },
  "required": [
    "extracted_content"
  ],
  "title": "ExtractContentNodeOutput",
  "type": "object"
}

How It Works

This node uses large language models (LLMs) to intelligently parse unstructured text and extract specific information based on a JSON schema you define. The LLM analyzes the input text, identifies relevant data points matching your schema structure, and returns the extracted information in a structured, consistent format. This enables automated data extraction from emails, documents, web content, and other text sources.

Usage Examples

Example 1: Extract Contact Information

Input:

extract_from: "John Smith is the CEO. You can reach him at [email protected] or call (555) 123-4567. His office is at 123 Main St, New York, NY 10001."
extract_schema: {
  "type": "object",
  "properties": {
    "name": {"type": "string"},
    "email": {"type": "string"},
    "phone": {"type": "string"},
    "address": {"type": "string"}
  }
}
model: "gpt-3.5-turbo-0613"

Output:

extracted_content: [
  {
    "name": "John Smith",
    "email": "[email protected]",
    "phone": "(555) 123-4567",
    "address": "123 Main St, New York, NY 10001"
  }
]

Example 2: Extract Product Information

Input:

extract_from: "We have the iPhone 15 Pro available for $999. The Samsung Galaxy S24 costs $899 and comes in blue, black, and silver. The Google Pixel 8 is priced at $699."
extract_schema: {
  "type": "object",
  "properties": {
    "product_name": {"type": "string"},
    "price": {"type": "number"},
    "colors": {"type": "array", "items": {"type": "string"}}
  }
}
model: "gpt-4-32k-0613"

Output:

extracted_content: [
  {
    "product_name": "iPhone 15 Pro",
    "price": 999,
    "colors": []
  },
  {
    "product_name": "Samsung Galaxy S24",
    "price": 899,
    "colors": ["blue", "black", "silver"]
  },
  {
    "product_name": "Google Pixel 8",
    "price": 699,
    "colors": []
  }
]

Example 3: Extract Event Details

Input:

extract_from: "Join us for the Tech Summit 2024 on March 15-17 at the Convention Center. Registration opens at 8 AM. Keynote speaker: Dr. Jane Doe. Cost: $299 for early bird."
extract_schema: {
  "type": "object",
  "properties": {
    "event_name": {"type": "string"},
    "dates": {"type": "string"},
    "location": {"type": "string"},
    "speaker": {"type": "string"},
    "price": {"type": "number"}
  }
}
model: "gpt-3.5-turbo-0613"

Output:

extracted_content: [
  {
    "event_name": "Tech Summit 2024",
    "dates": "March 15-17",
    "location": "Convention Center",
    "speaker": "Dr. Jane Doe",
    "price": 299
  }
]

Common Use Cases

  • Email Processing: Extract key information like names, dates, and action items from emails

  • Document Parsing: Pull structured data from invoices, contracts, and business documents

  • Web Scraping: Extract specific data points from web page content

  • Customer Data Extraction: Parse customer inquiries to extract contact details and requirements

  • Product Catalog Creation: Extract product details from descriptions to build structured catalogs

  • Resume Parsing: Extract candidate information like skills, experience, and education from resumes

  • Lead Generation: Extract business contact information from various text sources

Error Handling

Error Type
Cause
Solution

Invalid Schema

JSON schema is malformed or invalid

Validate your JSON schema structure and ensure it follows proper JSON syntax

Model Error

LLM API is unavailable or rate limited

Retry the operation or switch to an alternative model

Empty Content

extract_from field is empty or null

Provide valid text content for extraction

Schema Mismatch

Content doesn't match expected schema structure

Adjust schema to match the actual content structure or provide appropriate content

Token Limit Exceeded

Input text is too long for the model

Split text into smaller chunks or use gpt-4-32k-0613 for larger content

Authentication Failed

Invalid API credentials

Verify your OpenAI API connection and credentials

Extraction Failed

LLM unable to extract matching data

Simplify the schema or provide more explicit content that matches the schema

Notes

  • Schema Design: Design clear, specific schemas that match your expected data structure. Include field types and nested objects as needed.

  • Model Selection: Use gpt-3.5-turbo-0613 for simple extractions and gpt-4-32k-0613 for complex content or higher accuracy requirements.

  • Content Quality: Better structured input text produces more accurate extraction results.

  • Array Output: The node returns an array of objects, allowing extraction of multiple items from a single text source.

  • Data Types: Ensure your schema specifies appropriate data types (string, number, boolean, array, object) for accurate extraction.

  • Performance: Extraction time varies based on content length and model choice. GPT-4 is slower but more accurate than GPT-3.5.

Last updated

Was this helpful?