Web Scraping

Action ID: web_scraping

Description

Scrape content from a web page

Input Parameters

Name
Type
Required
Default
Description

web_url

string

-

The URL of the web page to scrape

scraping_type

dropdown

-

Text

Type of content to extract. Available options: Html, Text

max_tokens

integer

-

10000

Maximum number of tokens to extract. Range: 1 to 32000

tags_to_extract

array

-

["span"]

List of HTML tags to extract content from (only used when scraping_type is Html)

View JSON Schema
{
  "description": "Web scraping node input.",
  "properties": {
    "web_url": {
      "description": "Web URL to scrape.",
      "title": "Web URL",
      "type": "string"
    },
    "scraping_type": {
      "default": "Text",
      "description": "Type of scraping to perform.",
      "enum": [
        "Html",
        "Text"
      ],
      "title": "Scraping Type",
      "type": "string"
    },
    "max_tokens": {
      "default": 10000,
      "description": "The maximum number of tokens to scrape.",
      "maximum": 32000,
      "minimum": 1,
      "title": "Max Tokens",
      "type": "integer"
    },
    "tags_to_extract": {
      "default": [
        "span"
      ],
      "description": "List of tags to extract from the HTML.",
      "items": {
        "type": "string"
      },
      "title": "Tags to Extract",
      "type": "array"
    }
  },
  "required": [
    "web_url"
  ],
  "title": "WebScrapingNodeInput",
  "type": "object"
}

Output Parameters

Name
Type
Description

scraped_content

string

The extracted content from the web page, either as plain text or HTML based on scraping_type

View JSON Schema
{
  "description": "Web scraping node output.",
  "properties": {
    "scraped_content": {
      "description": "The extracted content from the web URLs.",
      "title": "Extracted Content",
      "type": "string"
    }
  },
  "required": [
    "scraped_content"
  ],
  "title": "WebScrapingNodeOutput",
  "type": "object"
}

How It Works

This node fetches content from a specified web URL and extracts either clean text or structured HTML based on your preference. When using Text mode, it removes HTML tags and returns readable text content. When using Html mode, it extracts content from specified HTML tags. The extraction is limited by the max_tokens parameter to prevent excessive data retrieval and ensure efficient processing.

Usage Examples

Example 1: Extract Article Text

Input:

web_url: "https://example.com/blog/article-about-ai"
scraping_type: "Text"
max_tokens: 5000
tags_to_extract: ["span"]

Output:

scraped_content: "Understanding Artificial Intelligence\n\nArtificial Intelligence has become one of the most transformative technologies of our time. From machine learning to natural language processing, AI is reshaping industries and creating new possibilities...\n\n[Article content continues for ~5000 tokens]"

Example 2: Extract Product Information from HTML

Input:

web_url: "https://shop.example.com/products/laptop-pro"
scraping_type: "Html"
max_tokens: 2000
tags_to_extract: ["h1", "h2", "p", "span", "div"]

Output:

scraped_content: "<h1>Laptop Pro 15</h1>\n<h2>Specifications</h2>\n<p>Processor: Intel Core i7</p>\n<span>RAM: 16GB</span>\n<div>Storage: 512GB SSD</div>\n<p>Display: 15.6\" 4K</p>\n<span>Price: $1,299.99</span>"

Example 3: Scrape News Headlines

Input:

web_url: "https://news.example.com/technology"
scraping_type: "Text"
max_tokens: 15000
tags_to_extract: ["span"]

Output:

scraped_content: "Technology News\n\nBreaking: New AI Model Achieves Human-Level Performance\nPublished 2 hours ago\n\nTech Giants Announce Cloud Computing Partnership\nPublished 5 hours ago\n\nCybersecurity Trends for 2024\nPublished 1 day ago\n\n[Additional headlines and content...]"

Common Use Cases

  • Content Aggregation: Collect articles, blog posts, or news content from multiple websites for analysis or republishing

  • Price Monitoring: Extract product prices and specifications from e-commerce sites for competitive analysis

  • Data Collection: Gather structured data from web pages for market research or lead generation

  • News Monitoring: Track news articles and updates from various sources for media monitoring workflows

  • SEO Analysis: Extract meta tags, headings, and content structure for SEO auditing purposes

  • Content Migration: Pull content from legacy websites for migration to new platforms

  • Competitive Intelligence: Monitor competitor websites for product updates, pricing changes, or content strategies

Error Handling

Error Type
Cause
Solution

Invalid URL

URL is malformed or uses unsupported protocol

Ensure the URL starts with http:// or https:// and is properly formatted

Page Not Found

The URL returns a 404 or doesn't exist

Verify the URL is correct and the page is accessible

Connection Timeout

Server didn't respond within timeout period

Try again later or check if the website is operational

Access Denied

Website blocks scraping or requires authentication

Check robots.txt, use appropriate authentication, or request API access

Content Too Large

Page content exceeds max_tokens limit

Increase max_tokens value or scrape specific sections of the page

Invalid HTML Structure

HTML parsing fails due to malformed markup

Try "Text" scraping_type instead of "Html" for more forgiving extraction

Rate Limiting

Too many requests to the same domain

Add delays between requests or reduce scraping frequency

Notes

  • Robots.txt Compliance: Respect website robots.txt files and terms of service; avoid scraping sites that prohibit automated access.

  • Text vs Html: Use "Text" mode for clean readable content; use "Html" mode when you need to preserve structure or extract from specific tags.

  • Token Limits: The max_tokens parameter prevents excessive data retrieval; adjust based on your content needs and processing capabilities.

  • Tag Selection: When using Html mode, specify relevant tags like ["p", "h1", "h2", "article"] to extract meaningful content sections.

  • Dynamic Content: This node may not capture JavaScript-rendered content; it scrapes the initial HTML response from the server.

  • Legal Considerations: Ensure you have permission to scrape the target website; web scraping may have legal implications depending on jurisdiction and use case.

  • Performance: Large pages with high max_tokens values may take longer to process; optimize your extraction strategy for efficiency.

  • Character Encoding: The node handles various character encodings automatically, but some special characters may require additional processing.

Last updated

Was this helpful?