# Web Scraping

**Action ID:** `web_scraping`

## Description

Scrape content from a web page

## Input Parameters

| Name              | Type     | Required | Default   | Description                                                                       |
| ----------------- | -------- | :------: | --------- | --------------------------------------------------------------------------------- |
| web\_url          | string   |     ✓    | -         | The URL of the web page to scrape                                                 |
| scraping\_type    | dropdown |     -    | Text      | Type of content to extract. Available options: Html, Text                         |
| max\_tokens       | integer  |     -    | 10000     | Maximum number of tokens to extract. Range: 1 to 32000                            |
| tags\_to\_extract | array    |     -    | \["span"] | List of HTML tags to extract content from (only used when scraping\_type is Html) |

<details>

<summary>View JSON Schema</summary>

```json
{
  "description": "Web scraping node input.",
  "properties": {
    "web_url": {
      "description": "Web URL to scrape.",
      "title": "Web URL",
      "type": "string"
    },
    "scraping_type": {
      "default": "Text",
      "description": "Type of scraping to perform.",
      "enum": [
        "Html",
        "Text"
      ],
      "title": "Scraping Type",
      "type": "string"
    },
    "max_tokens": {
      "default": 10000,
      "description": "The maximum number of tokens to scrape.",
      "maximum": 32000,
      "minimum": 1,
      "title": "Max Tokens",
      "type": "integer"
    },
    "tags_to_extract": {
      "default": [
        "span"
      ],
      "description": "List of tags to extract from the HTML.",
      "items": {
        "type": "string"
      },
      "title": "Tags to Extract",
      "type": "array"
    }
  },
  "required": [
    "web_url"
  ],
  "title": "WebScrapingNodeInput",
  "type": "object"
}
```

</details>

## Output Parameters

| Name             | Type   | Description                                                                                   |
| ---------------- | ------ | --------------------------------------------------------------------------------------------- |
| scraped\_content | string | The extracted content from the web page, either as plain text or HTML based on scraping\_type |

<details>

<summary>View JSON Schema</summary>

```json
{
  "description": "Web scraping node output.",
  "properties": {
    "scraped_content": {
      "description": "The extracted content from the web URLs.",
      "title": "Extracted Content",
      "type": "string"
    }
  },
  "required": [
    "scraped_content"
  ],
  "title": "WebScrapingNodeOutput",
  "type": "object"
}
```

</details>

## How It Works

This node fetches content from a specified web URL and extracts either clean text or structured HTML based on your preference. When using Text mode, it removes HTML tags and returns readable text content. When using Html mode, it extracts content from specified HTML tags. The extraction is limited by the max\_tokens parameter to prevent excessive data retrieval and ensure efficient processing.

## Usage Examples

### Example 1: Extract Article Text

**Input:**

```
web_url: "https://example.com/blog/article-about-ai"
scraping_type: "Text"
max_tokens: 5000
tags_to_extract: ["span"]
```

**Output:**

```
scraped_content: "Understanding Artificial Intelligence\n\nArtificial Intelligence has become one of the most transformative technologies of our time. From machine learning to natural language processing, AI is reshaping industries and creating new possibilities...\n\n[Article content continues for ~5000 tokens]"
```

### Example 2: Extract Product Information from HTML

**Input:**

```
web_url: "https://shop.example.com/products/laptop-pro"
scraping_type: "Html"
max_tokens: 2000
tags_to_extract: ["h1", "h2", "p", "span", "div"]
```

**Output:**

```
scraped_content: "<h1>Laptop Pro 15</h1>\n<h2>Specifications</h2>\n<p>Processor: Intel Core i7</p>\n<span>RAM: 16GB</span>\n<div>Storage: 512GB SSD</div>\n<p>Display: 15.6\" 4K</p>\n<span>Price: $1,299.99</span>"
```

### Example 3: Scrape News Headlines

**Input:**

```
web_url: "https://news.example.com/technology"
scraping_type: "Text"
max_tokens: 15000
tags_to_extract: ["span"]
```

**Output:**

```
scraped_content: "Technology News\n\nBreaking: New AI Model Achieves Human-Level Performance\nPublished 2 hours ago\n\nTech Giants Announce Cloud Computing Partnership\nPublished 5 hours ago\n\nCybersecurity Trends for 2024\nPublished 1 day ago\n\n[Additional headlines and content...]"
```

## Common Use Cases

* **Content Aggregation**: Collect articles, blog posts, or news content from multiple websites for analysis or republishing
* **Price Monitoring**: Extract product prices and specifications from e-commerce sites for competitive analysis
* **Data Collection**: Gather structured data from web pages for market research or lead generation
* **News Monitoring**: Track news articles and updates from various sources for media monitoring workflows
* **SEO Analysis**: Extract meta tags, headings, and content structure for SEO auditing purposes
* **Content Migration**: Pull content from legacy websites for migration to new platforms
* **Competitive Intelligence**: Monitor competitor websites for product updates, pricing changes, or content strategies

## Error Handling

| Error Type             | Cause                                              | Solution                                                                   |
| ---------------------- | -------------------------------------------------- | -------------------------------------------------------------------------- |
| Invalid URL            | URL is malformed or uses unsupported protocol      | Ensure the URL starts with http\:// or https\:// and is properly formatted |
| Page Not Found         | The URL returns a 404 or doesn't exist             | Verify the URL is correct and the page is accessible                       |
| Connection Timeout     | Server didn't respond within timeout period        | Try again later or check if the website is operational                     |
| Access Denied          | Website blocks scraping or requires authentication | Check robots.txt, use appropriate authentication, or request API access    |
| Content Too Large      | Page content exceeds max\_tokens limit             | Increase max\_tokens value or scrape specific sections of the page         |
| Invalid HTML Structure | HTML parsing fails due to malformed markup         | Try "Text" scraping\_type instead of "Html" for more forgiving extraction  |
| Rate Limiting          | Too many requests to the same domain               | Add delays between requests or reduce scraping frequency                   |

## Notes

* **Robots.txt Compliance**: Respect website robots.txt files and terms of service; avoid scraping sites that prohibit automated access.
* **Text vs Html**: Use "Text" mode for clean readable content; use "Html" mode when you need to preserve structure or extract from specific tags.
* **Token Limits**: The max\_tokens parameter prevents excessive data retrieval; adjust based on your content needs and processing capabilities.
* **Tag Selection**: When using Html mode, specify relevant tags like \["p", "h1", "h2", "article"] to extract meaningful content sections.
* **Dynamic Content**: This node may not capture JavaScript-rendered content; it scrapes the initial HTML response from the server.
* **Legal Considerations**: Ensure you have permission to scrape the target website; web scraping may have legal implications depending on jurisdiction and use case.
* **Performance**: Large pages with high max\_tokens values may take longer to process; optimize your extraction strategy for efficiency.
* **Character Encoding**: The node handles various character encodings automatically, but some special characters may require additional processing.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.agenticflow.ai/reference/nodes/web_scraping.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
