# Web Scraping using Apify

**Action ID:** `web_scraping_apify`

## Description

Web Scraping using Apify

## Connection

| Name               | Description                                 | Required | Category |
| ------------------ | ------------------------------------------- | -------- | -------- |
| PixelML Connection | The PixelML connection to call PixelML API. | True     | pixelml  |

## Input Parameters

| Name                  | Type    | Required | Default            | Description                                                 |
| --------------------- | ------- | :------: | ------------------ | ----------------------------------------------------------- |
| web\_urls             | array   |     ✓    | -                  | List of web URLs to scrape (1-10 URLs)                      |
| crawler\_type         | string  |     -    | playwright:firefox | Crawling engine (firefox, chrome, adaptive, cheerio, jsdom) |
| max\_tokens\_per\_url | integer |     -    | 32000              | Maximum tokens to scrape per URL (1-64000)                  |

<details>

<summary>View JSON Schema</summary>

```json
{
  "description": "Web scraping node input.",
  "properties": {
    "web_urls": {
      "description": "List of Web URL to scrape.",
      "items": {
        "type": "string"
      },
      "maxItems": 10,
      "minItems": 1,
      "title": "Web URLs",
      "type": "array"
    },
    "crawler_type": {
      "default": "playwright:firefox",
      "description": "Crawling engine to use. Default is playwright:firefox.",
      "enum": [
        "playwright:firefox",
        "playwright:chrome",
        "playwright:adaptive",
        "cheerio",
        "jsdom"
      ],
      "title": "Crawling Engine",
      "type": "string"
    },
    "max_tokens_per_url": {
      "default": 32000,
      "description": "Maximum number of tokens to scrape per URL. Default is 10000.",
      "maximum": 64000,
      "minimum": 1,
      "title": "Max Tokens per URL",
      "type": "integer"
    }
  },
  "required": [
    "web_urls"
  ],
  "title": "ApifyWebScrapingNodeInput",
  "type": "object"
}
```

</details>

## Output Parameters

| Name              | Type  | Description                                     |
| ----------------- | ----- | ----------------------------------------------- |
| scraped\_contents | array | List of extracted content strings from each URL |

<details>

<summary>View JSON Schema</summary>

```json
{
  "description": "Web scraping node output.",
  "properties": {
    "scraped_contents": {
      "description": "The extracted content from the web URLs.",
      "items": {
        "type": "string"
      },
      "title": "List of extracted content",
      "type": "array"
    }
  },
  "required": [
    "scraped_contents"
  ],
  "title": "ApifyWebScrapingNodeOutput",
  "type": "object"
}
```

</details>

## How It Works

This node uses Apify's web scraping infrastructure through the PixelML API to extract content from web pages. You provide URLs and select a crawling engine. The node uses headless browsers (Playwright) or HTML parsers (Cheerio/JSDOM) to access pages, extract text content, and return it in a structured format. The max\_tokens parameter limits the amount of content extracted per URL, useful for controlling costs and processing time.

## Usage Examples

### Example 1: Scrape Blog Articles

**Input:**

```
web_urls: ["https://example.com/blog/article-1", "https://example.com/blog/article-2"]
crawler_type: "playwright:firefox"
max_tokens_per_url: 32000
```

**Output:**

```
scraped_contents: [
  "Article 1 title and content extracted here...",
  "Article 2 title and content extracted here..."
]
```

### Example 2: Fast Scraping with Cheerio

**Input:**

```
web_urls: ["https://news.example.com/latest"]
crawler_type: "cheerio"
max_tokens_per_url: 10000
```

**Output:**

```
scraped_contents: [
  "Latest news article content with text only..."
]
```

### Example 3: JavaScript-Heavy Site

**Input:**

```
web_urls: ["https://spa-app.example.com/data"]
crawler_type: "playwright:adaptive"
max_tokens_per_url: 20000
```

**Output:**

```
scraped_contents: [
  "Dynamically loaded content from JavaScript app..."
]
```

## Common Use Cases

* **Content Aggregation**: Collect articles, blog posts, or news from multiple sources
* **Competitive Analysis**: Extract product information, pricing, or features from competitor websites
* **Data Mining**: Gather data from public websites for research or analysis
* **SEO Analysis**: Extract meta tags, headings, and content for SEO auditing
* **Price Monitoring**: Track product prices across e-commerce sites
* **News Monitoring**: Collect latest news and updates from multiple sources
* **Research Automation**: Automatically gather information from academic or research websites

## Error Handling

| Error Type              | Cause                                    | Solution                                                                         |
| ----------------------- | ---------------------------------------- | -------------------------------------------------------------------------------- |
| Invalid URL             | URL is malformed or inaccessible         | Verify URLs are complete and publicly accessible (include http\:// or https\://) |
| Connection Timeout      | Website took too long to respond         | Increase timeout or try again later; check if website is online                  |
| Rate Limit Exceeded     | Too many scraping requests               | Space out requests or upgrade your PixelML/Apify plan                            |
| Bot Detection           | Website blocked the scraping attempt     | Try different crawler\_type or use "playwright:adaptive" for better bot evasion  |
| Content Too Large       | Scraped content exceeds max\_tokens      | Reduce max\_tokens\_per\_url or target specific pages with less content          |
| JavaScript Required     | Page requires JS but using cheerio/jsdom | Switch to playwright:firefox or playwright:chrome for JS execution               |
| Authentication Required | Website requires login                   | This node works with public pages only; use authenticated scraping methods       |

## Notes

* **Crawler Selection**: Use Playwright for JavaScript-heavy sites, Cheerio/JSDOM for faster, static HTML parsing.
* **Token Limits**: Higher max\_tokens increases cost and processing time. Start with lower values and adjust as needed.
* **URL Limits**: Maximum 10 URLs per request. For batch scraping, use multiple node instances or loop structures.
* **Adaptive Mode**: playwright:adaptive intelligently switches between browsers for optimal scraping success.
* **Legal Compliance**: Ensure you have permission to scrape target websites and comply with their robots.txt and terms of service.
* **Performance**: Cheerio is fastest but doesn't execute JavaScript. Playwright is slower but handles dynamic content.
* **Cost Management**: Each scraping operation consumes API credits. Monitor usage especially with high max\_tokens values.
* **Content Format**: Scraped content is returned as plain text. HTML structure is typically stripped out.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.agenticflow.ai/reference/nodes/web_scraping_apify.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
