Web Scraping using Apify

Action ID: web_scraping_apify

Description

Web Scraping using Apify

Connection

Name
Description
Required
Category

PixelML Connection

The PixelML connection to call PixelML API.

True

pixelml

Input Parameters

Name
Type
Required
Default
Description

web_urls

array

-

List of web URLs to scrape (1-10 URLs)

crawler_type

string

-

playwright:firefox

Crawling engine (firefox, chrome, adaptive, cheerio, jsdom)

max_tokens_per_url

integer

-

32000

Maximum tokens to scrape per URL (1-64000)

View JSON Schema
{
  "description": "Web scraping node input.",
  "properties": {
    "web_urls": {
      "description": "List of Web URL to scrape.",
      "items": {
        "type": "string"
      },
      "maxItems": 10,
      "minItems": 1,
      "title": "Web URLs",
      "type": "array"
    },
    "crawler_type": {
      "default": "playwright:firefox",
      "description": "Crawling engine to use. Default is playwright:firefox.",
      "enum": [
        "playwright:firefox",
        "playwright:chrome",
        "playwright:adaptive",
        "cheerio",
        "jsdom"
      ],
      "title": "Crawling Engine",
      "type": "string"
    },
    "max_tokens_per_url": {
      "default": 32000,
      "description": "Maximum number of tokens to scrape per URL. Default is 10000.",
      "maximum": 64000,
      "minimum": 1,
      "title": "Max Tokens per URL",
      "type": "integer"
    }
  },
  "required": [
    "web_urls"
  ],
  "title": "ApifyWebScrapingNodeInput",
  "type": "object"
}

Output Parameters

Name
Type
Description

scraped_contents

array

List of extracted content strings from each URL

View JSON Schema
{
  "description": "Web scraping node output.",
  "properties": {
    "scraped_contents": {
      "description": "The extracted content from the web URLs.",
      "items": {
        "type": "string"
      },
      "title": "List of extracted content",
      "type": "array"
    }
  },
  "required": [
    "scraped_contents"
  ],
  "title": "ApifyWebScrapingNodeOutput",
  "type": "object"
}

How It Works

This node uses Apify's web scraping infrastructure through the PixelML API to extract content from web pages. You provide URLs and select a crawling engine. The node uses headless browsers (Playwright) or HTML parsers (Cheerio/JSDOM) to access pages, extract text content, and return it in a structured format. The max_tokens parameter limits the amount of content extracted per URL, useful for controlling costs and processing time.

Usage Examples

Example 1: Scrape Blog Articles

Input:

web_urls: ["https://example.com/blog/article-1", "https://example.com/blog/article-2"]
crawler_type: "playwright:firefox"
max_tokens_per_url: 32000

Output:

scraped_contents: [
  "Article 1 title and content extracted here...",
  "Article 2 title and content extracted here..."
]

Example 2: Fast Scraping with Cheerio

Input:

web_urls: ["https://news.example.com/latest"]
crawler_type: "cheerio"
max_tokens_per_url: 10000

Output:

scraped_contents: [
  "Latest news article content with text only..."
]

Example 3: JavaScript-Heavy Site

Input:

web_urls: ["https://spa-app.example.com/data"]
crawler_type: "playwright:adaptive"
max_tokens_per_url: 20000

Output:

scraped_contents: [
  "Dynamically loaded content from JavaScript app..."
]

Common Use Cases

  • Content Aggregation: Collect articles, blog posts, or news from multiple sources

  • Competitive Analysis: Extract product information, pricing, or features from competitor websites

  • Data Mining: Gather data from public websites for research or analysis

  • SEO Analysis: Extract meta tags, headings, and content for SEO auditing

  • Price Monitoring: Track product prices across e-commerce sites

  • News Monitoring: Collect latest news and updates from multiple sources

  • Research Automation: Automatically gather information from academic or research websites

Error Handling

Error Type
Cause
Solution

Invalid URL

URL is malformed or inaccessible

Verify URLs are complete and publicly accessible (include http:// or https://)

Connection Timeout

Website took too long to respond

Increase timeout or try again later; check if website is online

Rate Limit Exceeded

Too many scraping requests

Space out requests or upgrade your PixelML/Apify plan

Bot Detection

Website blocked the scraping attempt

Try different crawler_type or use "playwright:adaptive" for better bot evasion

Content Too Large

Scraped content exceeds max_tokens

Reduce max_tokens_per_url or target specific pages with less content

JavaScript Required

Page requires JS but using cheerio/jsdom

Switch to playwright:firefox or playwright:chrome for JS execution

Authentication Required

Website requires login

This node works with public pages only; use authenticated scraping methods

Notes

  • Crawler Selection: Use Playwright for JavaScript-heavy sites, Cheerio/JSDOM for faster, static HTML parsing.

  • Token Limits: Higher max_tokens increases cost and processing time. Start with lower values and adjust as needed.

  • URL Limits: Maximum 10 URLs per request. For batch scraping, use multiple node instances or loop structures.

  • Adaptive Mode: playwright:adaptive intelligently switches between browsers for optimal scraping success.

  • Legal Compliance: Ensure you have permission to scrape target websites and comply with their robots.txt and terms of service.

  • Performance: Cheerio is fastest but doesn't execute JavaScript. Playwright is slower but handles dynamic content.

  • Cost Management: Each scraping operation consumes API credits. Monitor usage especially with high max_tokens values.

  • Content Format: Scraped content is returned as plain text. HTML structure is typically stripped out.

Last updated

Was this helpful?