Web Scraping

Action ID: web_scraping

Description

Scrape content from a web page

Input Parameters

Name

Type

Required

Default

Description

web_url

string

✓

The URL of the web page to scrape

scraping_type

dropdown

Text

Type of content to extract. Available options: Html, Text

max_tokens

integer

10000

Maximum number of tokens to extract. Range: 1 to 32000

tags_to_extract

array

["span"]

List of HTML tags to extract content from (only used when scraping_type is Html)

View JSON Schema

{
  "description": "Web scraping node input.",
  "properties": {
    "web_url": {
      "description": "Web URL to scrape.",
      "title": "Web URL",
      "type": "string"
    },
    "scraping_type": {
      "default": "Text",
      "description": "Type of scraping to perform.",
      "enum": [
        "Html",
        "Text"
      ],
      "title": "Scraping Type",
      "type": "string"
    },
    "max_tokens": {
      "default": 10000,
      "description": "The maximum number of tokens to scrape.",
      "maximum": 32000,
      "minimum": 1,
      "title": "Max Tokens",
      "type": "integer"
    },
    "tags_to_extract": {
      "default": [
        "span"
      ],
      "description": "List of tags to extract from the HTML.",
      "items": {
        "type": "string"
      },
      "title": "Tags to Extract",
      "type": "array"
    }
  },
  "required": [
    "web_url"
  ],
  "title": "WebScrapingNodeInput",
  "type": "object"
}

Output Parameters

Name

Type

Description

scraped_content

string

The extracted content from the web page, either as plain text or HTML based on scraping_type

View JSON Schema

{
  "description": "Web scraping node output.",
  "properties": {
    "scraped_content": {
      "description": "The extracted content from the web URLs.",
      "title": "Extracted Content",
      "type": "string"
    }
  },
  "required": [
    "scraped_content"
  ],
  "title": "WebScrapingNodeOutput",
  "type": "object"
}

How It Works

This node fetches content from a specified web URL and extracts either clean text or structured HTML based on your preference. When using Text mode, it removes HTML tags and returns readable text content. When using Html mode, it extracts content from specified HTML tags. The extraction is limited by the max_tokens parameter to prevent excessive data retrieval and ensure efficient processing.

Usage Examples

Example 1: Extract Article Text

Input:

web_url: "https://example.com/blog/article-about-ai"
scraping_type: "Text"
max_tokens: 5000
tags_to_extract: ["span"]

Output:

scraped_content: "Understanding Artificial Intelligence\n\nArtificial Intelligence has become one of the most transformative technologies of our time. From machine learning to natural language processing, AI is reshaping industries and creating new possibilities...\n\n[Article content continues for ~5000 tokens]"

Example 2: Extract Product Information from HTML

Input:

web_url: "https://shop.example.com/products/laptop-pro"
scraping_type: "Html"
max_tokens: 2000
tags_to_extract: ["h1", "h2", "p", "span", "div"]

Output:

scraped_content: "<h1>Laptop Pro 15</h1>\n<h2>Specifications</h2>\n<p>Processor: Intel Core i7</p>\n<span>RAM: 16GB</span>\n<div>Storage: 512GB SSD</div>\n<p>Display: 15.6\" 4K</p>\n<span>Price: $1,299.99</span>"

Example 3: Scrape News Headlines

Input:

web_url: "https://news.example.com/technology"
scraping_type: "Text"
max_tokens: 15000
tags_to_extract: ["span"]

Output:

scraped_content: "Technology News\n\nBreaking: New AI Model Achieves Human-Level Performance\nPublished 2 hours ago\n\nTech Giants Announce Cloud Computing Partnership\nPublished 5 hours ago\n\nCybersecurity Trends for 2024\nPublished 1 day ago\n\n[Additional headlines and content...]"

Common Use Cases

Content Aggregation: Collect articles, blog posts, or news content from multiple websites for analysis or republishing
Price Monitoring: Extract product prices and specifications from e-commerce sites for competitive analysis
Data Collection: Gather structured data from web pages for market research or lead generation
News Monitoring: Track news articles and updates from various sources for media monitoring workflows
SEO Analysis: Extract meta tags, headings, and content structure for SEO auditing purposes
Content Migration: Pull content from legacy websites for migration to new platforms
Competitive Intelligence: Monitor competitor websites for product updates, pricing changes, or content strategies

Error Handling

Error Type

Cause

Solution

Invalid URL

URL is malformed or uses unsupported protocol

Ensure the URL starts with http:// or https:// and is properly formatted

Page Not Found

The URL returns a 404 or doesn't exist

Verify the URL is correct and the page is accessible

Connection Timeout

Server didn't respond within timeout period

Try again later or check if the website is operational

Access Denied

Website blocks scraping or requires authentication

Check robots.txt, use appropriate authentication, or request API access

Content Too Large

Page content exceeds max_tokens limit

Increase max_tokens value or scrape specific sections of the page

Invalid HTML Structure

HTML parsing fails due to malformed markup

Try "Text" scraping_type instead of "Html" for more forgiving extraction

Rate Limiting

Too many requests to the same domain

Add delays between requests or reduce scraping frequency

Notes

Robots.txt Compliance: Respect website robots.txt files and terms of service; avoid scraping sites that prohibit automated access.
Text vs Html: Use "Text" mode for clean readable content; use "Html" mode when you need to preserve structure or extract from specific tags.
Token Limits: The max_tokens parameter prevents excessive data retrieval; adjust based on your content needs and processing capabilities.
Tag Selection: When using Html mode, specify relevant tags like ["p", "h1", "h2", "article"] to extract meaningful content sections.
Dynamic Content: This node may not capture JavaScript-rendered content; it scrapes the initial HTML response from the server.
Legal Considerations: Ensure you have permission to scrape the target website; web scraping may have legal implications depending on jurisdiction and use case.
Performance: Large pages with high max_tokens values may take longer to process; optimize your extraction strategy for efficiency.
Character Encoding: The node handles various character encodings automatically, but some special characters may require additional processing.

PreviousVideo Faceswap Pro NextWeb Scraping using Apify

Last updated 3 months ago

hashtagDescription

hashtagInput Parameters

hashtagOutput Parameters

hashtagHow It Works

hashtagUsage Examples

hashtagExample 1: Extract Article Text

hashtagExample 2: Extract Product Information from HTML

hashtagExample 3: Scrape News Headlines

hashtagCommon Use Cases

hashtagError Handling

hashtagNotes

Description

Input Parameters

Output Parameters

How It Works

Usage Examples

Example 1: Extract Article Text

Example 2: Extract Product Information from HTML

Example 3: Scrape News Headlines

Common Use Cases

Error Handling

Notes