Web Scraping
Action ID: web_scraping
Description
Scrape content from a web page
Input Parameters
web_url
string
✓
-
The URL of the web page to scrape
scraping_type
dropdown
-
Text
Type of content to extract. Available options: Html, Text
max_tokens
integer
-
10000
Maximum number of tokens to extract. Range: 1 to 32000
tags_to_extract
array
-
["span"]
List of HTML tags to extract content from (only used when scraping_type is Html)
Output Parameters
scraped_content
string
The extracted content from the web page, either as plain text or HTML based on scraping_type
How It Works
This node fetches content from a specified web URL and extracts either clean text or structured HTML based on your preference. When using Text mode, it removes HTML tags and returns readable text content. When using Html mode, it extracts content from specified HTML tags. The extraction is limited by the max_tokens parameter to prevent excessive data retrieval and ensure efficient processing.
Usage Examples
Example 1: Extract Article Text
Input:
web_url: "https://example.com/blog/article-about-ai"
scraping_type: "Text"
max_tokens: 5000
tags_to_extract: ["span"]Output:
scraped_content: "Understanding Artificial Intelligence\n\nArtificial Intelligence has become one of the most transformative technologies of our time. From machine learning to natural language processing, AI is reshaping industries and creating new possibilities...\n\n[Article content continues for ~5000 tokens]"Example 2: Extract Product Information from HTML
Input:
web_url: "https://shop.example.com/products/laptop-pro"
scraping_type: "Html"
max_tokens: 2000
tags_to_extract: ["h1", "h2", "p", "span", "div"]Output:
scraped_content: "<h1>Laptop Pro 15</h1>\n<h2>Specifications</h2>\n<p>Processor: Intel Core i7</p>\n<span>RAM: 16GB</span>\n<div>Storage: 512GB SSD</div>\n<p>Display: 15.6\" 4K</p>\n<span>Price: $1,299.99</span>"Example 3: Scrape News Headlines
Input:
web_url: "https://news.example.com/technology"
scraping_type: "Text"
max_tokens: 15000
tags_to_extract: ["span"]Output:
scraped_content: "Technology News\n\nBreaking: New AI Model Achieves Human-Level Performance\nPublished 2 hours ago\n\nTech Giants Announce Cloud Computing Partnership\nPublished 5 hours ago\n\nCybersecurity Trends for 2024\nPublished 1 day ago\n\n[Additional headlines and content...]"Common Use Cases
Content Aggregation: Collect articles, blog posts, or news content from multiple websites for analysis or republishing
Price Monitoring: Extract product prices and specifications from e-commerce sites for competitive analysis
Data Collection: Gather structured data from web pages for market research or lead generation
News Monitoring: Track news articles and updates from various sources for media monitoring workflows
SEO Analysis: Extract meta tags, headings, and content structure for SEO auditing purposes
Content Migration: Pull content from legacy websites for migration to new platforms
Competitive Intelligence: Monitor competitor websites for product updates, pricing changes, or content strategies
Error Handling
Invalid URL
URL is malformed or uses unsupported protocol
Ensure the URL starts with http:// or https:// and is properly formatted
Page Not Found
The URL returns a 404 or doesn't exist
Verify the URL is correct and the page is accessible
Connection Timeout
Server didn't respond within timeout period
Try again later or check if the website is operational
Access Denied
Website blocks scraping or requires authentication
Check robots.txt, use appropriate authentication, or request API access
Content Too Large
Page content exceeds max_tokens limit
Increase max_tokens value or scrape specific sections of the page
Invalid HTML Structure
HTML parsing fails due to malformed markup
Try "Text" scraping_type instead of "Html" for more forgiving extraction
Rate Limiting
Too many requests to the same domain
Add delays between requests or reduce scraping frequency
Notes
Robots.txt Compliance: Respect website robots.txt files and terms of service; avoid scraping sites that prohibit automated access.
Text vs Html: Use "Text" mode for clean readable content; use "Html" mode when you need to preserve structure or extract from specific tags.
Token Limits: The max_tokens parameter prevents excessive data retrieval; adjust based on your content needs and processing capabilities.
Tag Selection: When using Html mode, specify relevant tags like ["p", "h1", "h2", "article"] to extract meaningful content sections.
Dynamic Content: This node may not capture JavaScript-rendered content; it scrapes the initial HTML response from the server.
Legal Considerations: Ensure you have permission to scrape the target website; web scraping may have legal implications depending on jurisdiction and use case.
Performance: Large pages with high max_tokens values may take longer to process; optimize your extraction strategy for efficiency.
Character Encoding: The node handles various character encodings automatically, but some special characters may require additional processing.
Last updated
Was this helpful?