Web Scraping using Apify
Plugin ID: web_scraping_apify
Description
The Web Scraping using Apify tool provides powerful web scraping capabilities through the Apify platform. Extract content from multiple websites simultaneously with advanced crawling engines, customizable depth and page limits, and flexible output formats including Markdown and HTML. Perfect for data collection, content analysis, and automated web research.
Cost Information
Cost: 4 credits
PixelML Cost: $0.01
Input Parameters
web_urls
array
Yes
-
List of web URLs to scrape. Must contain 1-10 URLs.
Example: ["https://example.com", "https://news.site.com"]
crawler_type
string
No
"playwright:firefox"
Crawling engine to use. Options: playwright:firefox, playwright:chrome, playwright:adaptive, cheerio, jsdom
max_crawl_pages
integer
No
10
Maximum number of pages to crawl per URL, including pagination and content pages. Prevents runaway crawling. Range: 1-100
max_crawl_depth
integer
No
20
Maximum link depth to follow from start URLs. Depth 0 = start URLs only, depth 1 = direct links, etc. Range: 1-50
max_tokens_per_url
integer
No
32000
Maximum number of tokens to extract per URL. Controls content volume and processing time. Range: 1-64,000
save_markdown
boolean
No
true
Convert HTML content to Markdown format for easier processing and readability.
save_html_as_file
boolean
No
false
Save the original HTML content as files for detailed analysis and preservation.
Crawler Engine Comparison
playwright:firefox
Full JavaScript support, modern sites
SPAs, dynamic content, complex sites
Slower
playwright:chrome
Fast rendering, good compatibility
General web scraping, most websites
Medium
playwright:adaptive
Intelligent engine selection
Mixed content types, optimal results
Variable
cheerio
Fast, lightweight, server-side parsing
Static HTML, simple sites
Fastest
jsdom
Node.js DOM implementation
Basic JavaScript, lightweight sites
Fast
Output
scraped_contents
array
Array of extracted content strings from each successfully scraped URL, formatted according to settings
How It Works
URL Processing: The system validates and prepares the provided URLs for scraping
Engine Selection: The chosen crawler engine initializes based on your configuration
Crawling Execution: The crawler visits URLs, follows links up to the specified depth, and extracts content
Content Processing: Raw HTML is processed and converted to the requested format (Markdown/HTML)
Token Management: Content is truncated to stay within the specified token limits per URL
Result Compilation: All extracted content is compiled into a structured array output
Use Cases
Content Research: Gather information from multiple sources for analysis
Competitive Intelligence: Monitor competitor websites and industry trends
Data Collection: Extract structured data from e-commerce, news, or directory sites
SEO Analysis: Analyze content and structure across multiple websites
Market Research: Collect product information, reviews, and pricing data
News Monitoring: Track updates and articles from multiple news sources
Academic Research: Gather information from educational and research websites
Example Usage
Basic Website Scraping
Web URLs: ["https://example.com/blog", "https://news.example.com"]
Crawler Type: playwright:firefox
Max Crawl Pages: 5
Max Crawl Depth: 2
Save Markdown: true
Large-Scale Content Collection
Web URLs: ["https://docs.example.com", "https://help.example.com", "https://support.example.com"]
Crawler Type: playwright:adaptive
Max Crawl Pages: 50
Max Crawl Depth: 10
Max Tokens per URL: 64000
Save Markdown: true
Save HTML as File: true
Fast Static Site Scraping
Web URLs: ["https://static-site.com", "https://simple-blog.com"]
Crawler Type: cheerio
Max Crawl Pages: 20
Max Crawl Depth: 5
Max Tokens per URL: 16000
Save Markdown: true
Configuration Guidelines
Crawler Type Selection
Use Playwright engines for:
Single-page applications (SPAs)
Sites with heavy JavaScript
Dynamic content loading
Complex interactions required
Use Cheerio/jsdom for:
Static HTML sites
Simple blog content
Fast bulk scraping
Minimal JavaScript requirements
Depth and Page Limits
Low depth (1-3): Focus on main content, faster execution
Medium depth (4-10): Comprehensive site coverage
High depth (11-50): Deep crawling, extensive data collection
Page limits: Balance between coverage and performance
1-10 pages: Quick sampling
11-50 pages: Standard coverage
51-100 pages: Comprehensive scraping
Tips for Best Results
Choose the Right Engine: Select crawler type based on website complexity and JavaScript requirements
Set Appropriate Limits: Balance depth/pages with performance needs and content requirements
Use Markdown Format: Enable Markdown saving for easier content processing and analysis
Monitor Token Usage: Adjust max_tokens_per_url based on content density and processing needs
Test with Small Batches: Start with fewer URLs and smaller limits to optimize settings
Respect Rate Limits: Be mindful of target websites' rate limiting and terms of service
Limitations
URL Limit: Maximum of 10 URLs per request
Page Limits: Up to 100 pages per crawling session
Depth Restrictions: Maximum crawl depth of 50 levels
Token Constraints: Maximum 64,000 tokens per URL
Content Dependencies: Some dynamic content may require specific crawler engines
Rate Limiting: Subject to Apify platform rate limits and target site restrictions
JavaScript Execution: Complex JavaScript may require Playwright engines for full functionality
Privacy & Compliance
Only scrapes publicly accessible web content
Respects robots.txt files and crawling directives when possible
Complies with Apify platform terms of service and usage policies
Users should ensure they have appropriate permissions for target websites
Scraped content should be used in accordance with website terms of service
Does not store scraped content beyond the immediate processing session
Respects copyright and intellectual property rights of scraped content
Error Handling
Common error scenarios:
Invalid URLs: Returns error for malformed or inaccessible URLs
URL Limit Exceeded: Returns error if more than 10 URLs are provided
Parameter Out of Range: Returns error for values outside specified min/max ranges
Crawling Failed: Returns error if websites cannot be accessed or crawled
Content Too Large: Returns error if content exceeds token limits significantly
Engine Unavailable: Returns error if selected crawler engine is not available
Rate Limited: Temporary restriction if Apify platform limits are exceeded
Timeout: Returns error if crawling takes too long to complete
Last updated
Was this helpful?