Web Scraping using Apify

Plugin ID: web_scraping_apify

Description

The Web Scraping using Apify tool provides powerful web scraping capabilities through the Apify platform. Extract content from multiple websites simultaneously with advanced crawling engines, customizable depth and page limits, and flexible output formats including Markdown and HTML. Perfect for data collection, content analysis, and automated web research.

Cost Information

Cost: 4 credits
PixelML Cost: $0.01

Input Parameters

Parameter

Type

Required

Default

Description

web_urls

array

Yes

List of web URLs to scrape. Must contain 1-10 URLs. Example: ["https://example.com", "https://news.site.com"]

crawler_type

string

"playwright:firefox"

Crawling engine to use. Options: playwright:firefox, playwright:chrome, playwright:adaptive, cheerio, jsdom

max_crawl_pages

integer

10

Maximum number of pages to crawl per URL, including pagination and content pages. Prevents runaway crawling. Range: 1-100

max_crawl_depth

integer

20

Maximum link depth to follow from start URLs. Depth 0 = start URLs only, depth 1 = direct links, etc. Range: 1-50

max_tokens_per_url

integer

32000

Maximum number of tokens to extract per URL. Controls content volume and processing time. Range: 1-64,000

save_markdown

boolean

true

Convert HTML content to Markdown format for easier processing and readability.

save_html_as_file

boolean

false

Save the original HTML content as files for detailed analysis and preservation.

Crawler Engine Comparison

Engine

Strengths

Best For

Performance

playwright:firefox

Full JavaScript support, modern sites

SPAs, dynamic content, complex sites

Slower

playwright:chrome

Fast rendering, good compatibility

General web scraping, most websites

Medium

playwright:adaptive

Intelligent engine selection

Mixed content types, optimal results

Variable

cheerio

Fast, lightweight, server-side parsing

Static HTML, simple sites

Fastest

jsdom

Node.js DOM implementation

Basic JavaScript, lightweight sites

Fast

Output

Field

Type

Description

scraped_contents

array

Array of extracted content strings from each successfully scraped URL, formatted according to settings

How It Works

URL Processing: The system validates and prepares the provided URLs for scraping
Engine Selection: The chosen crawler engine initializes based on your configuration
Crawling Execution: The crawler visits URLs, follows links up to the specified depth, and extracts content
Content Processing: Raw HTML is processed and converted to the requested format (Markdown/HTML)
Token Management: Content is truncated to stay within the specified token limits per URL
Result Compilation: All extracted content is compiled into a structured array output

Use Cases

Content Research: Gather information from multiple sources for analysis
Competitive Intelligence: Monitor competitor websites and industry trends
Data Collection: Extract structured data from e-commerce, news, or directory sites
SEO Analysis: Analyze content and structure across multiple websites
Market Research: Collect product information, reviews, and pricing data
News Monitoring: Track updates and articles from multiple news sources
Academic Research: Gather information from educational and research websites

Example Usage

Basic Website Scraping

Web URLs: ["https://example.com/blog", "https://news.example.com"]
Crawler Type: playwright:firefox
Max Crawl Pages: 5
Max Crawl Depth: 2
Save Markdown: true

Large-Scale Content Collection

Web URLs: ["https://docs.example.com", "https://help.example.com", "https://support.example.com"]
Crawler Type: playwright:adaptive
Max Crawl Pages: 50
Max Crawl Depth: 10
Max Tokens per URL: 64000
Save Markdown: true
Save HTML as File: true

Fast Static Site Scraping

Web URLs: ["https://static-site.com", "https://simple-blog.com"]
Crawler Type: cheerio
Max Crawl Pages: 20
Max Crawl Depth: 5
Max Tokens per URL: 16000
Save Markdown: true

Configuration Guidelines

Crawler Type Selection

Use Playwright engines for:
- Single-page applications (SPAs)
- Sites with heavy JavaScript
- Dynamic content loading
- Complex interactions required
Use Cheerio/jsdom for:
- Static HTML sites
- Simple blog content
- Fast bulk scraping
- Minimal JavaScript requirements

Depth and Page Limits

Low depth (1-3): Focus on main content, faster execution
Medium depth (4-10): Comprehensive site coverage
High depth (11-50): Deep crawling, extensive data collection
Page limits: Balance between coverage and performance
- 1-10 pages: Quick sampling
- 11-50 pages: Standard coverage
- 51-100 pages: Comprehensive scraping

Tips for Best Results

Choose the Right Engine: Select crawler type based on website complexity and JavaScript requirements
Set Appropriate Limits: Balance depth/pages with performance needs and content requirements
Use Markdown Format: Enable Markdown saving for easier content processing and analysis
Monitor Token Usage: Adjust max_tokens_per_url based on content density and processing needs
Test with Small Batches: Start with fewer URLs and smaller limits to optimize settings
Respect Rate Limits: Be mindful of target websites' rate limiting and terms of service

Limitations

URL Limit: Maximum of 10 URLs per request
Page Limits: Up to 100 pages per crawling session
Depth Restrictions: Maximum crawl depth of 50 levels
Token Constraints: Maximum 64,000 tokens per URL
Content Dependencies: Some dynamic content may require specific crawler engines
Rate Limiting: Subject to Apify platform rate limits and target site restrictions
JavaScript Execution: Complex JavaScript may require Playwright engines for full functionality

Privacy & Compliance

Only scrapes publicly accessible web content
Respects robots.txt files and crawling directives when possible
Complies with Apify platform terms of service and usage policies
Users should ensure they have appropriate permissions for target websites
Scraped content should be used in accordance with website terms of service
Does not store scraped content beyond the immediate processing session
Respects copyright and intellectual property rights of scraped content

Error Handling

Common error scenarios:

Invalid URLs: Returns error for malformed or inaccessible URLs
URL Limit Exceeded: Returns error if more than 10 URLs are provided
Parameter Out of Range: Returns error for values outside specified min/max ranges
Crawling Failed: Returns error if websites cannot be accessed or crawled
Content Too Large: Returns error if content exceeds token limits significantly
Engine Unavailable: Returns error if selected crawler engine is not available
Rate Limited: Temporary restriction if Apify platform limits are exceeded
Timeout: Returns error if crawling takes too long to complete

PreviousGoogle Search NextLinkedIn Profile Scraper

Last updated 1 month ago

Was this helpful?