Web Scraping using Apify

Plugin ID: web_scraping_apify

Description

The Web Scraping using Apify tool provides powerful web scraping capabilities through the Apify platform. Extract content from multiple websites simultaneously with advanced crawling engines, customizable depth and page limits, and flexible output formats including Markdown and HTML. Perfect for data collection, content analysis, and automated web research.

Cost Information

  • Cost: 4 credits

  • PixelML Cost: $0.01

Input Parameters

Parameter
Type
Required
Default
Description

web_urls

array

Yes

-

List of web URLs to scrape. Must contain 1-10 URLs. Example: ["https://example.com", "https://news.site.com"]

crawler_type

string

No

"playwright:firefox"

Crawling engine to use. Options: playwright:firefox, playwright:chrome, playwright:adaptive, cheerio, jsdom

max_crawl_pages

integer

No

10

Maximum number of pages to crawl per URL, including pagination and content pages. Prevents runaway crawling. Range: 1-100

max_crawl_depth

integer

No

20

Maximum link depth to follow from start URLs. Depth 0 = start URLs only, depth 1 = direct links, etc. Range: 1-50

max_tokens_per_url

integer

No

32000

Maximum number of tokens to extract per URL. Controls content volume and processing time. Range: 1-64,000

save_markdown

boolean

No

true

Convert HTML content to Markdown format for easier processing and readability.

save_html_as_file

boolean

No

false

Save the original HTML content as files for detailed analysis and preservation.

Crawler Engine Comparison

Engine
Strengths
Best For
Performance

playwright:firefox

Full JavaScript support, modern sites

SPAs, dynamic content, complex sites

Slower

playwright:chrome

Fast rendering, good compatibility

General web scraping, most websites

Medium

playwright:adaptive

Intelligent engine selection

Mixed content types, optimal results

Variable

cheerio

Fast, lightweight, server-side parsing

Static HTML, simple sites

Fastest

jsdom

Node.js DOM implementation

Basic JavaScript, lightweight sites

Fast

Output

Field
Type
Description

scraped_contents

array

Array of extracted content strings from each successfully scraped URL, formatted according to settings

How It Works

  1. URL Processing: The system validates and prepares the provided URLs for scraping

  2. Engine Selection: The chosen crawler engine initializes based on your configuration

  3. Crawling Execution: The crawler visits URLs, follows links up to the specified depth, and extracts content

  4. Content Processing: Raw HTML is processed and converted to the requested format (Markdown/HTML)

  5. Token Management: Content is truncated to stay within the specified token limits per URL

  6. Result Compilation: All extracted content is compiled into a structured array output

Use Cases

  • Content Research: Gather information from multiple sources for analysis

  • Competitive Intelligence: Monitor competitor websites and industry trends

  • Data Collection: Extract structured data from e-commerce, news, or directory sites

  • SEO Analysis: Analyze content and structure across multiple websites

  • Market Research: Collect product information, reviews, and pricing data

  • News Monitoring: Track updates and articles from multiple news sources

  • Academic Research: Gather information from educational and research websites

Example Usage

Basic Website Scraping

Web URLs: ["https://example.com/blog", "https://news.example.com"]
Crawler Type: playwright:firefox
Max Crawl Pages: 5
Max Crawl Depth: 2
Save Markdown: true

Large-Scale Content Collection

Web URLs: ["https://docs.example.com", "https://help.example.com", "https://support.example.com"]
Crawler Type: playwright:adaptive
Max Crawl Pages: 50
Max Crawl Depth: 10
Max Tokens per URL: 64000
Save Markdown: true
Save HTML as File: true

Fast Static Site Scraping

Web URLs: ["https://static-site.com", "https://simple-blog.com"]
Crawler Type: cheerio
Max Crawl Pages: 20
Max Crawl Depth: 5
Max Tokens per URL: 16000
Save Markdown: true

Configuration Guidelines

Crawler Type Selection

  • Use Playwright engines for:

    • Single-page applications (SPAs)

    • Sites with heavy JavaScript

    • Dynamic content loading

    • Complex interactions required

  • Use Cheerio/jsdom for:

    • Static HTML sites

    • Simple blog content

    • Fast bulk scraping

    • Minimal JavaScript requirements

Depth and Page Limits

  • Low depth (1-3): Focus on main content, faster execution

  • Medium depth (4-10): Comprehensive site coverage

  • High depth (11-50): Deep crawling, extensive data collection

  • Page limits: Balance between coverage and performance

    • 1-10 pages: Quick sampling

    • 11-50 pages: Standard coverage

    • 51-100 pages: Comprehensive scraping

Tips for Best Results

  1. Choose the Right Engine: Select crawler type based on website complexity and JavaScript requirements

  2. Set Appropriate Limits: Balance depth/pages with performance needs and content requirements

  3. Use Markdown Format: Enable Markdown saving for easier content processing and analysis

  4. Monitor Token Usage: Adjust max_tokens_per_url based on content density and processing needs

  5. Test with Small Batches: Start with fewer URLs and smaller limits to optimize settings

  6. Respect Rate Limits: Be mindful of target websites' rate limiting and terms of service

Limitations

  • URL Limit: Maximum of 10 URLs per request

  • Page Limits: Up to 100 pages per crawling session

  • Depth Restrictions: Maximum crawl depth of 50 levels

  • Token Constraints: Maximum 64,000 tokens per URL

  • Content Dependencies: Some dynamic content may require specific crawler engines

  • Rate Limiting: Subject to Apify platform rate limits and target site restrictions

  • JavaScript Execution: Complex JavaScript may require Playwright engines for full functionality

Privacy & Compliance

  • Only scrapes publicly accessible web content

  • Respects robots.txt files and crawling directives when possible

  • Complies with Apify platform terms of service and usage policies

  • Users should ensure they have appropriate permissions for target websites

  • Scraped content should be used in accordance with website terms of service

  • Does not store scraped content beyond the immediate processing session

  • Respects copyright and intellectual property rights of scraped content

Error Handling

Common error scenarios:

  • Invalid URLs: Returns error for malformed or inaccessible URLs

  • URL Limit Exceeded: Returns error if more than 10 URLs are provided

  • Parameter Out of Range: Returns error for values outside specified min/max ranges

  • Crawling Failed: Returns error if websites cannot be accessed or crawled

  • Content Too Large: Returns error if content exceeds token limits significantly

  • Engine Unavailable: Returns error if selected crawler engine is not available

  • Rate Limited: Temporary restriction if Apify platform limits are exceeded

  • Timeout: Returns error if crawling takes too long to complete

Last updated

Was this helpful?