Web Scraping using Apify
Action ID: web_scraping_apify
Description
Web Scraping using Apify
Connection
PixelML Connection
The PixelML connection to call PixelML API.
True
pixelml
Input Parameters
web_urls
array
✓
-
List of web URLs to scrape (1-10 URLs)
crawler_type
string
-
playwright:firefox
Crawling engine (firefox, chrome, adaptive, cheerio, jsdom)
max_tokens_per_url
integer
-
32000
Maximum tokens to scrape per URL (1-64000)
Output Parameters
scraped_contents
array
List of extracted content strings from each URL
How It Works
This node uses Apify's web scraping infrastructure through the PixelML API to extract content from web pages. You provide URLs and select a crawling engine. The node uses headless browsers (Playwright) or HTML parsers (Cheerio/JSDOM) to access pages, extract text content, and return it in a structured format. The max_tokens parameter limits the amount of content extracted per URL, useful for controlling costs and processing time.
Usage Examples
Example 1: Scrape Blog Articles
Input:
web_urls: ["https://example.com/blog/article-1", "https://example.com/blog/article-2"]
crawler_type: "playwright:firefox"
max_tokens_per_url: 32000Output:
scraped_contents: [
"Article 1 title and content extracted here...",
"Article 2 title and content extracted here..."
]Example 2: Fast Scraping with Cheerio
Input:
web_urls: ["https://news.example.com/latest"]
crawler_type: "cheerio"
max_tokens_per_url: 10000Output:
scraped_contents: [
"Latest news article content with text only..."
]Example 3: JavaScript-Heavy Site
Input:
web_urls: ["https://spa-app.example.com/data"]
crawler_type: "playwright:adaptive"
max_tokens_per_url: 20000Output:
scraped_contents: [
"Dynamically loaded content from JavaScript app..."
]Common Use Cases
Content Aggregation: Collect articles, blog posts, or news from multiple sources
Competitive Analysis: Extract product information, pricing, or features from competitor websites
Data Mining: Gather data from public websites for research or analysis
SEO Analysis: Extract meta tags, headings, and content for SEO auditing
Price Monitoring: Track product prices across e-commerce sites
News Monitoring: Collect latest news and updates from multiple sources
Research Automation: Automatically gather information from academic or research websites
Error Handling
Invalid URL
URL is malformed or inaccessible
Verify URLs are complete and publicly accessible (include http:// or https://)
Connection Timeout
Website took too long to respond
Increase timeout or try again later; check if website is online
Rate Limit Exceeded
Too many scraping requests
Space out requests or upgrade your PixelML/Apify plan
Bot Detection
Website blocked the scraping attempt
Try different crawler_type or use "playwright:adaptive" for better bot evasion
Content Too Large
Scraped content exceeds max_tokens
Reduce max_tokens_per_url or target specific pages with less content
JavaScript Required
Page requires JS but using cheerio/jsdom
Switch to playwright:firefox or playwright:chrome for JS execution
Authentication Required
Website requires login
This node works with public pages only; use authenticated scraping methods
Notes
Crawler Selection: Use Playwright for JavaScript-heavy sites, Cheerio/JSDOM for faster, static HTML parsing.
Token Limits: Higher max_tokens increases cost and processing time. Start with lower values and adjust as needed.
URL Limits: Maximum 10 URLs per request. For batch scraping, use multiple node instances or loop structures.
Adaptive Mode: playwright:adaptive intelligently switches between browsers for optimal scraping success.
Legal Compliance: Ensure you have permission to scrape target websites and comply with their robots.txt and terms of service.
Performance: Cheerio is fastest but doesn't execute JavaScript. Playwright is slower but handles dynamic content.
Cost Management: Each scraping operation consumes API credits. Monitor usage especially with high max_tokens values.
Content Format: Scraped content is returned as plain text. HTML structure is typically stripped out.
Last updated
Was this helpful?