Firecrawl Scrape

Action ID: firecrawl_scrape

Description

Scrape a single website and extract its content in structured format. This node uses Firecrawl's powerful web scraping technology to fetch and parse a webpage, returning clean HTML, markdown, or JSON data.

Provider

Firecrawl

Connection

Name
Description
Required
Category

Firecrawl Connection

The Firecrawl connection to use for the scrape.

firecrawl

Input Parameters

Name
Type
Required
Default
Description

url

string

-

The URL to scrape

format

array

-

["json"]

The format to use for the scrape

json_options

object

-

{}

The options to use for the JSON scrape

View JSON Schema
{
  "description": "Firecrawl scrape node input.",
  "properties": {
    "url": {
      "title": "URL",
      "type": "string",
      "format": "uri",
      "description": "The URL to scrape."
    },
    "format": {
      "title": "Format",
      "type": "array",
      "items": {"type": "string"},
      "default": ["json"],
      "description": "The format to use for the scrape."
    },
    "json_options": {
      "title": "JSON Options",
      "type": "object",
      "default": {},
      "description": "The options to use for the JSON scrape."
    }
  },
  "required": [
    "url"
  ],
  "title": "FirecrawlScrapeInput",
  "type": "object"
}

Output Parameters

Name
Type
Description

result

object

The output from the Firecrawl scrape

View JSON Schema
{
  "description": "Firecrawl scrape node output.",
  "properties": {
    "result": {
      "title": "Result",
      "type": "object",
      "description": "The output from the Firecrawl scrape."
    }
  },
  "required": [
    "result"
  ],
  "title": "FirecrawlScrapeOutput",
  "type": "object"
}

How It Works

This node sends a URL to Firecrawl's scraping service, which navigates to the page, loads all content including dynamically generated content, and returns structured data in your requested format. The scraper handles JavaScript rendering, extracts clean content, and returns the result as a structured object.

Usage Examples

Example 1: Scrape Article Content

Input:

url: "https://example.com/article/tech-news"
format: ["json"]
json_options: {}

Output:

result: {
  "title": "Latest Tech News",
  "author": "John Doe",
  "content": "Article content here...",
  "publish_date": "2024-01-15",
  "tags": ["technology", "news"]
}

Example 2: Scrape Product Page

Input:

url: "https://example.com/products/laptop"
format: ["json"]
json_options: {
  "include_images": true,
  "include_pricing": true
}

Output:

result: {
  "product_name": "Professional Laptop",
  "price": "$999.99",
  "rating": 4.5,
  "reviews_count": 250,
  "images": ["url1", "url2", "url3"],
  "specifications": {
    "processor": "Intel i7",
    "memory": "16GB RAM",
    "storage": "512GB SSD"
  }
}

Example 3: Scrape News Article

Input:

url: "https://news.example.com/story/123"
format: ["json"]
json_options: {}

Output:

result: {
  "headline": "Breaking News",
  "byline": "Staff Reporter",
  "publish_time": "2024-01-15T10:30:00Z",
  "body": "Full article content...",
  "images": ["image1.jpg", "image2.jpg"],
  "related_stories": ["story1", "story2"]
}

Common Use Cases

  • Content Extraction: Extract article text, headlines, and metadata from web pages

  • Price Monitoring: Scrape product pages to track pricing and availability changes

  • Research Data Collection: Gather data from multiple websites for analysis

  • News Aggregation: Collect news articles and summaries from news websites

  • Lead Generation: Extract contact information and business details from company websites

  • Market Intelligence: Monitor competitor websites for product updates and announcements

  • Data Enrichment: Augment existing data with information scraped from web sources

Error Handling

Error Type
Cause
Solution

Invalid URL

URL format is incorrect or domain doesn't exist

Verify the URL is valid and properly formatted

Access Denied

Website blocks automated scraping or requires authentication

Check robots.txt and site terms; consider using proxies if allowed

Page Not Found

URL returns 404 status

Verify the URL is correct and the page still exists

Timeout

Page takes too long to load or render

Try with a different URL or reduce complexity of json_options

JavaScript Error

Page requires complex JavaScript that fails to execute

Ensure the website supports standard JavaScript and has no rendering issues

Empty Result

Page content cannot be extracted or parsed

Check if the page has required content or if structure has changed

Notes

  • URL Validation: Ensure URLs are publicly accessible and include the full protocol (http:// or https://).

  • Dynamic Content: Firecrawl handles JavaScript-rendered content automatically, so dynamic websites are supported.

  • Format Options: Specify multiple formats to get the same content in different structures.

  • JSON Options: Use json_options to customize the output structure and include/exclude specific elements.

  • Rate Limits: Be mindful of Firecrawl's rate limits when scraping multiple pages in rapid succession.

  • Robots.txt Compliance: Respect website terms of service and robots.txt directives when scraping.

Last updated

Was this helpful?