Advanced Web Scraping

A guide to the Apify integration for advanced, large-scale web scraping tasks.

While the standard Web Scraping action is great for simple pages, the Advanced Web Scraping Action provides a powerful, large-scale scraping solution by integrating with Apify.

Use this action when you need to:

  • Scrape data from complex, modern websites that rely heavily on JavaScript.

  • Handle features like infinite scrolling, pagination, and pop-ups.

  • Extract thousands of records from a site.

  • Run scrapers on a schedule.

**Ethical Scraping** Always check a website's `robots.txt` file and Terms of Service before scraping. Respect the site's rules and avoid overwhelming their servers with too many requests.

Connection Setup

You will need an Apify account and your Apify API token.

  1. Sign up for an account on apify.com.

  2. Find your API token in your Apify account settings under Settings > Integrations.

  3. In AgenticFlow, navigate to Settings > Connections and add a new Apify Connection, providing your API token.

How it Works: Apify Actors

Apify works using "Actors," which are pre-built cloud programs designed for specific scraping tasks. The most common one is the Website Content Crawler, which can be configured to extract specific data from a site.

You don't configure the scraper inside AgenticFlow. Instead, you configure an Actor on the Apify platform and then simply tell the AgenticFlow action to run it.

Configuration

Input Parameters

Parameter
Type
Description

Connection

Connection

Select the Apify connection you created.

Actor ID

Text

The ID of the Apify Actor you want to run (e.g., apify/website-content-crawler).

Run Parameters

JSON

A JSON object containing the specific configuration for the Actor run, such as the target URLs and the data to extract.

Output Parameters

Parameter
Type
Description

Output

Array

The structured data extracted by the Apify Actor, usually an array of JSON objects.

Example: Scraping Product Names and Prices

Let's say you want to get the names and prices of all products from a specific e-commerce category page.

  1. On Apify:

    • Find the "Website Content Crawler" actor and create a new task.

    • Configure it to scrape the e-commerce category URL.

    • Specify the CSS selectors for the product name (h2.product-title) and price (span.price).

    • Save the task and note its Actor ID.

  2. In AgenticFlow:

    • Add the Advanced Web Scraping Action.

    • Connection: Select your Apify connection.

    • Actor ID: apify/website-content-crawler (or the specific ID of your saved task).

    • Run Parameters: This JSON tells the actor what to do for this specific run.

      {
        "startUrls": [
          { "url": "https://www.example-store.com/category/all-products" }
        ],
        "crawlerType": "cheerio"
      }
  3. Result:

    • The action will trigger the Apify Actor, which will visit the URL, extract the data, and return it.

    • The Output of the action will be an array of objects, ready to be used in a Map action or saved to a Google Sheet.

      [
        { "productName": "Wireless Mouse", "price": "$29.99" },
        { "productName": "Mechanical Keyboard", "price": "$89.99" }
      ]

Last updated

Was this helpful?