Web Scraping

A guide to using the Web Scraping action to extract content from any public web page.

The Web Scraping Action allows you to extract content directly from a public web page. This is useful for gathering information, monitoring changes, or feeding website content into an LLM for analysis or summarization.

Configuration

You need to provide the URL to scrape and specify what kind of content you want to extract.

Input Parameters

Parameter
Type
Description

Web URL

Text

The full URL of the web page you want to scrape (e.g., https://example.com/blog-post).

Scraping Type

Dropdown

Determines the type of content to extract: Text or Html.

Tags to Extract

List

(Only for Text scraping) A list of HTML tags to target for text extraction (e.g., p, h1, div, span). This helps you focus on the most relevant content.

Max Tokens

Integer

The maximum amount of text (measured in tokens) to scrape from the page to avoid processing excessively large pages.

Scraping Types

  • Text: This is the most common option. It extracts only the textual content from the page, filtered by the specific HTML tags you provide. This is the best way to get clean content for an LLM.

  • Html: This option scrapes the raw HTML source code of the page. This is useful if you need to parse the HTML structure itself in a subsequent Code Action.

Output

The action returns the content it scraped from the page as a single block of text.

Output Parameter

Parameter
Type
Description

scraped_content

Text

The content extracted from the web page, either as plain text or as raw HTML, depending on the selected scraping type.

Example

If you configure the action to scrape the Text from a blog post URL and specify the tags h1 and p, the {{scrape_action.scraped_content}} output might look like this:

My Awesome Blog Post

This is the first paragraph of my blog post. It contains some very interesting information.

This is the second paragraph. It continues the topic and provides further details.

Last updated

Was this helpful?