Web Scraping
A guide to using the Web Scraping action to extract content from any public web page.
The Web Scraping Action allows you to extract content directly from a public web page. This is useful for gathering information, monitoring changes, or feeding website content into an LLM for analysis or summarization.
Configuration
You need to provide the URL to scrape and specify what kind of content you want to extract.
Input Parameters
Web URL
Text
The full URL of the web page you want to scrape (e.g., https://example.com/blog-post
).
Scraping Type
Dropdown
Determines the type of content to extract: Text
or Html
.
Tags to Extract
List
(Only for Text scraping) A list of HTML tags to target for text extraction (e.g., p
, h1
, div
, span
). This helps you focus on the most relevant content.
Max Tokens
Integer
The maximum amount of text (measured in tokens) to scrape from the page to avoid processing excessively large pages.
Scraping Types
Text: This is the most common option. It extracts only the textual content from the page, filtered by the specific HTML tags you provide. This is the best way to get clean content for an LLM.
Html: This option scrapes the raw HTML source code of the page. This is useful if you need to parse the HTML structure itself in a subsequent Code Action.
Output
The action returns the content it scraped from the page as a single block of text.
Output Parameter
scraped_content
Text
The content extracted from the web page, either as plain text or as raw HTML, depending on the selected scraping type.
Example
If you configure the action to scrape the Text
from a blog post URL and specify the tags h1
and p
, the {{scrape_action.scraped_content}}
output might look like this:
Last updated
Was this helpful?