Extract Website Content

There is a large amount of text data available on the internet, which can be an amazing source for tasks such as Q/A, research, and context generation. The Extract Website Content action provides an easy-to-use component for scraping website contents.

How to Use the Extract Website Content Action

Add the Component

Navigate to the Workflow page.
Click on + Create Workflow or select an existing workflow.
Click on + Add Action.
Select Extract Website Content from the list of action components.

Website URL

Directly enter a URL in the box or use the {{ }} syntax to activate variable mode. For instance, if the URL is in an input component called my_url, use {{my_url}}. Or if the URL is the output of a Google-Search step, use {{google.organic[0].link}} to access the URL in the first search result.

Method

Specify if you wish to scrape data as Text or HTML. By default, the content is scraped as Text.

Element Selector

You can specify which element from the HTML components to scrape. By default, it is set to body. Note that using + Add new, you can specify a list of elements to be scraped.

Extra Headers

If you need to provide special information to be able to scrape a website, provide the data as a JSON object. The example below shows where an authentication token called auth-token and a user-id are required.

{
    "auth-token": "AUTHENTICATION-TOKEN",
    "user-id": "USER-ID"
}

Additional Information

Follow the links below for more information about:

Access the Action Output

The output is a dictionary with two keys: page and selectors, containing the extracted text and any selectors used. Below are examples where the default name assigned to the step is scrape.

Example Access

# Accessing the extracted text
scrape.output.page

# Accessing the selectors used
scrape.output.selectors

Note that a step name is different from the step title. Step titles can be found on the top left of steps. A step name is shown on the bottom left, in a smaller font and highlighted green.

Common Errors

Wrong URL Formatting

This error occurs when the URL field is set to a value that is not of type string. When using the output of another step, make sure you access the URL field correctly.

Error:

URL must be a string

Non-array Elements

When setting up specific elements to be scraped, make sure to use + Add new to have more than one element. If the button is clicked, do not leave it as an empty list. Use the x icon to the right of the row to remove the extra line.

Error:

Studio transformation browserless_scrape input validation error: must be array {"type":"array"} /element_selector

Invalid URL

This error occurs when the provided URL is not valid.

Error:

Protocol error (Page.navigate): Cannot navigate to invalid URL

Network Issue

This error normally occurs when there are network issues. Ensure your connection is strong, refresh the page, and try again.

Error:

Navigation failed because browser has disconnected!

Timeout

This error occurs when the navigation timeout exceeds 30000 ms.

Error:

Navigation timeout of 30000 ms exceeded

PreviousPDF to text NextKnowledge Search

Last updated 1 month ago

Was this helpful?