Extract Website Content
Extract Website Content
There is a large amount of text data available on the internet, which can be an amazing source for tasks such as Q/A, research, and context generation. The Extract Website Content action provides an easy-to-use component for scraping website contents.
How to Use the Extract Website Content Action
Add the Component
Navigate to the Workflow page.
Click on + Create Workflow or select an existing workflow.
Click on + Add Action.
Select Extract Website Content from the list of action components.
Website URL
Directly enter a URL in the box or use the {{ }}
syntax to activate variable mode. For instance, if the URL is in an input component called my_url
, use {{my_url}}
. Or if the URL is the output of a Google-Search step, use {{google.organic[0].link}}
to access the URL in the first search result.
Method
Specify if you wish to scrape data as Text or HTML. By default, the content is scraped as Text.
Element Selector
You can specify which element from the HTML components to scrape. By default, it is set to body
. Note that using + Add new, you can specify a list of elements to be scraped.
Extra Headers
If you need to provide special information to be able to scrape a website, provide the data as a JSON object. The example below shows where an authentication token called auth-token
and a user-id
are required.
Additional Information
Follow the links below for more information about:
Access the Action Output
The output is a dictionary with two keys: page
and selectors
, containing the extracted text and any selectors used. Below are examples where the default name assigned to the step is scrape
.
Example Access
Note that a step name is different from the step title. Step titles can be found on the top left of steps. A step name is shown on the bottom left, in a smaller font and highlighted green.
Common Errors
Wrong URL Formatting
This error occurs when the URL field is set to a value that is not of type string. When using the output of another step, make sure you access the URL field correctly.
Error:
Non-array Elements
When setting up specific elements to be scraped, make sure to use + Add new to have more than one element. If the button is clicked, do not leave it as an empty list. Use the x icon to the right of the row to remove the extra line.
Error:
Invalid URL
This error occurs when the provided URL is not valid.
Error:
Network Issue
This error normally occurs when there are network issues. Ensure your connection is strong, refresh the page, and try again.
Error:
Timeout
This error occurs when the navigation timeout exceeds 30000 ms.
Error:
Last updated