The Scrape URL block
The Scrape URL block allows you to save a webpage's contents in a variable that can be used in other parts of your workflow.
How to configure the Scrape URL block
Once the Scrape URL block is added to your workflow, you can configure it using the panel on the right. Here is how you would configure the block for an application that scrapes an online store and lists the products under $50:
- URL: Input the URL you want to scrape. Since the user will provide the URL, you can use a variable. In this example, the variable name is URL.
- Output variable: Set the Output Variable name. In this example, the Output Variable name is url_output.
- Provider: Choose the web scraper you want to use. MindStudio has two options: Default and Firecrawl: some text
- The Default web scraper is a basic web scraper that pulls text only from the first page of the URL.
- Firecrawl is an LLM-powered web scraper that offers additional configurations to perform more advanced scraping functions, such as capturing images and hyperlinks. For this example, we’ve selected Firecrawl.
Firecrawl options
Here are the additional options available when using Firecrawl:
- Only Main Content: Firecrawl will return only the main content of the page, excluding any headers, footers, or navigation bars.
- Include Screenshot: Creates a screenshot of the top of the page that you are scraping.
- Screenshot variable: Assign a name to the screenshot. For this example, the screenshot variable is image.
- Wait for: Wait a specific amount of milliseconds for the page to load before fetching the content. For this example, we are setting the wait time to 0 milliseconds.
- Advanced Options: Firecrawl offers three advanced settings:some text
- Absolute Paths: Ensures all links and resources in the scraped content have full URLs.
- Headers: Send headers with the request. Headers can be used to send cookies, user agents, etc.
- Remove Tags: Specify HTML tags that should be removed from the scraped content.
- Use Extractor: Use an LLM to extract structured data from the page. You must specify the extraction mode, extraction prompt, and extraction schema. For this example, we are using LLM extraction and have enabled the Use Extractor. We’ve also added an Extraction Prompt of “Based on the information on the page, extract the information from the schema.” and an Extraction Schema for the data structure we expect to extract from the page.