Scrape Sitemap

Minimal playground version: v1.0.1

Getting product pages from a sitemap.xml and scrape data from these pages.

DECLARE @jobConfig wds.JobConfig = 'JobName: CrawlAllProductsSitemap; Server: wds://localhost:2807; StartUrls: http://playground.svc/sitemap.xml';

SELECT 
    products.Task.Url ProductUrl,
    wds.ScrapeFirst(products.Task, 'css: h1', null) AS ProductName,
    wds.ScrapeFirst(products.Task, 'css: .price span', null) AS ProductPrice
FROM wds.Start(@jobConfig) root
    OUTER APPLY wds.Crawl(root.Task, 'xpath: //*[local-name()="url"]/*[local-name()="loc"]', 'val') products

Getting product pages from a sitemap.xml and scrape data from these pages using the ScrapeMultiple function. This approach is a bit faster because fewer requests to the WDS API Server are required.

DECLARE @jobConfig wds.JobConfig = 'JobName: CrawlAllProductsSitemap; Server: wds://localhost:2807; StartUrls: http://playground.svc/sitemap.xml';

SELECT
    products.Task.Url ProductUrl,
    product.ScrapeResult.GetFirst('ProductName') AS ProductName,
    product.ScrapeResult.GetFirst('ProductPrice') AS ProductPrice
FROM wds.Start(@jobConfig) root
    OUTER APPLY wds.Crawl(root.Task, 'xpath: //*[local-name()="url"]/*[local-name()="loc"]', 'val') products
    CROSS APPLY (
        SELECT wds.ScrapeMultiple(products.Task)
                .AddScrapeParams('ProductName', 'css: h1', null)
                .AddScrapeParams('ProductPrice', 'css: .price span', null) AS ScrapeResult
    ) product

Please rotate your device to landscape mode

This documentation is specifically designed with a wider layout to provide a better reading experience for code examples, tables, and diagrams.
Rotating your device horizontally ensures you can see everything clearly without excessive scrolling or resizing.

Return to Web Data Source Home