Scrape Sitemap
Minimal playground version: v1.0.1
Getting product pages from a sitemap.xml and scrape data from these pages.
DECLARE @jobConfig wds.JobConfig = 'JobName: CrawlAllProductsSitemap; Server: wds://localhost:2807; StartUrls: http://playground.svc/sitemap.xml';
SELECT
products.Task.Url ProductUrl,
wds.ScrapeFirst(products.Task, 'css: h1', null) AS ProductName,
wds.ScrapeFirst(products.Task, 'css: .price span', null) AS ProductPrice
FROM wds.Start(@jobConfig) root
OUTER APPLY wds.Crawl(root.Task, 'xpath: //*[local-name()="url"]/*[local-name()="loc"]', 'val') products
Getting product pages from a sitemap.xml and scrape data from these pages using the ScrapeMultiple function. This approach is a bit faster because fewer requests to the WDS API Server are required.
DECLARE @jobConfig wds.JobConfig = 'JobName: CrawlAllProductsSitemap; Server: wds://localhost:2807; StartUrls: http://playground.svc/sitemap.xml';
SELECT
products.Task.Url ProductUrl,
product.ScrapeResult.GetFirst('ProductName') AS ProductName,
product.ScrapeResult.GetFirst('ProductPrice') AS ProductPrice
FROM wds.Start(@jobConfig) root
OUTER APPLY wds.Crawl(root.Task, 'xpath: //*[local-name()="url"]/*[local-name()="loc"]', 'val') products
CROSS APPLY (
SELECT wds.ScrapeMultiple(products.Task)
.AddScrapeParams('ProductName', 'css: h1', null)
.AddScrapeParams('ProductPrice', 'css: .price span', null) AS ScrapeResult
) product