The CrawlMdr tool is used to perform recursive crawling and scraping of web resources according to a hierarchical configuration. It executes all crawling and scraping operations defined in the configuration, traversing pages, following links, and extracting structured data fields. This tool is designed for efficient, large-scale data extraction.
Arguments
Name |
Type |
Description |
tasks |
Array of DownloadTask |
Required. Initial download tasks (from StartJob) |
crawlMdrConfig |
CrawlMdrConfig |
Required. Crawl Multi Dimentional Recurcieve (MDR) configuration |
DownloadTask
Name |
Type |
Description |
Id |
String |
Required. Task Id |
Url |
String |
Required. Page URL |
CrawlMdrConfig
Crawl Multi Dimentional Recurcieve (MDR) configuration
Name |
Type |
Description |
MCP Tools |
Name |
String |
Required. Name of the level (e.g., ‘/’, ‘products’, etc.) |
Set via CrawlMdrConfigCreate, CrawlMdrConfigUpsertSub tools |
ScrapeParams |
Array of ScrapeParams |
List of data fields to extract |
Set via CrawlMdrConfigUpsertScrapeParams |
CrawlParams |
Array of CrawlParams |
List of link selectors for crawling on the current level |
Set via CrawlMdrConfigUpsertCrawlParams tool |
SubCrawlMdrConfigs |
Array of SubCrawlMdrConfigs |
List of sub-levels (child pages/sections), with transition crawl parameters |
Set via CrawlMdrConfigUpsertSub tool |
ScrapeParams
Name |
Type |
Description |
FieldName |
String |
Required. Name of the data field to extract |
Selector |
String |
Required. A valid CSS or XPATH selector. |
Attribute |
String |
Optional. Attribute name to get data from. Use val to get inner text. Default value: val |
CrawlParams
Name |
Type |
Description |
Selector |
String |
Required. A valid CSS or XPATH selector. |
Attribute |
String |
Optional. Attribute name to get data from. Use val to get inner text. Default value: href |
SubCrawlMdrConfigs
SubCrawlMdrConfigs is a CrawlMdrConfig with one additional filed:
Name |
Type |
Description |
SubCrawlParams |
CrawlParams |
Required. Transition crawl parameters to move to a sublevel |
Return Type
Returns a CrawlMdrResult
CrawlMdrResult
Name |
Type |
Description |
FailedDownloadTaskIds |
Array of String |
Required. List of IDs for download tasks that failed |
FailedDownloadTaskCount |
Int |
Required. Number of failed download tasks |
SuccessfulDownloadTaskCount |
Int |
Required. Number of successful download tasks |
DataCursor |
CrawlMdrDataCursor |
Optional. Cursor for fetching batches of scraped data (null if no data) |
CrawlMdrDataCursor
Name |
Type |
Description |
JobId |
String |
Required. Job Id |
NextCursor |
String |
Optional. Cursor for fetching the next batch of scraped data (null if done) |