# Release v2.1 # WDS API Server # Deployment Methods # Overview Choose the deployment that fits your environment and scale — from a single Docker container for quick trials to Kubernetes with Helm for resilient, multi‑service production. Deployment options are not tied to specific WDS versions and may evolve over time. ## Options at a Glance - Docker: fastest path to run WDS in a single container via Solidstack. Great for quick local trials and demos. - Docker Compose: recommended for evaluation and development; includes Docs and Playground services out of the box. - Helm (Kubernetes): best for staging and production; supports single‑service or multi‑service mode, scaling, and high availability. - Air‑Gapped: mirror images to a private registry, then deploy via Helm or Compose without internet access. ## When to Use Which - Use Docker if you want the simplest setup to validate APIs quickly on your laptop. Note: this mode omits Docs and Playground services. - Use Docker Compose if you need a complete local stack for development, examples, and iterative testing. - Use Helm if you need resilience, scaling and separation of concerns across services (recommended for non‑trivial environments). - Use Air‑Gapped if your environment has no outbound network access and you must pre‑stage images in a private registry. ## Prerequisites (Highlights) - Docker/Docker Compose: a working Docker installation; a MongoDB connection string for Solidstack and Compose options that provision the DB. - Helm: a Kubernetes cluster, Helm CLI (or Terraform), MongoDB connection string. - Air‑Gapped: a private registry and permissions to push/pull images; use provided scripts to mirror images. ## Detailed Guides - [Docker](/releases/latest/server/deployments/docker.html) - [Docker Compose](/releases/latest/server/deployments/dockercompose.html) - [Helm Chart](/releases/latest/server/deployments/helm.html) - [Air-Gapped](/releases/latest/server/deployments/airgapped.html) # Overview The WDS API Server powers scalable web crawling and data extraction. It discovers pages, downloads content (with proxy/cookie/HTTPS controls), and scrapes structured fields — all exposed via a simple REST API and an optional MCP server for IDE/agent workflows. ## Services - Dapi: public REST API and job/task orchestration - Datakeeper: durable storage and cache of downloaded pages - Crawler: high‑throughput HTTP downloader with throttling and retries - Scraper: selector‑driven extraction (text or attributes) - Idealer: consistent ID generation for jobs and tasks - Retriever: retrieval service (vector search) - Solidstack: single‑container bundle for fast local trials ## Deployment Options Pick the setup that matches your environment: see [Deployment Methods](./deployments/index.html) - Docker: run Solidstack in one container for quick evaluation - Docker Compose: recommended for dev; includes Playground and Docs - Helm (Kubernetes): staging/production, scaling, and resilience - Air‑Gapped: mirror images to a private registry and deploy offline ## Developer Entry Points - REST API docs and try‑it: `/api/swagger` (see [API](./api/index.html)) - MCP Server (for IDE/agents): `/mcp` (see [MCP](../mcp/index.html)) - Playground test site (if deployed): `/playground/` - Local documentation (if deployed): `/docs/` ## Next Steps - Choose a deployment: [Deployments](./deployments/index.html) - Explore capabilities and endpoints: [API](./api/index.html) - Understand the architecture: [Services](./services/index.html) # API # Overview The WDS REST API lets you configure crawl/scrape jobs, discover pages, extract data, and monitor task execution — all via simple, versioned HTTP endpoints. ## Base URL - The base URL depends on your [deployment method](../deployments/index.html). - Example (Docker): `http://localhost:2807` - Endpoints live under `/api/{version}` by default (see links below for concrete routes). In [Helm](../deployments/helm.html) deployments, you can add a base‑path prefix via `global.ingress.basePath`. ## Explore the API - Swagger UI: browse and try endpoints interactively at `/api/swagger`. - Playground: if deployed, use the test site at `/playground/` for predictable, repeatable examples. ## Key Resources - Jobs: start a job with a `JobConfig`, receive initial `DownloadTask`s. - Reference: jobs overview and start endpoint in `../jobs.html`. - Tasks: operate on tasks to continue the crawl or extract data. - Crawl: discover follow‑up pages and return new `DownloadTask`s. - Scrape: extract text/attributes from a page. - Scrape Multiple: batch multiple extractions in one request. - Info: get `DownloadTaskStatus` (state, errors, request/response details). - Reference: task endpoints in `../tasks.html`. ## Typical Flow - Start: POST Jobs Start with `JobConfig` -> returns initial `DownloadTask`s (one per Start URL). - Crawl: GET Tasks Crawl with a task + selector -> returns more `DownloadTask`s. - Scrape: GET Tasks Scrape (or Scrape Multiple) with a task + selector(s) -> returns extracted values. - Monitor: GET Tasks Info for `DownloadTaskStatus` to check progress and results. ## Documentation - [Quickstart](/releases/latest/server/api/quickstart.html) - [Jobs](/releases/latest/server/api/jobs.html) - [Tasks](/releases/latest/server/api/tasks.html) - [Tenants](/releases/latest/server/api/tenants.html) - [Retrieval](/releases/latest/server/api/retrieval.html) # Quickstart — up and running in a couple of minutes Deploy WDS and run your first crawl/scrape entirely in Swagger UI. ## Prerequisites - Docker installed and running - WDS deployed following the guide: [Deploying WDS API Server in Docker Compose](../deployments/dockercompose.html) (using the '`BOX (Free)`' option) Once deployed, the API is available at: `http://localhost:2807` ## Step 1 — Open Swagger UI Open: `http://localhost:2807/api/swagger` You’ll use three endpoints: - Jobs -> `start` — create a job and get the initial task - Tasks -> `crawl` — discover follow-up pages (links) from a page - Tasks -> `scrape-mutliple` — extract data from a page ## Step 2 — Start a job In Swagger UI: 1. Expand Jobs -> GET `start`, then click “Try it out”. 2. Path parameter `jobName`: enter `playground` (or any unique name). 3. Request body: ```json { "StartUrls": ["http://playground"], "Type": "Intranet" } ``` 4. Click “Execute”. Response: `200 OK` returns an array of DownloadTask items. Copy one `id` value (this is your first page task). ## Step 3 — Discover pages (Crawl) In Swagger UI: 1. Expand Tasks -> GET `crawl`, then click “Try it out”. 2. Path parameter `taskId`: paste the `id` from the Start response. 3. Query parameter `selector`: enter `css: a[href*='/cloak_of_the_phantom.html']` to target one of the product pages. 4. Leave `attributeName` empty (defaults to `href`). 5. Click “Execute”. Response: `200 OK` returns an array of new DownloadTask items (in this example, a single item). Copy its `id` value for the scraping step. ## Step 4 — Extract content (Scrape) In Swagger UI: 1. Expand Tasks -> POST `scrape-mutliple`, then click “Try it out”. 2. Path parameter `taskId`: paste the selected task id from Step 3. 3. Request body: ``` json [ { "name": "Title", "selector": "css: h1" }, { "name": "Price", "selector": "css: div.price span" }, { "name": "Description", "selector": "css: div.desc p" } ] ``` 4. Click “Execute”. Response: `200 OK` returns an array of objects with values for each field. For example: ```json [ { "name": "Title", "values": [ "Cloak of the Phantom" ] }, { "name": "Price", "values": [ "100 Fairy Coins" ] }, { "name": "Description", "values": [ "Made from the feathers of a phoenix, it grants the power of rebirth." ] } ] ``` You’ve successfully extracted data — all within Swagger UI. ## Conclusion That’s it — deploy, start, crawl, and scrape using only Swagger UI. For more, see the full [API docs](./index.html) and [Services](../services/index.html). # Jobs Create and control crawl/scrape jobs: start with a configuration and receive initial tasks to drive further processing. ## Info Returns existing jobs info. `GET /api/v2/jobs/info` ### Responses #### 200 (Ok) Returns array of [JobInfo](#jobinfo) ##### JobInfo General inforamtion about a job Fields: | Name | Type | Description | | --------------- | -------- | --------------------------------- | | JobId | string | **Required.** Job ID | | --------------- | -------- | --------------------------------- | | JobName | string | **Required.** Job Name | | --------------- | -------- | --------------------------------- | | Host | string | **Required.** Web resource host | | --------------- | -------- | --------------------------------- | | StartDateUtc | datetime | Optional. Job start date (UTC) | | --------------- | -------- | --------------------------------- | | CompleteDateUtc | datetime | Optional. Job complete date (UTC) | | --------------- | -------- | --------------------------------- | #### 403 (Forbidden) Access restricted. Refer to the response text for more information ## Start Creates or updates a job with the provided configuration and enqueues initial download tasks for the specified start URLs. `POST /api/v2/jobs/{jobName}/start` ### Path Parameters | Name | Type | Description | | ------- | ------ | ------------------------------------------------ | | jobName | string | **Required.** Unique job name. Used to identify the job in the system where the domain name is often used (e.g., example.com) | | ------- | ------ | ------------------------------------------------ | ### Request Body [JobConfig](#jobconfig) #### JobConfig Defines the top-level configuration for a crawl job: entry URLs, job type, request/session behavior (headers, cookies, HTTPS), network routing (proxies), and runtime policies (restarts, error handling, domain scope). Fields: | Name | Type | Description | |--------------------------|-------------------------------------------------------|-----------------------------------------------------------------------| | StartUrls | array of Strings | **Required.** Initial URLs. Crawling entry points | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | Type | [JobTypes](#jobtypes) | Optional. Job type | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | Headers | [HeadersConfig](#headersconfig) | Optional. Headers settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | Restart | [RestartConfig](#restartconfig) | Optional. Job restart settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | Https | [HttpsConfig](#httpsconfig) | Optional. HTTPS settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | Cookies | [CookiesConfig](#cookiesconfig) | Optional. Cookies settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | Proxy | [ProxiesConfig](#proxiesconfig) | Optional. Proxy settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | DownloadErrorHandling | [DownloadErrorHandling](#downloaderrorhandling) | Optional. Download errors handling settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | CrawlersProtectionBypass | [CrawlersProtectionBypass](#crawlersprotectionbypass) | Optional. Crawlers protection countermeasure settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | CrossDomainAccess | [CrossDomainAccess](#crossdomainaccess) | Optional. Cross-domain access settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | RetrievalConfig | [RetrievalConfig](#retrievalconfig) | Optional. Retrival settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | #### JobTypes > **_NOTE!_** Possible values restrictions and the default value for all jobs can be configured in the Dapi service. > **_NOTE!_** Crawler service should be correctly configured to handle jobs of different types. Specifies how and where the crawler operates. Choose the mode that matches the environment your job targets. Enumeration values: | Name | Description | |----------|--------------------------------------------------------------------------------------------------| | internet | Crawl data from internet sources via request gateways (Proxy addresses, Host IP addresses, etc.) | | -------- | ------------------------------------------------------------------------------------------------ | | intranet | Crawl data from intranet sources with no limits | | -------- | ------------------------------------------------------------------------------------------------ | #### HeadersConfig Configures additional HTTP headers to be sent with every request. Use to set user agents, auth tokens, custom headers, etc. Fields: | Field | Type | Description | |-----------------------| ---------------------------------- |----------------------------------------------------------------| | DefaultRequestHeaders | array of [HttpHeader](#httpheader) | **Required.** HTTP headers that will be sent with each request | | --------------------- | ---------------------------------- | -------------------------------------------------------------- | #### HttpHeader Represents a single HTTP header definition with a name and one or more values. Fields: | Name | Type | Description | |---------|-----------------|-----------------------------| | Name | string | **Required.** Header name | | ------- | --------------- | --------------------------- | | Values | array of String | **Required.** Header values | | ------ | --------------- | --------------------------- | #### RestartConfig Controls what happens when a job restarts: continue from cached state or rebuild from scratch. Fields: | Field | Type | Description | |----------------|-------------------------------------|--------------------------------| | JobRestartMode | [JobRestartModes](#jobrestartmodes) | **Required.** Job restart mode | |----------------|-------------------------------------|--------------------------------| #### JobRestartModes Describes restart strategies and their effect on previously cached data. Enumeration values: | Name | Description | |-------------|--------------------------------------------------------------------------------------------------| | Continue | Reuse cached data and continue crawling and parsing new data | | ----------- | ------------------------------------------------------------------------------------------------ | | FromScratch | Clear cached data and start from scratch | | ----------- | ------------------------------------------------------------------------------------------------ | #### HttpsConfig Defines HTTPS validation behavior for target resources. Useful for development or when crawling hosts with self-signed certificates. Fields: | Field | Type | Description | |------------------------------------|------|-----------------------------------------------------------------------| | SuppressHttpsCertificateValidation | bool | **Required.** Suppress HTTPS certificate validation of a web resource | |------------------------------------|------|-----------------------------------------------------------------------| #### CookiesConfig Controls cookie persistence between requests to maintain sessions or state across navigations. Fields: | Field | Type | Description | |------------|------|------------------------------------------------------------------| | UseCookies | bool | **Required.** Save and reuse cookies between requests | | --------- | ---- | ---------------------------------------------------------------- | #### ProxiesConfig Configures whether and how requests are routed through proxy servers, including fallback behavior and specific proxy pools. Fields: | Field | Type | Description | |-----------------------------------|--------------------------------------|--------------------------------------------------------------------------------------------------| | UseProxy | bool | **Required.** Use proxies for requests | |-----------------------------------|--------------------------------------|--------------------------------------------------------------------------------------------------| | SendOvertRequestsOnProxiesFailure | bool | **Required.** Send a request from a host real IP address if all proxies failed | |-----------------------------------|--------------------------------------|--------------------------------------------------------------------------------------------------| | IterateProxyResponseCodes | string | Optional. Comma-separated HTTP response codes to iterate proxies on. Default: '401, 403' | |-----------------------------------|--------------------------------------|--------------------------------------------------------------------------------------------------| | Proxies | array of [ProxyConfig](#proxyconfig) | Optional. Proxy configurations. Default: empty array | |-----------------------------------|--------------------------------------|--------------------------------------------------------------------------------------------------| #### ProxyConfig Defines an individual proxy endpoint and its connection characteristics. Fields: | Field | Type | Description | |------------------|-----------------|-----------------------------------------------------------------------------------------------| | Protocol | string | **Required.** Proxy protocol (http, https, socks5) | |------------------|-----------------|-----------------------------------------------------------------------------------------------| | Host | string | **Required.** Proxy host | |------------------|-----------------|-----------------------------------------------------------------------------------------------| | Port | int | **Required.** Proxy port | |------------------|-----------------|-----------------------------------------------------------------------------------------------| | UserName | string | Optional. Proxy username | |------------------|-----------------|-----------------------------------------------------------------------------------------------| | Password | string | Optional. Proxy password | |------------------|-----------------|-----------------------------------------------------------------------------------------------| | ConnectionsLimit | int | Optional. Max concurrent connections | |------------------|-----------------|-----------------------------------------------------------------------------------------------| | AvailableHosts | array of String | Optional. Hosts accessible via this proxy | |------------------|-----------------|-----------------------------------------------------------------------------------------------| #### DownloadErrorHandling Specifies how the crawler reacts to transient download errors, including retry limits and backoff delays. Fields: | Field | Type | Description | |-------------------|-----------------------------------------------------------------|----------------------------------------------------| | Policy | [DownloadErrorHandlingPolicies](#downloaderrorhandlingpolicies) | **Required.** Error handling policy (Skip, Retry) | |-------------------|-----------------------------------------------------------------|----------------------------------------------------| | RetryPolicyParams | [RetryPolicyParams](#retrypolicyparams) | Optional. Retry params | |-------------------|-----------------------------------------------------------------|----------------------------------------------------| ##### DownloadErrorHandlingPolicies Available strategies for handling request or network failures during content download. Enumeration values: | Name | Description | |-------|-------------------------------------| | Skip | Skip an error and continue crawling | | ----- | ----------------------------------- | | Retry | Try again | | ----- | ----------------------------------- | ##### RetryPolicyParams Specifies how the crawler performs retries. Fields: | Field | Type | Description | |--------------|------|----------------------------------------| | RetriesLimit | int | **Required.** Max retries | |--------------|------|----------------------------------------| | RetryDelayMs | int | **Required.** Delay before retry in ms | |--------------|------|----------------------------------------| #### CrawlersProtectionBypass Tuning options to reduce detection and throttling by target sites: response size limits, redirect depth, request timeouts, and host-specific crawl delays. Fields: | Field | Type | Description | | ------------------ | ------ | -------------------------------------------------- | | MaxResponseSizeKb | int | Optional. Max response size in KB | | ------------------ | ------ | -------------------------------------------------- | | MaxRedirectHops | int | Optional. Max redirect hops | | ------------------ | ------ | -------------------------------------------------- | | RequestTimeoutSec | int | Optional. Max request timeout in seconds | | ------------------ | ------ | -------------------------------------------------- | | CrawlDelays | Array | Optional. Crawl delays for hosts | | ------------------ | ------ | -------------------------------------------------- | #### CrawlDelay Per-host throttling rule to space out requests and respect site limits or robots guidance. Fields: | Field | Type | Description | |-------|--------|----------------------------------------------------| | Host | string | **Required.** Host | | ----- | ------ | -------------------------------------------------- | | Delay | string | **Required.** Delay value (0, 1-5, robots) | | ----- | ------ | -------------------------------------------------- | #### CrossDomainAccess Controls which domains the crawler can follow from the starting hosts: only the main domain, include subdomains, or allow cross-domain navigation. Fields: | Field | Type | Description | |--------|---------------------------------------------------------|--------------------------------------------------------------------| | Policy | [CrossDomainAccessPolicies](#crossDomainAccessPolicies) | **Required.** Cross-domain policy (None, Subdomains, CrossDomains) | |--------|---------------------------------------------------------|--------------------------------------------------------------------| ##### CrossDomainAccessPolicies Domain scoping modes that determine which hosts are considered in-bounds while crawling. Enumeration values: | Name | Description | |--------------|---------------------------------------------------------------------------------------| | None | No subdomain or cross-domain access. Only the main domain is allowed | | ------------ | ------------------------------------------------------------------------------------- | | Subdomains | The subdomains of the main domain are allowed (e.g., "example.com", "sub.example.com) | | ------------ | ------------------------------------------------------------------------------------- | | CrossDomains | Allows access to any domain (e.g., "example.com", "sub.example.com, another.com") | | ------------ | ------------------------------------------------------------------------------------- | #### RetrievalConfig RetrievalConfig controls what gets embedded and how enrollment behaves. Configuration for enrolling pages into a vector index for further vector search. Retrieval is part of the RAG. Fields: | Field | Type | Description | | ------------------ | ----------------------------------------------------------- | ------------------------------------------------------------------------- | | EnrollInIndex | bool | **Required.** Enroll crawled pages into the vector index. | | ------------------ | ----------------------------------------------------------- | ------------------------------------------------------------------------- | | Force | bool | **Required.** Should the already existing data in the index be overridden | | ------------------ | ----------------------------------------------------------- | ------------------------------------------------------------------------- | | MaxTokensPerChunk | int | Optional. Maximum tokens per chunk. Default: 512. | | ------------------ | ----------------------------------------------------------- | ------------------------------------------------------------------------- | | ContentScopes | array of [RetrievalContentScope](#retrievalcontentscope) | Optional. Selectors for page content to enroll. Default: entire page. | | ------------------ | ----------------------------------------------------------- | ------------------------------------------------------------------------- | | EnrollmentWaitMode | [RetrievalEnrollmentWaitMode](#retrievalenrollmentwaitmode) | Optional. Enrollment wait mode. Default: Eventually. | | ------------------ | ----------------------------------------------------------- | ------------------------------------------------------------------------- | ##### RetrievalContentScope Define which parts of which pages are enrolled, using URL path matching and selectors. This lets enroll only meaningful blocks (e.g., product descriptions, docs body) and ignore noise (menus, footers, ads). Fields: | Field | Type | Description | | ---------------------------- | ------ | ------------------------------------------------------------------------------------ | | PathPattern | string | **Required.** URL path pattern (case sensitive). See examples below for the details. | | ---------------------------- | ------ | ------------------------------------------------------------------------------------ | | [Selector](#selector-format) | string | **Required.** Selector for getting interesting data on a web page | | ---------------------------- | ------ | ------------------------------------------------------------------------------------ | PathPattern Examples: | URL | Pattern | Corresponds | | ------------------------------------ | ----------------- | ----------- | | https://example.com/path/to/resource | * | Yes | | ------------------------------------ | ----------------- | ----------- | | https://example.com/path/to/resource | /* | Yes | | ------------------------------------ | ----------------- | ----------- | | https://example.com/path/to/resource | /path/to/resource | Yes | | ------------------------------------ | ----------------- | ----------- | | https://example.com/path/to/resource | /path/to/* | Yes | | ------------------------------------ | ----------------- | ----------- | | https://example.com/path/to/resource | /path/*/resource | Yes | | ------------------------------------ | ----------------- | ----------- | | https://example.com/path/to/resource | /**/res* | Yes | | ------------------------------------ | ----------------- | ----------- | | https://example.com/path/to/resource | /res* | No | | ------------------------------------ | ----------------- | ----------- | | https://example.com/path/to/resource | /path/to/RESOURCE | No | | ------------------------------------ | ----------------- | ----------- | ##### Selector Format The selector argument is a selector of the following format: ```CSS|XPATH: selector```. The first part defines the selector type, the second one should be a selector in the corresponding type. Supported types: - [CSS](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_selectors) - [XPATH](https://developer.mozilla.org/en-US/docs/Web/XML/XPath) ##### RetrievalEnrollmentWaitMode Specifies whether to wait for each crawled document to be enrolled into the index. Enumeration values: | Name | Description | | ---------- | ---------------------------------------------------------------------------------------------------- | | Eventually | Don't wait. Queue for enrollment; the index catches up asynchronously. FAST | | ---------- | ---------------------------------------------------------------------------------------------------- | | WaitEach | Wait for each document. Logs an error if not enrolled within 1 minute. SLOW | | ---------- | ---------------------------------------------------------------------------------------------------- | | WaitJob | Wait for all document enrollments when the entire job is completed. FAST | | ---------- | ---------------------------------------------------------------------------------------------------- | ### Responses #### 200 (Ok) Job has succesfully inserted or updated Returns array of [DownloadTask](#downloadtask) ##### DownloadTask Represents a single page download request produced by a crawl or scrape job. Fields: | Name | Type | Description | |--------|----------|------------------------| | Id | string | **Required.** Task Id | |--------|----------|------------------------| | Url | string | **Required.** Page URL | |--------|----------|------------------------| #### 403 (Forbidden) Access restricted. Refer to the response text for more information ## Config Returns a job config of an existing job. `GET /api/v2/jobs/{jobName}/config` ### Responses #### 200 (Ok) Returns [JobConfig](#jobconfig) #### 403 (Forbidden) Access restricted. Refer to the response text for more information #### 404 (Not Found) Job not found ## Fetch Fetch content of a page with provided URL within a configured job, taking into account all its settings. Can be run only after a job has been started. `GET /api/v2/jobs/{jobName}/fetch` ### Path Parameters | Name | Type | Description | | ------- | ------ | ------------------------------------------ | | jobName | string | **Required.** The name of an existing job. | | ------- | ------ | ------------------------------------------ | ### Query Parameters | Name | Type | Description | | ---- | ------ | ------------------------- | | url | string | **Required.** A page URL. | | ---- | ------ | ------------------------- | ### Responses #### 200 (Ok) Page data has been successfully fetched. Returns the page's HTML. #### 403 (Forbidden) Access restricted. Refer to the response text for more information #### 404 (Not Found) Job not found # Tasks Work with download tasks within a job: discover new pages (crawl), extract data (scrape), and inspect status/results. ## Crawl Discovers and queues follow-up pages from the current task’s URL (e.g., pagination and links), returning new download tasks to continue the crawl. `GET /api/v2/tasks/{taskId}/crawl` ### Path Parameters | Name | Type | Description | | ------ | ------ | -------------------------------------------------- | | taskId | string | **Required.** A task ID returned by previous calls | | ------ | ------ | -------------------------------------------------- | ### Query Parameters | Name | Type | Description | | ---------------------------- | ------ | ----------------------------------------------------------- | | [selector](#selector-format) | string | **Required.** Selector for getting interesting links on a web page | | ---------------------------- | ------ | ----------------------------------------------------------- | | attributeName | string | Optional. Attribute name to get data from. Use ```val``` to get inner text. Default value: ```href``` | | ---------------------------- | ------ | ----------------------------------------------------------- | | maxDepth | int | Optional. Maximum depth for crawling based on the URL path ('example.com' = 0, 'example.com/index.html' = 0, 'example.com/path/' = 1, etc). A non-negative integer value. If null, there is no limit for the depth | | ---------------------------- | ------ | ----------------------------------------------------------- | ### Responses #### 200 (OK) Page data processed successfully Returns array of follow up [DownloadTask](#downloadtask) ###### DownloadTask Represents a single page download request produced by a crawl or scrape job. Fields: | Name | Type | Description | |--------|----------|------------------------| | Id | string | **Required.** Task Id | |--------|----------|------------------------| | Url | string | **Required.** Page URL | |--------|----------|------------------------| #### 202 (Accepted) Task has been queued and is awaiting execution. Retry the request later, repeating until a response other than 202 (Accepted) is received #### 400 (Bad Request) Invalid request parameters. Refer to the response text for more information #### 403 (Forbidden) Access restricted. Refer to the response text for more information #### 404 (Not Found) Task not found #### 422 (Unprocessable Content) There is an issue with processing the page content. Refer to the response text for more information --- ## Scrape Extracts data from the current page using the provided selector (and optional attribute), returning the matched text or attribute values. `GET /api/v2/tasks/{taskId}/scrape` ### Path Parameters | Name | Type | Description | | ------ | ------ | -------------------------------------------------- | | taskId | string | **Required.** A task ID returned by previous calls | | ------ | ------ | -------------------------------------------------- | ### Query Parameters | Name | Type | Description | | ---------------------------- | ------ | ------------------------------------------------------------ | | [selector](#selector-format) | string | **Required.** Selector for getting interesting data on a web page | | ---------------------------- | ------ | ------------------------------------------------------------ | | attributeName | string | Optional. Attribute name to get data from. Use ```val``` or leave null to get inner text | | ---------------------------- | ------ | ------------------------------------------------------------ | | convert | string | Optional. A data conversion function to apply to the scraped data. If not specified, no conversion will be applied. Available functions: `md()` - convert to markdown format, `sr()` - apply the Mozzila Readability algorithm to try to extract the main content of the page | | ---------------------------- | ------ | ------------------------------------------------------------ | ### Responses #### 200 (OK) Page data processed successfully Returns an array of strings with all data items found on a page according to the selector #### 202 (Accepted) Task has been queued and is awaiting execution. Retry the request later, repeating until a response other than 202 (Accepted) is received #### 400 (Bad Request) Invalid request parameters. Refer to the response text for more information #### 403 (Forbidden) Access restricted. Refer to the response text for more information #### 404 (Not Found) Task not found #### 422 (Unprocessable Content) There is an issue with processing the page content. Refer to the response text for more information --- ## Scrape Multiple Extracts data from the current page using the provided selector (and optional attribute), returning the matched text or attribute values. `GET /api/v2/tasks/{taskId}/scrape-multiple` ### Path Parameters | Name | Type | Description | | ------ | ------ | -------------------------------------------------- | | taskId | string | **Required.** A task ID returned by previous calls | | ------ | ------ | -------------------------------------------------- | ### Query Parameters | Name | Type | Description | | ------------ | -------------------------------------- | --------------------------------- | | scrapeParams | array of [ScrapeParams](#scrapeparams) | **Required.** scraping parameters | | ------------ | -------------------------------------- | --------------------------------- | #### ScrapeParams | Field | Type | Description | | ---------------------------- | ------ | -------------------------------------------------------------------------- | | name | string | **Required.** A name to find the corresponding scrape result in a response | | ---------------------------- | ------ | -------------------------------------------------------------------------- | | [selector](#selector-format) | string | **Required.** Selector for getting interesting data on a web page | | ---------------------------- | ------ | -------------------------------------------------------------------------- | | attributeName | string | Optional. Attribute name to get data from. Use ```val``` or leave null to get inner text | | ---------------------------- | ------ | -------------------------------------------------------------------------- | | convert | string | Optional. A data conversion function to apply to the scraped data. If not specified, no conversion will be applied. Available functions: `md()` - convert to markdown format, `sr()` - apply the Mozzila Readability algorithm to try to extract the main content of the page | | ---------------------------- | ------ | -------------------------------------------------------------------------- | #### Query Example ``` GET /api/v2/tasks/{taskId}/scrape-multiple?scrapeParams[0].name=name&scrapeParams[0].selector=css:%20h1&scrapeParams[0].attributeName=val&scrapeParams[1].name=params&scrapeParams[1].selector=css:%20b ``` ### Responses #### 200 (OK) Page data processed successfully Returns an array of [ScrapeResult](#scraperesult) ##### ScrapeResult | Field | Type | Description | | ------------- | --------------- | ------------------------------------------------------------------------------ | | name | string | **Required.** A name specified in the request ScrpapeParams | | ------------- | --------------- | ------------------------------------------------------------------------------ | | values | array of string | **Required.** Data extracted from the page according to the specified selector | | ------------- | --------------- | ------------------------------------------------------------------------------ | #### 202 (Accepted) Task has been queued and is awaiting execution. Retry the request later, repeating until a response other than 202 (Accepted) is received #### 400 (Bad Request) Invalid request parameters. Refer to the response text for more information #### 403 (Forbidden) Access restricted. Refer to the response text for more information #### 404 (Not Found) Task not found #### 422 (Unprocessable Content) There is an issue with processing the page content. Refer to the response text for more information --- ## Scrape Multiple Body Extracts data from the current page using the provided selector (and optional attribute), returning the matched text or attribute values. `POST /api/v2/tasks/{taskId}/scrape-multiple` This method performs the same function as [Scrape Multiple](#scrape-multiple), but accepts ScrapeParams as the body instead of serializing it as a query parameter.\ Not all reverse proxies pass request bodies if the method is GET, so the POST methid is used here. This is a reasonable trafe-off. ### Path Parameters | Name | Type | Description | | ------ | ------ | -------------------------------------------------- | | taskId | string | **Required.** A task ID returned by previous calls | | ------ | ------ | -------------------------------------------------- | ### Request Body Array of [ScrapeParams](#scrapeparams) in JSON format ### Responses #### 200 (OK) Page data processed successfully Returns an array of [ScrapeResult](#scraperesult) #### 202 (Accepted) Task has been queued and is awaiting execution. Retry the request later, repeating until a response other than 202 (Accepted) is received #### 400 (Bad Request) Invalid request parameters. Refer to the response text for more information #### 403 (Forbidden) Access restricted. Refer to the response text for more information #### 404 (Not Found) Task not found #### 422 (Unprocessable Content) There is an issue with processing the page content. Refer to the response text for more information --- ## Info Retrieves the current status and execution trace for a download task, including errors and links to result details when available. `GET /api/v2/tasks/{taskId}/info` ### Path Parameters | Name | Type | Description | | ------ | ------ | -------------------------------------------------- | | taskId | string | **Required.** A task ID returned by previous calls | | ------ | ------ | -------------------------------------------------- | ### Responses #### 200 (OK) Download task status found Returns [DownloadTaskStatus](#downloadtaskstatus) ##### DownloadTaskStatus Summarizes the execution state and outputs of a single download operation, including current status, any error, and final or intermediate results. Fields: | Name | Type | Description | | --------------- | ----------------------------------------- | ------------------------------------------------------ | | Error | string | Optional. Request execution error | | --------------- | ----------------------------------------- | ------------------------------------------------------ | | TaskState | [DownloadTaskStates](#downloadtaskstates) | Optional. Task state | | --------------- | ----------------------------------------- | ------------------------------------------------------ | | Result | [DownloadInfo](#downloadinfo) | Optional. Download result | | --------------- | ----------------------------------------- | ------------------------------------------------------ | | intermedResults | array of [DownloadInfo](#downloadinfo) | Optional. Intermediate requests download results stack | | --------------- | ----------------------------------------- | ------------------------------------------------------ | ##### DownloadTaskStates Lifecycle states a download task can transition through from creation to completion or deletion. Enumeration values: | Name | Description | | ------------------------ | ------------------------------------------------------------------------------------------ | | Handled | Task is handled and its results are available | | ------------------------ | ------------------------------------------------------------------------------------------ | | AccessDeniedForRobots | Access to a URL is denied by robots.txt | | ------------------------ | ------------------------------------------------------------------------------------------ | | AllRequestGatesExhausted | All request gateways (proxy and host IP addresses) were exhausted but no data was received | | ------------------------ | ------------------------------------------------------------------------------------------ | | InProgress | Task is in progress | | ------------------------ | ------------------------------------------------------------------------------------------ | | Created | Task has not been started yet | | ------------------------ | ------------------------------------------------------------------------------------------ | | Deleted | Task has been deleted | | ------------------------ | ------------------------------------------------------------------------------------------ | ##### DownloadInfo Captures request/response details for a download attempt, including HTTP metadata, headers, cookies, and payload. Fields: | Name | Type | Description | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | Method | string | **Required.** HTTP method | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | Url | string | **Required.** Request URL | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | IsSuccess | bool | **Required.** Was the request successful | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | HttpStatusCode | int | **Required.** [HTTP status code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | ReasonPhrase | string | **Required.** HTTP reason phrase | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | RequestHeaders | array of [HttpHeader](#httpheader) | **Required.** HTTP headers sent with the request | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | ResponseHeaders | array of [HttpHeader](#httpheader) | **Required.** HTTP headers received in the response | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | RequestCookies | array of [Cookie](#cookie) | **Required.** Cookies sent with the request | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | ResponseCookies | array of [Cookie](#cookie) | **Required.** Cookies received in the response | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | RequestDateUtc | datetime | **Required.** Request date and time in UTC | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | DownloadTimeSec | double | **Required.** Download time in seconds | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | ViaProxy | bool | **Required.** Is the request made via a proxy | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | WaitTimeSec | double | **Required.** What was the delay (in seconds) before the request was executed (crawl latency, etc.) | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | CrawlDelaySec | int | **Required.** A delay in seconds applied to the request | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | ##### HttpHeader Represents a single HTTP header with a name and one or more values. Fields: | Name | Type | Description | | ------ | --------------- | --------------------------- | | Name | string | **Required.** Header name | | ------ | --------------- | --------------------------- | | Values | array of String | **Required.** Header values | | ------ | --------------- | --------------------------- | ##### [Cookie](https://developer.mozilla.org/en-US/docs/Web/HTTP/Cookies) Represents an HTTP cookie as sent via Set-Cookie/ Cookie headers, including attributes. Fields: | Name | Type | Description | | -------- | -------- | --------------------------------------------------------------------------------------------------------------------------------- | | Name | string | **Required.** [Name](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Set-Cookie#attributes) | | -------- | -------- | --------------------------------------------------------------------------------------------------------------------------------- | | Value | string | **Required.** [Value](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Set-Cookie#attributes) | | -------- | -------- | --------------------------------------------------------------------------------------------------------------------------------- | | Domain | string | **Required.** [Domain](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Set-Cookie#domaindomain-value) | | -------- | -------- | --------------------------------------------------------------------------------------------------------------------------------- | | Path | string | **Required.** [Path](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Set-Cookie#pathpath-value) | | -------- | -------- | --------------------------------------------------------------------------------------------------------------------------------- | | HttpOnly | bool | **Required.** [HttpOnly](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Set-Cookie#httponly) | | -------- | -------- | --------------------------------------------------------------------------------------------------------------------------------- | | Secure | bool | **Required.** [Secure](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Set-Cookie#secure) | | -------- | -------- | --------------------------------------------------------------------------------------------------------------------------------- | | Expires | datetime | Optional. [Expires](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Set-Cookie#expiresdate) | | -------- | -------- | --------------------------------------------------------------------------------------------------------------------------------- | #### 403 (Forbidden) Access restricted. Refer to the response text for more information #### 404 (Not Found) Task not found --- ## Selector Format The selector argument is a selector of the following format: ```CSS|XPATH: selector```. The first part defines the selector type, the second one should be a selector in the corresponding type. Supported types: - [CSS](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_selectors) - [XPATH](https://developer.mozilla.org/en-US/docs/Web/XML/XPath) # Tenants Manage tenants that scope data and configuration; delete when deprovisioning environments or consolidating accounts. ## Delete Removes a tenant and all of its data across databases and caches; use with care as this operation is irreversible. `DELETE /api/v2/tenants/{tenantId}` ### Path Parameters | Name | Type | Description | | -------- | ------ | ------------------------- | | tenantId | string | **Required.** A tenant ID | | -------- | ------ | ------------------------- | ### Responses #### 200 (OK) Tenant deleted # Retrieval Built‑in search (FullText and Vector) that turns everything you crawl into answers. Retrieval makes your crawled content instantly searchable with natural‑language queries. As WDS discovers pages, it adds them in a full text index (Lucene), creates embeddings and stores them in a vector index, so you can retrieve the most relevant snippets — across a single job or your entire tenant — and plug them straight into RAG workflows. By default, WDS is configured to use the [Gemma](https://deepmind.google/models/gemma/) embedding model to generate high‑quality vector representations (embeddings) for indexed content. To automatically enroll crawled pages into the indexes, make sure your crawling jobs are properly [configured](./jobs.html#retrievalconfig) to enable this feature. Which indexes to use (Full-Text, Vector, both) can be configured on the service level. See [Solidstack](../services/solidstack.html), and [Retriever](../services/retriever.html) service configurations for reference. ## Query ### Job‑scoped Search: `GET /api/retrieval/v1/{jobName}/query` #### Path Parameters | Name | Type | Description | | ------- | ------ | -------------------------------------------------- | | jobName | string | **Required.** Unique job name. Used to identify the job in the system where the domain name is often used (e.g., example.com) | | ------- | ------ | -------------------------------------------------- | ### Tenant‑wide Search: `GET /api/retrieval/v1/query` ### Job‑scoped and Tenant‑wide Search Query Parameters | Name | Type | Description | | --------- | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | q | string | **Required.** The natural‑language query to match against indexed content. | | --------- | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | limit | int | Optional. Maximum number of results to return. Default: 5. | | --------- | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | threshold | string | Optional. Minimum relevance score using [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity). Default: [same-domain](#similarity-thresholds). | | --------- | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ##### Similarity Thresholds Choose a preset for quick, predictable relevance — or provide a numeric value. Presets map to cosine similarity scores. | Name | When to use | | ------------------ | ---------------------------------------------------------------------------------------- | | exact-match | The query and result describe essentially the same thing (exact term or strong synonym). | | ------------------ | ---------------------------------------------------------------------------------------- | | same-category | Not identical, but clearly the same family/category and very relevant. | | ------------------ | ---------------------------------------------------------------------------------------- | | same-domain | Topically aligned within the same thematic domain; balanced recall vs. precision. | | ------------------ | ---------------------------------------------------------------------------------------- | | generic-similarity | Broad lexical similarity; maximize recall when you’ll filter results later. | | ------------------ | ---------------------------------------------------------------------------------------- | ### Responses #### 200 (OK) Returns an array of RetrievalItem objects. ##### RetrievalItem | Field | Type | Description | | ------------- | ---------------------------------------------- | ----------------------------------------------------------------------- | | Span | string | **Required.** Text span - found text with surrounding semantic context. | | ------------- | ---------------------------------------------- | ----------------------------------------------------------------------- | | Score | float | **Required.** Relevance score. | | ------------- | ---------------------------------------------- | ----------------------------------------------------------------------- | | DownloadTasks | array of [DownloadTaskInfo](#downloadtaskinfo) | **Required.** Download tasks with this text span. | | ------------- | ---------------------------------------------- | ----------------------------------------------------------------------- | ##### DownloadTaskInfo | Field | Type | Description | | -------------- | ------ | ---------------------------------------------------------------------- | | DownloadTaskId | string | **Required.** The download task ID where this content was captured. | | -------------- | ------ | ---------------------------------------------------------------------- | | Url | string | **Required.** Source page URL. | | -------------- | ------ | ---------------------------------------------------------------------- | | CaptureDateUtc | date | **Required.** Capture timestamp in UTC. | | -------------- | ------ | ---------------------------------------------------------------------- | #### 403 (Forbidden) Access restricted. Refer to the response text for more information #### 404 (Not Found) The specified job was not found. # Services # Overview The WDS API Server consists of two groups of components: * [Core Services](#core-services) * [Feature Services](#feature-services) * [Auxiliary Services](#auxiliary-services) ## Core Services Core Services handle the main workload and support two deployment modes: * Single Service Mode * Multi Service Mode ### At a Glance - Dapi: public REST API and orchestration gateway for jobs and tasks - Datakeeper: durable storage and cache of downloaded pages and metadata - Crawler: high‑throughput HTTP downloader with proxy, cookie, HTTPS controls - Scraper: selector‑driven extraction of text and attributes from pages - Idealer: unique ID generation and consistency across entities ### Single Service Mode In this mode, the entire application is packaged into a single service — [Solidstack](./solidstack.html). Only one instance can run at a time. It is ideal for quick evaluation, development, and small test tasks because it is simple to deploy and maintain and consumes fewer resources. The trade‑offs are that it does not scale horizontally, and its availability is tied to a single node; a node restart will cause a service outage until the instance starts again. ### Multi Service Mode In this mode, the application runs as a set of independent, horizontally scalable services — core stack: [Dapi](./dapi.html), [Datakeeper](./datakeeper.html), [Crawler](./crawler.html), [Scraper](./scraper.html), and [Idealer](./idealer.html) - features: [Retriever](./retriever.html) * Scalability: run multiple instances of each service to handle workload spikes and grow with demand. * Resilience: isolates failures; individual services can restart without taking the whole system down. * Availability: when deployed to Kubernetes with the [Helm Chart](../deployments/helm.html), the platform tolerates node restarts and continues operating. * Operations: enables rolling upgrades and resource isolation per service. This mode is available starting from the Business [plan](https://webdatasource.com/plans.html) and is the recommended setup for staging and production environments. #### Choosing a Mode - Choose Single Service (Solidstack) for quick trials, demos, and small non‑critical workloads. - Choose Multi Service for environments that require scaling, fault isolation, and zero‑downtime operations. ### Third-Party Components Core Services have one required third‑party dependency: **MongoDB**, which stores all system data. Optionally, to optimize cost and performance, you can use: * **S3‑compatible storage** — caches and reuses downloaded web resource pages. If not configured, the MongoDB is used for this purpose. #### MongoDB Both in-cluster and managed (SaaS) MongoDB deployments work well with the WDS Server. * Supported versions: **6.x**, **7.x**, **8.x** * Supported deployments: Atlas, Enterprise, Community #### S3-Compatible Storage > NOTE: Available only in the Multi Service Mode. See the [Datakeeper](./datakeeper.html) documentation for configuration instructions. Any S3‑compatible storage can be used, for example: * AWS S3 * MinIO * Other compatible services. The [MinIO .NET Client](https://min.io/docs/minio/linux/developers/dotnet/minio*dotnet.html) is used to integrate with S3‑compatible storage. ## Feature Services Services that deploy to provide a specific feature if the feature is allowed and turned on: - [Retriever](./retriever.html) - to perform vector search ## Auxiliary Services Services that help in testing and evaluation: - [Playground](./playground.html) - to test queries and prompts idempotently in a stable environment - [Docs](./docs.html) - to have all necessary documentation at hand in case the website is not accessible, for example, in [air-gapped](../deployments/airgapped.html) environments. # Dapi Service Provides the public REST API for WDS — brokering requests between clients and core services, enforcing configuration constraints, and orchestrating jobs. Used by: * [MS SQL CLR Functions](../../mssql/clr-functions/index.html) * [WDS MCP Server](../../mcp/index.html) This service is not open-source but its image is publicly accessible on [DockerHub](https://hub.docker.com/repository/docker/webdatasource/dapi/general) ## Configuration The following environment variables are used to configure this service: | Name | Description | | ---------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | MONGOMONGODB_CONNECTION_STRING | **Required** [MongoDB connection string](https://www.mongodb.com/docs/v6.0/reference/connection-string/) | | ---------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | MONGODB_DATABASE_NAME | Optional. Sets or overrides the database name. | | ---------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | DATAKEEPER_ORIGIN | **Required** The [origin](https://developer.mozilla.org/en-US/docs/Web/API/URL/origin) of the [datakeeper service](./datakeeper.html) | | ---------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | SCRAPER_ORIGIN | **Required** The [origin](https://developer.mozilla.org/en-US/docs/Web/API/URL/origin) of the [scraper service](./scraper.html) | | ---------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | IDEALER_ORIGIN | **Required** The [origin](https://developer.mozilla.org/en-US/docs/Web/API/URL/origin) of the [idealer service](./idealer.html) | | ---------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | RETRIEVER_ORIGIN | **Required** The [origin](https://developer.mozilla.org/en-US/docs/Web/API/URL/origin) of the [retriever service](./retriever.html) | | ---------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | JOB_TYPES | **Required** Available [job types](#jobtypes) for job configs | | ---------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | LICENSE_KEY | **Required**. A license key. See [Plans](https://webdatasource.com/plans.html) | | ---------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | GLOBAL_EXCEPTION_RESPONSE_DELAY_MS | Optional. If a server error occurred, how much time should be waited before responding with an HTTP error? This makes sense because some clients can't wait before repeating requests on their side, so this is done on the server side. Default value is 1 second | | ---------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | MIN_LOG_LEVEL | Optional. Minimal log level. Default value is Information | | ---------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | FEATURE_FLAG_RETRIEVAL_ENABLED | Optional. Turns on/off the Retrieve feature. Default value is false | | ---------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ### JobTypes Supported job execution environments.\ Default and allowed values can be restricted in Dapi or Solidstack services.\ Ensure the Crawler/Solidstack service is configured to handle the selected job type. | Name | Description | | -------- | ------------------------------------------------------------------------------------------------ | | internet | Crawl data from internet sources via request gateways (Proxy addresses, Host IP addresses, etc.) | | -------- | ------------------------------------------------------------------------------------------------ | | intranet | Crawl data from intranet sources with no limits | | -------- | ------------------------------------------------------------------------------------------------ | # Crawler Service Performs page downloads for running jobs — executing HTTP requests, applying proxy/cookie/HTTPS settings, and honoring crawl delays and throttling. This service is not open-source but its image is publicly accessible on [DockerHub](https://hub.docker.com/repository/docker/webdatasource/crawler/general) ## Configuration The following environment variables are used to configure this service: | Name | Description | | ------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | DATAKEEPER_ORIGIN | **Required** The [origin](https://developer.mozilla.org/en-US/docs/Web/API/URL/origin) of the [datakeeper service](./datakeeper.html) | | ------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | SERVICE_HOST | **Required** The host on which the current service is available | | ------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | EXTERNAL_IP_ADDRESS_CONFIGS | **Required** A coma-separated list of [external IP getter services](#external-ip-getter-services) | | ------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | LICENSE_KEY | **Required**. A license key. See [Plans](https://webdatasource.com/plans.html) | | ------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | MAX_INACTIVE_SEC_TO_REREGISTRAR | Optional. Each crawler service on start registers itself at a datakeeper service. Sometimes something might go wrong and a craler might get forgotten. This parameter defines after what period without requests the crawler should remind about itself. Default value is 60 seconds | | ------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | MIN_LOG_LEVEL | Optional. Minimal log level. Default value is Information | | ------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ## External IP getter services For the Internet [job type](../../mssql/user-defined-types/job-config.html#jobtypes), requests can be sent via [proxies or/and from crawlers' public IP addresses](../../mssql/user-defined-types/proxies-config.html)\ In the last case, it is essential to know crawlers' public IP addresses. There are several ways to provide this information: * **amazon** - the ```https://checkip.amazonaws.com``` is used to get the crawler public IP address * **directIP** - a particular IP is used. The value should be a valid IP address in the format XX.XX.XX.XX * **intranet** - reserved value to allow crawlers to be used for Intranet jobs. ### External IP address configs examples ##### Crawler can be used only for Intranet jobs ``` BASH EXTERNAL_IP_ADDRESS_CONFIGS=intranet ``` ##### Crawler can be used for Intranet and Internet jobs ``` BASH EXTERNAL_IP_ADDRESS_CONFIGS=intranet, amazon ``` ##### Crawler can be used for Internet jobs only, and its public IP address is a static one ``` BASH EXTERNAL_IP_ADDRESS_CONFIGS=20.21.22.23 ``` # Datakeeper Service Provides durable storage and fast access to job data and cached web pages, enabling efficient scraping and re-use of previously downloaded content. This service is not open-source but its image is publicly accessible on [DockerHub](https://hub.docker.com/repository/docker/webdatasource/datakeeper/general) ## Configuration The following environment variables are used to configure this service: | Name | Description | | ------------------------- |---------------------------------------------------------------------------------------------------------------------------------------------------| | MONGOMONGODB_CONNECTION_STRING | **Required** [MongoDB connection string](https://www.mongodb.com/docs/v6.0/reference/connection-string/) with database name | | ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- | | CACHE_CONNECTION_STRING | Optional. Cache connection string. See [Caching](#caching) | | ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- | | MONGODB_DATABASE_NAME | Optional. Sets or overrides the database name. | | ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- | | IDEALER_ORIGIN | **Required** The [origin](https://developer.mozilla.org/en-US/docs/Web/API/URL/origin) of the [Idealer](./idealer.html) service | | ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- | | LICENSE_KEY | **Required**. A license key. See [Plans](https://webdatasource.com/plans.html) | | ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- | | MIN_LOG_LEVEL | Optional. Minimal log level. Default value is Information | | ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- | ## Caching All downloaded pages are cached to reduce pressure on web resources. These cached pages can be used to scrape additional data on them without additional requests to web resources.\ If a web page is in cache and the web resource supports cache control (sends `ETag` and/or `Last-Modified` headers in responses), all requests to this page URL will be sent with `If-None-Match` and/or `If-Modified-Since` headers to reuse cache data if it's still actual. By default, web pages are cached in the system DB (MongoDB). This behavior can be changed by providing the service with a CACHE_CONNECTION_STRING. The following providers are supported: * MongoDB * S3 compatible services ### MongoDB To cache web pages on a MongoDB instance other than the one used as the system database, initialize CACHE_CONNECTION_STRING with a valid [MongoDB connection string](https://www.mongodb.com/docs/v6.0/reference/connection-string/) with database name. For instance: - `mongodb://:@localhost:27017/` - `mongodb+srv://:@.mongodb.net/` ### S3 compatible services To cache web pages on an S3 compatible service, initialize CACHE_CONNECTION_STRING with a connection string of the following format: - `s3://:@:/?ssl=(true|false)` # Scraper Service Parses downloaded pages to extract text or attribute values according to selectors, returning structured results for downstream processing. This service is not open-source but its image is publicly accessible on [DockerHub](https://hub.docker.com/repository/docker/webdatasource/scraper/general) ## Configuration The following environment variables are used to configure this service: | Name | Description | | ----------------------- | ------------------------------------------------------------------- | | MIN_LOG_LEVEL | Optional. Minimal log level. Default value is Information | | ----------------------- | ------------------------------------------------------------------- | | LICENSE_KEY | **Required**. A license key. See [Plans](https://webdatasource.com/plans.html) | | ----------------------- | ------------------------------------------------------------------- | # Idealer Service Issues and tracks unique identifiers for entities across services, ensuring consistent references for jobs, tasks, and related records. This service is not open-source but its image is publicly accessible on [DockerHub](https://hub.docker.com/repository/docker/webdatasource/idealer/general) ## Configuration The following environment variables are used to configure this service: | Name | Description | | ------------------------- | -------------------------------------------------------------------------------------------------------------------------- | | MONGOMONGODB_CONNECTION_STRING | **Required** [MongoDB connection string](https://www.mongodb.com/docs/v6.0/reference/connection-string/) | | ------------------------- | -------------------------------------------------------------------------------------------------------------------------- | | MONGODB_DATABASE_NAME | Optional. Sets or overrides the database name. | | ------------------------- | -------------------------------------------------------------------------------------------------------------------------- | | LICENSE_KEY | **Required**. A license key. See [Plans](https://webdatasource.com/plans.html) | | ------------------------- | -------------------------------------------------------------------------------------------------------------------------- | | MIN_LOG_LEVEL | Optional. Minimal log level. Default value is Information | | ------------------------- | -------------------------------------------------------------------------------------------------------------------------- | # Retriever Service Vector indexing and similarity search backing the [Retrieval API](../api/retrieval.html). Generates embeddings via a configurable HTTP embedding service and stores/searches vectors in MongoDB or a dedicated vector database. This service is not open-source but its image is publicly accessible on [DockerHub](https://hub.docker.com/repository/docker/webdatasource/retriever/general) ## Configuration The following environment variables are used to configure this service: | Name | Description | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- | | MONGOMONGODB_CONNECTION_STRING | **Required** [MongoDB connection string](https://www.mongodb.com/docs/v6.0/reference/connection-string/) | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- | | MONGODB_DATABASE_NAME | Optional. Sets or overrides the database name. | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- | | LICENSE_KEY | **Required**. A license key. See [Plans](https://webdatasource.com/plans.html) | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- | | SEARCH_MODE | Optional. Search mode. Supported modes: `FullText`, `Vector`, `FullTextAndVector`. Default: `FullText` | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- | | EMBEDDING_SERVICE_URL | Optional. Embedding service URL. **Required if SEARCH_MODE is Vector or FullTextAndVector** | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- | | EMBEDDING_SERVICE_API_KEY | Optional. Embedding service API key if applicable | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- | | EMBEDDING_SERVICE_REQUEST_TEMPLATE | Optional. Request body to send to an embedding service. Default: `{ 'model': 'embeddinggemma', 'input': null }` | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- | | EMBEDDING_SERVICE_CONTENT_JSON_PATH | Optional. Json path to replace content in the template. Default: `$.input` | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- | | EMBEDDING_SERVICE_RESULT_JSON_PATH | Optional. Json path to get a results array from the embedding service response. Default: `$.embeddings` | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- | | EMBEDDING_VECTORS_LENGTH | Optional. Embedding vector length. Default 768 (the default value for the embeddinggemma model) | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- | | SEARCH_MONGODB_CONNECTION_STRING | Optional. [Vector database](#vector-database) connection string. If not specified, the primary MongoDB is used (in this case it must be MongoDB Atlas). | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- | | SEARCH_DB_DATABASE_NAME | Optional. Sets or overrides the vector database name. | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- | | MIN_LOG_LEVEL | Optional. Minimal log level. Default value is Information | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- | ### Vector database To perform vector search, a vector database is required to store embeddings and execute similarity queries. Supported vector databases: - [MongoDB Atlas](https://www.mongodb.com/docs/atlas/) # Solidstack Service Single-container bundle of all core services for quick evaluation and small workloads.\ Simplifies setup and reduces resource footprint, but lacks horizontal scaling and isolates less than multi-service deployments. This service is not open-source but its image is publicly accessible on [DockerHub](https://hub.docker.com/repository/docker/webdatasource/solidstack/general) ## Configuration The following environment variables are used to configure this service: | Name | Description | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | MONGOMONGODB_CONNECTION_STRING | **Required** [MongoDB connection string](https://www.mongodb.com/docs/v6.0/reference/connection-string/) | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | MONGODB_DATABASE_NAME | Optional. Sets or overrides the database name. | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | JOB_TYPES | **Required** Available [job types](#jobtypes) for job configs | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | GLOBAL_EXCEPTION_RESPONSE_DELAY_MS | Optional. If a server error occurred, how much time should be waited before responding with an HTTP error? This makes sense because some clients can't wait before repeating requests on their side, so this is done on the server side. Default value is 1 second | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | EXTERNAL_IP_ADDRESS_CONFIGS | **Required** A coma-separated list of [external IP getter services](./crawler.html#external-ip-getter-services) | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | MIN_LOG_LEVEL | Optional. Minimal log level. Default value is Information | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | FEATURE_FLAG_RETRIEVAL_ENABLED | Optional. Turns on/off the Retrieve feature. **If set to true, only MongoDB Atlas can be used as the system DB, and EMBEDDING_SERVICE_URL must be set.** Default value is false | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | SEARCH_MODE | Optional. Search mode. Supported modes: `FullText`, `Vector`, `FullTextAndVector`. Default: `FullText` | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | EMBEDDING_SERVICE_URL | Optional. Embedding service URL. **Required if FEATURE_FLAG_RETRIEVAL_ENABLED is set to true and SEARCH_MODE is Vector or FullTextAndVector** | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | EMBEDDING_SERVICE_API_KEY | Optional. Embedding service API key if applicable | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | EMBEDDING_SERVICE_REQUEST_TEMPLATE | Optional. Request body to send to an embedding service. Default: `{ 'model': 'embeddinggemma', 'input': null }` | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | EMBEDDING_SERVICE_CONTENT_JSON_PATH | Optional. Json path to replace content in the template. Default: `$.input` | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | EMBEDDING_SERVICE_RESULT_JSON_PATH | Optional. Json path to get a results array from the embedding service response. Default: `$.embeddings` | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | EMBEDDING_VECTORS_LENGTH | Optional. Embedding vector lenght. Default 768 (the default value for the embeddinggemma model) | | ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ### JobTypes Supported job execution environments.\ Default and allowed values can be restricted in Dapi or Solidstack services.\ Ensure the Crawler/Solidstack service is configured to handle the selected job type. | Name | Description | | -------- | ------------------------------------------------------------------------------------------------ | | internet | Crawl data from internet sources via request gateways (Proxy addresses, Host IP addresses, etc.) | | -------- | ------------------------------------------------------------------------------------------------ | | intranet | Crawl data from intranet sources with no limits | | -------- | ------------------------------------------------------------------------------------------------ | # Playground A companion test site for evaluation and automated checks — provides predictable pages and structures to validate crawling and scraping end‑to‑end. Inside a docker-compose network, it's available on the URL ```http://playground``` and all examples on this website use the service to provide users with predictable idempotent query results. Additionally, the playground is available on ```http://localhost:2808```. This service is a web server with static files in the following structure: ``` playground/ |- index.html |- about.html |- faq.html |- robots.txt |- sitemap.xml |- armor_and_accessories/ | |- 1/ | | |- index.html | | |- cloak_of_the_phantom.html | | |- ... | | |- shield_of_the_thunder_god.html | | ... | |- 3/ | | |- index.html | | |- chalice_of_dreams.html | | |- ... | | |- mask_of_the_forgotten.html | - beast_and_creature_items/ | - ... ``` The playground is designed to allow testing of most of the web crawling scenarios, for instance: * Tree page navigation * Graph page navigation * Paging * etc. As for scraping, the leaf pages have the same structure:
<h1>NAME</h1>
<div class="price">
    <b>Price: </b>
    <span>PRICE</span>
</div>
<div class="desc">
    <b>Description:</b>
    <p>DESCRIPTION</p>
</div>
So that the majority of the example queries would extract these three data items from the pages * NAME * PRICE * DESCRIPTION In the future versions, the playground might be extended with new data items and subfolders, but this core part will be kept as long as possible for backward compatibility # Docs Hosts the product documentation alongside the platform, ensuring the right version is always available — especially useful in [air-gapped](../deployments/airgapped.html) environments. Content mirrors the official site for the current release. This service is available on ```http://localhost:2809``` # MS SQL Server # CLR Functions # Overview WDS MS SQL CLR Functions is a set of functions embeddable into an MS SQL Server instance to crawl through web resources and scrape data from them.\ The functions are packed in a single .NET Framework library. This library is open source, so everyone can review this to check what data it sends, what it receives, and how it handles it - [wds.mssql.clr](https://github.com/webdatasource/wds.mssql.clr) ## Context diagram The following context diagram shows how this works in general: ![Context diagram](/assets/img/clr-functions/context-diagram.png?fp=UxzORxhteMx6eOBl) ## Table of contents - [Install](/releases/latest/mssql/clr-functions/install.html) - [ServerStatus](/releases/latest/mssql/clr-functions/server-status.html) - [Start](/releases/latest/mssql/clr-functions/start.html) - [Crawl](/releases/latest/mssql/clr-functions/crawl.html) - [ScrapeFirst](/releases/latest/mssql/clr-functions/scrape-first.html) - [ScrapeAll](/releases/latest/mssql/clr-functions/scrape-all.html) - [ScrapeMultiple](/releases/latest/mssql/clr-functions/scrape-multiple.html) - [TaskStatus](/releases/latest/mssql/clr-functions/task-status.html) - [ToStringsTable](/releases/latest/mssql/clr-functions/to-strings-table.html) # Installing CLR library to MS SQL Server The following MS SQL Server versions are supported and tested: - MS SQL Server 2022 running on a Windows machine (Linux-based versions that can be run in a docker container are not supported CLR functions) > **NOTE:** This doesn't mean that the other versions are not supported at all. WDS just haven't been tested with them yet, so follow the releases. To install CLR functions into an SQL Server instance, on the Releases page choose the [latest](https://github.com/webdatasource/wds.mssql.clr/releases/tag/latest) or a [specific version](https://github.com/webdatasource/wds.mssql.clr/releases) and download its Artifacts.zip archive. This archive has the following files: 1. **WDS.MsSql.Clr.dll** - .NET Framework 4.8 assembly with the CLR functions 2. **WDS.MsSql.Clr.hash** - a hash of the WDS.MsSql.Clr.dll that is required for adding the assembly into an SQL Server instance 3. **WdsClrFunctions.sql** - SQL script that adds the functions into an SQL Server instance 4. **Install.bat** - Windows script that configures and runs the WdsClrFunctions.sql against a particular instance of SQL Server. This script is idempotent so it can be run multiple times and all components will be reinstalled from scratch. **WdsClrFunctions.sql** and **Install.bat** scripts are available in the [wds.mssql.clr](https://github.com/webdatasource/wds.mssql.clr) repository for evaluation (see the WDS.MsSql.Scripts directory). **WDS.MsSql.Clr.dll**, and **WDS.MsSql.Clr.hash** files are built by GitHub actions automatically. Nonetheless, these two files can be compiled from the source code (the WDS.MsSql.Clr.hash is created by an after-build script in the WDS.MsSql.Clr project file). In order to add CLR functions in an SQL Server instance, the CLR feature should be enabled. To enable it, the WdsClrFunctions.sql script contains the following section: ``` SQL EXEC sp_configure 'show advanced options', 1; RECONFIGURE; EXEC sp_configure 'clr enable', 1; RECONFIGURE; ``` There is always the [latest](https://github.com/webdatasource/wds.mssql.clr/releases/tag/latest) release and the other releases. For evaluation purposes, the latest release is recommended, but it's better to use specific versions in production and perform updates from version to version according to a release process.\ ## Installation Steps 1. Download an Artifacts.zip archive from one of the releases or the latest one 2. Unarchive the Artifacts.zip and get into the unpacked folder 3. Run the Install.bat script and follow instructions 4. After the script completed successfully, the following message will be shown: ```All done. The MS SQL instance on "server_address" is ready to run WDS CLR functions.``` All [CRL functions](../clr-functions/index.html) and [UDTs](../user-defined-types/index.html) by default are installed to the ```wds``` namespace and all examples use this. This can be changed manually in the WdsClrFunctions.sql if necessary. ### Install.bat configuration The following environment variables are used to configure the Install.bat script: | Environmetn Valiable | Default Value | Description | | -------------------- | --------------- | ---------------------------------- | | SERVER | localhost | SQL Server instance address | | -------------------- | --------------- | ---------------------------------- | #### Install.bat run examples with overrides 1. With a custom SQL Server address ``` BASH cmd /c "SET SERVER=10.11.12.13 && Install.bat" ``` # ServerStatus Checks the WDS API Server readiness and key indicators, returning a table of status values useful for connectivity and health checks. ## Syntax ``` SQL wds.ServerStatus( serverConfig ) ``` ## Arguments | Name | Type | Description | | ------------ | -------------------------------------------------------- | -------------------------------------- | | serverConfig | [ServerConfig](../user-defined-types/server-config.html) | **Required.** API server configuration | | ------------ | -------------------------------------------------------- | -------------------------------------- | ## Return type ``` SQL TABLE (Name NVARCHAR(255), Value NVARCHAR(MAX), Description NVARCHAR(MAX)) ``` ## Return value A table with the following status indicators: | Name | Value | Description | | -------------------- | --------------- | --------------------------------------- | | Ready | True of False | HTTP Response code with a reason phrase | | -------------------- | --------------- | --------------------------------------- | ## Examples ##### Checking the status of a WDS API Server running on localhost with port 2807 (the default one) ``` SQL DECLARE @serverConfig wds.ServerConfig = 'wds://localhost:2807'; SELECT * FROM wds.ServerStatus(@serverConfig) ``` # Start Starts a new job from the given configuration and returns initial download tasks (one per start URL) to drive subsequent crawl/scrape steps. ## Syntax ``` SQL wds.Start( jobConfig ) ``` ## Arguments | Name | Type | Description | | --------- | -------------------------------------------------- | ------------------------------- | | jobConfig | [JobConfig](../user-defined-types/job-config.html) | **Required.** Job configuration | | --------- | -------------------------------------------------- | ------------------------------- | ## Return type ``` SQL TABLE (Task wds.DownloadTask) ``` ## Return value List of [DownloadTask](../user-defined-types/download-task.html) (one per Start URL) in a form of table. ## Examples ##### Creating a job and getting initial download tasks ``` SQL DECLARE @jobConfig wds.JobConfig = 'JobName: TestJob1; Server: wds://localhost:2807; StartUrls: http://playground'; SELECT root.Task.Url URL FROM wds.Start(@jobConfig) root ``` | URL | | ---------------------- | | http://playground/ | | ---------------------- | # Crawl Discovers and returns follow-up pages from the current task’s URL using a selector, producing new download tasks for continued crawling. Discovers and queues follow-up pages from the current task’s URL (e.g., pagination and links), returning new download tasks to continue the crawl. ## Syntax ``` SQL wds.Crawl( downloadTask, selector, [attributeName] ) ``` ## Arguments | Name | Type | Description | | ------------- | -------------------------------------------------------- | ------------------------------------------------------------------ | | downloadTask | [DownloadTask](../user-defined-types/download-task.html) | **Required.** A download task from a previous command result set | | ------------- | -------------------------------------------------------- | ------------------------------------------------------------------ | | selector | string | **Required.** Selector for getting interesting links on a web page | | ------------- | -------------------------------------------------------- | ------------------------------------------------------------------ | | attributeName | string | Optional. Attribute name to get data from. Use ```val``` to get inner text. Default value: ```href``` | | ------------- | -------------------------------------------------------- | ------------------------------------------------------------------ | ### Remarks The selector argument is a selector of the following format: ```CSS|XPATH: selector```. The first part defines the selector type, the second one should be a selector in the corresponding type. Supported types: - [CSS](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_selectors) - [XPATH](https://developer.mozilla.org/en-US/docs/Web/XML/XPath) ## Return type ``` SQL TABLE (Task wds.DownloadTask) ``` ## Return value List of [DownloadTask](../user-defined-types/download-task.html) (one per a found URL) in a form of table. ## Examples ##### Creating a job and getting download tasks for all sidebar links on the index page of the Playground ``` SQL DECLARE @jobConfig wds.JobConfig = 'JobName: TestJob1; Server: wds://localhost:2807; StartUrls: http://playground'; SELECT nav.Task.Url URL FROM wds.Start(@jobConfig) root OUTER APPLY wds.Crawl(root.Task, 'css: ul.nav a', null) nav ``` | URL | | --------------------------------------------------- | | http://playground/ | | --------------------------------------------------- | | http://playground/armor_and_accessories/1/ | | --------------------------------------------------- | | http://playground/beast_and_creature_items/1/ | | --------------------------------------------------- | | http://playground/elemental_and_nature_items/1/ | | --------------------------------------------------- | | http://playground/magical_artifacts/1/ | | --------------------------------------------------- | | http://playground/potions_and_elixirs/1/ | | --------------------------------------------------- | | http://playground/rings_and_amulets/1/ | | --------------------------------------------------- | | http://playground/wands_and_staffs/1/ | | --------------------------------------------------- | | http://playground/weapons/1/ | | --------------------------------------------------- | # ScrapeFirst Returns the first matching value from the page using the selector (optionally an attribute), ideal when only a single item is needed. ## Syntax ``` SQL wds.ScrapeFirst( downloadTask, selector, [attributeName] ) ``` ## Arguments | Name | Type | Description | | ------------- | -------------------------------------------------------- | ----------------------------------------------------------------- | | downloadTask | [DownloadTask](../user-defined-types/download-task.html) | **Required.** A download task from a previous command result set | | ------------- | -------------------------------------------------------- | ----------------------------------------------------------------- | | selector | string | **Required.** Selector for getting interesting data on a web page | | ------------- | -------------------------------------------------------- | ----------------------------------------------------------------- | | attributeName | string | Optional. Attribute name to get data from. Use ```val``` or leave null to get inner text | | ------------- | -------------------------------------------------------- | ----------------------------------------------------------------- | ### Remarks The selector argument is a selector of the following format: ```CSS|XPATH: selector```. The first part defines the selector type, the second one should be a selector in the corresponding type. Supported types: - [CSS](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_selectors) - [XPATH](https://developer.mozilla.org/en-US/docs/Web/XML/XPath) ## Return type String ## Return value Either found data or NULL if nothing found ## Examples ##### Creating a job and getting data from the Cloak of the Phantom page on the [Playground](../../server/services/playground.html) ``` SQL DECLARE @jobConfig wds.JobConfig = 'JobName: TestJob1; Server: wds://localhost:2807; StartUrls: http://playground'; SELECT product.Task.Url as URL, wds.ScrapeFirst(product.Task, 'css: h1', null) AS ProductName, wds.ScrapeFirst(product.Task, 'css: .price span', null) AS ProductPrice FROM wds.Start(@jobConfig) root OUTER APPLY wds.Crawl(root.Task, 'css: table a[href*="/cloak_of_the_phantom.html"]', null) product ``` | URL | ProductName | ProductPrice | | ------------------------------------------------------------------------ | -------------------- | --------------- | | http://playground/armor_and_accessories/1/cloak_of_the_phantom.html | Cloak of the Phantom | 100 Fairy Coins | | ------------------------------------------------------------------------ | -------------------- | --------------- | # ScrapeAll Returns all matching values from the page using the selector (optionally an attribute), useful when you need a complete list. ## Syntax ``` SQL wds.ScrapeAll( downloadTask, selector, [attributeName] ) ``` ## Arguments | Name | Type | Description | | ------------- | -------------------------------------------------------- | ----------------------------------------------------------------- | | downloadTask | [DownloadTask](../user-defined-types/download-task.html) | **Required.** A download task from a previous command result set | | ------------- | -------------------------------------------------------- | ----------------------------------------------------------------- | | selector | string | **Required.** Selector for getting interesting data on a web page | | ------------- | -------------------------------------------------------- | ----------------------------------------------------------------- | | attributeName | string | Optional. Attribute name to get data from. Use ```val``` or leave null to get inner text | | ------------- | -------------------------------------------------------- | ----------------------------------------------------------------- | ### Remarks The selector argument is a selector of the following format: ```CSS|XPATH: selector```. The first part defines the selector type, the second one should be a selector in the corresponding type. Supported types: - [CSS](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_selectors) - [XPATH](https://developer.mozilla.org/en-US/docs/Web/XML/XPath) ## Return type ``` SQL TABLE (Data NVARCHAR(MAX)) ``` ## Return value List of found data or the empty list if nothing found ## Examples ##### Creating a job and getting all product names string from the first page of the section Armor And Accessories on the Playground ``` SQL DECLARE @jobConfig wds.JobConfig = 'JobName: TestJob1; Server: wds://localhost:2807; StartUrls: http://playground'; SELECT section.Task.Url as URL, (SELECT STRING_AGG(Data, ', ') FROM wds.ScrapeAll(section.Task, 'css: table tr td:first-child', null)) Products FROM wds.Start(@jobConfig) root OUTER APPLY wds.Crawl(root.Task, 'css: ul.nav li a[href^="/armor_and_accessories"]', null) section ``` | URL | Products | | ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------- | | http://playground/armor_and_accessories/1/ | Cloak of the Phantom, Crown of the Forest King, Frostbound Crown, Scepter of the Golden Dragon, Shield of the Thunder God | | ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------- | ##### Creating a job and getting all product names list from the first page of the section Armor And Accessories on the Playground ``` SQL DECLARE @jobConfig wds.JobConfig = 'JobName: TestJob1; Server: wds://localhost:2807; StartUrls: http://playground'; SELECT section.Task.Url as URL, products.Data as ProductName FROM wds.Start(@jobConfig) root OUTER APPLY wds.Crawl(root.Task, 'css: ul.nav li a[href^="/armor_and_accessories"]', null) section OUTER APPLY wds.ScrapeAll(section.Task, 'css: table tr td:first-child', null) products ``` | URL | ProductName | | ---------------------------------------------- | ---------------------------- | | http://playground/armor_and_accessories/1/ | Cloak of the Phantom | | ---------------------------------------------- | ---------------------------- | | http://playground/armor_and_accessories/1/ | Crown of the Forest King | | ---------------------------------------------- | ---------------------------- | | http://playground/armor_and_accessories/1/ | Frostbound Crown | | ---------------------------------------------- | ---------------------------- | | http://playground/armor_and_accessories/1/ | Scepter of the Golden Dragon | | ---------------------------------------------- | ---------------------------- | | http://playground/armor_and_accessories/1/ | Shield of the Thunder God | | ---------------------------------------------- | ---------------------------- | # ScrapeMultiple Batch-scrape multiple fields from a page in a single call: define named selectors once, then fetch first or all values per field efficiently. ## Syntax ``` SQL wds.ScrapeMultiple( downloadTask ) ``` ## Arguments | Name | Type | Description | | ------------- | -------------------------------------------------------- | ------------------------------------------------------------------------------------------- | | downloadTask | [DownloadTask](../user-defined-types/download-task.html) | **Required.** A download task from a previous command result set | | ------------- | -------------------------------------------------------- | ------------------------------------------------------------------------------------------- | ## Return type [ScrapeMultipleParams](#scrapemultipleparams) ## Return value A special object that is used to 1. configure what data is needed to be scraped from a web page 2. return scraped data # ScrapeMultipleParams Fluent helper to define selectors/attributes per field and retrieve results. ## Methods Methods that are used to configure scraping and get its results ### AddScrapeParams Add a new scrape parameter ##### Syntax ``` SQL AddScrapeParams( name, selector, [attributeName] ) ``` #### Arguments | Name | Type | Description | | --------------| ------ | -------------------------------------------------------------------- | | name | string | **Required.** Scrape parameter name that is used to get scraped data | | ------------- | ------ | -------------------------------------------------------------------- | | selector | string | **Required.** Selector of data elements on a web page | | ------------- | ------ | -------------------------------------------------------------------- | | attributeName | string | Optional. Attribute name to get data from. Use ```val``` or leave null to get inner text | | ------------- | ------ | -------------------------------------------------------------------- | #### Remarks The selector argument is a selector of the following format: ```CSS|XPATH: selector```. The first part defines the selector type, the second one should be a selector in the corresponding type. Supported types: - [CSS](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_selectors) - [XPATH](https://developer.mozilla.org/en-US/docs/Web/XML/XPath) #### Return type [ScrapeMultipleParams](#scrapemultipleparams) #### Return value Returns the instance on which it was called #### GetFirst Returns the first scraped value ##### Syntax ``` SQL GetFirst( name ) ``` ##### Arguments | Name | Type | Description | | -----| ------ | ----------------------------------- | | name | string | **Required.** Scrape parameter name | | ---- | ------ | ----------------------------------- | ##### Return type String ##### Return value Either found data or NULL if nothing found #### GetAll Returns all scraped values ##### Syntax ``` SQL GetAll( name ) ``` ##### Arguments | Name | Type | Description | | -----| ------ | ----------------------------------- | | name | string | **Required.** Scrape parameter name | | ---- | ------ | ----------------------------------- | ##### Return type [StringDataItems](./to-strings-table.html#stringdataitems) ##### Return value List of found data or the empty list if nothing found ## Examples ##### Creating a job and getting data from the Cloak of the Phantom page on the Playground ``` SQL DECLARE @jobConfig wds.JobConfig = 'JobName: TestJob1; Server: wds://localhost:2807; StartUrls: http://playground'; SELECT product.Task.Url as URL, productData.ScrapeResult.GetFirst('ProductName') AS ProductName, (SELECT STRING_AGG(Data, ', ') FROM wds.ToStringsTable(productData.ScrapeResult.GetAll('AvailableProductParams'))) AS AvailableProductParams FROM wds.Start(@jobConfig) root OUTER APPLY wds.Crawl(root.Task, 'css: table a[href*="/cloak_of_the_phantom.html"]', null) product CROSS APPLY ( SELECT wds.ScrapeMultiple(product.Task) .AddScrapeParams('ProductName', 'css: h1', null) .AddScrapeParams('AvailableProductParams', 'css: b', null) AS ScrapeResult ) productData ``` | URL | ProductName | AvailableProductParams | | ------------------------------------------------------------------------ | -------------------- | ---------------------- | | http://playground/armor_and_accessories/1/cloak_of_the_phantom.html | Cloak of the Phantom | Price: , Description: | | ------------------------------------------------------------------------ | -------------------- | ---------------------- | # TaskStatus Returns the current status and details for a download task, useful for monitoring progress and debugging errors. Retrieves the current status and execution trace for a download task, including errors and links to result details when available. ## Syntax ``` SQL wds.TaskStatus( downloadTask ) ``` ## Arguments | Name | Type | Description | | ------------- | -------------------------------------------------------- | ----------------------------- | | downloadTask | [DownloadTask](../user-defined-types/download-task.html) | **Required.** A download task | | ------------- | -------------------------------------------------------- | ----------------------------- | ## Return type [DownloadTaskStatus](../user-defined-types/download-task-status.html) ## Return value An object with information about a task execution and result ## Examples ##### Getting an init download task status ``` SQL DECLARE @jobConfig wds.JobConfig = 'Server: wds://localhost:2807; StartUrls: http://playground'; SELECT root.Task.Url URL, wds.TaskStatus(root.Task).TaskState Status FROM wds.Start(@jobConfig) root ``` | URL | Status | | --------------------- | ------- | | http://playground | Created | | --------------------- | ------- | ##### Getting an init download task information as XML ``` SQL DECLARE @jobConfig wds.JobConfig = 'Server: wds://localhost:2807; StartUrls: http://playground'; SELECT root.Task.Url URL, wds.TaskStatus(root.Task).ToString() Status FROM wds.Start(@jobConfig) root ``` | URL | Status | | --------------------- | ----------------------------------------------------------------- | | http://playground | See [Init Download Task StatusXml](#init-download-task-statusxml) | | --------------------- | ----------------------------------------------------------------- | ###### Init Download Task StatusXml ``` XML Created ``` ##### Getting a download task information as XML ``` SQL DECLARE @jobConfig wds.JobConfig = 'Server: wds://localhost:2807; StartUrls: http://playground'; SELECT TOP 1 nav.Task.Url URL, wds.TaskStatus(root.Task).ToString() Status FROM wds.Start(@jobConfig) root OUTER APPLY wds.Crawl(root.Task, 'css: ul.nav a', null) nav ``` | URL | Status | | --------------------- | ------------------------------------------------------- | | http://playground | See [Download Task StatusXml](#download-task-statusxml) | | --------------------- | ------------------------------------------------------- | ##### Download Task StatusXml ``` XML Handled GET http://playground/ true 200 OK date ddd, DD MMM YYYY HH:mm:ss GMT server Kestrel accept-ranges bytes etag "etag string" content-length 21040 content-type text/html last-modified ddd, DD MMM YYYY HH:mm:ss GMT YYYY-MM-DDTHH:mm:ss.zzz 0.0591974 false 0 0 ``` # ToStringsTable Converts a list of strings to a single-column table, handy for composing queries with batch scrape results. Used with: * [ScrapeMultipleParams.GetAll](./scrape-multiple.html#getall) ## Syntax ``` SQL wds.ToStringsTable( items ) ``` ## Arguments | Name | Type | Description | | ----- | ----------------------------------- | ---------------------------------------------------------- | | items | [StringDataItems](#stringdataitems) | **Required.** A special object that contains an items list | | ----- | ----------------------------------- | ---------------------------------------------------------- | ## Return type ``` SQL TABLE (Data NVARCHAR(MAX)) ``` ## Return value List of data items passed to the function ## Examples See [ScrapeMultipleParams.GetAll](./scrape-multiple.html#getall) # StringDataItems Utility data type that is used to pass data between CRL methods. Should not be used in SQL queries directly # WDS for MS SQL Server Bring web crawling and scraping into T‑SQL. WDS for MS SQL Server is a CLR library with user‑defined types (UDTs) and functions that let you start jobs, discover pages, extract data, and check task status directly from SQL. ## What You Can Do - Start: launch a job with a `JobConfig` and receive initial `DownloadTask`s. - Crawl: discover follow‑up pages and get new `DownloadTask`s. - Scrape: extract one value (ScrapeFirst), all values (ScrapeAll), or multiple fields in one call (ScrapeMultiple). - Inspect: query `DownloadTaskStatus` to monitor progress and debug issues. ## Prerequisites - SQL Server: tested with SQL Server 2022 (Windows). - WDS API Server: running and reachable (see [Deployments](../server/deployments/index.html)). ## Components - UDTs: configure jobs and pass results in SQL — see [User‑Defined Types](./user-defined-types/index.html) - CLR Functions: Start, Crawl, Scrape, Status — see [CLR Functions](./clr-functions/index.html) ## Examples Explore end‑to‑end scripts for common scenarios — see [Examples](./examples/index.html) ## Install Step‑by‑step instructions to enable CLR and load the library — see [Install](./clr-functions/install.html) ## Support If you hit issues, please use [GitHub Issues](https://github.com/webdatasource/wds.mssql.clr/issues). # User Defined Types # User Defined Types (UDT) [CLR functions](../clr-functions/index.html) use UDTs as input and output parameters. All these data contracts are added to an SQL Server instance in the same script that adds the CLR functions (see the [Install](../clr-functions/install.html) page). By default all UDTs are added to the **wds** namespace. Most of the UDTs that are used as input parameters can be initialized from a string. Review the data contracts documentation below for the details. Data contract UDTs fields are readable and writable. When a UTS has been initialized from a string, the corresponding fields are prefilled with values from the string but they all can be overridden. In general initialization strings have the following format ``` FieldName1: FieldValue1; FieldName2: FieldListValue1, FieldListValue2; ``` \ UDTs have validation methods to check required fields. The validation performs when UDTs are used (not on initialization). ## Table of contents - [ServerConfig](/releases/latest/mssql/user-defined-types/server-config.html) - [JobConfig](/releases/latest/mssql/user-defined-types/job-config.html) - [HeadersConfig](/releases/latest/mssql/user-defined-types/headers-config.html) - [RestartConfig](/releases/latest/mssql/user-defined-types/restart-config.html) - [HttpsConfig](/releases/latest/mssql/user-defined-types/https-config.html) - [CookiesConfig](/releases/latest/mssql/user-defined-types/cookies-config.html) - [ProxiesConfig](/releases/latest/mssql/user-defined-types/proxies-config.html) - [DownloadErrorHandling](/releases/latest/mssql/user-defined-types/download-error-handling.html) - [CrawlersProtectionBypass](/releases/latest/mssql/user-defined-types/crawlers-protection-bypass.html) - [DownloadTask](/releases/latest/mssql/user-defined-types/download-task.html) - [DownloadTaskStatus](/releases/latest/mssql/user-defined-types/download-task-status.html) - [CrossDomainAccess](/releases/latest/mssql/user-defined-types/cross-domain-access.html) # ServerConfig Defines how MSSQL UDTs connect to the WDS API Server, including base URL and optional credentials/HTTPS settings. | Name | Type | Description | | -------------------- | --------------- | --------------------------------------- | | Url | string | **Required.** WDS API Server URL | | -------------------- | --------------- | --------------------------------------- | ## Initialization String Format An instance can be initialized with a connection string of the following format: ```wds://user:password@host:port?https=false``` The **wds** schema is required for this connection string The user and password parameters are optional and should be provided only if authentication is required. The https parameter is optional. The default value is false. For production environments, it is strongly recommended to use HTTPS. ## Examples Creating a new instance initialized from a string: ``` SQL DECLARE @serverConfig wds.ServerConfig = 'wds://localhost:2807'; ``` # JobConfig Defines the top‑level job configuration when starting via MSSQL: connection to the WDS API, entry URLs, job type, and request/runtime settings (headers, cookies, HTTPS, proxies, error handling, domain scope). | Name | Type | Description | | --------------------------- | ------------------------------------------------------------------ | -------------------------------------------------------------------------- | | Server | [ServerConfig](./server-config.html) | **Required.** WDS API Server connection parameters | | --------------------------- | ------------------------------------------------------------------ | -------------------------------------------------------------------------- | | StartUrls | array of Strings | **Required.** Initial URLs. Crawling entry points | | --------------------------- | ------------------------------------------------------------------ | -------------------------------------------------------------------------- | | JobName | string | Optional. Job name. If not specified a random generated value is used | | --------------------------- | ------------------------------------------------------------------ | -------------------------------------------------------------------------- | | JobType | [JobTypes](#jobtypes) | Optional. Job type | | --------------------------- | ------------------------------------------------------------------ | -------------------------------------------------------------------------- | | Headers | [HeadersConfig](./headers-config.html) | Optional. Headers settings | | --------------------------- | ------------------------------------------------------------------ | -------------------------------------------------------------------------- | | Restart | [RestartConfig](./restart-config.html) | Optional. Job restart settings | | --------------------------- | ------------------------------------------------------------------ | -------------------------------------------------------------------------- | | Https | [HttpsConfig](./https-config.html) | Optional. HTTPS settings | | --------------------------- | ------------------------------------------------------------------ | -------------------------------------------------------------------------- | | Cookies | [CookiesConfig](./cookies-config.html) | Optional. Cookies settings | | --------------------------- | ------------------------------------------------------------------ | -------------------------------------------------------------------------- | | Proxy | [ProxiesConfig](./proxies-config.html) | Optional. Proxy settings | | --------------------------- | ------------------------------------------------------------------ | -------------------------------------------------------------------------- | | DownloadErrorHandling | [DownloadErrorHandling](./download-error-handling.html) | Optional. Download errors handling settings | | --------------------------- | ------------------------------------------------------------------ | -------------------------------------------------------------------------- | | CrawlersProtectionBypass | [CrawlersProtectionBypass](./crawlers-protection-bypass.html) | Optional. Crawlers protection countermeasure settings | | --------------------------- | ------------------------------------------------------------------ | -------------------------------------------------------------------------- | | CrossDomainAccess | [CrossDomainAccess](./cross-domain-access.html) | Optional. Cross-domain access settings | | --------------------------- | ------------------------------------------------------------------ | -------------------------------------------------------------------------- | ## Initialization String Format An instance can be initialized with a string of the following format: ```JobName: jobname; Server: serverConnectionString; StartUrls: URL1, URL2``` ## Methods Methods that help with initialization. #### AddStartUrl Adds a new start URL ##### Syntax ``` SQL AddStartUrl( url ) ``` ##### Arguments | Name | Type | Description | | ---- | ------ | ----------- | | url | string | Start Url | | ---- | ------ | ----------- | ##### Return type [JobConfig](#jobconfig) ##### Return value Returns the instance on which it was called ## Examples Creating a new instance initialized from a string: ``` SQL DECLARE @jobConfig wds.JobConfig = 'JobName: TestJob1; ServerCS: wds://localhost:2807; StartUrls: http://playground'; ``` Adding one more start URL: ``` SQL SET @jobConfig = @jobConfig.AddStartUrl('http://example.com'); ``` --- # JobTypes Specifies how and where the crawler operates. Choose the mode that matches the environment your job targets. Possible values restrictions and the default value for all jobs can be configured in the [Dapi](../../server/services/dapi.html) service. Additionally, the [Crawler](../../server/services/crawler.html) service should be correctly configured to handle jobs of different types. ## Values | Name | Description | | -------- | ------------------------------------------------------------------------------------------------ | | internet | Crawl data from internet sources via request gateways (Proxy addresses, Host IP addresses, etc.) | | -------- | ------------------------------------------------------------------------------------------------ | | intranet | Crawl data from intranet sources with no limits | | -------- | ------------------------------------------------------------------------------------------------ | ## Examples Changing the job type: ``` SQL SET @jobConfig.JobType = 'Intranet'; ``` # HeadersConfig Configures default HTTP headers to include with every request (e.g., User-Agent, Authorization), and provides helpers to add or append values. | Name | Type | Description | | --------------------- | ---------------------------------- | -------------------------------------------------------------------------------------- | | DefaultRequestHeaders | array of [HttpHeader](#httpheader) | Optional. HTTP headers will be sent with all requests. Default value is an empty array | | --------------------- | ---------------------------------- | -------------------------------------------------------------------------------------- | ## Initialization String Format This UDT can't be configured with a string. Initialization methiods are used instead. ## Methods Methods that help with initialization. #### AddHeader Adds or replaces a header ##### Syntax ``` SQL AddHeader( header ) ``` ##### Arguments | Name | Type | Description | | -------| ------------------------- | ------------------------- | | header | [HttpHeader](#httpheader) | **Required.** HTTP header | | ------ | ------------------------- | ------------------------- | ##### Return type [HeadersConfig](#headersconfig) ##### Return value Returns the instance on which it was called #### AppendHeader Adds a new header or a value to an existing header ##### Syntax ``` SQL AppendHeader( name, value ) ``` ##### Arguments | Name | Type | Description | | ------| ------ | -------------------------- | | name | string | **Required.** Header name | | ------| ------ | -------------------------- | | value | string | **Required.** Header value | | ----- | ------ | -------------------------- | ##### Return type [HeadersConfig](#headersconfig) ##### Return value Returns the instance on which it was called ## Examples Creating a new instance ``` SQL DECLARE @headersConfig wds.HeadersConfig = ''; DECLARE @httpHeader wds.HttpHeader = 'Name: Accept; Values: text/html'; SET @headersConfig = @headersConfig.AddHeader(@httpHeader); SET @headersConfig = @headersConfig.AppendHeader('Accept', 'application/xml'); ``` --- # HttpHeader Represents a single HTTP header definition with a name and one or more values. | Name | Type | Description | | -------| --------------- | --------------------------- | | Name | string | **Required.** Header name | | -------| --------------- | --------------------------- | | Values | array of String | **Required.** Header values | | ------ | --------------- | --------------------------- | ## Initialization String Format An instance can be initialized with a string of the following format: ```Name: name; Values: value1, value2;``` ## Examples Creating a new instance initialized from a string: ``` SQL DECLARE @httpHeader wds.HttpHeader = 'Name: Accept; Values: text/html, application/xml'; ``` # RestartConfig Controls behavior when restarting a job — either continue from cached state or rebuild from scratch. | Name | Type | Description | | -------------- | ----------------------------------- | ------------------------------ | | RestartMode | [JobRestartModes](#jobrestartmodes) | **Required.** Job restart mode | | -------------- | ----------------------------------- | ------------------------------ | ## Initialization String Format An instance can be initialized with a string of the following format: ```RestartMode: ``` ## Examples Quickly changing the job restart mode: ``` SQL SET @jobConfig.Restart = 'RestartMode: Continue'; ``` Changing the job restart mode: ``` SQL DECLARE @restartConfig wds.RestartConfig = 'RestartMode: FromScratch' SET @restartConfig.RestartMode = 'Continue'; SET @jobConfig.Restart = @restartConfig; ``` --- # JobRestartModes Describes restart strategies and their effect on previously cached data. ## Values | Name | Description | | ----------- | ------------------------------------------------------------------------------------------------ | | Continue | Reuse cached data and continue crawling and parsing new data | | ----------- | ------------------------------------------------------------------------------------------------ | | FromScratch | Clear cached data and start from scratch | | ----------- | ------------------------------------------------------------------------------------------------ | # HttpsConfig Controls HTTPS validation behavior for target resources; useful for development or crawling hosts with self‑signed certificates. | Name | Type | Description | | ---------------------------------- | ---- | ---------------------------------------------------------------------------------------- | | SuppressHttpsCertificateValidation | bool | **Required.** Defines whether to suppress HTTPS certificate validation of a web resource | | ---------------------------------- | ---- | ---------------------------------------------------------------------------------------- | ## Initialization String Format An instance can be initialized with a string of the following format: ```SuppressHttpsCertificateValidation: true|false``` ## Examples Quickly changing the job Https config: ``` SQL SET @jobConfig.Https = 'SuppressHttpsCertificateValidation: true'; ``` Changing the job Https config: ``` SQL DECLARE @jobHttps wds.HttpsConfig = 'SuppressHttpsCertificateValidation: false' SET @jobHttps.SuppressHttpsCertificateValidation = 1; SET @jobConfig.Https = @jobHttps; ``` # CookiesConfig Controls whether requests persist and reuse cookies to maintain session state across page navigations. | Name | Type | Description | | ---------- | ---- | ---------------------------------------------------------------------------- | | UseCookies | bool | **Required.** Defines if cookies should be saved and reused between requests | | ---------- | ---- | ---------------------------------------------------------------------------- | ## Initialization String Format An instance can be initialized with a string of the following format: ```UseCookies: true|false``` ## Examples Quickly changing the job Cookies config: ``` SQL SET @jobConfig.Cookies = 'UseCookies: true'; ``` Changing the job Cookies config: ``` SQL DECLARE @jobCookies wds.CookiesConfig = 'UseCookies: false' SET @jobCookies.UseCookies = 1; SET @jobConfig.Cookies = @jobCookies; ``` # ProxiesConfig Configures whether and how requests are routed through proxies, including fallback to direct IP, response-code rotation, and managing a proxy pool. | Name | Type | Description | | --------------------------------- | ------------------------------------ | ------------------------------------------------------------------------------------------------ | | UseProxy | bool | **Required.** Use proxies for requests | | --------------------------------- | ------------------------------------ | ------------------------------------------------------------------------------------------------ | | SendOvertRequestsOnProxiesFailure | bool | **Required.** Send a request from a host real IP address if all proxies failed | | --------------------------------- | ------------------------------------ | ------------------------------------------------------------------------------------------------ | | IterateProxyResponseCodes | string | Optional. Comma-separated HTTP response codes to iterate proxies on. Default value is '401, 403' | | --------------------------------- | ------------------------------------ | ------------------------------------------------------------------------------------------------ | | Proxies | array of [ProxyConfig](#proxyconfig) | Optional. Proxy configurations. Default value is an empty array | | --------------------------------- | ------------------------------------ | ------------------------------------------------------------------------------------------------ | ## Initialization String Format An instance can be initialized with a string of the following format: ```UseProxy: true|false; SendOvertRequestsOnProxiesFailure: true|false; IterateProxyResponseCodes: 401, 403``` ## Methods Methods that help with initialization. #### AddProxyConfig Adds a new proxy configuration ##### Syntax ``` SQL AddProxyConfig( proxyConfig ) ``` ##### Arguments | Name | Type | Description | | ------------| --------------------------- | --------------------------------- | | proxyConfig | [ProxyConfig](#proxyconfig) | **Required.** Proxy configuration | | ----------- | --------------------------- | --------------------------------- | ##### Return type [ProxiesConfig](#proxiesconfig) ##### Return value Returns the instance on which it was called #### AddProxy Adds a new proxy ##### Syntax ``` SQL AddProxy( protocol, host, port, userName, password, connectionsLimit, availableHosts ) ``` ##### Arguments | Name | Type | Description | | -----------------| ------ | ---------------------------------------------------------------------------------------------------------- | | protocol | string | **Required.** Proxy protocol (http\|https\|socks5) | | -----------------| ------ | ---------------------------------------------------------------------------------------------------------- | | host | string | **Required.** Proxy host | | ---------------- | ------ | ---------------------------------------------------------------------------------------------------------- | | port | int | **Required.** Proxy port | | ---------------- | ------ | ---------------------------------------------------------------------------------------------------------- | | userName | string | Optional. Proxy username | | ---------------- | ------ | ---------------------------------------------------------------------------------------------------------- | | password | string | Optional. Proxy password | | ---------------- | ------ | ---------------------------------------------------------------------------------------------------------- | | connectionsLimit | int | Optional. Proxy connections limit (how many connections can be established through this proxy at one time) | | ---------------- | ------ | ---------------------------------------------------------------------------------------------------------- | | availableHosts | string | Optional. A comma-separated list of available hosts that can be accessed through this proxy | | ---------------- | ------ | ---------------------------------------------------------------------------------------------------------- | ##### Return type [ProxiesConfig](#proxiesconfig) ##### Return value Returns the instance on which it was called ## Examples Creating a new instance initialized from a string: ``` SQL DECLARE @proxiesConfig wds.ProxiesConfig = 'UseProxy: true; SendOvertRequestsOnProxiesFailure: true; IterateProxyResponseCodes: 401, 403'; SET @proxiesConfig = @proxiesConfig.AddProxy('https', 'example.com', 3128, null, null, null, null); SET @proxiesConfig = @proxiesConfig.AddProxy('https', 'example.com', 3128, 'user', 'password', 10, 'host1.com, host2.com'); ``` Creating a new instance: ``` SQL DECLARE @proxiesConfig wds.ProxiesConfig = 'UseProxy: true; SendOvertRequestsOnProxiesFailure: true; IterateProxyResponseCodes: 401, 403'; SET @proxiesConfig.UseProxy = 1; SET @proxiesConfig = @proxiesConfig.AddProxy('https', 'example.com', 3128, null, null, null, null); ``` --- # ProxyConfig Defines an individual proxy endpoint and its connection characteristics. | Name | Type | Description | | ---------------- | --------------- | ---------------------------------------------------------------------------------------------------------- | | Protocol | string | **Required.** Proxy protocol (http\|https\|socks5) | | ---------------- | --------------- | ---------------------------------------------------------------------------------------------------------- | | Host | string | **Required.** Proxy host | | ---------------- | --------------- | ---------------------------------------------------------------------------------------------------------- | | Port | int | **Required.** Proxy port | | ---------------- | --------------- | ---------------------------------------------------------------------------------------------------------- | | UserName | string | Optional. Proxy username | | ---------------- | --------------- | ---------------------------------------------------------------------------------------------------------- | | Password | string | Optional. Proxy password | | ---------------- | --------------- | ---------------------------------------------------------------------------------------------------------- | | ConnectionsLimit | int | Optional. Proxy connections limit (how many connections can be established through this proxy at one time) | | ---------------- | --------------- | ---------------------------------------------------------------------------------------------------------- | | AvailableHosts | array of String | Optional. A list of available hosts that can be accessed through this proxy | | ---------------- | --------------- | ---------------------------------------------------------------------------------------------------------- | ## Initialization String Format An instance can be initialized with a string of the following format: ```Protocol: protocol; Host: host; Port: port; UserName: userName; Password: password; ConnectionsLimit: connectionsLimit; AvailableHosts: availableHost1, availableHost2;``` ## Methods The ProxyConfig UTD has helper methods that make its initialization a little more convenient. #### AddAvailableHost Adds a new available host ##### Syntax ``` SQL AddAvailableHost( host ) ``` ##### Arguments | Name | Type | Description | | -----| ------ | ------------------ | | host | string | **Required.** Host | | -----| ------ | ------------------ | ##### Return type [ProxyConfig](#proxyconfig) ##### Return value Returns the instance on which it was called ## Examples Creating a new instance initialized from a string and adding in to a proxies connfig: ``` SQL DECLARE @proxyConfig wds.ProxyConfig = 'Protocol: socks5; Host: example.com; Port: 1080'; SET @proxyConfig.ConnectionsLimit = 10; SET @proxiesConfig = @proxiesConfig.AddProxyConfig(@proxyConfig); ``` Creating a new instance initialized from a string and adding in to a proxies connfig: ``` SQL DECLARE @proxyConfig wds.ProxyConfig = 'Protocol: socks5; Host: example.com; Port: 1080; UserName: userName; Password: password; ConnectionsLimit: 10; AvailableHosts: host1.com, host2.com;'; SET @proxyConfig = @proxyConfig.AddAvailableHost('host3.com'); SET @proxiesConfig = @proxiesConfig.AddProxyConfig(@proxyConfig); ``` # DownloadErrorHandling Specifies how a job should react to download failures (e.g., network timeouts or HTTP errors), either skipping failures or retrying with limits. | Name | Type | Description | | ------------------- | --------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------- | | ErrorHandlingPolicy | [DownloadErrorHandlingPolicies](#downloaderrorhandlingpolicies) | **Required.** Download error handling policy | | ------------------- | --------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------- | | RetryPolicyParams | [RetryPolicyParams](#retrypolicyparams) | Optional. Retry settings. It comes into play if the Retry error handling policy is selected. If not set, 0 values are used | | ------------------- | --------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------- | ## Initialization String Format An instance can be initialized with a string of the following format: ```ErrorHandlingPolicy: ``` ## Examples Creating a new instance initialized from a string: ``` SQL DECLARE @downloadErrorHandling wds.DownloadErrorHandling = 'ErrorHandlingPolicy: Skip'; SET @jobConfig.DownloadErrorHandling = @downloadErrorHandling; ``` Setting the job DownloadErrorHandling from a string: ``` SQL SET @jobConfig.DownloadErrorHandling = 'ErrorHandlingPolicy: Skip';; ``` # DownloadErrorHandlingPolicies Available strategies for handling request or network failures during content download. ## Values | Name | Description | | ----- | ----------------------------------- | | Skip | Skip an error and continue crawling | | ----- | ----------------------------------- | | Retry | Try again | | ----- | ----------------------------------- | # RetryPolicyParams Parameters that control retry behavior when the Retry policy is selected. | Name | Type | Description | | ------------ | ---- | ------------------------------------------------------------- | | RetryDelayMs | int | Optional. Delay between retries in milliseconds. Default is 0 | | ------------ | ---- | ------------------------------------------------------------- | | RetriesLimit | int | Optional. Maximum number of retries. Default is 0 | | ------------ | ---- | ------------------------------------------------------------- | ## Initialization String Format An instance can be initialized with a string of the following format: ```RetryDelayMs: dealy; RetriesLimit: retries``` ## Examples Creating a new instance initialized from a string: ``` SQL DECLARE @retryPolicyParams wds.RetryPolicyParams = 'RetryDelayMs: 1000; RetriesLimit: 3'; SET @downloadErrorHandling.RetryPolicyParams = @retryPolicyParams ``` Setting the DownloadErrorHandling RetryPolicyParams from a string: ``` SQL SET @downloadErrorHandling.RetryPolicyParams = 'RetryDelayMs: 1000; RetriesLimit: 3' ``` # CrawlersProtectionBypass Tuning options to reduce detection and throttling by target sites: response size limits, redirect depth, timeouts, and per‑host crawl delays. | Name | Type | Description | | ----------------- | ---------------------------------- | ------------------------------------------------------------------------- | | MaxResponseSizeKb | int | Optional. Max response size in kilobytes. Optional. Default value is 1000 | | ----------------- | ---------------------------------- | ------------------------------------------------------------------------- | | MaxRedirectHops | int | Optional. Max redirect hops. Optional. Default value is 10 | | ----------------- | ---------------------------------- | ------------------------------------------------------------------------- | | RequestTimeoutSec | int | Optional. Max request timeout in seconds. Optional. Default value is 30 | | ----------------- | ---------------------------------- | ------------------------------------------------------------------------- | | CrawlDelays | array of [CrawlDelay](#crawldelay) | Optional. Crawl delays for hosts | | ----------------- | ---------------------------------- | ------------------------------------------------------------------------- | ## Initialization String Format An instance can be initialized with a string of the following format: ```MaxResponseSizeKb: size; MaxRedirectHops: hops; RequestTimeoutSec: timeout``` ## Methods Methods that help with initialization. #### AddCrawlDelay Adds a new crawl delay ##### Syntax ``` SQL AddCrawlDelay( crawlDelay ) ``` ##### Arguments | Name | Type | Description | | ---------- | ------------------------- | --------------------------------- | | crawlDelay | [CrawlDelay](#crawldelay) | **Required.** CrawlDelay instance | | ---------- | ------------------------- | --------------------------------- | ##### Return type [CrawlersProtectionBypass](#crawlersprotectionbypass) ##### Return value Returns the instance on which it was called #### AddDelay Adds a new crawl delay ##### Syntax ``` SQL AddCrawlDelay( host, delay ) ``` ##### Arguments | Name | Type | Description | | ----- | ------ | --------------------------------------------------------- | | host | string | **Required.** Host | | ----- | ------ | --------------------------------------------------------- | | delay | string | **Required.** Delay string. See [CrawlDelay](#crawldelay) | | ----- | ------ | --------------------------------------------------------- | ##### Return type [CrawlersProtectionBypass](#crawlersprotectionbypass) ##### Return value Returns the instance on which it was called ## Examples Creating a new instance initialized from a string: ``` SQL DECLARE @crawlersProtectionBypass wds.CrawlersProtectionBypass = 'MaxResponseSizeKb: 1000; MaxRedirectHops: 3; RequestTimeoutSec: 1'; SET @crawlersProtectionBypass = @crawlersProtectionBypass.AddDelay('host1.com', '0'); SET @crawlersProtectionBypass = @crawlersProtectionBypass.AddDelay('host2.com', '1-3'); SET @crawlersProtectionBypass = @crawlersProtectionBypass.AddDelay('host2.com', 'robots'); SET @jobConfig.CrawlersProtectionBypass = @crawlersProtectionBypass; ``` Setting the CrawlersProtectionBypass from a string: ``` SQL SET @jobConfig.CrawlersProtectionBypass = 'MaxResponseSizeKb: 1000; MaxRedirectHops: 3; RequestTimeoutSec: 1'; ``` --- # CrawlDelay Per-host throttling rule to space out requests and respect site limits or robots guidance. | Name | Type | Description | | ----- | ------ | --------------------------- | | host | string | **Required.** Host | | ----- | ------ | --------------------------- | | delay | string | **Required.** Delay string | | ----- | ------ | --------------------------- | ### Remarks **Delay string** can be either a number, a range of numbers separated by the dash, or 'robots': * Single value means a delay of that many seconds * A range means a delay of seconds from the range * The 'robots' means using a delay defined in robots.txt (if not specified there - 0 is used) ## Initialization String Format An instance can be initialized with a string of the following format: ```Host: host; Delay: 0|1-5|robots``` ## Examples Creating a new instance initialized from a string: ``` SQL DECLARE @robotsCrawlDelay wds.CrawlDelay = 'Host: host1.com; Delay: robots'; DECLARE @rangeCrawlDelay wds.CrawlDelay = 'Host: host2.com; Delay: 1-5'; DECLARE @noCrawlDelay wds.CrawlDelay = 'Host: host3.com; Delay: 0'; SET @crawlersProtectionBypass = @crawlersProtectionBypass.AddCrawlDelay(@robotsCrawlDelay); SET @crawlersProtectionBypass = @crawlersProtectionBypass.AddCrawlDelay(@rangeCrawlDelay); SET @crawlersProtectionBypass = @crawlersProtectionBypass.AddCrawlDelay(@noCrawlDelay); ``` # DownloadTask Represents a single page download request and its metadata when interacting with WDS from MSSQL. | Name | Type | Description | | ------ | ------------------------------------ | ----------------------------------------------- | | Error | string | Optional. Request execution error | | ------ | ------------------------------------ | ----------------------------------------------- | | Server | [ServerConfig](./server-config.html) | Optional. WDS API Server connection parameters. | | ------ | ------------------------------------ | ----------------------------------------------- | | Id | string | Optional.Download task ID | | ------ | ------------------------------------ | ----------------------------------------------- | | Url | string | Optional. Download task URL | | ------ | ------------------------------------ | ----------------------------------------------- | # DownloadTaskStatus Summarizes the execution state and outputs of a single download operation, including current status, any error, and final or intermediate results. Fields: | Name | Type | Description | | --------------- | ----------------------------------------- | ------------------------------------------------------ | | Error | string | Optional. Request execution error | | --------------- | ----------------------------------------- | ------------------------------------------------------ | | TaskState | [DownloadTaskStates](#downloadtaskstates) | Optional. Task state | | --------------- | ----------------------------------------- | ------------------------------------------------------ | | Result | [DownloadInfo](#downloadinfo) | Optional. Download result | | --------------- | ----------------------------------------- | ------------------------------------------------------ | | intermedResults | array of [DownloadInfo](#downloadinfo) | Optional. Intermediate requests download results stack | | --------------- | ----------------------------------------- | ------------------------------------------------------ | # DownloadTaskStates Lifecycle states a download task can transition through from creation to completion or deletion. Enumeration values: | Name | Description | | ------------------------ | ------------------------------------------------------------------------------------------ | | Handled | Task is handled and its results are available | | ------------------------ | ------------------------------------------------------------------------------------------ | | AccessDeniedForRobots | Access to a URL is denied by robots.txt | | ------------------------ | ------------------------------------------------------------------------------------------ | | AllRequestGatesExhausted | All request gateways (proxy and host IP addresses) were exhausted but no data was received | | ------------------------ | ------------------------------------------------------------------------------------------ | | InProgress | Task is in progress | | ------------------------ | ------------------------------------------------------------------------------------------ | | Created | Task has not been started yet | | ------------------------ | ------------------------------------------------------------------------------------------ | | Deleted | Task has been deleted | | ------------------------ | ------------------------------------------------------------------------------------------ | # DownloadInfo Captures request/response details for a download attempt, including HTTP metadata, headers, cookies, and payload. Fields: | Name | Type | Description | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | Method | string | **Required.** HTTP method | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | Url | string | **Required.** Request URL | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | IsSuccess | bool | **Required.** Was the request successful | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | HttpStatusCode | int | **Required.** [HTTP status code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | ReasonPhrase | string | **Required.** HTTP reason phrase | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | RequestHeaders | array of [HttpHeader](#httpheader) | **Required.** HTTP headers sent with the request | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | ResponseHeaders | array of [HttpHeader](#httpheader) | **Required.** HTTP headers received in the response | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | RequestCookies | array of [Cookie](#cookie) | **Required.** Cookies sent with the request | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | ResponseCookies | array of [Cookie](#cookie) | **Required.** Cookies received in the response | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | RequestDateUtc | datetime | **Required.** Request date and time in UTC | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | DownloadTimeSec | double | **Required.** Download time in seconds | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | ViaProxy | bool | **Required.** Is the request made via a proxy | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | WaitTimeSec | double | **Required.** What was the delay (in seconds) before the request was executed (crawl latency, etc.) | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | CrawlDelaySec | int | **Required.** A delay in seconds applied to the request | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | # HttpHeader Represents a single HTTP header with a name and one or more values. Fields: | Name | Type | Description | | ------ | --------------- | --------------------------- | | Name | string | **Required.** Header name | | ------ | --------------- | --------------------------- | | Values | array of String | **Required.** Header values | | ------ | --------------- | --------------------------- | # [Cookie](https://developer.mozilla.org/en-US/docs/Web/HTTP/Cookies) Represents an HTTP cookie as sent via Set-Cookie/ Cookie headers, including attributes. Fields: | Name | Type | Description | | -------- | -------- | --------------------------------------------------------------------------------------------------------------------------------- | | Name | string | **Required.** [Name](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Set-Cookie#attributes) | | -------- | -------- | --------------------------------------------------------------------------------------------------------------------------------- | | Value | string | **Required.** [Value](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Set-Cookie#attributes) | | -------- | -------- | --------------------------------------------------------------------------------------------------------------------------------- | | Domain | string | **Required.** [Domain](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Set-Cookie#domaindomain-value) | | -------- | -------- | --------------------------------------------------------------------------------------------------------------------------------- | | Path | string | **Required.** [Path](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Set-Cookie#pathpath-value) | | -------- | -------- | --------------------------------------------------------------------------------------------------------------------------------- | | HttpOnly | bool | **Required.** [HttpOnly](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Set-Cookie#httponly) | | -------- | -------- | --------------------------------------------------------------------------------------------------------------------------------- | | Secure | bool | **Required.** [Secure](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Set-Cookie#secure) | | -------- | -------- | --------------------------------------------------------------------------------------------------------------------------------- | | Expires | datetime | Optional. [Expires](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Set-Cookie#expiresdate) | | -------- | -------- | --------------------------------------------------------------------------------------------------------------------------------- | # CrossDomainAccess Controls the domain scope a job may follow from its starting hosts: only the main domain, include subdomains, or allow cross‑domain navigation. | Name | Type | Description | | ------------ | ------------------------------------------------------- | ---------------------------------------- | | AccessPolicy | [CrossDomainAccessPolicies](#crossDomainAccessPolicies) | **Required.** Cross-domain access policy | | ------------ | ------------------------------------------------------- | ---------------------------------------- | ## Initialization String Format An instance can be initialized with a string of the following format: ```AccessPolicy: ``` ## Examples Creating a new instance initialized from a string: ``` SQL DECLARE @crossDomainAccess wds.CrossDomainAccess = 'AccessPolicy: Subdomains'; SET @jobConfig.CrossDomainAccess = @crossDomainAccess; ``` Setting the job CrossDomainAccess from a string: ``` SQL SET @jobConfig.CrossDomainAccess = 'AccessPolicy: Subdomains';; ``` --- # CrossDomainAccessPolicies Domain scoping modes that determine which hosts are in‑bounds while crawling. ## Values | Name | Description | | ------------ | ------------------------------------------------------------------------------------- | | None | No subdomain or cross-domain access. Only the main domain is allowed | | ------------ | ------------------------------------------------------------------------------------- | | Subdomains | The subdomains of the main domain are allowed (e.g., "example.com", "sub.example.com) | | ------------ | ------------------------------------------------------------------------------------- | | CrossDomains | Allows access to any domain (e.g., "example.com", "sub.example.com, another.com") | | ------------ | ------------------------------------------------------------------------------------- | # Examples # Scripts Examples This section contains T-SQL script examples that might be used for evaluation or as a base for writing queries to real web sources.\ All these examples use the [Playground](../../server/services/playground.html) as the target web resource. By default, the RestartMode is [Continue](../user-defined-types/restart-config.html#jobrestartmodes), so the first runs would take some time to grab real data from the playground, but the rest will work on the cache, which is much faster. If you don't like this, [change the RestartMode](../user-defined-types/restart-config.html#examples) - [Scrape Paged](/releases/latest/mssql/examples/scrape-paged.html) - [Scrape Sitemap](/releases/latest/mssql/examples/scrape-sitemap.html) # Scrape Paged > **_Minimal playground version:_** v1.0.0 Demonstrates iterating through paginated category pages to visit item pages and scrape fields; includes a ScrapeMultiple variant for fewer API calls. ``` SQL DECLARE @jobConfig wds.JobConfig = 'JobName: CrawlAllProducts; Server: wds://localhost:2807; StartUrls: http://playground'; DECLARE @pages TABLE (Task wds.DownloadTask); -- Gathering categories' first pages INSERT INTO @pages (Task) SELECT nav.Task FROM wds.Start(@jobConfig) root OUTER APPLY wds.Crawl(root.Task, 'css: ul.nav a:not([href="/"])', null) nav -- Gathering the other pages of categories WHILE @@ROWCOUNT > 0 BEGIN INSERT INTO @pages (Task) SELECT newPages.Task FROM @pages curPages CROSS APPLY wds.Crawl(curPages.Task, 'css: ul.pagination li:not(.disabled) a', null) newPages WHERE NOT EXISTS (SELECT NULL FROM @pages p WHERE p.Task = newPages.Task) END -- Iterating through pages, visiting product pages, and scraping data from them SELECT products.Task.Url ProductUrl, wds.ScrapeFirst(products.Task, 'css: h1', null) AS ProductName, wds.ScrapeFirst(products.Task, 'css: .price span', null) AS ProductPrice FROM @pages pages CROSS APPLY wds.Crawl(pages.Task, 'css: .table a', null) products ``` Iterate over all pages and scrape data using the [ScrapeMultiple](../clr-functions/scrape-multiple.html) function. This approach is a bit faster because fewer requests to the WDS API Server are required. ``` SQL DECLARE @jobConfig wds.JobConfig = 'JobName: CrawlAllProductsBatch; Server: wds://localhost:2807; StartUrls: http://playground'; DECLARE @pages TABLE (Task wds.DownloadTask); -- Gathering categories' first pages INSERT INTO @pages (Task) SELECT nav.Task FROM wds.Start(@jobConfig) root OUTER APPLY wds.Crawl(root.Task, 'css: ul.nav a:not([href="/"])', null) nav -- Gathering the other pages of categories WHILE @@ROWCOUNT > 0 BEGIN INSERT INTO @pages (Task) SELECT newPages.Task FROM @pages curPages CROSS APPLY wds.Crawl(curPages.Task, 'css: ul.pagination li:not(.disabled) a', null) newPages WHERE NOT EXISTS (SELECT NULL FROM @pages p WHERE p.Task = newPages.Task) END -- Iterating through pages, visiting product pages, and scraping data from them SELECT products.Task.Url ProductUrl, product.ScrapeResult.GetFirst('ProductName') AS ProductName, product.ScrapeResult.GetFirst('ProductPrice') AS ProductPrice FROM @pages pages CROSS APPLY wds.Crawl(pages.Task, 'css: .table a', null) products CROSS APPLY ( SELECT wds.ScrapeMultiple(products.Task) .AddScrapeParams('ProductName', 'css: h1', null) .AddScrapeParams('ProductPrice', 'css: .price span', null) AS ScrapeResult ) product ``` # Scrape Sitemap > **_Minimal playground version:_** v1.0.1 Shows how to collect product pages from sitemap.xml and scrape fields from each page; includes a ScrapeMultiple variant to minimize round trips. ``` SQL DECLARE @jobConfig wds.JobConfig = 'JobName: CrawlAllProductsSitemap; Server: wds://localhost:2807; StartUrls: http://playground/sitemap.xml'; SELECT products.Task.Url ProductUrl, wds.ScrapeFirst(products.Task, 'css: h1', null) AS ProductName, wds.ScrapeFirst(products.Task, 'css: .price span', null) AS ProductPrice FROM wds.Start(@jobConfig) root OUTER APPLY wds.Crawl(root.Task, 'xpath: //*[local-name()="url"]/*[local-name()="loc"]', 'val') products ``` Getting product pages from a sitemap.xml and scrape data from these pages using the [ScrapeMultiple](../clr-functions/scrape-multiple.html) function. This approach is a bit faster because fewer requests to the WDS API Server are required. ``` SQL DECLARE @jobConfig wds.JobConfig = 'JobName: CrawlAllProductsSitemap; Server: wds://localhost:2807; StartUrls: http://playground/sitemap.xml'; SELECT products.Task.Url ProductUrl, product.ScrapeResult.GetFirst('ProductName') AS ProductName, product.ScrapeResult.GetFirst('ProductPrice') AS ProductPrice FROM wds.Start(@jobConfig) root OUTER APPLY wds.Crawl(root.Task, 'xpath: //*[local-name()="url"]/*[local-name()="loc"]', 'val') products CROSS APPLY ( SELECT wds.ScrapeMultiple(products.Task) .AddScrapeParams('ProductName', 'css: h1', null) .AddScrapeParams('ProductPrice', 'css: .price span', null) AS ScrapeResult ) product ``` # MCP Server # MCP Server A Model Context Protocol (MCP) server that lets IDEs and agentic systems orchestrate crawling and scraping with WDS. It exposes ready‑to‑use tools and prompts for link discovery, data extraction, and workflow automation — all backed by the WDS API Server. ## What You Can Do - Run crawl/scrape workflows with tools (start jobs, follow links, extract fields, check status). - Use one‑shot prompts to guide an agent through discovery and extraction tasks. - Stream large extractions via cursors for efficient, incremental processing. - Retrieve data from crawled pages ## WDS API Server [WDS API Server](../server/index.html) provides the core functionality, including the WDS MCP Server. ## WDS MCP Server When the WDS API Server is up and running, the WDS MCP Server can be connected to IDEs: #### Visual Studio Code Here is an [official instruction](https://code.visualstudio.com/docs/copilot/chat/mcp-servers) on how to connect MCP servers to Visual Studio Code. Use the following values to connect WDS MCP Server: | Parameter | Value | Description | |-----------|------------------------|--------------------------------------------------------------------------------------------------------------------------| | Name | wds | A name to find the WDS MCP Server among other connected MCP servers | |-----------|------------------------|--------------------------------------------------------------------------------------------------------------------------| | Type | http | WDS MCP Server is connected using the HTTP protocol | |-----------|------------------------|--------------------------------------------------------------------------------------------------------------------------| | URL | http://[host:port]/mcp | A WDS MCP Server URL. If WDS API Server is deployed locally in Docker, the URL appears to be `http://localhost:2807/mcp` | |-----------|------------------------|--------------------------------------------------------------------------------------------------------------------------| ## Quick Start 1. Deploy the WDS API Server (see Server [Deployments](../server/deployments/index.html)). 2. Connect the MCP server in your IDE (see table above). 3. Try a prompt, for example in VS Code: `/mcp.wds.scrape-data` with optional `urls` and `mainTask` args. ## Tools at a Glance - StartJob: create/update a job from a Job Config, return initial Download Tasks. - Crawl: discover follow‑up pages using a selector, return Download Tasks. - Scrape: extract text/attribute values with a selector. - GetDownloadTaskStatus: inspect status, errors, and request/response details. - CrawlMdr: execute hierarchical crawl/scrape plans with cursor‑based results. - CrawlMdrConfig* helpers: build/update multi‑level plans (subs, crawl params, scrape params). - GetCrawlMdrData: fetch the next batch of scraped JSON documents via cursor. - Retrieve: retrieve relevant data from the indexed web resources. See full details in [MCP Tools](./tools/index.html). ## Prompts at a Glance - ScrapeData: discover pages, define fields, and configure site‑wide scraping to JSON. - Resume: crawl and summarize an entire site into a structured overview. - Query: retrieve data from the WDS index based on a user query, analyze the retrieved documents, enrich the response with additional details, and generate an answer using Retrieval-Augmented Generation (RAG). See full details in [MCP Prompts](./prompts/index.html). ## Typical Flows - Simple: StartJob → Crawl → Scrape. - Hierarchical (MDR): CrawlMdrConfigCreate/Upsert* → StartJob → CrawlMdr → GetCrawlMdrData (repeat until cursor is empty). ## Endpoint and Base Path - Default MCP endpoint: `/mcp` under the WDS base URL (e.g., `http://localhost:2807/mcp`). - [Helm](../server/deployments/helm.html) deployments can add a base‑path prefix via `global.ingress.basePath`. # WDS MCP Prompts # WDS MCP Prompts Prebuilt prompts that guide an AI agent to use the [WDS MCP Tools](../tools/index.html) effectively — from discovering pages to extracting data and summarizing sites. Each prompt includes intent, usage guidance, and examples tailored to a specific outcome. ## What They’re Good For - Speed: get productive fast with sensible defaults and working examples. - Consistency: standardize how agents discover, scrape, and format results. - Extendability: adapt prompts to your domain (add constraints, schemas, formats). ## How to Run - In supported IDEs (e.g., VS Code), trigger the prompt command (see each prompt page for the command and arguments). - Ensure the WDS MCP Server is reachable (see [MCP Server](../index.html)). - The [Claude Opus 4.5](https://www.anthropic.com/news/claude-opus-4-5) model was used to generate Output Examples because it showed the best results during testing. ## Customize for Your Use Case - Start from the provided prompts and tune examples, output schemas (JSON), constraints, or safety rules. - Combine with tool flows (StartJob → Crawl → Scrape or CrawlMdr) for more complex pipelines. ## Prompt Catalog Review the list and choose the best fit for your task: - [Resume](/releases/latest/mcp/prompts/resume.html) - [Sliced Resume](/releases/latest/mcp/prompts/sliced-resume.html) - [ScrapeData](/releases/latest/mcp/prompts/scrape-data.html) - [Index](/releases/latest/mcp/prompts/index-wr.html) - [Reindex](/releases/latest/mcp/prompts/reindex-wr.html) - [Query](/releases/latest/mcp/prompts/query.html) # Resume Prompt Instructs an AI agent to crawl and summarize an entire site: discover sections, scrape key fields, and produce a concise, structured overview. ## How to call In different IDEs, the command might be different. So here are commands for well-known IDEs if the WDS MCP server has been registered with the name `wds`: | IDE | Command | | ------------------ | -------------------------------- | | Visual Studio Code | `/mcp.wds.resume` | | ------------------ | -------------------------------- | ## Arguments | Name | Type | Description | | ---- | ------- | --------------------------------------------------------------------------------------------------------------------------------- | | url | string | Optional. Initial crawling entry point URL. If not specified the [Playground](../../server/services/playground.html) URL is iused | | ---- | ------- | --------------------------------------------------------------------------------------------------------------------------------- | ### Output Example Output of this prompt, run with default arguments: ![Resume Prompt](/assets/img/mcp-prompts/resume2.png?fp=K5iGiJtBm7lceQ9R) # Sliced Resume Prompt Instructs an AI agent to crawl and summarize an entire site: discover sections, scrape key fields, and produce a concise, structured overview in a fast-first-result manner. It uses the max-depth parameter to retrieve the main data first (overview, contacts, prices), providing the user with an overview of a web resource in several seconds, and continues crawling to gather all information for the final resume. ## How to call In different IDEs, the command might be different. So here are commands for well-known IDEs if the WDS MCP server has been registered with the name `wds`: | IDE | Command | | ------------------ | -------------------------------- | | Visual Studio Code | `/mcp.wds.sliced-resume` | | ------------------ | -------------------------------- | ## Arguments | Name | Type | Description | | ---- | ------- | --------------------------------------------------------------------------------------------------------------------------------- | | url | string | Optional. Initial crawling entry point URL. If not specified the [Playground](../../server/services/playground.html) URL is iused | | ---- | ------- | --------------------------------------------------------------------------------------------------------------------------------- | #### Output Example Output of this prompt, run with default arguments: ![Sliced Resume Prompt](/assets/img/mcp-prompts/sliced-resume1.png?fp=MIq8Uo3VrUd2EsjO) # ScrapeData Prompt Guides an AI agent to discover relevant pages, define fields, and configure WDS to scrape site-wide data for analysis. ## How to call In different IDEs, the command might be different. So here are commands for well-known IDEs if the WDS MCP server has been registered with the name `wds`: | IDE | Command | | ------------------ | -------------------------------- | | Visual Studio Code | `/mcp.wds.scrape-data` | | ------------------ | -------------------------------- | ## Arguments | Name | Type | Description | | ------------ | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------- | | url | string | Optional. Start URL. If not specified, the [Playground](../../server/services/playground.html) URL is used | | ------------ | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------- | | outputFormat | string | Optional. A task. Default: a table with the following columns: Name, Price, Description. | | ------------ | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------- | #### Output Example Output of this prompt, run with default arguments: ![ScrapeData Prompt](/assets/img/mcp-prompts/scrape-data2.png?fp=IYGPyzdMXATIquHm) # Index Prompt Instructs an AI agent to crawl and add to the search index payload data from a web resource. Agent analyses a web resource and configures the crawler to index only the payload, skipping data noise (ads, navigation, etc.)\ If a web page is cached, these pages are skipped because they are assumed to have already been indexed. If reindexing is required, see the [reindex](./reindex-wr.html) prompt. ## How to call In different IDEs, the command might be different. So here are commands for well-known IDEs if the WDS MCP server has been registered with the name `wds`: | IDE | Command | | ------------------ | -------------------------------- | | Visual Studio Code | `/mcp.wds.index` | | ------------------ | -------------------------------- | ## Arguments | Name | Type | Description | | ---- | ------- | --------------------------------------------------------------------------------------------------------------------------------- | | url | string | Optional. Initial crawling entry point URL. If not specified the [Playground](../../server/services/playground.html) URL is iused | | ---- | ------- | --------------------------------------------------------------------------------------------------------------------------------- | ### Output Example Output of this prompt, run with default arguments: ![Index Prompt](/assets/img/mcp-prompts/index1.png?fp=QitxZiT01njIFmjP) # Reindex Prompt Instructs an AI agent to crawl and add to the search reindex payload data from a web resource (even cached pages). Agent analyses a web resource and configures the crawler to index only the payload, skipping data noise (ads, navigation, etc.) ## How to call In different IDEs, the command might be different. So here are commands for well-known IDEs if the WDS MCP server has been registered with the name `wds`: | IDE | Command | | ------------------ | -------------------------------- | | Visual Studio Code | `/mcp.wds.reindex` | | ------------------ | -------------------------------- | ## Arguments | Name | Type | Description | | ---- | ------- | --------------------------------------------------------------------------------------------------------------------------------- | | url | string | Optional. Initial crawling entry point URL. If not specified the [Playground](../../server/services/playground.html) URL is iused | | ---- | ------- | --------------------------------------------------------------------------------------------------------------------------------- | ### Output Example Output of this prompt, run with default arguments: ![Reindex Prompt](/assets/img/mcp-prompts/reindex1.png?fp=MlEChwaG0trvQJWx) # Query Prompt Instructs an AI agent to retrieve data from the WDS index based on a user query, analyze the retrieved documents, enrich the response with additional details, and generate an answer using Retrieval-Augmented Generation (RAG).\ See the [index](./index-wr.html) prompt to find out how to index web resources for RAG. ## How to call In different IDEs, the command might be different. So here are commands for well-known IDEs if the WDS MCP server has been registered with the name `wds`: | IDE | Command | | ------------------ | -------------------------------- | | Visual Studio Code | `/mcp.wds.query` | | ------------------ | -------------------------------- | ## Arguments | Name | Type | Description | | ------------ | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | role | string | Optional. The Role section for the propmpt. Default: You are an AI assistant that helps people find information. | | ------------ | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | userTask | string | Optional. The Mission section for the propmt. Default: Analyse the playground (http://playground) and find items that grant the wearer control over the creatures of the forest. | | ------------ | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | outputFormat | string | Optional. Output format. Default: markdown | | ------------ | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ### Output Example Output of this prompt, run with default arguments: ![Query Prompt](/assets/img/mcp-prompts/query1.png?fp=xMwmvRG-LV8gAtg1) # WDS MCP Tools # WDS MCP Tools Orchestrate crawling and scraping from IDEs and agentic systems using a focused set of MCP tools. These tools wrap the WDS REST API with simple actions for starting jobs, following links, extracting data, and inspecting task status. The companion [WDS MCP Prompts](../prompts/index.html) build on these tools for common outcomes. ## What You Can Do - Start jobs with a `JobConfig` and receive initial `DownloadTask`s - Crawl pages via selectors and return new `DownloadTask`s - Scrape text or attribute values (single or multiple fields) - Track execution with `DownloadTaskStatus` and request/response details - Execute hierarchical (MDR) crawl/scrape plans and stream results by cursor - Index web pages data for retrieval (RAG) based on Full Text and Vector searches ## Quick Start 1. Ensure the WDS API Server (and MCP endpoint) is reachable (e.g., `http://localhost:2807/mcp`) 2. Connect the MCP server in your IDE (see MCP overview for setup) 3. Call StartJob with a minimal JobConfig, then Crawl and Scrape, or Query data ## Tools at a Glance - StartJob: create or update a job from `JobConfig`; returns initial `DownloadTask`s - JobConfig*: helpers to add URLs, headers, proxies, restart/error policies, domain scope - Crawl: find links using a selector/attribute and return `DownloadTask`s - Scrape: extract values using a selector (and optional attribute) - GetDownloadTaskStatus: inspect status, errors, and HTTP details for a task - CrawlMdr: run hierarchical crawl/scrape plans with cursor‑based results - CrawlMdrConfig*: create/update MDR plans (subs, crawl params, scrape params) - GetCrawlMdrData: fetch next batch of scraped JSON documents using a cursor - Retrieve: retrieve relevant data from the indexed web resources ## Table of Contents - [JobConfig](/releases/latest/mcp/tools/job-config.html) - [StartJob](/releases/latest/mcp/tools/start-job.html) - [UpsertJobConfig](/releases/latest/mcp/tools/upsert-job-config.html) - [StartExistingJob](/releases/latest/mcp/tools/start-existing-job.html) - [StartJobForFastWebResourceEvaluation](/releases/latest/mcp/tools/start-job-for-fast-web-resource-evaluation.html) - [UpdateJobToContinueWebResourceEvaluation](/releases/latest/mcp/tools/update-job-to-continue-web-resource-evaluation.html) - [Crawl](/releases/latest/mcp/tools/crawl.html) - [Scrape](/releases/latest/mcp/tools/scrape.html) - [JobFetch](/releases/latest/mcp/tools/job-fetch.html) - [CrawlMdrConfig](/releases/latest/mcp/tools/crawl-mdr-config.html) - [CrawlMdr](/releases/latest/mcp/tools/crawl-mdr.html) - [CrawlAllMdr](/releases/latest/mcp/tools/crawl-all-mdr.html) - [GetCrawlMdrData](/releases/latest/mcp/tools/get-crawl-mdr-data.html) - [GetJobsInfo](/releases/latest/mcp/tools/get-jobs-info.html) - [GetJobConfig](/releases/latest/mcp/tools/get-job-config.html) - [GetDownloadTaskStatus](/releases/latest/mcp/tools/get-download-task-status.html) - [Retrieve](/releases/latest/mcp/tools/retrieve.html) # JobConfig Tools Build and update JobConfig objects for use with [StartJob](./start-job.html): set start URLs, job type, headers/cookies/HTTPS, proxies, error handling, and domain scope. Each tool returns a new or modified JobConfig object. The returned JobConfig object is passed to the next tool call as a required input parameter. ## JobConfigCreate Creates a new job configuration object. ### Arguments | Name | Type | Description | | --------- | ------- | ------------------------------------------------ | | jobName | string | **Required.** Unique job name. Used to identify the job in the system where the domain name is often used (e.g., example.com) | | --------- | ------- | ------------------------------------------------ | | startUrl | string | **Required.** Initial crawling entry point URL | | --------- | ------- | ------------------------------------------------ | ## JobConfigAddStartUrl Adds a new start URL to an existing job configuration. ### Arguments | Name | Type | Description | | --------- | ------- | ------------------------------------------------ | | jobConfig | object | **Required.** JobConfig object | | --------- | ------- | ------------------------------------------------ | | startUrl | string | **Required.** Additional start URL | | --------- | ------- | ------------------------------------------------ | ## JobConfigSetJobType Sets the job type for the job configuration. ### Arguments | Name | Type | Description | | --------- | ------- | ------------------------------------------------ | | jobConfig | object | **Required.** JobConfig object | | --------- | ------- | ------------------------------------------------ | | jobType | string | **Required.** Job type ("Internet" or "Intranet")| | --------- | ------- | ------------------------------------------------ | ## JobConfigHeadersUpsertDefaultHeader Adds or updates a default HTTP header in the job configuration. ### Arguments | Name | Type | Description | | ----------- | ------- | ------------------------------------------------ | | jobConfig | object | **Required.** JobConfig object | | ----------- | ------- | ------------------------------------------------ | | headerName | string | **Required.** Header name | | ----------- | ------- | ------------------------------------------------ | | headerValue | string | **Required.** Header value | | ----------- | ------- | ------------------------------------------------ | ## JobConfigRestartSetJobRestartMode Sets the job restart mode for the job configuration. ### Arguments | Name | Type | Description | | -------------- | ------- | ------------------------------------------------------- | | jobConfig | object | **Required.** JobConfig object | | -------------- | ------- | ------------------------------------------------------- | | jobRestartMode | string | **Required.** Restart mode ("Continue" or "FromScratch")| | -------------- | ------- | ------------------------------------------------------- | ## JobConfigHttpsSetSuppressHttpsCertificateValidation Sets whether to suppress HTTPS certificate validation. ### Arguments | Name | Type | Description | | ---------------------------------- | ------- | --------------------------------------------------- | | jobConfig | object | **Required.** JobConfig object | | ---------------------------------- | ------- | --------------------------------------------------- | | suppressHttpsCertificateValidation | bool | **Required.** Suppress HTTPS certificate validation | | ---------------------------------- | ------- | --------------------------------------------------- | ## JobConfigCookiesSetUseCookies Sets whether to use cookies for requests in the job configuration. ### Arguments | Name | Type | Description | | --------- | ------- | ------------------------------ | | jobConfig | object | **Required.** JobConfig object | | --------- | ------- | ------------------------------ | | useCookies| bool | **Required.** Use cookies | | --------- | ------- | ------------------------------ | ## JobConfigProxySetUseProxy Sets whether to use a proxy for requests in the job configuration. ### Arguments | Name | Type | Description | | --------- | ------- | ---------------------------------- | | jobConfig | object | **Required.** JobConfig object | | --------- | ------- | ---------------------------------- | | useProxy | bool | **Required.** Use proxy | | --------- | ------- | ---------------------------------- | ## JobConfigProxySetSendOvertRequestsOnProxiesFailure Sets whether to send overt requests if all proxies fail. ### Arguments | Name | Type | Description | | --------------------------------- | ------- | -------------------------------------------------- | | jobConfig | object | **Required.** JobConfig object | | --------------------------------- | ------- | -------------------------------------------------- | | sendOvertRequestsOnProxiesFailure | bool | **Required.** Send overt requests on proxy failure | | --------------------------------- | ------- | -------------------------------------------------- | ## JobConfigProxySetIterateProxyResponseCodes Sets HTTP response codes for which requests should be resent with another proxy. ### Arguments | Name | Type | Description | | ------------------------- | ------- | ------------------------------------------------------------------- | | jobConfig | object | **Required.** JobConfig object | | ------------------------- | ------- | ------------------------------------------------------------------- | | iterateProxyResponseCodes | string | **Required.** Comma-separated HTTP response codes (e.g., "401,403") | | ------------------------- | ------- | ------------------------------------------------------------------- | ## JobConfigProxyUpsertProxy Adds or updates a proxy configuration in the job configuration. ### Arguments | Name | Type | Description | | ---------------- | ------- | ------------------------------------------------- | | jobConfig | object | **Required.** JobConfig object | | ---------------- | ------- | ------------------------------------------------- | | protocol | string | **Required.** Proxy protocol (http|https|socks5) | | ---------------- | ------- | ------------------------------------------------- | | host | string | **Required.** Proxy host | | ---------------- | ------- | ------------------------------------------------- | | port | int | **Required.** Proxy port | | ---------------- | ------- | ------------------------------------------------- | | userName | string | Optional. Proxy username | | ---------------- | ------- | ------------------------------------------------- | | password | string | Optional. Proxy password | | ---------------- | ------- | ------------------------------------------------- | | connectionsLimit | int | Optional. Max connections | | ---------------- | ------- | ------------------------------------------------- | | availableHosts | string | Optional. Comma-separated list of available hosts | | ---------------- | ------- | ------------------------------------------------- | ## JobConfigDownloadErrorHandlingSetPolicy Sets the download error handling policy for the job configuration. ### Arguments | Name | Type | Description | | --------------------------- | ------- | --------------------------------------------------------- | | jobConfig | object | **Required.** JobConfig object | | --------------------------- | ------- | --------------------------------------------------------- | | downloadErrorHandlingPolicy | string | **Required.** Policy ("Skip" or "Retry") | | --------------------------- | ------- | --------------------------------------------------------- | | retriesLimit | int | Optional. Max retries (if policy is "Retry") | | --------------------------- | ------- | --------------------------------------------------------- | | retryDelayMs | int | Optional. Delay before retry in ms (if policy is "Retry") | | --------------------------- | ------- | --------------------------------------------------------- | ## JobConfigCrawlersProtectionBypassSetMaxResponseSizeKb Sets the maximum response size (in KB) for the job configuration. ### Arguments | Name | Type | Description | | ----------------- | ------- | ------------------------------------------------ | | jobConfig | object | **Required.** JobConfig object | | ----------------- | ------- | ------------------------------------------------ | | maxResponseSizeKb | int | **Required.** Max response size in KB | | ----------------- | ------- | ------------------------------------------------ | ## JobConfigCrawlersProtectionBypassSetMaxRedirectHops Sets the maximum number of redirect hops for the job configuration. ### Arguments | Name | Type | Description | | --------------- | ------- | ------------------------------------------------ | | jobConfig | object | **Required.** JobConfig object | | --------------- | ------- | ------------------------------------------------ | | maxRedirectHops | int | **Required.** Max redirect hops | | --------------- | ------- | ------------------------------------------------ | ## JobConfigCrawlersProtectionBypassSetRequestTimeoutSec Sets the request timeout (in seconds) for the job configuration. ### Arguments | Name | Type | Description | | ----------------- | ------- | ------------------------------------------------ | | jobConfig | object | **Required.** JobConfig object | | ----------------- | ------- | ------------------------------------------------ | | requestTimeoutSec | int | **Required.** Timeout in seconds | | ----------------- | ------- | ------------------------------------------------ | ## JobConfigCrawlersProtectionBypassUpsertCrawlDelay Adds or updates a crawl delay for a specific host in the job configuration. ### Arguments | Name | Type | Description | | --------- | ------- | ------------------------------------------------ | | jobConfig | object | **Required.** JobConfig object | | --------- | ------- | ------------------------------------------------ | | host | string | **Required.** Host for crawl delay | | --------- | ------- | ------------------------------------------------ | | delay | string | **Required.** Delay value ("0", "1-5", "robots") | | --------- | ------- | ------------------------------------------------ | ## JobConfigCrossDomainAccessSetPolicy Sets the cross-domain access policy for the job configuration. ### Arguments | Name | Type | Description | | ----------------- | ------- | ----------------------------------------------------------- | | jobConfig | object | **Required.** JobConfig object | | ----------------- | ------- | ----------------------------------------------------------- | | crossDomainAccess | string | **Required.** Policy ("None", "Subdomains", "CrossDomains") | | ----------------- | ------- | ----------------------------------------------------------- | ## JobConfigRetrievalConfigSetEnrollInIndex Sets whether to enroll in index all crawled pages within the job. ### Arguments | Name | Type | Description | | ----------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | jobConfig | object | **Required.** JobConfig object | | ----------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | enrollInIndex | bool | **Required.** Enroll crawled pages in index. If true, all crawled pages within the job will be enrolled in an index and their data will be available for retrieval | | ----------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ## JobConfigRetrievalConfigSetMaxTokensPerChunk Sets the maximum number of tokens per chunk for indexing. ### Arguments | Name | Type | Description | | ----------------- | ------- | ----------------------------------------------------------------------------------------------------------------------------------------- | | jobConfig | object | **Required.** JobConfig object | | ----------------- | ------- | ----------------------------------------------------------------------------------------------------------------------------------------- | | maxTokensPerChunk | int | Optional. Max tokens per chunk. The maximum number of tokens per chunk when splitting the content of a page into chunks for indexing | | ----------------- | ------- | ----------------------------------------------------------------------------------------------------------------------------------------- | ## JobConfigRetrievalConfigUpsertContentScope Upserts a content scope to the job configuration. It helps to index only certain pats from the crawled pages. ### Arguments | Name | Type | Description | | ----------------- | ------- | ----------------------------------------------------------------------------------------------------------------------------------------------- | | jobConfig | object | **Required.** JobConfig object | | ----------------- | ------- | ----------------------------------------------------------------------------------------------------------------------------------------------- | | pathPattern | string | **Required.** Path pattern. A valid pattern like is used for files search. Supports * and ** wildcards. Examples: /products/*, /**/blog/*, etc. | | ----------------- | ------- | ----------------------------------------------------------------------------------------------------------------------------------------------- | | selector | string | **Required.** A single or comma-separated list of content selectors. For instance, 'CSS: selector1', 'CSS: selector1, selector2, selector3' | | ----------------- | ------- | ----------------------------------------------------------------------------------------------------------------------------------------------- | ## JobConfigRetrievalConfigSetWaitForEnrollment Upserts an enrollment wait mode to the job configuration. It helps to control how the job waits for the enrollment of crawled pages in the index. ### Arguments | Name | Type | Description | | ----------------- | ------- | ---------------------------------------------------------------------------------------------------------------- | | jobConfig | object | **Required.** JobConfig object | | ----------------- | ------- | ---------------------------------------------------------------------------------------------------------------- | | waitForEnrollment | bool | **Required.** Wait for enrollment. If true, the CrawlEr will wait for all documents to be enrolled in the index | | ----------------- | ------- | ---------------------------------------------------------------------------------------------------------------- | ## JobConfigRetrievalConfigSetForce Sets whether to force re-enrollment of all crawled pages within the job. ### Arguments | Name | Type | Description | | ----------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- | | jobConfig | object | **Required.** JobConfig object | | ----------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- | | force | bool | **Required.** Force re-enrollment. If true, all crawled pages within the job will be re-enrolled in the index even if they were previously enrolled or cached | | ----------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- | # StartJob Tool Starts a new WDS job using a provided JobConfig — validates settings, enqueues initial downloads for the start URLs, and returns tasks to continue processing. ## Arguments | Name | Type | Description | | ------------ |---------------------------| -------------------------------------------------------------------- | | jobName | string | **Required.** Unique job name. Used to identify the job in the system where the domain name is often used (e.g., example.com) | | ----------- | ------------------------- | -------------------------------------------------------------------- | | jobConfig | [JobConfig](#jobconfig) | **Required.** Job configuration object containing all job parameters | | ------------ | ----------------------- | -------------------------------------------------------------------- | ### JobConfig Defines the top-level configuration for a crawl job: entry URLs, job type, request/session behavior (headers, cookies, HTTPS), network routing (proxies), and runtime policies (restarts, error handling, domain scope). Fields: | Name | Type | Description | |--------------------------|-------------------------------------------------------|-----------------------------------------------------------------------| | StartUrls | array of Strings | **Required.** Initial URLs. Crawling entry points | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | Type | [JobTypes](#jobtypes) | Optional. Job type | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | Headers | [HeadersConfig](#headersconfig) | Optional. Headers settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | Restart | [RestartConfig](#restartconfig) | Optional. Job restart settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | Https | [HttpsConfig](#httpsconfig) | Optional. HTTPS settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | Cookies | [CookiesConfig](#cookiesconfig) | Optional. Cookies settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | Proxy | [ProxiesConfig](#proxiesconfig) | Optional. Proxy settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | DownloadErrorHandling | [DownloadErrorHandling](#downloaderrorhandling) | Optional. Download errors handling settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | CrawlersProtectionBypass | [CrawlersProtectionBypass](#crawlersprotectionbypass) | Optional. Crawlers protection countermeasure settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | CrossDomainAccess | [CrossDomainAccess](#crossdomainaccess) | Optional. Cross-domain access settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | RetrievalConfig | [RetrievalConfig](#retrievalconfig) | Optional. Retrival settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | ### JobTypes > **_NOTE!_** Possible values restrictions and the default value for all jobs can be configured in the Dapi service. > **_NOTE!_** Crawler service should be correctly configured to handle jobs of different types. Specifies how and where the crawler operates. Choose the mode that matches the environment your job targets. Enumeration values: | Name | Description | |----------|--------------------------------------------------------------------------------------------------| | internet | Crawl data from internet sources via request gateways (Proxy addresses, Host IP addresses, etc.) | | -------- | ------------------------------------------------------------------------------------------------ | | intranet | Crawl data from intranet sources with no limits | | -------- | ------------------------------------------------------------------------------------------------ | ### HeadersConfig Configures additional HTTP headers to be sent with every request. Use to set user agents, auth tokens, custom headers, etc. Fields: | Field | Type | Description | |-----------------------| ---------------------------------- |----------------------------------------------------------------| | DefaultRequestHeaders | array of [HttpHeader](#httpheader) | **Required.** HTTP headers that will be sent with each request | | --------------------- | ---------------------------------- | -------------------------------------------------------------- | ### HttpHeader Represents a single HTTP header definition with a name and one or more values. Fields: | Name | Type | Description | |---------|-----------------|-----------------------------| | Name | string | **Required.** Header name | | ------- | --------------- | --------------------------- | | Values | array of String | **Required.** Header values | | ------ | --------------- | --------------------------- | ### RestartConfig Controls what happens when a job restarts: continue from cached state or rebuild from scratch. Fields: | Field | Type | Description | |----------------|-------------------------------------|--------------------------------| | JobRestartMode | [JobRestartModes](#jobrestartmodes) | **Required.** Job restart mode | |----------------|-------------------------------------|--------------------------------| ### JobRestartModes Describes restart strategies and their effect on previously cached data. Enumeration values: | Name | Description | |-------------|--------------------------------------------------------------------------------------------------| | Continue | Reuse cached data and continue crawling and parsing new data | | ----------- | ------------------------------------------------------------------------------------------------ | | FromScratch | Clear cached data and start from scratch | | ----------- | ------------------------------------------------------------------------------------------------ | ### HttpsConfig Defines HTTPS validation behavior for target resources. Useful for development or when crawling hosts with self-signed certificates. Fields: | Field | Type | Description | |------------------------------------|------|-----------------------------------------------------------------------| | SuppressHttpsCertificateValidation | bool | **Required.** Suppress HTTPS certificate validation of a web resource | |------------------------------------|------|-----------------------------------------------------------------------| ### CookiesConfig Controls cookie persistence between requests to maintain sessions or state across navigations. Fields: | Field | Type | Description | |------------|------|------------------------------------------------------------------| | UseCookies | bool | **Required.** Save and reuse cookies between requests | | --------- | ---- | ---------------------------------------------------------------- | ### ProxiesConfig Configures whether and how requests are routed through proxy servers, including fallback behavior and specific proxy pools. Fields: | Field | Type | Description | |-----------------------------------|--------------------------------------|--------------------------------------------------------------------------------------------------| | UseProxy | bool | **Required.** Use proxies for requests | |-----------------------------------|--------------------------------------|--------------------------------------------------------------------------------------------------| | SendOvertRequestsOnProxiesFailure | bool | **Required.** Send a request from a host real IP address if all proxies failed | |-----------------------------------|--------------------------------------|--------------------------------------------------------------------------------------------------| | IterateProxyResponseCodes | string | Optional. Comma-separated HTTP response codes to iterate proxies on. Default: '401, 403' | |-----------------------------------|--------------------------------------|--------------------------------------------------------------------------------------------------| | Proxies | array of [ProxyConfig](#proxyconfig) | Optional. Proxy configurations. Default: empty array | |-----------------------------------|--------------------------------------|--------------------------------------------------------------------------------------------------| ### ProxyConfig Defines an individual proxy endpoint and its connection characteristics. Fields: | Field | Type | Description | |------------------|-----------------|-----------------------------------------------------------------------------------------------| | Protocol | string | **Required.** Proxy protocol (http, https, socks5) | |------------------|-----------------|-----------------------------------------------------------------------------------------------| | Host | string | **Required.** Proxy host | |------------------|-----------------|-----------------------------------------------------------------------------------------------| | Port | int | **Required.** Proxy port | |------------------|-----------------|-----------------------------------------------------------------------------------------------| | UserName | string | Optional. Proxy username | |------------------|-----------------|-----------------------------------------------------------------------------------------------| | Password | string | Optional. Proxy password | |------------------|-----------------|-----------------------------------------------------------------------------------------------| | ConnectionsLimit | int | Optional. Max concurrent connections | |------------------|-----------------|-----------------------------------------------------------------------------------------------| | AvailableHosts | array of String | Optional. Hosts accessible via this proxy | |------------------|-----------------|-----------------------------------------------------------------------------------------------| ### DownloadErrorHandling Specifies how the crawler reacts to transient download errors, including retry limits and backoff delays. Fields: | Field | Type | Description | |-------------------|-----------------------------------------------------------------|----------------------------------------------------| | Policy | [DownloadErrorHandlingPolicies](#downloaderrorhandlingpolicies) | **Required.** Error handling policy (Skip, Retry) | |-------------------|-----------------------------------------------------------------|----------------------------------------------------| | RetryPolicyParams | [RetryPolicyParams](#retrypolicyparams) | Optional. Retry params | |-------------------|-----------------------------------------------------------------|----------------------------------------------------| #### DownloadErrorHandlingPolicies Available strategies for handling request or network failures during content download. Enumeration values: | Name | Description | |-------|-------------------------------------| | Skip | Skip an error and continue crawling | | ----- | ----------------------------------- | | Retry | Try again | | ----- | ----------------------------------- | #### RetryPolicyParams Specifies how the crawler performs retries. Fields: | Field | Type | Description | |--------------|------|----------------------------------------| | RetriesLimit | int | **Required.** Max retries | |--------------|------|----------------------------------------| | RetryDelayMs | int | **Required.** Delay before retry in ms | |--------------|------|----------------------------------------| ### CrawlersProtectionBypass Tuning options to reduce detection and throttling by target sites: response size limits, redirect depth, request timeouts, and host-specific crawl delays. Fields: | Field | Type | Description | | ------------------ | ------ | -------------------------------------------------- | | MaxResponseSizeKb | int | Optional. Max response size in KB | | ------------------ | ------ | -------------------------------------------------- | | MaxRedirectHops | int | Optional. Max redirect hops | | ------------------ | ------ | -------------------------------------------------- | | RequestTimeoutSec | int | Optional. Max request timeout in seconds | | ------------------ | ------ | -------------------------------------------------- | | CrawlDelays | Array | Optional. Crawl delays for hosts | | ------------------ | ------ | -------------------------------------------------- | ### CrawlDelay Per-host throttling rule to space out requests and respect site limits or robots guidance. Fields: | Field | Type | Description | |-------|--------|----------------------------------------------------| | Host | string | **Required.** Host | | ----- | ------ | -------------------------------------------------- | | Delay | string | **Required.** Delay value (0, 1-5, robots) | | ----- | ------ | -------------------------------------------------- | ### CrossDomainAccess Controls which domains the crawler can follow from the starting hosts: only the main domain, include subdomains, or allow cross-domain navigation. Fields: | Field | Type | Description | |--------|---------------------------------------------------------|--------------------------------------------------------------------| | Policy | [CrossDomainAccessPolicies](#crossDomainAccessPolicies) | **Required.** Cross-domain policy (None, Subdomains, CrossDomains) | |--------|---------------------------------------------------------|--------------------------------------------------------------------| #### CrossDomainAccessPolicies Domain scoping modes that determine which hosts are considered in-bounds while crawling. Enumeration values: | Name | Description | |--------------|---------------------------------------------------------------------------------------| | None | No subdomain or cross-domain access. Only the main domain is allowed | | ------------ | ------------------------------------------------------------------------------------- | | Subdomains | The subdomains of the main domain are allowed (e.g., "example.com", "sub.example.com) | | ------------ | ------------------------------------------------------------------------------------- | | CrossDomains | Allows access to any domain (e.g., "example.com", "sub.example.com, another.com") | | ------------ | ------------------------------------------------------------------------------------- | ### RetrievalConfig RetrievalConfig controls what gets embedded and how enrollment behaves. Configuration for enrolling pages into a vector index for further vector search. Retrieval is part of the RAG. Fields: | Field | Type | Description | | ------------------ | ----------------------------------------------------------- | ------------------------------------------------------------------------- | | EnrollInIndex | bool | **Required.** Enroll crawled pages into the vector index. | | ------------------ | ----------------------------------------------------------- | ------------------------------------------------------------------------- | | Force | bool | **Required.** Should the already existing data in the index be overridden | | ------------------ | ----------------------------------------------------------- | ------------------------------------------------------------------------- | | MaxTokensPerChunk | int | Optional. Maximum tokens per chunk. Default: 512. | | ------------------ | ----------------------------------------------------------- | ------------------------------------------------------------------------- | | ContentScopes | array of [RetrievalContentScope](#retrievalcontentscope) | Optional. Selectors for page content to enroll. Default: entire page. | | ------------------ | ----------------------------------------------------------- | ------------------------------------------------------------------------- | | EnrollmentWaitMode | [RetrievalEnrollmentWaitMode](#retrievalenrollmentwaitmode) | Optional. Enrollment wait mode. Default: Eventually. | | ------------------ | ----------------------------------------------------------- | ------------------------------------------------------------------------- | #### RetrievalContentScope Define which parts of which pages are enrolled, using URL path matching and selectors. This lets enroll only meaningful blocks (e.g., product descriptions, docs body) and ignore noise (menus, footers, ads). Fields: | Field | Type | Description | | ---------------------------- | ------ | ------------------------------------------------------------------------------------ | | PathPattern | string | **Required.** URL path pattern (case sensitive). See examples below for the details. | | ---------------------------- | ------ | ------------------------------------------------------------------------------------ | | [Selector](#selector-format) | string | **Required.** Selector for getting interesting data on a web page | | ---------------------------- | ------ | ------------------------------------------------------------------------------------ | PathPattern Examples: | URL | Pattern | Corresponds | | ------------------------------------ | ----------------- | ----------- | | https://example.com/path/to/resource | * | Yes | | ------------------------------------ | ----------------- | ----------- | | https://example.com/path/to/resource | /* | Yes | | ------------------------------------ | ----------------- | ----------- | | https://example.com/path/to/resource | /path/to/resource | Yes | | ------------------------------------ | ----------------- | ----------- | | https://example.com/path/to/resource | /path/to/* | Yes | | ------------------------------------ | ----------------- | ----------- | | https://example.com/path/to/resource | /path/*/resource | Yes | | ------------------------------------ | ----------------- | ----------- | | https://example.com/path/to/resource | /**/res* | Yes | | ------------------------------------ | ----------------- | ----------- | | https://example.com/path/to/resource | /res* | No | | ------------------------------------ | ----------------- | ----------- | | https://example.com/path/to/resource | /path/to/RESOURCE | No | | ------------------------------------ | ----------------- | ----------- | #### Selector Format The selector argument is a selector of the following format: ```CSS|XPATH: selector```. The first part defines the selector type, the second one should be a selector in the corresponding type. Supported types: - [CSS](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_selectors) - [XPATH](https://developer.mozilla.org/en-US/docs/Web/XML/XPath) #### RetrievalEnrollmentWaitMode Specifies whether to wait for each crawled document to be enrolled into the index. Enumeration values: | Name | Description | | ---------- | ---------------------------------------------------------------------------------------------------- | | Eventually | Don't wait. Queue for enrollment; the index catches up asynchronously. FAST | | ---------- | ---------------------------------------------------------------------------------------------------- | | WaitEach | Wait for each document. Logs an error if not enrolled within 1 minute. SLOW | | ---------- | ---------------------------------------------------------------------------------------------------- | | WaitJob | Wait for all document enrollments when the entire job is completed. FAST | | ---------- | ---------------------------------------------------------------------------------------------------- | ## Return Type Array of [DownloadTask](#downloadtask) ### DownloadTask Represents a single page download request produced by a crawl or scrape job. Fields: | Name | Type | Description | |--------|----------|------------------------| | Id | string | **Required.** Task Id | |--------|----------|------------------------| | Url | string | **Required.** Page URL | |--------|----------|------------------------| # UpsertJobConfig Tool Creates or updates a WDS job config without starting a job. ## Arguments | Name | Type | Description | | ------------ |---------------------------| -------------------------------------------------------------------- | | jobName | string | **Required.** Unique job name. Used to identify the job in the system where the domain name is often used (e.g., example.com) | | ----------- | ------------------------- | -------------------------------------------------------------------- | | jobConfig | [JobConfig](#jobconfig) | **Required.** Job configuration object containing all job parameters | | ------------ | ----------------------- | -------------------------------------------------------------------- | ### JobConfig Defines the top-level configuration for a crawl job: entry URLs, job type, request/session behavior (headers, cookies, HTTPS), network routing (proxies), and runtime policies (restarts, error handling, domain scope). Fields: | Name | Type | Description | |--------------------------|-------------------------------------------------------|-----------------------------------------------------------------------| | StartUrls | array of Strings | **Required.** Initial URLs. Crawling entry points | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | Type | [JobTypes](#jobtypes) | Optional. Job type | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | Headers | [HeadersConfig](#headersconfig) | Optional. Headers settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | Restart | [RestartConfig](#restartconfig) | Optional. Job restart settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | Https | [HttpsConfig](#httpsconfig) | Optional. HTTPS settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | Cookies | [CookiesConfig](#cookiesconfig) | Optional. Cookies settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | Proxy | [ProxiesConfig](#proxiesconfig) | Optional. Proxy settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | DownloadErrorHandling | [DownloadErrorHandling](#downloaderrorhandling) | Optional. Download errors handling settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | CrawlersProtectionBypass | [CrawlersProtectionBypass](#crawlersprotectionbypass) | Optional. Crawlers protection countermeasure settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | CrossDomainAccess | [CrossDomainAccess](#crossdomainaccess) | Optional. Cross-domain access settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | RetrievalConfig | [RetrievalConfig](#retrievalconfig) | Optional. Retrival settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | ### JobTypes > **_NOTE!_** Possible values restrictions and the default value for all jobs can be configured in the Dapi service. > **_NOTE!_** Crawler service should be correctly configured to handle jobs of different types. Specifies how and where the crawler operates. Choose the mode that matches the environment your job targets. Enumeration values: | Name | Description | |----------|--------------------------------------------------------------------------------------------------| | internet | Crawl data from internet sources via request gateways (Proxy addresses, Host IP addresses, etc.) | | -------- | ------------------------------------------------------------------------------------------------ | | intranet | Crawl data from intranet sources with no limits | | -------- | ------------------------------------------------------------------------------------------------ | ### HeadersConfig Configures additional HTTP headers to be sent with every request. Use to set user agents, auth tokens, custom headers, etc. Fields: | Field | Type | Description | |-----------------------| ---------------------------------- |----------------------------------------------------------------| | DefaultRequestHeaders | array of [HttpHeader](#httpheader) | **Required.** HTTP headers that will be sent with each request | | --------------------- | ---------------------------------- | -------------------------------------------------------------- | ### HttpHeader Represents a single HTTP header definition with a name and one or more values. Fields: | Name | Type | Description | |---------|-----------------|-----------------------------| | Name | string | **Required.** Header name | | ------- | --------------- | --------------------------- | | Values | array of String | **Required.** Header values | | ------ | --------------- | --------------------------- | ### RestartConfig Controls what happens when a job restarts: continue from cached state or rebuild from scratch. Fields: | Field | Type | Description | |----------------|-------------------------------------|--------------------------------| | JobRestartMode | [JobRestartModes](#jobrestartmodes) | **Required.** Job restart mode | |----------------|-------------------------------------|--------------------------------| ### JobRestartModes Describes restart strategies and their effect on previously cached data. Enumeration values: | Name | Description | |-------------|--------------------------------------------------------------------------------------------------| | Continue | Reuse cached data and continue crawling and parsing new data | | ----------- | ------------------------------------------------------------------------------------------------ | | FromScratch | Clear cached data and start from scratch | | ----------- | ------------------------------------------------------------------------------------------------ | ### HttpsConfig Defines HTTPS validation behavior for target resources. Useful for development or when crawling hosts with self-signed certificates. Fields: | Field | Type | Description | |------------------------------------|------|-----------------------------------------------------------------------| | SuppressHttpsCertificateValidation | bool | **Required.** Suppress HTTPS certificate validation of a web resource | |------------------------------------|------|-----------------------------------------------------------------------| ### CookiesConfig Controls cookie persistence between requests to maintain sessions or state across navigations. Fields: | Field | Type | Description | |------------|------|------------------------------------------------------------------| | UseCookies | bool | **Required.** Save and reuse cookies between requests | | --------- | ---- | ---------------------------------------------------------------- | ### ProxiesConfig Configures whether and how requests are routed through proxy servers, including fallback behavior and specific proxy pools. Fields: | Field | Type | Description | |-----------------------------------|--------------------------------------|--------------------------------------------------------------------------------------------------| | UseProxy | bool | **Required.** Use proxies for requests | |-----------------------------------|--------------------------------------|--------------------------------------------------------------------------------------------------| | SendOvertRequestsOnProxiesFailure | bool | **Required.** Send a request from a host real IP address if all proxies failed | |-----------------------------------|--------------------------------------|--------------------------------------------------------------------------------------------------| | IterateProxyResponseCodes | string | Optional. Comma-separated HTTP response codes to iterate proxies on. Default: '401, 403' | |-----------------------------------|--------------------------------------|--------------------------------------------------------------------------------------------------| | Proxies | array of [ProxyConfig](#proxyconfig) | Optional. Proxy configurations. Default: empty array | |-----------------------------------|--------------------------------------|--------------------------------------------------------------------------------------------------| ### ProxyConfig Defines an individual proxy endpoint and its connection characteristics. Fields: | Field | Type | Description | |------------------|-----------------|-----------------------------------------------------------------------------------------------| | Protocol | string | **Required.** Proxy protocol (http, https, socks5) | |------------------|-----------------|-----------------------------------------------------------------------------------------------| | Host | string | **Required.** Proxy host | |------------------|-----------------|-----------------------------------------------------------------------------------------------| | Port | int | **Required.** Proxy port | |------------------|-----------------|-----------------------------------------------------------------------------------------------| | UserName | string | Optional. Proxy username | |------------------|-----------------|-----------------------------------------------------------------------------------------------| | Password | string | Optional. Proxy password | |------------------|-----------------|-----------------------------------------------------------------------------------------------| | ConnectionsLimit | int | Optional. Max concurrent connections | |------------------|-----------------|-----------------------------------------------------------------------------------------------| | AvailableHosts | array of String | Optional. Hosts accessible via this proxy | |------------------|-----------------|-----------------------------------------------------------------------------------------------| ### DownloadErrorHandling Specifies how the crawler reacts to transient download errors, including retry limits and backoff delays. Fields: | Field | Type | Description | |-------------------|-----------------------------------------------------------------|----------------------------------------------------| | Policy | [DownloadErrorHandlingPolicies](#downloaderrorhandlingpolicies) | **Required.** Error handling policy (Skip, Retry) | |-------------------|-----------------------------------------------------------------|----------------------------------------------------| | RetryPolicyParams | [RetryPolicyParams](#retrypolicyparams) | Optional. Retry params | |-------------------|-----------------------------------------------------------------|----------------------------------------------------| #### DownloadErrorHandlingPolicies Available strategies for handling request or network failures during content download. Enumeration values: | Name | Description | |-------|-------------------------------------| | Skip | Skip an error and continue crawling | | ----- | ----------------------------------- | | Retry | Try again | | ----- | ----------------------------------- | #### RetryPolicyParams Specifies how the crawler performs retries. Fields: | Field | Type | Description | |--------------|------|----------------------------------------| | RetriesLimit | int | **Required.** Max retries | |--------------|------|----------------------------------------| | RetryDelayMs | int | **Required.** Delay before retry in ms | |--------------|------|----------------------------------------| ### CrawlersProtectionBypass Tuning options to reduce detection and throttling by target sites: response size limits, redirect depth, request timeouts, and host-specific crawl delays. Fields: | Field | Type | Description | | ------------------ | ------ | -------------------------------------------------- | | MaxResponseSizeKb | int | Optional. Max response size in KB | | ------------------ | ------ | -------------------------------------------------- | | MaxRedirectHops | int | Optional. Max redirect hops | | ------------------ | ------ | -------------------------------------------------- | | RequestTimeoutSec | int | Optional. Max request timeout in seconds | | ------------------ | ------ | -------------------------------------------------- | | CrawlDelays | Array | Optional. Crawl delays for hosts | | ------------------ | ------ | -------------------------------------------------- | ### CrawlDelay Per-host throttling rule to space out requests and respect site limits or robots guidance. Fields: | Field | Type | Description | |-------|--------|----------------------------------------------------| | Host | string | **Required.** Host | | ----- | ------ | -------------------------------------------------- | | Delay | string | **Required.** Delay value (0, 1-5, robots) | | ----- | ------ | -------------------------------------------------- | ### CrossDomainAccess Controls which domains the crawler can follow from the starting hosts: only the main domain, include subdomains, or allow cross-domain navigation. Fields: | Field | Type | Description | |--------|---------------------------------------------------------|--------------------------------------------------------------------| | Policy | [CrossDomainAccessPolicies](#crossDomainAccessPolicies) | **Required.** Cross-domain policy (None, Subdomains, CrossDomains) | |--------|---------------------------------------------------------|--------------------------------------------------------------------| #### CrossDomainAccessPolicies Domain scoping modes that determine which hosts are considered in-bounds while crawling. Enumeration values: | Name | Description | |--------------|---------------------------------------------------------------------------------------| | None | No subdomain or cross-domain access. Only the main domain is allowed | | ------------ | ------------------------------------------------------------------------------------- | | Subdomains | The subdomains of the main domain are allowed (e.g., "example.com", "sub.example.com) | | ------------ | ------------------------------------------------------------------------------------- | | CrossDomains | Allows access to any domain (e.g., "example.com", "sub.example.com, another.com") | | ------------ | ------------------------------------------------------------------------------------- | ### RetrievalConfig RetrievalConfig controls what gets embedded and how enrollment behaves. Configuration for enrolling pages into a vector index for further vector search. Retrieval is part of the RAG. Fields: | Field | Type | Description | | ------------------ | ----------------------------------------------------------- | ------------------------------------------------------------------------- | | EnrollInIndex | bool | **Required.** Enroll crawled pages into the vector index. | | ------------------ | ----------------------------------------------------------- | ------------------------------------------------------------------------- | | Force | bool | **Required.** Should the already existing data in the index be overridden | | ------------------ | ----------------------------------------------------------- | ------------------------------------------------------------------------- | | MaxTokensPerChunk | int | Optional. Maximum tokens per chunk. Default: 512. | | ------------------ | ----------------------------------------------------------- | ------------------------------------------------------------------------- | | ContentScopes | array of [RetrievalContentScope](#retrievalcontentscope) | Optional. Selectors for page content to enroll. Default: entire page. | | ------------------ | ----------------------------------------------------------- | ------------------------------------------------------------------------- | | EnrollmentWaitMode | [RetrievalEnrollmentWaitMode](#retrievalenrollmentwaitmode) | Optional. Enrollment wait mode. Default: Eventually. | | ------------------ | ----------------------------------------------------------- | ------------------------------------------------------------------------- | #### RetrievalContentScope Define which parts of which pages are enrolled, using URL path matching and selectors. This lets enroll only meaningful blocks (e.g., product descriptions, docs body) and ignore noise (menus, footers, ads). Fields: | Field | Type | Description | | ---------------------------- | ------ | ------------------------------------------------------------------------------------ | | PathPattern | string | **Required.** URL path pattern (case sensitive). See examples below for the details. | | ---------------------------- | ------ | ------------------------------------------------------------------------------------ | | [Selector](#selector-format) | string | **Required.** Selector for getting interesting data on a web page | | ---------------------------- | ------ | ------------------------------------------------------------------------------------ | PathPattern Examples: | URL | Pattern | Corresponds | | ------------------------------------ | ----------------- | ----------- | | https://example.com/path/to/resource | * | Yes | | ------------------------------------ | ----------------- | ----------- | | https://example.com/path/to/resource | /* | Yes | | ------------------------------------ | ----------------- | ----------- | | https://example.com/path/to/resource | /path/to/resource | Yes | | ------------------------------------ | ----------------- | ----------- | | https://example.com/path/to/resource | /path/to/* | Yes | | ------------------------------------ | ----------------- | ----------- | | https://example.com/path/to/resource | /path/*/resource | Yes | | ------------------------------------ | ----------------- | ----------- | | https://example.com/path/to/resource | /**/res* | Yes | | ------------------------------------ | ----------------- | ----------- | | https://example.com/path/to/resource | /res* | No | | ------------------------------------ | ----------------- | ----------- | | https://example.com/path/to/resource | /path/to/RESOURCE | No | | ------------------------------------ | ----------------- | ----------- | #### Selector Format The selector argument is a selector of the following format: ```CSS|XPATH: selector```. The first part defines the selector type, the second one should be a selector in the corresponding type. Supported types: - [CSS](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_selectors) - [XPATH](https://developer.mozilla.org/en-US/docs/Web/XML/XPath) #### RetrievalEnrollmentWaitMode Specifies whether to wait for each crawled document to be enrolled into the index. Enumeration values: | Name | Description | | ---------- | ---------------------------------------------------------------------------------------------------- | | Eventually | Don't wait. Queue for enrollment; the index catches up asynchronously. FAST | | ---------- | ---------------------------------------------------------------------------------------------------- | | WaitEach | Wait for each document. Logs an error if not enrolled within 1 minute. SLOW | | ---------- | ---------------------------------------------------------------------------------------------------- | | WaitJob | Wait for all document enrollments when the entire job is completed. FAST | | ---------- | ---------------------------------------------------------------------------------------------------- | # StartExistingJob Tool Starts a new WDS job using existing JobConfig — enqueues initial downloads for the start URLs, and returns tasks to continue processing. ## Arguments | Name | Type | Description | | ------------ |---------------------------| ----------------------------------------------- | | jobName | string | **Required.** Unique job name. Used to identify the job in the system where the domain name is often used (e.g., example.com) | | ----------- | ------------------------- | ----------------------------------------------- | ## Return Type Array of [DownloadTask](#downloadtask) ### DownloadTask Represents a single page download request produced by a crawl or scrape job. Fields: | Name | Type | Description | |--------|----------|------------------------| | Id | string | **Required.** Task Id | |--------|----------|------------------------| | Url | string | **Required.** Page URL | |--------|----------|------------------------| # StartJobForFastWebResourceEvaluation Tool Starts a job for fast web resource evaluation. It creates a job with the provided URL as a start URL and with minimal configuration to start crawling right away.\ Returns a job name. ## Arguments | Name | Type | Description | | ------------ |---------------------------| ----------------------------------------------- | | url | string | **Required.** Initial URL. Crawling entry point | | ----------- | ------------------------- | ----------------------------------------------- | ## Return Type string # UpdateJobToContinueWebResourceEvaluation Tool Updates an existing job to continue web resource evaluation with respect to web resource crawl delays. This allows the job to continue crawling the web resource without overpressure. ## Arguments | Name | Type | Description | | ------------ |---------------------------| ----------------------------------------------- | | jobName | string | **Required.** Unique job name. Used to identify the job in the system where the domain name is often used (e.g., example.com) | | ----------- | ------------------------- | ----------------------------------------------- | # Crawl Tool Finds links on the current page using a selector and returns new download tasks to continue the crawl; supports notifying on retries. ## Arguments | Name | Type | Description | |---------------|-------------------------------|---------------------------------------------------------------------| | task | [DownloadTask](#downloadtask) | **Required.** A task from the previous Start or Crawl tool response | |---------------|-------------------------------|---------------------------------------------------------------------| | selector | string | **Required.** Selector for getting interesting links on a web page | |---------------|-------------------------------|---------------------------------------------------------------------| | attributeName | string | Optional. Attribute name to get data from. Use ```val``` to get inner text. Default value: ```href``` | |---------------|-------------------------------|---------------------------------------------------------------------| | maxDepth | int | Optional. Maximum depth for crawling based on the URL path ('example.com' = 0, 'example.com/index.html' = 0, 'example.com/path/' = 1, etc). A non-negative integer value. If null, there is no limit for the depth | |---------------|-------------------------------|---------------------------------------------------------------------| ##### Remarks The selector argument is a selector of the following format: ```CSS|XPATH: selector```. The first part defines the selector type, the second one should be a selector in the corresponding type. Supported types: - [CSS](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_selectors) - [XPATH](https://developer.mozilla.org/en-US/docs/Web/XML/XPath) ### DownloadTask Represents a single page download request produced by a crawl or scrape job. Fields: | Name | Type | Description | |--------|----------|------------------------| | Id | string | **Required.** Task Id | |--------|----------|------------------------| | Url | string | **Required.** Page URL | |--------|----------|------------------------| ## Return Type Array of [DownloadTask](#downloadtask) # Scrape Tool Extracts text or attribute values from the current page using a selector (and optional attribute), returning the matched values. ## Arguments | Name | Type | Description | |---------------|-------------------------------|---------------------------------------------------------------------| | task | [DownloadTask](#downloadtask) | **Required.** A task from the previous Start or Crawl tool response | |---------------|-------------------------------|---------------------------------------------------------------------| | selector | string | **Required.** Selector for getting interesting data on a web page | |---------------|-------------------------------|---------------------------------------------------------------------| | attributeName | string | Optional. Attribute name to get data from. Use ```val``` or leave null to get inner text | |---------------|-------------------------------|---------------------------------------------------------------------| | convert | string | Optional. A data conversion function to apply to the scraped data. If not specified, no conversion will be applied. Available functions: `md()` - convert to markdown format, `sr()` - apply the Mozzila Readability algorithm to try to extract the main content of the page | |---------------|-------------------------------|---------------------------------------------------------------------| ##### Remarks The selector argument is a selector of the following format: ```CSS|XPATH: selector```. The first part defines the selector type, the second one should be a selector in the corresponding type. Supported types: - [CSS](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_selectors) - [XPATH](https://developer.mozilla.org/en-US/docs/Web/XML/XPath) ### DownloadTask Represents a single page download request produced by a crawl or scrape job. Fields: | Name | Type | Description | |--------|----------|------------------------| | Id | string | **Required.** Task Id | |--------|----------|------------------------| | Url | string | **Required.** Page URL | |--------|----------|------------------------| ## Return Type Array of String # JobFetch Tool Fetchs a web page content with a job config. ## Arguments | Name | Type | Description | |---------|--------|-------------------------------------------------------------------------------------------| | jobName | string | **Required.** The name of an existing job whose settings will be applied to the HTTP call | |---------|--------|-------------------------------------------------------------------------------------------| | url | string | **Required.** A web page URL | |---------|--------|-------------------------------------------------------------------------------------------| ## Return Type string # CrawlMdrConfig Tools Build and update multi‑level crawl/scrape plans: define tree structure, link selectors, and field extraction rules for complex extractions. Each tool returns a new or modified CrawlMdrConfig object. The returned CrawlMdrConfig object is passed to the next tool call as a required input parameter. ## CrawlMdrConfigCreate Creates a new empty CrawlMdrConfig object with path `/`. ### Arguments None ## CrawlMdrConfigUpsertSub Adds or updates a child level and the transition crawl parameters to reach it. ##### Remarks The selector argument is a selector of the following format: ```CSS|XPATH: selector```. The first part defines the selector type, the second one should be a selector in the corresponding type. Supported types: - [CSS](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_selectors) - [XPATH](https://developer.mozilla.org/en-US/docs/Web/XML/XPath) ### Arguments | Name | Type | Description | | -------------- | ------ | ----------------------------------------------------------------- | | crawlMdrConfig | object | **Required.** MDR configuration object from the previous tool call | | -------------- | ------ | ----------------------------------------------------------------- | | path | string | **Required.** Path to a level in the MDR tree. It should start with `/` and contain at least one step. Each step is separated by `/`. Path must not end with `/` | | -------------- | ------ | ----------------------------------------------------------------- | | selector | string | **Required.** Selector for getting interesting links on a web page | | -------------- | ------ | ----------------------------------------------------------------- | | attributeName | string | Optional. Attribute name to get data from. Use ```val``` to get inner text. Default value: ```href``` | | -------------- | ------ | ----------------------------------------------------------------- | ## CrawlMdrConfigUpsertCrawlParams Adds or updates link selectors for a specific MDR level. ### Arguments | Name | Type | Description | | -------------- | ------ | ----------------------------------------------------------------- | | crawlMdrConfig | object | **Required.** MDR configuration object from the previous tool call | | -------------- | ------ | ----------------------------------------------------------------- | | path | string | **Required.** Path to a level in the MDR tree. It should start with `/` and contain at least one step. Each step is separated by `/`. Path must not end with `/` | | -------------- | ------ | ----------------------------------------------------------------- | | selector | string | **Required.** Selector for getting interesting links on a web page | | -------------- | ------ | ----------------------------------------------------------------- | | attributeName | string | Optional. Attribute name to get data from. Use ```val``` to get inner text. Default value: ```href``` | | -------------- | ------ | ----------------------------------------------------------------- | ## CrawlMdrConfigUpsertScrapeParams Adds or updates a field’s selector/attribute for a specific MDR level. ### Arguments | Name | Type | Description | | -------------- | ------ | ------------------------------------------------------------------------------------------------------------------------ | | crawlMdrConfig | object | **Required.** MDR configuration object from the previous tool call | | -------------- | ------ | ------------------------------------------------------------------------------------------------------------------------ | | path | string | **Required.** Path to a level in the MDR tree. It should start with `/` and contain at least one step. Each step is separated by `/`. Path must not end with `/` | | -------------- | ------ | ------------------------------------------------------------------------------------------------------------------------ | | fieldName | string | **Required.** Name of a data field that will contain scraped data according to the provided selector and attribute name. | | -------------- | ------ | ------------------------------------------------------------------------------------------------------------------------ | | selector | string | **Required.** Selector for getting interesting data on a web page | | -------------- | ------ | ------------------------------------------------------------------------------------------------------------------------ | | attributeName | string | Optional. Attribute name to get data from. Use ```val``` or leave null to get inner text | | -------------- | ------ | ------------------------------------------------------------------------------------------------------------------------ | | convert | string | Optional. A data conversion function to apply to the scraped data. If not specified, no conversion will be applied. Available functions: md() - convert to markdown format, sr() - apply the Mozzila Readability algorithm to try to extract the main content of the page | | -------------- | ------ | ------------------------------------------------------------------------------------------------------------------------ | ## CrawlMdrConfigSetMaxDepth Sets the maximum depth for crawling in the CrawlMdrConfig tree. ### Arguments | Name | Type | Description | | -------------- | ------ | ------------------------------------------------------------------------------------------------------------------------ | | crawlMdrConfig | object | **Required.** MDR configuration object from the previous tool call | | -------------- | ------ | ------------------------------------------------------------------------------------------------------------------------ | | maxDepth | int | Optional. Maximum depth for crawling based on the URL path ('example.com' = 0, 'example.com/index.html' = 0, 'example.com/path/' = 1, etc). A non-negative integer value. If null, there is no limit for the depth | | -------------- | ------ | ------------------------------------------------------------------------------------------------------------------------ | # CrawlMdr Tool Performs recursive crawling and scraping based on a hierarchical configuration: follows links, extracts fields per level, and returns a cursor to stream large result sets. ## Arguments | Name | Type | Description | |----------------|----------------------------------------|----------------------------------------------------------------------| | tasks | array of [DownloadTask](#downloadtask) | **Required.** Initial download tasks (from StartJob) | |----------------|----------------------------------------|----------------------------------------------------------------------| | crawlMdrConfig | [CrawlMdrConfig](#crawlmdrconfig) | **Required.** Crawl Multi Dimentional Recurcieve (MDR) configuration | |----------------|----------------------------------------|----------------------------------------------------------------------| ### DownloadTask Represents a single page download request produced by a crawl or scrape job. Fields: | Name | Type | Description | |--------|----------|------------------------| | Id | string | **Required.** Task Id | |--------|----------|------------------------| | Url | string | **Required.** Page URL | |--------|----------|------------------------| ### CrawlMdrConfig Hierarchical crawl/scrape plan that defines fields to extract, link selectors, and child levels. | Name | Type | Description | MCP Tools | |-----------------------|----------------------------------------------------|-----------------------------------------------------------------------------|-------------------------------------------------------------| | Name | string | **Required.** Name of the level (e.g., '/', 'products', etc.) | Set via CrawlMdrConfigCreate, CrawlMdrConfigUpsertSub tools | |-----------------------|----------------------------------------------------|-----------------------------------------------------------------------------|-------------------------------------------------------------| | ScrapeParams | array of [ScrapeParams](#scrapeparams) | List of data fields to extract | Set via CrawlMdrConfigUpsertScrapeParams | |-----------------------|----------------------------------------------------|-----------------------------------------------------------------------------|-------------------------------------------------------------| | CrawlParams | array of [CrawlParams](#crawlparams) | List of link selectors for crawling on the current level | Set via CrawlMdrConfigUpsertCrawlParams tool | |-----------------------|----------------------------------------------------|-----------------------------------------------------------------------------|-------------------------------------------------------------| | SubCrawlMdrConfigs | array of [SubCrawlMdrConfigs](#subcrawlmdrconfigs) | List of sub-levels (child pages/sections), with transition crawl parameters | Set via CrawlMdrConfigUpsertSub tool | |-----------------------|----------------------------------------------------|-----------------------------------------------------------------------------|-------------------------------------------------------------| ##### Remarks The selector argument is a selector of the following format: ```CSS|XPATH: selector```. The first part defines the selector type, the second one should be a selector in the corresponding type. Supported types: - [CSS](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_selectors) - [XPATH](https://developer.mozilla.org/en-US/docs/Web/XML/XPath) ### ScrapeParams | Name | Type | Description | |-------------|--------|------------------------------------------------------------------| | FieldName | string | **Required.** Name of the data field to extract | |-------------|--------|------------------------------------------------------------------| | Selector | string | **Required.** Selector for getting interesting data on a web page | |-------------|--------|------------------------------------------------------------------| | Attribute | string | Optional. Attribute name to get data from. Use ```val``` or leave null to get inner text | |-------------|--------|------------------------------------------------------------------| ### CrawlParams | Name | Type | Description | |-----------|--------|-------------------------------------------------------------------| | Selector | string | **Required.** Selector for getting interesting links on a web page | |-----------|--------|-------------------------------------------------------------------| | Attribute | string | Optional. Attribute name to get data from. Use ```val``` to get inner text. Default value: ```href``` | |-----------|--------|-------------------------------------------------------------------| ### SubCrawlMdrConfigs A child [CrawlMdrConfig](#crawlmdrconfig) that includes transition crawl parameters to reach the sublevel. | Name | Type | Description | |----------------|-----------------------------|-----------------------------------------------------------------| | SubCrawlParams | [CrawlParams](#crawlparams) | **Required.** Transition crawl parameters to move to a sublevel | |----------------|-----------------------------|-----------------------------------------------------------------| ## Return Type Returns a [CrawlMdrResult](#crawlmdrresult) ### CrawlMdrResult Represents the result of a crawl operation. | Name | Type | Description | |-----------------------------| ----------------------------------------------- |--------------------------------------------------------------------------| | FailedDownloadTasks | Array [FailedDownloadTask](#faileddownloadtask) | **Required.** List of failed tasks grouped by their parent pages URLs | |-----------------------------| ----------------------------------------------- |--------------------------------------------------------------------------| | FailedDownloadTaskCount | int | **Required.** Number of failed download tasks | |-----------------------------| ----------------------------------------------- |--------------------------------------------------------------------------| | SuccessfulDownloadTaskCount | int | **Required.** Number of successful download tasks | |-----------------------------| ----------------------------------------------- |--------------------------------------------------------------------------| | DataCursor | [CrawlMdrDataCursor](#crawlmdrdatacursor) | Optional. Cursor for fetching batches of scraped data (null if no data) | |-----------------------------| ----------------------------------------------- |--------------------------------------------------------------------------| #### FailedDownloadTask | Name | Type | Description | | --------------------- | -------------------------------------- | ----------------------------------- | | ParentDownloadTaskUrl | string | **Required.** Parent page URL | | --------------------- | -------------------------------------- | ----------------------------------- | | FailedDownloadTasks | array of [DownloadTask](#downloadtask) | **Required.** Failed download tasks | | --------------------- | -------------------------------------- | ----------------------------------- | #### CrawlMdrDataCursor Cursor for fetching batches of scraped data | Name | Type | Description | |--------------|--------|----------------------------------------------------------------------------------------------| | JobId | string | **Required.** Job Id | |--------------|--------|----------------------------------------------------------------------------------------------| | Path | string | **Required.** Path to a level in the MDR tree. Defines a level at which data should be built | |--------------|--------|----------------------------------------------------------------------------------------------| | NextCursor | string | Optional. Cursor for fetching the next batch of scraped data (null if done) | |--------------|--------|----------------------------------------------------------------------------------------------| # CrawlAllMdr Tool Starts crawling all data from a web resource using a job with provided name. Returns a cursor to the beginning of data batch. ## Arguments | Name | Type | Description | |----------|--------|--------------------------------------------------| | jobName | string | **Required.** Unique job name. Used to identify the job in the system where the domain name is often used (e.g., example.com) | |----------|--------|--------------------------------------------------| | convert | string | Optional. A data conversion function to apply to the scraped data. If not specified, no conversion will be applied. Available functions: `md()` - convert to markdown format, `sr()` - apply the Mozzila Readability algorithm to try to extract the main content of the page | |----------|--------|--------------------------------------------------| | maxDepth | string | Optional. Maximum depth for crawling based on the URL path ('example.com' = 0, 'example.com/index.html' = 0, 'example.com/path/' = 1, etc). A non-negative integer value. If null, there is no limit for the depth | |----------|--------|--------------------------------------------------| ## Return Type Returns a [CrawlMdrResult](#crawlmdrresult) ### CrawlMdrResult Represents the result of a crawl operation. | Name | Type | Description | |-----------------------------| ----------------------------------------------- |--------------------------------------------------------------------------| | FailedDownloadTasks | Array [FailedDownloadTask](#faileddownloadtask) | **Required.** List of failed tasks grouped by their parent pages URLs | |-----------------------------| ----------------------------------------------- |--------------------------------------------------------------------------| | FailedDownloadTaskCount | int | **Required.** Number of failed download tasks | |-----------------------------| ----------------------------------------------- |--------------------------------------------------------------------------| | SuccessfulDownloadTaskCount | int | **Required.** Number of successful download tasks | |-----------------------------| ----------------------------------------------- |--------------------------------------------------------------------------| | DataCursor | [CrawlMdrDataCursor](#crawlmdrdatacursor) | Optional. Cursor for fetching batches of scraped data (null if no data) | |-----------------------------| ----------------------------------------------- |--------------------------------------------------------------------------| #### FailedDownloadTask | Name | Type | Description | | --------------------- | -------------------------------------- | ----------------------------------- | | ParentDownloadTaskUrl | string | **Required.** Parent page URL | | --------------------- | -------------------------------------- | ----------------------------------- | | FailedDownloadTasks | array of [DownloadTask](#downloadtask) | **Required.** Failed download tasks | | --------------------- | -------------------------------------- | ----------------------------------- | #### CrawlMdrDataCursor Cursor for fetching batches of scraped data | Name | Type | Description | |--------------|--------|----------------------------------------------------------------------------------------------| | JobId | string | **Required.** Job Id | |--------------|--------|----------------------------------------------------------------------------------------------| | Path | string | **Required.** Path to a level in the MDR tree. Defines a level at which data should be built | |--------------|--------|----------------------------------------------------------------------------------------------| | NextCursor | string | Optional. Cursor for fetching the next batch of scraped data (null if done) | |--------------|--------|----------------------------------------------------------------------------------------------| # GetCrawlMdrData Tool Fetches the next batch of scraped data using a data cursor from CrawlMdr, returning JSON documents and an updated cursor when more data remains. ## Arguments | Name | Type | Description | |--------------------|--------------------------------------------|---------------------------------------------------------------------| | dataCursor | [CrawlMdrDataCursor](#crawlmdrdatacursor) | **Required.** Cursor from CrawlMdrResult for fetching data batches. | |--------------------|--------------------------------------------|---------------------------------------------------------------------| | downloadTasksCount | int | **Required.** The count refers to the number of download tasks to be processed in this request. In most cases, one download task corresponds to one document. However, if a table was handled, it will return as many documents as were in the table (for each download task). Additionally, if there are multiple data objects on each level, the document count will be a multiplication of all counts on each level | |--------------------|--------------------------------------------|---------------------------------------------------------------------| | path | string | **Required.** Path to a level in the MDR tree. It should start with `/` and contain at least one step. Each step is separated by `/`. Path must not end with `/` | |--------------------|--------------------------------------------|---------------------------------------------------------------------| ## Return Type Returns a `CrawlMdrData` object containing the scraped data and a cursor for the next batch. | Name | Type | Description | |--------------|-----------------------------------------|------------------------------------------------------------------------------| | Data | array of string | **Required.** Array of scraped data objects in JSON format. | |--------------|-----------------------------------------|------------------------------------------------------------------------------| | DataCursor | [CrawlMdrDataCursor](#crawlmdrdatacursor) | Optional. Cursor for fetching the next batch of data (null if no more data). | |--------------|-----------------------------------------|------------------------------------------------------------------------------| ### CrawlMdrDataCursor Cursor for fetching batches of scraped data | Name | Type | Description | |--------------|--------|----------------------------------------------------------------------------------------------| | JobId | string | **Required.** Job Id | |--------------|--------|----------------------------------------------------------------------------------------------| | Path | string | **Required.** Path to a level in the MDR tree. Defines a level at which data should be built | |--------------|--------|----------------------------------------------------------------------------------------------| | NextCursor | string | Optional. Cursor for fetching the next batch of scraped data (null if done) | |--------------|--------|----------------------------------------------------------------------------------------------| # GetJobsInfo Tool Returns an array of jobs info. ## Return Type Returns a [JobInfo](#jobinfo) ### JobInfo Job info. #### Fields | Name | Type | Description | | --------------- | -------- | --------------------------------- | | JobId | string | **Required.** Job ID. | | --------------- | -------- | --------------------------------- | | JobName | string | **Required.** Job Name. | | --------------- | -------- | --------------------------------- | | Host | string | **Required.** Web resource host | | --------------- | -------- | --------------------------------- | | StartDateUtc | datetime | Optional. Job start date (UTC) | | --------------- | -------- | --------------------------------- | | CompleteDateUtc | datetime | Optional. Job complete date (UTC) | | --------------- | -------- | --------------------------------- | # GetJobConfig Tool Returns a job config for a particular job. ## Arguments | Name | Type | Description | |--------------|---------|-------------------------------------------------| | jobName | string | **Required.** Unique job name. Used to identify the job in the system where the domain name is often used (e.g., example.com) | |--------------|---------|-------------------------------------------------| ## Return Type Returns a [JobConfig](#jobconfig) ### JobConfig Defines the top-level configuration for a crawl job: entry URLs, job type, request/session behavior (headers, cookies, HTTPS), network routing (proxies), and runtime policies (restarts, error handling, domain scope). Fields: | Name | Type | Description | |--------------------------|-------------------------------------------------------|-----------------------------------------------------------------------| | StartUrls | array of Strings | **Required.** Initial URLs. Crawling entry points | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | Type | [JobTypes](#jobtypes) | Optional. Job type | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | Headers | [HeadersConfig](#headersconfig) | Optional. Headers settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | Restart | [RestartConfig](#restartconfig) | Optional. Job restart settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | Https | [HttpsConfig](#httpsconfig) | Optional. HTTPS settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | Cookies | [CookiesConfig](#cookiesconfig) | Optional. Cookies settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | Proxy | [ProxiesConfig](#proxiesconfig) | Optional. Proxy settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | DownloadErrorHandling | [DownloadErrorHandling](#downloaderrorhandling) | Optional. Download errors handling settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | CrawlersProtectionBypass | [CrawlersProtectionBypass](#crawlersprotectionbypass) | Optional. Crawlers protection countermeasure settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | CrossDomainAccess | [CrossDomainAccess](#crossdomainaccess) | Optional. Cross-domain access settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | | RetrievalConfig | [RetrievalConfig](#retrievalconfig) | Optional. Retrival settings | | ----------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- | ### JobTypes > **_NOTE!_** Possible values restrictions and the default value for all jobs can be configured in the Dapi service. > **_NOTE!_** Crawler service should be correctly configured to handle jobs of different types. Specifies how and where the crawler operates. Choose the mode that matches the environment your job targets. Enumeration values: | Name | Description | |----------|--------------------------------------------------------------------------------------------------| | internet | Crawl data from internet sources via request gateways (Proxy addresses, Host IP addresses, etc.) | | -------- | ------------------------------------------------------------------------------------------------ | | intranet | Crawl data from intranet sources with no limits | | -------- | ------------------------------------------------------------------------------------------------ | ### HeadersConfig Configures additional HTTP headers to be sent with every request. Use to set user agents, auth tokens, custom headers, etc. Fields: | Field | Type | Description | |-----------------------| ---------------------------------- |----------------------------------------------------------------| | DefaultRequestHeaders | array of [HttpHeader](#httpheader) | **Required.** HTTP headers that will be sent with each request | | --------------------- | ---------------------------------- | -------------------------------------------------------------- | ### HttpHeader Represents a single HTTP header definition with a name and one or more values. Fields: | Name | Type | Description | |---------|-----------------|-----------------------------| | Name | string | **Required.** Header name | | ------- | --------------- | --------------------------- | | Values | array of String | **Required.** Header values | | ------ | --------------- | --------------------------- | ### RestartConfig Controls what happens when a job restarts: continue from cached state or rebuild from scratch. Fields: | Field | Type | Description | |----------------|-------------------------------------|--------------------------------| | JobRestartMode | [JobRestartModes](#jobrestartmodes) | **Required.** Job restart mode | |----------------|-------------------------------------|--------------------------------| ### JobRestartModes Describes restart strategies and their effect on previously cached data. Enumeration values: | Name | Description | |-------------|--------------------------------------------------------------------------------------------------| | Continue | Reuse cached data and continue crawling and parsing new data | | ----------- | ------------------------------------------------------------------------------------------------ | | FromScratch | Clear cached data and start from scratch | | ----------- | ------------------------------------------------------------------------------------------------ | ### HttpsConfig Defines HTTPS validation behavior for target resources. Useful for development or when crawling hosts with self-signed certificates. Fields: | Field | Type | Description | |------------------------------------|------|-----------------------------------------------------------------------| | SuppressHttpsCertificateValidation | bool | **Required.** Suppress HTTPS certificate validation of a web resource | |------------------------------------|------|-----------------------------------------------------------------------| ### CookiesConfig Controls cookie persistence between requests to maintain sessions or state across navigations. Fields: | Field | Type | Description | |------------|------|------------------------------------------------------------------| | UseCookies | bool | **Required.** Save and reuse cookies between requests | | --------- | ---- | ---------------------------------------------------------------- | ### ProxiesConfig Configures whether and how requests are routed through proxy servers, including fallback behavior and specific proxy pools. Fields: | Field | Type | Description | |-----------------------------------|--------------------------------------|--------------------------------------------------------------------------------------------------| | UseProxy | bool | **Required.** Use proxies for requests | |-----------------------------------|--------------------------------------|--------------------------------------------------------------------------------------------------| | SendOvertRequestsOnProxiesFailure | bool | **Required.** Send a request from a host real IP address if all proxies failed | |-----------------------------------|--------------------------------------|--------------------------------------------------------------------------------------------------| | IterateProxyResponseCodes | string | Optional. Comma-separated HTTP response codes to iterate proxies on. Default: '401, 403' | |-----------------------------------|--------------------------------------|--------------------------------------------------------------------------------------------------| | Proxies | array of [ProxyConfig](#proxyconfig) | Optional. Proxy configurations. Default: empty array | |-----------------------------------|--------------------------------------|--------------------------------------------------------------------------------------------------| ### ProxyConfig Defines an individual proxy endpoint and its connection characteristics. Fields: | Field | Type | Description | |------------------|-----------------|-----------------------------------------------------------------------------------------------| | Protocol | string | **Required.** Proxy protocol (http, https, socks5) | |------------------|-----------------|-----------------------------------------------------------------------------------------------| | Host | string | **Required.** Proxy host | |------------------|-----------------|-----------------------------------------------------------------------------------------------| | Port | int | **Required.** Proxy port | |------------------|-----------------|-----------------------------------------------------------------------------------------------| | UserName | string | Optional. Proxy username | |------------------|-----------------|-----------------------------------------------------------------------------------------------| | Password | string | Optional. Proxy password | |------------------|-----------------|-----------------------------------------------------------------------------------------------| | ConnectionsLimit | int | Optional. Max concurrent connections | |------------------|-----------------|-----------------------------------------------------------------------------------------------| | AvailableHosts | array of String | Optional. Hosts accessible via this proxy | |------------------|-----------------|-----------------------------------------------------------------------------------------------| ### DownloadErrorHandling Specifies how the crawler reacts to transient download errors, including retry limits and backoff delays. Fields: | Field | Type | Description | |-------------------|-----------------------------------------------------------------|----------------------------------------------------| | Policy | [DownloadErrorHandlingPolicies](#downloaderrorhandlingpolicies) | **Required.** Error handling policy (Skip, Retry) | |-------------------|-----------------------------------------------------------------|----------------------------------------------------| | RetryPolicyParams | [RetryPolicyParams](#retrypolicyparams) | Optional. Retry params | |-------------------|-----------------------------------------------------------------|----------------------------------------------------| #### DownloadErrorHandlingPolicies Available strategies for handling request or network failures during content download. Enumeration values: | Name | Description | |-------|-------------------------------------| | Skip | Skip an error and continue crawling | | ----- | ----------------------------------- | | Retry | Try again | | ----- | ----------------------------------- | #### RetryPolicyParams Specifies how the crawler performs retries. Fields: | Field | Type | Description | |--------------|------|----------------------------------------| | RetriesLimit | int | **Required.** Max retries | |--------------|------|----------------------------------------| | RetryDelayMs | int | **Required.** Delay before retry in ms | |--------------|------|----------------------------------------| ### CrawlersProtectionBypass Tuning options to reduce detection and throttling by target sites: response size limits, redirect depth, request timeouts, and host-specific crawl delays. Fields: | Field | Type | Description | | ------------------ | ------ | -------------------------------------------------- | | MaxResponseSizeKb | int | Optional. Max response size in KB | | ------------------ | ------ | -------------------------------------------------- | | MaxRedirectHops | int | Optional. Max redirect hops | | ------------------ | ------ | -------------------------------------------------- | | RequestTimeoutSec | int | Optional. Max request timeout in seconds | | ------------------ | ------ | -------------------------------------------------- | | CrawlDelays | Array | Optional. Crawl delays for hosts | | ------------------ | ------ | -------------------------------------------------- | ### CrawlDelay Per-host throttling rule to space out requests and respect site limits or robots guidance. Fields: | Field | Type | Description | |-------|--------|----------------------------------------------------| | Host | string | **Required.** Host | | ----- | ------ | -------------------------------------------------- | | Delay | string | **Required.** Delay value (0, 1-5, robots) | | ----- | ------ | -------------------------------------------------- | ### CrossDomainAccess Controls which domains the crawler can follow from the starting hosts: only the main domain, include subdomains, or allow cross-domain navigation. Fields: | Field | Type | Description | |--------|---------------------------------------------------------|--------------------------------------------------------------------| | Policy | [CrossDomainAccessPolicies](#crossDomainAccessPolicies) | **Required.** Cross-domain policy (None, Subdomains, CrossDomains) | |--------|---------------------------------------------------------|--------------------------------------------------------------------| #### CrossDomainAccessPolicies Domain scoping modes that determine which hosts are considered in-bounds while crawling. Enumeration values: | Name | Description | |--------------|---------------------------------------------------------------------------------------| | None | No subdomain or cross-domain access. Only the main domain is allowed | | ------------ | ------------------------------------------------------------------------------------- | | Subdomains | The subdomains of the main domain are allowed (e.g., "example.com", "sub.example.com) | | ------------ | ------------------------------------------------------------------------------------- | | CrossDomains | Allows access to any domain (e.g., "example.com", "sub.example.com, another.com") | | ------------ | ------------------------------------------------------------------------------------- | ### RetrievalConfig RetrievalConfig controls what gets embedded and how enrollment behaves. Configuration for enrolling pages into a vector index for further vector search. Retrieval is part of the RAG. Fields: | Field | Type | Description | | ------------------ | ----------------------------------------------------------- | ------------------------------------------------------------------------- | | EnrollInIndex | bool | **Required.** Enroll crawled pages into the vector index. | | ------------------ | ----------------------------------------------------------- | ------------------------------------------------------------------------- | | Force | bool | **Required.** Should the already existing data in the index be overridden | | ------------------ | ----------------------------------------------------------- | ------------------------------------------------------------------------- | | MaxTokensPerChunk | int | Optional. Maximum tokens per chunk. Default: 512. | | ------------------ | ----------------------------------------------------------- | ------------------------------------------------------------------------- | | ContentScopes | array of [RetrievalContentScope](#retrievalcontentscope) | Optional. Selectors for page content to enroll. Default: entire page. | | ------------------ | ----------------------------------------------------------- | ------------------------------------------------------------------------- | | EnrollmentWaitMode | [RetrievalEnrollmentWaitMode](#retrievalenrollmentwaitmode) | Optional. Enrollment wait mode. Default: Eventually. | | ------------------ | ----------------------------------------------------------- | ------------------------------------------------------------------------- | #### RetrievalContentScope Define which parts of which pages are enrolled, using URL path matching and selectors. This lets enroll only meaningful blocks (e.g., product descriptions, docs body) and ignore noise (menus, footers, ads). Fields: | Field | Type | Description | | ---------------------------- | ------ | ------------------------------------------------------------------------------------ | | PathPattern | string | **Required.** URL path pattern (case sensitive). See examples below for the details. | | ---------------------------- | ------ | ------------------------------------------------------------------------------------ | | [Selector](#selector-format) | string | **Required.** Selector for getting interesting data on a web page | | ---------------------------- | ------ | ------------------------------------------------------------------------------------ | PathPattern Examples: | URL | Pattern | Corresponds | | ------------------------------------ | ----------------- | ----------- | | https://example.com/path/to/resource | * | Yes | | ------------------------------------ | ----------------- | ----------- | | https://example.com/path/to/resource | /* | Yes | | ------------------------------------ | ----------------- | ----------- | | https://example.com/path/to/resource | /path/to/resource | Yes | | ------------------------------------ | ----------------- | ----------- | | https://example.com/path/to/resource | /path/to/* | Yes | | ------------------------------------ | ----------------- | ----------- | | https://example.com/path/to/resource | /path/*/resource | Yes | | ------------------------------------ | ----------------- | ----------- | | https://example.com/path/to/resource | /**/res* | Yes | | ------------------------------------ | ----------------- | ----------- | | https://example.com/path/to/resource | /res* | No | | ------------------------------------ | ----------------- | ----------- | | https://example.com/path/to/resource | /path/to/RESOURCE | No | | ------------------------------------ | ----------------- | ----------- | #### Selector Format The selector argument is a selector of the following format: ```CSS|XPATH: selector```. The first part defines the selector type, the second one should be a selector in the corresponding type. Supported types: - [CSS](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_selectors) - [XPATH](https://developer.mozilla.org/en-US/docs/Web/XML/XPath) #### RetrievalEnrollmentWaitMode Specifies whether to wait for each crawled document to be enrolled into the index. Enumeration values: | Name | Description | | ---------- | ---------------------------------------------------------------------------------------------------- | | Eventually | Don't wait. Queue for enrollment; the index catches up asynchronously. FAST | | ---------- | ---------------------------------------------------------------------------------------------------- | | WaitEach | Wait for each document. Logs an error if not enrolled within 1 minute. SLOW | | ---------- | ---------------------------------------------------------------------------------------------------- | | WaitJob | Wait for all document enrollments when the entire job is completed. FAST | | ---------- | ---------------------------------------------------------------------------------------------------- | # GetDownloadTaskStatus Tool Retrieves the current status and request/response details for a download task, including errors and intermediate attempts for troubleshooting and monitoring. ## Arguments | Name | Type | Description | |--------------|---------|--------------------------------------------------------------------| | taskId | string | **Required.** The ID of the download task to check. | |--------------|---------|--------------------------------------------------------------------| ## Return Type Returns a [DownloadTaskStatus](#downloadtaskstatus) ### DownloadTaskStatus Download task execution status #### Fields | Name | Type | Description | | --------------- | ----------------------------------------- | ------------------------------------------------------ | | TaskState | [DownloadTaskStates](#downloadtaskstates) | Optional. Task state | | --------------- | ----------------------------------------- | ------------------------------------------------------ | | Result | [DownloadInfo](#downloadinfo) | Optional. Download result | | --------------- | ----------------------------------------- | ------------------------------------------------------ | | intermedResults | array of [DownloadInfo](#downloadinfo) | Optional. Intermediate requests download results stack | | --------------- | ----------------------------------------- | ------------------------------------------------------ | ### DownloadTaskStates Download task states enumeration. #### Values | Name | Description | | ------------------------ | ------------------------------------------------------------------------------------------ | | Handled | Task is handled and its results are available | | ------------------------ | ------------------------------------------------------------------------------------------ | | AccessDeniedForRobots | Access to a URL is denied by robots.txt | | ------------------------ | ------------------------------------------------------------------------------------------ | | AllRequestGatesExhausted | All request gateways (proxy and host IP addresses) were exhausted but no data was received | | ------------------------ | ------------------------------------------------------------------------------------------ | | Created | Task has not been started yet | | ------------------------ | ------------------------------------------------------------------------------------------ | | InProgress | Task is in progress | | ------------------------ | ------------------------------------------------------------------------------------------ | | Deleted | Task has been deleted | | ------------------------ | ------------------------------------------------------------------------------------------ | ### DownloadInfo Download attempt information #### Fields | Name | Type | Description | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | Method | string | **Required.** HTTP method | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | Url | string | **Required.** Request URL | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | IsSuccess | bool | **Required.** Was the request successful | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | HttpStatusCode | int | **Required.** [HTTP status code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | ReasonPhrase | string | **Required.** HTTP reason phrase | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | RequestHeaders | array of [HttpHeader](#httpheader) | **Required.** HTTP headers sent with the request | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | ResponseHeaders | array of [HttpHeader](#httpheader) | **Required.** HTTP headers received in the response | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | RequestCookies | array of [Cookie](#cookie) | **Required.** Cookies sent with the request | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | ResponseCookies | array of [Cookie](#cookie) | **Required.** Cookies received in the response | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | RequestDateUtc | datetime | **Required.** Request date and time in UTC | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | DownloadTimeSec | double | **Required.** Download time in seconds | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | ViaProxy | bool | **Required.** Was the request made via a proxy | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | WaitTimeSec | double | **Required.** What was the delay (in seconds) before the request was executed (crawl latency, etc.) | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | | CrawlDelaySec | int | **Required.** A delay in seconds applied to the request | | --------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------- | ### HttpHeader HTTP header #### Fields | Name | Type | Description | | ------ | --------------- | --------------------------- | | Name | string | **Required.** Header name | | ------ | --------------- | --------------------------- | | Values | array of String | **Required.** Header values | | ------ | --------------- | --------------------------- | ### Cookie [Cookies](https://developer.mozilla.org/en-US/docs/Web/HTTP/Cookies) #### Fields | Name | Type | Description | | -------- | -------- | --------------------------------------------------------------------------------------------------------------------------------- | | Name | string | **Required.** [Name](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Set-Cookie#attributes) | | -------- | -------- | --------------------------------------------------------------------------------------------------------------------------------- | | Value | string | **Required.** [Value](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Set-Cookie#attributes) | | -------- | -------- | --------------------------------------------------------------------------------------------------------------------------------- | | Domain | string | **Required.** [Domain](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Set-Cookie#domaindomain-value) | | -------- | -------- | --------------------------------------------------------------------------------------------------------------------------------- | | Path | string | **Required.** [Path](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Set-Cookie#pathpath-value) | | -------- | -------- | --------------------------------------------------------------------------------------------------------------------------------- | | HttpOnly | bool | **Required.** [HttpOnly](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Set-Cookie#httponly) | | -------- | -------- | --------------------------------------------------------------------------------------------------------------------------------- | | Secure | bool | **Required.** [Secure](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Set-Cookie#secure) | | -------- | -------- | --------------------------------------------------------------------------------------------------------------------------------- | | Expires | datetime | Optional. [Expires](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Set-Cookie#expiresdate) | | -------- | -------- | --------------------------------------------------------------------------------------------------------------------------------- | # Retrieve Tool Retrieves relevant data from the indexed web resources to augment the user prompt. ## Arguments | Name | Type | Description | |------------|--------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | query | string | **Required.** A query to retrieve relevant data for | |------------|--------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | jobName | string | Optional. Scope of data to use for augmentation. If JobName is specified, all data from the job with the given job will be used for augmentation. If not, the data from all jobs in the current tenant will be used | |------------|--------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | threshold | string | Optional. A similarity threshold for filtering retrieval results: `exact-match`, `same-category`, `same-domain`, `generic-similarity`. If not specified, the default value is `same-domain` | |------------|--------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | limit | int | Optional. The maximum number of retrieval results to return. If not specified, the default value is 10 | |------------|--------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | nextCursor | string | Optional. A cursor for pagination. If specified, the retrieval will continue from the position indicated by the cursor. If not specified, the retrieval will start from the beginning | |------------|--------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| ## Return Type Array of [RetrievalItem](#retrievalitem) ### RetrievalItem | Name | Type | Description | |---------------|------------------------------------------------|-----------------------------------------------------------| | Score | double | **Required.** Retrieval item score | |---------------|------------------------------------------------|-----------------------------------------------------------| | Span | string | **Required.** Found text with its semantic context | |---------------|------------------------------------------------|-----------------------------------------------------------| | DownloadTasks | array of [DownloadTaskInfo](#downloadtaskinfo) | **Required.** Download tasks that containt the found text | |---------------|------------------------------------------------|-----------------------------------------------------------| #### DownloadTaskInfo | Name | Type | Description | |----------------|----------|----------------------------------| | DownloadTaskId | string | **Required.** Download task ID | |----------------|----------|----------------------------------| | Url | string | **Required.** Page URL | |----------------|----------|----------------------------------| | CaptureDateUtc | datetime | **Required.** Capture date (UTC) | |----------------|----------|----------------------------------|