Jobs

Create and control crawl/scrape jobs: start with a configuration and receive initial tasks to drive further processing.

Info

Returns existing jobs info.

GET /api/v2/jobs/info

Responses

200 (Ok)

Returns array of JobInfo

JobInfo

General inforamtion about a job

Fields:

Name	Type	Description
JobId	string	Required. Job ID
JobName	string	Required. Job Name
Host	string	Required. Web resource host
StartDateUtc	datetime	Optional. Job start date (UTC)
CompleteDateUtc	datetime	Optional. Job complete date (UTC)

403 (Forbidden)

Access restricted. Refer to the response text for more information

Start

Creates or updates a job with the provided configuration and enqueues initial download tasks for the specified start URLs.

POST /api/v2/jobs/{jobName}/start

Path Parameters

Name	Type	Description
jobName	string	Required. Unique job name. Used to identify the job in the system where the domain name is often used (e.g., example.com)

Request Body

JobConfig

Defines the top-level configuration for a crawl job: entry URLs, job type, request/session behavior (headers, cookies, HTTPS), network routing (proxies), and runtime policies (restarts, error handling, domain scope).

Fields:

Name	Type	Description
StartUrls	array of Strings	Required. Initial URLs. Crawling entry points
Type	JobTypes	Optional. Job type
Headers	HeadersConfig	Optional. Headers settings
Restart	RestartConfig	Optional. Job restart settings
Https	HttpsConfig	Optional. HTTPS settings
Cookies	CookiesConfig	Optional. Cookies settings
Proxy	ProxiesConfig	Optional. Proxy settings
DownloadErrorHandling	DownloadErrorHandling	Optional. Download errors handling settings
CrawlersProtectionBypass	CrawlersProtectionBypass	Optional. Crawlers protection countermeasure settings
CrossDomainAccess	CrossDomainAccess	Optional. Cross-domain access settings
RetrievalConfig	RetrievalConfig	Optional. Retrival settings

JobTypes

NOTE! Possible values restrictions and the default value for all jobs can be configured in the Dapi service.

NOTE! Crawler service should be correctly configured to handle jobs of different types.

Specifies how and where the crawler operates. Choose the mode that matches the environment your job targets.

Enumeration values:

Name	Description
internet	Crawl data from internet sources via request gateways (Proxy addresses, Host IP addresses, etc.)
intranet	Crawl data from intranet sources with no limits

HeadersConfig

Configures additional HTTP headers to be sent with every request. Use to set user agents, auth tokens, custom headers, etc.

Fields:

Field	Type	Description
DefaultRequestHeaders	array of HttpHeader	Required. HTTP headers that will be sent with each request

HttpHeader

Represents a single HTTP header definition with a name and one or more values.

Fields:

Name	Type	Description
Name	string	Required. Header name
Values	array of String	Required. Header values

RestartConfig

Controls what happens when a job restarts: continue from cached state or rebuild from scratch.

Fields:

Field	Type	Description
JobRestartMode	JobRestartModes	Required. Job restart mode

JobRestartModes

Describes restart strategies and their effect on previously cached data.

Enumeration values:

Name	Description
Continue	Reuse cached data and continue crawling and parsing new data
FromScratch	Clear cached data and start from scratch

HttpsConfig

Defines HTTPS validation behavior for target resources. Useful for development or when crawling hosts with self-signed certificates.

Fields:

Field	Type	Description
SuppressHttpsCertificateValidation	bool	Required. Suppress HTTPS certificate validation of a web resource

CookiesConfig

Controls cookie persistence between requests to maintain sessions or state across navigations.

Fields:

Field	Type	Description
UseCookies	bool	Required. Save and reuse cookies between requests

ProxiesConfig

Configures whether and how requests are routed through proxy servers, including fallback behavior and specific proxy pools.

Fields:

Field	Type	Description
UseProxy	bool	Required. Use proxies for requests
SendOvertRequestsOnProxiesFailure	bool	Required. Send a request from a host real IP address if all proxies failed
IterateProxyResponseCodes	string	Optional. Comma-separated HTTP response codes to iterate proxies on. Default: ‘401, 403’
Proxies	array of ProxyConfig	Optional. Proxy configurations. Default: empty array

ProxyConfig

Defines an individual proxy endpoint and its connection characteristics.

Fields:

Field	Type	Description
Protocol	string	Required. Proxy protocol (http, https, socks5)
Host	string	Required. Proxy host
Port	int	Required. Proxy port
UserName	string	Optional. Proxy username
Password	string	Optional. Proxy password
ConnectionsLimit	int	Optional. Max concurrent connections
AvailableHosts	array of String	Optional. Hosts accessible via this proxy

DownloadErrorHandling

Specifies how the crawler reacts to transient download errors, including retry limits and backoff delays.

Fields:

Field	Type	Description
Policy	DownloadErrorHandlingPolicies	Required. Error handling policy (Skip, Retry)
RetryPolicyParams	RetryPolicyParams	Optional. Retry params

DownloadErrorHandlingPolicies

Available strategies for handling request or network failures during content download.

Enumeration values:

Name	Description
Skip	Skip an error and continue crawling
Retry	Try again

RetryPolicyParams

Specifies how the crawler performs retries.

Fields:

Field	Type	Description
RetriesLimit	int	Required. Max retries
RetryDelayMs	int	Required. Delay before retry in ms

CrawlersProtectionBypass

Tuning options to reduce detection and throttling by target sites: response size limits, redirect depth, request timeouts, and host-specific crawl delays.

Fields:

Field	Type	Description
MaxResponseSizeKb	int	Optional. Max response size in KB
MaxRedirectHops	int	Optional. Max redirect hops
RequestTimeoutSec	int	Optional. Max request timeout in seconds
CrawlDelays	Array	Optional. Crawl delays for hosts

CrawlDelay

Per-host throttling rule to space out requests and respect site limits or robots guidance.

Fields:

Field	Type	Description
Host	string	Required. Host
Delay	string	Required. Delay value (0, 1-5, robots)

CrossDomainAccess

Controls which domains the crawler can follow from the starting hosts: only the main domain, include subdomains, or allow cross-domain navigation.

Fields:

Field	Type	Description
Policy	CrossDomainAccessPolicies	Required. Cross-domain policy (None, Subdomains, CrossDomains)

CrossDomainAccessPolicies

Domain scoping modes that determine which hosts are considered in-bounds while crawling.

Enumeration values:

Name	Description
None	No subdomain or cross-domain access. Only the main domain is allowed
Subdomains	The subdomains of the main domain are allowed (e.g., “example.com”, “sub.example.com)
CrossDomains	Allows access to any domain (e.g., “example.com”, “sub.example.com, another.com”)

RetrievalConfig

RetrievalConfig controls what gets embedded and how enrollment behaves. Configuration for enrolling pages into a vector index for further vector search. Retrieval is part of the RAG.

Fields:

Field	Type	Description
EnrollInIndex	bool	Required. Enroll crawled pages into the vector index.
Force	bool	Required. Should the already existing data in the index be overridden
MaxTokensPerChunk	int	Optional. Maximum tokens per chunk. Default: 512.
ContentScopes	array of RetrievalContentScope	Optional. Selectors for page content to enroll. Default: entire page.
EnrollmentWaitMode	RetrievalEnrollmentWaitMode	Optional. Enrollment wait mode. Default: Eventually.

RetrievalContentScope

Define which parts of which pages are enrolled, using URL path matching and selectors. This lets enroll only meaningful blocks (e.g., product descriptions, docs body) and ignore noise (menus, footers, ads).

Fields:

Field	Type	Description
PathPattern	string	Required. URL path pattern (case sensitive). See examples below for the details.
Selector	string	Required. Selector for getting interesting data on a web page

PathPattern Examples:

URL	Pattern	Corresponds
https://example.com/path/to/resource	*	Yes
https://example.com/path/to/resource	/*	Yes
https://example.com/path/to/resource	/path/to/resource	Yes
https://example.com/path/to/resource	/path/to/*	Yes
https://example.com/path/to/resource	/path/*/resource	Yes
https://example.com/path/to/resource	/*/res	Yes
https://example.com/path/to/resource	/res*	No
https://example.com/path/to/resource	/path/to/RESOURCE	No

Selector Format

The selector argument is a selector of the following format: CSS|XPATH: selector. The first part defines the selector type, the second one should be a selector in the corresponding type. Supported types:

CSS
XPATH

RetrievalEnrollmentWaitMode

Specifies whether to wait for each crawled document to be enrolled into the index.

Enumeration values:

Name	Description
Eventually	Don’t wait. Queue for enrollment; the index catches up asynchronously. FAST
WaitEach	Wait for each document. Logs an error if not enrolled within 1 minute. SLOW
WaitJob	Wait for all document enrollments when the entire job is completed. FAST

Responses

200 (Ok)

Job has succesfully inserted or updated

Returns array of DownloadTask

DownloadTask

Represents a single page download request produced by a crawl or scrape job.

Fields:

Name	Type	Description
Id	string	Required. Task Id
Url	string	Required. Page URL

403 (Forbidden)

Access restricted. Refer to the response text for more information

Config

Returns a job config of an existing job.

GET /api/v2/jobs/{jobName}/config

Responses

200 (Ok)

Returns JobConfig

403 (Forbidden)

Access restricted. Refer to the response text for more information

404 (Not Found)

Job not found

Fetch

Fetch content of a page with provided URL within a configured job, taking into account all its settings. Can be run only after a job has been started.

GET /api/v2/jobs/{jobName}/fetch

Path Parameters

Name	Type	Description
jobName	string	Required. The name of an existing job.

Query Parameters

Name	Type	Description
url	string	Required. A page URL.

Responses

200 (Ok)

Page data has been successfully fetched.

Returns the page’s HTML.

403 (Forbidden)

Access restricted. Refer to the response text for more information

404 (Not Found)

Job not found

Jobs

Info

Responses

200 (Ok)

JobInfo

403 (Forbidden)

Start

Path Parameters

Request Body

JobConfig

JobTypes

HeadersConfig

HttpHeader

RestartConfig

JobRestartModes

HttpsConfig

CookiesConfig

ProxiesConfig

ProxyConfig

DownloadErrorHandling

DownloadErrorHandlingPolicies

RetryPolicyParams

CrawlersProtectionBypass

CrawlDelay

CrossDomainAccess

CrossDomainAccessPolicies

RetrievalConfig

RetrievalContentScope

Selector Format

RetrievalEnrollmentWaitMode

Responses

200 (Ok)

DownloadTask

403 (Forbidden)

Config

Responses

200 (Ok)

403 (Forbidden)

404 (Not Found)

Fetch

Path Parameters

Query Parameters

Responses

200 (Ok)

403 (Forbidden)

404 (Not Found)

Please rotate your device to landscape mode