Jobs

Create and control crawl/scrape jobs: start with a configuration and receive initial tasks to drive further processing.

Start

Creates or updates a job with the provided configuration and enqueues initial download tasks for the specified start URLs.

POST /api/v2/jobs/{jobName}/start

Path Parameters

Name Type Description
jobName string Required. A unique job name, where the domain name is often used (e.g., example.com)

Request Body

JobConfig

JobConfig

Defines the top-level configuration for a crawl job: entry URLs, job type, request/session behavior (headers, cookies, HTTPS), network routing (proxies), and runtime policies (restarts, error handling, domain scope).

Fields:

Name Type Description
StartUrls Array of Strings Required. Initial URLs. Crawling entry points
Type JobTypes Optional. Job type
Headers HeadersConfig Optional. Headers settings
Restart RestartConfig Optional. Job restart settings
Https HttpsConfig Optional. HTTPS settings
Cookies CookiesConfig Optional. Cookies settings
Proxy ProxiesConfig Optional. Proxy settings
DownloadErrorHandling DownloadErrorHandling Optional. Download errors handling settings
CrawlersProtectionBypass CrawlersProtectionBypass Optional. Crawlers protection countermeasure settings
CrossDomainAccess CrossDomainAccess Optional. Cross-domain access settings

JobTypes

NOTE! Possible values restrictions and the default value for all jobs can be configured in the Dapi service.

NOTE! Crawler service should be correctly configured to handle jobs of different types.

Specifies how and where the crawler operates. Choose the mode that matches the environment your job targets.

Enumeration values:

Name Description
Internet Crawl data from internet sources via request gateways (Proxy addresses, Host IP addresses, etc.)
Intranet Crawl data from intranet sources with no limits

HeadersConfig

Configures additional HTTP headers to be sent with every request. Use to set user agents, auth tokens, custom headers, etc.

Fields:

Field Type Description
HttpHeader HttpHeader Required. HTTP header (name, values)

HttpHeader

Represents a single HTTP header definition with a name and one or more values.

Fields:

Name Type Description
Name String Required. Header name
Values Array of String Required. Header values

RestartConfig

Controls what happens when a job restarts: continue from cached state or rebuild from scratch.

Fields:

Field Type Description
JobRestartMode JobRestartModes Required. Job restart mode

JobRestartModes

Describes restart strategies and their effect on previously cached data.

Enumeration values:

Name Description
Continue Reuse cached data and continue crawling and parsing new data
FromScratch Clear cached data and start from scratch

HttpsConfig

Defines HTTPS validation behavior for target resources. Useful for development or when crawling hosts with self-signed certificates.

Fields:

Field Type Description
SuppressHttpsCertificateValidation Bool Required. Suppress HTTPS certificate validation of a web resource

CookiesConfig

Controls cookie persistence between requests to maintain sessions or state across navigations.

Fields:

Field Type Description
UseCookies Bool Required. Save and reuse cookies between requests

ProxiesConfig

Configures whether and how requests are routed through proxy servers, including fallback behavior and specific proxy pools.

Fields:

Field Type Description
UseProxy Bool Required. Use proxies for requests
SendOvertRequestsOnProxiesFailure Bool Required. Send a request from a host real IP address if all proxies failed
IterateProxyResponseCodes String Optional. Comma-separated HTTP response codes to iterate proxies on. Default: ‘401, 403’
Proxies Array of ProxyConfig Optional. Proxy configurations. Default: empty array

ProxyConfig

Defines an individual proxy endpoint and its connection characteristics.

Fields:

Field Type Description
Protocol String Required. Proxy protocol (http, https, socks5)
Host String Required. Proxy host
Port Int Required. Proxy port
UserName String Optional. Proxy username
Password String Optional. Proxy password
ConnectionsLimit Int Optional. Max concurrent connections
AvailableHosts Array of String Optional. Hosts accessible via this proxy

DownloadErrorHandling

Specifies how the crawler reacts to transient download errors, including retry limits and backoff delays.

Fields:

Field Type Description
Policy DownloadErrorHandlingPolicies Required. Error handling policy (Skip, Retry)
RetriesLimit Int Optional. Max retries (if Retry)
RetryDelayMs Int Optional. Delay before retry in ms (if Retry)
DownloadErrorHandlingPolicies

Available strategies for handling request or network failures during content download.

Enumeration values:

Name Description
Skip Skip an error and continue crawling
Retry Try again

CrawlersProtectionBypass

Tuning options to reduce detection and throttling by target sites: response size limits, redirect depth, request timeouts, and host-specific crawl delays.

Fields:

Field Type Description
MaxResponseSizeKb Int Optional. Max response size in KB
MaxRedirectHops Int Optional. Max redirect hops
RequestTimeoutSec Int Optional. Max request timeout in seconds
CrawlDelays Array Optional. Crawl delays for hosts

CrawlDelay

Per-host throttling rule to space out requests and respect site limits or robots guidance.

Fields:

Field Type Description
Host String Required. Host
Delay String Required. Delay value (0, 1-5, robots)

CrossDomainAccess

Controls which domains the crawler can follow from the starting hosts: only the main domain, include subdomains, or allow cross-domain navigation.

Fields:

Field Type Description
Policy CrossDomainAccessPolicies Required. Cross-domain policy (None, Subdomains, CrossDomains)
CrossDomainAccessPolicies

Domain scoping modes that determine which hosts are considered in-bounds while crawling.

Enumeration values:

Name Description
None No subdomain or cross-domain access. Only the main domain is allowed
Subdomains The subdomains of the main domain are allowed (e.g., “example.com”, “sub.example.com)
CrossDomains Allows access to any domain (e.g., “example.com”, “sub.example.com, another.com”)

Responses

200 (Ok)

Job has succesfully inserted or updated

Returns array of DownloadTask

DownloadTask

Represents a single page download request produced by a crawl or scrape job.

Fields:

Name Type Description
Id String Required. Task Id
Url String Required. Page URL

Please rotate your device to landscape mode

This documentation is specifically designed with a wider layout to provide a better reading experience for code examples, tables, and diagrams.
Rotating your device horizontally ensures you can see everything clearly without excessive scrolling or resizing.

Return to Web Data Source Home