Jobs
Create and control crawl/scrape jobs: start with a configuration and receive initial tasks to drive further processing.
Start
Creates or updates a job with the provided configuration and enqueues initial download tasks for the specified start URLs.
POST /api/v2/jobs/{jobName}/start
Path Parameters
Name | Type | Description |
---|---|---|
jobName | string | Required. A unique job name, where the domain name is often used (e.g., example.com) |
Request Body
JobConfig
Defines the top-level configuration for a crawl job: entry URLs, job type, request/session behavior (headers, cookies, HTTPS), network routing (proxies), and runtime policies (restarts, error handling, domain scope).
Fields:
Name | Type | Description |
---|---|---|
StartUrls | Array of Strings | Required. Initial URLs. Crawling entry points |
Type | JobTypes | Optional. Job type |
Headers | HeadersConfig | Optional. Headers settings |
Restart | RestartConfig | Optional. Job restart settings |
Https | HttpsConfig | Optional. HTTPS settings |
Cookies | CookiesConfig | Optional. Cookies settings |
Proxy | ProxiesConfig | Optional. Proxy settings |
DownloadErrorHandling | DownloadErrorHandling | Optional. Download errors handling settings |
CrawlersProtectionBypass | CrawlersProtectionBypass | Optional. Crawlers protection countermeasure settings |
CrossDomainAccess | CrossDomainAccess | Optional. Cross-domain access settings |
JobTypes
NOTE! Possible values restrictions and the default value for all jobs can be configured in the Dapi service.
NOTE! Crawler service should be correctly configured to handle jobs of different types.
Specifies how and where the crawler operates. Choose the mode that matches the environment your job targets.
Enumeration values:
Name | Description |
---|---|
Internet | Crawl data from internet sources via request gateways (Proxy addresses, Host IP addresses, etc.) |
Intranet | Crawl data from intranet sources with no limits |
HeadersConfig
Configures additional HTTP headers to be sent with every request. Use to set user agents, auth tokens, custom headers, etc.
Fields:
Field | Type | Description |
---|---|---|
HttpHeader | HttpHeader | Required. HTTP header (name, values) |
HttpHeader
Represents a single HTTP header definition with a name and one or more values.
Fields:
Name | Type | Description |
---|---|---|
Name | String | Required. Header name |
Values | Array of String | Required. Header values |
RestartConfig
Controls what happens when a job restarts: continue from cached state or rebuild from scratch.
Fields:
Field | Type | Description |
---|---|---|
JobRestartMode | JobRestartModes | Required. Job restart mode |
JobRestartModes
Describes restart strategies and their effect on previously cached data.
Enumeration values:
Name | Description |
---|---|
Continue | Reuse cached data and continue crawling and parsing new data |
FromScratch | Clear cached data and start from scratch |
HttpsConfig
Defines HTTPS validation behavior for target resources. Useful for development or when crawling hosts with self-signed certificates.
Fields:
Field | Type | Description |
---|---|---|
SuppressHttpsCertificateValidation | Bool | Required. Suppress HTTPS certificate validation of a web resource |
CookiesConfig
Controls cookie persistence between requests to maintain sessions or state across navigations.
Fields:
Field | Type | Description |
---|---|---|
UseCookies | Bool | Required. Save and reuse cookies between requests |
ProxiesConfig
Configures whether and how requests are routed through proxy servers, including fallback behavior and specific proxy pools.
Fields:
Field | Type | Description |
---|---|---|
UseProxy | Bool | Required. Use proxies for requests |
SendOvertRequestsOnProxiesFailure | Bool | Required. Send a request from a host real IP address if all proxies failed |
IterateProxyResponseCodes | String | Optional. Comma-separated HTTP response codes to iterate proxies on. Default: ‘401, 403’ |
Proxies | Array of ProxyConfig | Optional. Proxy configurations. Default: empty array |
ProxyConfig
Defines an individual proxy endpoint and its connection characteristics.
Fields:
Field | Type | Description |
---|---|---|
Protocol | String | Required. Proxy protocol (http, https, socks5) |
Host | String | Required. Proxy host |
Port | Int | Required. Proxy port |
UserName | String | Optional. Proxy username |
Password | String | Optional. Proxy password |
ConnectionsLimit | Int | Optional. Max concurrent connections |
AvailableHosts | Array of String | Optional. Hosts accessible via this proxy |
DownloadErrorHandling
Specifies how the crawler reacts to transient download errors, including retry limits and backoff delays.
Fields:
Field | Type | Description |
---|---|---|
Policy | DownloadErrorHandlingPolicies | Required. Error handling policy (Skip, Retry) |
RetriesLimit | Int | Optional. Max retries (if Retry) |
RetryDelayMs | Int | Optional. Delay before retry in ms (if Retry) |
DownloadErrorHandlingPolicies
Available strategies for handling request or network failures during content download.
Enumeration values:
Name | Description |
---|---|
Skip | Skip an error and continue crawling |
Retry | Try again |
CrawlersProtectionBypass
Tuning options to reduce detection and throttling by target sites: response size limits, redirect depth, request timeouts, and host-specific crawl delays.
Fields:
Field | Type | Description |
---|---|---|
MaxResponseSizeKb | Int | Optional. Max response size in KB |
MaxRedirectHops | Int | Optional. Max redirect hops |
RequestTimeoutSec | Int | Optional. Max request timeout in seconds |
CrawlDelays | Array | Optional. Crawl delays for hosts |
CrawlDelay
Per-host throttling rule to space out requests and respect site limits or robots guidance.
Fields:
Field | Type | Description |
---|---|---|
Host | String | Required. Host |
Delay | String | Required. Delay value (0, 1-5, robots) |
CrossDomainAccess
Controls which domains the crawler can follow from the starting hosts: only the main domain, include subdomains, or allow cross-domain navigation.
Fields:
Field | Type | Description |
---|---|---|
Policy | CrossDomainAccessPolicies | Required. Cross-domain policy (None, Subdomains, CrossDomains) |
CrossDomainAccessPolicies
Domain scoping modes that determine which hosts are considered in-bounds while crawling.
Enumeration values:
Name | Description |
---|---|
None | No subdomain or cross-domain access. Only the main domain is allowed |
Subdomains | The subdomains of the main domain are allowed (e.g., “example.com”, “sub.example.com) |
CrossDomains | Allows access to any domain (e.g., “example.com”, “sub.example.com, another.com”) |
Responses
200 (Ok)
Job has succesfully inserted or updated
Returns array of DownloadTask
DownloadTask
Represents a single page download request produced by a crawl or scrape job.
Fields:
Name | Type | Description |
---|---|---|
Id | String | Required. Task Id |
Url | String | Required. Page URL |