GetJobConfig Tool
Returns a job config for a particular job.
Arguments
| Name | Type | Description |
|---|---|---|
| jobName | string | Required. Unique job name. Used to identify the job in the system where the domain name is often used (e.g., example.com) |
Return Type
Returns a JobConfig
JobConfig
Defines the top-level configuration for a crawl job: entry URLs, job type, request/session behavior (headers, cookies, HTTPS), network routing (proxies), and runtime policies (restarts, error handling, domain scope).
Fields:
| Name | Type | Description |
|---|---|---|
| StartUrls | array of Strings | Required. Initial URLs. Crawling entry points |
| Type | JobTypes | Optional. Job type |
| Headers | HeadersConfig | Optional. Headers settings |
| Restart | RestartConfig | Optional. Job restart settings |
| Https | HttpsConfig | Optional. HTTPS settings |
| Cookies | CookiesConfig | Optional. Cookies settings |
| Proxy | ProxiesConfig | Optional. Proxy settings |
| DownloadErrorHandling | DownloadErrorHandling | Optional. Download errors handling settings |
| CrawlersProtectionBypass | CrawlersProtectionBypass | Optional. Crawlers protection countermeasure settings |
| CrossDomainAccess | CrossDomainAccess | Optional. Cross-domain access settings |
| RetrievalConfig | RetrievalConfig | Optional. Retrival settings |
JobTypes
NOTE! Possible values restrictions and the default value for all jobs can be configured in the Dapi service.
NOTE! Crawler service should be correctly configured to handle jobs of different types.
Specifies how and where the crawler operates. Choose the mode that matches the environment your job targets.
Enumeration values:
| Name | Description |
|---|---|
| internet | Crawl data from internet sources via request gateways (Proxy addresses, Host IP addresses, etc.) |
| intranet | Crawl data from intranet sources with no limits |
HeadersConfig
Configures additional HTTP headers to be sent with every request. Use to set user agents, auth tokens, custom headers, etc.
Fields:
| Field | Type | Description |
|---|---|---|
| DefaultRequestHeaders | array of HttpHeader | Required. HTTP headers that will be sent with each request |
HttpHeader
Represents a single HTTP header definition with a name and one or more values.
Fields:
| Name | Type | Description |
|---|---|---|
| Name | string | Required. Header name |
| Values | array of String | Required. Header values |
RestartConfig
Controls what happens when a job restarts: continue from cached state or rebuild from scratch.
Fields:
| Field | Type | Description |
|---|---|---|
| JobRestartMode | JobRestartModes | Required. Job restart mode |
JobRestartModes
Describes restart strategies and their effect on previously cached data.
Enumeration values:
| Name | Description |
|---|---|
| Continue | Reuse cached data and continue crawling and parsing new data |
| FromScratch | Clear cached data and start from scratch |
HttpsConfig
Defines HTTPS validation behavior for target resources. Useful for development or when crawling hosts with self-signed certificates.
Fields:
| Field | Type | Description |
|---|---|---|
| SuppressHttpsCertificateValidation | bool | Required. Suppress HTTPS certificate validation of a web resource |
CookiesConfig
Controls cookie persistence between requests to maintain sessions or state across navigations.
Fields:
| Field | Type | Description |
|---|---|---|
| UseCookies | bool | Required. Save and reuse cookies between requests |
ProxiesConfig
Configures whether and how requests are routed through proxy servers, including fallback behavior and specific proxy pools.
Fields:
| Field | Type | Description |
|---|---|---|
| UseProxy | bool | Required. Use proxies for requests |
| SendOvertRequestsOnProxiesFailure | bool | Required. Send a request from a host real IP address if all proxies failed |
| IterateProxyResponseCodes | string | Optional. Comma-separated HTTP response codes to iterate proxies on. Default: ‘401, 403’ |
| Proxies | array of ProxyConfig | Optional. Proxy configurations. Default: empty array |
ProxyConfig
Defines an individual proxy endpoint and its connection characteristics.
Fields:
| Field | Type | Description |
|---|---|---|
| Protocol | string | Required. Proxy protocol (http, https, socks5) |
| Host | string | Required. Proxy host |
| Port | int | Required. Proxy port |
| UserName | string | Optional. Proxy username |
| Password | string | Optional. Proxy password |
| ConnectionsLimit | int | Optional. Max concurrent connections |
| AvailableHosts | array of String | Optional. Hosts accessible via this proxy |
DownloadErrorHandling
Specifies how the crawler reacts to transient download errors, including retry limits and backoff delays.
Fields:
| Field | Type | Description |
|---|---|---|
| Policy | DownloadErrorHandlingPolicies | Required. Error handling policy (Skip, Retry) |
| RetryPolicyParams | RetryPolicyParams | Optional. Retry params |
DownloadErrorHandlingPolicies
Available strategies for handling request or network failures during content download.
Enumeration values:
| Name | Description |
|---|---|
| Skip | Skip an error and continue crawling |
| Retry | Try again |
RetryPolicyParams
Specifies how the crawler performs retries.
Fields:
| Field | Type | Description |
|---|---|---|
| RetriesLimit | int | Required. Max retries |
| RetryDelayMs | int | Required. Delay before retry in ms |
CrawlersProtectionBypass
Tuning options to reduce detection and throttling by target sites: response size limits, redirect depth, request timeouts, and host-specific crawl delays.
Fields:
| Field | Type | Description |
|---|---|---|
| MaxResponseSizeKb | int | Optional. Max response size in KB |
| MaxRedirectHops | int | Optional. Max redirect hops |
| RequestTimeoutSec | int | Optional. Max request timeout in seconds |
| CrawlDelays | Array | Optional. Crawl delays for hosts |
CrawlDelay
Per-host throttling rule to space out requests and respect site limits or robots guidance.
Fields:
| Field | Type | Description |
|---|---|---|
| Host | string | Required. Host |
| Delay | string | Required. Delay value (0, 1-5, robots) |
CrossDomainAccess
Controls which domains the crawler can follow from the starting hosts: only the main domain, include subdomains, or allow cross-domain navigation.
Fields:
| Field | Type | Description |
|---|---|---|
| Policy | CrossDomainAccessPolicies | Required. Cross-domain policy (None, Subdomains, CrossDomains) |
CrossDomainAccessPolicies
Domain scoping modes that determine which hosts are considered in-bounds while crawling.
Enumeration values:
| Name | Description |
|---|---|
| None | No subdomain or cross-domain access. Only the main domain is allowed |
| Subdomains | The subdomains of the main domain are allowed (e.g., “example.com”, “sub.example.com) |
| CrossDomains | Allows access to any domain (e.g., “example.com”, “sub.example.com, another.com”) |
RetrievalConfig
RetrievalConfig controls what gets embedded and how enrollment behaves. Configuration for enrolling pages into a vector index for further vector search. Retrieval is part of the RAG.
Fields:
| Field | Type | Description |
|---|---|---|
| EnrollInIndex | bool | Required. Enroll crawled pages into the vector index. |
| Force | bool | Required. Should the already existing data in the index be overridden |
| MaxTokensPerChunk | int | Optional. Maximum tokens per chunk. Default: 512. |
| ContentScopes | array of RetrievalContentScope | Optional. Selectors for page content to enroll. Default: entire page. |
| EnrollmentWaitMode | RetrievalEnrollmentWaitMode | Optional. Enrollment wait mode. Default: Eventually. |
RetrievalContentScope
Define which parts of which pages are enrolled, using URL path matching and selectors. This lets enroll only meaningful blocks (e.g., product descriptions, docs body) and ignore noise (menus, footers, ads).
Fields:
| Field | Type | Description |
|---|---|---|
| PathPattern | string | Required. URL path pattern (case sensitive). See examples below for the details. |
| Selector | string | Required. Selector for getting interesting data on a web page |
PathPattern Examples:
| URL | Pattern | Corresponds |
|---|---|---|
| https://example.com/path/to/resource | * | Yes |
| https://example.com/path/to/resource | /* | Yes |
| https://example.com/path/to/resource | /path/to/resource | Yes |
| https://example.com/path/to/resource | /path/to/* | Yes |
| https://example.com/path/to/resource | /path/*/resource | Yes |
| https://example.com/path/to/resource | /*/res | Yes |
| https://example.com/path/to/resource | /res* | No |
| https://example.com/path/to/resource | /path/to/RESOURCE | No |
Selector Format
The selector argument is a selector of the following format: CSS|XPATH: selector. The first part defines the selector type, the second one should be a selector in the corresponding type.
Supported types:
RetrievalEnrollmentWaitMode
Specifies whether to wait for each crawled document to be enrolled into the index.
Enumeration values:
| Name | Description |
|---|---|
| Eventually | Don’t wait. Queue for enrollment; the index catches up asynchronously. FAST |
| WaitEach | Wait for each document. Logs an error if not enrolled within 1 minute. SLOW |
| WaitJob | Wait for all document enrollments when the entire job is completed. FAST |