StartJob Tool
The StartJob tool is used to initiate a web data source (WDS) crawling or scraping job using a fully configured job configuration object. It launches the job with all specified parameters, including start URLs, crawling rules, scraping settings, and more. The tool returns a job object that can be used to track job status and retrieve results.
Arguments
Name | Type | Description |
---|---|---|
jobConfig | JobConfig | Required. Job configuration object containing all job parameters |
JobConfig
The JobConfig object passed to StartJob contains the following fields. Each field can be set up using the corresponding MCP tool (see links):
Name | Type | Description | MCP Tools |
---|---|---|---|
StartUrls | Array of Strings | Required. Initial URLs. Crawling entry points | Set via JobConfigCreate, JobConfigAddStartUrl tools |
JobName | String | Optional. Job name. If not specified a random generated value is used | Set via JobConfigCreate tool |
Type | JobTypes | Optional. Job type | Set via JobConfigSetJobType tool |
Headers | HeadersConfig | Optional. Headers settings | Set via JobConfigHeaders* tools |
Restart | RestartConfig | Optional. Job restart settings | Set via JobConfigRestart* tools |
Https | HttpsConfig | Optional. HTTPS settings | Set via JobConfigHttps* tools |
Cookies | CookiesConfig | Optional. Cookies settings | Set via JobConfigCookies* tools |
Proxy | ProxiesConfig | Optional. Proxy settings | Set via JobConfigProxy* tools |
DownloadErrorHandling | DownloadErrorHandling | Optional. Download errors handling settings | Set via JobConfigDownloadErrorHandling* tools |
CrawlersProtectionBypass | CrawlersProtectionBypass | Optional. Crawlers protection countermeasure settings | Set via JobConfigCrawlersProtectionBypass* tools |
CrossDomainAccess | CrossDomainAccess | Optional. Cross-domain access settings | Set via JobConfigCrossDomainAccess* tools |
JobTypes
Job types enumeration.
Possible values restrictions and the default value for all jobs can be configured in the Dapi service.
Additionally, the Crawler service should be correctly configured to handle jobs of different types.
Values
Name | Description |
---|---|
Internet | Crawl data from internet sources via request gateways (Proxy addresses, Host IP addresses, etc.) |
Intranet | Crawl data from intranet sources with no limits |
HeadersConfig
Configuration for HTTP headers sent with each request.
Field | Type | Description |
---|---|---|
HttpHeader | HttpHeader | Required. HTTP header (name, values) |
HttpHeader
HTTP header config
Fields
Name | Type | Description |
---|---|---|
Name | String | Required. Header name |
Values | Array of String | Required. Header values |
RestartConfig
Settings for job restart behavior.
Field | Type | Description |
---|---|---|
JobRestartMode | JobRestartModes | Required. Job restart mode (Continue, FromScratch) |
JobRestartModes
Job restart mode enumeration.
Values
Name | Description |
---|---|
Continue | Reuse cached data and continue crawling and parsing new data |
FromScratch | Clear cached data and start from scratch |
HttpsConfig
Settings for HTTPS certificate validation.
Field | Type | Description |
---|---|---|
SuppressHttpsCertificateValidation | Bool | Required. Suppress HTTPS certificate validation of a web resource |
CookiesConfig
Settings for cookies usage.
Field | Type | Description |
---|---|---|
UseCookies | Bool | Required. Save and reuse cookies between requests |
ProxiesConfig
Configuration for using proxies.
Field | Type | Description |
---|---|---|
UseProxy | Bool | Required. Use proxies for requests |
SendOvertRequestsOnProxiesFailure | Bool | Required. Send a request from a host real IP address if all proxies failed |
IterateProxyResponseCodes | String | Optional. Comma-separated HTTP response codes to iterate proxies on. Default: ‘401, 403’ |
Proxies | Array of ProxyConfig | Optional. Proxy configurations. Default: empty array |
ProxyConfig
Defines a single proxy server config.
Field | Type | Description |
---|---|---|
Protocol | String | Required. Proxy protocol (http, https, socks5) |
Host | String | Required. Proxy host |
Port | Int | Required. Proxy port |
UserName | String | Optional. Proxy username |
Password | String | Optional. Proxy password |
ConnectionsLimit | Int | Optional. Max concurrent connections |
AvailableHosts | Array of String | Optional. Hosts accessible via this proxy |
DownloadErrorHandling
Settings for handling download errors.
Field | Type | Description |
---|---|---|
Policy | DownloadErrorHandlingPolicies | Required. Error handling policy (Skip, Retry) |
RetriesLimit | Int | Optional. Max retries (if Retry) |
RetryDelayMs | Int | Optional. Delay before retry in ms (if Retry) |
DownloadErrorHandlingPolicies
Download error handling policies enumeration.
Values
Name | Description |
---|---|
Skip | Skip an error and continue crawling |
Retry | Try again |
CrawlersProtectionBypass
Settings for crawler protection countermeasures.
Field | Type | Description |
---|---|---|
MaxResponseSizeKb | Int | Optional. Max response size in KB |
MaxRedirectHops | Int | Optional. Max redirect hops |
RequestTimeoutSec | Int | Optional. Max request timeout in seconds |
CrawlDelays | Array | Optional. Crawl delays for hosts |
CrawlDelay
Defines a crawl delay for a host.
Field | Type | Description |
---|---|---|
Host | String | Required. Host |
Delay | String | Required. Delay value (0, 1-5, robots) |
CrossDomainAccess
Settings for cross-domain crawling.
Field | Type | Description |
---|---|---|
Policy | CrossDomainAccessPolicies | Required. Cross-domain policy (None, Subdomains, CrossDomains) |
CrossDomainAccessPolicies
Cross-domain access policies enumeration.
Values
Name | Description |
---|---|
None | No subdomain or cross-domain access. Only the main domain is allowed |
Subdomains | The subdomains of the main domain are allowed (e.g., “example.com”, “sub.example.com) |
CrossDomains | Allows access to any domain (e.g., “example.com”, “sub.example.com, another.com”) |