Tasks
Work with download tasks within a job: discover new pages (crawl), extract data (scrape), and inspect status/results.
Crawl
Discovers and queues follow-up pages from the current task’s URL (e.g., pagination and links), returning new download tasks to continue the crawl.
GET /api/v2/tasks/{taskId}/crawl
Path Parameters
Name | Type | Description |
---|---|---|
taskId | string | Required. A task ID returned by previous calls |
Query Parameters
Name | Type | Description |
---|---|---|
selector | string | Required. Selector for getting interesting links on a web page |
attributeName | string | Optional. Attribute name to get data from. Use val to get inner text. Default value: href |
Responses
200 (OK)
Page data processed successfully
Returns array of follow up DownloadTask
DownloadTask
Represents a single page download request produced by a crawl or scrape job.
Fields:
Name | Type | Description |
---|---|---|
Id | String | Required. Task Id |
Url | String | Required. Page URL |
202 (Accepted)
Task has been queued and is awaiting execution. Retry the request later, repeating until a response other than 202 (Accepted) is received
400 (Bad Request)
Invalid request parameters. Refer to the response text for more information
403 (Forbidden)
Unable to access the page content. Refer to the response text for more information
404 (Not Found)
Task not found
422 (Unprocessable Content)
There is an issue with processing the page content. Refer to the response text for more information
Scrape
Extracts data from the current page using the provided selector (and optional attribute), returning the matched text or attribute values.
GET /api/v2/tasks/{taskId}/scrape
Path Parameters
Name | Type | Description |
---|---|---|
taskId | string | Required. A task ID returned by previous calls |
Query Parameters
Name | Type | Description |
---|---|---|
selector | string | Required. Selector for getting interesting data on a web page |
attributeName | string | Optional. Attribute name to get data from. Use val or leave null to get inner text |
Responses
200 (OK)
Page data processed successfully
Returns an array of strings with all data items found on a page according to the selector
202 (Accepted)
Task has been queued and is awaiting execution. Retry the request later, repeating until a response other than 202 (Accepted) is received
400 (Bad Request)
Invalid request parameters. Refer to the response text for more information
403 (Forbidden)
Unable to access the page content. Refer to the response text for more information
404 (Not Found)
Task not found
422 (Unprocessable Content)
There is an issue with processing the page content. Refer to the response text for more information
Scrape Multiple
Extracts data from the current page using the provided selector (and optional attribute), returning the matched text or attribute values.
GET /api/v2/tasks/{taskId}/scrape-multiple
Path Parameters
Name | Type | Description |
---|---|---|
taskId | string | Required. A task ID returned by previous calls |
Query Parameters
Name | Type | Description |
---|---|---|
scrapeParams | array of ScrapeParams | Required. scraping parameters |
ScrapeParams
Field | Type | Description |
---|---|---|
name | string | Required. A name to find the corresponding scrape result in a response |
selector | string | Required. Selector for getting interesting data on a web page |
attributeName | string | Optional. Attribute name to get data from. Use val or leave null to get inner text |
Query Example
GET /api/v2/tasks/{taskId}/scrape-multiple?scrapeParams[0].name=name&scrapeParams[0].selector=css:%20h1&scrapeParams[0].attributeName=val&scrapeParams[1].name=params&scrapeParams[1].selector=css:%20b
Responses
200 (OK)
Page data processed successfully
Returns an array of ScrapeResult
ScrapeResult
Field | Type | Description |
---|---|---|
name | string | Required. A name specified in the request ScrpapeParams |
values | array of string | Required. Data extracted from the page according to the specified selector |
202 (Accepted)
Task has been queued and is awaiting execution. Retry the request later, repeating until a response other than 202 (Accepted) is received
400 (Bad Request)
Invalid request parameters. Refer to the response text for more information
403 (Forbidden)
Unable to access the page content. Refer to the response text for more information
404 (Not Found)
Task not found
422 (Unprocessable Content)
There is an issue with processing the page content. Refer to the response text for more information
Scrape Multiple Body
Extracts data from the current page using the provided selector (and optional attribute), returning the matched text or attribute values.
POST /api/v2/tasks/{taskId}/scrape-multiple
This method performs the same function as Scrape Multiple, but accepts ScrapeParams as the body instead of serializing it as a query parameter.
Not all reverse proxies pass request bodies if the method is GET, so the POST methid is used here. This is a reasonable trafe-off.
Path Parameters
Name | Type | Description |
---|---|---|
taskId | string | Required. A task ID returned by previous calls |
Request Body
Array of ScrapeParams in JSON format
Responses
200 (OK)
Page data processed successfully
Returns an array of ScrapeResult
202 (Accepted)
Task has been queued and is awaiting execution. Retry the request later, repeating until a response other than 202 (Accepted) is received
400 (Bad Request)
Invalid request parameters. Refer to the response text for more information
403 (Forbidden)
Unable to access the page content. Refer to the response text for more information
404 (Not Found)
Task not found
422 (Unprocessable Content)
There is an issue with processing the page content. Refer to the response text for more information
Info
Retrieves the current status and execution trace for a download task, including errors and links to result details when available.
GET /api/v2/tasks/{taskId}/info
Path Parameters
Name | Type | Description |
---|---|---|
taskId | string | Required. A task ID returned by previous calls |
Responses
200 (OK)
Download task status found
Returns DownloadTaskStatus
DownloadTaskStatus
Summarizes the execution state and outputs of a single download operation, including current status, any error, and final or intermediate results.
Fields:
Name | Type | Description |
---|---|---|
Error | String | Optional. Request execution error |
TaskState | DownloadTaskStates | Optional. Task state |
Result | DownloadInfo | Optional. Download result |
IntermedResults | Array of DownloadInfo | Optional. Intermediate requests download results stack |
DownloadTaskStates
Lifecycle states a download task can transition through from creation to completion or deletion.
Enumeration values:
Name | Description |
---|---|
Handled | Task is handled and its results are available |
AccessDeniedForRobots | Access to a URL is denied by robots.txt |
AllRequestGatesExhausted | All request gateways (proxy and host IP addresses) were exhausted but no data was received |
InProgress | Task is in progress |
Created | Task has not been started yet |
Deleted | Task has been deleted |
DownloadInfo
Captures request/response details for a download attempt, including HTTP metadata, headers, cookies, and payload.
Fields:
Name | Type | Description |
---|---|---|
Method | String | Required. HTTP method |
Url | String | Required. Request URL |
IsSuccess | Bool | Required. Was the request successful |
HttpStatusCode | Int | Required. HTTP status code |
ReasonPhrase | String | Required. HTTP reason phrase |
RequestHeaders | Array of HttpHeader | Required. HTTP headers sent with the request |
ResponseHeaders | Array of HttpHeader | Required. HTTP headers received in the response |
RequestCookies | Array of Cookie | Required. Cookies sent with the request |
ResponseCookies | Array of Cookie | Required. Cookies received in the response |
RequestDateUtc | DateTime | Required. Request date and time in UTC |
DownloadTimeSec | Double | Required. Download time in seconds |
ViaProxy | Bool | Required. Is the request made via a proxy |
WaitTimeSec | Double | Required. What was the delay (in seconds) before the request was executed (crawl latency, etc.) |
CrawlDelaySec | Int | Required. A delay in seconds applied to the request |
HttpHeader
Represents a single HTTP header with a name and one or more values.
Fields:
Name | Type | Description |
---|---|---|
Name | String | Required. Header name |
Values | Array of String | Required. Header values |
Cookie
Represents an HTTP cookie as sent via Set-Cookie/ Cookie headers, including attributes.
Fields:
Name | Type | Description |
---|---|---|
Name | String | Required. Name |
Value | String | Required. Value |
Domain | String | Required. Domain |
Path | String | Required. Path |
HttpOnly | Bool | Required. HttpOnly |
Secure | Bool | Required. Secure |
Expires | DateTime | Optional. Expires |
404 (Not Found)
Task not found
Selector Format
The selector argument is a selector of the following format: CSS|XPATH: selector
. The first part defines the selector type, the second one should be a selector in the corresponding type.
Supported types: