Crawler Service

This is a service that downloads pages

This service is not open-source but its image is publicly accessible on DockerHub

Configuration

The following environment variables are used to configure this service:

Name Description
DATAKEEPER_ORIGIN Required The origin of the datakeeper service
SERVICE_HOST Required The host on which the current service is available
EXTERNAL_IP_ADDRESS_CONFIGS Required A coma-separated list of external IP getter services
MAX_INACTIVE_SEC_TO_REREGISTRAR Optional. Each crawler service on start registers itself at a datakeeper service. Sometimes something might go wrong and a craler might get forgotten. This parameter defines after what period without requests the crawler should remind about itself. Default value is 60 seconds
MIN_LOG_LEVEL Optional. Minimal log level. Default value is INFO

External IP getter services

For the Internet job type, requests can be sent via proxies or/and from crawlers’ public IP addresses
In the last case, it is essential to know crawlers’ public IP addresses. There are several ways to provide this information:

External IP address configs examples

Crawler can be used only for Intranet jobs
EXTERNAL_IP_ADDRESS_CONFIGS=intranet
Crawler can be used for Intranet and Internet jobs
EXTERNAL_IP_ADDRESS_CONFIGS=intranet, amazon
Crawler can be used for Internet jobs only, and its public IP address is a static one
EXTERNAL_IP_ADDRESS_CONFIGS=20.21.22.23

Please rotate your device to landscape mode

This documentation is specifically designed with a wider layout to provide a better reading experience for code examples, tables, and diagrams.
Rotating your device horizontally ensures you can see everything clearly without excessive scrolling or resizing.

Return to Web Data Source Home