Crawler Service

This is a service that downloads pages

This service is not open-source but its image is publicly accessible on DockerHub

Configuration

The following environment variables are used to configure this service:

Name	Description
DATAKEEPER_ORIGIN	Required The origin of the datakeeper service
SERVICE_HOST	Required The host on which the current service is available
EXTERNAL_IP_ADDRESS_CONFIGS	Required A coma-separated list of external IP getter services
MAX_INACTIVE_SEC_TO_REREGISTRAR	Optional. Each crawler service on start registers itself at a datakeeper service. Sometimes something might go wrong and a craler might get forgotten. This parameter defines after what period without requests the crawler should remind about itself. Default value is 60 seconds
MIN_LOG_LEVEL	Optional. Minimal log level. Default value is INFO

For the Internet job type, requests can be sent via proxies or/and from crawlers’ public IP addresses
In the last case, it is essential to know crawlers’ public IP addresses. There are several ways to provide this information:

amazon - the https://checkip.amazonaws.com is used to get the crawler public IP address
directIP - a particular IP is used. The value should be a valid IP address in the format XX.XX.XX.XX
intranet - reserved value to allow crawlers to be used for Intranet jobs.

EXTERNAL_IP_ADDRESS_CONFIGS=intranet

EXTERNAL_IP_ADDRESS_CONFIGS=intranet, amazon

EXTERNAL_IP_ADDRESS_CONFIGS=20.21.22.23