Crawler Service
This is a service that downloads pages
This service is not open-source but its image is publicly accessible on DockerHub
Configuration
The following environment variables are used to configure this service:
Name | Description |
---|---|
DATAKEEPER_ORIGIN | Required The origin of the datakeeper service |
SERVICE_HOST | Required The host on which the current service is available |
EXTERNAL_IP_ADDRESS_CONFIGS | Required A coma-separated list of external IP getter services |
MAX_INACTIVE_SEC_TO_REREGISTRAR | Optional. Each crawler service on start registers itself at a datakeeper service. Sometimes something might go wrong and a craler might get forgotten. This parameter defines after what period without requests the crawler should remind about itself. Default value is 60 seconds |
MIN_LOG_LEVEL | Optional. Minimal log level. Default value is INFO |
External IP getter services
For the Internet job type, requests can be sent via proxies or/and from crawlers’ public IP addresses
In the last case, it is essential to know crawlers’ public IP addresses. There are several ways to provide this information:
- amazon - the
https://checkip.amazonaws.com
is used to get the crawler public IP address - directIP - a particular IP is used. The value should be a valid IP address in the format XX.XX.XX.XX
- intranet - reserved value to allow crawlers to be used for Intranet jobs.
External IP address configs examples
Crawler can be used only for Intranet jobs
EXTERNAL_IP_ADDRESS_CONFIGS=intranet
Crawler can be used for Intranet and Internet jobs
EXTERNAL_IP_ADDRESS_CONFIGS=intranet, amazon
Crawler can be used for Internet jobs only, and its public IP address is a static one
EXTERNAL_IP_ADDRESS_CONFIGS=20.21.22.23