Troubleshooting large Fargate clusters
Occasionally, the SWO K8s Collector may fail when deployed to a large Kubernetes cluster running on AWS Fargate.
Identify the issue
The metrics collector that is part of the SWO K8s Collector needs approximately 10 MB of available memory per each node in a cluster, otherwise it can start failing.
-
The
swo-k8s-collector-metrics
deployment keeps restarting every few minutes. This is because of out-of-memory issues, or because of failing liveness/readiness probes. -
The logs from the deployment contain repeated messages about failed scraping of metrics, for example,
Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus/node-metrics", "data_type": "metrics", ...
. The may also contain messages about failures to send data to SolarWinds Observability SaaS, for example,Exporting failed. Dropping data.
Resolve the issue
To resolve the issue, adjust the Helm chart configuration:
otel:
metrics:
memory_limiter:
limit_mib: 2560
resources:
limits:
memory: 3Gi
requests:
memory: 3Gi
The otel.metrics.resources.limits.memory
and otel.metrics.resources.requests.memory
limit should be big enough for all nodes in the Kubernetes cluster. For example, for a cluster with 600 nodes, the limit should be at least 6 GB. The default value is 3 GB.
Additionally, the otel.metrics.memory_limiter.limit_mib
limit should be slightly lower than the other limits. The SWO K8s Collector will try to keep its memory usage below this value. The default value is 2.5 GB.
In addition to above, you can also decrease the memory usage by reducing the frequency of metrics scraping. That can be achieved by the following Helm chart configuration:
otel:
metrics:
prometheus:
scrape_interval: 60s
The default scrape interval is 1 minute. Increasing it to 2 minutes or more will give the SWO K8s Collector more time to process the data. This slightly reduces its memory usage.