Troubleshooting large Fargate clusters

Occasionally, the SWO K8s Collector may fail when deployed to a large Kubernetes cluster running on AWS Fargate.

Identify the issue

The metrics collector that is part of the SWO K8s Collector needs approximately 10 MB of available memory per each node in a cluster, otherwise it can start failing.

The swo-k8s-collector-metrics deployment keeps restarting every few minutes. This is because of out-of-memory issues, or because of failing liveness/readiness probes.
The logs from the deployment contain repeated messages about failed scraping of metrics, for example, Failed to scrape Prometheus endpoint {"kind": "receiver", "name": "prometheus/node-metrics", "data_type": "metrics", .... The may also contain messages about failures to send data to SolarWinds Observability SaaS, for example, Exporting failed. Dropping data.

Resolve the issue

To resolve the issue, adjust the Helm chart configuration:


otel:
  metrics:
    memory_limiter:
      limit_mib: 2560
    resources:
      limits:
        memory: 3Gi
      requests:
        memory: 3Gi

The otel.metrics.resources.limits.memory and otel.metrics.resources.requests.memory limit should be big enough for all nodes in the Kubernetes cluster. For example, for a cluster with 600 nodes, the limit should be at least 6 GB. The default value is 3 GB.

Additionally, the otel.metrics.memory_limiter.limit_mib limit should be slightly lower than the other limits. The SWO K8s Collector will try to keep its memory usage below this value. The default value is 2.5 GB.

In addition to above, you can also decrease the memory usage by reducing the frequency of metrics scraping. That can be achieved by the following Helm chart configuration:


otel:
  metrics:
    prometheus:
      scrape_interval: 60s

The default scrape interval is 1 minute. Increasing it to 2 minutes or more will give the SWO K8s Collector more time to process the data. This slightly reduces its memory usage.

Search SolarWinds Support

Troubleshooting large Fargate clusters

Identify the issue

Resolve the issue