Skip to content

resource usage metrics not collected when running gVisor #44084

@kpurdon

Description

@kpurdon

Agent version

7.x

Bug Report

Description

The kubelet check fails to collect actual resource usage metrics (kubernetes.cpu.usage.total, kubernetes.memory.usage, etc.) for pods running on gVisor nodes, while successfully collecting them for regular containerd pods on the same cluster.

Prometheus/Grafana successfully collects these same metrics from the same cAdvisor endpoint, so the data is available - the issue is specific to how the Datadog agent processes it.

Environment

  • Datadog Agent version: 7.x (Helm chart v3.115.0)
  • Kubernetes version: 1.32 (EKS)
  • Container runtime: containerd with gVisor (runsc) RuntimeClass
  • Cloud provider: AWS EKS

Expected Behavior

The kubelet check should collect kubernetes.cpu.usage.total, kubernetes.memory.usage, kubernetes.memory.working_set, etc. for all pods, including those running on gVisor nodes.

Actual Behavior

For gVisor pods:

  • kubernetes.cpu.limits, kubernetes.cpu.requests - collected (from pod spec)
  • kubernetes.memory.limits, kubernetes.memory.requests - collected (from pod spec)
  • kubernetes.cpu.usage.total - NOT collected
  • kubernetes.memory.usage, kubernetes.memory.working_set, kubernetes.memory.rss - NOT collected

For regular pods on the same cluster: all metrics collected correctly.

Root Cause Analysis

1. gVisor cAdvisor output structure

gVisor runs containers inside a sandbox, making individual container cgroups invisible to the host. cAdvisor returns pod-level aggregate metrics with empty container identifiers:

Regular pod (has per-container metrics):

container_cpu_usage_seconds_total{container="cilium-agent",cpu="total",name="4e2e7b228584...",namespace="kube-system",pod="cilium-tstw9"} 2332.54

gVisor pod (only pod-level aggregate):

container_cpu_usage_seconds_total{container="",cpu="total",name="",namespace="console",pod="zcs-agent-xyz"} 767.62

2. Kubelet check requires container ID correlation

Debug logs show the kubelet check failing to correlate pod-level metrics:

pod not found for id:, name:zcs-agent-xyz, namespace:console
Tags not found for container: console/zcs-agent-xyz/main

The code path in pkg/collector/corechecks/containers/kubelet/common/pod.go:175 (GetContainerID()) expects a container ID to match against the workloadmeta store. When cAdvisor returns container="", there's no ID to match.

3. Prometheus succeeds because it doesn't correlate

The prometheus-kubelet ServiceMonitor uses honorLabels: true and accepts metrics as-is without requiring container ID correlation. This is why Grafana dashboards show gVisor pod metrics while Datadog doesn't.

Reproduction Steps

  1. Create a Kubernetes cluster with gVisor RuntimeClass configured
  2. Deploy a pod using the gVisor RuntimeClass
  3. Deploy Datadog agent with default kubelet check
  4. Observe that kubernetes.cpu.usage.total and kubernetes.memory.usage are missing for gVisor pods
  5. Verify data exists by curling cAdvisor directly:
    kubectl exec -n datadog <agent-pod> -c agent -- sh -c \
      'curl -sk "https://$DD_KUBERNETES_KUBELET_HOST:10250/metrics/cadvisor" \
      -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \
      | grep container_cpu_usage_seconds_total | grep <gvisor-namespace>'

Workaround

We're currently using the OpenMetrics integration to scrape /metrics/cadvisor directly, which works but:

  • Emits metrics with different names (cadvisor.container_cpu_usage_seconds_total vs kubernetes.cpu.usage.total)
  • Doesn't integrate with Datadog's built-in Kubernetes dashboards and monitors
  • Requires custom dashboards/monitors for gVisor workloads

Reproduction Steps

No response

Agent configuration

No response

Operating System

No response

Other environment details

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    pendingLabel for issues waiting a Datadog member's response.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions