-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Agent version
7.x
Bug Report
Description
The kubelet check fails to collect actual resource usage metrics (kubernetes.cpu.usage.total, kubernetes.memory.usage, etc.) for pods running on gVisor nodes, while successfully collecting them for regular containerd pods on the same cluster.
Prometheus/Grafana successfully collects these same metrics from the same cAdvisor endpoint, so the data is available - the issue is specific to how the Datadog agent processes it.
Environment
- Datadog Agent version: 7.x (Helm chart v3.115.0)
- Kubernetes version: 1.32 (EKS)
- Container runtime: containerd with gVisor (runsc) RuntimeClass
- Cloud provider: AWS EKS
Expected Behavior
The kubelet check should collect kubernetes.cpu.usage.total, kubernetes.memory.usage, kubernetes.memory.working_set, etc. for all pods, including those running on gVisor nodes.
Actual Behavior
For gVisor pods:
- ✅
kubernetes.cpu.limits,kubernetes.cpu.requests- collected (from pod spec) - ✅
kubernetes.memory.limits,kubernetes.memory.requests- collected (from pod spec) - ❌
kubernetes.cpu.usage.total- NOT collected - ❌
kubernetes.memory.usage,kubernetes.memory.working_set,kubernetes.memory.rss- NOT collected
For regular pods on the same cluster: all metrics collected correctly.
Root Cause Analysis
1. gVisor cAdvisor output structure
gVisor runs containers inside a sandbox, making individual container cgroups invisible to the host. cAdvisor returns pod-level aggregate metrics with empty container identifiers:
Regular pod (has per-container metrics):
container_cpu_usage_seconds_total{container="cilium-agent",cpu="total",name="4e2e7b228584...",namespace="kube-system",pod="cilium-tstw9"} 2332.54
gVisor pod (only pod-level aggregate):
container_cpu_usage_seconds_total{container="",cpu="total",name="",namespace="console",pod="zcs-agent-xyz"} 767.62
2. Kubelet check requires container ID correlation
Debug logs show the kubelet check failing to correlate pod-level metrics:
pod not found for id:, name:zcs-agent-xyz, namespace:console
Tags not found for container: console/zcs-agent-xyz/main
The code path in pkg/collector/corechecks/containers/kubelet/common/pod.go:175 (GetContainerID()) expects a container ID to match against the workloadmeta store. When cAdvisor returns container="", there's no ID to match.
3. Prometheus succeeds because it doesn't correlate
The prometheus-kubelet ServiceMonitor uses honorLabels: true and accepts metrics as-is without requiring container ID correlation. This is why Grafana dashboards show gVisor pod metrics while Datadog doesn't.
Reproduction Steps
- Create a Kubernetes cluster with gVisor RuntimeClass configured
- Deploy a pod using the gVisor RuntimeClass
- Deploy Datadog agent with default kubelet check
- Observe that
kubernetes.cpu.usage.totalandkubernetes.memory.usageare missing for gVisor pods - Verify data exists by curling cAdvisor directly:
kubectl exec -n datadog <agent-pod> -c agent -- sh -c \ 'curl -sk "https://$DD_KUBERNETES_KUBELET_HOST:10250/metrics/cadvisor" \ -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" \ | grep container_cpu_usage_seconds_total | grep <gvisor-namespace>'
Workaround
We're currently using the OpenMetrics integration to scrape /metrics/cadvisor directly, which works but:
- Emits metrics with different names (
cadvisor.container_cpu_usage_seconds_totalvskubernetes.cpu.usage.total) - Doesn't integrate with Datadog's built-in Kubernetes dashboards and monitors
- Requires custom dashboards/monitors for gVisor workloads
Reproduction Steps
No response
Agent configuration
No response
Operating System
No response
Other environment details
No response