increase chan size to fix oom event lost #386

ningmingxiao · 2025-12-16T15:48:35Z

@fuweid @mikebrow
fix containerd/containerd#12681 ci failed reduce oom event lost probability.

Signed-off-by: ningmingxiao <ning.mingxiao@zte.com.cn>

mikebrow · 2025-12-16T16:02:29Z

cgroup2/manager.go


 func (c *Manager) EventChan() (<-chan Event, <-chan error) {
-	ec := make(chan Event, 1)
+	ec := make(chan Event, 16)


It is an experience value‌, I test it many times. I'm not sure .@mikebrow

or we can try

go func() { ec <- Event{ Low: out["low"], High: out["high"], Max: out["max"], OOM: out["oom"], OOMKill: out["oom_kill"], } }()

@fuweid when you moved from unbuffered to buffered size 1 for this #374 did you consider a buffer size > 1 .. are there any issues to consider wrt ordering events or closing the channel before it's emptied etc..

I think the event lost is not caused by channel side. it's race condition in shim side.

shim sets GOMAXPROCS=4 and in CI action, critest runs 8 cases parrallel. So, ideally, it will run 8 pods in the same time on 4 cpu cores node. If shim gets exit event first then we read oom event, it will cause we can lost event.

I am thinking we should drain select-oom-event goroutine before sending exit event in shim side. let me do some performance or density tests for that. I will update it later.

fair.. there may another window where we get the task exit event on the cri side first then a request for and report status as exited with error reason.. then receive the oom event and update container exit reason.. then if they ask again we give status with reason? The two events use up two checkpoints both protected by the same container store global mutex and because of this lock and storage locks.. the "tight" window for the racing oom test.

Thinking we might want to check if there is an exit reason queued up first before reporting status while it's exiting.. same for the generateAndSendContainerEvent() to the kubelet at the bottom of the task exit.. if we don't get the oom event first we have a window where we report with no reason for the task exit.

worse.. when we get the container status:

func toCRIContainerStatus(ctx context.Context, container containerstore.Container, spec *runtime.ImageSpec, imageRef string) (*runtime.ContainerStatus, error) { meta := container.Metadata status := container.Status.Get() reason := status.Reason if status.State() == runtime.ContainerState_CONTAINER_EXITED && reason == "" { if status.ExitCode == 0 { reason = completeExitReason } else { reason = errorExitReason } } ...

if exited but reason is nil.. we invent an exit reason if there is non 0 exit code

I add some log I find If I increase the size will have less chance to cause oom event lost.The oom event really happened but containerd failed to catch it.

The final time I read from memory.events is

{Low:0x0, High:0x0, Max:0x1, OOM:0x0, OOMKill:0x0} c.id cad5a1e0473717dd873246764c857ecc7dbb3630516bba093c5700b246b6a282]

increase chan size to fix oom event lost

650719a

Signed-off-by: ningmingxiao <ning.mingxiao@zte.com.cn>

mikebrow reviewed Dec 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

increase chan size to fix oom event lost #386

increase chan size to fix oom event lost #386

ningmingxiao commented Dec 16, 2025 •

edited

Loading

Uh oh!

mikebrow Dec 16, 2025

Uh oh!

ningmingxiao Dec 16, 2025

Uh oh!

ningmingxiao Dec 16, 2025

Uh oh!

mikebrow Dec 16, 2025 •

edited

Loading

Uh oh!

fuweid Dec 16, 2025 •

edited

Loading

Uh oh!

mikebrow Dec 16, 2025

Uh oh!

mikebrow Dec 16, 2025 •

edited

Loading

Uh oh!

ningmingxiao Dec 16, 2025 •

edited

Loading

Uh oh!

ningmingxiao Dec 16, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

increase chan size to fix oom event lost #386

Are you sure you want to change the base?

increase chan size to fix oom event lost #386

Conversation

ningmingxiao commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mikebrow Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

ningmingxiao Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

ningmingxiao Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

mikebrow Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fuweid Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikebrow Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

mikebrow Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ningmingxiao Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ningmingxiao Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ningmingxiao commented Dec 16, 2025 •

edited

Loading

mikebrow Dec 16, 2025 •

edited

Loading

fuweid Dec 16, 2025 •

edited

Loading

mikebrow Dec 16, 2025 •

edited

Loading

ningmingxiao Dec 16, 2025 •

edited

Loading

ningmingxiao Dec 16, 2025 •

edited

Loading