Skip to content

Conversation

@ningmingxiao
Copy link
Contributor

@ningmingxiao ningmingxiao commented Dec 16, 2025

@fuweid @mikebrow
fix containerd/containerd#12681 ci failed reduce oom event lost probability.

Signed-off-by: ningmingxiao <ning.mingxiao@zte.com.cn>

func (c *Manager) EventChan() (<-chan Event, <-chan error) {
ec := make(chan Event, 1)
ec := make(chan Event, 16)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 16..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an experience value‌, I test it many times. I'm not sure .@mikebrow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or we can try

		go func() {
				ec <- Event{
					Low:     out["low"],
					High:    out["high"],
					Max:     out["max"],
					OOM:     out["oom"],
					OOMKill: out["oom_kill"],
				}
			}()

Copy link
Member

@mikebrow mikebrow Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fuweid when you moved from unbuffered to buffered size 1 for this #374 did you consider a buffer size > 1 .. are there any issues to consider wrt ordering events or closing the channel before it's emptied etc..

Copy link
Member

@fuweid fuweid Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the event lost is not caused by channel side. it's race condition in shim side.

shim sets GOMAXPROCS=4 and in CI action, critest runs 8 cases parrallel. So, ideally, it will run 8 pods in the same time on 4 cpu cores node. If shim gets exit event first then we read oom event, it will cause we can lost event.

I am thinking we should drain select-oom-event goroutine before sending exit event in shim side. let me do some performance or density tests for that. I will update it later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair.. there may another window where we get the task exit event on the cri side first then a request for and report status as exited with error reason.. then receive the oom event and update container exit reason.. then if they ask again we give status with reason? The two events use up two checkpoints both protected by the same container store global mutex and because of this lock and storage locks.. the "tight" window for the racing oom test.

Thinking we might want to check if there is an exit reason queued up first before reporting status while it's exiting.. same for the generateAndSendContainerEvent() to the kubelet at the bottom of the task exit.. if we don't get the oom event first we have a window where we report with no reason for the task exit.

Copy link
Member

@mikebrow mikebrow Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worse.. when we get the container status:

func toCRIContainerStatus(ctx context.Context, container containerstore.Container, spec *runtime.ImageSpec, imageRef string) (*runtime.ContainerStatus, error) {
	meta := container.Metadata
	status := container.Status.Get()
	reason := status.Reason
	if status.State() == runtime.ContainerState_CONTAINER_EXITED && reason == "" {
		if status.ExitCode == 0 {
			reason = completeExitReason
		} else {
			reason = errorExitReason
		}
	}
...

if exited but reason is nil.. we invent an exit reason if there is non 0 exit code

Copy link
Contributor Author

@ningmingxiao ningmingxiao Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add some log I find If I increase the size will have less chance to cause oom event lost.The oom event really happened but containerd failed to catch it.

Copy link
Contributor Author

@ningmingxiao ningmingxiao Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final time I read from memory.events is

{Low:0x0, High:0x0, Max:0x1, OOM:0x0, OOMKill:0x0} c.id cad5a1e0473717dd873246764c857ecc7dbb3630516bba093c5700b246b6a282]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants