-
Notifications
You must be signed in to change notification settings - Fork 250
increase chan size to fix oom event lost #386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: ningmingxiao <ning.mingxiao@zte.com.cn>
|
|
||
| func (c *Manager) EventChan() (<-chan Event, <-chan error) { | ||
| ec := make(chan Event, 1) | ||
| ec := make(chan Event, 16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 16..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is an experience value, I test it many times. I'm not sure .@mikebrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or we can try
go func() {
ec <- Event{
Low: out["low"],
High: out["high"],
Max: out["max"],
OOM: out["oom"],
OOMKill: out["oom_kill"],
}
}()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the event lost is not caused by channel side. it's race condition in shim side.
shim sets GOMAXPROCS=4 and in CI action, critest runs 8 cases parrallel. So, ideally, it will run 8 pods in the same time on 4 cpu cores node. If shim gets exit event first then we read oom event, it will cause we can lost event.
I am thinking we should drain select-oom-event goroutine before sending exit event in shim side. let me do some performance or density tests for that. I will update it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair.. there may another window where we get the task exit event on the cri side first then a request for and report status as exited with error reason.. then receive the oom event and update container exit reason.. then if they ask again we give status with reason? The two events use up two checkpoints both protected by the same container store global mutex and because of this lock and storage locks.. the "tight" window for the racing oom test.
Thinking we might want to check if there is an exit reason queued up first before reporting status while it's exiting.. same for the generateAndSendContainerEvent() to the kubelet at the bottom of the task exit.. if we don't get the oom event first we have a window where we report with no reason for the task exit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worse.. when we get the container status:
func toCRIContainerStatus(ctx context.Context, container containerstore.Container, spec *runtime.ImageSpec, imageRef string) (*runtime.ContainerStatus, error) {
meta := container.Metadata
status := container.Status.Get()
reason := status.Reason
if status.State() == runtime.ContainerState_CONTAINER_EXITED && reason == "" {
if status.ExitCode == 0 {
reason = completeExitReason
} else {
reason = errorExitReason
}
}
...
if exited but reason is nil.. we invent an exit reason if there is non 0 exit code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add some log I find If I increase the size will have less chance to cause oom event lost.The oom event really happened but containerd failed to catch it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The final time I read from memory.events is
{Low:0x0, High:0x0, Max:0x1, OOM:0x0, OOMKill:0x0} c.id cad5a1e0473717dd873246764c857ecc7dbb3630516bba093c5700b246b6a282]
@fuweid @mikebrow
fix containerd/containerd#12681 ci failed reduce oom event lost probability.