Skip to content

Improve performance of Device Segmented Reduce on large and very large segment sizes #6865

@srinivasyadav18

Description

@srinivasyadav18

The current implementation of device_segmented_reduce only reaches upto 30% in average for segment sizes between 1M and 100M. We could possibliy optimize it by using two phase segmented reduction by using multiple blocks to reduce a single segment.

Bandwidth utilization of current implementation on H100:
Image

Some quick optimization show's promising results:

Image

Leaving with speed ups of

Image

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions