Improve performance of Device Segmented Reduce on large and very large segment sizes

The current implementation of device_segmented_reduce only reaches upto 30% in average for segment sizes between 1M and 100M. We could possibliy optimize it by using two phase segmented reduction by using multiple blocks to reduce a single segment. 

Bandwidth utilization of current implementation on H100:
<img width="1200" height="800" alt="Image" src="https://github.com/user-attachments/assets/f7e4b031-ce3f-4197-a012-8990096eb069" />

Some quick optimization show's promising results:

<img width="1200" height="800" alt="Image" src="https://github.com/user-attachments/assets/b8b0ed3b-f6f6-4d1e-9480-201cbc68cda4" />

Leaving with speed ups of

<img width="1200" height="800" alt="Image" src="https://github.com/user-attachments/assets/17ffb214-78f7-4bd7-8ba1-1371c146dad9" />



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve performance of Device Segmented Reduce on large and very large segment sizes #6865

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve performance of Device Segmented Reduce on large and very large segment sizes #6865

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions