The current implementation of device_segmented_reduce only reaches upto 30% in average for segment sizes between 1M and 100M. We could possibliy optimize it by using two phase segmented reduction by using multiple blocks to reduce a single segment.
Bandwidth utilization of current implementation on H100:

Some quick optimization show's promising results:
Leaving with speed ups of
