Tags · AIEdX/pytorch

ciflow/xpu/118613

Update on "[2/2] Intel GPU Runtime Upstreaming for Generator"

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Feb 1, 2024
f0c386b
zip
tar.gz

ciflow/xpu/118528

Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Feb 1, 2024
394810a
zip
tar.gz

ciflow/xpu/118523

Update on "Intel GPU Runtime Upstreaming for Guard"


# Motivation
According to [[RFC] Intel GPU Runtime Upstreaming](pytorch#114842), the 5th runtime component we would like to upstream is `Guard`. We will cover device guard and stream guard in this PR.

# Design
Device guard is used mainly for op dispatcher in PyTorch. Currently, PyTorch already has a device guard abstraction `c10::impl::DeviceGuardImplInterface`. In our design, we will introduce an `XPUGuardImpl` class inherits from `c10::impl::DeviceGuardImplInterface`. Register `XPUGuardImpl` to PyTorch after we implement the device switch management mechanism in `XPUGuardImpl`. Besides, we will introduce `XPUGuard`, `OptionalXPUGuard`, `XPUStreamGuard`, and `OptionalXPUStreamGuard`. They are all following the design of CUDA's counterpart. The corresponding C++ file should be placed in c10/xpu/ folder.

# Additional Context
It is unnecessary to add `Guard` code to PyTorch frontend.

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Feb 1, 2024
4b6f109
zip
tar.gz

ciflow/xpu/118091

Update on "Intel GPU Runtime Upstreaming for Device Allocator"


# Motivation
According to [[RFC] Intel GPU Runtime Upstreaming](pytorch#114842) and [[RFC] Intel GPU Runtime Upstreaming for Allocator](pytorch#116322), we will upstream the key functionality of device `Allocator` dedicated for XPU to PyTorch. And following our design prepare to generalize `Allocator` in parallel.

# Design
In the current design, XPU uses an `XPUAllocator` class, inherited from `c10::Allocator`. `XPUAllocator` is a manager to handle `DeviceCachingAllocator`, which is a per-device implementation of the caching mechanism to manage the already cached or newly allocated memory. The caching mechanism is similar to other backends, like CUDA. We can visualize the design as below.
<p align="center">
<img width="162" alt="image" src="https://github.com/pytorch/pytorch/assets/106960996/6b17b8cf-e7d1-48b4-b684-f830c409d218">
</p>

# Additional Context
We're going to implement our design gradually. This PR covers the device `Allocator` dedicated to XPU. The second PR covers the host `Allocator`.
Besides these PRs, we plan to generalize the device `Allocator` device-agnostic through another PR.
In this PR, our device `Allocator` has the same memory management mechanism as CUDA, but lacks features such as expendable segments and statistics. We will add these features back in the subsequent PR which intend to generalize `Allocator`.

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Feb 1, 2024
8f94a19
zip
tar.gz

ciflow/xpu/117734

Update on "Intel GPU Runtime Upstreaming for Event"


# Motivation
As mentioned in [[RFC] Intel GPU Runtime Upstreaming](pytorch#114842), the next runtime component we would like to upstream is `Event` which handles the status of an operation that is being executed. Typically, in some circumstances, we can fine-grain control of the operation execution via `Event`.

# Design
`XPUEvent` is a movable but not a copyable wrapper around sycl event. It should be created lazily on an XPU device when recording an `XPUStream`. Meanwhile, `XPUEvent` can wait for another `XPUEvent` or all the submitted kernels on an `XPUStream` to complete. Align to the other backend, the C++ files related to `Event` will be placed in `aten/src/ATen/xpu` folder. For frontend code, `XPUEvent` runtime API will be bound to Python `torch.xpu.Event`. The corresponding C++ code will be placed in `torch/csrc/xpu/Event.cpp` and Python code will be placed in `torch/xpu/streams.py` respectively.

# Additional Context
It is worth mentioning that the `elapsed_time` method is temporarily not supported by `XPUEvent`. We will be adding support for it soon. Meanwhile `XPUEvent` doesn't support IPC from different processes. For the other parts, we have almost a 1:1 mapping with CUDA.


cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Feb 1, 2024
f30f4ae
zip
tar.gz

ciflow/xpu/117619

Update on "[2/2] Intel GPU Runtime Upstreaming for Stream"


# Motivation
According to [[1/2] Intel GPU Runtime Upstreaming for Stream](pytorch#117611), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](pytorch#114842), the second PR covers the changes under `python frontend`.

# Design
Currently, it primarily offers stream-related APIs, including
 - `torch.xpu.StreamContext`
 - `torch.xpu.current_stream`
 - `torch.xpu.set_stream`
 - `torch.xpu.synchronize`
 - `torch._C._xpu_getCurrentRawStream`

# Additional Context
We will implement functions like `torch.xpu.Stream.wait_event`, `torch.xpu.Stream.wait_stream`, and `torch.xpu.Stream.record_event` in the next PR related with `Event`.


cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Feb 1, 2024
ade5abd
zip
tar.gz

ciflow/xpu/117611

Update on "[1/2] Intel GPU Runtime Upstreaming for Stream"


# Motivation
As mentioned in [[RFC] Intel GPU Runtime Upstreaming](pytorch#114842), the second runtime component we would like to upstream is `Stream` which contains the device management functions of Intel GPU's runtime. To facilitate the code review, we split the code changes into 2 PRs. This is one of the 2 PRs and covers the changes under `c10`.

# Design
Intel GPU stream is a wrapper of sycl queue which schedules kernels on a sycl device. In our design, we will maintain a sycl queue pool containing 32 queues per device. And when a queue is requested one of these queues is returned round-robin. The corresponding C++ files related to `Device` will be placed in `c10/xpu` folder. We provide the `c10::xpu::XPUStream` APIs, like
 - `XPUStream getStreamFromPool`
 - `XPUStream getCurrentXPUStream`
 - `void setCurrentXPUStream`
 - `void device_synchronize`

# Additional Context
In our plan, 2 PRs should be submitted to PyTorch for `Stream`:
1. for c10
2. for python frontend.

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Feb 1, 2024
bdb6902
zip
tar.gz

ciflow/xpu/116869

Update on "[4/4] Intel GPU Runtime Upstreaming for Device"


# Motivation
According to [[1/4] Intel GPU Runtime Upstreaming for Device](pytorch#116019), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](pytorch#114842), this last PR  covers the changes under lazy initialization.

# Design
This PR primarily offers the support of multi-processing via lazy initialization. We lazily initialize our runtime avoiding initializing XPU until the first time it is accessed. In our design, we extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for both CUDA and XPU while maintaining scalability.

# Additional Context
We adopt a similar design to CUDA. So we share some code with CUDA.

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Feb 1, 2024
c4f50e8
zip
tar.gz

ciflow/xpu/116850

Update on "[3/4] Intel GPU Runtime Upstreaming for Device"


# Motivation
According to [[1/4] Intel GPU Runtime Upstreaming for Device](pytorch#116019), As mentioned in [[RFC] Intel GPU Runtime Upstreaming](pytorch#114842), this third PR  covers the changes under `libtorch_python`.

# Design
This PR primarily offers device-related APIs in python frontend, including
- `torch.xpu.is_available`
- `torch.xpu.device_count`
- `torch.xpu.current_device`
- `torch.xpu.set_device`
- `torch.xpu._DeviceGuard`
- `torch.xpu.device`
- `torch.xpu.device_of`
- `torch.xpu.get_device_name`
- `torch.xpu.get_device_capability`
- `torch.xpu.get_device_properties`

# Additional Context
We will implement the support of lazy initialization in the next PR. 

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

Feb 1, 2024
783c604
zip
tar.gz

ciflow/trunk/118823

fix dup import

Feb 1, 2024
d9e0688
zip
tar.gz

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ciflow/xpu/118613

ciflow/xpu/118528

ciflow/xpu/118523

ciflow/xpu/118091

ciflow/xpu/117734

ciflow/xpu/117619

ciflow/xpu/117611

ciflow/xpu/116869

ciflow/xpu/116850

ciflow/trunk/118823

Tags: AIEdX/pytorch