Skip to content

Tags: AIEdX/pytorch

Tags

ciflow/xpu/118613

Toggle ciflow/xpu/118613's commit message
Update on "[2/2] Intel GPU Runtime Upstreaming for Generator"

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

ciflow/xpu/118528

Toggle ciflow/xpu/118528's commit message
Update on "[1/2] Intel GPU Runtime Upstreaming for Generator"

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

ciflow/xpu/118523

Toggle ciflow/xpu/118523's commit message
Update on "Intel GPU Runtime Upstreaming for Guard"


# Motivation
According to [[RFC] Intel GPU Runtime Upstreaming](pytorch#114842), the 5th runtime component we would like to upstream is `Guard`. We will cover device guard and stream guard in this PR.

# Design
Device guard is used mainly for op dispatcher in PyTorch. Currently, PyTorch already has a device guard abstraction `c10::impl::DeviceGuardImplInterface`. In our design, we will introduce an `XPUGuardImpl` class inherits from `c10::impl::DeviceGuardImplInterface`. Register `XPUGuardImpl` to PyTorch after we implement the device switch management mechanism in `XPUGuardImpl`. Besides, we will introduce `XPUGuard`, `OptionalXPUGuard`, `XPUStreamGuard`, and `OptionalXPUStreamGuard`. They are all following the design of CUDA's counterpart. The corresponding C++ file should be placed in c10/xpu/ folder.

# Additional Context
It is unnecessary to add `Guard` code to PyTorch frontend.

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

ciflow/xpu/118091

Toggle ciflow/xpu/118091's commit message
Update on "Intel GPU Runtime Upstreaming for Device Allocator"


# Motivation
According to [[RFC] Intel GPU Runtime Upstreaming](pytorch#114842) and [[RFC] Intel GPU Runtime Upstreaming for Allocator](pytorch#116322), we will upstream the key functionality of device `Allocator` dedicated for XPU to PyTorch. And following our design prepare to generalize `Allocator` in parallel.

# Design
In the current design, XPU uses an `XPUAllocator` class, inherited from `c10::Allocator`. `XPUAllocator` is a manager to handle `DeviceCachingAllocator`, which is a per-device implementation of the caching mechanism to manage the already cached or newly allocated memory. The caching mechanism is similar to other backends, like CUDA. We can visualize the design as below.
<p align="center">
<img width="162" alt="image" src="https://github.com/pytorch/pytorch/assets/106960996/6b17b8cf-e7d1-48b4-b684-f830c409d218">
</p>

# Additional Context
We're going to implement our design gradually. This PR covers the device `Allocator` dedicated to XPU. The second PR covers the host `Allocator`.
Besides these PRs, we plan to generalize the device `Allocator` device-agnostic through another PR.
In this PR, our device `Allocator` has the same memory management mechanism as CUDA, but lacks features such as expendable segments and statistics. We will add these features back in the subsequent PR which intend to generalize `Allocator`.

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

ciflow/xpu/117734

Toggle ciflow/xpu/117734's commit message
Update on "Intel GPU Runtime Upstreaming for Event"


# Motivation
As mentioned in [[RFC] Intel GPU Runtime Upstreaming](pytorch#114842), the next runtime component we would like to upstream is `Event` which handles the status of an operation that is being executed. Typically, in some circumstances, we can fine-grain control of the operation execution via `Event`.

# Design
`XPUEvent` is a movable but not a copyable wrapper around sycl event. It should be created lazily on an XPU device when recording an `XPUStream`. Meanwhile, `XPUEvent` can wait for another `XPUEvent` or all the submitted kernels on an `XPUStream` to complete. Align to the other backend, the C++ files related to `Event` will be placed in `aten/src/ATen/xpu` folder. For frontend code, `XPUEvent` runtime API will be bound to Python `torch.xpu.Event`. The corresponding C++ code will be placed in `torch/csrc/xpu/Event.cpp` and Python code will be placed in `torch/xpu/streams.py` respectively.

# Additional Context
It is worth mentioning that the `elapsed_time` method is temporarily not supported by `XPUEvent`. We will be adding support for it soon. Meanwhile `XPUEvent` doesn't support IPC from different processes. For the other parts, we have almost a 1:1 mapping with CUDA.


cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

ciflow/xpu/117619

Toggle ciflow/xpu/117619's commit message
Update on "[2/2] Intel GPU Runtime Upstreaming for Stream"


# Motivation
According to [[1/2] Intel GPU Runtime Upstreaming for Stream](pytorch#117611), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](pytorch#114842), the second PR covers the changes under `python frontend`.

# Design
Currently, it primarily offers stream-related APIs, including
 - `torch.xpu.StreamContext`
 - `torch.xpu.current_stream`
 - `torch.xpu.set_stream`
 - `torch.xpu.synchronize`
 - `torch._C._xpu_getCurrentRawStream`

# Additional Context
We will implement functions like `torch.xpu.Stream.wait_event`, `torch.xpu.Stream.wait_stream`, and `torch.xpu.Stream.record_event` in the next PR related with `Event`.


cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

ciflow/xpu/117611

Toggle ciflow/xpu/117611's commit message
Update on "[1/2] Intel GPU Runtime Upstreaming for Stream"


# Motivation
As mentioned in [[RFC] Intel GPU Runtime Upstreaming](pytorch#114842), the second runtime component we would like to upstream is `Stream` which contains the device management functions of Intel GPU's runtime. To facilitate the code review, we split the code changes into 2 PRs. This is one of the 2 PRs and covers the changes under `c10`.

# Design
Intel GPU stream is a wrapper of sycl queue which schedules kernels on a sycl device. In our design, we will maintain a sycl queue pool containing 32 queues per device. And when a queue is requested one of these queues is returned round-robin. The corresponding C++ files related to `Device` will be placed in `c10/xpu` folder. We provide the `c10::xpu::XPUStream` APIs, like
 - `XPUStream getStreamFromPool`
 - `XPUStream getCurrentXPUStream`
 - `void setCurrentXPUStream`
 - `void device_synchronize`

# Additional Context
In our plan, 2 PRs should be submitted to PyTorch for `Stream`:
1. for c10
2. for python frontend.

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

ciflow/xpu/116869

Toggle ciflow/xpu/116869's commit message
Update on "[4/4] Intel GPU Runtime Upstreaming for Device"


# Motivation
According to [[1/4] Intel GPU Runtime Upstreaming for Device](pytorch#116019), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](pytorch#114842), this last PR  covers the changes under lazy initialization.

# Design
This PR primarily offers the support of multi-processing via lazy initialization. We lazily initialize our runtime avoiding initializing XPU until the first time it is accessed. In our design, we extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for both CUDA and XPU while maintaining scalability.

# Additional Context
We adopt a similar design to CUDA. So we share some code with CUDA.

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

ciflow/xpu/116850

Toggle ciflow/xpu/116850's commit message
Update on "[3/4] Intel GPU Runtime Upstreaming for Device"


# Motivation
According to [[1/4] Intel GPU Runtime Upstreaming for Device](pytorch#116019), As mentioned in [[RFC] Intel GPU Runtime Upstreaming](pytorch#114842), this third PR  covers the changes under `libtorch_python`.

# Design
This PR primarily offers device-related APIs in python frontend, including
- `torch.xpu.is_available`
- `torch.xpu.device_count`
- `torch.xpu.current_device`
- `torch.xpu.set_device`
- `torch.xpu._DeviceGuard`
- `torch.xpu.device`
- `torch.xpu.device_of`
- `torch.xpu.get_device_name`
- `torch.xpu.get_device_capability`
- `torch.xpu.get_device_properties`

# Additional Context
We will implement the support of lazy initialization in the next PR. 

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]

ciflow/trunk/118823

Toggle ciflow/trunk/118823's commit message
fix dup import