- Artifact for the Paper "Colocating ML Inference and Training with Fast GPU Memory Handover"
- Project Structure
- Hardware Requirements
- Build and Install
- Run and Evaluate
$ tree --dirsfirst -L 2 .
├── client
├── cmake # CMake helper files
├── common # Common libraries for inference/training
├── environment # Docker and conda environment files
├── eval
│ ├── runner # Automatic evaluation runner
│ └── ... # Evaluation scripts for test cases
├── log # Running logs
├── proto # gRPC proto
├── pytorch # PyTorch plugin
├── scripts
├── server # Inference server
│ ├── models # Contains inference models
│ └── ...
├── train # PyTorch training scripts
├── third_party/mpool... # GPU memory pool
└── ...- 4 x NVIDIA V100 (16GB)
- 1 x NVIDIA A100 (80GB)
Option 1: Pull from Docker Hub
Pull the pre-built Docker images from Docker Hub. The script ./scripts/docker.sh is provided as a wrapper for Docker commands.
docker pull siriusinftra/sirius:latest
docker pull siriusinftra/triton-trt-um:latest # Triton TensorRT UM backend
bash ./scripts/docker.shThe project is located at /gpu-col within the Docker container. TVM and Triton models are pre-installed in this image.
Before running the system, activate the conda environment (e.g., conda activate colserve).
To evaluate Sirius, refer to Run Benchmark and Artifact Evaluation for more details.
Option 2: Build from Dockerfile
- Clone the repository and build the Docker image. The
build_docker.shscript will clone dependencies intoinftra-docker-build, which serves as the Docker build context.
[Optional] Copy TVM and Triton models to
inftra-docker-build/tvm-modelsandinftra-docker-build/triton-modelsrespectively. These will be copied into the Docker image.
git clone --recurse-submodules git@github.com:SiriusInfTra/Sirius.git gpu-col
bash ./gpu-col/scripts/build_docker.sh- Build Triton TensorRT UM Docker image.
bash ./gpu-col/scripts/build_triton_trt_um_docker.shSoftware Requirements: cmake>=3.24, gcc>=9.4, nvcc>=11.6, ninja
Create Environment and Build System:
- Prepare a new conda environment, install Python packages, and then clone the repository.
conda create -n colserve python=3.12
conda activate colserve
conda install -y conda-forge::python-devtools nvitop conda-forge::c-ares
pip install -r environment/requirements.txt
export SIRIUS_HOME=/path/to/clone/repo
git clone --recurse-submodules git@github.com:SiriusInfTra/Sirius.git $SIRIUS_HOME- Install
Boost>=1.80by compiling from source (Boost installed via apt/conda might require a higher GCC version).
export BOOST_HOME=/path/to/install/boost
$SIRIUS_HOME/scripts/install_boost.sh $BOOST_HOME-
Clone and build TVM for inference, and PyTorch and TorchVision for training. Ensure the CUDA backend is enabled. Pay attention to the PyTorch
GLIBCXX_USE_CXX11_ABIflag, which can cause ABI issues. To accelerate the build, set theTORCH_CUDA_ARCH_LISTflag to your GPU's compute capability (e.g.,TORCH_CUDA_ARCH_LIST=7.0for V100). -
Set the
TVM_HOMEenvironment variable. Verify by runningecho $TVM_HOMEandecho $CONDA_PREFIX. Then, configure CMake.
export TVM_HOME=/path/to/tvm
export TORCH_HOME=/path/to/pytorch
export BOOST_HOME=/path/to/boost
$SIRIUS_HOME/scripts/build_sirius.sh $SIRIUS_HOME $TVM_HOME $TORCH_HOME $BOOST_HOME- [Only required for Triton UM+MPS] Set up Triton TensorRT backend with Unified Memory support. Clone and build Triton TensorRT UM Backend.
export TRITON_TRT_UM_HOME=/path/to/triton_tensorrt_um
export TRITON_TRT_INSTALL_HOME=/path/to/triton_tensorrt_um_install # e.g., $SIRIUS_HOME/triton/tensorrt_um/install
bash $SIRIUS_HOME/scripts/build_triton_trt_um.sh $TRITON_TRT_UM_HOME $TRITON_TRT_INSTALL_HOME- [Only required for LLM] Install vLLM by compiling from source, clone xFormer and vLLM.
export VLLM_HOME=/path/to/vllm
export XFORMER_HOME=/path/to/xformer
bash $SIRIUS_HOME/scripts/build_vllm.sh $VLLM_HOME $XFORMER_HOMETVM Models
Compile models using TVM (refer to ./util/prepare_model_store). TVM models (i.e., mod.json, mod.params, and mod.so) are stored in server/models, as shown below.
server/models
├── densenet161-b1
├── distilbert_base-b1
├── distilgpt2-b1
├── efficientnet_v2_s-b1
├── efficientvit_b2-b1
└── resnet152-b1
Triton Models
Compile Triton models using TensorRT (refer to ./util/onnx). Triton models are stored in server/triton_models. Each model has a directory containing the Triton compiled model (model.plan and config.pbtxt), as shown below.
├── densenet161
├── distilbert_base
├── distilgpt2
├── efficientnet_v2_s
├── efficientvit_b2
├── resnet152
│ ├── 1
│ │ └── model.plan
│ └── config.pbtxt
└── config.conf
config.conf is used to configure the memory usage (in MiB) for each model.
resnet152 = 345
distilgpt2 = 349
efficientvit_b2 = 143
efficientnet_v2_s = 114
densenet161 = 107
distilbert_base = 278LLM
Download Llama2 from Hugging Face.
from transformers import AutoConfig, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-13b-hf")
config = AutoConfig.from_pretrained('Qwen/Qwen2-0.5B')
model = AutoModelForCausalLM.from_pretrained('Qwen/Qwen2-0.5B', config=config)The evaluation is fully automated by the script at ./eval/runner. This script will automatically launch GPU MPS, Sirius's inference server, PyTorch training tasks, and inference workloads.
For example, to evaluate Sirius with the Light workload:
source ./scripts/set_cuda_device.sh 0
python eval/overall_v2.py --uniform-v2 --uniform-v2-wkld-types NormalLight \
--sirius --skip-set-mps-pctThe evaluation results will be saved in a directory like log/overall-uniform-v2-1gpu-YYYYMMDD-HHMM/colsys-NormalLight.
Please refer to ./artifact-evaluation/README.md for more details on the artifact evaluation process.