NVIDIA GPU Operator

Kubernetes provides access to special hardware resources such as NVIDIA GPUs, NICs, Infiniband adapters and other devices through the device plugin framework. However, configuring and managing nodes with these hardware resources requires configuration of multiple software components such as drivers, container runtimes or other libraries which are difficult and prone to errors. The NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Runtime, automatic node labelling, DCGM based monitoring and others.

Audience and Use-Cases

The GPU Operator allows administrators of Kubernetes clusters to manage GPU nodes just like CPU nodes in the cluster. Instead of provisioning a special OS image for GPU nodes, administrators can rely on a standard OS image for both CPU and GPU nodes and then rely on the GPU Operator to provision the required software components for GPUs.

Note that the GPU Operator is specifically useful for scenarios where the Kubernetes cluster needs to scale quickly - for example provisioning additional GPU nodes on the cloud or on-prem and managing the lifecycle of the underlying software components. Since the GPU Operator runs everything as containers including NVIDIA drivers, the administrators can easily swap various components - simply by starting or stopping containers.

Product Documentation

For information on platform support and getting started, visit the official documentation repository.

Webinar

How to easily use GPUs on Kubernetes

Contributions

Read the document on contributions. You can contribute by opening a pull request.

Support and Getting Help

Please open an issue on the GitHub project for any questions. Your feedback is appreciated.

vGPU License Visibility

When the operator configures a node for vm-vgpu workloads it now reports the license state directly through the Kubernetes API:

Each vGPU node receives an annotation nvidia.com/vgpu-license-statuses that contains a JSON snapshot of the most recent nvidia-smi vgpu -q output, including per-device status and expiry timestamps.
The ClusterPolicy resource exposes a Licensed condition that summarizes the state of every vGPU node. It turns False if any device is unlicensed or nearing expiry, and Unknown if data from the node-status-exporter is missing.

You can inspect the node-level data with:

kubectl get node <name> -o jsonpath='{.metadata.annotations.nvidia\.com/vgpu-license-statuses}'

and the cluster-level summary through:

kubectl get clusterpolicy gpu-cluster-policy -o jsonpath='{.status.conditions[?(@.type=="Licensed")]}'

This makes it easier for users and automation to diagnose misconfigured or expired licenses without shelling into the node.

Name		Name	Last commit message	Last commit date
Latest commit History 3,157 Commits
.github		.github
api		api
assets		assets
aws-kube-ci @ a6e3298		aws-kube-ci @ a6e3298
bundle		bundle
cmd		cmd
cnt-ci @ e96100e		cnt-ci @ e96100e
config		config
controllers		controllers
deployments/gpu-operator		deployments/gpu-operator
docker		docker
hack		hack
internal		internal
manifests/state-driver		manifests/state-driver
scripts		scripts
tests		tests
tools		tools
validator/manifests		validator/manifests
vendor		vendor
.common-ci.yml		.common-ci.yml
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.gitmodules		.gitmodules
.golangci.yml		.golangci.yml
.nvidia-ci.yml		.nvidia-ci.yml
CODEOWNERS		CODEOWNERS
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
PROJECT		PROJECT
README.md		README.md
SECURITY.md		SECURITY.md
go.mod		go.mod
go.sum		go.sum
multi-arch.mk		multi-arch.mk
native-only.mk		native-only.mk
versions.mk		versions.mk

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NVIDIA GPU Operator

Audience and Use-Cases

Product Documentation

Webinar

Contributions

Support and Getting Help

vGPU License Visibility

About

Uh oh!

Releases

Packages

Languages

License

elliott-davis/gpu-operator

Folders and files

Latest commit

History

Repository files navigation

NVIDIA GPU Operator

Audience and Use-Cases

Product Documentation

Webinar

Contributions

Support and Getting Help

vGPU License Visibility

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages