Releases · NVIDIA/cloudai

07 Jan 15:38

amaslenn

v1.5.beta7

7b63c79

v1.5.beta7 Pre-release

Pre-release

What's Changed

M bridge Documentation by @srivatsankrishnan in #765
Remove hardcoded --distribution=arbitrary by @juntaowww in #766
M bridge updates by @srivatsankrishnan in #767

New Contributors

@juntaowww made their first contribution in #766

Full Changelog: v1.5.beta6...v1.5.beta7

Contributors

srivatsankrishnan and juntaowww

Assets 2

23 Dec 16:35

srivatsankrishnan

v1.5.beta6

99f9158

v1.5.beta6 Pre-release

Pre-release

What's Changed

Update codeowners by @srivatsankrishnan in #717
Aiconfig by @srivatsankrishnan in #760
Rula review by @RulaHallak in #761
Automatically install sshd for NCCL k8s workers if no available by @amaslenn in #759
Add workload for OSU Micro Benchmark by @allkoow in #742
Rename field model_config to model_cfg in NIXLKVBench workload by @allkoow in #763
Megatron Bridge in CloudAI by @srivatsankrishnan in #764

New Contributors

@allkoow made their first contribution in #742

Full Changelog: v1.5.beta5...v1.5.beta6

Contributors

amaslenn, srivatsankrishnan, and 2 other contributors

Assets 2

17 Dec 18:10

amaslenn

v1.5.beta5

b9ff078

v1.5.beta5 Pre-release

Pre-release

What's Changed

UCC add file generator by @yaeliyac in #747
Do not set -N/--nodes if nodelist is specified by @amaslenn in #746
Use genai-perf from Dynamo container when running k8s by @amaslenn in #748
Fix empty table if not all results are available by @amaslenn in #753
Ensure reports order by @amaslenn in #754
Update documentation on Dynamo k8s multi node by @amaslenn in #749
Fix bokeh charts generation by @amaslenn in #755
Enhancements for Dynamo with k8s by @amaslenn in #752
Fix a crash during dry-run for Dynamo scenario by @amaslenn in #757
Describe global options for cloudai CLI by @amaslenn in #758

Full Changelog: v1.5.beta4...v1.5.beta5

Contributors

amaslenn and yaeliyac

Assets 2

10 Dec 16:08

amaslenn

v1.5.beta4

8e26c01

v1.5.beta4 Pre-release

Pre-release

What's Changed

Add new installable type: HF model by @amaslenn in #735
Add extra_srun_args on TestRun level by @amaslenn in #734
Dynamo pass/fail and slurm example by @amaslenn in #736
Add support for HF model in K8s by @amaslenn in #737
Configure Dynamo k8s based on TOML, not an extra config by @amaslenn in #738
Fine tune CodeRabbit reviews by @amaslenn in #740
Expand K8s Dynamo support to disagg and multinode by @amaslenn in #739
Generate reports in dry-run by @amaslenn in #741
Update documentation by @amaslenn in #743
Simplify Dynamo slurm configuration by @amaslenn in #745

Full Changelog: v1.5.beta3...v1.5.beta4

Contributors

amaslenn

Assets 2

03 Dec 16:01

amaslenn

v1.5.beta3

4e9c340

v1.5.beta3 Pre-release

Pre-release

What's Changed

Print scenario status table at the end of a run by @amaslenn in #730
Always set number of nodes for srun cmd by @amaslenn in #729
Convert base System into pydantic model by @amaslenn in #732
Add HF home dir property inside System model by @amaslenn in #733

Full Changelog: v1.5.beta2...v1.5.beta3

Contributors

amaslenn

Assets 2

26 Nov 09:33

amaslenn

v1.5.beta2

a48d097

v1.5.beta2 Pre-release

Pre-release

What's Changed

Fix NameError for K8s batch run by @amaslenn in #721
Add DDLB workload by @nsarka in #711
Updates for Dynamo over K8s by @amaslenn in #724
Fixed and issue when using dependencies could result in an infinite loop by @amaslenn in #725
Report results dir to users as early as possible by @amaslenn in #726
Configure AI code review tools by @amaslenn in #728
Kill and wait for ETCD process to be gone by @amaslenn in #727
DeepEP benchmark by @ybenvidia in #723

New Contributors

@nsarka made their first contribution in #711

Full Changelog: v1.5.beta1...v1.5.beta2

Contributors

amaslenn, nsarka, and ybenvidia

Assets 2

12 Nov 18:22

amaslenn

v1.5.beta1

453f185

v1.5.beta1 Pre-release

Pre-release

What's Changed

Remove DeepEP callback for llama4 by @aahouzi in #712
Run tests for several py versions by @amaslenn in #713
Bump fallback version to v1.5 and upgrade dependencies by @amaslenn in #714
Small enhancements by @amaslenn in #715
Simplify internal hierarchy of classes by @amaslenn in #716
Update documentation by @amaslenn in #718

Full Changelog: v1.4.0...v1.5.beta1

Contributors

amaslenn and aahouzi

Assets 2

28 Oct 13:38

amaslenn

v1.4.0

71dfd89

v1.4.0 Latest

Latest

Highlights

1. GB300 Support for Common Configs

CloudAI and example configurations now support GB300 systems, expanding hardware compatibility.

2. AI Dynamo with Kubernetes Support (Alpha)

Added Kubernetes SPCx support for AI Dynamo workload
Enhanced container orchestration capabilities

3. New Model and Workload Support

Qwen Recipe Support: Added comprehensive Qwen model recipe support
SGLang Backend: Added SGLang backend support for DeepSeekR1 model
NIXL KVBench: New NIXL KVBench workload with full Slurm integration

4. Advanced AI Dynamo Capabilities

Multi-worker GPU Slicing: Multi-worker-per-node GPU slicing with dynamic allocation
Explicit Node Assignment: Dedicated node assignment for prefill and decode workers
Shell Script Entry Point: Shell script-based entry point (replacing Python implementation)
Environment Validation: Environment validation during startup sequence
Error Handling: Error detection and retry mechanism for worker failures

5. Agent System Enhancements

Plugin System: Load agents from entrypoints for extensibility
Custom Reward Functions: Custom reward functions with latency & throughput metrics handling
Configurable Rewards: Configurable agent reward functions

6. Enhanced Documentation

Sphinx Framework: Sphinx-based documentation framework with GitHub Pages deployment
Comprehensive Workload Docs: Complete documentation for Bash, NCCL, UCC, and other workloads
Interactive Features: Copy-to-clipboard support for code snippets

7. CLI Modernization

Click Framework: Complete CLI re-implementation using Click framework (replacing argparse)
Improved Usability: Made --tests-dir optional for better user experience
Simplified Commands: Removed unnecessary install/uninstall command options

8. NIXL Workload Improvements

Per-rank Environment Variables: Enhanced per-rank environment variable support for NIXL Perftest
Multi-backend Support: Support for non-UCX backends with multiple backend options
Enhanced Output Parsing: Improved output parsing for noisy/multi-format output
Container Management: etcd now managed from NIXL container

9. Reporting & Metrics

NCCL Comparison Reports: NCCL comparison reports with latency metrics
Reusable Framework: Reusable comparison report framework for NIXL
Multi-section CSV: Multi-section CSV format handling for AI Dynamo
Configurable Reports: Configurable reports via scenario configuration
DSE Trajectory Support: DSE trajectory support for single-sbatch mode

10. Development & Maintenance

Environment Management: Added uv.lock for persistent environment management
Code Cleanup: Removed deprecated NemoLauncher-based configurations
Refactoring: NeMo recipes refactoring for better maintainability

11. Reliability & Error Handling

Output Directory Handling: Improved output directory error handling (permissions, read-only filesystem)
Graceful Error Handling: Graceful handling of missing tests with MissingTestError
Environment Preservation: Better environment variable preservation (order maintained from system schema)
Debug Logging: Enhanced debug logging for system config parsing errors

What's Changed (details)

Support custom matgen args and set valid ppn by @amaslenn in #612
Fix gres related directives for single sbatch mode by @amaslenn in #613
Preserve the order of environment variables specified in the system schema by @TaekyungHeo in #616
Update docker_image_url separator from colon to hash by @TaekyungHeo in #621
Bump default version to v1.4 by @amaslenn in #622
Update USER_GUIDE.md by @TaekyungHeo in #623
Update doc/ai_dynamo.md by @TaekyungHeo in #624
Update conf/common/test/nemo_run_llama3_8b.toml by @TaekyungHeo in #625
Replace PyTorch image tag from 24.02-py3 to 25.06-py3 in all conf TOML files by @TaekyungHeo in #627
Update doc/ai_dynamo.md by @TaekyungHeo in #628
Replace Nemo image tag from 24.12.rc3 to 25.04.rc2 in all conf TOML files by @TaekyungHeo in #626
Improve prepare_output_dir error handling for permissions and read-only fs by @TaekyungHeo in #629
Improve prepare_output_dir error handling for permissions and read-only fs (continued) by @TaekyungHeo in #631
Add GitRepo support to KubernetesInstaller with install/uninstall logic by @TaekyungHeo in #634
Use a shell script as the entry point for AI Dynamo by @TaekyungHeo in #615
Handle multi-section CSV format in AI Dynamo report generation by @TaekyungHeo in #620
Add multi-worker-per-node GPU slicing support with dynamic allocation by @TaekyungHeo in #636
Log mapping between AI Dynamo nodes and roles by @TaekyungHeo in #617
Updates for SlurmContainer workload by @amaslenn in #638
Handle missing tests gracefully by adding MissingTestError to avoid backtrace by @TaekyungHeo in #640
Clean up src/cloudai/workloads/ai_dynamo/ai_dynamo.sh by @TaekyungHeo in #639
Preserve installables' state during apply_params_set() by @amaslenn in #643
Control which env vars dumped for per-rand evaluation by @amaslenn in #642
Align extra_env_vars definition in test and scenario by @amaslenn in #644
Update USER_GUIDE.md by @TaekyungHeo in #646
Add latency metric reporting for NCCL by @amaslenn in #645
Support for DeepSeekR1 model with SGLang / AI Dynamo by @TaekyungHeo in #641
Support mounting any JSON files for --dynamo-deepep-config by @TaekyungHeo in #650
Set tp-size and dp-size from args if provided, else use total_gpus by @TaekyungHeo in #649
Add environment validation to startup sequence by @TaekyungHeo in #651
Follow-up for PR641 (Support for DeepSeekR1 model with SGLang / AI Dynamo) by @TaekyungHeo in #653
Reorder the functions in ai_dynamo.sh for improved maintainability by @TaekyungHeo in #654
Refactor GPU count to use _gpus_per_node in vllm and env validation by @TaekyungHeo in #657
Mount huggingface_home_container_path unconditionally by @TaekyungHeo in #655
Refactor nodelist validation to check DYNAMO_NODELIST only if both args empty by @TaekyungHeo in #658
Comparison report for NCCL workloads by @amaslenn in #656
Support explicit node assignment for prefill and decode workers by @TaekyungHeo in #647
Configure reports via scenario config by @amaslenn in #661
Handle CancelledError gracefully during job cleanup by @TaekyungHeo in #662
Small housekeeping updates by @amaslenn in #663
nemo recipes refactor by @malay-nagda in #633
Re-use comparison report for NIXL by @amaslenn in #664
Handle single-sbatch metadata layout in report by @amaslenn in #666
Follow-up for PR647 (Support explicit node assignment for prefill and decode workers) by @TaekyungHeo in #665
Support two NIXL bench output formats by @amaslenn in #668
Add error detection and retry mechanism for worker failures by @TaekyungHeo in #659
Use single source of data for reporting and NIXL pass/fail by @amaslenn in #670
Write trajectory file for DSE jobs in single-sbatch mode by @amaslenn in #671
Update NIXL bench command generation logic by @amaslenn in #673
Make agent_reward_function configurable by @TaekyungHeo in #675
Add custom reward functions with latency & throughput metrics handling by @TaekyungHeo in #674
Auto install missing components for workloads in run mode by @amaslenn in #676
Fix step idx in single-sbatch trajectory by @amaslenn in #678
Get rid of strict validation for configs by @amaslenn in #680
Update Nemo image reference to nvcr.io#nvidia/nemo:25.07 in all toml files by @TaekyungHeo in #681
Update NIXL perftest workload by @amaslenn in #679
Increase time limit to 60...

Contributors

amaslenn, TaekyungHeo, and 5 other contributors

Assets 2

16 Oct 16:11

amaslenn

v1.4.rc2

71dfd89

v1.4.rc2 Pre-release

Pre-release

What's Changed

Updated Python executable path in NIXL KVBench workloads by @Bohatchuk in #709
Update docs by @amaslenn in #710
ucc_perftest_add_gen_and_a2av by @yaeliyac in #692

New Contributors

@Bohatchuk made their first contribution in #709
@yaeliyac made their first contribution in #692

Full Changelog: v1.4.rc1...v1.4.rc2

Contributors

amaslenn, Bohatchuk, and yaeliyac

Assets 2

15 Oct 14:54

amaslenn

v1.4.rc1

207d489

v1.4.rc1 Pre-release

Pre-release

What's Changed

Add llama4 Recipe by @aahouzi in #708
Update NemoRun configs by @amaslenn in #706

Full Changelog: v1.4.beta25...v1.4.rc1

Contributors

amaslenn and aahouzi

Assets 2

Releases: NVIDIA/cloudai

v1.5.beta7

What's Changed

New Contributors

Contributors

Uh oh!

v1.5.beta6

What's Changed

New Contributors

Contributors

Uh oh!

v1.5.beta5

What's Changed

Contributors

Uh oh!

v1.5.beta4

What's Changed

Contributors

Uh oh!

v1.5.beta3

What's Changed

Contributors

Uh oh!

v1.5.beta2

What's Changed

New Contributors

Contributors

Uh oh!

v1.5.beta1

What's Changed

Contributors

Uh oh!

v1.4.0

Highlights

1. GB300 Support for Common Configs

2. AI Dynamo with Kubernetes Support (Alpha)

3. New Model and Workload Support

4. Advanced AI Dynamo Capabilities

5. Agent System Enhancements

6. Enhanced Documentation

7. CLI Modernization

8. NIXL Workload Improvements

9. Reporting & Metrics

10. Development & Maintenance

11. Reliability & Error Handling

What's Changed (details)

Contributors

Uh oh!

v1.4.rc2

What's Changed

New Contributors

Contributors

Uh oh!

v1.4.rc1

What's Changed

Contributors

Uh oh!