Skip to content

Releases: NVIDIA/cloudai

v1.5.beta7

07 Jan 15:38
7b63c79

Choose a tag to compare

v1.5.beta7 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: v1.5.beta6...v1.5.beta7

v1.5.beta6

23 Dec 16:35
99f9158

Choose a tag to compare

v1.5.beta6 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: v1.5.beta5...v1.5.beta6

v1.5.beta5

17 Dec 18:10
b9ff078

Choose a tag to compare

v1.5.beta5 Pre-release
Pre-release

What's Changed

Full Changelog: v1.5.beta4...v1.5.beta5

v1.5.beta4

10 Dec 16:08
8e26c01

Choose a tag to compare

v1.5.beta4 Pre-release
Pre-release

What's Changed

Full Changelog: v1.5.beta3...v1.5.beta4

v1.5.beta3

03 Dec 16:01
4e9c340

Choose a tag to compare

v1.5.beta3 Pre-release
Pre-release

What's Changed

Full Changelog: v1.5.beta2...v1.5.beta3

v1.5.beta2

26 Nov 09:33
a48d097

Choose a tag to compare

v1.5.beta2 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: v1.5.beta1...v1.5.beta2

v1.5.beta1

12 Nov 18:22
453f185

Choose a tag to compare

v1.5.beta1 Pre-release
Pre-release

What's Changed

Full Changelog: v1.4.0...v1.5.beta1

v1.4.0

28 Oct 13:38
71dfd89

Choose a tag to compare

Highlights

1. GB300 Support for Common Configs

CloudAI and example configurations now support GB300 systems, expanding hardware compatibility.

2. AI Dynamo with Kubernetes Support (Alpha)

  • Added Kubernetes SPCx support for AI Dynamo workload
  • Enhanced container orchestration capabilities

3. New Model and Workload Support

  • Qwen Recipe Support: Added comprehensive Qwen model recipe support
  • SGLang Backend: Added SGLang backend support for DeepSeekR1 model
  • NIXL KVBench: New NIXL KVBench workload with full Slurm integration

4. Advanced AI Dynamo Capabilities

  • Multi-worker GPU Slicing: Multi-worker-per-node GPU slicing with dynamic allocation
  • Explicit Node Assignment: Dedicated node assignment for prefill and decode workers
  • Shell Script Entry Point: Shell script-based entry point (replacing Python implementation)
  • Environment Validation: Environment validation during startup sequence
  • Error Handling: Error detection and retry mechanism for worker failures

5. Agent System Enhancements

  • Plugin System: Load agents from entrypoints for extensibility
  • Custom Reward Functions: Custom reward functions with latency & throughput metrics handling
  • Configurable Rewards: Configurable agent reward functions

6. Enhanced Documentation

  • Sphinx Framework: Sphinx-based documentation framework with GitHub Pages deployment
  • Comprehensive Workload Docs: Complete documentation for Bash, NCCL, UCC, and other workloads
  • Interactive Features: Copy-to-clipboard support for code snippets

7. CLI Modernization

  • Click Framework: Complete CLI re-implementation using Click framework (replacing argparse)
  • Improved Usability: Made --tests-dir optional for better user experience
  • Simplified Commands: Removed unnecessary install/uninstall command options

8. NIXL Workload Improvements

  • Per-rank Environment Variables: Enhanced per-rank environment variable support for NIXL Perftest
  • Multi-backend Support: Support for non-UCX backends with multiple backend options
  • Enhanced Output Parsing: Improved output parsing for noisy/multi-format output
  • Container Management: etcd now managed from NIXL container

9. Reporting & Metrics

  • NCCL Comparison Reports: NCCL comparison reports with latency metrics
  • Reusable Framework: Reusable comparison report framework for NIXL
  • Multi-section CSV: Multi-section CSV format handling for AI Dynamo
  • Configurable Reports: Configurable reports via scenario configuration
  • DSE Trajectory Support: DSE trajectory support for single-sbatch mode

10. Development & Maintenance

  • Environment Management: Added uv.lock for persistent environment management
  • Code Cleanup: Removed deprecated NemoLauncher-based configurations
  • Refactoring: NeMo recipes refactoring for better maintainability

11. Reliability & Error Handling

  • Output Directory Handling: Improved output directory error handling (permissions, read-only filesystem)
  • Graceful Error Handling: Graceful handling of missing tests with MissingTestError
  • Environment Preservation: Better environment variable preservation (order maintained from system schema)
  • Debug Logging: Enhanced debug logging for system config parsing errors

What's Changed (details)

  • Support custom matgen args and set valid ppn by @amaslenn in #612
  • Fix gres related directives for single sbatch mode by @amaslenn in #613
  • Preserve the order of environment variables specified in the system schema by @TaekyungHeo in #616
  • Update docker_image_url separator from colon to hash by @TaekyungHeo in #621
  • Bump default version to v1.4 by @amaslenn in #622
  • Update USER_GUIDE.md by @TaekyungHeo in #623
  • Update doc/ai_dynamo.md by @TaekyungHeo in #624
  • Update conf/common/test/nemo_run_llama3_8b.toml by @TaekyungHeo in #625
  • Replace PyTorch image tag from 24.02-py3 to 25.06-py3 in all conf TOML files by @TaekyungHeo in #627
  • Update doc/ai_dynamo.md by @TaekyungHeo in #628
  • Replace Nemo image tag from 24.12.rc3 to 25.04.rc2 in all conf TOML files by @TaekyungHeo in #626
  • Improve prepare_output_dir error handling for permissions and read-only fs by @TaekyungHeo in #629
  • Improve prepare_output_dir error handling for permissions and read-only fs (continued) by @TaekyungHeo in #631
  • Add GitRepo support to KubernetesInstaller with install/uninstall logic by @TaekyungHeo in #634
  • Use a shell script as the entry point for AI Dynamo by @TaekyungHeo in #615
  • Handle multi-section CSV format in AI Dynamo report generation by @TaekyungHeo in #620
  • Add multi-worker-per-node GPU slicing support with dynamic allocation by @TaekyungHeo in #636
  • Log mapping between AI Dynamo nodes and roles by @TaekyungHeo in #617
  • Updates for SlurmContainer workload by @amaslenn in #638
  • Handle missing tests gracefully by adding MissingTestError to avoid backtrace by @TaekyungHeo in #640
  • Clean up src/cloudai/workloads/ai_dynamo/ai_dynamo.sh by @TaekyungHeo in #639
  • Preserve installables' state during apply_params_set() by @amaslenn in #643
  • Control which env vars dumped for per-rand evaluation by @amaslenn in #642
  • Align extra_env_vars definition in test and scenario by @amaslenn in #644
  • Update USER_GUIDE.md by @TaekyungHeo in #646
  • Add latency metric reporting for NCCL by @amaslenn in #645
  • Support for DeepSeekR1 model with SGLang / AI Dynamo by @TaekyungHeo in #641
  • Support mounting any JSON files for --dynamo-deepep-config by @TaekyungHeo in #650
  • Set tp-size and dp-size from args if provided, else use total_gpus by @TaekyungHeo in #649
  • Add environment validation to startup sequence by @TaekyungHeo in #651
  • Follow-up for PR641 (Support for DeepSeekR1 model with SGLang / AI Dynamo) by @TaekyungHeo in #653
  • Reorder the functions in ai_dynamo.sh for improved maintainability by @TaekyungHeo in #654
  • Refactor GPU count to use _gpus_per_node in vllm and env validation by @TaekyungHeo in #657
  • Mount huggingface_home_container_path unconditionally by @TaekyungHeo in #655
  • Refactor nodelist validation to check DYNAMO_NODELIST only if both args empty by @TaekyungHeo in #658
  • Comparison report for NCCL workloads by @amaslenn in #656
  • Support explicit node assignment for prefill and decode workers by @TaekyungHeo in #647
  • Configure reports via scenario config by @amaslenn in #661
  • Handle CancelledError gracefully during job cleanup by @TaekyungHeo in #662
  • Small housekeeping updates by @amaslenn in #663
  • nemo recipes refactor by @malay-nagda in #633
  • Re-use comparison report for NIXL by @amaslenn in #664
  • Handle single-sbatch metadata layout in report by @amaslenn in #666
  • Follow-up for PR647 (Support explicit node assignment for prefill and decode workers) by @TaekyungHeo in #665
  • Support two NIXL bench output formats by @amaslenn in #668
  • Add error detection and retry mechanism for worker failures by @TaekyungHeo in #659
  • Use single source of data for reporting and NIXL pass/fail by @amaslenn in #670
  • Write trajectory file for DSE jobs in single-sbatch mode by @amaslenn in #671
  • Update NIXL bench command generation logic by @amaslenn in #673
  • Make agent_reward_function configurable by @TaekyungHeo in #675
  • Add custom reward functions with latency & throughput metrics handling by @TaekyungHeo in #674
  • Auto install missing components for workloads in run mode by @amaslenn in #676
  • Fix step idx in single-sbatch trajectory by @amaslenn in #678
  • Get rid of strict validation for configs by @amaslenn in #680
  • Update Nemo image reference to nvcr.io#nvidia/nemo:25.07 in all toml files by @TaekyungHeo in #681
  • Update NIXL perftest workload by @amaslenn in #679
  • Increase time limit to 60...
Read more

v1.4.rc2

16 Oct 16:11
71dfd89

Choose a tag to compare

v1.4.rc2 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: v1.4.rc1...v1.4.rc2

v1.4.rc1

15 Oct 14:54
207d489

Choose a tag to compare

v1.4.rc1 Pre-release
Pre-release

What's Changed

Full Changelog: v1.4.beta25...v1.4.rc1