Releases: NVIDIA/cloudai
Releases · NVIDIA/cloudai
v1.5.beta7
What's Changed
- M bridge Documentation by @srivatsankrishnan in #765
- Remove hardcoded
--distribution=arbitraryby @juntaowww in #766 - M bridge updates by @srivatsankrishnan in #767
New Contributors
- @juntaowww made their first contribution in #766
Full Changelog: v1.5.beta6...v1.5.beta7
v1.5.beta6
What's Changed
- Update codeowners by @srivatsankrishnan in #717
- Aiconfig by @srivatsankrishnan in #760
- Rula review by @RulaHallak in #761
- Automatically install sshd for NCCL k8s workers if no available by @amaslenn in #759
- Add workload for OSU Micro Benchmark by @allkoow in #742
- Rename field model_config to model_cfg in NIXLKVBench workload by @allkoow in #763
- Megatron Bridge in CloudAI by @srivatsankrishnan in #764
New Contributors
Full Changelog: v1.5.beta5...v1.5.beta6
v1.5.beta5
What's Changed
- UCC add file generator by @yaeliyac in #747
- Do not set -N/--nodes if nodelist is specified by @amaslenn in #746
- Use genai-perf from Dynamo container when running k8s by @amaslenn in #748
- Fix empty table if not all results are available by @amaslenn in #753
- Ensure reports order by @amaslenn in #754
- Update documentation on Dynamo k8s multi node by @amaslenn in #749
- Fix bokeh charts generation by @amaslenn in #755
- Enhancements for Dynamo with k8s by @amaslenn in #752
- Fix a crash during dry-run for Dynamo scenario by @amaslenn in #757
- Describe global options for cloudai CLI by @amaslenn in #758
Full Changelog: v1.5.beta4...v1.5.beta5
v1.5.beta4
What's Changed
- Add new installable type: HF model by @amaslenn in #735
- Add extra_srun_args on TestRun level by @amaslenn in #734
- Dynamo pass/fail and slurm example by @amaslenn in #736
- Add support for HF model in K8s by @amaslenn in #737
- Configure Dynamo k8s based on TOML, not an extra config by @amaslenn in #738
- Fine tune CodeRabbit reviews by @amaslenn in #740
- Expand K8s Dynamo support to disagg and multinode by @amaslenn in #739
- Generate reports in dry-run by @amaslenn in #741
- Update documentation by @amaslenn in #743
- Simplify Dynamo slurm configuration by @amaslenn in #745
Full Changelog: v1.5.beta3...v1.5.beta4
v1.5.beta3
What's Changed
- Print scenario status table at the end of a run by @amaslenn in #730
- Always set number of nodes for srun cmd by @amaslenn in #729
- Convert base System into pydantic model by @amaslenn in #732
- Add HF home dir property inside System model by @amaslenn in #733
Full Changelog: v1.5.beta2...v1.5.beta3
v1.5.beta2
What's Changed
- Fix NameError for K8s batch run by @amaslenn in #721
- Add DDLB workload by @nsarka in #711
- Updates for Dynamo over K8s by @amaslenn in #724
- Fixed and issue when using dependencies could result in an infinite loop by @amaslenn in #725
- Report results dir to users as early as possible by @amaslenn in #726
- Configure AI code review tools by @amaslenn in #728
- Kill and wait for ETCD process to be gone by @amaslenn in #727
- DeepEP benchmark by @ybenvidia in #723
New Contributors
Full Changelog: v1.5.beta1...v1.5.beta2
v1.5.beta1
What's Changed
- Remove DeepEP callback for llama4 by @aahouzi in #712
- Run tests for several py versions by @amaslenn in #713
- Bump fallback version to v1.5 and upgrade dependencies by @amaslenn in #714
- Small enhancements by @amaslenn in #715
- Simplify internal hierarchy of classes by @amaslenn in #716
- Update documentation by @amaslenn in #718
Full Changelog: v1.4.0...v1.5.beta1
v1.4.0
Highlights
1. GB300 Support for Common Configs
CloudAI and example configurations now support GB300 systems, expanding hardware compatibility.
2. AI Dynamo with Kubernetes Support (Alpha)
- Added Kubernetes SPCx support for AI Dynamo workload
- Enhanced container orchestration capabilities
3. New Model and Workload Support
- Qwen Recipe Support: Added comprehensive Qwen model recipe support
- SGLang Backend: Added SGLang backend support for DeepSeekR1 model
- NIXL KVBench: New NIXL KVBench workload with full Slurm integration
4. Advanced AI Dynamo Capabilities
- Multi-worker GPU Slicing: Multi-worker-per-node GPU slicing with dynamic allocation
- Explicit Node Assignment: Dedicated node assignment for prefill and decode workers
- Shell Script Entry Point: Shell script-based entry point (replacing Python implementation)
- Environment Validation: Environment validation during startup sequence
- Error Handling: Error detection and retry mechanism for worker failures
5. Agent System Enhancements
- Plugin System: Load agents from entrypoints for extensibility
- Custom Reward Functions: Custom reward functions with latency & throughput metrics handling
- Configurable Rewards: Configurable agent reward functions
6. Enhanced Documentation
- Sphinx Framework: Sphinx-based documentation framework with GitHub Pages deployment
- Comprehensive Workload Docs: Complete documentation for Bash, NCCL, UCC, and other workloads
- Interactive Features: Copy-to-clipboard support for code snippets
7. CLI Modernization
- Click Framework: Complete CLI re-implementation using Click framework (replacing argparse)
- Improved Usability: Made
--tests-diroptional for better user experience - Simplified Commands: Removed unnecessary install/uninstall command options
8. NIXL Workload Improvements
- Per-rank Environment Variables: Enhanced per-rank environment variable support for NIXL Perftest
- Multi-backend Support: Support for non-UCX backends with multiple backend options
- Enhanced Output Parsing: Improved output parsing for noisy/multi-format output
- Container Management: etcd now managed from NIXL container
9. Reporting & Metrics
- NCCL Comparison Reports: NCCL comparison reports with latency metrics
- Reusable Framework: Reusable comparison report framework for NIXL
- Multi-section CSV: Multi-section CSV format handling for AI Dynamo
- Configurable Reports: Configurable reports via scenario configuration
- DSE Trajectory Support: DSE trajectory support for single-sbatch mode
10. Development & Maintenance
- Environment Management: Added
uv.lockfor persistent environment management - Code Cleanup: Removed deprecated NemoLauncher-based configurations
- Refactoring: NeMo recipes refactoring for better maintainability
11. Reliability & Error Handling
- Output Directory Handling: Improved output directory error handling (permissions, read-only filesystem)
- Graceful Error Handling: Graceful handling of missing tests with MissingTestError
- Environment Preservation: Better environment variable preservation (order maintained from system schema)
- Debug Logging: Enhanced debug logging for system config parsing errors
What's Changed (details)
- Support custom matgen args and set valid ppn by @amaslenn in #612
- Fix gres related directives for single sbatch mode by @amaslenn in #613
- Preserve the order of environment variables specified in the system schema by @TaekyungHeo in #616
- Update docker_image_url separator from colon to hash by @TaekyungHeo in #621
- Bump default version to v1.4 by @amaslenn in #622
- Update USER_GUIDE.md by @TaekyungHeo in #623
- Update doc/ai_dynamo.md by @TaekyungHeo in #624
- Update conf/common/test/nemo_run_llama3_8b.toml by @TaekyungHeo in #625
- Replace PyTorch image tag from 24.02-py3 to 25.06-py3 in all conf TOML files by @TaekyungHeo in #627
- Update doc/ai_dynamo.md by @TaekyungHeo in #628
- Replace Nemo image tag from 24.12.rc3 to 25.04.rc2 in all conf TOML files by @TaekyungHeo in #626
- Improve prepare_output_dir error handling for permissions and read-only fs by @TaekyungHeo in #629
- Improve prepare_output_dir error handling for permissions and read-only fs (continued) by @TaekyungHeo in #631
- Add GitRepo support to KubernetesInstaller with install/uninstall logic by @TaekyungHeo in #634
- Use a shell script as the entry point for AI Dynamo by @TaekyungHeo in #615
- Handle multi-section CSV format in AI Dynamo report generation by @TaekyungHeo in #620
- Add multi-worker-per-node GPU slicing support with dynamic allocation by @TaekyungHeo in #636
- Log mapping between AI Dynamo nodes and roles by @TaekyungHeo in #617
- Updates for SlurmContainer workload by @amaslenn in #638
- Handle missing tests gracefully by adding MissingTestError to avoid backtrace by @TaekyungHeo in #640
- Clean up src/cloudai/workloads/ai_dynamo/ai_dynamo.sh by @TaekyungHeo in #639
- Preserve installables' state during apply_params_set() by @amaslenn in #643
- Control which env vars dumped for per-rand evaluation by @amaslenn in #642
- Align extra_env_vars definition in test and scenario by @amaslenn in #644
- Update USER_GUIDE.md by @TaekyungHeo in #646
- Add latency metric reporting for NCCL by @amaslenn in #645
- Support for DeepSeekR1 model with SGLang / AI Dynamo by @TaekyungHeo in #641
- Support mounting any JSON files for --dynamo-deepep-config by @TaekyungHeo in #650
- Set tp-size and dp-size from args if provided, else use total_gpus by @TaekyungHeo in #649
- Add environment validation to startup sequence by @TaekyungHeo in #651
- Follow-up for PR641 (Support for DeepSeekR1 model with SGLang / AI Dynamo) by @TaekyungHeo in #653
- Reorder the functions in ai_dynamo.sh for improved maintainability by @TaekyungHeo in #654
- Refactor GPU count to use _gpus_per_node in vllm and env validation by @TaekyungHeo in #657
- Mount huggingface_home_container_path unconditionally by @TaekyungHeo in #655
- Refactor nodelist validation to check DYNAMO_NODELIST only if both args empty by @TaekyungHeo in #658
- Comparison report for NCCL workloads by @amaslenn in #656
- Support explicit node assignment for prefill and decode workers by @TaekyungHeo in #647
- Configure reports via scenario config by @amaslenn in #661
- Handle CancelledError gracefully during job cleanup by @TaekyungHeo in #662
- Small housekeeping updates by @amaslenn in #663
- nemo recipes refactor by @malay-nagda in #633
- Re-use comparison report for NIXL by @amaslenn in #664
- Handle single-sbatch metadata layout in report by @amaslenn in #666
- Follow-up for PR647 (Support explicit node assignment for prefill and decode workers) by @TaekyungHeo in #665
- Support two NIXL bench output formats by @amaslenn in #668
- Add error detection and retry mechanism for worker failures by @TaekyungHeo in #659
- Use single source of data for reporting and NIXL pass/fail by @amaslenn in #670
- Write trajectory file for DSE jobs in single-sbatch mode by @amaslenn in #671
- Update NIXL bench command generation logic by @amaslenn in #673
- Make agent_reward_function configurable by @TaekyungHeo in #675
- Add custom reward functions with latency & throughput metrics handling by @TaekyungHeo in #674
- Auto install missing components for workloads in run mode by @amaslenn in #676
- Fix step idx in single-sbatch trajectory by @amaslenn in #678
- Get rid of strict validation for configs by @amaslenn in #680
- Update Nemo image reference to nvcr.io#nvidia/nemo:25.07 in all toml files by @TaekyungHeo in #681
- Update NIXL perftest workload by @amaslenn in #679
- Increase time limit to 60...
v1.4.rc2
What's Changed
- Updated Python executable path in NIXL KVBench workloads by @Bohatchuk in #709
- Update docs by @amaslenn in #710
- ucc_perftest_add_gen_and_a2av by @yaeliyac in #692
New Contributors
- @Bohatchuk made their first contribution in #709
- @yaeliyac made their first contribution in #692
Full Changelog: v1.4.rc1...v1.4.rc2