-
Notifications
You must be signed in to change notification settings - Fork 23
release v2.8 rocm #428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
VeeraRajasekhar
wants to merge
21
commits into
release_v2.8_rocm
Choose a base branch
from
veergopu/release_v2.8_rocm
base: release_v2.8_rocm
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
release v2.8 rocm #428
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
* remove import jax.extend.ffi Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
* first draft; debug plan failure Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * debug uid error Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * tweak params Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add grad in output Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up prints Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix prints in test Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Apply 1 suggestion(s) to 1 file(s) Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * address review comments Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix unfused grad; add softmax_type; add sink to bwd Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Apply 1 suggestion(s) to 1 file(s) Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix padding mask; add swa tests; remove requires_grad for off-by-one Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update FE Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Apply 1 suggestion(s) to 1 file(s) Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Apply 1 suggestion(s) to 1 file(s) Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Apply 1 suggestion(s) to 1 file(s) Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Apply 1 suggestion(s) to 1 file(s) Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Apply 1 suggestion(s) to 1 file(s) Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Apply 1 suggestion(s) to 1 file(s) Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Apply 1 suggestion(s) to 1 file(s) Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Apply 1 suggestion(s) to 1 file(s) Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Apply 1 suggestion(s) to 1 file(s) Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix indent Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix non-determinism and shapes Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up prints Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add GQA Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add CP A2A; dq/dk mismatches Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix CP A2A; need cleaner solution Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix CP A2A; pending cudnn kernel change Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fixes Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix world size in unit test; avoid thd format Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix kernel_backend, dtype in unit test; fix head_dim for FP8 Hopper Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix thd logic Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix fp8 context Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * tweak CP logging Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * allow no_mask/padding for SWA(left,0) Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Revert "allow no_mask/padding for SWA(left,0)" This reverts commit 08b4ccc67a08b6882080b06aa715f541bb832aca. Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add softmax_type to Jax Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add cuDNN version control Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * prettify tests Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * skip 9.13 for MLA, non 192/128 Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * rename compare_with_error Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * small cleanups and improvements Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix minor CI failures Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * force sink/dsink to be float32 Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * switch FE to GH FE Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * return to GH TE main FE commit Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update FE to 1.14.1 Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up before CI Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix lint Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * bump up cudnn version Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add backend selection guard for unit tests Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add docstring for softmax type enums in C Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by: Chen Cui <chcui@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
… (#2169) * Add pytest xml report for debug unittest and onnx unittest, and remove the duplicated test line in qa/L0_pytorch_debug_unittest/test.sh --------- Signed-off-by: erindai <shengfangd@nvidia.com>
* Adding Amax Primitive and related args. Signed-off-by: Ming Huang <mingh@nvidia.com> * Enable local-amax for current-scaling and optionally run AR aross FSDP/TP/SP. Signed-off-by: Ming Huang <mingh@nvidia.com> * Adding doc for Amax Primitive. Signed-off-by: Ming Huang <mingh@nvidia.com> * Fix the function name conflict. Signed-off-by: Ming Huang <mingh@nvidia.com> * Modification as feedback suggested. Signed-off-by: Ming Huang <mingh@nvidia.com> * Fix errors from lint. Signed-off-by: Ming Huang <mingh@nvidia.com> * Fix the wrong amax-scope in the bwd. Signed-off-by: Ming Huang <mingh@nvidia.com> * Added more description for amax-scope Signed-off-by: Ming Huang <mingh@nvidia.com> * Fix the wrong attribute name. Signed-off-by: Ming Huang <mingh@nvidia.com> * Keep dim for AmaxCalcuation. Signed-off-by: Ming Huang <mingh@nvidia.com> * Remove keepDim and add shardy_rule Signed-off-by: Ming Huang <mingh@nvidia.com> * Fix shardy_rule Signed-off-by: Ming Huang <mingh@nvidia.com> * Remove extra-collective bytes from ref_coll_count due to local amax. Signed-off-by: Ming Huang <mingh@nvidia.com> --------- Signed-off-by: Ming Huang <mingh@nvidia.com> Signed-off-by: Ming-Xu Huang <mingh@nvidia.com> Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>
* Rework shardy rules * WAR for compound factor=1 Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
update jax requirements Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
* fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
fix xml file name Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
* init cgemm + unit tests * UB bootstrap with NCCL, no MPI dependency * add NVLINK-P2P check + error message * skip tests if no NVLINK available * use std::vector to store ncclComm_t * update misuse of TP warning Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…mm` (#2210) * add xml export for test_multiprocessing_encoder and test_cgemm Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Address tolerance check for current scaling dact Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* Add NVFP4 recipe Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: Frank Sun <frsun@nvidia.com> Co-authored-by: Oleg Goncharov <ogoncharov@nvidia.com> Co-authored-by: Zhongbo Zhu <zhongboz@nvidia.com> Co-authored-by: Evgeny Tsykunov <etsykunov@nvidia.com> Co-authored-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Teddy Do <tdophung@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add MathDx dependency to GitHub builds Signed-off-by: Tim Moon <tmoon@nvidia.com> * Suggestions from GitHub Copilot Signed-off-by: Tim Moon <tmoon@nvidia.com> * Move 2x shape logic from core to PyTorch Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix compilation errors with CUDA 12.1 Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * SM 70 is not supported in CUDA 13 Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> * Typo Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> * Revert "Move 2x shape logic from core to PyTorch" This reverts commit f8b2a2d0111d9af690b43bb98ae448d9a430a185. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Added dequantize kernel for FP4 Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix linter warning Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add NVFP4 support with fusible ops Use logical tensor dims for PyTorch NVFP4 tensors. Temporarily add unfused dequantize impl. Fix bug where NVFP4 recipe was not configurable. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Fix logic for 2x shapes and move to PyTorch Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix CG test model config Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Debug NVFP4 tensor size function Signed-off-by: Tim Moon <tmoon@nvidia.com> * Proper handling of the RNG state Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Test SR properly Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix workspace size for GEMM heuristic. Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix compile error in C++ NVFP4 test Some some numeric errors when blocks are all zero. Signed-off-by: Tim Moon <tmoon@nvidia.com> * fix distrbuted test problem shape Signed-off-by: zhongboz <zhongboz@nvidia.com> * proper assert dim for low precision AG TP Signed-off-by: zhongboz <zhongboz@nvidia.com> * clean up duplicated code in nvfp4_utils.cuh Signed-off-by: zhongboz <zhongboz@nvidia.com> * lint Signed-off-by: zhongboz <zhongboz@nvidia.com> * pylint: disable=unused-argument Signed-off-by: zhongboz <zhongboz@nvidia.com> * `nvte_cublas_gemm_v2` to take alpha pointer (#12) * make nvte_cublas_gemm_v2 to take alpha/beta pointers Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> * users are expected to pass a valid C_tensor Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> * typos Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> * API to have const float* alpha Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> * Minor tweaks Support arbitrary beta scales. Increase workspace to be aligned to 128 bytes. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Debug IMA with alpha pointer Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Support fused amax kernels with NVFP4 quantization Signed-off-by: Tim Moon <tmoon@nvidia.com> * Disable fused amax with cuDNN LayerNorm kernel Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add NVFP4 cases to distributed tests for TE ops Signed-off-by: Tim Moon <tmoon@nvidia.com> * Change assert to NVTE_CHECK in the hadamard cast fusion Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * Fix compile error Signed-off-by: Tim Moon <tmoon@nvidia.com> * Use global thread IDs for Philox subsequences Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add shape checks for NVFP4 cast kernels Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Do not fuse amax if cuDNN normalization is forced by envvar Signed-off-by: Przemek Tredak <ptredak@nvidia.com> --------- Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Przemek Tredak <ptredak@nvidia.com> Signed-off-by: zhongboz <zhongboz@nvidia.com> Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by: Frank Sun <frsun@nvidia.com> Co-authored-by: Oleg Goncharov <ogoncharov@nvidia.com> Co-authored-by: Zhongbo Zhu <zhongboz@nvidia.com> Co-authored-by: Evgeny Tsykunov <etsykunov@nvidia.com> Co-authored-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Teddy Do <tdophung@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: Przemek Tredak <ptredak@nvidia.com> Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>
* Fix the segfault in the nvfp4 quantization Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Przemek Tredak <ptredak@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* debug existing usage Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix fp8_dpa Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * reimplement fp8_dpa Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * more clean up Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update FE develop Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * redesign CS; need cleanup Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up s/dP quantizers Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * return dP to DS Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * improve quantizer_helper; tweak dP DS/CS logic Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * debug CP Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update FE commit Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up non-CP; debug dq/dk mismatches Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor success with CP; need to remove debug info Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove debug info Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * disable fp8 output for fp8_mha + CS Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add output_tensor_type to FADescriptor Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fixes for CP Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove print Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * more fixes for non-CP and CP Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * enable non-determinism for blackwell Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix indent; remove print Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * switch from create_tensor_from_data to make_like Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * enable a2a+p2p for CS CP and require additional cp_group_global Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix last commit Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * condense tests; only create dist groups once Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * consolidate CP P2P per-tile calls for fwd/bwd and fused/flash Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix flash-attn from last commit Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fixes for previous commit Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix attn_mask_type in f16 causal Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert bb6a0a59 temporarily Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * reenable comparison for some tensors in CP tests Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix dbias for fused attn CP Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up prints/comments and add back NVTE_CS_dP_SCALE Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * first attempt at mixed DS/CS reduction Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fix for last commit for mixed DS/CS reduction Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove prints from 69639024 Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix DS recipe for dP Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add NVTE_DPA_FORCE_DS to force DS for all DPA tensors, not just dP Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix NVTE_DPA_FORCE_DS and add NVTE_PRINT Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix last commit Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * modify DS recipe for MLPerf Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * reduce only over TP group; need to think about CP group later Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * streamline fake_recipe/quantizer generation; allow NVTE_DPA_Fixed_Scales or DS-update S/dP Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add more print: NVTE_LAYER_NUMBER Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * split S/dP in env vars: NVTE_DPA_Fix_S_Scale and NVTE_DPA_Fix_dP_Scale Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix autocast_key for DS Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add NVTE_REPEAT_in_F16 to repeat FP8 fwd/bwd passes Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add FP8 CS to UnfusedDPA; unsuccessful; does not affect other backends Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * temporary: print min/max and save tensors for debugging Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * emulate q/dq+bf16 with NVTE_Emulate_in_F16; add NVTE_DPA_FORCE_MXFP8 for MXFP8 q/dq Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add RHT to BMM1 with NVTE_RHT_BMM1 for the size Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * re-enable fused attn in dpa_fp8_vs_f16 test; changed during unfused attn implementation Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add NVTE_FP8_CS_POWER_OF_2, NVTE_DPA_FORCE_BLOCKFP8, NVTE_Emulate_QDQ_QKV, NVTE_Emulate_QDQ_O, NVTE_Emulate_QDQ_dO Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add F16 O support for FP8 kernels Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert to TE FE commit Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * return to FE develop Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * tidy up; untested Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fix for last commit Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fixes and improvements for last commit Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * more minor fixes and improvements Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * more small fixes/improvements; mostly for CP Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix CS/DS recipe switch in DPA Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * avoid quantizing/saving of O when CS bwd uses F16 O Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * move fp8_autocast(fp8_recipe) print to utils.py Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add debug logging to unit tests Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add back prints of quantizers/layer_number for debugging Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * enable amax reduction for both CS and DS tensors Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix NVTE_FP8_DPA_BWD=0 for CP Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix last commit for F16 fwd/bwd a2a+p2p Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * small fixes for float8_current_scaling(), nominal types, and unruly d_out types Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix fp8_output in MHA and some CP tests Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fixes to CP tests Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fixes for CP A2A Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clamp input data in tests Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove rmse and tighten atol/rtol for tests Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * restructure fp8_recipes Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix linter Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Revert "remove rmse and tighten atol/rtol for tests" This reverts commit 15dba6a59a5323d414f02cf22f099cb00d880532. Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * more fixes for linter Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix fp8 recipe changes for F16 code path Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert to FE on main to help with merges Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * switch back to FE develop after merge Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update FE develop commit Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix last merge Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert to GitHub FE 1.14.1 Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update FE to its latest main Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fix for A2A Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix last commit for A2A DS Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove memset for BSHD/SBHD FP8 Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove concat for qkv quantization in CS Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * improve/simplify the logic for last commit Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add nominal_type for UnfusedDPA FP8 EmuFunc Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: update env vars for DPA recipes Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix last commit Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix typo in last commit Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix DS recipe creation for NVFP4 global recipe Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * replace python max with torch.maximum Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix linter Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix CP A2A for FA Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * reduce prints in print_quantizers Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add FP8 env vars to NVTE_DEBUG prints Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add reduce_amax to DS repr Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * separate fp8_dpa/fp8_mha in CP tests; fix A2A for them; add f16_O tests Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address some reciews Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make data optional in create_hp_tensor_with_amax Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fix for comments in bwd Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * print cudnn version in attn tests Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * disable CS for Hopper Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * alternative tests to reduce CI time Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make NVTE_DPA_FP8CS_O_in_F16 default to 1 Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove _fp8 variables to avoid confusion Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * return to requiring two cp_groups for a2a+p2p Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * replace NVTE_PRINT with NVTE_DEBUG/_LEVEL for quantizer prints Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * provide a basic set of tests for CP Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix the last merge with nvfp4 PR Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * disable for Hopper Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix fp8 backend selection for Hopper Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * reduce CP CI to essential tests Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fix to CP test Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix recipe logic in tests Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert to concat for qkv quantization Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove cudnn version in qa scripts Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Load modules during initialize Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by: JAX Toolbox <jax@nvidia.com>
* Fix the cublas workspace alignment Signed-off-by: Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Przemek Tredak <ptredak@nvidia.com> Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Make sure to set usages for linear op quantizers before forward Signed-off-by: Tim Moon <tmoon@nvidia.com> * Avoid unsupported case for fused dbias+quantize kernel Hopper does not support dbias + FP8 cast without FP8 transpose. Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Please include a brief summary of the changes, relevant motivation and context.
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: