Skip to content

Conversation

@VeeraRajasekhar
Copy link
Contributor

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

ptrendx and others added 21 commits January 19, 2026 19:06
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
* remove import jax.extend.ffi

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
* first draft; debug plan failure

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* debug uid error

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* tweak params

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add grad in output

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* clean up prints

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix prints in test

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Apply 1 suggestion(s) to 1 file(s)

Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* address review comments

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix unfused grad; add softmax_type; add sink to bwd

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Apply 1 suggestion(s) to 1 file(s)

Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix padding mask; add swa tests; remove requires_grad for off-by-one

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* update FE

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Apply 1 suggestion(s) to 1 file(s)

Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Apply 1 suggestion(s) to 1 file(s)

Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Apply 1 suggestion(s) to 1 file(s)

Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Apply 1 suggestion(s) to 1 file(s)

Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Apply 1 suggestion(s) to 1 file(s)

Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Apply 1 suggestion(s) to 1 file(s)

Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Apply 1 suggestion(s) to 1 file(s)

Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Apply 1 suggestion(s) to 1 file(s)

Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Apply 1 suggestion(s) to 1 file(s)

Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix indent

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix non-determinism and shapes

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* clean up prints

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add GQA

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add CP A2A; dq/dk mismatches

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix CP A2A; need cleaner solution

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix CP A2A; pending cudnn kernel change

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor fixes

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix world size in unit test; avoid thd format

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix kernel_backend, dtype in unit test; fix head_dim for FP8 Hopper

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix thd logic

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix fp8 context

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* tweak CP logging

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* allow no_mask/padding for SWA(left,0)

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Revert "allow no_mask/padding for SWA(left,0)"

This reverts commit 08b4ccc67a08b6882080b06aa715f541bb832aca.

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add softmax_type to Jax

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add cuDNN version control

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* prettify tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* skip 9.13 for MLA, non 192/128

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* rename compare_with_error

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* small cleanups and improvements

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix minor CI failures

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* force sink/dsink to be float32

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* switch FE to GH FE

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* return to GH TE main FE commit

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update FE to 1.14.1

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* clean up before CI

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix lint

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* bump up cudnn version

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add backend selection guard for unit tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add docstring for softmax type enums in C

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
… (#2169)

* Add pytest xml report for debug unittest and onnx unittest, and remove the duplicated test line in qa/L0_pytorch_debug_unittest/test.sh

---------

Signed-off-by: erindai <shengfangd@nvidia.com>
* Adding Amax Primitive and related args.

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Enable local-amax for current-scaling and optionally run AR aross FSDP/TP/SP.

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Adding doc for Amax Primitive.

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Fix the function name conflict.

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Modification as feedback suggested.

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Fix errors from lint.

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Fix the wrong amax-scope in the bwd.

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Added more description for amax-scope

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Fix the wrong attribute name.

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Keep dim for AmaxCalcuation.

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Remove keepDim and add shardy_rule

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Fix shardy_rule

Signed-off-by: Ming Huang <mingh@nvidia.com>

* Remove extra-collective bytes from ref_coll_count due to local amax.

Signed-off-by: Ming Huang <mingh@nvidia.com>

---------

Signed-off-by: Ming Huang <mingh@nvidia.com>
Signed-off-by: Ming-Xu Huang <mingh@nvidia.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>
* Rework shardy rules

* WAR for compound factor=1

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
update jax requirements

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

* fix

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

---------

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
fix xml file name

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
* init cgemm + unit tests

* UB bootstrap with NCCL, no MPI dependency

* add NVLINK-P2P check + error message

* skip tests if no NVLINK available

* use std::vector to store ncclComm_t

* update misuse of TP warning

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
…mm` (#2210)

* add xml export for test_multiprocessing_encoder and test_cgemm

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Address tolerance check for current scaling dact

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
* Add NVFP4 recipe

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Frank Sun <frsun@nvidia.com>
Co-authored-by: Oleg Goncharov <ogoncharov@nvidia.com>
Co-authored-by: Zhongbo Zhu <zhongboz@nvidia.com>
Co-authored-by: Evgeny Tsykunov <etsykunov@nvidia.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Teddy Do <tdophung@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add MathDx dependency to GitHub builds

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Suggestions from GitHub Copilot

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Move 2x shape logic from core to PyTorch

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix compilation errors with CUDA 12.1

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* SM 70 is not supported in CUDA 13

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

* Typo

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

* Revert "Move 2x shape logic from core to PyTorch"

This reverts commit f8b2a2d0111d9af690b43bb98ae448d9a430a185.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Added dequantize kernel for FP4

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix linter warning

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add NVFP4 support with fusible ops

Use logical tensor dims for PyTorch NVFP4 tensors. Temporarily add unfused dequantize impl. Fix bug where NVFP4 recipe was not configurable.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Fix logic for 2x shapes and move to PyTorch

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix CG test model config

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Debug NVFP4 tensor size function

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Proper handling of the RNG state

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Test SR properly

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix workspace size for GEMM heuristic.

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Fix compile error in C++ NVFP4 test

Some some numeric errors when blocks are all zero.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* fix distrbuted test problem shape

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* proper assert dim for low precision AG TP

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* clean up duplicated code in nvfp4_utils.cuh

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* lint

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* pylint: disable=unused-argument

Signed-off-by: zhongboz <zhongboz@nvidia.com>

* `nvte_cublas_gemm_v2` to take alpha pointer (#12)

* make nvte_cublas_gemm_v2 to take alpha/beta pointers

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* users are expected to pass a valid C_tensor

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* typos

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* API to have const float* alpha

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* Minor tweaks

Support arbitrary beta scales. Increase workspace to be aligned to 128 bytes.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Debug IMA with alpha pointer

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Support fused amax kernels with NVFP4 quantization

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Disable fused amax with cuDNN LayerNorm kernel

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Add NVFP4 cases to distributed tests for TE ops

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Change assert to NVTE_CHECK in the hadamard cast fusion

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* Fix compile error

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Use global thread IDs for Philox subsequences

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add shape checks for NVFP4 cast kernels

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Do not fuse amax if cuDNN normalization is forced by envvar

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

---------

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: zhongboz <zhongboz@nvidia.com>
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Co-authored-by: Frank Sun <frsun@nvidia.com>
Co-authored-by: Oleg Goncharov <ogoncharov@nvidia.com>
Co-authored-by: Zhongbo Zhu <zhongboz@nvidia.com>
Co-authored-by: Evgeny Tsykunov <etsykunov@nvidia.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Teddy Do <tdophung@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Przemek Tredak <ptredak@nvidia.com>
Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com>
* Fix the segfault in the nvfp4 quantization

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* debug existing usage

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix fp8_dpa

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* reimplement fp8_dpa

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* clean up

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* more clean up

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* update FE develop

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* redesign CS; need cleanup

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* clean up

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* clean up s/dP quantizers

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* return dP to DS

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* improve quantizer_helper; tweak dP DS/CS logic

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* debug CP

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update FE commit

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* clean up non-CP; debug dq/dk mismatches

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor success with CP; need to remove debug info

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove debug info

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* disable fp8 output for fp8_mha + CS

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add output_tensor_type to FADescriptor

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor fixes for CP

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove print

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* more fixes for non-CP and CP

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* enable non-determinism for blackwell

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix indent; remove print

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* switch from create_tensor_from_data to make_like

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* enable a2a+p2p for CS CP and require additional cp_group_global

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix last commit

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* condense tests; only create dist groups once

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* consolidate CP P2P per-tile calls for fwd/bwd and fused/flash

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix flash-attn from last commit

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor fixes for previous commit

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix attn_mask_type in f16 causal

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* revert bb6a0a59 temporarily

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* reenable comparison for some tensors in CP tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix dbias for fused attn CP

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* clean up prints/comments and add back NVTE_CS_dP_SCALE

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* first attempt at mixed DS/CS reduction

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor fix for last commit for mixed DS/CS reduction

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove prints from 69639024

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix DS recipe for dP

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add NVTE_DPA_FORCE_DS to force DS for all DPA tensors, not just dP

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix NVTE_DPA_FORCE_DS and add NVTE_PRINT

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix last commit

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* modify DS recipe for MLPerf

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* reduce only over TP group; need to think about CP group later

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* streamline fake_recipe/quantizer generation; allow NVTE_DPA_Fixed_Scales or DS-update S/dP

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add more print: NVTE_LAYER_NUMBER

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* split S/dP in env vars: NVTE_DPA_Fix_S_Scale and NVTE_DPA_Fix_dP_Scale

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix autocast_key for DS

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add NVTE_REPEAT_in_F16 to repeat FP8 fwd/bwd passes

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add FP8 CS to UnfusedDPA; unsuccessful; does not affect other backends

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* temporary: print min/max and save tensors for debugging

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* emulate q/dq+bf16 with NVTE_Emulate_in_F16; add NVTE_DPA_FORCE_MXFP8 for MXFP8 q/dq

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add RHT to BMM1 with NVTE_RHT_BMM1 for the size

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* re-enable fused attn in dpa_fp8_vs_f16 test; changed during unfused attn implementation

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add NVTE_FP8_CS_POWER_OF_2, NVTE_DPA_FORCE_BLOCKFP8, NVTE_Emulate_QDQ_QKV, NVTE_Emulate_QDQ_O, NVTE_Emulate_QDQ_dO

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add F16 O support for FP8 kernels

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* revert to TE FE commit

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* return to FE develop

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* tidy up; untested

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor fix for last commit

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor fixes and improvements for last commit

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* more minor fixes and improvements

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* more small fixes/improvements; mostly for CP

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix CS/DS recipe switch in DPA

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* avoid quantizing/saving of O when CS bwd uses F16 O

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* move fp8_autocast(fp8_recipe) print to utils.py

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add debug logging to unit tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add back prints of quantizers/layer_number for debugging

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* enable amax reduction for both CS and DS tensors

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix NVTE_FP8_DPA_BWD=0 for CP

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix last commit for F16 fwd/bwd a2a+p2p

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* small fixes for float8_current_scaling(), nominal types, and unruly d_out types

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix fp8_output in MHA and some CP tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor fixes to CP tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor fixes for CP A2A

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* clamp input data in tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove rmse and tighten atol/rtol for tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* restructure fp8_recipes

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix linter

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* Revert "remove rmse and tighten atol/rtol for tests"

This reverts commit 15dba6a59a5323d414f02cf22f099cb00d880532.

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* more fixes for linter

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix fp8 recipe changes for F16 code path

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* revert to FE on main to help with merges

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* switch back to FE develop after merge

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* update FE develop commit

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix last merge

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* revert to GitHub FE 1.14.1

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* update FE to its latest main

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor fix for A2A

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix last commit for A2A DS

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove memset for BSHD/SBHD FP8

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove concat for qkv quantization in CS

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* improve/simplify the logic for last commit

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add nominal_type for UnfusedDPA FP8 EmuFunc

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* WIP: update env vars for DPA recipes

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix last commit

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix typo in last commit

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix DS recipe creation for NVFP4 global recipe

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* replace python max with torch.maximum

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix linter

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix CP A2A for FA

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* reduce prints in print_quantizers

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add FP8 env vars to NVTE_DEBUG prints

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* add reduce_amax to DS repr

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* separate fp8_dpa/fp8_mha in CP tests; fix A2A for them; add f16_O tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address some reciews

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* make data optional in create_hp_tensor_with_amax

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* minor fix for comments in bwd

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* print cudnn version in attn tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* disable CS for Hopper

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* alternative tests to reduce CI time

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* make NVTE_DPA_FP8CS_O_in_F16 default to 1

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove _fp8 variables to avoid confusion

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* return to requiring two cp_groups for a2a+p2p

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* replace NVTE_PRINT with NVTE_DEBUG/_LEVEL for quantizer prints

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* provide a basic set of tests for CP

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix the last merge with nvfp4 PR

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* disable for Hopper

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix fp8 backend selection for Hopper

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* reduce CP CI to essential tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor fix to CP test

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* fix recipe logic in tests

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* revert to concat for qkv quantization

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

* remove cudnn version in qa scripts

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>

---------

Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Load modules during initialize

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Co-authored-by: JAX Toolbox <jax@nvidia.com>
* Fix the cublas workspace alignment

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemyslaw Tredak <ptrendx@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Make sure to set usages for linear op quantizers before forward

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Avoid unsupported case for fused dbias+quantize kernel

Hopper does not support dbias + FP8 cast without FP8 transpose.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.