A simple implementation and verification toolkit for LLM kernels.
- fp8 blockwise gemm
- int8 gemm
- w4a8 gemm(triton)
- int4 weight pack/unpack
- w4a16 gemm (cuda simple Marlin)
- w4a8 gemm (cuda, simple Qserve)
- fp4/6/8 fake quantize function
- Multiple communication strategies (All-to-All, AllGather)
- Group GEMM acceleration
- Quantized Group GEMM
- sage attention