lnigam
7b8443ac78
ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (… ( #22286 )
...
* ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (GQA=32)
Adds MMA-f16 and tile kernel configs, dispatch logic, template instances,
and tile .cu file for Mistral Small 4 (head sizes 320/256), restricting to
ncols2=32 to support GQA ratio 32 only.
* Adding check to return BEST_FATTN_KERNEL_NONE in case GQA!=32
* Apply suggestions from code review
Address review comments
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
* Address review comments and making kernel config default to DQK=512, DV=512 instead of DQK=256,DV=256
* Fixed bug with sinks=1, with ncols=32, there are two warp-groups created but sinks index is same(0,...,15) for both the groups hence with sinks=1, output is not matching with CPU output. Added sink_base which will be base index for each warp_group (threadIdx.y / np)
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
* Update ggml/src/ggml-cuda/template-instances/generate_cu_files.py
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
2026-04-28 21:37:35 +02:00
Johannes Gäßler
4eac5b4509
CUDA: refactor mma data loading for AMD ( #22051 )
...
* CUDA: refactor mma data loading for AMD
* fix CDNA MMQ occupancy
* fix CDNA3 mma
* fix RDNA3 compile
2026-04-19 18:26:59 +02:00
Anav Prasad
88458164c7
CUDA: Add Flash Attention Support for Head Dimension 512 ( #20998 )
...
* flash attention support for head dimension 512 added
* FA D=512 - match 576 configs, limit ncols2, revert vec cap
* fix HIP tile kernel build for D=512
* fix HIP tile kernel occupancy for D=512 on AMD
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
* fix tile FA compilation
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
2026-04-01 09:07:24 +02:00
Marcel Petrick
92f7da00b4
chore : correct typos [no ci] ( #20041 )
...
* fix(docs): correct typos found during code review
Non-functional changes only:
- Fixed minor spelling mistakes in comments
- Corrected typos in user-facing strings
- No variables, logic, or functional code was modified.
Signed-off-by: Marcel Petrick <mail@marcelpetrick.it >
* Update docs/backend/CANN.md
Co-authored-by: Aaron Teo <taronaeo@gmail.com >
* Revert "Auxiliary commit to revert individual files from 846d1c301281178efbc6ce6060ad34c1ebe45af8"
This reverts commit 02fcf0c7db661d5ff3eff96b2b2db9fdb7213256.
* Update tests/test-backend-ops.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
* Update tests/test-backend-ops.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
---------
Signed-off-by: Marcel Petrick <mail@marcelpetrick.it >
Co-authored-by: Aaron Teo <taronaeo@gmail.com >
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com >
2026-03-05 08:50:21 +01:00
Jayant Lohia
ecbcb7ea9d
CUDA: add CDNA3 MFMA support for flash attention MMA kernel ( #19806 )
...
* CUDA: add CDNA3 MFMA support for flash attention MMA kernel
Add MI300X (gfx942) MFMA tensor core flash attention using
v_mfma_f32_16x16x16_f16 (FP16 in, FP32 accumulate).
- Add FATTN_WARP_SIZE=64 for CDNA wavefront64
- Add CDNA config for head sizes 64, 80, 96, 112, 128
- Add FP16 MFMA intrinsic path in mma.cuh
- Add manual V transpose load for MFMA register layout
- Route CDNA to MMA for prompt processing, VEC for token generation
- Fix Q loading and combine stride granularity for non-power-of-2 heads
Benchmarks (Qwen2.5-1.5B Q4_K_M, MI300X):
pp512 +7%, pp1024 +13%, pp2048 +23%, pp4096 +39%
tg128 -10% (FA overhead, VEC used for both)
All 2480 flash attention tests pass.
Ref: https://github.com/ggml-org/llama.cpp/issues/17917
* address review: replace FATTN_WARP_SIZE with constexpr, improve dispatch
- Replace #define FATTN_WARP_SIZE with constexpr int warp_size =
ggml_cuda_get_physical_warp_size() in each device function
- Use ne[1]*gqa_ratio threshold for MMA vs tile dispatch. Benchmarked
crossover on MI300X @ d32768 with power-of-2 GQA models:
hsk=64 (Llama 1B, gqa=4): MMA wins at eff >= 128 (+11%)
hsk=128 (Llama 3B, gqa=4): MMA wins at eff >= 128 (+4%)
Unified threshold: eff_nq >= 128 for all head sizes.
- Remove VEC fallback; small batches fall through to tile kernel
* Update ggml/src/ggml-cuda/fattn.cu
* use ggml_cuda_info().devices warp_size instead of hardcoded check
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
2026-02-27 19:37:26 +01:00
Johannes Gäßler
b0311c16d2
CUDA: fix padding of GQA to power of 2 in FA ( #19115 )
2026-01-26 23:24:58 +01:00
Johannes Gäßler
0c21677e43
CUDA: faster FA for GQA > 1 but not power of 2 ( #19092 )
2026-01-25 21:19:47 +01:00
Johannes Gäßler
8f91ca54ec
CUDA: re-use MLA K data for V in MMA FA ( #19057 )
2026-01-24 10:09:36 +01:00
Georgi Gerganov
a5eaa1d6a3
mla : make the V tensor a view of K ( #18986 )
...
* mla : pass V as a view of K to the FA op
* cuda : adjust mla logic to new layout
* kv-cache : fix rope shift
* tests : remove comment
* cuda : fix reusable_cutoff
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
2026-01-22 22:09:01 +02:00
Aman Gupta
b70d251076
CUDA: add gqa_ratio 4 for GLM 4.7 flash ( #18953 )
2026-01-22 18:51:53 +08:00
yulo
ea4a321f2a
HIP: add fattn-mma-f16 for RDNA4 ( #18481 )
...
* finish VQ mma
* flash_attn_ext_f16_iter
* KQ_rowsum
* correct exp
* fix scale error
* fix softmax scale
* fix softmax scale
* enable fattn on cpu side
* fix random error
* disable fattn-mma-f16 on rdna3
* fix wrong col for rdna
* use identity mat to transpose
* resolve conflicts
* basic tuning for DeepSeek-R1-Distill-Qwen-1.5B
* fix volta compile error
* align rdna4 policy for fattn
* adjust fattn policy
* adjust kernel selection logic
* update as the review comments
* keep fattn-wmma logic
* adjust kernel selection logic
---------
Co-authored-by: zhang hui <you@example.com >
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
2026-01-13 13:52:16 +01:00
Johannes Gäßler
ecc343de63
CUDA: fix KQ max calculation ( #18487 )
2025-12-31 09:37:00 +01:00
Johannes Gäßler
482211438d
CUDA: fix overflow in MMA kernel without stream-k ( #17939 )
2025-12-12 17:43:58 +01:00
Johannes Gäßler
17f7f4baad
CUDA: fix unpadded strides in MMA FA kernel ( #17891 )
2025-12-10 12:39:56 +01:00
Johannes Gäßler
e95d0bc8fd
CUDA: fix FA VKQ accumulator overflow ( #17746 )
2025-12-05 09:18:10 +01:00
Johannes Gäßler
2e1c9cd814
CUDA: generalized (mma) FA, add Volta support ( #17505 )
...
* CUDA: generalized (mma) FA, add Volta support
* use struct for MMA FA kernel config
---------
Co-authored-by: Aman Gupta <aman>
2025-12-03 16:57:05 +01:00
R0CKSTAR
8ad038c0fd
musa: add GGML_UNUSED_VARS ( #15446 )
...
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
2025-08-21 11:06:05 +08:00
R0CKSTAR
a094f38143
musa: fix build warnings ( #15258 )
...
* musa: fix build warnings
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* fix warning: comparison of integers of different signs: 'const int' and 'unsigned int' [-Wsign-compare]
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
2025-08-20 10:17:37 +08:00
Johannes Gäßler
1425f587a8
CUDA: attention sinks for mma FlashAttention ( #15157 )
2025-08-08 08:19:58 +02:00
Johannes Gäßler
1d72c84188
CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16 ( #15131 )
...
* CUDA: GEMM for FP32/FP16/BF16 and ne11 <= 16
2025-08-07 10:53:21 +02:00
Georgi Gerganov
fd1234cb46
llama : add gpt-oss ( #15091 )
...
* oai moe
* compat with new checkpoint
* add attn sink impl
* add rope scaling yarn
* logits match with latest transformers code
* wip chat template
* rm trailing space
* use ggml_scale_bias
* rm redundant is_swa_all
* convert interleaved gate_up
* graph : fix activation function to match reference (#7 )
* vocab : handle o200k_harmony special tokens
* ggml : add attention sinks support (#1 )
* llama : add attn sinks
* ggml : add attn sinks
* cuda : add attn sinks
* vulkan : add support for sinks in softmax
remove unnecessary return
* ggml : add fused swiglu_oai op (#11 )
* ggml : add fused swiglu_oai op
* Update ggml/src/ggml-cpu/ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* update CUDA impl
* cont : metal impl
* add vulkan impl
* test-backend-ops : more test cases, clean up
* llama : remove unfused impl
* remove extra lines
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
---------
Co-authored-by: slaren <slarengh@gmail.com >
* repack mxfp4 upon conversion
* clean up a bit
* enable thinking
* add quick hack to render only some special tokens
* fix bf16 conversion
* remove vocab hack
* webui ok
* support chat parsing for gpt-oss
* fix webui
* direct mapping mxfp4, FINALLY
* force using mxfp4
* properly use lazy tensor
* ggml : add mxfp4
ggml : use e8m0 conversion instead of powf
Co-authored-by: Diego Devesa <slarengh@gmail.com >
change kvalues_mxfp4 table to match e2m1 (#6 )
metal : remove quantization for now (not used)
cuda : fix disabled CUDA graphs due to ffn moe bias
vulkan : add support for mxfp4
cont : add cm2 dequant
* ggml : add ggml_add_id (#13 )
* ggml : add ggml_add_id
* add cuda impl
* llama : add weight support check for add_id
* perf opt
* add vulkan impl
* rename cuda files
* add metal impl
* allow in-place ggml_add_id
* llama : keep biases on CPU with --cpu-moe
* llama : fix compile error
ggml-ci
* cuda : add fallback for __nv_cvt_e8m0_to_bf16raw
ggml-ci
* cleanup
ggml-ci
* sycl : fix supports_op for MXFP4
ggml-ci
* fix Unknown reasoning format
* ggml-cpu : fix AVX build
ggml-ci
* fix hip build
ggml-ci
* cuda : add mxfp4 dequantization support for cuBLAS
ggml-ci
* ggml-cpu : fix mxfp4 fallback definitions for some architectures
ggml-ci
* cuda : fix version required for __nv_cvt_e8m0_to_bf16raw
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co >
Co-authored-by: slaren <slarengh@gmail.com >
2025-08-05 22:10:36 +03:00
Johannes Gäßler
92b8810ec7
CUDA: skip masked KV slices for all FA kernels ( #14924 )
2025-07-30 15:46:13 +02:00
uvos
aa79524c51
HIP: remove the use of __HIP_PLATFORM_AMD__, explicitly support only AMD targets ( #14945 )
2025-07-29 20:23:04 +02:00
R0CKSTAR
9b8f3c6c77
musa: fix build warnings (unused variable) ( #14869 )
...
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
2025-07-26 10:36:02 +08:00
Johannes Gäßler
a86f52b285
CUDA: fix overflow in FA, tune performance ( #14840 )
2025-07-23 21:43:25 +02:00
Georgi Gerganov
225e7a1438
llama : add high-throughput mode ( #14363 )
...
* kv-cache : prepare K/V buffers for separation
ggml-ci
* batched-bench : fix oob write
ggml-ci
* llama : add "virtual sequences"
ggml-ci
* llama : use "stream" vs "virtual sequence"
ggml-ci
* graph : fix stream splitting when KV cache is not used
ggml-ci
* kv-cache : add multi-stream save/load support
ggml-ci
* llama : add "--attn-streams" flag
ggml-ci
* kv-cache : fix handling when find_slot fails
ggml-ci
* kv-cache : restore find_slot impl
ggml-ci
* kv-cache : add comments
* kv-cache : add bounds checks for sequence id
ggml-ci
* cont : add n_seq_max to batch allocr
ggml-ci
* kv-cache : perform stream copies lazily after llama_synchronize
ggml-ci
* kv-cache : avoid throwing exceptions across the C boundary
ggml-ci
* CUDA: 4D FlashAttention support (#14628 )
* CUDA: 4D FlashAttention support
* CUDA: fix WMMA FA kernel
* llama : rename attn_streams -> kv_unified
ggml-ci
* common : rename kv_split -> kv_unified
ggml-ci
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
2025-07-16 16:35:42 +03:00
Johannes Gäßler
12a81af45f
CUDA: broadcasting for FlashAttention mask ( #14500 )
2025-07-02 15:48:33 +03:00
Johannes Gäßler
0b4be4c435
CUDA: fix FTZ in FA for Gemma 3 ( #13991 )
2025-06-04 08:57:05 +02:00
Johannes Gäßler
e562eece7c
CUDA: fix typo in FlashAttention code ( #13926 )
2025-05-30 21:22:03 +02:00
R0CKSTAR
33983057d0
musa: Upgrade MUSA SDK version to rc4.0.1 and use mudnn::Unary::IDENTITY op to accelerate D2D memory copy ( #13647 )
...
* musa: fix build warning (unused parameter)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* musa: upgrade MUSA SDK version to rc4.0.1
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* musa: use mudnn::Unary::IDENTITY op to accelerate D2D memory copy
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* Update ggml/src/ggml-cuda/cpy.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
* musa: remove MUDNN_CHECK_GEN and use CUDA_CHECK_GEN instead in MUDNN_CHECK
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
2025-05-21 09:58:49 +08:00
Johannes Gäßler
6da34fa276
CUDA: faster Deepseek FA, add Turing support ( #13435 )
2025-05-14 16:08:20 +02:00
Johannes Gäßler
95e18884fc
CUDA: fix misaligned synchronization in FA ( #13469 )
2025-05-12 10:51:21 +02:00
Johannes Gäßler
0208355f42
CUDA: fix race conditions FlashAttention kernels ( #13438 )
2025-05-10 22:22:48 +02:00
Johannes Gäßler
d8919424f1
CUDA: fix FlashAttention on Turing ( #13415 )
2025-05-10 09:16:52 +02:00
Johannes Gäßler
0cf6725e9f
CUDA: FA support for Deepseek (Ampere or newer) ( #13306 )
...
* CUDA: FA support for Deepseek (Ampere or newer)
* do loop unrolling via C++ template
2025-05-09 13:34:58 +02:00
R0CKSTAR
492d7f1ff7
musa: fix all warnings, re-enable -DLLAMA_FATAL_WARNINGS=ON in ci and update doc ( #12611 )
...
* musa: fix all warnings
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* musa: enable -DLLAMA_FATAL_WARNINGS=ON in run.sh
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* musa: update ci doc (install ccache)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* fix Windows build issue
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* Address review comments
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
* Address review comments
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com >
2025-03-30 10:59:38 +02:00
Gaurav Garg
517b5ddbf0
CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case ( #12183 )
...
- Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value.
- Prefer vector flash attention kernels over MMA kernel for BS=1
Fixes Issue: #12182
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de >
2025-03-19 20:52:06 +01:00
Johannes Gäßler
a28e0d5eb1
CUDA: app option to compile without FlashAttention ( #12025 )
2025-02-22 20:44:34 +01:00
Johannes Gäßler
5fa07c2f93
CUDA: optimize FA for GQA + large batches ( #12014 )
2025-02-22 12:20:17 +01:00
Johannes Gäßler
73e2ed3ce3
CUDA: use async data loading for FlashAttention ( #11894 )
...
* CUDA: use async data loading for FlashAttention
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com >
2025-02-17 14:03:24 +01:00
Johannes Gäßler
864a0b67a6
CUDA: use mma PTX instructions for FlashAttention ( #11583 )
...
* CUDA: use mma PTX instructions for FlashAttention
* __shfl_sync workaround for movmatrix
* add __shfl_sync to HIP
Co-authored-by: Diego Devesa <slarengh@gmail.com >
2025-02-02 19:31:09 +01:00