llama.cpp

Author	SHA1	Message	Date
hipudding	c3e08f4700	CANN: add new ops, optimize existing ops (#21204 ) New operators: - GGML_OP_SET: implement via aclnnInplaceCopy on target region - GGML_OP_CUMSUM: implement via aclnnCumsum - GGML_OP_FILL: implement via aclnnInplaceFillScalar - GGML_OP_DIAG: implement via aclnnInplaceCopy on diagonal strides - GGML_OP_TRI (lower/lower_diag/upper_diag/upper): implement via aclnnTril(-1/0) and aclnnTriu(0/1) with appropriate diagonal offsets - GGML_OP_SOLVE_TRI: implement via aclnnTriangularSolve - GGML_UNARY_OP_SOFTPLUS: implement via aclnnSoftplus Optimizations: - GLU (SwiGLU/GeGLU/GeGLU_ERF/GeGLU_QUICK): fuse with aclnnSwiGlu / aclnnGeGluV3 when applicable; fallback conditions now checked inside each function rather than at the call site - CROSS_ENTROPY_LOSS: replace 5-kernel sequence (LogSoftmax→Mul→ ReduceSum×2→Muls) with single aclnnSoftmaxCrossEntropyWithLogits call - L2_NORM: fix in-place ClampMin on norm result (was clamping wrong tensor); add eps clamping before division to avoid divide-by-zero - PAD_REFLECT_1D: eliminate per-ne[3] loop; assert contiguity and call ReflectionPad1d once on the full 4-D view; remove redundant nb copies - GET_ROWS: replace IndexSelect with GatherV2 per batch slice; refactor helper into gather_batched lambda with batch loop inlined - SET_ROWS: replace IndexCopy with InplaceIndexCopy per batch slice; refactor helper into scatter_batched lambda with batch loop inlined - OUT_PROD: replace O(ne[3]ne[2]ne[1]) Ger+InplaceAdd loop with per-slice Matmul loop (src0 @ src1^T); handles strided-broadcast batch dims where ne02/ne03 may differ from ne2/ne3 - backend memset_tensor: implement via aclrtMemset (was NULL) Bug fixes: - COUNT_EQUAL: use non-inplace EqTensor into a same-type temporary buffer instead of InplaceEqTensor, avoiding corruption of src0 - ACL graph cache (USE_ACL_GRAPH): restore node_type and src_type[] fields in ggml_graph_node_properties; has_matching_properties() was missing type checks, causing F16 and BF16 tensors (same nb[0]=2) to incorrectly share cached graphs and produce wrong results (ERR≈679) - graph cache op_params matching: compare full GGML_MAX_OP_PARAMS bytes so that ops differing only in parameters are not incorrectly replayed from cache	2026-04-28 09:27:22 +03:00
hipudding	632219af73	CANN: fix multi-thread set_tensor race conditions (#20151 ) * CANN: fix multi-thread set_tensor race conditions When ollama calls ggml_backend_tensor_set from multiple threads (each writing a different chunk of the same tensor), the CANN backend had three concurrency issues: 1. Quantized tensors (Q4_0/Q8_0) require a full-tensor format transform before uploading to device. Per-chunk transforms produced corrupt data. 2. ND-to-NZ weight conversion requires complete tensor data on device. Per-chunk conversion operated on incomplete data. 3. The global g_nz_workspaces array had unprotected concurrent access. Fix by introducing a TensorSetTracker that accumulates write progress per tensor. For quantized tensors, raw data is staged in a host buffer and the transform + upload is deferred until all chunks arrive. For NZ weights, chunks are uploaded directly but conversion is deferred. The tracker and its staging buffer are released immediately after post-processing completes. Add per-device mutex to g_nz_workspaces to prevent data races. * CANN: fix L2_NORM ignoring eps parameter The L2_NORM implementation was not using the eps parameter from op_params, causing incorrect results when eps is large (e.g. 10.0). The CPU reference computes scale = 1/fmaxf(norm, eps), so add a Clamp step to clamp the norm to at least eps before dividing. * ggml/cann: compare op_params for POOL_2D in ACL graph cache matching When ACL graph mode is enabled, the graph LRU cache checks whether a cached graph matches the current computation graph. Previously, GGML_OP_POOL_2D was not included in the op_params comparison, so two POOL_2D nodes with different pooling parameters (kernel size, stride, padding) but identical tensor shapes and addresses could incorrectly reuse a cached graph, leading to wrong results or aclnn errors. Add GGML_OP_POOL_2D to the list of ops that require op_params matching in ggml_graph_node_properties::has_matching_properties(). * cann: fix ACL graph cache matching by adding tensor type and unconditional op_params comparison The ACL graph LRU cache was incorrectly reusing cached graphs for operations with different tensor types or op_params, causing test failures for CPY (f16 vs bf16), POOL_2D, L2_NORM, NORM_MUL_ADD, RMS_NORM_MUL_ADD, and ADD_RMS_NORM. Changes: - Add node_type and src_type[] fields to ggml_graph_node_properties so the cache can distinguish tensors with different types but identical ne/nb (e.g. f16 and bf16 both have 2-byte elements) - Compare op_params unconditionally for all ops instead of only for SCALE/UNARY/GLU/ROPE/POOL_2D	2026-03-31 17:00:51 +03:00
Chenguang Li	07ff000551	CANN: add RoPE cache preload before ACL graph capture (#20747 ) ACL graph capture disallows host-to-device memcpy and device memory malloc/free on the captured stream. Pre-load the RoPE cache before capture so that: - Host-to-device copies and allocations run on the non-captured stream - Cache metadata is populated and memory pool is warmed up - During capture, only on-device computations are recorded; host-side and allocation branches are skipped	2026-03-23 15:24:06 +08:00
hipudding	1af9dab32b	CANN: add BF16 support for core operators (#20152 ) * CANN: add BF16 support for core operators Add BF16 (bfloat16) type support to the CANN backend for the following operators: MUL_MAT, MUL_MAT_ID, GET_ROWS, SET_ROWS, CPY, CONT, and OUT_PROD. This enables BF16 models to run on Ascend NPUs. * CANN: skip NZ weight format for BF16 and add 310P compile guards NZ weight format conversion does not support BF16 tensors, skip it in set_tensor, get_alloc_size and mul_mat. Remove BF16 from MUL_MAT_ID and OUT_PROD as there are no BF16 use cases. Add #ifndef ASCEND_310P guards for all BF16 operator support since 310P does not support BF16.	2026-03-20 17:08:39 +08:00
Chenguang Li	7f2cbd9a4d	CANN: handle in-place ROPE on non-contiguous f32 tensors (#20274 ) RotaryPositionEmbedding on CANN fails when src and dst share the same non-contiguous buffer (inplace + view), because the operator overwrites source data before it is fully read. Add a branch that detects this case and uses contiguous temporary buffers: copy src to temp, run ROPE into another temp, then copy back to the non-contiguous dst. Fixes 20 failing ROPE tests (f32, v=1, inplace=1). Signed-off-by: noemotiovon <757486878@qq.com>	2026-03-19 14:05:01 +08:00
Chenguang Li	07ba6d275b	CANN: support flash attention for head dim not multiple of 16, fix ALiBi slope offset (#20031 ) - Allow FLASH_ATTN_EXT when head dimension D is not a multiple of 16 by padding Q/K/V to D_padded = GGML_PAD(D, 16), running FusedInferAttentionScoreV2, then slicing the output back to D (ggml-cann.cpp + aclnn_ops.cpp). - Fix aclnn_get_slope second-part offset: use ggml_type_size(dtype) instead of sizeof(float) so ALiBi slopes are correct when dtype is F16 (e.g. GQA with 48 heads); fixes buffer overflow and large numerical errors in those cases.	2026-03-19 11:02:42 +08:00
hipudding	52e38faf8c	CANN: implement quantized MUL_MAT_ID for MoE models (#19228 ) Implement ggml_cann_mul_mat_id_quant function to support quantized matrix multiplication for Mixture of Experts (MoE) architectures on CANN backend. Key features: - Support Q4_0 and Q8_0 quantized weight formats - Use IndexSelect to dynamically route expert-specific weights based on indices - Leverage WeightQuantBatchMatmulV2 for efficient quantized computation - Handle automatic F16 type conversion for hardware compatibility - Support both per-expert and broadcast input modes Implementation details: - Extract expert weights and scales using CANN IndexSelect operation - Process each batch and expert combination independently - Create proper tensor views with correct stride for matmul operations - Automatic input/output type casting to/from F16 as needed Testing: All test cases passed for supported types (F32, F16, Q4_0, Q8_0).	2026-02-10 14:18:59 +08:00
Christian Kastner	7a4ca3cbd9	docs : Minor cleanups (#19252 ) * Update old URLs to github.com/ggml-org/ * Bump copyrights	2026-02-02 08:38:55 +02:00
hipudding	baa4ba0aec	CANN: support gated linear attn (#18653 ) * CANN: support gated linear attn This change adds support for the GGML_OP_GATED_LINEAR_ATTN operator. The feature was implemented by YushengZhao. Because the previous submission was based on an outdated codebase, this PR was rebased to merge. Co-authored-by: YushengZhao <yusheng.chao@outlook.com> Co-authored-by: hipudding <huafengchun@gmail.com> * CANN: optimize OP gla Optimize gla for high preformance * Remove unused comments --------- Co-authored-by: 赵禹昇 <2501112001@cninfer02.localdomain> Co-authored-by: YushengZhao <yusheng.chao@outlook.com>	2026-01-16 16:18:49 +08:00
hipudding	3333951d86	CANN: Fix rename for get_env (#18652 ) In #18624, get_env in ggml-cann was renamed to get_env_as_lowercase to accurately reflect the function’s behavior and reduce the chance of misuse. However, the update missed renaming call sites in other files. This commit fixes that oversight.	2026-01-07 16:11:31 +08:00
Chenguang Li	67e3f6f601	CANN: add operator fusion support for ADD + RMS_NORM (#17512 ) This commit implements operator fusion for ADD + RMS_NORM operations in the CANN backend to reduce memory access overhead and improve performance. The fusion is controlled by the GGML_CANN_OPERATOR_FUSION environment variable (default: false). Changes: - Implement ggml_cann_op_add_rms_norm_fused() using ACLNN AddRmsNorm - Add ggml_cann_can_fuse() to check fusion eligibility - Integrate fusion logic into computation graph evaluation - Add test cases for ADD + RMS_NORM fusion - Update documentation with new environment variable The fusion combines ADD and RMS_NORM into a single kernel call, which is more efficient than executing them separately.	2026-01-05 15:38:18 +08:00
0Marble	b07cda687c	CANN: implement the SSM_CONV operator (#17737 ) * CANN: implement SSM_CONV operator Co-authored-by: Aleksei Lobanov, <zeromarblectm@gmail.com> Co-authored-by: Sujin Kang, <waterjin326@gmail.com> * CANN: remove custom error limit for SSM_CONV * CANN: merge SSM_CONV tensor shape/strides into one line --------- Co-authored-by: Sujin Kang, <waterjin326@gmail.com>	2025-12-26 09:12:04 +08:00
Penglin Cai	e68c19b0fd	CANN: Add support for CONV_TRANSPOSE_1D when kernel size > 255 (#17934 ) * CONV_TRANSPOSE_1D kernel_size>255 * remove condition check * fix the bug of type conversion * removing trailing whitespaces * fix: return true in the switch case	2025-12-25 16:46:09 +08:00
TianHao324	cf2ffc02bc	CANN: Uses yarn_ramp cache in ROPE (#17725 )	2025-12-24 14:55:33 +08:00
Chenguang Li	ca709e427b	CANN: add support for partial RoPE and Vision mode (#17543 ) * cann: add support for partial RoPE and Vision mode Add support for two important RoPE variants: partial rotation (rope_dims < ne0) and Vision mode rotation. 1. Support for partial RoPE (rope_dims < ne0): - Split tensor into head (first rope_dims dimensions) and tail portions - Apply rotation only to head portion using RotaryPositionEmbedding operator - Copy unrotated tail portion directly from source to destination - Handle both contiguous and non-contiguous tensor layouts 2. Support for Vision mode (GGML_ROPE_TYPE_VISION): - Set rope_dims = ne0 for Vision mode to rotate entire tensor - Vision mode pairs dimension i with dimension i+n_dims (where n_dims = ne0/2) - No tail handling needed since entire tensor is rotated Implementation details: - Use has_tail flag to determine execution path: head/tail splitting when rope_dims < ne0, or full tensor rotation when rope_dims == ne0 - Support both F32 and F16 data types with intermediate F32 conversion - Copy non-contiguous tensors to contiguous buffers before calling RotaryPositionEmbedding operator for compatibility - Improve cache invalidation logic to include rope_dims and indep_sects parameters These enhancements enable CANN backend to handle various RoPE configurations used in modern vision-language models and models with partial rotation. * cann: fix review comment	2025-12-09 17:53:23 +08:00
hipudding	eeb5605de2	CANN: Add MROPE and IMROPE support (#17401 ) * CANN: ROPE supports both MROPE and IMROPE. 1. Optimize the caching logic of rope_cache_init. 2. Add support for mRoPE and i-mRoPE. Note that on Ascend 910B devices, it is necessary to disable FA in CLIP and disable NZ-format conversion. These two issues are still under investigation. * Resolve review comments	2025-11-26 16:44:19 +08:00
TianHao324	064c90d843	CANN: supports out_prod operator for F32 and F16 (#17406 ) Co-authored-by: tianhao <tianhao42@huawei.com>	2025-11-25 17:39:06 +08:00
Chenguang Li	bc4064cfea	CANN: fix acl_tensor_ptr usage in ASCEND_310P ROPE (#17347 ) * cann: fix acl_tensor_ptr usage in ASCEND_310P ROPE implementation Fix compilation errors in the ASCEND_310P-specific ROPE operation code by adding .get() calls when passing acl_tensor_ptr smart pointers to functions expecting raw aclTensor* pointers. This fixes the code that was missed in the previous refactoring commit (8981848) which changed ggml_cann_create_tensor() return type from aclTensor* to acl_tensor_ptr. * cann: format code	2025-11-18 16:41:52 +08:00
hipudding	2376b7758c	CANN: Use smart pointers to manage ACL objects (#17238 ) * CANN: Use smart pointers to manage ACL objects Previously, ACL objects were managed via manual destruction, which led to multiple memory-leak issues during runtime. This patch replaces manual memory management with smart pointers so that ACL objects are properly released and ownership is clearly defined. Note that the ownership of an ACL object belongs to the function that creates it. Other internal functions should operate on these ACL objects using raw pointers to avoid unintended ownership transfers. Additionally, since aclTensorList automatically frees its contained aclTensor objects, any aclTensor added to a tensor list must release ownership to avoid double free operations. This PR also removes the asynchronous task submission mechanism. Due to changes in recent CANN versions, tiling time has significantly decreased. Even with a dual-thread submission model, the dispatch overhead still falls on the critical path, making async submission less beneficial. Moreover, aclGraph support provides a much better path to reducing operator dispatch latency. * CANN: resolve review comments	2025-11-17 08:43:59 +08:00
TecJesh	97d5117217	CANN: Add cross_entropy_loss op support (#16886 ) * update L2_NORM op support * update L2_NORM op support * remove extra whitespace * cann: update cross_entropy_loss op support * remove trailing whitespaces * rebase the latest code in the main repository and remove the l2_norm operator that already exists in another pull request. * undo the l2_norm operator deletion	2025-11-13 09:39:51 +08:00
TecJesh	655cddd174	CANN: Add L2_NORM op support (#16856 ) * update L2_NORM op support * update L2_NORM op support * remove extra whitespace	2025-11-12 15:11:42 +08:00
Chenguang Li	3479efd112	CANN: Improve device ID handling and aclnnArange checks (#16752 ) * cann: improve device ID handling and aclnnArange checks - Stop relying on CANN's internal device ID retrieval; use a global variable instead. - Enforce stricter dimension validation in aclnnArange for better compatibility across CANN versions. * cann: use thread local var	2025-10-28 10:54:53 +08:00
Chenguang Li	7a50cf388a	CANN: format code using .clang-format (#15863 ) This commit applies .clang-format rules to all source files under the ggml-cann directory to ensure consistent coding style and readability. The .clang-format option `SortIncludes: false` has been set to disable automatic reordering of include directives. No functional changes are introduced. Co-authored-by: hipudding <huafengchun@gmail.com>	2025-10-16 16:41:11 +08:00
Chenguang Li	56fc38b965	CANN: fix CPU memory leak in CANN backend (#16549 ) This commit fixes a CPU-side memory leak issue in the CANN backend, which occurred when intermediate aclTensorList objects were not properly released after operator execution. The leak happened during repeated invocations of CANN ops (e.g., FlashAttention), leading to increasing host memory usage over time. Proper resource cleanup (aclDestroyTensorList and related release logic) has been added to ensure that all temporary tensors are correctly freed.	2025-10-13 17:01:24 +08:00
hipudding	f9bc66c3eb	CANN: Update several operators to support FP16 data format (#16251 ) Many Ascend operators internally use FP16 precision for computation. If input data is in FP32, it must first be cast to FP16 before computation, and then cast back to FP32 after computation, which introduces unnecessary cast operations. Moreover, FP16 computation requires significantly less workload compared to FP32, leading to noticeable efficiency improvements. In this change, `get_rows`, `rms_norm`, and `flash_attn_ext` are extended to support multiple data types. Validation on the Qwen2 0.5b model shows correct accuracy and about 10% performance gain in concurrent scenarios. Co-authored-by: noemotiovon <757486878@qq.com>	2025-10-13 08:52:22 +08:00
Chenguang Li	10d8b2b6b0	CANN: Add ROPE sin/cos cache for reuse (#15912 ) * CANN: Add ROPE sin/cos cache for reuse Introduce sin/cos caching mechanism in ROPE to avoid redundant computation across layers. The cache is built on the first layer per device and reused by subsequent layers if parameters match. - Added sin_cache / cos_cache pointers and position_length tracking - Introduced cache validity flags and properties: (ext_factor, theta_scale, freq_scale, attn_factor, is_neox) - Accelerates ROPE by eliminating repeated sin/cos generation This change reduces overhead in multi-layer scenarios while preserving correctness by verifying parameter consistency. Co-authored-by: hipudding <huafengchun@gmail.com> * fix typo Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com> Co-authored-by: hipudding <huafengchun@gmail.com>	2025-09-10 18:42:00 +08:00
leejet	0a1b3982cd	ggml: add ops for WAN video model (cuda && cpu) (#15669 ) * add conv3d support * add ggml_pad_ext for cpu & cuda backend * cuda/cpu: add im2col_3d support * cuda: make im2col a little faster * fix cuda pad/scale/im2col3d * make im2col_3d faster * gguf: support loading tensors which n_dims > GGML_MAX_DIMS * fix cuda get_rows * avoid ggml_conv_3d conflict * correct GGML_OP_COUNT assertion * avoid build failure * avoid build failure on MacOS * cuda: remove unnecessary MIN define * fix cpu im2col_3d * adjust the code style * cuda: use simpler loop in get_rows * add test_im2col_3d to test-backend-ops * test-backend-ops.cpp: remove trailing whitespace * cpu: im2col_3d support non continuous src Co-authored-by: Jeff Bolz <jbolz@nvidia.com> * fix test_im2col_3d * remove unused variables * cuda: get_rows: dfloat2 -> float2 * add test_pad_ext to test-backend-ops.cpp * add gguf_init_from_file_ext impl * Revert "gguf: support loading tensors which n_dims > GGML_MAX_DIMS" This reverts commit d8377a0a37f314bd3713fe043b4333ad661610c1. * Revert "add gguf_init_from_file_ext impl" This reverts commit d9f1d13208c68ef83b3538201ac7f31614fb1994. * update ggml_backend_vk_device_supports_op * fix ggml_backend_vk_device_supports_op * update other backend supports op for ggml_pad_ext * metal/opencl/sycl/vulkan: fix GGML_OP_PAD check in supports_op --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-09-04 10:38:49 +02:00
hipudding	5421f63ab0	CANN: Fix precision issue on 310I DUO multi-devices (#15784 )	2025-09-04 15:12:30 +08:00
Chenguang Li	239b60e898	CANN: fix acl_rstd allocation size in ggml_cann_rms_norm (#15760 ) Fixes #15330 Adjust the allocation size of acl_rstd. The parameter `dims` is set to 3 according to the CANN documentation. Co-authored-by: Yuchuan <yuchuan-cao@users.noreply.github.com>	2025-09-04 11:03:02 +08:00
hipudding	f6da8cb86a	CANN: Mask unsupported TRANSPOSE_1D operator (#15733 ) CANN currently does not support kernels larger than 255. This change disables such cases.	2025-09-03 14:08:22 +08:00
Chenguang Li	8a2234ea0c	CANN: Fix type float_t to float (#15736 ) Signed-off-by: noemotiovon <757486878@qq.com>	2025-09-03 10:43:53 +08:00
hipudding	9961d244f2	CANN: Resolve soft_max precision issue (#15730 ) Previously, the slope tensor was set to fp16 to improve efficiency. While this worked correctly in FA, it caused precision issues in soft_max. This change applies different data types for different operators to balance both accuracy and performance.	2025-09-02 17:12:37 +08:00
hipudding	ef2af57ddf	CANN: Support ext_factor in rope (#15710 )	2025-09-02 14:05:23 +08:00
hipudding	b9382c3877	CANN: Optimize MUL_MAT_ID (#15658 )	2025-09-01 08:57:23 +08:00
hipudding	3dc7397a27	CANN: fix RoPE cache issue on multi-device (#15629 ) * CANN: fix RoPE cache issue on multi-device RoPE cache only needs to be computed once per token. However, in multi-device scenarios, not every device starts computation from layer 0, which may lead to unallocated memory issues and precision errors. This commit records the first layer of each device to avoid the above issues. * CANN: Optimize first-layer detection method * CANN: Remove trailing whitespace * CANN: Only cache the data that can be determined as unchanged through the parameters. * CANN: Update function comment	2025-09-01 08:57:00 +08:00
Chenguang Li	1e7489745a	CANN: refactor mask handling and improve performance in FA (#15561 ) * CANN(flash-attn): refactor mask handling and improve performance 1. Refactored the mask computation in Flash Attention, unified the logic without separating prefill and decode. 2. Optimized performance in non-alibi scenarios by reducing one repeat operation. 3. Updated operator management to explicitly mark unsupported cases on 310P devices and when dim is not divisible by 16. Signed-off-by: noemotiovon <757486878@qq.com> * [CANN]: fix review Signed-off-by: noemotiovon <757486878@qq.com> * [CANN]: Optimization FA BNSD to BSND Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-08-27 17:21:41 +08:00
Chenguang Li	c247d06f38	CANN: ROPE cache sin/cos repeat (#15501 ) Signed-off-by: noemotiovon <757486878@qq.com>	2025-08-25 10:32:21 +08:00
Chenguang Li	a0f98dd604	CANN: Optimize RMS_NORM using cache (#15419 ) * [CANN] Optimize RMS_NORM using cache Signed-off-by: noemotiovon <757486878@qq.com> * fix typo Signed-off-by: noemotiovon <757486878@qq.com> * fix review comment Signed-off-by: noemotiovon <757486878@qq.com> * codestyle adjustment Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-08-22 14:12:07 +08:00
SHUAI YANG	a6d3cfe7fa	CANN: optimize rope operator (#15335 ) * optimize rope ops * amendment * delete trailing whitespace * change the variable name	2025-08-19 21:28:22 +08:00
Chenguang Li	bbd57b7eaf	CANN: GGML_OP_CPY optimization (#15070 ) Signed-off-by: noemotiovon <757486878@qq.com>	2025-08-12 16:12:13 +08:00
hipudding	be48528b06	CANN: Add broadcast for softmax and FA (#15208 ) * refactor softmax * fix fa * fix mask shape * format * add comments * Remove whitespace	2025-08-11 22:50:31 +08:00
hipudding	11490b3672	CANN: Improve loading efficiency after converting weights to NZ format. (#14985 ) * CANN: Improve loading efficiency after converting weights to NZ format. * CANN: fix typo	2025-07-31 19:47:20 +08:00
hipudding	204f2cf168	CANN: Add ggml_set_rows (#14943 )	2025-07-29 22:36:43 +08:00
hipudding	11dd5a44eb	CANN: Implement GLU ops (#14884 ) Implement REGLU, GEGLU, SWIGLU ops according to #14158	2025-07-26 17:56:18 +08:00
chen fan	14c28dfc50	CANN: weight format to NZ for Ascend310P3 (#14407 ) * weight format to nz for 310p * remove quant weight format to nz * clean code * fix * make the conditions for converting weights to NZ format consistent * clean code	2025-07-23 11:58:00 +08:00
luyhcsu	499a8f5a78	CANN: Replace aclrtMemsetSync with aclnnInplaceZero operator (#14002 ) Co-authored-by: luyuhong <luyuhong@kylinos.cn>	2025-07-04 11:50:07 +08:00
Chenguang Li	343b6e94b6	CANN: update aclnnGroupedMatmulV2 to aclnnGroupedMatmulV3 (#14411 ) * [CANN]update to aclnnGroupedMatmulV2 Signed-off-by: noemotiovon <757486878@qq.com> * Support MUL_MAT_ID on 310p Signed-off-by: noemotiovon <757486878@qq.com> * fix editorconfig Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-07-01 16:47:30 +08:00
Bizhao Shi	2d38b6e400	CANN: Add the basic supports of Flash Attention kernel (#13627 ) * cann: add the basic FA support * cann: update the readme * cann: update the FlashAttention with PSEShift * cann: update the input parameters in FA * cann: update the alibi with max_bias * cann: add the constrints of softcap * cann: update the docs CANN.md * cann: update the docs CANN.md * cann: fix typo of CANN.md * cann: add some comments and update the CANN.md * cann: update the CANN.md * cann: update the inner precise for fusedInferAttention * cann: update the constraints of flash_attn_ext on ggml-cann.cpp * cann: clean the whitespace * cann: clean the whitespace * cann: add a new endline	2025-05-26 10:20:18 +08:00
Chenguang Li	faaaff5f94	CANN: Support MUL_MAT_ID for q8_0 and q4_0 (#13705 ) * [CANN]Support MUL_MAT_ID Q8 && Q4 Signed-off-by: noemotiovon <757486878@qq.com> * codestyle adjustment Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-05-23 16:47:53 +08:00
Chenguang Li	33d7aed4a8	CANN: Support MOE Model MUL_MAT_ID (#13042 ) Signed-off-by: noemotiovon <757486878@qq.com>	2025-05-19 14:21:17 +08:00

1 2

73 Commits