llama.cpp

Author	SHA1	Message	Date
Michael Wand	fc2b0053ff	ggml-cuda: Repost of 21896: Blackwell native NVFP4 support (#22196 )	2026-04-29 06:47:42 +08:00
lnigam	7b8443ac78	ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (… (#22286 ) * ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (GQA=32) Adds MMA-f16 and tile kernel configs, dispatch logic, template instances, and tile .cu file for Mistral Small 4 (head sizes 320/256), restricting to ncols2=32 to support GQA ratio 32 only. * Adding check to return BEST_FATTN_KERNEL_NONE in case GQA!=32 * Apply suggestions from code review Address review comments Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Address review comments and making kernel config default to DQK=512, DV=512 instead of DQK=256,DV=256 * Fixed bug with sinks=1, with ncols=32, there are two warp-groups created but sinks index is same(0,...,15) for both the groups hence with sinks=1, output is not matching with CPU output. Added sink_base which will be base index for each warp_group (threadIdx.y / np) * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/template-instances/generate_cu_files.py Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-04-28 21:37:35 +02:00
Daniel Bevenius	5d56effdee	convert : add support for Nemotron Nano 3 Omni (#22481 ) This commit adds support for NVIDIA Nemotron Nano 3 Omni model enabling this model to be converted to GGUF.	2026-04-28 19:17:57 +02:00
Jillis ter Hove	52e5f0a5c1	common : re-arm reasoning budget after DONE on new <think> (#22323 ) DONE state absorbs all tokens including a new start tag, causing any think blocks after the first to run unbudgeted. Observed on unsloth/Qwen3.6-27B-GGUF which interleaves multiple <think> blocks per response. Fixed by advancing start_matcher in DONE branch and re-arming to COUNTING with a fresh budget on match. Adds regression test (test-reasoning-budget: test 6).	2026-04-28 19:15:36 +02:00
Matt Corallo	f9f33654a6	vulkan: Coalesce Q4_K/Q5_K scale loads (#21751 ) Some SPIR-V compilers (notably mesa) don't handle the current vulkan Q4_K/Q5_K scale load pattern in mul_mat particularly well. While reading three `u8`s from the 12-byte scale array should (at least on some hardware) result in loading the full 12 bytes in a single LOAD followed by whatever extraction is needed, at least the ANV Intel driver really can't practically perform this optimization. `mesa`'s unsigned upper bound logic doesn't handle tracking bounds through ternary, resulting in the `(is < 4) ? ... : is - 4` having an infinite upper bound (as it cannot prove `is - 4` doesn't underflow). While this could still be rectified if mesa looked at the array bounds, it currently doesn't and `glslc` currently emits SPIR-V that doesn't allow for this optimization anyway (though maybe it will at some point, see https://github.com/KhronosGroup/glslang/issues/4206). In mul_mat_vecq we took a different approach to loading the same fields. We read the first two bytes we needed from `scale` then took a branch before deciding whether we needed to read a third byte. In mesa this did, indeed, lead to a top-level branch with conditional loads. As such these loads ended up not being coalesced either (at least in the ANV driver) resulting in additional instructions in our hot loop. Instead, here, we go ahead and force loading the full 12 bytes and extract the bits we need from the packed-u32s instead. In mul_mat there's a few less ternaries and only one extra shift, so even on drivers that did optimize the previous loads properly the only material change should be pulling a few extra bytes into registers (which on most hardware won't cost anything anyway, though ironically on Intel it theoretically could). In mul_mat_vecq this requires a bit of extra math and may read bytes from the u32 that weren't needed, but it seems likely avoiding the branch is a win on most platforms. On Intel Xe2/mesa 26.0.4 with the optimizations from https://gitlab.freedesktop.org/mesa/mesa/-/work_items/15162, for shader matmul_id_subgroup_q4_k_f32_f16acc_aligned_l: * Instruction Count: 2753 -> 2688 * SEND Count: 269 -> 261 * Cycle Count: 273976 -> 266138 * Max live registers: 248 -> 246 * Non SSA regs after NIR: 381 -> 382 for shader matmul_id_subgroup_q5_k_f32_f16acc_aligned_l: * Instruction Count: 2767 -> 2702 * SEND Count: 271 -> 263 * Cycle Count: 274140 -> 268144 * Max live registers: 248 -> 246 * Non SSA regs after NIR: 381 -> 382 for shader mul_mat_vec_id_q4_k_q8_1_f32: * Instruction Count: 1930 -> 1646 * SEND Count: 116 -> 71 * Cycle Count: 1348306 -> 843350 * Max live registers: 78 -> 84 * Non SSA regs after NIR: 300 -> 135 for shader mul_mat_vec_id_q5_k_q8_1_f32: * Instruction Count: 2207 -> 1922 * SEND Count: 131 -> 86 * Cycle Count: 1392012 -> 1037836 * Max live registers: 90 -> 90 * Non SSA regs after NIR: 300 -> 135 for shader mul_mat_vec_q4_k_q8_1_f32: * Instruction Count: 2029 -> 1749 * SEND Count: 111 -> 66 * Cycle Count: 1347278 -> 840118 * Max live registers: 74 -> 80 * Non SSA regs after NIR: 299 -> 134 for shader mul_mat_vec_q5_k_q8_1_f32: * Instruction Count: 2307 -> 2022 * SEND Count: 126 -> 81 * Cycle Count: 1379820 -> 954042 * Max live registers: 86 -> 86 * Non SSA regs after NIR: 299 -> 134 On one Arc Pro B60, unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL: * pp512: 907.34 ± 9.28 -> 941.94 ± 10.53 (+4%) * pp2048: 897.95 ± 1.82 -> 931.55 ± 1.79 (+4%) * tg128: 49.49 ± 0.02 -> 49.86 ± 0.05 (+ <1%) On one Arc Pro B60, unsloth/Qwen3.5-27B-GGUF:Q4_K_S: * pp512: 324.13 ± 10.52 -> 354.33 ± 6.81 (+9%) * pp2048: 329.80 ± 0.25 -> 357.10 ± 0.06 (+8%) * tg128: 17.11 ± 0.01 -> 18.11 ± 0.01 (+6%) On four Arc Pro B60s, unsloth/Qwen3.5-122B-A10B-GGUF:Q5_K_S with -sm layer (note that -sm tensor improvements will naturally be less): * pp512: 264.55 ± 2.81 -> 280.45 ± 3.94 (+6%) * pp2048: 319.32 ± 2.72 -> 335.70 ± 3.48 (+5%) * tg128: 26.39 ± 0.01 -> 26.67 ± 0.01 (+1%)	2026-04-28 17:31:04 +02:00
Reese Levine	98bb57916a	ggml-webgpu: fix buffer aliasing for ssm_scan and refactor aliasing logic (#22456 ) * Refactor buffer aliasing to be part of shader lib decisions * cleanup * formatting	2026-04-28 07:27:17 -07:00
Aleksander Grygier	f42e29fdf1	webui: Server tools (#21237 ) * wip: server_tools * feat: Integrate with `/tools` endpoint * feat: Builtin + MCP + JSON Schema Tools WIP * refactor * displayName -> display_name * snake_case everywhere * rm redundant field * feat: Improvements * chore: update webui build output * refactor: Updates after server updates * chore: update webui build output * change arg to --tools all * feat: UI improvements * chore: update webui build output * add readme mention * llama-gen-docs * chore: update webui build output * chore: update webui build output * chore: update webui build output * feat: Reorganize settings sections * feat: Separate dialogs for MCP Servers Settings and Import/Export * feat: WIP * feat: WIP * feat: WIP * feat: WIP * feat: WIP * feat: WIP * WIP on allozaur/20677-webui-server-tools * feat: UI improvements * chore: Update package lock * chore: Run `npm audit fix` * feat: UI WIP * feat: UI * refactor: Desktop Icon Strip DRY * feat: Cleaner rendering and transition for ChatScreen * feat: UI improvements * feat: UI improvement * feat: Remove MCP Server "enable" switch from Tools submenu * chore: Run `npm audit fix` * feat: WIP * feat: Logic improvements * refactor: Cleanup * refactor: DRY * test: Fix Chat Sidebar UI Tests * chore: Update package lock * refactor: Cleanup * feat: Chat Message Action Card with Continue and Permission flow implementations * feat: Add agentic steering messages, draft messages and improve chat UX * fix: Search results UI * test: Fix unit test * feat: UI/UX improvements * refactor: Simplify `useToolsPanel` access in components * feat: Implement Processing Info Context API * feat: Implement 'Go back to chat' functionality for settings * feat: Enhance MCP Server management in Chat Form Attachments * style: Minor UI and branding adjustments * chore: Update webui static build output * chore: Formatting, linting & type checks * feat: Draft messages logic * feat: UI improvements * feat: Steering Messages improvements * refactor: Cleanup * refactor: Cleanup * feat: Improve UI * refactor: Settings navigation hook * refactor: DRY code * refactor: DRY ChatMessageUser UI components * refactor: Desktop Icon Strip DRY * refactor: Tools & permissions * fix: Navigation condition * refactor: Cleanup * refactor: Cleanup * refactor: Cleanup * fix: preserve reasoning_content in agentic flow --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2026-04-28 14:35:49 +03:00
Jeff Bolz	19821178be	vulkan: add barrier after writetimestamp (#21865 )	2026-04-28 12:28:12 +02:00
Emil Askerov	698d19b93c	ggml: improve SPIR-V headers detection with __has_include (#21918 ) * ggml: improve SPIR-V headers detection with __has_include while preserving original _WIN32 logic * Address review comments: fix fallback logic and add FreeBSD support * Remove spirv_cross fallback as per review * Remove redundant __has_include check	2026-04-28 12:19:06 +02:00
Adrien Gallouët	50494a2800	ggml : skip already registered backends and devices (#22296 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-04-28 10:02:32 +03:00
Adrien Gallouët	d530d6e7a2	ggml : revert to -lm linking instead of find_library (#22355 ) * ggml : revert to -lm linking instead of find_library `find_library(MATH_LIBRARY m)` was introduced recently, but it breaks CUDA compilation with GGML_STATIC. I could not find any valid use case where we would prefer `find_library` over the standard `-lm` approach. This commit is also meant to start a discussion if there is a valid reason to keep `find_library(MATH_LIBRARY m)`, we should clarify what problem it was solving and find an alternative fix that does not break CUDA with GGML_STATIC. Signed-off-by: Adrien Gallouët <angt@huggingface.co> * ggml : use MATH_LIBRARY only if defined Signed-off-by: Adrien Gallouët <angt@huggingface.co> * ggml : fix initial broken condition Signed-off-by: Adrien Gallouët <angt@huggingface.co> * ggml : always respect MATH_LIBRARY when defined Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-04-28 09:56:02 +03:00
hipudding	c3e08f4700	CANN: add new ops, optimize existing ops (#21204 ) New operators: - GGML_OP_SET: implement via aclnnInplaceCopy on target region - GGML_OP_CUMSUM: implement via aclnnCumsum - GGML_OP_FILL: implement via aclnnInplaceFillScalar - GGML_OP_DIAG: implement via aclnnInplaceCopy on diagonal strides - GGML_OP_TRI (lower/lower_diag/upper_diag/upper): implement via aclnnTril(-1/0) and aclnnTriu(0/1) with appropriate diagonal offsets - GGML_OP_SOLVE_TRI: implement via aclnnTriangularSolve - GGML_UNARY_OP_SOFTPLUS: implement via aclnnSoftplus Optimizations: - GLU (SwiGLU/GeGLU/GeGLU_ERF/GeGLU_QUICK): fuse with aclnnSwiGlu / aclnnGeGluV3 when applicable; fallback conditions now checked inside each function rather than at the call site - CROSS_ENTROPY_LOSS: replace 5-kernel sequence (LogSoftmax→Mul→ ReduceSum×2→Muls) with single aclnnSoftmaxCrossEntropyWithLogits call - L2_NORM: fix in-place ClampMin on norm result (was clamping wrong tensor); add eps clamping before division to avoid divide-by-zero - PAD_REFLECT_1D: eliminate per-ne[3] loop; assert contiguity and call ReflectionPad1d once on the full 4-D view; remove redundant nb copies - GET_ROWS: replace IndexSelect with GatherV2 per batch slice; refactor helper into gather_batched lambda with batch loop inlined - SET_ROWS: replace IndexCopy with InplaceIndexCopy per batch slice; refactor helper into scatter_batched lambda with batch loop inlined - OUT_PROD: replace O(ne[3]ne[2]ne[1]) Ger+InplaceAdd loop with per-slice Matmul loop (src0 @ src1^T); handles strided-broadcast batch dims where ne02/ne03 may differ from ne2/ne3 - backend memset_tensor: implement via aclrtMemset (was NULL) Bug fixes: - COUNT_EQUAL: use non-inplace EqTensor into a same-type temporary buffer instead of InplaceEqTensor, avoiding corruption of src0 - ACL graph cache (USE_ACL_GRAPH): restore node_type and src_type[] fields in ggml_graph_node_properties; has_matching_properties() was missing type checks, causing F16 and BF16 tensors (same nb[0]=2) to incorrectly share cached graphs and produce wrong results (ERR≈679) - graph cache op_params matching: compare full GGML_MAX_OP_PARAMS bytes so that ops differing only in parameters are not incorrectly replayed from cache	2026-04-28 09:27:22 +03:00
Georgi Gerganov	14e733e36f	spec : refactor params (#22397 ) * spec : refactor params * cont : fix * cont : rename "sparam" to "sampling" * cont : add spec params category * cont : add info about removed arguments * cont : skip param length check for spec params * cont : adapt server tests	2026-04-28 09:07:33 +03:00
Aman Gupta	516e8d7a8a	server: use pos_next instead of n_tokens for m-rope (#22439 )	2026-04-28 08:41:00 +03:00
Rithik Sharma	434b2a1ff6	ggml-webgpu: add Q1_0 support (#22374 ) * add fast matmul matvec q1_0 kernel * ggml-webgpu: drop redundant zero-fills in Q1_0 shmem init	2026-04-27 15:50:59 -07:00
tha80	983ca8992e	server: (router) Forward form-data to model server (Fixes #22044 ) (#22118 ) * This commit enables the router to forward form-data to model server. Fixes #22044 (enabling to use the /v1/audio/transcriptions in router mode) * * Applied the suggestion from Copilots first comment: using the non-throwing json::parse overload. * Addressed Copilots third comment by extending the files representation to also include filename and content-type * Addressed Copilots fourth comment by making the RNG thread_local * Changed variable body from std::string to std::ostringstream in build_multipart_body as suggested by ngxson in https://github.com/ggml-org/llama.cpp/pull/22118#discussion_r3127099053 * Added sanitize_field lambda in build_multipart_body for key, filename and content_type as suggested by ngxson in https://github.com/ggml-org/llama.cpp/pull/22118#discussion_r3127104647 * explicitly checking if value/item is string before calling value/item.get<std::string>() as requested by ngxson in https://github.com/ggml-org/llama.cpp/pull/22118#discussion_r3127111279 * Added double quote to the sanitize lambda and throw on json parse failure --------- Co-authored-by: Ralph Paßgang <ralph@trust-it.de>	2026-04-27 23:55:00 +02:00
Rithik Sharma	665abc6097	add fast mat-vec kernels for i-quants (#22344 )	2026-04-27 08:25:45 -07:00
Igor Rudenko	4414c04b9a	Additional test for common/gemma4 : handle parsing edge cases (#22420 ) * Additional test for common/gemma4 : handle parsing edge cases * Move tests to Gemma 4 test group	2026-04-27 16:36:59 +02:00
unraido	ceaf47c4b1	fix: rpc-server cache may not work in Windows environments (#22394 ) * fix: create directory and log cache file name. * Remove GGML_LOG_INFO conditional compilation. --------- Co-authored-by: kotaro <kotaro.kusunoki@gmail.com>	2026-04-27 17:25:09 +03:00
rankaiyx	42401c72b8	Fix type casting for unaccounted memory calculation (#22424 )	2026-04-27 14:31:13 +02:00
Georgi Gerganov	e940b3d468	download : prefer q8_0 when q4_k not available (#22428 )	2026-04-27 14:30:29 +02:00
ynankani	0f1bb602dd	model : remove duplicate wo_s scale after build_attn (Qwen3, LLaMA) (#22421 ) Signed-off-by: Yash Nankani <ynankani@nvidia.com>	2026-04-27 09:58:48 +02:00
Sigbjørn Skjæret	d13540becd	convert : remove input_scale for dequantized fp8 modelopt (#22356 )	2026-04-27 08:45:01 +02:00
Adrien Gallouët	f84270ea10	ggml : use 64 bytes aligned tile buffers (#21058 ) \| Model \| Test \| t/s OLD \| t/s NEW \| Speedup \| \|:---------------------------------\|:-------\|----------:\|----------:\|----------:\| \| qwen35 0.8B BF16 \| pp512 \| 584.59 \| 595.41 \| 1.02 \| \| qwen35 0.8B BF16 \| tg128 \| 52.23 \| 52.82 \| 1.01 \| \| qwen35 0.8B IQ2_M - 2.7 bpw \| pp512 \| 260.64 \| 261.70 \| 1.00 \| \| qwen35 0.8B IQ2_M - 2.7 bpw \| tg128 \| 81.17 \| 80.89 \| 1.00 \| \| qwen35 0.8B IQ2_XXS - 2.0625 bpw \| pp512 \| 302.36 \| 302.56 \| 1.00 \| \| qwen35 0.8B IQ2_XXS - 2.0625 bpw \| tg128 \| 84.93 \| 85.12 \| 1.00 \| \| qwen35 0.8B IQ3_XXS - 3.0625 bpw \| pp512 \| 263.22 \| 260.01 \| 0.99 \| \| qwen35 0.8B IQ3_XXS - 3.0625 bpw \| tg128 \| 80.29 \| 78.94 \| 0.98 \| \| qwen35 0.8B IQ4_NL - 4.5 bpw \| pp512 \| 728.65 \| 742.09 \| 1.02 \| \| qwen35 0.8B IQ4_NL - 4.5 bpw \| tg128 \| 82.39 \| 84.46 \| 1.03 \| \| qwen35 0.8B IQ4_XS - 4.25 bpw \| pp512 \| 681.33 \| 677.06 \| 0.99 \| \| qwen35 0.8B IQ4_XS - 4.25 bpw \| tg128 \| 80.18 \| 79.28 \| 0.99 \| \| qwen35 0.8B Q2_K_M \| pp512 \| 413.28 \| 415.94 \| 1.01 \| \| qwen35 0.8B Q2_K_M \| tg128 \| 81.90 \| 82.78 \| 1.01 \| \| qwen35 0.8B Q3_K_M \| pp512 \| 493.17 \| 495.08 \| 1.00 \| \| qwen35 0.8B Q3_K_M \| tg128 \| 82.75 \| 83.23 \| 1.01 \| \| qwen35 0.8B Q3_K_S \| pp512 \| 429.35 \| 427.64 \| 1.00 \| \| qwen35 0.8B Q3_K_S \| tg128 \| 86.69 \| 87.02 \| 1.00 \| \| qwen35 0.8B Q4_0 \| pp512 \| 783.46 \| 782.32 \| 1.00 \| \| qwen35 0.8B Q4_0 \| tg128 \| 88.23 \| 87.90 \| 1.00 \| \| qwen35 0.8B Q4_1 \| pp512 \| 741.71 \| 729.76 \| 0.98 \| \| qwen35 0.8B Q4_1 \| tg128 \| 85.44 \| 86.01 \| 1.01 \| \| qwen35 0.8B Q4_K_M \| pp512 \| 676.24 \| 681.31 \| 1.01 \| \| qwen35 0.8B Q4_K_M \| tg128 \| 76.59 \| 77.06 \| 1.01 \| \| qwen35 0.8B Q4_K_S \| pp512 \| 683.12 \| 688.81 \| 1.01 \| \| qwen35 0.8B Q4_K_S \| tg128 \| 80.50 \| 81.19 \| 1.01 \| \| qwen35 0.8B Q5_K_M \| pp512 \| 635.33 \| 642.11 \| 1.01 \| \| qwen35 0.8B Q5_K_M \| tg128 \| 72.07 \| 72.49 \| 1.01 \| \| qwen35 0.8B Q5_K_S \| pp512 \| 660.95 \| 658.18 \| 1.00 \| \| qwen35 0.8B Q5_K_S \| tg128 \| 72.19 \| 72.95 \| 1.01 \| \| qwen35 0.8B Q6_K \| pp512 \| 647.97 \| 638.84 \| 0.99 \| \| qwen35 0.8B Q6_K \| tg128 \| 72.83 \| 72.49 \| 1.00 \| \| qwen35 0.8B Q8_0 \| pp512 \| 805.01 \| 785.49 \| 0.98 \| \| qwen35 0.8B Q8_0 \| tg128 \| 70.10 \| 70.13 \| 1.00 \| Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-04-27 09:30:55 +03:00
Max Krasnyansky	5594d13224	common: fix missing exports in llama-common (#22340 ) * common: refactor common/debug to move abort_on_nan into base_callback_data Passing bool abort_on_nan as template parameter for common_debug_cb_eval is unnecessary and creates an issue with LTO. It should just be a member of the base_callback_data instead. * cont : cleanup * common : use pimpl in debug.h to reduce header dependencies Move common_debug_cb_user_data's data members (std::regex, std::vector<uint8_t>) into a private impl struct in debug.cpp. This removes the includes of common.h and <regex> from debug.h, reducing transitive dependencies for any translation unit that includes the header. Assisted-by: llama.cpp:local pi --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-04-27 08:06:39 +03:00
Georgi Gerganov	f535774325	pr2wt : symlink .pi (#22386 )	2026-04-26 19:49:26 +03:00
Rithik Sharma	06a811d085	add performance-portable tuning for register-tile and subgroup matmul (#22241 )	2026-04-26 09:26:28 -07:00
Gaurav Garg	78433f606f	Fix recurrent state serialization for partial reads and writes (#22362 ) The previous code worked only for full tensor reads and writes and was hitting `GGML_ASSERT(size == ggml_nbytes(tensor)); ` assert when tested with llama-server.	2026-04-26 13:34:40 +02:00
Johannes Gäßler	7ec36aa861	Github: set meta backend code owner (#22388 )	2026-04-26 13:34:13 +02:00
Oliver Simons	b1a5bd4e0c	CUDA: better coalesce data-access for contiguous concat (#22330 ) Also, distribute all elements across CTAs evenly instead of launching one CTA per dim	2026-04-26 09:21:45 +02:00
Sigbjørn Skjæret	0c6ee1cade	ggml-cpu : re-enable fast gelu_quick_f16 (#22339 )	2026-04-26 09:28:14 +03:00
Eve	2dd84169d1	ggml-cpu: optimize avx2 q6_k (#22345 )	2026-04-26 09:27:50 +03:00
lhez	f454bd7eb8	opencl: add iq4_nl support (#22272 ) * opencl: add general support for iq4_nl * opencl: add iq4_nl gemm/gemv for adreno * opencl: pack 2 lut entries into a uint	2026-04-25 21:21:58 -07:00
Trivikram Reddy	b760272f1a	hexagon: guard HMX clock request for v75+ platforms (#22377 )	2026-04-25 17:58:26 -07:00
Piotr Wilkin (ilintar)	dcad77cc3b	chat: fix handling of space in reasoning markers (#22353 ) * chat: fix handling of space in reasoning markers * fix tests * whitespace	2026-04-25 21:24:13 +02:00
Georgi Gerganov	98dc1418ea	spec : fix vocab compat checks (#22358 )	2026-04-25 20:11:35 +03:00
Johannes Gäßler	9725a313be	CUDA: reduce MMQ stream-k overhead (#22298 ) * CUDA: reduce MMQ stream-k overhead * use 32 bit integers for kbc	2026-04-25 14:15:03 +02:00
Developer-Ecosystem-Engineering	d1649047a3	metal : optimize Metal Tensor API usage for GGML_OP_MUL_MAT (#20962 ) * Optimize Metal Tensor API usage for matmul2d Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR. The legacy simdgroup_matrix kernel is preserved under #else. Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure. * cont : cleanup * cont : cleanup * cont : cleanup --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-04-25 15:14:28 +03:00
ddh0	9d34231bb8	llama-quant : default ftype param `Q5_1` --> `Q8_0` (#20828 ) Change the default `ftype` in `llama_model_quantize_params` from `LLAMA_FTYPE_MOSTLY_Q5_1` to `LLAMA_FTYPE_MOSTLY_Q8_0`. In case some external program naively uses the default quantization params, we should probably default to a known-good type like Q8_0 rather than Q5_1, which is rather old.	2026-04-25 09:25:35 +03:00
Georgi Gerganov	8ea8fee966	gitignore : add .pi + personal SYSTEM.md (#22316 ) * gitignore : add .pi + personal SYSTEM.md * cont : fix requirements heading in PR template * cont : shorten line	2026-04-25 09:20:45 +03:00
Neo Zhang	eddd7a13a5	[SYCL] Optimize Q4_0 mul_mat for Arc770, add scripts (#22291 ) * opt arc770 for Q4_0 * add for Q4_0 * update the script * add help script for windows * update guide * fix format issue * convert from dos to unix for format issue * fix missed -sm parameter	2026-04-25 09:20:14 +03:00
Reese Levine	dd2914dc81	ggml-webgpu: support for SSM_SCAN and disable set_rows error checking (#22327 ) * Implement ssm_scan * Remove blocking in graph_compute and check for set rows * Fix bindings * Update op support	2026-04-25 09:18:15 +03:00
Piotr Wilkin (ilintar)	0adede866d	parser: fix structured output bug (#22302 ) * fix very stupid structured output bug * Things just cannot be too easy.	2026-04-24 23:19:55 +02:00
Trivikram Reddy	361fe72acb	Hexagon: Bump HMX Frequency to Max Corner (#22334 ) * hexagon: bump HMX freq to max corner * hex-mm: fix error in log msg	2026-04-24 13:55:17 -07:00
Shreya Jain	a702f39597	CI Snapdragon: Switch ubuntu-latest to ubuntu-slim runner (#22303 ) * switch ubuntu-latest to ubuntu-slim * Fix the path for upload so CI doesn't fail * Update .github/workflows/build-and-test-snapdragon.yml Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Use -slim image for key check and consistent naming for artifact dir Signed-off-by: Max Krasnyansky <maxk@qti.qualcomm.com> * Remove check-secret extra job * move QDC key check for Run QDC jobs step specifically * add a step before to check the secret for qdc jobs --------- Signed-off-by: Max Krasnyansky <maxk@qti.qualcomm.com> Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-04-24 21:21:36 +02:00
Zheyuan Chen	13d36cf891	ggml-webgpu: enable FLASH_ATTN_EXT on browser without subgroup matrix (#22199 ) * ggml-webgpu: add tile flash attention fallback * ggml-webgpu: add new fields and discard usage of mnk for tile version * ggml-webgpu: modify the vec path to discard the mnk parameter * ggml-webgpu: enable flash attention vec and tile version for broswer * ggml-webgpu: stagging KV for flash attention tile version * formatting * turn on subgroup uniformity check * remove Q_TILE as it is always 1 for vec path * make row_max and exp_sum to local register * make different bindings with same underlying buffer to have the same usage flags * move path selection into the shader library and have the host consume a single flash-attn decision object. * turn off skip_validation and address buffer overlapping when nwg==1 * formatting * merge binding when kv overlap	2026-04-24 10:39:09 -07:00
Mengsheng Wu	f65bc34c68	hexagon: use DIRID 13 in libggml-htp.inf for modern InfVerif (#22306 )	2026-04-24 09:21:33 -07:00
Georgi Gerganov	15fa3c493b	metal : print GPU description (#22318 )	2026-04-24 13:56:03 +03:00
Adrien Gallouët	dc80c5252a	common : fix jinja warnings with clang 21 (#22313 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-04-24 12:36:02 +02:00
Georgi Gerganov	e583f3b4f5	ggml : minor coding style (#22308 )	2026-04-24 11:02:00 +03:00

1 2 3 4 5 ...

8967 Commits