llama.cpp

Author	SHA1	Message	Date
Aman Gupta	516e8d7a8a	server: use pos_next instead of n_tokens for m-rope (#22439 )	2026-04-28 08:41:00 +03:00
Rithik Sharma	434b2a1ff6	ggml-webgpu: add Q1_0 support (#22374 ) * add fast matmul matvec q1_0 kernel * ggml-webgpu: drop redundant zero-fills in Q1_0 shmem init	2026-04-27 15:50:59 -07:00
tha80	983ca8992e	server: (router) Forward form-data to model server (Fixes #22044 ) (#22118 ) * This commit enables the router to forward form-data to model server. Fixes #22044 (enabling to use the /v1/audio/transcriptions in router mode) * * Applied the suggestion from Copilots first comment: using the non-throwing json::parse overload. * Addressed Copilots third comment by extending the files representation to also include filename and content-type * Addressed Copilots fourth comment by making the RNG thread_local * Changed variable body from std::string to std::ostringstream in build_multipart_body as suggested by ngxson in https://github.com/ggml-org/llama.cpp/pull/22118#discussion_r3127099053 * Added sanitize_field lambda in build_multipart_body for key, filename and content_type as suggested by ngxson in https://github.com/ggml-org/llama.cpp/pull/22118#discussion_r3127104647 * explicitly checking if value/item is string before calling value/item.get<std::string>() as requested by ngxson in https://github.com/ggml-org/llama.cpp/pull/22118#discussion_r3127111279 * Added double quote to the sanitize lambda and throw on json parse failure --------- Co-authored-by: Ralph Paßgang <ralph@trust-it.de>	2026-04-27 23:55:00 +02:00
Rithik Sharma	665abc6097	add fast mat-vec kernels for i-quants (#22344 )	2026-04-27 08:25:45 -07:00
Igor Rudenko	4414c04b9a	Additional test for common/gemma4 : handle parsing edge cases (#22420 ) * Additional test for common/gemma4 : handle parsing edge cases * Move tests to Gemma 4 test group	2026-04-27 16:36:59 +02:00
unraido	ceaf47c4b1	fix: rpc-server cache may not work in Windows environments (#22394 ) * fix: create directory and log cache file name. * Remove GGML_LOG_INFO conditional compilation. --------- Co-authored-by: kotaro <kotaro.kusunoki@gmail.com>	2026-04-27 17:25:09 +03:00
rankaiyx	42401c72b8	Fix type casting for unaccounted memory calculation (#22424 )	2026-04-27 14:31:13 +02:00
Georgi Gerganov	e940b3d468	download : prefer q8_0 when q4_k not available (#22428 )	2026-04-27 14:30:29 +02:00
ynankani	0f1bb602dd	model : remove duplicate wo_s scale after build_attn (Qwen3, LLaMA) (#22421 ) Signed-off-by: Yash Nankani <ynankani@nvidia.com>	2026-04-27 09:58:48 +02:00
Sigbjørn Skjæret	d13540becd	convert : remove input_scale for dequantized fp8 modelopt (#22356 )	2026-04-27 08:45:01 +02:00
Adrien Gallouët	f84270ea10	ggml : use 64 bytes aligned tile buffers (#21058 ) \| Model \| Test \| t/s OLD \| t/s NEW \| Speedup \| \|:---------------------------------\|:-------\|----------:\|----------:\|----------:\| \| qwen35 0.8B BF16 \| pp512 \| 584.59 \| 595.41 \| 1.02 \| \| qwen35 0.8B BF16 \| tg128 \| 52.23 \| 52.82 \| 1.01 \| \| qwen35 0.8B IQ2_M - 2.7 bpw \| pp512 \| 260.64 \| 261.70 \| 1.00 \| \| qwen35 0.8B IQ2_M - 2.7 bpw \| tg128 \| 81.17 \| 80.89 \| 1.00 \| \| qwen35 0.8B IQ2_XXS - 2.0625 bpw \| pp512 \| 302.36 \| 302.56 \| 1.00 \| \| qwen35 0.8B IQ2_XXS - 2.0625 bpw \| tg128 \| 84.93 \| 85.12 \| 1.00 \| \| qwen35 0.8B IQ3_XXS - 3.0625 bpw \| pp512 \| 263.22 \| 260.01 \| 0.99 \| \| qwen35 0.8B IQ3_XXS - 3.0625 bpw \| tg128 \| 80.29 \| 78.94 \| 0.98 \| \| qwen35 0.8B IQ4_NL - 4.5 bpw \| pp512 \| 728.65 \| 742.09 \| 1.02 \| \| qwen35 0.8B IQ4_NL - 4.5 bpw \| tg128 \| 82.39 \| 84.46 \| 1.03 \| \| qwen35 0.8B IQ4_XS - 4.25 bpw \| pp512 \| 681.33 \| 677.06 \| 0.99 \| \| qwen35 0.8B IQ4_XS - 4.25 bpw \| tg128 \| 80.18 \| 79.28 \| 0.99 \| \| qwen35 0.8B Q2_K_M \| pp512 \| 413.28 \| 415.94 \| 1.01 \| \| qwen35 0.8B Q2_K_M \| tg128 \| 81.90 \| 82.78 \| 1.01 \| \| qwen35 0.8B Q3_K_M \| pp512 \| 493.17 \| 495.08 \| 1.00 \| \| qwen35 0.8B Q3_K_M \| tg128 \| 82.75 \| 83.23 \| 1.01 \| \| qwen35 0.8B Q3_K_S \| pp512 \| 429.35 \| 427.64 \| 1.00 \| \| qwen35 0.8B Q3_K_S \| tg128 \| 86.69 \| 87.02 \| 1.00 \| \| qwen35 0.8B Q4_0 \| pp512 \| 783.46 \| 782.32 \| 1.00 \| \| qwen35 0.8B Q4_0 \| tg128 \| 88.23 \| 87.90 \| 1.00 \| \| qwen35 0.8B Q4_1 \| pp512 \| 741.71 \| 729.76 \| 0.98 \| \| qwen35 0.8B Q4_1 \| tg128 \| 85.44 \| 86.01 \| 1.01 \| \| qwen35 0.8B Q4_K_M \| pp512 \| 676.24 \| 681.31 \| 1.01 \| \| qwen35 0.8B Q4_K_M \| tg128 \| 76.59 \| 77.06 \| 1.01 \| \| qwen35 0.8B Q4_K_S \| pp512 \| 683.12 \| 688.81 \| 1.01 \| \| qwen35 0.8B Q4_K_S \| tg128 \| 80.50 \| 81.19 \| 1.01 \| \| qwen35 0.8B Q5_K_M \| pp512 \| 635.33 \| 642.11 \| 1.01 \| \| qwen35 0.8B Q5_K_M \| tg128 \| 72.07 \| 72.49 \| 1.01 \| \| qwen35 0.8B Q5_K_S \| pp512 \| 660.95 \| 658.18 \| 1.00 \| \| qwen35 0.8B Q5_K_S \| tg128 \| 72.19 \| 72.95 \| 1.01 \| \| qwen35 0.8B Q6_K \| pp512 \| 647.97 \| 638.84 \| 0.99 \| \| qwen35 0.8B Q6_K \| tg128 \| 72.83 \| 72.49 \| 1.00 \| \| qwen35 0.8B Q8_0 \| pp512 \| 805.01 \| 785.49 \| 0.98 \| \| qwen35 0.8B Q8_0 \| tg128 \| 70.10 \| 70.13 \| 1.00 \| Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-04-27 09:30:55 +03:00
Max Krasnyansky	5594d13224	common: fix missing exports in llama-common (#22340 ) * common: refactor common/debug to move abort_on_nan into base_callback_data Passing bool abort_on_nan as template parameter for common_debug_cb_eval is unnecessary and creates an issue with LTO. It should just be a member of the base_callback_data instead. * cont : cleanup * common : use pimpl in debug.h to reduce header dependencies Move common_debug_cb_user_data's data members (std::regex, std::vector<uint8_t>) into a private impl struct in debug.cpp. This removes the includes of common.h and <regex> from debug.h, reducing transitive dependencies for any translation unit that includes the header. Assisted-by: llama.cpp:local pi --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-04-27 08:06:39 +03:00
Georgi Gerganov	f535774325	pr2wt : symlink .pi (#22386 )	2026-04-26 19:49:26 +03:00
Rithik Sharma	06a811d085	add performance-portable tuning for register-tile and subgroup matmul (#22241 )	2026-04-26 09:26:28 -07:00
Gaurav Garg	78433f606f	Fix recurrent state serialization for partial reads and writes (#22362 ) The previous code worked only for full tensor reads and writes and was hitting `GGML_ASSERT(size == ggml_nbytes(tensor)); ` assert when tested with llama-server.	2026-04-26 13:34:40 +02:00
Johannes Gäßler	7ec36aa861	Github: set meta backend code owner (#22388 )	2026-04-26 13:34:13 +02:00
Oliver Simons	b1a5bd4e0c	CUDA: better coalesce data-access for contiguous concat (#22330 ) Also, distribute all elements across CTAs evenly instead of launching one CTA per dim	2026-04-26 09:21:45 +02:00
Sigbjørn Skjæret	0c6ee1cade	ggml-cpu : re-enable fast gelu_quick_f16 (#22339 )	2026-04-26 09:28:14 +03:00
Eve	2dd84169d1	ggml-cpu: optimize avx2 q6_k (#22345 )	2026-04-26 09:27:50 +03:00
lhez	f454bd7eb8	opencl: add iq4_nl support (#22272 ) * opencl: add general support for iq4_nl * opencl: add iq4_nl gemm/gemv for adreno * opencl: pack 2 lut entries into a uint	2026-04-25 21:21:58 -07:00
Trivikram Reddy	b760272f1a	hexagon: guard HMX clock request for v75+ platforms (#22377 )	2026-04-25 17:58:26 -07:00
Piotr Wilkin (ilintar)	dcad77cc3b	chat: fix handling of space in reasoning markers (#22353 ) * chat: fix handling of space in reasoning markers * fix tests * whitespace	2026-04-25 21:24:13 +02:00
Georgi Gerganov	98dc1418ea	spec : fix vocab compat checks (#22358 )	2026-04-25 20:11:35 +03:00
Johannes Gäßler	9725a313be	CUDA: reduce MMQ stream-k overhead (#22298 ) * CUDA: reduce MMQ stream-k overhead * use 32 bit integers for kbc	2026-04-25 14:15:03 +02:00
Developer-Ecosystem-Engineering	d1649047a3	metal : optimize Metal Tensor API usage for GGML_OP_MUL_MAT (#20962 ) * Optimize Metal Tensor API usage for matmul2d Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR. The legacy simdgroup_matrix kernel is preserved under #else. Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure. * cont : cleanup * cont : cleanup * cont : cleanup --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-04-25 15:14:28 +03:00
ddh0	9d34231bb8	llama-quant : default ftype param `Q5_1` --> `Q8_0` (#20828 ) Change the default `ftype` in `llama_model_quantize_params` from `LLAMA_FTYPE_MOSTLY_Q5_1` to `LLAMA_FTYPE_MOSTLY_Q8_0`. In case some external program naively uses the default quantization params, we should probably default to a known-good type like Q8_0 rather than Q5_1, which is rather old.	2026-04-25 09:25:35 +03:00
Georgi Gerganov	8ea8fee966	gitignore : add .pi + personal SYSTEM.md (#22316 ) * gitignore : add .pi + personal SYSTEM.md * cont : fix requirements heading in PR template * cont : shorten line	2026-04-25 09:20:45 +03:00
Neo Zhang	eddd7a13a5	[SYCL] Optimize Q4_0 mul_mat for Arc770, add scripts (#22291 ) * opt arc770 for Q4_0 * add for Q4_0 * update the script * add help script for windows * update guide * fix format issue * convert from dos to unix for format issue * fix missed -sm parameter	2026-04-25 09:20:14 +03:00
Reese Levine	dd2914dc81	ggml-webgpu: support for SSM_SCAN and disable set_rows error checking (#22327 ) * Implement ssm_scan * Remove blocking in graph_compute and check for set rows * Fix bindings * Update op support	2026-04-25 09:18:15 +03:00
Piotr Wilkin (ilintar)	0adede866d	parser: fix structured output bug (#22302 ) * fix very stupid structured output bug * Things just cannot be too easy.	2026-04-24 23:19:55 +02:00
Trivikram Reddy	361fe72acb	Hexagon: Bump HMX Frequency to Max Corner (#22334 ) * hexagon: bump HMX freq to max corner * hex-mm: fix error in log msg	2026-04-24 13:55:17 -07:00
Shreya Jain	a702f39597	CI Snapdragon: Switch ubuntu-latest to ubuntu-slim runner (#22303 ) * switch ubuntu-latest to ubuntu-slim * Fix the path for upload so CI doesn't fail * Update .github/workflows/build-and-test-snapdragon.yml Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Use -slim image for key check and consistent naming for artifact dir Signed-off-by: Max Krasnyansky <maxk@qti.qualcomm.com> * Remove check-secret extra job * move QDC key check for Run QDC jobs step specifically * add a step before to check the secret for qdc jobs --------- Signed-off-by: Max Krasnyansky <maxk@qti.qualcomm.com> Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-04-24 21:21:36 +02:00
Zheyuan Chen	13d36cf891	ggml-webgpu: enable FLASH_ATTN_EXT on browser without subgroup matrix (#22199 ) * ggml-webgpu: add tile flash attention fallback * ggml-webgpu: add new fields and discard usage of mnk for tile version * ggml-webgpu: modify the vec path to discard the mnk parameter * ggml-webgpu: enable flash attention vec and tile version for broswer * ggml-webgpu: stagging KV for flash attention tile version * formatting * turn on subgroup uniformity check * remove Q_TILE as it is always 1 for vec path * make row_max and exp_sum to local register * make different bindings with same underlying buffer to have the same usage flags * move path selection into the shader library and have the host consume a single flash-attn decision object. * turn off skip_validation and address buffer overlapping when nwg==1 * formatting * merge binding when kv overlap	2026-04-24 10:39:09 -07:00
Mengsheng Wu	f65bc34c68	hexagon: use DIRID 13 in libggml-htp.inf for modern InfVerif (#22306 )	2026-04-24 09:21:33 -07:00
Georgi Gerganov	15fa3c493b	metal : print GPU description (#22318 )	2026-04-24 13:56:03 +03:00
Adrien Gallouët	dc80c5252a	common : fix jinja warnings with clang 21 (#22313 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-04-24 12:36:02 +02:00
Georgi Gerganov	e583f3b4f5	ggml : minor coding style (#22308 )	2026-04-24 11:02:00 +03:00
Georgi Gerganov	017f090442	jinja : remove unused header (#22310 )	2026-04-24 11:01:46 +03:00
Georgi Gerganov	ffdd983fb8	server : fix swa-full logic (#22288 )	2026-04-24 10:17:37 +03:00
Yes You Can Have Your Own	793d0a7931	server: rename debug tags to match --cache-idle-slots naming (#22292 )	2026-04-24 09:28:44 +03:00
Mengsheng Wu	8bc492ebb4	hexagon: add SOLVE_TRI op (#21974 ) * hexagon: add SOLVE_TRI op * ggml: fix TODO description for solve_tri * hexagon: rm unused variable/function warnings * hexagon: chunk vs batch processingfor better thread utilization * hexagon: vectorize partial f32 loads * hexagon: move HVX f32 add/sub/mul wrappers to hvx-base.h --------- Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>	2026-04-23 18:39:13 -07:00
Chen Yuan	e5f070a1dc	fix(shader): handle the buffer aliasing for rms fuse (#22266 )	2026-04-23 16:32:59 -07:00
Ethan Turner	fa0b8a70a8	cli: Remove redundant local sampling variables (#20429 ) (#22264 ) This change implements the third requested change in issue 20429. Because defaults.sampling contains the reasoning budget token count and the reasoning budget message, it's not necessary to assign them to struct variables.	2026-04-24 00:53:23 +02:00
Max Krasnyansky	5d2b52d80d	hexagon: add support for basic and extended Op profiling (#22269 ) * hexagon: restore HTP_OPMASK_QUEUE * hexagon: honor OPMASK_SKIP_COMPUTE in hmx-matmul * hex-prof: restore op profiling * hex-prof: enable PMU * hexagon: simplify and improve op-queuing with full profiling support Add separate profile descriptors. * hexagon: remove opsync and rename opmask into opstage opsync is no longer needed since the profiler is fully async now. opmask name was confusing and opstage is more accurate. * hexagon: refactor opbatch queue handling * hexagon: add iface hooks for enabling profiler from the host Also move all the PMU setup stuff out of the hex-utils since it's not inteded for normal use. * hexagon: make profiler mode configurable On older devices getting PMU counters is expensive so it's now optional. * hexagon: add support for setting profiler pmu events from env * hexagon: simplify profiler output (no need to print buffs, etc) * hexagon: simplify pmu counter formating * hexagon: add a simple profile post-proc tool * hex-prof: add support for reading logs from stdin * hexagon: document GGML_HEXAGON_PROFILE * hex-prof: update default width for dims field * hex-prof: fix linter warnings and errors * Update ggml/src/ggml-hexagon/htp/htp-ops.h Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update scripts/snapdragon/ggml-hexagon-profile.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-04-23 14:17:21 -07:00
Shreya Jain	187a456370	Enable testing on Snapdragon devices (#21051 ) * Add the tests that we want to run on external CI * remove extra files * Fixes python issues, reove the deadlock on CI * remove unecessary changes * use override to ty.toml * fix pre-commit and try tests with secret in external repo not upstream * skip if key is unavailable * Fix feedback * switch hexagon to snapdragon * cleanup * fix secrets * remove the copyrights at the top of the files	2026-04-23 13:08:10 -07:00
srkizer	185cbff6f1	server : convert_anthropic_to_oai: also copy chat_template_kwargs (#22154 )	2026-04-23 13:32:46 -05:00
Song Li	c78fb909b2	server: fix heap-buffer-overflow from negative n_discard (CVE-2026-21869) (#22267 ) * server: clamp n_discard to non-negative at JSON parse boundary (CVE-2026-21869) A negative n_discard from client JSON causes heap-buffer-overflow in update_slots() context-shift loop (CWE-787, CVSS 8.8). Clamp to 0 at ingress; n_discard=0 already triggers auto-discard (n_left/2). Ref: GHSA-8947-pfff-2f3c * cont : cleaner * cont : cleanerer * cont : cleanest --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-04-23 18:39:07 +02:00
Adrien Gallouët	12568ca8c8	vendor : update LibreSSL to 4.3.1 (#22285 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-04-23 17:45:56 +02:00
kvc0	c807c6e3b0	server: (anthropic API) fix prefix caching (#21793 ) When testing claude code against llama.cpp, I noticed that only n_past 18577 was used even when context was 60k or more. The log in llama-server says: ``` slot update_slots: id 3 \| task 10342 \| old: ... ; cch= \| defa0;You are slot update_slots: id 3 \| task 10342 \| new: ... ; cch= \| 1c8b4; ``` I observed that the cch value changed every time. Reading about that, the x-anthropic-billing-header system message seems to be specially handled inside of the anthropic api. I could remove it, but there is a meaningful string sometimes included at the end. So instead, I just replace the changing cch checksum with fffff. I'm treating this as an anthropic message body API detail - I think this is the right way to do this, but by all means please correct me! It's always 5 hexadecimal characters, but I've written the replacement defensively in case they change the protocol.	2026-04-23 17:45:02 +02:00
Sigbjørn Skjæret	0949beb5a3	fix build number for sycl release (#22283 )	2026-04-23 21:38:58 +08:00

1 2 3 4 5 ...

8954 Commits