llama.cpp

Author	SHA1	Message	Date
Xuan-Son Nguyen	9998d88bc8	mtmd: correct mtmd_decode_use_mrope() (#22188 )	2026-04-21 10:53:37 +02:00
Georgi Gerganov	cd03ec7642	llama-ext : fix exports (#22202 )	2026-04-21 11:04:46 +03:00
Georgi Gerganov	4889afba5f	sync : ggml	2026-04-21 11:04:21 +03:00
Georgi Gerganov	041fe83d74	ggml : bump version to 0.10.0 (ggml/1463)	2026-04-21 11:04:21 +03:00
Georgi Gerganov	cfe9838d26	fit-params : refactor + add option to output estimated memory per device (#22171 ) * fit-params : add option to output estimated memory per device * cont : minor * cont : refactor * cont : move fit params implementation to libcommon * cont : header * cont : headers * cont : codeowners	2026-04-21 09:54:36 +03:00
xris99	ff6b1062af	server : fix hardcoded proxy connection timeout in router mode (#18760 ) (#22003 ) Fixes: https://github.com/ggml-org/llama.cpp/issues/18760 Co-authored-by: Christian <christian@example.com>	2026-04-21 06:41:14 +02:00
leonardHONG	97895129e5	ggml-cuda: flush legacy pool on OOM and retry (#22155 ) * ggml-cuda: flush legacy pool on OOM and retry Signed-off-by: 梁厚宏 <2695316095@qq.com> * Address review comments: add explicit sync, update destructor, clean up MUSA macros Signed-off-by: 梁厚宏 <2695316095@qq.com> --------- Signed-off-by: 梁厚宏 <2695316095@qq.com>	2026-04-20 23:30:38 +02:00
Xuan-Son Nguyen	86f8daacfe	mtmd: correct get_n_pos / get_decoder_pos (#22175 )	2026-04-20 23:29:19 +02:00
Georgi Gerganov	cf8b0dbda9	server : remove /api endpoints (#22165 ) * server : remove /api endpoints * cont : remove /api/tags	2026-04-20 20:41:19 +03:00
Gaurav Garg	fd6ae4ca1c	Tensor-parallel: Fix delayed AllReduce on Gemma-4 MoE (#22129 ) * Fix delayed AllReduce on Gemma-4 MoE Skip forward past nodes that don't consume the current one, and allow a chain of MULs. * Check for all sources before skipping nodes * Address review comments	2026-04-20 18:25:39 +02:00
Johannes Gäßler	fb19f94c71	TP: fix 0-sized tensor slices, AllReduce fallback (#21808 ) * TP: fix 0-sized tensor slices, AllReduce fallback * fix layer structure <-> GPU count aliasing * add missing std::fill * fix CUDA device set, max ggml ctx size	2026-04-20 18:09:39 +02:00
pl752	7f251fdbce	ggml-cpu: Optimized x86 and generic cpu q1_0 dot (follow up) (#21636 ) * Implemented optimized q1_0 dot for x86 and generic * Removed redundant helper definition * Removed two redundant instructions from AVX q1_0 dot * Fixed inconsistency with fp16 conversion for generic q1_0 dot and deduplicated generic fallback * Style cleanup around AVX q1_0 dot * Replaced explicitly unrolled blocks with inner for loop for q1_0 * Replaced scalar ARM q1_0 impl with new generic one	2026-04-20 19:02:54 +03:00
neha-ha	a6cc43c286	ggml-webgpu: updated matrix-vector multiplication (#21738 ) * merged properly, but slow q3_k and q5_k with u32 indexing * Start on new mat-vec * New format float paths working * Working q4_0 * Work on remaining legacy q-types * port k-quants to new matvec * remove old shader * Remove old constants, format * remove accidental file --------- Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com>	2026-04-20 07:37:17 -07:00
Xuan-Son Nguyen	a678916623	mtmd: refactor mtmd_decode_use_mrope (#22161 )	2026-04-20 14:45:11 +02:00
SamareshSingh	81df3f7cfa	fix: GLM-DSA crash in llama-tokenize when using vocab_only (#22102 ) * llama: fix crash in print_info for GLM-DSA when vocab_only is set * addressed code review comments * cont : simplify --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-04-20 10:32:46 +03:00
Georgi Gerganov	de71b5f81c	server : refactor "use checkpoint" logic (#22114 )	2026-04-20 08:42:37 +03:00
Katostrofik	788fcbc5dd	[SYCL] Fix reorder MMVQ assert on unaligned vocab sizes (#22035 ) * [SYCL] Fix reorder MMVQ assert on unaligned vocab sizes The reorder mul_mat_vec_q dispatchers for Q4_0, Q8_0, Q4_K, and Q6_K asserted that block_num_y was a multiple of 16 subgroups. Models with a vocab size not divisible by 16 (for example HY-MT at 120818) aborted on model load when the output projection tripped the assert. I replaced the assert with padding: block_num_y now rounds up to a whole number of subgroup-sized workgroups. The kernel already has the row bounds check (`if (row >= nrows) return;`) so the extra padded threads early-exit cleanly. Row values are uniform across a subgroup so the collective reduce stays safe. For aligned vocab sizes the padded block_num_y equals the old value, so the kernel launch is identical and there is no regression. Thanks to @arthw for flagging the relationship to #21527. Fixes #22020. AI assisted coding, tested on Intel B70 hardware. * sycl: use WARP_SIZE for num_subgroups in reorder MMVQ launches Replaces the hardcoded 16 with WARP_SIZE in the four reorder_mul_mat_vec launch helpers (Q4_0, Q8_0, Q4_K, Q6_K). Compile-time no-op on the Intel target where WARP_SIZE is 16, but makes the relationship to subgroup size explicit. Per review by @NeoZhangJianyu on #22035. Assisted by Claude.	2026-04-20 08:39:45 +03:00
Yes You Can Have Your Own	9d49acb2a7	server: rename --clear-idle to --cache-idle-slots (#21741 )	2026-04-20 08:30:24 +03:00
Alessandro de Oliveira Faria (A.K.A.CABELO)	e365e658f0	vendor : update cpp-httplib to 0.42.0 (#21781 )	2026-04-20 06:41:43 +08:00
Johannes Gäßler	4eac5b4509	CUDA: refactor mma data loading for AMD (#22051 ) * CUDA: refactor mma data loading for AMD * fix CDNA MMQ occupancy * fix CDNA3 mma * fix RDNA3 compile	2026-04-19 18:26:59 +02:00
Aldehir Rojas	d5b780a676	common/autoparser : allow space after tool call (#22073 )	2026-04-19 13:28:35 +02:00
uvos	471540ae8a	HIP: Remove unesscary NCCL_CHECK (#21914 )	2026-04-19 12:59:44 +02:00
Xuan-Son Nguyen	19124078be	mtmd: add pos_0 to mtmd_image_tokens_get_decoder_pos (breaking change) (#22082 ) * mtmd: add pos_0 to mtmd_image_tokens_get_decoder_pos * fix build	2026-04-19 11:57:21 +02:00
Gaurav Garg	bcdcc1044f	ggml : reduce CPU overhead in meta backend (#22041 ) * cache subgraph splits when cgraph is unchanged Skip per-call subgraph construction in ggml_backend_meta_graph_compute when the same ggml_cgraph is used consecutively. Assign uid to every sub-graph so that CUDA's fast uid check path hits too. * Address review comments * Keep the scope as is * Rename last_uid and last_n_subgraphs field. Remove last_max_tmp_size field. Refactor code. * Address review comments * Update ggml/src/ggml-backend-meta.cpp Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-backend-meta.cpp Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-04-19 12:48:35 +03:00
Sigbjørn Skjæret	037bfe38d0	ci : install spirv-headers for vulkan-cross (#22109 )	2026-04-19 10:32:08 +03:00
Dowon	8685e7b075	convert : support sentence-transformer 5.4 config files (#22087 ) * convert : support sentence-transformer 5.4 config files * fix: embeddinggemma * fix: mapping Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix: pooling_mode Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-04-19 10:25:39 +03:00
texasich	09b4efa95f	cmake: remove CMP0194 policy to restore MSVC builds (#21934 ) #21630 added the CMP0194 NEW policy to silence a CMake warning, but on Windows runners it caused CMake to prefer the MinGW toolchain for ASM and broke MSVC builds. Reverting only that policy block restores the previous working behavior. The CMake 4.1+ warning comes back, but that is cosmetic and does not break any platform. Reported-by: oobabooga Refs: #21630 Co-authored-by: texasich <texasich@users.noreply.github.com>	2026-04-19 10:25:05 +03:00
Sascha Rogmann	455d8e4be8	server : speculative checkpointing (#19493 ) * server : speculative decoding using checkpoints * server : fix draft check with checkpoints * server : rename spec vars * server : log levels * server : refactored spec logic to speculative.cpp * server : renamed spec checkpoints option * server : fix spec checkpoints, logging * speculative : checkpoints with draft model, logging * server : n_tokens_cur and create_checkpoint in draft * server : fix server_speculative_callback (slot.id) * spec : fix ngram-map/begin idx_last_check * spec : init ckpt (begin() wasn't called) * chore: update webui build output * server : restore sampler in spec checkpoint and clear mem * cont : avoid --spec-use-checkpoints argument * cont : remove server_prompt_checkpoint_with_size * spec : rename (leave_draft_state) * cont : clean-up * cont : do not ignore partial drafts even if the are short * cont : spec callback owned by session * cont : simplify * cont : avoid empty speculative session * cont : simplify * cont : simplify * cont : enable mtmd speculative decoding * cont : keep the spec sampler alive * cont : simplify * cont : fix nullptr deref + draft checkpoints * cont : remove common_speculative_accept_response * cont : remove callback * cont : simplify * cont : minor * cont : simplify * cont : fix accepted number --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-04-19 10:24:06 +03:00
Radoslav Gerganov	91fef95362	rpc : refactor the RPC transport (#21998 ) * rpc : refactor the RPC transport Move all transport related code into a separate file and use the socket_t interface to hide all transport implementation details. * fix win32 * better socket_t construction	2026-04-19 10:21:53 +03:00
Cetarthoriphros	9e5647affa	server: Expose `media_tag` on /props endpoint. (#22028 )	2026-04-19 00:27:17 +02:00
Sigbjørn Skjæret	4f02d47339	model : refactor bias tensor variable names (#22079 ) * refactor bias tensor variable names * use create_tensor_qkv for jina-bert-v2	2026-04-18 20:12:00 +02:00
Sigbjørn Skjæret	23b8cc4991	android : libcommon -> libllama-common (#22076 )	2026-04-18 11:19:40 +02:00
SamareshSingh	59accc8863	ggml-backend-meta: add multi-segment read support in get_tensor (#22063 )	2026-04-18 10:04:51 +02:00
Sigbjørn Skjæret	83d58e02fc	ci : free disk space for rocm release (#22012 )	2026-04-18 09:37:30 +02:00
Sigbjørn Skjæret	89a5474f0e	convert : fix (ignore for now) typings errors (#22002 )	2026-04-18 09:36:41 +02:00
Johannes Gäßler	fd1c0ec3f0	llama: fit ctx size for CPU only (#21568 )	2026-04-18 08:16:04 +02:00
Reese Levine	45cac7ca70	ggml-webgpu: fix compiler warnings and refactor FlashAttention encoding (#21052 ) * Update workflows to remove dependence on llvmpipe * Try setting Dawn_DIR * remove c++20 initializers * Move to proper guid * Try avoiding segfaults on vulkan backend process exit * Remove compiler warnings on parameter casting * Fix soft_max and update reg_tile accumulation to f32 for better precision * Refactor flash_attn a bit * remove c++20 initializers and format * Increase div precision for NVIDIA * revert div precision and comment out ggml-ci node for now * Formatting * Try debugging on a failing CI node * Revert "Try debugging on a failing CI node" This reverts commit 1971e33cba919915e12bcfd5828abfbd54ca942e.	2026-04-17 09:17:11 -07:00
Aman Gupta	b94050e896	CUDA: use LRU based eviction for cuda graphs (#21611 ) * CUDA: use a ring-buffer for cuda graphs * bump limit to 128 * use LRU eviction * better naming * do periodic clean-up	2026-04-17 23:24:21 +08:00
Yuri Khrustalev	a279d0f0f4	ci : add android arm64 build and release (#21647 ) * server: respect the ignore eos flag * ci: add android arm64 build and release * patch * pin android-setup actions to v4 * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * lf in the suggestion --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-04-17 11:32:24 +02:00
65a	268d61e178	mtmd: add missing struct tag (#22023 )	2026-04-17 10:48:33 +02:00
Georgi Gerganov	6990e2f1f7	libs : rename libcommon -> libllama-common (#21936 ) * cmake : allow libcommon to be shared * cmake : rename libcommon to libllama-common * cont : set -fPIC for httplib * cont : export all symbols * cont : fix build_info exports * libs : add libllama-common-base * log : add common_log_get_verbosity_thold()	2026-04-17 11:11:46 +03:00
Eric Zhang	fcc7508759	model : Gemma4 model type detection (#22027 ) * model : Gemma4 model type detection * model : Gemma4 model type detection	2026-04-17 10:07:11 +02:00
lhez	5e6c0e18b6	opencl: refactor q8_0 set_tensor and mul_mat host side dispatch for Adreno (#21938 ) * opencl: refactor q8_0 gemm/gemv Adreno dispatch * opencl: refactor q8_0 set_tensor * opencl: fix whitespace	2026-04-16 22:28:33 -07:00
Sigbjørn Skjæret	30dce2cf29	cli : use get_media_marker (#22017 )	2026-04-17 00:12:31 +02:00
Xuan-Son Nguyen	089dd41fe3	cmake: use glob to collect src/models sources (#22005 )	2026-04-16 23:25:16 +02:00
nullname	85dde8dc4a	hexagon: optimize HMX matmul operations (#21071 ) * optimize hmx_mat_mul functions by calculating row and column tiles upfront * refactor core_dot_chunk_fp16 to use size_t for tile counts and improve readability * wip * set scale outside of loop * wip * refactor core_mma_chunk_fp16 and mat_mul_qk_0_d16a32 to use size_t for tile counts * wip * wip * refactor transfer_output_chunk_fp16_to_fp32 to use size_t for dimensions * refactor core_dot_chunk_fp16 to use size_t for tile row stride calculation * wip * refactor hmx_mat_mul functions to use hvx_vec_splat_f16 for column scales initialization * refactor hmx_mat_mul_permuted_w16a32_batched to streamline scale setting and locking * refactor core_dot_chunk_fp16 to improve tile stride calculations for output * refactor hmx_mat_mul functions to use Q6_V_vsplat_R for column scales initialization * fix compiling error * wip * optimize row and column tile indexing in core_mma_chunk_fp16 function * wip * Revert "wip" This reverts commit cde679eff79c4a28dd2d89d32f710015e09592b6. * Add size limit check for HAP_mmap in htp_iface_mmap and drop_mmap functions * wip	2026-04-16 13:48:34 -07:00
Xuan-Son Nguyen	4fbdabdc61	model: using single llm_build per arch (#21970 ) * model: using single llm_build per arch * fix merge * nits	2026-04-16 21:10:22 +02:00
shaofeiqi	e45dbdece8	opencl: add q5_K gemm and gemv kernels for Adreno (#21595 )	2026-04-16 12:08:33 -07:00
Pascal	4adac43f6f	server: tests: fetch random media marker via /apply-template (#21962 ) (#21980 ) * server: tests: fetch random media marker via /apply-template (#21962 fix) * server: allow pinning media marker via LLAMA_MEDIA_MARKER env var get_media_marker() checks LLAMA_MEDIA_MARKER at first call and uses it as-is if set, falling back to the random marker otherwise. Tests no longer need to fetch the marker dynamically via /apply-template: the fixture sets LLAMA_MEDIA_MARKER=<__media__> so the hardcoded prompts work as before. Address review feedback from ngxson * server: make get_media_marker() thread-safe via magic statics Use a C++11 static local with a lambda initializer instead of a global static with an empty-check. The runtime guarantees initialization exactly once without explicit locking. Address review feedback from ggerganov * nits * nits	2026-04-16 20:46:21 +03:00
PikaPikachu	9db77a020c	model : refactor QKV into common build_qkv and create_tensor_qkv helpers (#21245 ) * model : refactor QKV into common build_qkv and create_tensor_qkv helpers * model : extend build_qkv to bert/mpt/dbrx/olmo/lfm2/nemotron-h/granite-hybrid/gemma3n-iswa/t5-dec and fix wqkv_s	2026-04-16 17:41:34 +02:00

1 2 3 4 5 ...

8869 Commits