Commit Graph

8982 Commits

Author SHA1 Message Date
Chen Yuan ca7f7b7b94 ggml-webgpu(shader): support conv2d kernels. (#21964)
* ggml(webgpu): fix the busy-polls in Emscripten  in the waitAny after #20618, and remove the busy webgpu log

* Merge with upstream

* Fix GET_ROWS packed integer NaN when using f16 as memory buffer in shader quants

* Update Unary wgsl EXP and EXPM1 for f16 stability

* Fix GET_ROWS IQ4_XS strcut for NaN f16 canonicalization

* Fix numerical percision for unary sqrt when working with f16

* Fix NaN canonicalization for packed integers using f16

* Update err threshold for binary div ops when using f16

* backend: Keep one Dawn/WebGPU instance alive for the lifetime of the static backend

* clean: uncomment existing code logs

* clean: clean the unncessary debug info

* Refactor and generalize dequant helpers

* Remove deprecated quant structs

* Refactor shader defines to reduce repetition

* Remove error override for F16 type

* fix: fix the accidential removal of the proper initialization of ctx

* clean: clean legacy and format code

* fix: did not modify tests ops

* shader(conv2d): add conv2d shader kernels and pass f32 and f16 tests

* shader(conv2d): fix the out of bounds memory access in the weight indexing

* shader(conv2d): clean unused variables and optimize the computation

* merge: use the new entries function

* clean: address the formatting issues

* clean: address the warning issues

* clear: clean the shader editorconfig-checker issues

* clear: clean the shader editorconfig-checker with utf-8

---------

Co-authored-by: Jeremy J. Hartmann <jeremy@mtion.tv>
2026-04-21 20:18:57 -07:00
Aparna M P 0dedb9ef7a hexagon: add support for FILL op (#22198)
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-04-21 16:24:20 -07:00
Masashi Yoshimura 2799d933b5 ggml-webgpu: reset CPU/GPU profiling time when freeing context (#22050)
* Reset the CPU/GPU profiling time when freeing context.

* move GPU profiling time from global context to webgpu_context.
2026-04-21 16:05:21 -07:00
Xuan-Son Nguyen 04fe84b69d server: allow cancel loading model (#21814) 2026-04-22 00:26:09 +02:00
Shreya Jain 5a4cd6741f Hexagon: DAIG op (#22195)
* hexagon: Add DIAG op

* hexagon: add HVX support and DMA double buffering

* hexagon: fix fatal error

* hexagon: remove as many pragma(s) as possible
2026-04-21 14:16:04 -07:00
Mengsheng Wu 2248799a58 hexagon: fix missing v79 entry in libggml-htp.inf (#22194) 2026-04-21 13:53:44 -07:00
Paul Dubs 72d693e4fb spec : reset i_last when low acceptance streak occurs (#22168)
By resetting i_last to zero, we will include the current context when rebuilding the speculative map.
2026-04-21 21:29:07 +03:00
Kwa Jie Hao 98d2d2884e mtmd: Add support for Reka Edge 2603 (#21616)
* feat: (vocab) fix stray text appended in llama_decode_text

Remove accidental concatenation of the full `text` string when
formatting UNK_BYTE hex escapes. Only the closing "]" should be appended.

* feat(mtmd): add Yasa2 vision encoder support

Add a Yasa2 (ConvNeXtV2-based) vision encoder for reka-edge:
- Register PROJECTOR_TYPE_YASA2 and tensor name definitions
- Add yasa2_block/yasa2_stage model structs
- Implement graph builder with ConvNeXt stages, GRN, adaptive pooling
- Wire into clip.cpp switch statements and mtmd.cpp init_vision
- Use mtmd_image_preprocessor_fixed_size for image preprocessing

* feat(chat): add reka-edge template handler (tools, thinking)

- Add chat-reka.cpp/h implementing PEG-based parser for reka-edge format
- Add Reka-Edge.jinja chat template
- Detect reka-edge template in try_specialized_template()
- Add LLAMA_EXAMPLE_MTMD to chat-template-file arg

* feat: add reka vlm to gguf conversion script

Converts Reka Yasa2 hf checkpoints to GGUF format:
- Text decoder: Llama-arch with tiktoken/BPE vocab
- Mmproj (--mmproj): ConvNeXt vision backbone + language_projection
- Generates 2D sincos positional embeddings for vision encoder

* test: add Reka Edge chat template and parser tests

- test-chat-template: oracle tests comparing Jinja engine output vs
  common_chat_templates_apply for text, tools, thinking, images, video
- test-chat: PEG parser tests for Reka Edge format, round-trip tests
  for image/video content parts, common path integration tests

* scripts: add Reka Edge mixed quantization helper

Q4_0 base quantization with Q8_0 override for the last 8 transformer
blocks (layers 24-31) via --tensor-type regex.

* fix: adapt chat-reka and tests to upstream API

- Use autoparser::generation_params (not templates_params)
- Add p.prefix(generation_prompt) to PEG parser
- Simplify reasoning parser to match LFM2 pattern
- Remove image/video oracle tests (unsupported by oaicompat parser;
  no other multimodal models test this path)

* fix: avoid duplicate tensor loading in yasa2 vision encoder

TN_YASA_PATCH_W and TN_PATCH_EMBD both resolve to "v.patch_embd.weight",
causing the same tensor to be loaded twice into ctx_data and overflowing
the memory pool. Reuse the tensors already loaded by the common section.

* chore: update image pre-processing settings

The reka-edge model depends on the following settings in an older
fork of llama.cpp:
1. Fixed square resize
2. BICUBIC
3. add_padding=false

In current llama.cpp, this means setting:
- image_resize_algo = RESIZE_ALGO_BICUBIC
- image_resize_pad = false

* chore: remove reka gguf conversion script

* chore: remove reka quantization script

* chore: remove unnecessary changes from PR scope

This commit removes a couple of unnecessary changes for the PR scope:
1. BPE decoder bug fix - this affects reka edge because there's a bug
in our tokenization that doesn't represent <think> tokens as special
tokens. However this isn't meant to be a thinking model so when run
with --reasoning off the edge case does not affect us

2. --chat-template-file support from llama-mtmd-cli - the focus is on
llama-server and the reka edge gguf contains the necessary metadata
to detect the chat template

3. reka edge oracle test cases - no other model has similar test cases,
so I removed it for standardization

* chore: remove unnecessary ggml_cast

This commit removes unnecessary ggml_cast after updating the
reka vlm -> gguf conversion script on hugging face.

* chore: remove redundant code

* chore: remove unnecessary ggml_cont calls

This commit removes all ggml_cont calls except the four that
precede ggml_reshape_3d/ggml_reshape_4d. Those are necessary
because ggml_reshape recomputes strides assuming contiguous
layout and asserts ggml_is_contiguous.

Other operations (ggml_mean, ggml_add, ggml_mul etc.) use
stride-based indexing and handle non-contiguous inputs
correctly and so we are ok to remove ggml_cont for those.

* chore: remove unnecessary ggml_repeat calls

This commit removes unnecessary ggml_repeat calls because the underlying
ops already broadcast automatically.

Every ggml_repeat in yasa2.cpp was expanding a smaller tensor to match
a larger one's shape before passing both to an elementwise op (ggml_add,
ggml_sub, ggml_mul, or ggml_div). This is unnecessary because all four
of these ops already support broadcasting internally.

* chore: restore ggml_cont needed for cpu operations

* refactor: locate reka chat template handler in chat.cpp

* chore: remove unnecessary warmup tokens

* chore: add code comments on image_resize_pad

* chore: remove custom reka parsing code

* chore: revert common/chat.cpp

* Uncomment debug logging for PEG input parsing

---------

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
2026-04-21 20:02:49 +02:00
Georgi Gerganov 84652b80cf arg : add --spec-default (#22223) 2026-04-21 19:52:02 +03:00
Zijun Yu 52f1096f21 openvino: driver setup, CI split, thread safety, and NPU optimizations (#21944)
* Thread safety per request only

* Fix ROPE yarn case

* Fix sticky stateful config

* Use i4/i8 directly for symmetric quant

* Use weightless caching

* Add WeightlessCacheAttribute to reduce NPU memory usage

* Gelu tanh support (#125)

* Imrope support (#126)

* fix(openvino): explicit ov::Tensor frees in ggml_backend_openvino_free

* add GPU,NPU support in OV Dockerfile

* add build-openvino.yml ci

* Fix sticky stateful config

* add concurrency to ov-gpu ci runs. Move OV CI to build-openvino.yml

* fix thread-safety of shared runtime context

* rope type abstraction for frontend translations

* fix editorconfig

---------

Co-authored-by: Mustafa Cavus <mustafa.cavus@intel.com>
Co-authored-by: Dan Hoffman <dhoff749@gmail.com>
Co-authored-by: Ravi Panchumarthy <ravi.panchumarthy@intel.com>
2026-04-21 18:58:34 +03:00
Alessandro de Oliveira Faria (A.K.A.CABELO) 606fa42f5d vendor : update cpp-httplib to 0.43.1 (#22143)
* vendor : update cpp-httplib to 0.43.0

* vendor : update cpp-httplib to 0.43.0
2026-04-21 22:45:48 +08:00
Georgi Gerganov 7fc1c4ef78 metal : workaround macOS GPU interactivity watchdog (#22216) 2026-04-21 17:24:55 +03:00
Jeff Bolz 82209efb7e vulkan: Support F16 OP_FILL (#22177) 2026-04-21 11:01:56 +02:00
Xuan-Son Nguyen 9998d88bc8 mtmd: correct mtmd_decode_use_mrope() (#22188) 2026-04-21 10:53:37 +02:00
Georgi Gerganov cd03ec7642 llama-ext : fix exports (#22202) 2026-04-21 11:04:46 +03:00
Georgi Gerganov 4889afba5f sync : ggml 2026-04-21 11:04:21 +03:00
Georgi Gerganov 041fe83d74 ggml : bump version to 0.10.0 (ggml/1463) 2026-04-21 11:04:21 +03:00
Georgi Gerganov cfe9838d26 fit-params : refactor + add option to output estimated memory per device (#22171)
* fit-params : add option to output estimated memory per device

* cont : minor

* cont : refactor

* cont : move fit params implementation to libcommon

* cont : header

* cont : headers

* cont : codeowners
2026-04-21 09:54:36 +03:00
xris99 ff6b1062af server : fix hardcoded proxy connection timeout in router mode (#18760) (#22003)
Fixes: https://github.com/ggml-org/llama.cpp/issues/18760

Co-authored-by: Christian <christian@example.com>
2026-04-21 06:41:14 +02:00
leonardHONG 97895129e5 ggml-cuda: flush legacy pool on OOM and retry (#22155)
* ggml-cuda: flush legacy pool on OOM and retry

Signed-off-by: 梁厚宏 <2695316095@qq.com>

* Address review comments: add explicit sync, update destructor, clean up MUSA macros

Signed-off-by: 梁厚宏 <2695316095@qq.com>

---------

Signed-off-by: 梁厚宏 <2695316095@qq.com>
2026-04-20 23:30:38 +02:00
Xuan-Son Nguyen 86f8daacfe mtmd: correct get_n_pos / get_decoder_pos (#22175) 2026-04-20 23:29:19 +02:00
Georgi Gerganov cf8b0dbda9 server : remove /api endpoints (#22165)
* server : remove /api endpoints

* cont : remove /api/tags
2026-04-20 20:41:19 +03:00
Gaurav Garg fd6ae4ca1c Tensor-parallel: Fix delayed AllReduce on Gemma-4 MoE (#22129)
* Fix delayed AllReduce on Gemma-4 MoE

Skip forward past nodes that don't consume the current one, and allow a chain of MULs.

* Check for all sources before skipping nodes

* Address review comments
2026-04-20 18:25:39 +02:00
Johannes Gäßler fb19f94c71 TP: fix 0-sized tensor slices, AllReduce fallback (#21808)
* TP: fix 0-sized tensor slices, AllReduce fallback

* fix layer structure <-> GPU count aliasing

* add missing std::fill

* fix CUDA device set, max ggml ctx size
2026-04-20 18:09:39 +02:00
pl752 7f251fdbce ggml-cpu: Optimized x86 and generic cpu q1_0 dot (follow up) (#21636)
* Implemented optimized q1_0 dot for x86 and generic

* Removed redundant helper definition

* Removed two redundant instructions from AVX q1_0 dot

* Fixed inconsistency with fp16 conversion for generic q1_0 dot and deduplicated generic fallback

* Style cleanup around AVX q1_0 dot

* Replaced explicitly unrolled blocks with inner for loop for q1_0

* Replaced scalar ARM q1_0 impl with new generic one
2026-04-20 19:02:54 +03:00
neha-ha a6cc43c286 ggml-webgpu: updated matrix-vector multiplication (#21738)
* merged properly, but slow q3_k and q5_k with u32 indexing

* Start on new mat-vec

* New format float paths working

* Working q4_0

* Work on remaining legacy q-types

* port k-quants to new matvec

* remove old shader

* Remove old constants, format

* remove accidental file

---------

Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
2026-04-20 07:37:17 -07:00
Xuan-Son Nguyen a678916623 mtmd: refactor mtmd_decode_use_mrope (#22161) 2026-04-20 14:45:11 +02:00
SamareshSingh 81df3f7cfa fix: GLM-DSA crash in llama-tokenize when using vocab_only (#22102)
* llama: fix crash in print_info for GLM-DSA when vocab_only is set

* addressed code review comments

* cont : simplify

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-20 10:32:46 +03:00
Georgi Gerganov de71b5f81c server : refactor "use checkpoint" logic (#22114) 2026-04-20 08:42:37 +03:00
Katostrofik 788fcbc5dd [SYCL] Fix reorder MMVQ assert on unaligned vocab sizes (#22035)
* [SYCL] Fix reorder MMVQ assert on unaligned vocab sizes

The reorder mul_mat_vec_q dispatchers for Q4_0, Q8_0, Q4_K, and Q6_K
asserted that block_num_y was a multiple of 16 subgroups. Models with
a vocab size not divisible by 16 (for example HY-MT at 120818) aborted
on model load when the output projection tripped the assert.

I replaced the assert with padding: block_num_y now rounds up to a
whole number of subgroup-sized workgroups. The kernel already has the
row bounds check (`if (row >= nrows) return;`) so the extra padded
threads early-exit cleanly. Row values are uniform across a subgroup
so the collective reduce stays safe.

For aligned vocab sizes the padded block_num_y equals the old value,
so the kernel launch is identical and there is no regression.

Thanks to @arthw for flagging the relationship to #21527.

Fixes #22020.

AI assisted coding, tested on Intel B70 hardware.

* sycl: use WARP_SIZE for num_subgroups in reorder MMVQ launches

Replaces the hardcoded 16 with WARP_SIZE in the four reorder_mul_mat_vec
launch helpers (Q4_0, Q8_0, Q4_K, Q6_K). Compile-time no-op on the Intel
target where WARP_SIZE is 16, but makes the relationship to subgroup
size explicit. Per review by @NeoZhangJianyu on #22035.

Assisted by Claude.
2026-04-20 08:39:45 +03:00
Yes You Can Have Your Own 9d49acb2a7 server: rename --clear-idle to --cache-idle-slots (#21741) 2026-04-20 08:30:24 +03:00
Alessandro de Oliveira Faria (A.K.A.CABELO) e365e658f0 vendor : update cpp-httplib to 0.42.0 (#21781) 2026-04-20 06:41:43 +08:00
Johannes Gäßler 4eac5b4509 CUDA: refactor mma data loading for AMD (#22051)
* CUDA: refactor mma data loading for AMD

* fix CDNA MMQ occupancy

* fix CDNA3 mma

* fix RDNA3 compile
2026-04-19 18:26:59 +02:00
Aldehir Rojas d5b780a676 common/autoparser : allow space after tool call (#22073) 2026-04-19 13:28:35 +02:00
uvos 471540ae8a HIP: Remove unesscary NCCL_CHECK (#21914) 2026-04-19 12:59:44 +02:00
Xuan-Son Nguyen 19124078be mtmd: add pos_0 to mtmd_image_tokens_get_decoder_pos (breaking change) (#22082)
* mtmd: add pos_0 to mtmd_image_tokens_get_decoder_pos

* fix build
2026-04-19 11:57:21 +02:00
Gaurav Garg bcdcc1044f ggml : reduce CPU overhead in meta backend (#22041)
* cache subgraph splits when cgraph is unchanged

Skip per-call subgraph construction in ggml_backend_meta_graph_compute when the same ggml_cgraph is used consecutively.

Assign uid to every sub-graph so that CUDA's fast uid check path hits too.

* Address review comments

* Keep the scope as is

* Rename last_uid and last_n_subgraphs field. Remove last_max_tmp_size field. Refactor code.

* Address review comments

* Update ggml/src/ggml-backend-meta.cpp

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-backend-meta.cpp

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-04-19 12:48:35 +03:00
Sigbjørn Skjæret 037bfe38d0 ci : install spirv-headers for vulkan-cross (#22109) 2026-04-19 10:32:08 +03:00
Dowon 8685e7b075 convert : support sentence-transformer 5.4 config files (#22087)
* convert : support sentence-transformer 5.4 config files

* fix: embeddinggemma

* fix: mapping

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix: pooling_mode

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-04-19 10:25:39 +03:00
texasich 09b4efa95f cmake: remove CMP0194 policy to restore MSVC builds (#21934)
#21630 added the CMP0194 NEW policy to silence a CMake warning, but on Windows runners it caused CMake to prefer the MinGW toolchain for ASM and broke MSVC builds.

Reverting only that policy block restores the previous working behavior. The CMake 4.1+ warning comes back, but that is cosmetic and does not break any platform.

Reported-by: oobabooga

Refs: #21630

Co-authored-by: texasich <texasich@users.noreply.github.com>
2026-04-19 10:25:05 +03:00
Sascha Rogmann 455d8e4be8 server : speculative checkpointing (#19493)
* server : speculative decoding using checkpoints

* server : fix draft check with checkpoints

* server : rename spec vars

* server : log levels

* server : refactored spec logic to speculative.cpp

* server : renamed spec checkpoints option

* server : fix spec checkpoints, logging

* speculative : checkpoints with draft model, logging

* server : n_tokens_cur and create_checkpoint in draft

* server : fix server_speculative_callback (slot.id)

* spec : fix ngram-map/begin idx_last_check

* spec : init ckpt (begin() wasn't called)

* chore: update webui build output

* server : restore sampler in spec checkpoint and clear mem

* cont : avoid --spec-use-checkpoints argument

* cont : remove server_prompt_checkpoint_with_size

* spec : rename (leave_draft_state)

* cont : clean-up

* cont : do not ignore partial drafts even if the are short

* cont : spec callback owned by session

* cont : simplify

* cont : avoid empty speculative session

* cont : simplify

* cont : simplify

* cont : enable mtmd speculative decoding

* cont : keep the spec sampler alive

* cont : simplify

* cont : fix nullptr deref + draft checkpoints

* cont : remove common_speculative_accept_response

* cont : remove callback

* cont : simplify

* cont : minor

* cont : simplify

* cont : fix accepted number

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-04-19 10:24:06 +03:00
Radoslav Gerganov 91fef95362 rpc : refactor the RPC transport (#21998)
* rpc : refactor the RPC transport

Move all transport related code into a separate file and use the
socket_t interface to hide all transport implementation details.

* fix win32

* better socket_t construction
2026-04-19 10:21:53 +03:00
Cetarthoriphros 9e5647affa server: Expose media_tag on /props endpoint. (#22028) 2026-04-19 00:27:17 +02:00
Sigbjørn Skjæret 4f02d47339 model : refactor bias tensor variable names (#22079)
* refactor bias tensor variable names

* use create_tensor_qkv for jina-bert-v2
2026-04-18 20:12:00 +02:00
Sigbjørn Skjæret 23b8cc4991 android : libcommon -> libllama-common (#22076) 2026-04-18 11:19:40 +02:00
SamareshSingh 59accc8863 ggml-backend-meta: add multi-segment read support in get_tensor (#22063) 2026-04-18 10:04:51 +02:00
Sigbjørn Skjæret 83d58e02fc ci : free disk space for rocm release (#22012) 2026-04-18 09:37:30 +02:00
Sigbjørn Skjæret 89a5474f0e convert : fix (ignore for now) typings errors (#22002) 2026-04-18 09:36:41 +02:00
Johannes Gäßler fd1c0ec3f0 llama: fit ctx size for CPU only (#21568) 2026-04-18 08:16:04 +02:00
Reese Levine 45cac7ca70 ggml-webgpu: fix compiler warnings and refactor FlashAttention encoding (#21052)
* Update workflows to remove dependence on llvmpipe

* Try setting Dawn_DIR

* remove c++20 initializers

* Move to proper guid

* Try avoiding segfaults on vulkan backend process exit

* Remove compiler warnings on parameter casting

* Fix soft_max and update reg_tile accumulation to f32 for better precision

* Refactor flash_attn a bit

* remove c++20 initializers and format

* Increase div precision for NVIDIA

* revert div precision and comment out ggml-ci node for now

* Formatting

* Try debugging on a failing CI node

* Revert "Try debugging on a failing CI node"

This reverts commit 1971e33cba919915e12bcfd5828abfbd54ca942e.
2026-04-17 09:17:11 -07:00