Commit Graph

  • 06f05e71c1 [metal] wire contiguous Q4_0 kernel into dispatch (#29) fix/29-q40-contig-reads Kaloyan Nikolov 2026-04-30 22:38:37 +02:00
  • 8c532835be [metal] extend bin op fusion to MUL/SUB/DIV chains (#28) (#38) master sleepy 2026-04-30 21:03:14 +02:00
  • eeb79b026b [metal] extend bin op fusion to MUL/SUB/DIV chains (#28) Kaloyan Nikolov 2026-04-30 20:14:12 +02:00
  • 222626cfdc [docs] add GIT.md with workflow and agent instructions Kaloyan Nikolov 2026-04-30 18:11:44 +02:00
  • 683c5acb90 spec : disacard last drafted token with low prob (#22506) Georgi Gerganov 2026-04-29 17:00:00 +03:00
  • b1d5f5b449 sync : ggml Georgi Gerganov 2026-04-29 16:43:08 +03:00
  • 4b221b7f1e ggml : bump version to 0.10.1 (ggml/1469) Georgi Gerganov 2026-04-29 16:41:45 +03:00
  • 59237bfbbc webui: fix slow mic stop and WAV encode (#22480) Pascal 2026-04-29 12:58:35 +02:00
  • 1cbc846eba ggml-cpu : disable tiled matmul on AIX to fix page boundary segfault (#22293) shalinib-ibm 2026-04-29 16:02:40 +05:30
  • 3142f1dbb9 ggml-cuda: refactor fusion code (#22468) Aman Gupta 2026-04-29 16:19:33 +08:00
  • b5c4227dc6 ggml-cpu: cmake: append xsmtvdotii march for SpacemiT IME (#22317) b8972 qiurui144 2026-04-29 15:59:21 +08:00
  • d6a5094004 ggml-webgpu: Fix bug in FlashAttention support check (#22492) b8971 Reese Levine 2026-04-29 00:59:00 -07:00
  • 7b95ea5d11 common: Intentionally leak logger instance to fix hanging on Windows (#22273) b8970 Masato Nakasaka 2026-04-29 16:58:43 +09:00
  • bdc9c743a5 ggml : add sve tuned code for gemm_q8_0_4x8_q8_0() kernel (#21916) b8969 hrushitfujitsu 2026-04-29 13:27:37 +05:30
  • 739393beeb TP: fix delayed AllReduce + zero-sized slices (#22489) Johannes Gäßler 2026-04-29 08:55:07 +02:00
  • fc2b0053ff ggml-cuda: Repost of 21896: Blackwell native NVFP4 support (#22196) b8967 Michael Wand 2026-04-28 15:47:42 -07:00
  • 7b8443ac78 ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (… (#22286) b8966 lnigam 2026-04-29 01:07:35 +05:30
  • 5d56effdee convert : add support for Nemotron Nano 3 Omni (#22481) Daniel Bevenius 2026-04-28 19:17:57 +02:00
  • 52e5f0a5c1 common : re-arm reasoning budget after DONE on new <think> (#22323) b8964 Jillis ter Hove 2026-04-28 19:15:36 +02:00
  • f9f33654a6 vulkan: Coalesce Q4_K/Q5_K scale loads (#21751) b8963 Matt Corallo 2026-04-28 15:31:04 +00:00
  • 98bb57916a ggml-webgpu: fix buffer aliasing for ssm_scan and refactor aliasing logic (#22456) b8962 Reese Levine 2026-04-28 07:27:17 -07:00
  • f42e29fdf1 webui: Server tools (#21237) Aleksander Grygier 2026-04-28 14:35:49 +03:00
  • 19821178be vulkan: add barrier after writetimestamp (#21865) b8960 Jeff Bolz 2026-04-28 12:28:12 +02:00
  • 698d19b93c ggml: improve SPIR-V headers detection with __has_include (#21918) Emil Askerov 2026-04-28 13:19:06 +03:00
  • 50494a2800 ggml : skip already registered backends and devices (#22296) b8958 Adrien Gallouët 2026-04-28 09:02:32 +02:00
  • d530d6e7a2 ggml : revert to -lm linking instead of find_library (#22355) b8957 Adrien Gallouët 2026-04-28 08:56:02 +02:00
  • c3e08f4700 CANN: add new ops, optimize existing ops (#21204) b8956 hipudding 2026-04-28 14:27:22 +08:00
  • 14e733e36f spec : refactor params (#22397) b8955 Georgi Gerganov 2026-04-28 09:07:33 +03:00
  • 516e8d7a8a server: use pos_next instead of n_tokens for m-rope (#22439) b8954 Aman Gupta 2026-04-28 13:41:00 +08:00
  • 434b2a1ff6 ggml-webgpu: add Q1_0 support (#22374) b8953 Rithik Sharma 2026-04-27 15:50:59 -07:00
  • 983ca8992e server: (router) Forward form-data to model server (Fixes #22044) (#22118) b8952 tha80 2026-04-27 23:55:00 +02:00
  • 665abc6097 add fast mat-vec kernels for i-quants (#22344) b8951 Rithik Sharma 2026-04-27 08:25:45 -07:00
  • 4414c04b9a Additional test for common/gemma4 : handle parsing edge cases (#22420) b8950 Igor Rudenko 2026-04-27 17:36:59 +03:00
  • ceaf47c4b1 fix: rpc-server cache may not work in Windows environments (#22394) b8949 unraido 2026-04-27 23:25:09 +09:00
  • 42401c72b8 Fix type casting for unaccounted memory calculation (#22424) b8948 rankaiyx 2026-04-27 20:31:13 +08:00
  • e940b3d468 download : prefer q8_0 when q4_k not available (#22428) b8947 Georgi Gerganov 2026-04-27 15:30:29 +03:00
  • 0f1bb602dd model : remove duplicate wo_s scale after build_attn (Qwen3, LLaMA) (#22421) b8946 ynankani 2026-04-27 07:58:48 +00:00
  • d13540becd convert : remove input_scale for dequantized fp8 modelopt (#22356) Sigbjørn Skjæret 2026-04-27 08:45:01 +02:00
  • f84270ea10 ggml : use 64 bytes aligned tile buffers (#21058) b8944 Adrien Gallouët 2026-04-27 08:30:55 +02:00
  • 5594d13224 common: fix missing exports in llama-common (#22340) b8943 Max Krasnyansky 2026-04-26 22:06:39 -07:00
  • f535774325 pr2wt : symlink .pi (#22386) Georgi Gerganov 2026-04-26 19:49:26 +03:00
  • 06a811d085 add performance-portable tuning for register-tile and subgroup matmul (#22241) b8941 Rithik Sharma 2026-04-26 09:26:28 -07:00
  • 78433f606f Fix recurrent state serialization for partial reads and writes (#22362) b8940 Gaurav Garg 2026-04-26 17:04:40 +05:30
  • 7ec36aa861 Github: set meta backend code owner (#22388) Johannes Gäßler 2026-04-26 13:34:13 +02:00
  • b1a5bd4e0c CUDA: better coalesce data-access for contiguous concat (#22330) Oliver Simons 2026-04-26 09:21:45 +02:00
  • 0c6ee1cade ggml-cpu : re-enable fast gelu_quick_f16 (#22339) b8937 Sigbjørn Skjæret 2026-04-26 08:28:14 +02:00
  • 2dd84169d1 ggml-cpu: optimize avx2 q6_k (#22345) b8936 Eve 2026-04-26 06:27:50 +00:00
  • f454bd7eb8 opencl: add iq4_nl support (#22272) b8935 lhez 2026-04-25 21:21:58 -07:00
  • b760272f1a hexagon: guard HMX clock request for v75+ platforms (#22377) b8934 Trivikram Reddy 2026-04-25 19:58:26 -05:00
  • dcad77cc3b chat: fix handling of space in reasoning markers (#22353) b8933 Piotr Wilkin (ilintar) 2026-04-25 21:24:13 +02:00
  • 98dc1418ea spec : fix vocab compat checks (#22358) Georgi Gerganov 2026-04-25 20:11:35 +03:00
  • 9725a313be CUDA: reduce MMQ stream-k overhead (#22298) b8931 Johannes Gäßler 2026-04-25 14:15:03 +02:00
  • d1649047a3 metal : optimize Metal Tensor API usage for GGML_OP_MUL_MAT (#20962) Developer-Ecosystem-Engineering 2026-04-25 05:14:28 -07:00
  • 9d34231bb8 llama-quant : default ftype param Q5_1 --> Q8_0 (#20828) b8929 ddh0 2026-04-25 01:25:35 -05:00
  • 8ea8fee966 gitignore : add .pi + personal SYSTEM.md (#22316) Georgi Gerganov 2026-04-25 09:20:45 +03:00
  • eddd7a13a5 [SYCL] Optimize Q4_0 mul_mat for Arc770, add scripts (#22291) b8927 Neo Zhang 2026-04-25 14:20:14 +08:00
  • dd2914dc81 ggml-webgpu: support for SSM_SCAN and disable set_rows error checking (#22327) b8926 Reese Levine 2026-04-24 23:18:15 -07:00
  • 0adede866d parser: fix structured output bug (#22302) b8925 Piotr Wilkin (ilintar) 2026-04-24 23:19:55 +02:00
  • 361fe72acb Hexagon: Bump HMX Frequency to Max Corner (#22334) b8924 Trivikram Reddy 2026-04-24 15:55:17 -05:00
  • a702f39597 CI Snapdragon: Switch ubuntu-latest to ubuntu-slim runner (#22303) Shreya Jain 2026-04-24 12:21:36 -07:00
  • 13d36cf891 ggml-webgpu: enable FLASH_ATTN_EXT on browser without subgroup matrix (#22199) b8922 Zheyuan Chen 2026-04-24 10:39:09 -07:00
  • f65bc34c68 hexagon: use DIRID 13 in libggml-htp.inf for modern InfVerif (#22306) Mengsheng Wu 2026-04-25 00:21:33 +08:00
  • 15fa3c493b metal : print GPU description (#22318) b8920 Georgi Gerganov 2026-04-24 13:56:03 +03:00
  • dc80c5252a common : fix jinja warnings with clang 21 (#22313) b8919 Adrien Gallouët 2026-04-24 12:36:02 +02:00
  • e583f3b4f5 ggml : minor coding style (#22308) b8918 Georgi Gerganov 2026-04-24 11:02:00 +03:00
  • 017f090442 jinja : remove unused header (#22310) b8917 Georgi Gerganov 2026-04-24 11:01:46 +03:00
  • ffdd983fb8 server : fix swa-full logic (#22288) b8916 Georgi Gerganov 2026-04-24 10:17:37 +03:00
  • 793d0a7931 server: rename debug tags to match --cache-idle-slots naming (#22292) Yes You Can Have Your Own 2026-04-24 09:28:44 +03:00
  • 8bc492ebb4 hexagon: add SOLVE_TRI op (#21974) b8914 Mengsheng Wu 2026-04-24 09:39:13 +08:00
  • e5f070a1dc fix(shader): handle the buffer aliasing for rms fuse (#22266) b8913 Chen Yuan 2026-04-23 19:32:59 -04:00
  • fa0b8a70a8 cli: Remove redundant local sampling variables (#20429) (#22264) b8912 Ethan Turner 2026-04-23 15:53:23 -07:00
  • 5d2b52d80d hexagon: add support for basic and extended Op profiling (#22269) b8911 Max Krasnyansky 2026-04-23 14:17:21 -07:00
  • 187a456370 Enable testing on Snapdragon devices (#21051) Shreya Jain 2026-04-23 13:08:10 -07:00
  • 185cbff6f1 server : convert_anthropic_to_oai: also copy chat_template_kwargs (#22154) b8909 srkizer 2026-04-24 03:32:46 +09:00
  • c78fb909b2 server: fix heap-buffer-overflow from negative n_discard (CVE-2026-21869) (#22267) b8908 Song Li 2026-04-23 12:39:07 -04:00
  • 12568ca8c8 vendor : update LibreSSL to 4.3.1 (#22285) b8907 Adrien Gallouët 2026-04-23 17:45:56 +02:00
  • c807c6e3b0 server: (anthropic API) fix prefix caching (#21793) b8906 kvc0 2026-04-23 08:45:02 -07:00
  • 0949beb5a3 fix build number for sycl release (#22283) b8905 Sigbjørn Skjæret 2026-04-23 15:38:58 +02:00
  • 9012c50fc8 model-conversion : fix mmproj output file name [no ci] (#22274) Daniel Bevenius 2026-04-23 15:07:38 +02:00
  • 0dd7f915fd cli : cleanup auto-completion code (#21745) Matthias Straka 2026-04-23 15:03:28 +02:00
  • 550d684bd1 server: Enable transcriptions API for LFM2-Audio (#22000) b8902 Tarek Dakhran 2026-04-23 10:47:26 +02:00
  • 8635e221c8 metal : fix event synchronization (#22260) b8901 Georgi Gerganov 2026-04-23 08:22:49 +03:00
  • 930e0210d1 gitignore: add AGENTS.local.md (#22246) Georgi Gerganov 2026-04-23 08:22:24 +03:00
  • 96c1db26c4 ggml-base: use MATH_LIBRARY variable instead of hardcoded 'm' (#22239) Georgi Gerganov 2026-04-23 08:22:08 +03:00
  • 4ead6fd957 [SYCL] Update oneapi 2025.3.3, Seperate SYCL build, release Ubuntu 24 package. (#22078) Neo Zhang Jianyu 2026-04-23 13:21:36 +08:00
  • 5eaee65384 convert : Handle ModelOpt produced mixed precision model during convert to GGUF (#22247) ynankani 2026-04-23 05:19:51 +00:00
  • 60b68a6279 sycl : fused MoE mul_mat_vec_q for TG (#21920) abotsis 2026-04-22 23:18:56 -06:00
  • b76429a69c ggml-webgpu: add support for im2col (#22259) b8895 Chen Yuan 2026-04-22 23:17:41 -04:00
  • 86db42e97f CUDA: fuse relu + sqr (#22249) Anav Prasad 2026-04-23 02:28:56 +00:00
  • 6217b49583 HIP: flip GGML_HIP_GRAPHS to default on (#22254) b8893 uvos 2026-04-23 02:34:31 +02:00
  • 0d0764dfd2 [WebGPU] Implement async tensor api and event api (#22099) b8892 Nikhil Jain 2026-04-22 10:52:01 -07:00
  • 6da7168312 ggml-webgpu: Add fused RMS_NORM + MUL (#21983) b8891 Masashi Yoshimura 2026-04-23 02:51:40 +09:00
  • 8bccdbbff9 chat: fix parallel_tool_calls default setting based on model capabilities, add tests for parallel tool calls and structured outputs (#22217) b8890 Piotr Wilkin (ilintar) 2026-04-22 18:10:56 +02:00
  • bcb5eeb645 speculative-simple : add checkpoint support (#22227) b8889 Georgi Gerganov 2026-04-22 15:44:45 +03:00
  • 225088ea76 sycl: Improve mul_mat_id memory efficiency and add BF16 fast path (#22119) b8888 Akarshan Biswas 2026-04-22 18:02:56 +05:30
  • 82d3f4d3b2 mtmd: also support LLAMA_ROPE_TYPE_NONE (#22242) b8887 Xuan-Son Nguyen 2026-04-22 12:16:29 +02:00
  • 17f6245168 server: ignore reasoning content from transcription api (#21905) b8886 Xuan-Son Nguyen 2026-04-22 12:10:50 +02:00
  • 7bfe60fdf9 mtmd, llama : Update HunyuanVL vision-language model support (#22037) b8885 manayang 2026-04-22 17:58:43 +08:00
  • 750579ff14 common: Refactoring sampler parameters (#20429) (#22233) b8884 Ethan Turner 2026-04-22 01:40:19 -07:00
  • 134d6e54d4 common/chat, server: refactor, move all conversion functions to common, add tests (#20690) b8883 Piotr Wilkin (ilintar) 2026-04-22 10:28:45 +02:00