* ggml-webgpu: add tile flash attention fallback
* ggml-webgpu: add new fields and discard usage of mnk for tile version
* ggml-webgpu: modify the vec path to discard the mnk parameter
* ggml-webgpu: enable flash attention vec and tile version for broswer
* ggml-webgpu: stagging KV for flash attention tile version
* formatting
* turn on subgroup uniformity check
* remove Q_TILE as it is always 1 for vec path
* make row_max and exp_sum to local register
* make different bindings with same underlying buffer to have the same usage flags
* move path selection into the shader library and have the host consume a single flash-attn decision object.
* turn off skip_validation and address buffer overlapping when nwg==1
* formatting
* merge binding when kv overlap
* shader(im2col): implement the im2col shader
* shader(im2col): clean the formatting issues
* shader(im2col): clean the editorconfig checker warning
* fix(shader): address the workgroup issues of im2col and conv2d
* Only run webgpu CI on my fork
* Implement set_tensor_async
* Implement synchronize api
* Implement event creation and deletion API
* Cleanup
* Cleanup
* Comment out jobs for local CI run
* Add webgpu only workflow
* Delete .github/workflows/build-webgpu.yml
* Cleanup
* Cleanup
* Update API with function handlers
* Run clang-format
* Replace one-shot buffer with a direct queue.WriteBuffer using the buffer context
* fused rms_norm_mul + mul
* Add GGML_WEBGPU_DISABLE_FUSION for being able to disable kernel fusion.
* Decouple num_fused_ops from webgpu_context; misc cleanup
* Fix eps handling and remove disable_fusion.
* Fix not to use c++20 initializers.
* ggml(webgpu): fix the busy-polls in Emscripten in the waitAny after #20618, and remove the busy webgpu log
* Merge with upstream
* Fix GET_ROWS packed integer NaN when using f16 as memory buffer in shader quants
* Update Unary wgsl EXP and EXPM1 for f16 stability
* Fix GET_ROWS IQ4_XS strcut for NaN f16 canonicalization
* Fix numerical percision for unary sqrt when working with f16
* Fix NaN canonicalization for packed integers using f16
* Update err threshold for binary div ops when using f16
* backend: Keep one Dawn/WebGPU instance alive for the lifetime of the static backend
* clean: uncomment existing code logs
* clean: clean the unncessary debug info
* Refactor and generalize dequant helpers
* Remove deprecated quant structs
* Refactor shader defines to reduce repetition
* Remove error override for F16 type
* fix: fix the accidential removal of the proper initialization of ctx
* clean: clean legacy and format code
* fix: did not modify tests ops
* shader(conv2d): add conv2d shader kernels and pass f32 and f16 tests
* shader(conv2d): fix the out of bounds memory access in the weight indexing
* shader(conv2d): clean unused variables and optimize the computation
* merge: use the new entries function
* clean: address the formatting issues
* clean: address the warning issues
* clear: clean the shader editorconfig-checker issues
* clear: clean the shader editorconfig-checker with utf-8
---------
Co-authored-by: Jeremy J. Hartmann <jeremy@mtion.tv>
* merged properly, but slow q3_k and q5_k with u32 indexing
* Start on new mat-vec
* New format float paths working
* Working q4_0
* Work on remaining legacy q-types
* port k-quants to new matvec
* remove old shader
* Remove old constants, format
* remove accidental file
---------
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* Update workflows to remove dependence on llvmpipe
* Try setting Dawn_DIR
* remove c++20 initializers
* Move to proper guid
* Try avoiding segfaults on vulkan backend process exit
* Remove compiler warnings on parameter casting
* Fix soft_max and update reg_tile accumulation to f32 for better precision
* Refactor flash_attn a bit
* remove c++20 initializers and format
* Increase div precision for NVIDIA
* revert div precision and comment out ggml-ci node for now
* Formatting
* Try debugging on a failing CI node
* Revert "Try debugging on a failing CI node"
This reverts commit 1971e33cba919915e12bcfd5828abfbd54ca942e.
* Update register tiling matmul to use f32 accumulation
* fix profiling code
* Fix register tiling matmul for chrome, i'm blaming dawn
* Update batch tuning value for iOS
* compile fix
* Fix use of new load function
* Move to a single query set for GPU profiling
* Move to batching compute passes when not profiling
* Refactor build_multi
* remove iOS throttling now that we're batching compute passes
* Update register tiling matmul to use f32 accumulation
* fix profiling code
* Fix register tiling matmul for chrome, i'm blaming dawn
* Update batch tuning value for iOS
* compile fix
* Fix use of new load function
* ggml(webgpu): fix the busy-polls in Emscripten in the waitAny after #20618, and remove the busy webgpu log
* Merge with upstream
* Fix GET_ROWS packed integer NaN when using f16 as memory buffer in shader quants
* Update Unary wgsl EXP and EXPM1 for f16 stability
* Fix GET_ROWS IQ4_XS strcut for NaN f16 canonicalization
* Fix numerical percision for unary sqrt when working with f16
* Fix NaN canonicalization for packed integers using f16
* Update err threshold for binary div ops when using f16
* backend: Keep one Dawn/WebGPU instance alive for the lifetime of the static backend
* clean: uncomment existing code logs
* clean: clean the unncessary debug info
* Refactor and generalize dequant helpers
* Remove deprecated quant structs
* Refactor shader defines to reduce repetition
* Remove error override for F16 type
* fix: fix the accidential removal of the proper initialization of ctx
* clean: clean legacy and format code
* fix: did not modify tests ops
---------
Co-authored-by: Jeremy J. Hartmann <jeremy@mtion.tv>
* ggml: backend-agnostic tensor parallelism
* support for GPT-OSS, Qwen 3 MoE
* partial Vulkan fix
* add support for 4/8 GPUs
* unconditional peer access
* re-use buffers + ggml contexts
* fix output pattern
* NCCL support
* GGML: HIP: add RCCL support
* Remove shfl and AllReduce from backend interface
* move allocation workaround out of ggml-alloc.c
* 2d tensor set/get support
* Fix the seg fault without NCCL
* Apply suggestion from JohannesGaessler
* support for tensor dims % n_devs != 0
* fix view_offs scaling
* arbitrary num. of GPUs/tensor split
* fix compilation
* better granularity estimate
* Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA.
Fix compilation errors.
* partial Qwen 3 Next support
* Fix qwen3 30b (#8)
* Fix crash with Qwen-30B-A3B Q4_0
Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation.
* Decide block size based on tensor quantization type
* Fix crashes due to KV cache serialization (#9)
KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset.
* metal : fix build (#7)
* static memory allocations, fix usage count
* fix tensor granularity
* more even memory distribution
* use BF16 for allreduce
* rebase fixup
* better error message for unsupported architectures
* Fix device mismatch during scatter of allReduce. (#11)
There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies
* Enable the previous allreduce implementation. It is better in both perf and stability (#12)
* delay AllReduce for Moe for less I/O
* build : clean-up compile warnings
* backend : move most of the meta backend API to ggml-backend-impl.h
* cont : hide unused public API in the implementation
* llama : use llama_device + remove ggml_backend_dev_is_meta()
* ggml-backend : remove unused alloc include
* minor : remove regex include
* ggml : introduce ggml-ext.h for staging new APIs
* rebase fixup
* fix tests
* llama : more robust logic for determining Meta devices (#16)
* llama : more robust logic for determining Meta devices
* cont : fix devs size check
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* cont : fix log type
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* disable roundtrip for meta backend
* fix arch selection
* Qwen 3.5 support
* fix Gemma 4 MoE
* fix OpenVino, SYCL
* fix test-llama-archs for CPU-only builds
* Fix Qwen 3.5 MoE
* disable meta backend tests for WebGPU
* tests : filter CPU-based devices from the Meta backend tests (#17)
* meta : formatting, naming, indentation (#18)
* formatting : llama-model.cpp
* formatting : ggml-ext.h
* formatting : ggml-backend-meta.cpp
* meta : add TODO
* add documentation
* better error messages
* fix GPT-OSS
---------
Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Work towards removing bitcast
* Move rest of existing types over
* Add timeout back to wait and remove synchronous set_tensor/memset_tensor
* move to unpackf16 for wider compatibility
* cleanup
* Remove deadlock condition in free_bufs
* Start work on removing parameter buffer pools
* Simplify and optimize further
* simplify profile futures
* Fix stride
* Try using a single command buffer per batch
* formatting
* Add parameters for different browsers in-flight submissions
* Update handling of batch size too
* Throttle ios as much as possible
* Increase timeout for llvm-pipe testing
* Work towards removing bitcast
* Move rest of existing types over
* Add timeout back to wait and remove synchronous set_tensor/memset_tensor
* move to unpackf16 for wider compatibility
* cleanup
* Remove deadlock condition in free_bufs
* Start work on removing parameter buffer pools
* Simplify and optimize further
* simplify profile futures
* Fix stride
* Try using a single command buffer per batch
* formatting
* Work towards removing bitcast
* Move rest of existing types over
* Add timeout back to wait and remove synchronous set_tensor/memset_tensor
* move to unpackf16 for wider compatibility
* cleanup
* Remove deadlock condition in free_bufs
* port cpy pipeline to shader lib with JIT compilation
* port glu pipeline to shader lib with JIT compilation
* port rope pipeline to shader lib with JIT compilation
* port soft_max pipeline to shader lib with JIT compilation
* removed unused functions from embed_wgsl.py which were used for
old AOT template expansion
* K quant speedup (#20)
* Basic JIT compilation for mul_mat, get_rows, and scale (#17)
* scale jit working
* preliminary working jit for getrows and mulmat, needs refining
* simplified mul_mat preprocessing switch statement
* get_rows fixes, mul_mat refinement
* formatted + last edits
* removed some extraneous prints
* fixed get_rows, fixed workgroup dispatch in mul_mat. no gibberish
* small fix
* some changes, working
* get_rows and mul_mat jit fixed and working
* Update formatting
* formatting
* Add header
---------
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* Start work on all-encompassing shader library
* refactor argmax, set_rows
* Refactor all but flashattention, mat mul
* no gibberish, all k quants added, merged
* vec memory fix
* q6_k matching metal on my machine, tests passing
* Set tile size for q6_k separately
* Separate out fast shaders
---------
Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>
* Move towards writeBuffer for params
* Move away from multiple buffers for set_rows errors, remove host buffer for parameter buffers, minor cleanups
* Remove extra file
* Formatting
---------
Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>
* Enable tmate debugging for investigating thread safety issue
* Refactor wait and submit to operate on vector<wgpu::FutureWaitInfo>, and fix wait to delete only the future that is completed.
* Cleanup
* Remove clear change and run clang-format
* Cleanup
* ggml-webgpu: fix workgroup dispatch limit for large batch sizes
WebGPU limits workgroup sizes to 65535 per dimension. Large MUL_MAT
operations with batch sizes exceedeing this limi would fail.
* add compute_2d_workgroups() helper to split total workgroup ID across
X/Y dimensions
* update mul_mat_reg_tile.wgsl to reconstruct linear workgroup ID from 2D
dispatch
* update mul_mat_subgroup_matrix.wgsl to reconstruct linear workgroup ID
from 2D dispatch
* update mul_mat.wgsl to compute global index from 2D workgroup
coordinates
* refactor all three mul_mat dispatch paths to use the shared helper
* ggml-webgpu: add bounds checking for over-dispatched workgroups
2D workgroup dispatch can over-dispatch when total workgroups don't
divide evenly into the 65535 per-dimension limit. Extra workgroups
would compute invalid batch indices, causing memory corruption.
* add batch_idx bound check to mul_mat_reg_tile.wgsl and
mul_mat_subgroup_matrix.wgsl to prevent over-dispatched workgroups
from accessing invalid memory
* fixes test failures with large batch sizes (eg., bs=[128, 1024])
* ggml-webgpu: add back TODO for spliting large sizes into batches
* Optimize 2d workgroup provisioning
* Set some parameters that increase speed
---------
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* Allow webgpu_buf_pool to resize if needed, remove inflight_threads, and replace inflight_threads with num_kernels for submission
* Run clang-format
* Keep track of num batched kernels that have not been submitted yet
* Run clang-format
* Increase buf pool max size
* Increase param buf pool init size
* Remove webgpu buf pool resizing
* Merge with master
* Add buffer pool growth
* Move buffer pool growth outside of lock
* Reduce max pool size to 32
* Run clang-format
* Only resize param buf pool
* ggml-webgpu: Add binary op support for overlapping and non-contiguous.
* Add newline to binary.wgsl
* Append the test of binary op for src overlapping to test_bin_bcast.
* Remove unnecessary newline.
* Basic JIT compilation for mul_mat, get_rows, and scale (#17)
* scale jit working
* preliminary working jit for getrows and mulmat, needs refining
* simplified mul_mat preprocessing switch statement
* get_rows fixes, mul_mat refinement
* formatted + last edits
* removed some extraneous prints
* fixed get_rows, fixed workgroup dispatch in mul_mat. no gibberish
* small fix
* some changes, working
* get_rows and mul_mat jit fixed and working
* Update formatting
* formatting
* Add header
---------
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* Start work on all-encompassing shader library
* refactor argmax, set_rows
* Refactor all but flashattention, mat mul
* flashattention and matrix multiplication moved to new format
* clean up preprocessing
* Formatting
* remove duplicate constants
* Split large shaders into multiple static strings
---------
Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>
* Fix memory leaks in shader lib, backend, backend_context, buffer_context, and webgpu_buf_pool
* Free pools
* Cleanup
* More cleanup
* Run clang-format
* Fix arg-parser and tokenizer test errors that free an unallocated buffer
* Fix device lost callback to not print on device teardown
* Fix include and run clang-format
* remove unused unused
* Update binary ops
---------
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* ggml webgpu: port binary operators to use pre-wgsl
* Add binary.wgsl: unified shader with conditionals for all 4 ops
* Add gen_binary_shaders.cpp: build tool for using pre_wgsl preprocessor
* Remove bin_op.tmpl.wgsl and binary.wgsl (Python template)
* Update CMake to generate binary operator shaders at build time
* ggml-webgpu: migrate binary ops to JIT compilation with overlap handling
* port binary operators from AOT to pre-wgsl JIT compilation
* add src1=dst overlap handling for binary ops
* use compile-time workgroup size defines instead of runtime overrides
* ggml-webgpu: complete overlap handling for binary ops
* add support for inplace & overlap case in binding setup
* restructure conditional logic to handle all overlap cases
* ensure all buffer bindings are correctly assigned for edge cases
* ggml-webgpu: remove unused binary overlap cases
Remove src0==src1 binary overlap case that never occurs in practice.
* keep INPLACE (src0==dst), OVERLAP (src1==dst), DEFAULT
* remove unused src0==src1 and all-same variant
* refactor wgsl to eliminate duplication
* Remove mutex for pipeline caches, since they are now per-thread.
* Add comment
* Run clang-format
* Cleanup
* Run CI again
* Run CI once more
* Run clang-format
* FlashAttention (#13)
* Add inplace softmax
* Move rms_norm to split row approach
* Update debug for supports_op
* clean up debug statements
* neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though
* neg passes backend test
* unary operators pass ggml tests
* rms_norm double declaration bug atoned
* abides by editor-config
* removed vestigial files
* fixed autoconfig
* All operators (inlcluding xielu) working
* removed unnecesarry checking if node->src[1] exists for unary operators
* responded and dealt with PR comments
* implemented REPL_Template support and removed bug in unary operators kernel
* formatted embed wgsl and ggml-webgpu.cpp
* Faster tensors (#8)
Add fast matrix and matrix/vector multiplication.
* Use map for shader replacements instead of pair of strings
* Wasm (#9)
* webgpu : fix build on emscripten
* more debugging stuff
* test-backend-ops: force single thread on wasm
* fix single-thread case for init_tensor_uniform
* use jspi
* add pthread
* test: remember to set n_thread for cpu backend
* Add buffer label and enable dawn-specific toggles to turn off some checks
* Intermediate state
* Fast working f16/f32 vec4
* Working float fast mul mat
* Clean up naming of mul_mat to match logical model, start work on q mul_mat
* Setup for subgroup matrix mat mul
* Basic working subgroup matrix
* Working subgroup matrix tiling
* Handle weirder sg matrix sizes (but still % sg matrix size)
* Working start to gemv
* working f16 accumulation with shared memory staging
* Print out available subgroup matrix configurations
* Vectorize dst stores for sg matrix shader
* Gemv working scalar
* Minor set_rows optimization (#4)
* updated optimization, fixed errors
* non vectorized version now dispatches one thread per element
* Simplify
* Change logic for set_rows pipelines
---------
Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan>
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* Comment on dawn toggles
* Working subgroup matrix code for (semi)generic sizes
* Remove some comments
* Cleanup code
* Update dawn version and move to portable subgroup size
* Try to fix new dawn release
* Update subgroup size comment
* Only check for subgroup matrix configs if they are supported
* Add toggles for subgroup matrix/f16 support on nvidia+vulkan
* Make row/col naming consistent
* Refactor shared memory loading
* Move sg matrix stores to correct file
* Working q4_0
* Formatting
* Work with emscripten builds
* Fix test-backend-ops emscripten for f16/quantized types
* Use emscripten memory64 to support get_memory
* Add build flags and try ci
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* Remove extra whitespace
* Move wasm single-thread logic out of test-backend-ops for cpu backend
* Disable multiple threads for emscripten single-thread builds in ggml_graph_plan
* Refactored pipelines and workgroup calculations (#10)
* refactored pipelines
* refactored workgroup calculation
* removed commented out block of prior maps
* Clean up ceiling division pattern
---------
Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* Start work on flash attention
* Shader structure set up (many bugs still)
* debugging
* Working first test
* Working with head grouping, head sizes to 128, logit softcap, mask/sinks enabled, f32
* Generalize softmax to work with multiple subgroups, f16 accumulation, mask shared memory tiling
* Start work on integrating pre-wgsl
* Separate structs/initial shader compilation library into separate files
* Work on compilation choices for flashattention
* Work on subgroup matrix/tile size portability
* subgroup size agnostic online softmax
* Cleanups, quantization types
* more cleanup
* fix wasm build
* Refactor flashattention to increase parallelism, use direct loads for KV in somce cases
* Checkpoint
* formatting
* Update to account for default kv cache padding
* formatting shader
* Add workflow for ggml-ci webgpu
* Try passing absolute path to dawn in ggml-ci
* Avoid error on device destruction, add todos for proper cleanup
* Fix unused warning
* Forgot one parameter unused
* Move some flashattn computation to f32 for correctness
* ggml-webgpu: add CEIL operation support
Add support for the CEIL unary operation in the WebGPU backend:
- Add CEIL_FUNC shader template in unary_op.wgsl
- Add 4 shader variants (f32, f16, inplace versions)
- Initialize CEIL pipelines in ggml-webgpu.cpp
- Register CEIL in supports_op function
* docs: update WebGPU ops support for CEIL