* This commit enables the router to forward form-data to model server.
Fixes#22044 (enabling to use the /v1/audio/transcriptions in router mode)
* * Applied the suggestion from Copilots first comment: using the non-throwing json::parse overload.
* Addressed Copilots third comment by extending the files representation to also include filename and content-type
* Addressed Copilots fourth comment by making the RNG thread_local
* Changed variable body from std::string to std::ostringstream in build_multipart_body
as suggested by ngxson in https://github.com/ggml-org/llama.cpp/pull/22118#discussion_r3127099053
* Added sanitize_field lambda in build_multipart_body for key, filename and content_type
as suggested by ngxson in https://github.com/ggml-org/llama.cpp/pull/22118#discussion_r3127104647
* explicitly checking if value/item is string before calling value/item.get<std::string>()
as requested by ngxson in https://github.com/ggml-org/llama.cpp/pull/22118#discussion_r3127111279
* Added double quote to the sanitize lambda and throw on json parse failure
---------
Co-authored-by: Ralph Paßgang <ralph@trust-it.de>
* common: refactor common/debug to move abort_on_nan into base_callback_data
Passing bool abort_on_nan as template parameter for common_debug_cb_eval is unnecessary and creates an issue with LTO.
It should just be a member of the base_callback_data instead.
* cont : cleanup
* common : use pimpl in debug.h to reduce header dependencies
Move common_debug_cb_user_data's data members (std::regex,
std::vector<uint8_t>) into a private impl struct in debug.cpp.
This removes the includes of common.h and <regex> from debug.h,
reducing transitive dependencies for any translation unit that
includes the header.
Assisted-by: llama.cpp:local pi
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
The previous code worked only for full tensor reads and writes and was hitting `GGML_ASSERT(size == ggml_nbytes(tensor)); ` assert when tested with llama-server.
* Optimize Metal Tensor API usage for matmul2d
Separates the Metal Tensor API (matmul2d) path in kernel_mul_mm into its own standalone kernel, gated by GGML_METAL_HAS_TENSOR.
The legacy simdgroup_matrix kernel is preserved under #else.
Previously both paths were interleaved via #ifdef blocks within a single kernel, forcing the tensor path to share the legacy kernel's data layout and threadgroup memory scheme. Splitting the kernel enabled memory and dispatch optimizations that weren't possible when the two paths shared code structure.
* cont : cleanup
* cont : cleanup
* cont : cleanup
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Change the default `ftype` in `llama_model_quantize_params` from
`LLAMA_FTYPE_MOSTLY_Q5_1` to `LLAMA_FTYPE_MOSTLY_Q8_0`.
In case some external program naively uses the default quantization
params, we should probably default to a known-good type like Q8_0 rather
than Q5_1, which is rather old.
* opt arc770 for Q4_0
* add for Q4_0
* update the script
* add help script for windows
* update guide
* fix format issue
* convert from dos to unix for format issue
* fix missed -sm parameter
* switch ubuntu-latest to ubuntu-slim
* Fix the path for upload so CI doesn't fail
* Update .github/workflows/build-and-test-snapdragon.yml
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Use -slim image for key check and consistent naming for artifact dir
Signed-off-by: Max Krasnyansky <maxk@qti.qualcomm.com>
* Remove check-secret extra job
* move QDC key check for Run QDC jobs step specifically
* add a step before to check the secret for qdc jobs
---------
Signed-off-by: Max Krasnyansky <maxk@qti.qualcomm.com>
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* ggml-webgpu: add tile flash attention fallback
* ggml-webgpu: add new fields and discard usage of mnk for tile version
* ggml-webgpu: modify the vec path to discard the mnk parameter
* ggml-webgpu: enable flash attention vec and tile version for broswer
* ggml-webgpu: stagging KV for flash attention tile version
* formatting
* turn on subgroup uniformity check
* remove Q_TILE as it is always 1 for vec path
* make row_max and exp_sum to local register
* make different bindings with same underlying buffer to have the same usage flags
* move path selection into the shader library and have the host consume a single flash-attn decision object.
* turn off skip_validation and address buffer overlapping when nwg==1
* formatting
* merge binding when kv overlap
This change implements the third requested change in issue 20429.
Because defaults.sampling contains the reasoning budget token count and
the reasoning budget message, it's not necessary to assign them to
struct variables.
* hexagon: restore HTP_OPMASK_QUEUE
* hexagon: honor OPMASK_SKIP_COMPUTE in hmx-matmul
* hex-prof: restore op profiling
* hex-prof: enable PMU
* hexagon: simplify and improve op-queuing with full profiling support
Add separate profile descriptors.
* hexagon: remove opsync and rename opmask into opstage
opsync is no longer needed since the profiler is fully async now.
opmask name was confusing and opstage is more accurate.
* hexagon: refactor opbatch queue handling
* hexagon: add iface hooks for enabling profiler from the host
Also move all the PMU setup stuff out of the hex-utils since it's not inteded for normal use.
* hexagon: make profiler mode configurable
On older devices getting PMU counters is expensive so it's now optional.
* hexagon: add support for setting profiler pmu events from env
* hexagon: simplify profiler output (no need to print buffs, etc)
* hexagon: simplify pmu counter formating
* hexagon: add a simple profile post-proc tool
* hex-prof: add support for reading logs from stdin
* hexagon: document GGML_HEXAGON_PROFILE
* hex-prof: update default width for dims field
* hex-prof: fix linter warnings and errors
* Update ggml/src/ggml-hexagon/htp/htp-ops.h
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update scripts/snapdragon/ggml-hexagon-profile.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Add the tests that we want to run on external CI
* remove extra files
* Fixes python issues, reove the deadlock on CI
* remove unecessary changes
* use override to ty.toml
* fix pre-commit and try tests with secret in external repo not upstream
* skip if key is unavailable
* Fix feedback
* switch hexagon to snapdragon
* cleanup
* fix secrets
* remove the copyrights at the top of the files
When testing claude code against llama.cpp, I noticed that only
n_past 18577 was used even when context was 60k or more. The log
in llama-server says:
```
slot update_slots: id 3 | task 10342 | old: ... ; cch= | defa0;You are
slot update_slots: id 3 | task 10342 | new: ... ; cch= | 1c8b4;
```
I observed that the cch value changed every time. Reading about that,
the x-anthropic-billing-header system message seems to be specially
handled inside of the anthropic api. I could remove it, but there
is a meaningful string sometimes included at the end. So instead,
I just replace the changing cch checksum with fffff.
I'm treating this as an anthropic message body API detail - I think this
is the right way to do this, but by all means please correct me!
It's always 5 hexadecimal characters, but I've written the replacement
defensively in case they change the protocol.