docs: more extensive RoPE documentation [no ci] (#21953)
* more extensive ggml_rope documentation * add more docs * nits
This commit is contained in:
@@ -130,6 +130,23 @@ Note:
|
|||||||
- Adding a model-specific API or CLI is an anti-pattern in `libmtmd`. The goal of `libmtmd` is to provide an easy-to-use, model-agnostic library for multimodal pipeline.
|
- Adding a model-specific API or CLI is an anti-pattern in `libmtmd`. The goal of `libmtmd` is to provide an easy-to-use, model-agnostic library for multimodal pipeline.
|
||||||
- In most cases, `llama-mtmd-cli` should not be modified. If a model requires a specific prompt, either let the user provide it or bake it into the Jinja chat template.
|
- In most cases, `llama-mtmd-cli` should not be modified. If a model requires a specific prompt, either let the user provide it or bake it into the Jinja chat template.
|
||||||
|
|
||||||
|
## Tips and tricks
|
||||||
|
|
||||||
|
### Working with ggml_rope_ext
|
||||||
|
|
||||||
|
PyTorch implementations usually prefer explicitly calculating `freq_cis`/`sin`/`cos` components. However, in llama.cpp, most RoPE operations can be handled via `ggml_rope_ext`, which does not require a sin/cos matrix. This saves memory while allowing the GGML RoPE kernel to be fused with other ops.
|
||||||
|
|
||||||
|
However, since `ggml_rope_ext` only provides a subset of the RoPE implementations that models use, converting models from PyTorch to llama.cpp may require some creative adaptations.
|
||||||
|
|
||||||
|
For more information about `ggml_rope_ext`, please refer to the in-code documentation in `ggml.h`.
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
- `libmtmd` implements 2D RoPE with `GGML_ROPE_TYPE_NORMAL` ordering by splitting the input tensor in half, applying `ggml_rope_ext` separately to each half, then joining them back together using `ggml_concat`.
|
||||||
|
- The [Kimi-K2.5](https://github.com/ggml-org/llama.cpp/pull/19170) vision encoder uses vision RoPE with interleaved frequencies. The weights must be permuted during conversion in order to reuse the `build_rope_2d()` function.
|
||||||
|
- [Gemma 4](https://github.com/ggml-org/llama.cpp/pull/21309) uses "proportional" RoPE. We employ a trick where `rope_freqs` is set to a very large value in the last dimensions to prevent those dimensions from being rotated. See the `Gemma4Model` class in `convert_hf_to_gguf.py`.
|
||||||
|
- Some models require scaling the input position. For example, `[0, 1, 2, ...]` becomes `[0, 0.5, 1, ...]`. In this case, you can provide the scaling via `freq_scale = 0.5f`.
|
||||||
|
- Some models use learned RoPE frequencies instead of relying on `powf(freq_base, -2.0 * i / n_dims)`. In this case, you can provide the learned frequencies via the `rope_freqs` tensor (corresponding to the `c` argument in `ggml_rope_ext`), then set `freq_base = 1.0f`. An important note is that `rope_freqs` in GGML is the **inverse** (`theta = pos[i] / rope_freqs`), so you may need to invert `rope_freqs` during conversion.
|
||||||
|
|
||||||
## GGUF specification
|
## GGUF specification
|
||||||
|
|
||||||
https://github.com/ggml-org/ggml/blob/master/docs/gguf.md
|
https://github.com/ggml-org/ggml/blob/master/docs/gguf.md
|
||||||
|
|||||||
+55
-1
@@ -1773,8 +1773,32 @@ extern "C" {
|
|||||||
int n_dims,
|
int n_dims,
|
||||||
int mode);
|
int mode);
|
||||||
|
|
||||||
// custom RoPE
|
// RoPE operations with extended options
|
||||||
|
// a is the input tensor to apply RoPE to, shape [n_embd, n_head, n_token]
|
||||||
|
// b is an int32 vector with size n_token
|
||||||
// c is freq factors (e.g. phi3-128k), (optional)
|
// c is freq factors (e.g. phi3-128k), (optional)
|
||||||
|
// mode can be GGML_ROPE_TYPE_NORMAL or NEOX; for MROPE and VISION mode, use ggml_rope_multi
|
||||||
|
//
|
||||||
|
// pseudo-code for computing theta:
|
||||||
|
// for i in [0, n_dims/2):
|
||||||
|
// theta[i] = b[i] * powf(freq_base, -2.0 * i / n_dims);
|
||||||
|
// theta[i] = theta[i] / c[i]; # if c is provided, divide theta by c
|
||||||
|
// theta[i] = rope_yarn(theta[i], ...); # note: theta = theta * freq_scale is applied here
|
||||||
|
//
|
||||||
|
// other params are used by YaRN RoPE scaling, these default values will disable YaRN:
|
||||||
|
// freq_scale = 1.0f
|
||||||
|
// ext_factor = 0.0f
|
||||||
|
// attn_factor = 1.0f
|
||||||
|
// beta_fast = 0.0f
|
||||||
|
// beta_slow = 0.0f
|
||||||
|
//
|
||||||
|
// example:
|
||||||
|
// (marking: c = cos, s = sin, 0 = unrotated)
|
||||||
|
// given a single head with size = 8 --> [00000000]
|
||||||
|
// GGML_ROPE_TYPE_NORMAL n_dims = 4 --> [cscs0000]
|
||||||
|
// GGML_ROPE_TYPE_NORMAL n_dims = 8 --> [cscscscs]
|
||||||
|
// GGML_ROPE_TYPE_NEOX n_dims = 4 --> [ccss0000]
|
||||||
|
// GGML_ROPE_TYPE_NEOX n_dims = 8 --> [ccccssss]
|
||||||
GGML_API struct ggml_tensor * ggml_rope_ext(
|
GGML_API struct ggml_tensor * ggml_rope_ext(
|
||||||
struct ggml_context * ctx,
|
struct ggml_context * ctx,
|
||||||
struct ggml_tensor * a,
|
struct ggml_tensor * a,
|
||||||
@@ -1790,6 +1814,36 @@ extern "C" {
|
|||||||
float beta_fast,
|
float beta_fast,
|
||||||
float beta_slow);
|
float beta_slow);
|
||||||
|
|
||||||
|
// multi-dimensional RoPE, for Qwen-VL and similar vision models
|
||||||
|
// mode can be either VISION, MROPE, IMROPE, cannot be combined with NORMAL or NEOX
|
||||||
|
// sections specify how many dimensions to rotate in each section:
|
||||||
|
// section length is equivalent to number of cos/sin pairs, NOT the number of dims
|
||||||
|
// (i.e. sum of 4 sections are expected to be n_dims/2)
|
||||||
|
// last sections can be 0, means ignored
|
||||||
|
// all other options are identical to ggml_rope_ext
|
||||||
|
//
|
||||||
|
// important note:
|
||||||
|
// - NEOX ordering is automatically applied and cannot be disabled for MROPE and VISION
|
||||||
|
// if you need normal ordering, there are 2 methods:
|
||||||
|
// (1) split the tensor manually using ggml_view
|
||||||
|
// (2) permute the weight upon conversion
|
||||||
|
// - for VISION, n_dims must be head_size/2
|
||||||
|
//
|
||||||
|
// example M-RoPE:
|
||||||
|
// given sections = [t=4, y=2, x=2, 0]
|
||||||
|
// given a single head with size = 18 --> [000000000000000000]
|
||||||
|
// GGML_ROPE_TYPE_MROPE n_dims = 16 --> [ttttyyxxttttyyxx00] (cos/sin are applied in NEOX ordering)
|
||||||
|
// GGML_ROPE_TYPE_IMROPE n_dims = 16 --> [ttyxttyxttyxttyx00] (interleaved M-RoPE, still NEOX ordering)
|
||||||
|
// note: the theta for each dim is computed the same way as ggml_rope_ext, no matter the section
|
||||||
|
// in other words, idx used for theta: [0123456789... until n_dims/2], not reset for each section
|
||||||
|
//
|
||||||
|
// example vision RoPE:
|
||||||
|
// given sections = [y=4, x=4, 0, 0] (last 2 sections are ignored)
|
||||||
|
// given a single head with size = 8 --> [00000000]
|
||||||
|
// GGML_ROPE_TYPE_VISION n_dims = 4 --> [yyyyxxxx]
|
||||||
|
// other values of n_dims are untested and is undefined behavior
|
||||||
|
// note: unlike MROPE, the theta for each dim is computed differently for each section
|
||||||
|
// in other words, idx used for theta: [0123] for y section, then [0123] for x section
|
||||||
GGML_API struct ggml_tensor * ggml_rope_multi(
|
GGML_API struct ggml_tensor * ggml_rope_multi(
|
||||||
struct ggml_context * ctx,
|
struct ggml_context * ctx,
|
||||||
struct ggml_tensor * a,
|
struct ggml_tensor * a,
|
||||||
|
|||||||
Reference in New Issue
Block a user