server: add presets (config) when using multiple models (#17859)
* llama-server: recursive GGUF loading Replace flat directory scan with recursive traversal using std::filesystem::recursive_directory_iterator. Support for nested vendor/model layouts (e.g. vendor/model/*.gguf). Model name now reflects the relative path within --models-dir instead of just the filename. Aggregate files by parent directory via std::map before constructing local_model * server : router config POC (INI-based per-model settings) * server: address review feedback from @aldehir and @ngxson PEG parser usage improvements: - Simplify parser instantiation (remove arena indirection) - Optimize grammar usage (ws instead of zero_or_more, remove optional wrapping) - Fix last line without newline bug (+ operator instead of <<) - Remove redundant end position check Feature scope: - Remove auto-reload feature (will be separate PR per @ngxson) - Keep config.ini auto-creation and template generation - Preserve per-model customization logic Co-authored-by: aldehir <aldehir@users.noreply.github.com> Co-authored-by: ngxson <ngxson@users.noreply.github.com> * server: adopt aldehir's line-oriented PEG parser Complete rewrite of INI parser grammar and visitor: - Use p.chars(), p.negate(), p.any() instead of p.until() - Support end-of-line comments (key=value # comment) - Handle EOF without trailing newline correctly - Strict identifier validation ([a-zA-Z_][a-zA-Z0-9_.-]*) - Simplified visitor (no pending state, no trim needed) - Grammar handles whitespace natively via eol rule Business validation preserved: - Reject section names starting with LLAMA_ARG_* - Accept only keys starting with LLAMA_ARG_* - Require explicit section before key-value pairs Co-authored-by: aldehir <aldehir@users.noreply.github.com> * server: fix CLI/env duplication in child processes Children now receive minimal CLI args (executable, model, port, alias) instead of inheriting all router args. Global settings pass through LLAMA_ARG_* environment variables only, eliminating duplicate config warnings. Fixes: Router args like -ngl, -fa were passed both via CLI and env, causing 'will be overwritten' warnings on every child spawn * add common/preset.cpp * fix compile * cont * allow custom-path models * add falsey check * server: fix router model discovery and child process spawning - Sanitize model names: replace / and \ with _ for display - Recursive directory scan with relative path storage - Convert relative paths to absolute when spawning children - Filter router control args from child processes - Refresh args after port assignment for correct port value - Fallback preset lookup for compatibility - Fix missing argv[0]: store server binary path before base_args parsing * Revert "server: fix router model discovery and child process spawning" This reverts commit e3832b42eeea7fcb108995966c7584479f745857. * clarify about "no-" prefix * correct render_args() to include binary path * also remove arg LLAMA_ARG_MODELS_PRESET for child * add co-author for ini parser code Co-authored-by: aldehir <hello@alde.dev> * also set LLAMA_ARG_HOST * add CHILD_ADDR * Remove dead code --------- Co-authored-by: aldehir <aldehir@users.noreply.github.com> Co-authored-by: ngxson <ngxson@users.noreply.github.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: aldehir <hello@alde.dev>
This commit is contained in:
@@ -1369,6 +1369,11 @@ llama-server
|
||||
|
||||
### Model sources
|
||||
|
||||
There are 3 possible sources for model files:
|
||||
1. Cached models (controlled by the `LLAMA_CACHE` environment variable)
|
||||
2. Custom model directory (set via the `--models-dir` argument)
|
||||
3. Custom preset (set via the `--models-preset` argument)
|
||||
|
||||
By default, the router looks for models in the cache. You can add Hugging Face models to the cache with:
|
||||
|
||||
```sh
|
||||
@@ -1413,6 +1418,51 @@ llama-server -ctx 8192 -n 1024 -np 2
|
||||
|
||||
Note: model instances inherit both command line arguments and environment variables from the router server.
|
||||
|
||||
Alternatively, you can also add GGUF based preset (see next section)
|
||||
|
||||
### Model presets
|
||||
|
||||
Model presets allow advanced users to define custom configurations using an `.ini` file:
|
||||
|
||||
```sh
|
||||
llama-server --models-preset ./my-models.ini
|
||||
```
|
||||
|
||||
Each section in the file defines a new preset. Keys within a section correspond to command-line arguments (without leading dashes). For example, the argument `--n-gpu-layer 123` is written as `n-gpu-layer = 123`.
|
||||
|
||||
Short argument forms (e.g., `c`, `ngl`) and environment variable names (e.g., `LLAMA_ARG_N_GPU_LAYERS`) are also supported as keys.
|
||||
|
||||
Example:
|
||||
|
||||
```ini
|
||||
version = 1
|
||||
|
||||
; If the key corresponds to an existing model on the server,
|
||||
; this will be used as the default config for that model
|
||||
[ggml-org/MY-MODEL-GGUF:Q8_0]
|
||||
; string value
|
||||
chat-template = chatml
|
||||
; numeric value
|
||||
n-gpu-layer = 123
|
||||
; flag value (for certain flags, you need to use the "no-" prefix for negation)
|
||||
jinja = true
|
||||
; shorthand argument (for example, context size)
|
||||
c = 4096
|
||||
; environment variable name
|
||||
LLAMA_ARG_CACHE_RAM = 0
|
||||
; file paths are relative to server's CWD
|
||||
model-draft = ./my-models/draft.gguf
|
||||
; but it's RECOMMENDED to use absolute path
|
||||
model-draft = /Users/abc/my-models/draft.gguf
|
||||
|
||||
; If the key does NOT correspond to an existing model,
|
||||
; you need to specify at least the model path
|
||||
[custom_model]
|
||||
model = /Users/abc/my-awesome-model-Q4_K_M.gguf
|
||||
```
|
||||
|
||||
Note: some arguments are controlled by router (e.g., host, port, API key, HF repo, model alias). They will be removed or overwritten upload loading.
|
||||
|
||||
### Routing requests
|
||||
|
||||
Requests are routed according to the requested model name.
|
||||
|
||||
Reference in New Issue
Block a user