* server : add thinking content blocks to Anthropic Messages API Add support for returning reasoning/thinking content in Anthropic API responses when using models with --reasoning-format deepseek and the thinking parameter enabled. - Non-streaming: adds thinking block before text in content array - Streaming: emits thinking_delta events with correct block indices - Partial streaming: tracks reasoning state across chunks via anthropic_has_reasoning member variable Tested with bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF model. * server : fix Anthropic API streaming for thinking content blocks Add signature field and fix duplicate content_block_start events in Anthropic Messages API streaming responses for reasoning models. * server: refactor Anthropic streaming state to avoid raw pointer Replace raw pointer to task_result_state with direct field copies: - Copy state fields in update() before processing chunk - Use local copies in to_json_anthropic() instead of dereferencing - Pre-compute state updates for next chunk in update() This makes the data flow clearer and avoids unsafe pointer patterns.
Server tests
Python based server tests scenario using pytest.
Tests target GitHub workflows job runners with 4 vCPU.
Note: If the host architecture inference speed is faster than GitHub runners one, parallel scenario may randomly fail.
To mitigate it, you can increase values in n_predict, kv_size.
Install dependencies
pip install -r requirements.txt
Run tests
- Build the server
cd ../../..
cmake -B build
cmake --build build --target llama-server
- Start the test:
./tests.sh
It's possible to override some scenario steps values with environment variables:
| variable | description |
|---|---|
PORT |
context.server_port to set the listening port of the server during scenario, default: 8080 |
LLAMA_SERVER_BIN_PATH |
to change the server binary path, default: ../../../build/bin/llama-server |
DEBUG |
to enable steps and server verbose mode --verbose |
N_GPU_LAYERS |
number of model layers to offload to VRAM -ngl --n-gpu-layers |
LLAMA_CACHE |
by default server tests re-download models to the tmp subfolder. Set this to your cache (e.g. $HOME/Library/Caches/llama.cpp on Mac or $HOME/.cache/llama.cpp on Unix) to avoid this |
To run slow tests (will download many models, make sure to set LLAMA_CACHE if needed):
SLOW_TESTS=1 ./tests.sh
To run with stdout/stderr display in real time (verbose output, but useful for debugging):
DEBUG=1 ./tests.sh -s -v -x
To run all the tests in a file:
./tests.sh unit/test_chat_completion.py -v -x
To run a single test:
./tests.sh unit/test_chat_completion.py::test_invalid_chat_completion_req
Hint: You can compile and run test in single command, useful for local developement:
cmake --build build -j --target llama-server && ./tools/server/tests/tests.sh
To see all available arguments, please refer to pytest documentation
Debugging external llama-server
It can sometimes be useful to run the server in a debugger when invesigating test
failures. To do this, the environment variable DEBUG_EXTERNAL=1 can be set
which will cause the test to skip starting a llama-server itself. Instead, the
server can be started in a debugger.
Example using gdb:
$ gdb --args ../../../build/bin/llama-server \
--host 127.0.0.1 --port 8080 \
--temp 0.8 --seed 42 \
--hf-repo ggml-org/models --hf-file tinyllamas/stories260K.gguf \
--batch-size 32 --no-slots --alias tinyllama-2 --ctx-size 512 \
--parallel 2 --n-predict 64
And a break point can be set in before running:
(gdb) br server.cpp:4604
(gdb) r
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv update_slots: all slots are idle
And then the test in question can be run in another terminal:
(venv) $ env DEBUG_EXTERNAL=1 ./tests.sh unit/test_chat_completion.py -v -x
And this should trigger the breakpoint and allow inspection of the server state in the debugger terminal.