server: /v1/responses (partial) (#18486)
* from previous PR * Make instruction(system) as first message * Convert [input_message] (text/image/file) * Rename convert_responses_to_chatcmpl(body) -> response_body * Initial tool call support * Erase instructions field from chatcmpl body * Feed reasoning texts to chat template * Use std::vector instead of opaque json array * Make output_item.added events consistent * Move `server_task_result_cmpl_partial::update` from header to source * Match ID of output_item.added and .done events * Add function_call only if there is no "fc_" prefix * Add function call output at non-streaming API * Test if ID is persistent * Add doc * Fix style - use trailing comma * Rewrite state management * catch up with upstream/master * Fix style - "type" is the first item of SSE data * Explicitly check "instructions" from response_body * Make lambdas static * Check if reasoning content exists * Add `oai_resp_id` to task_result_state(also initialized at ctor), server_task_result_cmpl_partial, and server_task_result_cmpl_final * Reject `input_file` since it is not supported by chatcmpl * Add "fc_" prefix to non-straming function call id as coderabbit pointed out --------- Co-authored-by: openingnow <>
This commit is contained in:
+44
-1
@@ -6,7 +6,7 @@ Set of LLM REST APIs and a web UI to interact with llama.cpp.
|
||||
|
||||
**Features:**
|
||||
* LLM inference of F16 and quantized models on GPU and CPU
|
||||
* [OpenAI API](https://github.com/openai/openai-openapi) compatible chat completions and embeddings routes
|
||||
* [OpenAI API](https://github.com/openai/openai-openapi) compatible chat completions, responses, and embeddings routes
|
||||
* [Anthropic Messages API](https://docs.anthropic.com/en/api/messages) compatible chat completions
|
||||
* Reranking endpoint (https://github.com/ggml-org/llama.cpp/pull/9510)
|
||||
* Parallel decoding with multi-user support
|
||||
@@ -1267,6 +1267,49 @@ This provides information on the performance of the server. It also allows calcu
|
||||
|
||||
The total number of tokens in context is equal to `prompt_n + cache_n + predicted_n`
|
||||
|
||||
### POST `/v1/responses`: OpenAI-compatible Responses API
|
||||
|
||||
*Options:*
|
||||
|
||||
See [OpenAI Responses API documentation](https://platform.openai.com/docs/api-reference/responses).
|
||||
|
||||
*Examples:*
|
||||
|
||||
You can use either Python `openai` library with appropriate checkpoints:
|
||||
|
||||
```python
|
||||
import openai
|
||||
|
||||
client = openai.OpenAI(
|
||||
base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
|
||||
api_key = "sk-no-key-required"
|
||||
)
|
||||
|
||||
response = client.responses.create(
|
||||
model="gpt-4.1",
|
||||
instructions="You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests.",
|
||||
input="Write a limerick about python exceptions"
|
||||
)
|
||||
|
||||
print(response.output_text)
|
||||
```
|
||||
|
||||
... or raw HTTP requests:
|
||||
|
||||
```shell
|
||||
curl http://localhost:8080/v1/responses \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "Authorization: Bearer no-key" \
|
||||
-d '{
|
||||
"model": "gpt-4.1",
|
||||
"instructions": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests.",
|
||||
"input": "Write a limerick about python exceptions"
|
||||
}'
|
||||
```
|
||||
|
||||
This endpoint works by converting Responses request into Chat Completions request.
|
||||
|
||||
|
||||
### POST `/v1/embeddings`: OpenAI-compatible embeddings API
|
||||
|
||||
This endpoint requires that the model uses a pooling different than type `none`. The embeddings are normalized using the Eucledian norm.
|
||||
|
||||
Reference in New Issue
Block a user