server: support multiple generations from one prompt (OAI "n" option) (#17775)

* backend support

* server: support multiple generations from one prompt (OAI "n" option)

* fix invalid batch

* format oai

* clean up

* disable ctx shift

* add test

* update comments

* fix style

* add n_cmpl to docs [no ci]

* allowing using both n_cmpl and n
This commit is contained in:
Xuan-Son Nguyen
2025-12-06 15:54:38 +01:00
committed by GitHub
parent 09c7c50e64
commit c42712b056
7 changed files with 146 additions and 19 deletions
+2
View File
@@ -493,6 +493,8 @@ Note for `multimodal_data` in JSON object prompts. This should be an array of st
`n_keep`: Specify the number of tokens from the prompt to retain when the context size is exceeded and tokens need to be discarded. The number excludes the BOS token.
By default, this value is set to `0`, meaning no tokens are kept. Use `-1` to retain all tokens from the prompt.
`n_cmpl`: Number of completions to generate from the current prompt. If input has multiple prompts, the output will have N prompts times `n_cmpl` entries.
`stream`: Allows receiving each predicted token in real-time instead of waiting for the completion to finish (uses a different response format). To enable this, set to `true`.
`stop`: Specify a JSON array of stopping strings.