server: add auto-sleep after N seconds of idle (#18228)
* implement sleeping at queue level * implement server-context suspend * add test * add docs * optimization: add fast path * make sure to free llama_init * nits * fix use-after-free * allow /models to be accessed during sleeping, fix use-after-free * don't allow accessing /models during sleep, it is not thread-safe * fix data race on accessing props and model_meta * small clean up * trailing whitespace * rm outdated comments
This commit is contained in:
@@ -2887,6 +2887,16 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
|
||||
params.lora_init_without_apply = true;
|
||||
}
|
||||
).set_examples({LLAMA_EXAMPLE_SERVER}));
|
||||
add_opt(common_arg(
|
||||
{"--sleep-idle-seconds"}, "SECONDS",
|
||||
string_format("number of seconds of idleness after which the server will sleep (default: %d; -1 = disabled)", params.sleep_idle_seconds),
|
||||
[](common_params & params, int value) {
|
||||
if (value == 0 || value < -1) {
|
||||
throw std::invalid_argument("invalid value: cannot be 0 or less than -1");
|
||||
}
|
||||
params.sleep_idle_seconds = value;
|
||||
}
|
||||
).set_examples({LLAMA_EXAMPLE_SERVER}));
|
||||
add_opt(common_arg(
|
||||
{"--simple-io"},
|
||||
"use basic IO for better compatibility in subprocesses and limited consoles",
|
||||
|
||||
Reference in New Issue
Block a user