common : introduce composable PEG parser combinators for chat parsing (#17136)
* common : implement parser combinators to simplify chat parsing * add virtual destructor to parser_base * fix memory leak from circular references of rules * implement gbnf grammar building * remove unused private variable * create a base visitor and implement id assignment as a visitor * fix const ref for grammar builder * clean up types, friend classes, and class declarations * remove builder usage from until_parser * Use a counter class to help assign rule ids * cache everything * add short description for each parser * create a type for the root parser * implement repetition parser * Make optional, one_or_more, and zero_or_more subclasses of repetition * improve context constructor * improve until parsing and add benchmarks * remove cached() pattern, cache in parser_base with specialized parsing functions for each parser * improve json parsing performance to better match legacy parsing * fix const auto * it for windows * move id assignment to classes instead of using a visitor * create named rules in the command r7b example * use '.' for any in GBNF * fix parens around choices in gbnf grammar * add convenience operators to turn strings to literals * add free-form operators for const char * to simplify defining literals * simplify test case parser * implement semantic actions * remove groups in favor of actions and a scratchpad * add built in actions for common operations * add actions to command r7b example * use std::default_searcher for platforms that don't have bm * improve parser_type handling and add cast helper * add partial result type to better control when to run actions * fix bug in until() * run actions on partial results by default * use common_chat_msg for result * add qwen3 example wip * trash partial idea and simplify * move action arguments to a struct * implement aho-corasick matcher for until_parser and to build exclusion grammars * use std::string for input, since std::string_view is incompatible with std::regex * Refactor tests * improve qwen3 example * implement sax-style parsing and refactor * fix json string in test * rename classes to use common_chat_ prefix * remove is_ suffix from functions * rename from id_counter to just counter * Final refactored tests * Fix executable name and editorconfig-checker * Third time's the charm... * add trigger parser to begin lazy grammar rule generation * working lazy grammar * refactor json rules now that we check for reachability * reduce pointer usage * print out grammars in example * rename to chat-peg-parser* and common_chat_peg_parser* * Revert unrelated changes * New macros for CMakeLists to enable multi-file compilations * starting unicode support * add unicode support to char_parser * use unparsed args as additional sources * Refactor tests to new harness * Fix CMakeLists * fix rate calculation * add unicode tests * fix trailing whitespace and line endings skip-checks: true * Helpers + rewrite qwen3 with helpers * Fix whitespace * extract unicode functions to separate file * refactor parse unicode function * fix compiler error * improve construction of sequence/choice parsers * be less clever * add make_parser helper function * expand usage of make_parser, alias common_chat_msg_peg_parser_builder to builder in source * lower bench iterations * add unicode support to until_parser * add unicode support to json_string_parser * clean up unicode tests * reduce unicode details to match src/unicode.cpp * simplify even further * remove unused functions * fix type * reformat char class parsing * clean up json string parser * clean up + fix diagnostics * reorder includes * compact builder functions * replace action_parser with capture_parser, rename env to semantics * rename env to semantics * clean up common_chat_parse_context * move type() to below constant * use default constructor for common_chat_peg_parser * make all operators functions for consistency * fix compilation errors in test-optional.cpp * simplify result values * rename json_string_unquoted to json_string_content * Move helper to separate class, add separate explicit and helper classes * Whitespace * Change + to append() * Reformat * Add extra helpers, tests and Minimax example * Add some extra optional debugging prints + real example of how to use them * fix bug in repetitions when min_count = 0 reports failures * dump rule in debug * fix token accumulation and assert parsing never fails * indent debug by depth * use LOG_* in tests so logs sync up with test logs * - Add selective testing - Refactor all messaging to use LOG_ERR - Fix lack of argument / tool name capturing - Temporary fix for double event capture * refactor rule() and introduce ref() * clean up visitor * clean up indirection in root parser w.r.t rules * store shared ptr directly in parser classes * replace aho-corasick automation with a simple trie * Reset prev for qwen3 helper example variant * refactor to use value semantics with std::variant/std::visit * simplify trie_matcher result * fix linting issues * add annotations to rules * revert test workaround * implement serializing the parser * remove redundant parsers * remove tests * gbnf generation fixes * remove LOG_* use in tests * update gbnf tests to test entire grammar * clean up gbnf generation and fix a few bugs * fix typo in test output * remove implicit conversion rules * improve test output * rename trie_matcher to trie * simplify trie to just know if a node is the end of a word * remove common_chat_ prefix and ensure a common_peg_ prefix to all types * rename chat-peg-parser -> peg-parser * promote chat-peg-parser-helper to chat-peg-parser * checkpoint * use a static_assert to ensure we handle every branch * inline trivial peg parser builders * use json strings for now * implement basic and native chat peg parser builders/extractors * resolve refs to their rules * remove packrat caching (for now) * update tests * compare parsers with incremental input * benchmark both complete and incremental parsing * add raw string generation from json schema * add support for string schemas in gbnf generation * fix qwen example to include \n * tidy up example * rename extractor to mapper * rename ast_arena to ast * place basic tests into one * use gbnf_format_literal from json-schema-to-grammar * integrate parser with common/chat and server * clean up schema and serialization * add json-schema raw string tests * clean up json creation and remove capture parser * trim spaces from reasoning and content * clean up redundant rules and comments * rename input_is_complete to is_partial to match rest of project * simplify json rules * remove extraneous file * remove comment * implement += and |= operators * add comments to qwen3 implementation * reorder arguments to common_chat_peg_parse * remove commented outdated tests * add explicit copy constructor * fix operators and constness * wip: update test-chat for qwen3-coder * bring json parser closer to json-schema-to-grammar rules * trim trailing space for most things * fix qwen3 coder rules w.r.t. trailing spaces * group rules * do not trim trailing space from string args * tweak spacing of qwen3 grammar * update qwen3-coder tests * qwen3-coder small fixes * place parser in common_chat_syntax to simplify invocation * use std::set to collect rules to keep order predictable for tests * initialize parser to make certain platforms happy * revert back to std::unordered_set, sort rule names at the end instead * uncomment rest of chat tests * define explicit default constructor * improve arena init and server integration * fix chat test * add json_member() * add a comprehensive native example * clean up example qwen test and add response_format example to native test * make build_peg_parser accept std::function instead of template * change peg parser parameters into const ref * push tool call on tool open for constructed parser * add parsing documentation * clean up some comments * add json schema support to qwen3-coder * add id initializer in tests * remove grammar debug line from qwen3-coder * refactor qwen3-coder to use sequence over operators * only call common_chat_peg_parse if appropriate format * simplify qwen3-coder space handling * revert qwen3-coder implementation * revert json-schema-to-grammar changes * remove unnecessary forward declaration * small adjustment to until_parser * rename C/C++ files to use dashes * codeowners : add aldehir to peg-parser and related files --------- Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>
This commit is contained in:
@@ -0,0 +1,288 @@
|
||||
# Parsing Model Output
|
||||
|
||||
The `common` library contains a PEG parser implementation suitable for parsing
|
||||
model output.
|
||||
|
||||
Types with the prefix `common_peg_*` are intended for general use and may have
|
||||
applications beyond parsing model output, such as parsing user-provided regex
|
||||
patterns.
|
||||
|
||||
Types with the prefix `common_chat_peg_*` are specialized helpers for model
|
||||
output.
|
||||
|
||||
The parser features:
|
||||
|
||||
- Partial parsing of streaming input
|
||||
- Built-in JSON parsers
|
||||
- AST generation with semantics via "tagged" nodes
|
||||
|
||||
## Example
|
||||
|
||||
Below is a contrived example demonstrating how to use the PEG parser to parse
|
||||
output from a model that emits arguments as JSON.
|
||||
|
||||
```cpp
|
||||
auto parser = build_chat_peg_native_parser([&](common_chat_peg_native_builder & p) {
|
||||
// Build a choice of all available tools
|
||||
auto tool_choice = p.choice();
|
||||
for (const auto & tool : tools) {
|
||||
const auto & function = tool.at("function");
|
||||
std::string name = function.at("name");
|
||||
const auto & schema = function.at("parameters");
|
||||
|
||||
auto tool_name = p.json_member("name", "\"" + p.literal(name) + "\"");
|
||||
auto tool_args = p.json_member("arguments", p.schema(p.json(), "tool-" + name + "-schema", schema));
|
||||
|
||||
tool_choice |= p.rule("tool-" + name, "{" << tool_name << "," << tool_args << "}");
|
||||
}
|
||||
|
||||
// Define the tool call structure: <tool_call>[{tool}]</tool_call>
|
||||
auto tool_call = p.trigger_rule("tool-call",
|
||||
p.sequence({
|
||||
p.literal("<tool_call>["),
|
||||
tool_choice,
|
||||
p.literal("]</tool_call>")
|
||||
})
|
||||
);
|
||||
|
||||
// Parser accepts content, optionally followed by a tool call
|
||||
return p.sequence({
|
||||
p.content(p.until("<tool_call>")),
|
||||
p.optional(tool_call),
|
||||
p.end()
|
||||
});
|
||||
});
|
||||
```
|
||||
|
||||
For a more complete example, see `test_example_native()` in
|
||||
[tests/test-chat-peg-parser.cpp](tests/test-chat-peg-parser.cpp).
|
||||
|
||||
## Parsers/Combinators
|
||||
|
||||
### Basic Matchers
|
||||
|
||||
- **`eps()`** - Matches nothing and always succeeds (epsilon/empty match)
|
||||
- **`start()`** - Matches the start of input (anchor `^`)
|
||||
- **`end()`** - Matches the end of input (anchor `$`)
|
||||
- **`literal(string)`** - Matches an exact literal string
|
||||
- **`any()`** - Matches any single character (`.`)
|
||||
|
||||
### Combinators
|
||||
|
||||
- **`sequence(...)`** - Matches parsers in order; all must succeed
|
||||
- **`choice(...)`** - Matches the first parser that succeeds from alternatives (ordered choice)
|
||||
- **`one_or_more(p)`** - Matches one or more repetitions (`+`)
|
||||
- **`zero_or_more(p)`** - Matches zero or more repetitions (`*`)
|
||||
- **`optional(p)`** - Matches zero or one occurrence (`?`)
|
||||
- **`repeat(p, min, max)`** - Matches between min and max repetitions (use `-1` for unbounded)
|
||||
- **`repeat(p, n)`** - Matches exactly n repetitions
|
||||
|
||||
### Lookahead
|
||||
|
||||
- **`peek(p)`** - Positive lookahead: succeeds if parser succeeds without consuming input (`&`)
|
||||
- **`negate(p)`** - Negative lookahead: succeeds if parser fails without consuming input (`!`)
|
||||
|
||||
### Character Classes & Utilities
|
||||
|
||||
- **`chars(classes, min, max)`** - Matches repetitions of characters from a character class
|
||||
- **`space()`** - Matches zero or more whitespace characters (space, tab, newline)
|
||||
- **`until(delimiter)`** - Matches characters until delimiter is found (delimiter not consumed)
|
||||
- **`until_one_of(delimiters)`** - Matches characters until any delimiter in the list is found
|
||||
- **`rest()`** - Matches everything remaining (`.*`)
|
||||
|
||||
### JSON Parsers
|
||||
|
||||
- **`json()`** - Complete JSON parser (objects, arrays, strings, numbers, booleans, null)
|
||||
- **`json_object()`** - JSON object parser
|
||||
- **`json_array()`** - JSON array parser
|
||||
- **`json_string()`** - JSON string parser
|
||||
- **`json_number()`** - JSON number parser
|
||||
- **`json_bool()`** - JSON boolean parser
|
||||
- **`json_null()`** - JSON null parser
|
||||
- **`json_string_content()`** - JSON string content without surrounding quotes
|
||||
- **`json_member(key, p)`** - JSON object member with specific key and value parser
|
||||
|
||||
### Grammar Building
|
||||
|
||||
- **`ref(name)`** - Creates a lightweight reference to a named rule (for recursive grammars)
|
||||
- **`rule(name, p, trigger)`** - Creates a named rule and returns a reference
|
||||
- **`trigger_rule(name, p)`** - Creates a trigger rule (entry point for lazy grammar generation)
|
||||
- **`schema(p, name, schema, raw)`** - Wraps parser with JSON schema metadata for grammar generation
|
||||
|
||||
### AST Control
|
||||
|
||||
- **`atomic(p)`** - Prevents AST node creation for partial parses
|
||||
- **`tag(tag, p)`** - Creates AST nodes with semantic tags (multiple nodes can share tags)
|
||||
|
||||
## GBNF Grammar Generation
|
||||
|
||||
The PEG parser also acts as a convenient DSL for generating GBNF grammars, with
|
||||
some exceptions.
|
||||
|
||||
```cpp
|
||||
data.grammar = build_grammar([&](const common_grammar_builder & builder) {
|
||||
foreach_function(params.tools, [&](const json & fn) {
|
||||
builder.resolve_refs(fn.at("parameters"));
|
||||
});
|
||||
parser.build_grammar(builder, data.grammar_lazy);
|
||||
});
|
||||
```
|
||||
|
||||
The notable exception is the `negate(p)` lookahead parser, which cannot be
|
||||
defined as a CFG grammar and therefore does not produce a rule. Its usage
|
||||
should be limited and preferably hidden behind a `schema()` parser. In many
|
||||
cases, `until(delimiter)` or `until_one_of(delimiters)` is a better choice.
|
||||
|
||||
Another limitation is that the PEG parser requires an unambiguous grammar. In
|
||||
contrast, the `llama-grammar` implementation can support ambiguous grammars,
|
||||
though they are difficult to parse.
|
||||
|
||||
### Lazy Grammars
|
||||
|
||||
During lazy grammar generation, only rules reachable from a `trigger_rule(p)`
|
||||
are emitted in the grammar. All trigger rules are added as alternations in the
|
||||
root rule. It is still necessary to define trigger patterns, as the parser has
|
||||
no interaction with the grammar sampling.
|
||||
|
||||
### JSON Schema
|
||||
|
||||
The `schema(p, name, schema, raw)` parser will use the `json-schema-to-grammar`
|
||||
implementation to generate the grammar instead of the underlying parser.
|
||||
|
||||
The `raw` option emits a grammar suitable for a raw string instead of a JSON
|
||||
string. In other words, it won't be wrapped in quotes or require escaping
|
||||
quotes. It should only be used when `type == "string"`.
|
||||
|
||||
The downside is that it can potentially lead to ambiguous grammars. For
|
||||
example, if a user provides the pattern `^.*$`, the following grammar may be
|
||||
generated:
|
||||
|
||||
```
|
||||
root ::= "<arg>" .* "</arg>"
|
||||
```
|
||||
|
||||
This creates an ambiguous grammar that cannot be parsed by the PEG parser. To
|
||||
help mitigate this, if `.*` is found in the pattern, the grammar from the
|
||||
underlying parser will be emitted instead.
|
||||
|
||||
## Common AST Shapes for Chat Parsing
|
||||
|
||||
Most model output can be placed in one of the following categories:
|
||||
|
||||
- Content only
|
||||
- Tool calling with arguments emitted as a single JSON object
|
||||
- Tool calling with arguments emitted as separate entities, either XML
|
||||
(Qwen3-Coder, MiniMax M2) or pseudo-function calls (LFM2)
|
||||
|
||||
To provide broad coverage,
|
||||
[`common/chat-peg-parser.h`](common/chat-peg-parser.h) contains builders and
|
||||
mappers that help create parsers and visitors/extractors for these types. They
|
||||
require parsers to tag nodes to conform to an AST "shape". This normalization
|
||||
makes it easy to extract information and generalize parsing.
|
||||
|
||||
### Simple
|
||||
|
||||
The `common_chat_peg_builder` builds a `simple` parser that supports
|
||||
content-only models with optional reasoning.
|
||||
|
||||
- **`reasoning(p)`** - Tag node for extracting `reasoning_content`
|
||||
- **`content(p)`** - Tag node for extracting `content`
|
||||
|
||||
```cpp
|
||||
build_chat_peg_parser([&](common_chat_peg_parser & p) {
|
||||
return p.sequence({
|
||||
p.optional("<think>" + p.reasoning(p.until("</think>")) + "</think>"),
|
||||
p.content(p.until("<tool_call>")),
|
||||
p.end()
|
||||
});
|
||||
});
|
||||
```
|
||||
|
||||
Use `common_chat_peg_mapper` to extract the content. Note that this is already
|
||||
done for you in `common_chat_peg_parser` when
|
||||
`chat_format == COMMON_CHAT_FORMAT_PEG_SIMPLE`.
|
||||
|
||||
```cpp
|
||||
auto result = parser.parse(ctx);
|
||||
|
||||
common_chat_msg msg;
|
||||
auto mapper = common_chat_peg_mapper(msg);
|
||||
mapper.from_ast(ctx.ast, result);
|
||||
```
|
||||
|
||||
### Native
|
||||
|
||||
The `common_chat_peg_native_builder` builds a `native` parser suitable for
|
||||
models that emit tool arguments as a direct JSON object.
|
||||
|
||||
- **`reasoning(p)`** - Tag node for `reasoning_content`
|
||||
- **`content(p)`** - Tag node for `content`
|
||||
- **`tool(p)`** - Tag entirety of a single tool call
|
||||
- **`tool_open(p)`** - Tag start of a tool call
|
||||
- **`tool_close(p)`** - Tag end of a tool call
|
||||
- **`tool_id(p)`** - Tag the tool call ID (optional)
|
||||
- **`tool_name(p)`** - Tag the tool name
|
||||
- **`tool_args(p)`** - Tag the tool arguments
|
||||
|
||||
```cpp
|
||||
build_chat_peg_native_parser([&](common_chat_peg_native_parser & p) {
|
||||
auto get_weather_tool = p.tool(p.sequence({
|
||||
p.tool_open(p.literal("{")),
|
||||
p.json_member("name", "\"" + p.tool_name(p.literal("get_weather")) + "\""),
|
||||
p.literal(","),
|
||||
p.json_member("arguments", p.tool_args(p.json())),
|
||||
p.tool_close(p.literal("}"))
|
||||
}));
|
||||
|
||||
return p.sequence({
|
||||
p.content(p.until("<tool_call>")),
|
||||
p.literal("<tool_call>"),
|
||||
get_weather_tool,
|
||||
p.literal("</tool_call>"),
|
||||
p.end()
|
||||
});
|
||||
});
|
||||
```
|
||||
|
||||
### Constructed
|
||||
|
||||
The `common_chat_peg_constructed_builder` builds a `constructed` parser
|
||||
suitable for models that emit tool arguments as separate entities, such as XML
|
||||
tags.
|
||||
|
||||
- **`reasoning(p)`** - Tag node for `reasoning_content`
|
||||
- **`content(p)`** - Tag node for `content`
|
||||
- **`tool(p)`** - Tag entirety of a single tool call
|
||||
- **`tool_open(p)`** - Tag start of a tool call
|
||||
- **`tool_close(p)`** - Tag end of a tool call
|
||||
- **`tool_name(p)`** - Tag the tool name
|
||||
- **`tool_arg(p)`** - Tag a complete tool argument (name + value)
|
||||
- **`tool_arg_open(p)`** - Tag start of a tool argument
|
||||
- **`tool_arg_close(p)`** - Tag end of a tool argument
|
||||
- **`tool_arg_name(p)`** - Tag the argument name
|
||||
- **`tool_arg_string_value(p)`** - Tag string value for the argument
|
||||
- **`tool_arg_json_value(p)`** - Tag JSON value for the argument
|
||||
|
||||
```cpp
|
||||
build_chat_peg_constructed_parser([&](common_chat_peg_constructed_builder & p) {
|
||||
auto location_arg = p.tool_arg(
|
||||
p.tool_arg_open("<parameter name=\"" + p.tool_arg_name(p.literal("location")) + "\">"),
|
||||
p.tool_arg_string_value(p.until("</parameter>")),
|
||||
p.tool_arg_close(p.literal("</parameter>"))
|
||||
);
|
||||
|
||||
auto get_weather_tool = p.tool(p.sequence({
|
||||
p.tool_open("<function name=\"" + p.tool_name(p.literal("get_weather")) + "\">"),
|
||||
location_arg,
|
||||
p.tool_close(p.literal("</function>"))
|
||||
}));
|
||||
|
||||
return p.sequence({
|
||||
p.content(p.until("<tool_call>")),
|
||||
p.literal("<tool_call>"),
|
||||
get_weather_tool,
|
||||
p.literal("</tool_call>"),
|
||||
p.end()
|
||||
});
|
||||
});
|
||||
```
|
||||
Reference in New Issue
Block a user