vocab: fix Gemma4 tokenizer (#21343)

* seems to work

* fix case with new line

Co-authored-by: sayap <sokann@gmail.com>

* gemma 4: fix pre tok regex

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: sayap <sokann@gmail.com>
This commit is contained in:
Piotr Wilkin (ilintar)
2026-04-03 10:33:03 +02:00
committed by GitHub
parent 0c58ba3365
commit b069b10ab4
5 changed files with 69 additions and 9 deletions
+1 -1
View File
@@ -108,4 +108,4 @@ uint32_t unicode_tolower(uint32_t cpt);
bool unicode_cpt_is_han(uint32_t cpt);
std::vector<std::string> unicode_regex_split(const std::string & text, const std::vector<std::string> & regex_exprs);
std::vector<std::string> unicode_regex_split(const std::string & text, const std::vector<std::string> & regex_exprs, bool byte_encode = true);