vocab: fix Gemma4 tokenizer (#21343)

* seems to work

* fix case with new line

Co-authored-by: sayap <sokann@gmail.com>

* gemma 4: fix pre tok regex

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: sayap <sokann@gmail.com>
This commit is contained in:
Piotr Wilkin (ilintar)
2026-04-03 10:33:03 +02:00
committed by GitHub
parent 0c58ba3365
commit b069b10ab4
5 changed files with 69 additions and 9 deletions
-3
View File
@@ -7464,9 +7464,6 @@ class Gemma4Model(Gemma3Model):
assert len(tokens) == vocab.vocab_size
# TODO @ngxson : there are some known (rare) issues with the tokenizer during development
# but I don't have time to dive into them right now;
# using a dedicated tokenizer name so that we can fix later without re-converting GGUF
self.gguf_writer.add_tokenizer_model("gemma4")
self.gguf_writer.add_token_list(tokens)
self.gguf_writer.add_token_scores(scores)