-
Notifications
You must be signed in to change notification settings - Fork 5k
fix: eliminate hard-coded vocab definitions to make the Whisper model compatible with custom vocabularies and embedding layer lengths #3555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| int num_languages() const { | ||
| return n_vocab - 51765 - (is_multilingual() ? 1 : 0); | ||
| return n_vocab - token_to_id.size() - 1509; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified num_languages() Function
The num_languages() function has been redesigned to dynamically calculate the number of supported languages, replacing the original hardcoded logic that relied on fixed vocabulary size thresholds (e.g., 51865 for multilingual models).
Rationale (Aligned with OpenAI Whisper's Tokenizer Design)
Per OpenAI’s official Whisper tokenizer implementation (tokenizer.py#L340-L351):
- Language-specific special tokens (e.g.,
<|ja|>,<|en|>) are consecutively arranged between the<|startoftranscript|>and<|translate|>tokens in the vocabulary. - The total number of non-language special tokens is fixed at 1509 (1501 timestamp tokens + 8 core functional tokens:
<|endoftext|>,<|startoftranscript|>,<|translate|>,<|transcribe|>,<|startoflm|>,<|startofprev|>,<|nospeech|>,<|notimestamps|>).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the method which caculates the token ids between <|startoftext|> and <|translate|> is better, because the sizes of token_to_id and id_to_token will change after loading special token in whisper.cpp#L1641-L1672
| if "<|endoftext|>" in tokens: | ||
| del tokens["<|endoftext|>"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I use ggml-tiny.en.bin recognition, the result is also empty, the same reason why the tests fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the last commit was Migrate from HG dataset into HG model, it is necessary that these models need to be reconverted if they were last generated with this script, otherwise, <|endoftext|> will be written into common tokens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just like the test model, the GGML models that are actually converted from OpenAI's official model should not record special tokens in the vocabulary, otherwise the ID of the subsequent special token will be positioned incorrectly.

|
So you may need to regenerate the empty model used in the test. The newly generated test model compared to the old one guarantees that special tokens have been excluded from the normal vocabulary. import os
import base64
import struct
import numpy as np
# ggml magic number
GGML_FILE_MAGIC = 0x67676d6c # "ggml"
# Hyperparameter settings (configuration using tiny.en model)
class HyperParams:
def __init__(self,
n_vocab=51865,
n_audio_ctx=1500,
n_audio_state=384,
n_audio_head=6,
n_audio_layer=4,
n_text_ctx=448,
n_text_state=384,
n_text_head=6,
n_text_layer=4,
n_mels=80):
self.n_vocab = n_vocab
self.n_audio_ctx = n_audio_ctx
self.n_audio_state = n_audio_state
self.n_audio_head = n_audio_head
self.n_audio_layer = n_audio_layer
self.n_text_ctx = n_text_ctx
self.n_text_state = n_text_state
self.n_text_head = n_text_head
self.n_text_layer = n_text_layer
self.n_mels = n_mels
self.ftype = True # True: fp16, False: fp32
def write_ggml_metadata(fout, hparams):
# write magic number
fout.write(struct.pack("i", GGML_FILE_MAGIC))
# write hyperparameters
fout.write(struct.pack("i", hparams.n_vocab))
fout.write(struct.pack("i", hparams.n_audio_ctx))
fout.write(struct.pack("i", hparams.n_audio_state))
fout.write(struct.pack("i", hparams.n_audio_head))
fout.write(struct.pack("i", hparams.n_audio_layer))
fout.write(struct.pack("i", hparams.n_text_ctx))
fout.write(struct.pack("i", hparams.n_text_state))
fout.write(struct.pack("i", hparams.n_text_head))
fout.write(struct.pack("i", hparams.n_text_layer))
fout.write(struct.pack("i", hparams.n_mels))
fout.write(struct.pack("i", hparams.ftype))
def write_mel_filters(fout, hparams, mel_filters_path):
print("loading real Mel filter data...")
# load the Mel filter from the npz file
with np.load(mel_filters_path) as f:
filters = f[f"mel_{hparams.n_mels}"]
fout.write(struct.pack("i", filters.shape[0]))
fout.write(struct.pack("i", filters.shape[1]))
for i in range(filters.shape[0]):
for j in range(filters.shape[1]):
fout.write(struct.pack("f", filters[i][j]))
def write_tokenizer(fout, tokenizer_path):
with open(tokenizer_path, "rb") as f:
contents = f.read()
tokens = {base64.b64decode(token): int(rank) for token, rank in (line.split() for line in contents.splitlines() if line)}
# write size of tokenizer
fout.write(struct.pack("i", len(tokens)))
# write vocabulary
for key in tokens:
fout.write(struct.pack("i", len(key)))
fout.write(key)
def generate_empty_model(filename, hparams):
print(f"generate empty model file: {filename}")
with open(filename, "wb") as f:
write_ggml_metadata(f, hparams)
write_mel_filters(f, hparams, "whisper/whisper/assets/mel_filters.npz")
write_tokenizer(f, f"whisper/whisper/assets/{'gpt2' if hparams.n_vocab < 51865 else 'multilingual'}.tiktoken")
# ignore the rest of the model
if __name__ == "__main__":
os.system("git clone https://github.com/openai/whisper.git")
os.makedirs("empty_models", exist_ok=True)
# Base models
generate_empty_model("empty_models/for-tests-ggml-base.bin", HyperParams(
n_vocab=51865, n_audio_state=512, n_audio_head=8, n_audio_layer=6,
n_text_state=512, n_text_head=8, n_text_layer=6
))
generate_empty_model("empty_models/for-tests-ggml-base.en.bin", HyperParams(
n_vocab=51864, n_audio_state=512, n_audio_head=8, n_audio_layer=6,
n_text_state=512, n_text_head=8, n_text_layer=6
))
# Small models
generate_empty_model("empty_models/for-tests-ggml-small.bin", HyperParams(
n_vocab=51865, n_audio_state=768, n_audio_head=12, n_audio_layer=12,
n_text_state=768, n_text_head=12, n_text_layer=12
))
generate_empty_model("empty_models/for-tests-ggml-small.en.bin", HyperParams(
n_vocab=51864, n_audio_state=768, n_audio_head=12, n_audio_layer=12,
n_text_state=768, n_text_head=12, n_text_layer=12
))
# Medium models
generate_empty_model("empty_models/for-tests-ggml-medium.bin", HyperParams(
n_vocab=51865, n_audio_state=1024, n_audio_head=16, n_audio_layer=24,
n_text_state=1024, n_text_head=16, n_text_layer=24
))
generate_empty_model("empty_models/for-tests-ggml-medium.en.bin", HyperParams(
n_vocab=51864, n_audio_state=1024, n_audio_head=16, n_audio_layer=24,
n_text_state=1024, n_text_head=16, n_text_layer=24
))
# Large models
generate_empty_model("empty_models/for-tests-ggml-large.bin", HyperParams(
n_vocab=51865, n_audio_state=1280, n_audio_head=20, n_audio_layer=32,
n_text_state=1280, n_text_head=20, n_text_layer=32
))
generate_empty_model("empty_models/for-tests-ggml-large-v3.bin", HyperParams( # add <|yue|>
n_vocab=51866, n_audio_state=1280, n_audio_head=20, n_audio_layer=32,
n_text_state=1280, n_text_head=20, n_text_layer=32
))
# Tiny models
generate_empty_model("empty_models/for-tests-ggml-tiny.bin", HyperParams(n_vocab=51865))
generate_empty_model("empty_models/for-tests-ggml-tiny.en.bin", HyperParams(n_vocab=51864))
# Turbo model (based on large-v3 with optimizations)
generate_empty_model("empty_models/for-tests-ggml-turbo.bin", HyperParams( # add <|yue|>
n_vocab=51866, n_audio_state=1280, n_audio_head=20, n_audio_layer=32,
n_text_state=1280, n_text_head=20, n_text_layer=32
)) |
|
Theoretically, the newly converted test model and the GGML model that removes the special token are compatible with the old code, because the old code still relies on hard coding to locate the ID of the special token. However, the code I submitted is based on the number of ordinary tokens in vocabulary and the size of the embedding layer to calculate the ID of the special token, if the ggml model contains special tokens in the vocabulary, then the IDs of |
| vocab.token_beg += dt; | ||
| } | ||
| vocab.token_eot = n_vocab; // <|endoftext|> 50256 for en, 50257 for multilingual, others for custom model | ||
| vocab.token_sot = n_vocab + 1; // <|startoftext|> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I make a mistake in comment. It should be <|startoftranscribe|>.

Description
This PR addresses the tokenizer index out-of-bounds issue when using custom Whisper models with modified vocabulary sizes, as reported in #3392.
The problem occurs when converting models like efwkjn/whisper-ja-anime-v0.1, efwkjn/whisper-ja-anime-v0.2, and efwkjn/whisper-ja-anime-v0.3 to ggml format using
convert-h5-to-ggml.pyand running them with whisper.cpp. These models have a vocabulary size of 20480 (including special tokens), which differs from the official Whisper models (51864 for monolingual, 51865 for multilingual). The hardcoded special token IDs in whisper.cpp cause index out-of-bounds errors when using these custom models.Solution
The solution dynamically calculates special token IDs based on the actual vocabulary size and structure, instead of using hardcoded values:
After loading the vocabulary and establishing id-to-token mappings, we determine:
vocab.n_vocab: The total size of the embedding layercommon_vocab_size: The number of regular (non-special) tokens (size_t common_vocab_size = vocab.token_to_id.size())Following OpenAI's Whisper token arrangement principles (special tokens are placed consecutively after regular tokens), we calculate the ranges:
[0, common_vocab_size): Regular tokenscommon_vocab_size:<|endoftext|>common_vocab_size + 1:<|startoftranscript|>[common_vocab_size + 2, emb_size - 1507): Language mark tokensemb_size - 1507:<|translate|>emb_size - 1506:<|transcribe|>emb_size - 1505:<|startoflm|>emb_size - 1504:<|startofprev|>emb_size - 1503:<|nospeech|>emb_size - 1502:<|notimestamps|>[emb_size - 1501, emb_size): Timestamp tokens (1501 tokens from<|0.00|>to<|30.00|>)The total number of non-language special tokens is 1509 (1501 timestamps + 8 other special tokens).
The number of language tokens is calculated as:
vocab.n_vocab - common_vocab_size - 1509This approach dynamically adapts to different vocabulary sizes and maintains compatibility with both official and custom Whisper models.