I think this is a GPT-OSS-20B model error. Like the LLM itself generates a bad output. The output is processed by the Harmony decoder which really just re-formats the tokens given an expected structure. It's expecting the message to come out of the LLM in a specific format that would let you separate the reasoning and the generated output. If the LLM bungles the generation, then the format is all off and the Harmony decoder breaks.
From what I see, it breaks because the LLM consistently outputs a whole bunch of "...". This happens pretty frequently - seems like a 20% chance per message. You could get around it by attempting to decode, and just re-generating the output if the decoding breaks, but 20% is super high. I don't want 20% of my messages to take twice as long because I have to regenerate them.
Snapshot

I think this is a GPT-OSS-20B model error. Like the LLM itself generates a bad output. The output is processed by the Harmony decoder which really just re-formats the tokens given an expected structure. It's expecting the message to come out of the LLM in a specific format that would let you separate the reasoning and the generated output. If the LLM bungles the generation, then the format is all off and the Harmony decoder breaks.
From what I see, it breaks because the LLM consistently outputs a whole bunch of "...". This happens pretty frequently - seems like a 20% chance per message. You could get around it by attempting to decode, and just re-generating the output if the decoding breaks, but 20% is super high. I don't want 20% of my messages to take twice as long because I have to regenerate them.
Snapshot