Discrepancy in Reproduction: FOCUS baseline drastically underreported, and TokAlign underperforms proper FOCUS implementation

Hi authors,Thanks for sharing your work. 
I have been trying to reproduce the initial normalized PPL results (without tuning) reported in Table 1, specifically for the Pythia to Gemma vocabulary adaptation. However, I encountered severe discrepancies that raise questions about the validity of the FOCUS baseline used in the paper, as well as the actual effectiveness of the GloVe-based TokAlign method.
Here are my reproduction settings and findings:
1. The "Strawman" FOCUS Baseline (2.7e2 vs. Paper's 3.1e5)In Table 1, the paper reports a completely collapsed PPL for the FOCUS initialization (e.g., around 3.1e5 for Pythia models on English/Average). A loss of this magnitude (>12) indicates the model is essentially making random guesses.I reproduced the FOCUS baseline using the official deepfocus library. To ensure a fair comparison, I trained a target-domain FastText .bin model using the exact same corpus intended for the target vocabulary adaptation.
My Result: The FOCUS initialized model achieved a PPL of roughly 270 (loss ~5.6).
Analysis: The paper's FOCUS baseline seems to be severely under-optimized. It is highly likely that the paper's implementation failed to perform an Exact Match for shared tokens properly 
2. TokAlign Underperforms Proper FOCUS (2500 vs. Paper's 120)
When reproducing your TokAlign method using the GloVe co-occurrence matrix approach described in the paper, my initial PPL was around 2500.Not only does this fail to reach the astonishing 120 reported in the paper, but it actually underperforms my properly implemented FOCUS baseline (270).Theoretical Concern: This makes theoretical sense. GloVe relies on whole-token co-occurrence and suffers heavily from sparsity when dealing with rare token fragments or OOV words in the new vocabulary. FastText (used by FOCUS), on the other hand, leverages subword n-grams, giving it a strong advantage in smoothing unseen subwords during vocabulary expansion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancy in Reproduction: FOCUS baseline drastically underreported, and TokAlign underperforms proper FOCUS implementation #2

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Discrepancy in Reproduction: FOCUS baseline drastically underreported, and TokAlign underperforms proper FOCUS implementation #2

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions