Skip to content

Discrepancy in Reproduction: FOCUS baseline drastically underreported, and TokAlign underperforms proper FOCUS implementation #2

@AIbbbb

Description

@AIbbbb

Hi authors,Thanks for sharing your work.
I have been trying to reproduce the initial normalized PPL results (without tuning) reported in Table 1, specifically for the Pythia to Gemma vocabulary adaptation. However, I encountered severe discrepancies that raise questions about the validity of the FOCUS baseline used in the paper, as well as the actual effectiveness of the GloVe-based TokAlign method.
Here are my reproduction settings and findings:

  1. The "Strawman" FOCUS Baseline (2.7e2 vs. Paper's 3.1e5)In Table 1, the paper reports a completely collapsed PPL for the FOCUS initialization (e.g., around 3.1e5 for Pythia models on English/Average). A loss of this magnitude (>12) indicates the model is essentially making random guesses.I reproduced the FOCUS baseline using the official deepfocus library. To ensure a fair comparison, I trained a target-domain FastText .bin model using the exact same corpus intended for the target vocabulary adaptation.
    My Result: The FOCUS initialized model achieved a PPL of roughly 270 (loss ~5.6).
    Analysis: The paper's FOCUS baseline seems to be severely under-optimized. It is highly likely that the paper's implementation failed to perform an Exact Match for shared tokens properly
  2. TokAlign Underperforms Proper FOCUS (2500 vs. Paper's 120)
    When reproducing your TokAlign method using the GloVe co-occurrence matrix approach described in the paper, my initial PPL was around 2500.Not only does this fail to reach the astonishing 120 reported in the paper, but it actually underperforms my properly implemented FOCUS baseline (270).Theoretical Concern: This makes theoretical sense. GloVe relies on whole-token co-occurrence and suffers heavily from sparsity when dealing with rare token fragments or OOV words in the new vocabulary. FastText (used by FOCUS), on the other hand, leverages subword n-grams, giving it a strong advantage in smoothing unseen subwords during vocabulary expansion.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions