Skip to content

Advice on training a language model? #89

@Mattias421

Description

@Mattias421

Hello,

I am trying to train a DFM LM from a uniform distribution with the KLD loss. I am training on text data from LibriSpeech , which contains around 130,000 sentences. Given that LibriSpeech text data is much smaller than webtext, I have changed the config to reflect the dataset difference. My main choices have been guided by what hyperparameters and methods are used for training autoregressive LMs on LibriSpeech.

I trained the model for about 350k steps with a batch size of 256, which achieves 2.8 KLD loss on the validation set. Surprisingly, the ELBO for this model is over 600000, which is a lot larger than I was expecting. Looking at the model outputs, it is similar to a 3-gram language model.

This leads to my question: are there any obvious design considerations for training a DFM LM on a dataset much smaller than webtext?

Additionally, I am slightly confused by the correct way to report ELBO, the eval scripts uses ELBO: {torch.exp(elbo / num_elements), but this reminds me more of perplexity than ELBO. Although I am only just beginning to wrap my head around the ELBO presented in the paper XD

Any tips and discussion would be greatly appreciated :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions