Advice on training a language model?

Hello,

I am trying to train a DFM LM from a uniform distribution with the KLD loss. I am training on text data from [LibriSpeech ](https://www.danielpovey.com/files/2015_icassp_librispeech.pdf), which contains around 130,000 sentences. Given that LibriSpeech text data is much smaller than webtext, I have changed the config to reflect the dataset difference. My main choices have been guided by what hyperparameters and methods are used for training autoregressive LMs on LibriSpeech.

I trained the model for about 350k steps with a batch size of 256, which achieves 2.8 KLD loss on the validation set. Surprisingly, the ELBO for this model is over 600000, which is a lot larger than I was expecting. Looking at the model outputs, it is similar to a 3-gram language model.

This leads to my question: are there any obvious design considerations for training a DFM LM on a dataset much smaller than webtext?

Additionally, I am slightly confused by the correct way to report ELBO, the eval scripts uses `ELBO: {torch.exp(elbo / num_elements)`, but this reminds me more of perplexity than ELBO. Although I am only just beginning to wrap my head around the ELBO presented in the paper XD

Any tips and discussion would be greatly appreciated :)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advice on training a language model? #89

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Advice on training a language model? #89

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions