A next-word predictor based on variable n-gram contexts.
I used AllenAI's c4 dataset on HuggingFace to train the model. The code supports multiple files, but on a laptop 1 json file takes enough time to train.
Specifically, here is the download link to the file I used to train the model:
https://huggingface.co/datasets/allenai/c4/blob/main/en/c4-train.00000-of-01024.json.gz
This algorithm predicts the next word based on the previous occurances on what the next word is based on the past n words (an n-gram).
By creating a matrix with each row corersponding to a word, we can find and count every occurence of the next word by increasing the column corresponding to the word to 1. That is, for every word
This is a uni-gram matrix. We then increase the context length by
Next, the code finds the best n-gram in order to predict the next single word. This is done by computing the entropy of the probability distribution of an n-gram to find the uncertainty in its probability distribution (and therefore its predictions), followed by using a scoring algorithm to favor more predictable contexts with strong candidates for the next word.
After the max score for the context for each n-gram model is found and the associated probabilities per word are captured, we raise every probability to the power of
Suppose we have the following:
Then we have a time complexity of
Suppose we have the following:
Then we have a space complexity of
After training the model of 200,000 lines of text with a max n-gram of 5, it produces the following outputs:
Given the context 'how is the best', the next 20 words predicted are: possible outcome of any event is the sound sytem that makes the moment live and love in the book she
Given the context 'let a sequence of real numbers', the next 20 words predicted are: for which there is no other custom writing service as flexible and convenient as this one us and uk writers
Given the context 'it will only take', the next 20 words predicted are: a girl out for drinks because it allows them to you are you to implement the best solution to your
Given the context 'my water bottle is', the next 20 words predicted are: designed to be the weakest link in chapter two 20040924 the qt widget gallery has been added to the with
Given the context 'the ipad is', the next 20 words predicted are: a tremendous tool but the fear of stating who can or should be a flat out no absolutely not i
Given the context 'suppose that i said that', the next 20 words predicted are: we dont really specifically celebrate the day anymore valentines day has really turned into a day to simply celebrate our