diff --git a/_posts/2026-02-12-microgpt.markdown b/_posts/2026-02-12-microgpt.markdown index a8a5d4df..e45a7713 100644 --- a/_posts/2026-02-12-microgpt.markdown +++ b/_posts/2026-02-12-microgpt.markdown @@ -88,7 +88,7 @@ sample 19: alela sample 20: anton ``` - It doesn't look like much, but from the perspective of a model like ChatGPT, your conversation with it is just a funny looking "document". When you initialize the document with your prompt, the model's response from its perspective is just a statistical document completion. +It doesn't look like much, but from the perspective of a model like ChatGPT, your conversation with it is just a funny looking "document". When you initialize the document with your prompt, the model's response from its perspective is just a statistical document completion. ## Tokenizer @@ -102,7 +102,7 @@ vocab_size = len(uchars) + 1 # total number of unique tokens, +1 is for BOS print(f"vocab size: {vocab_size}") ``` -In the code above, we collect all unique characters across the dataset (which are just all the lowercase letters a-z), sort them, and each letter gets an id by its index. Note that the integer values themselves have no meaning at all; each token is just a separate discrete symbol. Instead of 0, 1, 2 they might as well be different emoji. In addition, we create one more special token called `BOS` (Beginning of Sequence), which acts as a delimiter: it tells the model "a new document starts/ends here". Later during training, each document gets wrapped with `BOS` on both sides: `[BOS, e, m, m, a, BOS]`. The model learns that `BOS` initates a new name, and that another `BOS` ends it. Therefore, we have a final vocavulary of 27 (26 possible lowercase characters a-z and +1 for the BOS token). +In the code above, we collect all unique characters across the dataset (which are just all the lowercase letters a-z), sort them, and each letter gets an id by its index. Note that the integer values themselves have no meaning at all; each token is just a separate discrete symbol. Instead of 0, 1, 2 they might as well be different emoji. In addition, we create one more special token called `BOS` (Beginning of Sequence), which acts as a delimiter: it tells the model "a new document starts/ends here". Later during training, each document gets wrapped with `BOS` on both sides: `[BOS, e, m, m, a, BOS]`. The model learns that `BOS` initiates a new name, and that another `BOS` ends it. Therefore, we have a final vocabulary of 27 (26 possible lowercase characters a-z and +1 for the BOS token). ## Autograd @@ -206,7 +206,7 @@ print(b.grad) # tensor(2.) This is the same algorithm that PyTorch's `loss.backward()` runs, just on scalars instead of tensors (arrays of scalars) - algorithmically identical, significantly smaller and simpler, but of course a lot less efficient. -Let's spell what the `.backward()` gives us above. Autograd calculated that if `L = a*b + a`, and `a=2` and `b=3`, then `a.grad = 4.0` is telling us about the local influence of `a` on `L`. If you wiggle the inmput `a`, in what direction is `L` changing? Here, the derivative of `L` w.r.t. `a` is 4.0, meaning that if we increase `a` by a tiny amount (say 0.001), `L` would increase by about 4x that (0.004). Similarly, `b.grad = 2.0` means the same nudge to `b` would increase `L` by about 2x that (0.002). In other words, these gradients tell us the direction (positive or negative depending on the sign), and the steepness (the magnitude) of the influence of each individual input on the final output (the loss). This then allows us to interately nudge the parameters of our neural network to lower the loss, and hence improve its predictions. +Let's spell out what the `.backward()` gives us above. Autograd calculated that if `L = a*b + a`, and `a=2` and `b=3`, then `a.grad = 4.0` is telling us about the local influence of `a` on `L`. If you wiggle the input `a`, in what direction is `L` changing? Here, the derivative of `L` w.r.t. `a` is 4.0, meaning that if we increase `a` by a tiny amount (say 0.001), `L` would increase by about 4x that (0.004). Similarly, `b.grad = 2.0` means the same nudge to `b` would increase `L` by about 2x that (0.002). In other words, these gradients tell us the direction (positive or negative depending on the sign), and the steepness (the magnitude) of the influence of each individual input on the final output (the loss). This then allows us to iteratively nudge the parameters of our neural network to lower the loss, and hence improve its predictions. ## Parameters @@ -235,7 +235,7 @@ Each parameter is initialized to a small random number drawn from a Gaussian dis ## Architecture -The model architecture is a stateless function: it takes a token, a position, the parameters, and the cached keys/values from previous positions, and returns logits (scores) over what token the model things should come next in the sequence. We follow GPT-2 with minor simplifications: RMSNorm instead of LayerNorm, no biases, and ReLU instead of GeLU. First, three small helper functions: +The model architecture is a stateless function: it takes a token, a position, the parameters, and the cached keys/values from previous positions, and returns logits (scores) over what token the model thinks should come next in the sequence. We follow GPT-2 with minor simplifications: RMSNorm instead of LayerNorm, no biases, and ReLU instead of GeLU. First, three small helper functions: ```python def linear(x, w): @@ -434,7 +434,7 @@ step 13 / 1000 | loss 3.0544 ... ``` -Watch it go down from ~3.3 (random) toward ~2.37. The lower this number is, the better the network's predictions already were about what token comes next in the sequence. At the end of training, the knowledge of the stastical patterns of the training token sequences is distilled in the model parameters. Fixing these parameters, we can now generate new, hallucinated names. You'll see (again): +Watch it go down from ~3.3 (random) toward ~2.37. The lower this number is, the better the network's predictions already were about what token comes next in the sequence. At the end of training, the knowledge of the statistical patterns of the training token sequences is distilled in the model parameters. Fixing these parameters, we can now generate new, hallucinated names. You'll see (again): ``` sample 1: kamon