A small C# console project that teaches how a large language model is built, one step at a time.
The repo is aimed at software developers rather than data scientists. Each lesson keeps the mechanics visible and uses tiny Azure and programming-flavoured examples so you can see what the code is doing without needing a GPU cluster or an existential crisis.
The interactive lessons currently cover:
lesson1- a count-based bigram modellesson2- replacing counts with learnable weightslesson3- sequence windows and ordered contextlesson4- attentionlesson5- a tiny transformer-style blocklesson6- training, inference, and samplinglesson7- prompting, retrieval, and grounded answersnanogpt- a TorchSharp-backed nanoGPT track using a training documentnanogptv2- the nanoGPT track with embeddings, attention, residuals, feed-forward blocks, and per-position logits
src/ukazure.llm.cli/ Console application source, lessons, nanoGPT tracks, and training data
docs/ Supporting walkthroughs and architecture notes
README.md How to run the project and what each command demonstrates
- .NET 10 SDK
- a terminal that supports interactive console input
- macOS users running
nanogptmay needbrew install libompfor TorchSharp's CPU backend
This project uses:
Spectre.Consolefor the interactive lesson UI- the official
OpenAI.NET SDK for the optional live model call in lesson 7 TorchSharp-cpufor the practical nanoGPT command
From the repo root:
dotnet run --project ukazure.llm.cli.csproj -- lesson1Replace lesson1 with any lesson name from lesson1 to nanogptv2.
If you run the app without a valid lesson argument, it prints the available lessons:
dotnet run --project ukazure.llm.cli.csprojExample output:
Usage: dotnet run lesson1|lesson2|lesson3|lesson4|lesson5|lesson6|lesson7|nanogpt|nanogptv2
Available commands:
lesson1 A tiny bigram model built from counts
lesson2 Replace counts with learnable weights
lesson3 Model sequencing with context windows
lesson4 Attention lets the model choose what to focus on
lesson5 A tiny transformer-style block
lesson6 Training, inference, and sampling
lesson7 Prompting, retrieval, and grounded answers
nanogpt nanoGPT in C# with TorchSharp
nanogptv2 nanoGPT v2: per-position logits
Lesson 1 is interactive. You choose the training sentences, then pick a starting token for generation.
Example run:
Choose the training sentences for lesson 1
[x] azure deploys to the cloud
[x] azure scales in the cloud
[x] dotnet builds in the cloud
[ ] dotnet runs in containers
Choose the starting token for generation
azure
Representative output:
Lesson 1: A tiny bigram model built from counts
Seed: azure
Current output: azure
Step 1: azure -> deploys
Current output: azure deploys
Step 2: deploys -> to
Current output: azure deploys to
Final output: azure deploys to the cloud
Lesson 6 uses the tiny transformer-style block from lesson 5 and shows the difference between training behaviour and inference behaviour.
Representative output:
Lesson 6: Training, inference, and sampling
Top predictions after "az deployment group create":
with 94.8%
to 2.1%
deploy 1.0%
Greedy output:
az deployment group create with
Temperature 0.7, top-k 3:
az deployment group create with bicep
Temperature 1.2, top-k 5:
az deployment group create with json
Lesson 7 moves from model internals to application architecture. You choose a developer question, the lesson retrieves relevant documents, builds a grounded prompt, and compares answers with and without retrieval.
Representative output:
Choose the developer question for lesson 7
How should I store secrets for my Azure app without hard-coding credentials?
Retrieved:
Azure Key Vault
Managed Identities
Azure App Service
Without retrieval:
You should use a secure service for secrets, avoid hard-coded credentials, and prefer platform features that reduce direct secret handling.
With retrieval:
Store secrets in Azure Key Vault instead of appsettings files or source code. Use managed identities so the app can authenticate without storing passwords or client secrets. If the app runs on Azure App Service, configure app settings to reference Key Vault secrets.
The nanogpt command is separate from the numbered lessons. It is a C# rewrite track inspired by Andrej Karpathy's nanoGPT repo, using TorchSharp rather than hand-written arrays for the practical training machinery.
It reads a condensed training document from src/ukazure.llm.cli/data/nanogpt-training.txt. The document is based on the Azure developer guide:
https://docs.azure.cn/en-us/guides/developer/azure-developer-guide
TorchSharp provides tensors, automatic gradients, cross-entropy, and AdamW. The current implementation is intentionally small: it is a character-level model trained on Azure developer text, plus an interactive document-grounded question loop.
Representative output:
nanoGPT: C# with TorchSharp
Path: data/nanogpt-training.txt
Vocabulary size: 60
Block size: 32
Embedding size: 64
Hidden size: 128
Step 1: loss = 4.1052
Step 20: loss = 3.1908
Step 40: loss = 3.0859
Step 60: loss = 2.8595
Step 80: loss = 2.5768
Prompt: az
Sample: az ...
Ask questions about data/nanogpt-training.txt.
Try: What is App Service useful for? or Why would I use Bicep or ARM templates?
Question: What is App Service useful for?
Answer: App Service is useful when a team wants a fast path to publish web projects. App Service for Linux can run custom container images for web applications. Hybrid Connections can connect an App Service application to on premises resources.
Evidence:
App Service is useful when a team wants a fast path to publish web projects. (score 3)
App Service for Linux can run custom container images for web applications. (score 2)
Hybrid Connections can connect an App Service application to on premises resources. (score 2)
Submit a blank question to leave the question loop and finish the command. The generated sample still comes from the tiny TorchSharp character model; the question loop is deliberately grounded in the training document so the demo can answer useful questions without pretending that a small character model has suddenly become a semantic assistant.
Useful demo questions:
- What is App Service useful for?
- When should I use Azure Functions?
- How can developers manage Azure resources?
- What does Azure Monitor help with?
- Why would I use Bicep or ARM templates?
The evidence score is a simple lexical overlap score from the local document retriever. It is intentionally visible and imperfect, which makes it useful for explaining why production retrieval systems often add embeddings, chunking, reranking, and a final LLM answer-generation step.
For a deeper technical walkthrough of the current implementation, see docs/nanogpt-technical-walkthrough.md.
The nanogptv2 command keeps the same training document and interactive flow as nanogpt, but changes the first part of the model.
Version 1 manually expands token IDs into one-hot vectors and feeds the flattened result into a Linear layer. Version 2 sends integer token IDs into a TorchSharp Embedding layer first. The embedding table learns a dense vector for each character token during training.
Version 2 also adds positional embeddings. A second embedding table learns a vector for each position in the context window. The model adds token vector plus position vector before processing the sequence.
The current v2 step adds multi-head causal self-attention. The attention layer builds query, key, and value projections from the embedded sequence, splits them into multiple heads, runs causal attention independently in each head, concatenates the results, and projects them back to the original embedding width.
The model now also uses layer normalisation before attention and before the feed-forward network. Layer normalisation keeps each token vector in a steadier numerical range, which helps the later layers train more predictably.
The current v2 step adds residual connections. Attention no longer replaces the embedded sequence; it produces an update that is added back to the original sequence. The feed-forward block follows the same pattern. In code terms, the model now follows x = x + attention(layerNorm(x)) and then x = x + feedForward(layerNorm(x)).
The feed-forward block is now explicit. It transforms each token vector independently with embedding -> hidden -> embedding, preserving the sequence shape so the result can be added back through a residual connection. This differs from v1, where the flattened context went straight through one feed-forward-style network to produce logits.
The language-model head now produces logits for every position in the context window. Instead of flattening the whole sequence before scoring, v2 applies a final embedding -> vocabulary projection to each position, creating batch x blockSize x vocabularySize logits. The training loop still uses the final position logits for now so the behaviour remains comparable with v1.
That gives the next architecture steps somewhere sensible to attach:
- sequence-level logits
- sequence-level loss
- stacking the block
- checkpoint save/load
Lesson 7 can optionally call a live model through the OpenAI .NET SDK.
By default, the lesson still works without any configuration. If no API key is present, it falls back to the locally composed grounded answer so the demo remains runnable.
The easiest way to enable the live call is to edit the local config file in the repo root:
{
"OpenAiApiKey": "your-api-key-here"
}The file name is:
lesson7.config.json
This file is ignored by git, so you can keep your local key there without committing it.
If you prefer, you can still use an environment variable instead:
export OPENAI_API_KEY="your-api-key-here"Then run lesson 7:
dotnet run --project ukazure.llm.cli.csproj -- lesson7If no key is configured, lesson 7 will show a message like:
No API key configured. Set OPENAI_API_KEY or update lesson7.config.json.
- The lessons are intentionally tiny and simplified.
- The goal is clarity, not scale or performance.
- Later lessons reuse ideas from earlier ones, so they work best as a sequence.