diff --git a/content/blogs/small-models-rock.md b/content/blogs/small-models-rock.md
new file mode 100644
index 0000000..e8431f0
--- /dev/null
+++ b/content/blogs/small-models-rock.md
@@ -0,0 +1,459 @@
+---
+title: "Making Small Models Rock with Mellea"
+date: "2026-05-20"
+author: "Paul Schweigert, Nathan Fulton"
+excerpt: "Small open-weight models can handle production-shaped work when the harness decomposes the task, validates outputs, and routes each step to the right local model."
+tags: ["granite", "rag", "adapters", "small-models", "docling", "local-llm"]
+---
+
+
+
+Small open-weight models keep getting better. A 3B Granite model on your
+laptop today does things a 70B model couldn't do a year ago. There is still
+a gap between "small model can technically do this" and "small model can
+reliably do this on real data in production," and most teams cross that gap
+by giving up and paying for a frontier reasoning model instead.
+
+They shouldn't have to. The reason a small model fails on most production
+tasks isn't raw capability; it's that the small model can't hold a
+six-subtask problem in one shot. A frontier model given a thousand-token
+system prompt with six subtasks inlined will muddle through. A 3B model
+given the same prompt won't.
+
+The fix is a better harness around the model you already have. By
+"harness" we mean the software scaffolding around the model call:
+decomposition, validation, retries, tool dispatch. The part that isn't
+the forward pass. That's what Mellea is for.
+
+> **What this post does**: walks through a construction cost-estimation
+> pipeline that one-shot prompting needs a frontier reasoning model for,
+> rebuilt as a local Mellea pipeline. Granite 4 Micro 3B handles parsing,
+> validation, and pricing; GPT-OSS 20B handles chart and report generation.
+> No API keys, no egress, no per-token billing. If you're paying
+> frontier-model prices for structured extraction or matching, the same
+> pattern applies.
+
+## The Bet
+
+Mellea is built around one observation: the software harness around a model
+matters as much as the model itself. Frontier labs already know this. The
+team training OpenAI's models knows exactly what the ChatGPT and Codex
+harnesses look like, and that knowledge shapes the training data. The
+open-weight ecosystem has mostly not had that feedback loop. You upload
+weights to Hugging Face and people do whatever with them.
+
+Mellea is the harness side of that co-design. It takes tasks that currently
+demand a frontier model, breaks them into pieces a small model can do well,
+and holds those pieces together with ordinary code rather than an
+ever-growing English prompt.
+
+Three things fall out of that approach:
+
+- **Cost is predictable.** Local inference on a small model has a fixed,
+ knowable cost per run, so you can back-test millions of runs without a
+ finance conversation.
+- **Data stays local.** Documents never leave the machine. For regulated
+ domains like healthcare, finance, legal, and procurement, that's the
+ difference between "can use this" and "can't."
+- **The inference backend is yours to choose.** Mellea talks to Ollama,
+ vLLM, Hugging Face, and any OpenAI-compatible endpoint, so swapping
+ backends is a one-line change. You aren't married to one provider's
+ pricing or uptime.
+
+## The Three Patterns
+
+Mellea turns a task that needs a frontier model into one a small model can
+handle through three patterns:
+
+
+
+1. **Decompose the task.** A single prompt bundles parsing, fuzzy matching,
+ arithmetic, and generation. Split it into steps narrow enough that a
+ small model can nail each one.
+2. **Externalize control flow.** Keep the `for` loops, `if` statements, and
+ retry logic in Python, not in the prompt. Models don't need to be
+ instructed to iterate; code iterates.
+3. **Modularize model capabilities.** Reuse validated components like
+ structured output, typed generative functions, and RAG adapters
+ instead of rebuilding them every time.
+
+None of these is new. They are the ordinary software-engineering moves
+you would make for any non-trivial system. What's new is applying them to the
+part of your program that happens to be written in English.
+
+The argument in [*On the Foolishness of "Natural Language
+Programming"*](https://www.cs.utexas.edu/~EWD/transcriptions/EWD06xx/EWD667.html)
+from 1978 applies directly. Dijkstra's line:
+
+> The "naturalness" with which we use our native tongues boils down to
+> the ease with which we can use them for making statements the nonsense
+> of which is not obvious.
+
+A modern model interpreting English well enough to *try* doesn't change
+that. Drawing boundaries, stating contracts, decomposing work: that's
+where the engineering has always lived. The syntax of the target
+language was the trivial part.
+
+## An Example Worth Walking Through
+
+Pick a task that's just past what single-shot prompting can do reliably,
+then build it up as a small-model pipeline.
+
+A construction management firm wants to estimate material costs for a
+project. The inputs are a construction plan (PDF, with a bill of materials
+spread across several tables) and a handful of supplier catalogs (PDF,
+DOCX, XLSX). The output is a priced HTML report with a pie chart breaking
+spend down by category.
+
+Ask a frontier model to do this in one shot and you can mostly make it
+work. A top-tier reasoning model with extended thinking gets most of the
+line items right, cites the catalog it pulled the price from, and does
+it all for about a dollar per run. Here's what one-shot prompting looks like
+across the size spectrum:
+
+| Model | Result |
+| --- | --- |
+| Small open-weight (<3B) | Doesn't understand the task. Returns a generic "cost breakdown" with no prices. |
+| Open-weight reasoning (gpt-oss-20b, one-shot) | Finds categories and subtotals. No pie chart. Numbers often wrong. |
+| Gemini Flash | Mostly reasonable. No chart. Some prices off. |
+| Frontier reasoning (GPT-5) | Gets most items right. Cites sources. No chart on first shot. On the order of a dollar per run. |
+
+That sounds cheap until you write down what the firm actually wants to
+do with it. Fifteen hundred projects a year, twenty years of history, a
+backlog of hypothetical backtests: *what if we'd switched lumber
+suppliers in 2012? What happens to margins if this supply chain
+constraint materializes?* One backtest is a Monte Carlo simulation over
+tens of thousands of projects. At frontier-API prices, that's the
+budget for a small team.
+
+Same task, as a local Mellea pipeline. Local models handle parsing,
+validation, pricing, and chart/report generation instead of routing the
+whole job through a frontier API. A few hundred lines of pipeline code,
+no API keys, no egress, no per-token billing. Here's how.
+
+> **Note:** The snippets below are the load-bearing pieces of the pipeline, trimmed
+> for reading. The [full runnable notebook](https://github.com/generative-computing/mellea-tutorials/blob/main/notebooks/atai_2026/tutorial.ipynb)
+> has the install commands, sample data, and the glue between steps.
+
+```python
+import mellea
+from mellea.backends.model_ids import IBM_GRANITE_4_MICRO_3B
+
+m = mellea.start_session(backend_name="ollama", model_id=IBM_GRANITE_4_MICRO_3B)
+```
+
+## Step 1: Parse the Construction Plan into a Bill of Materials
+
+The construction plan is a PDF with tables scattered through it. Some are
+bills of materials; others are schedules, notes, or summaries. The first
+step is to get a clean, typed `BOM` object.
+
+`RichDocument` wraps [docling](https://github.com/DS4SD/docling) and
+exposes tables as markdown, which small models handle much better than raw
+HTML or PDF binary:
+
+```python
+from mellea.stdlib.components.docs.richdocument import RichDocument
+
+plans = RichDocument.from_document_file("construction_docs/construction_plans.pdf")
+```
+
+To filter down to material-list tables, Mellea lets you declare a typed
+generative function: a Python signature with no body, plus a docstring.
+
+```python
+from typing import Literal
+
+@mellea.generative
+def is_material_list(table_markdown: str) -> Literal["yes", "no"]:
+ """Determines if the table contains a list of construction items."""
+```
+
+The `@mellea.generative` decorator turns the signature into a constrained
+model call. The return type `Literal["yes", "no"]` forces a binary answer at
+decode time. The model can't hedge with a paragraph, can't return "maybe,"
+and can't return "probably yes." That constraint is enforced by the
+sampler, not by a post-hoc prompt rule.
+
+For tables that pass the filter, reformat them into a validated `BOM`:
+
+```python
+import pydantic
+
+class BOMEntry(pydantic.BaseModel):
+ item: str
+ quantity: int | str
+ notes: str
+ category: Literal["lumber", "windows", "doors", "other"]
+
+class BOM(pydantic.BaseModel):
+ items: list[BOMEntry]
+```
+
+The schema pins the shape. But "quantity" has a construction-specific quirk:
+alongside integers, entries can read *allowance* — meaning *buy some of these,
+here's twenty bucks.* The Pydantic schema alone can't express that rule, so
+it goes into a requirement with an explicit validator:
+
+```python
+from mellea.stdlib.requirements import req, simple_validate
+
+def _quantity_is_valid(out: str) -> bool:
+ bom = BOM.model_validate_json(out)
+ return all(str(e.quantity).isdigit() or str(e.quantity).lower() == "allowance" for e in bom.items)
+
+m.instruct(
+ "Reformat this table to have four columns: item, quantity, category, and notes.",
+ grounding_context={"table": table.to_markdown()},
+ requirements=[req("Quantity must be an integer or 'allowance'", validation_fn=simple_validate(_quantity_is_valid))],
+ format=BOM,
+)
+```
+
+Two layers of enforcement: `format=BOM` constrains the decoder to emit valid
+JSON matching the schema; the `requirements` list adds business-logic checks
+on top. If validation fails, Mellea re-samples. The loop is Python; the model
+only has to do the reformatting.
+
+### Parallelize across tables
+
+Construction documents often carry many tables, and each table is
+independent of the others. The full notebook fans out with `ainstruct`:
+
+```python
+boms = await asyncio.gather(*[reformat(t) for t in material_tables])
+```
+
+One coroutine per material-list table, gather merges the resulting
+`BOM` objects. Wall-clock scales with the slowest table, not the sum.
+The loop shape stays in Python.
+
+## Step 2: Load the Product Catalogs
+
+Catalogs come in whatever format the supplier happened to send. `RichDocument`
+handles PDF, DOCX, and XLSX the same way:
+
+```python
+from mellea.stdlib.components.docs import Document
+
+def load(path: str) -> Document:
+ return Document(text=RichDocument.from_document_file(path).to_markdown())
+
+doors_doc = load("door_catalog.pdf")
+windows_doc = load("windows_catalog.docx")
+```
+
+Each becomes a plain `Document`, the input format Mellea's RAG adapters
+expect. Any item whose category isn't keyed into the catalog lookup below will
+land in the report as "unknown."
+
+## Step 3: Match BOM Entries to Prices with RAG Adapters
+
+Matching "36x80 Aurora half-moon entry door" from a bill of materials to
+the right line in a product catalog is fuzzy by nature. Rather than ask a
+general-purpose model to "find the right price and tell me if you're sure,"
+put a calibrated gate in front of extraction.
+
+The adapters run on the HuggingFace backend, so start a second session:
+
+```python
+from mellea.backends.huggingface import LocalHFBackend
+from mellea.stdlib.components.intrinsic.rag import check_answerability
+from mellea.stdlib.context import ChatContext
+
+m_hf = mellea.MelleaSession(backend=LocalHFBackend(model_id=IBM_GRANITE_4_MICRO_3B))
+```
+
+`check_answerability` is the RAG adapter we'll use as our gate: given a
+question and a document, it returns whether the document actually contains
+enough information to answer. The line that matters is the answerability
+gate (`if verdict != "answerable":`) sitting in front of extraction. The loop walks every BOM item, picks the
+right catalog by category, and asks the answerability adapter whether a
+price is extractable before it ever asks the model to produce one:
+
+```python
+CATALOGS = {"windows": windows_doc, "doors": doors_doc}
+
+def price_one(entry: BOMEntry) -> tuple[float | None, float | None]:
+ catalog = CATALOGS.get(entry.category)
+ if catalog is None:
+ return None, None # no catalog for this category → unknown
+
+ verdict = check_answerability(
+ f"What is the price of {entry.item}?",
+ documents=[catalog], context=ChatContext(), backend=m_hf.backend,
+ )
+ if verdict != "answerable":
+ return None, None # adapter isn't confident → unknown, not hallucinated
+
+ unit = extract_unit_price(entry, catalog) # m.instruct, format=...
+ total = extract_total(unit, entry.quantity) # m.instruct, format=...
+ return unit, total
+```
+
+`verdict == "answerable"` is a gate: items the adapter can't confidently
+answer don't get a hallucinated price, they get `total_price=None` and
+flow through to the report as "unknown." Failure modes are explicit. And
+the model is only ever asked to extract a number from a document where a
+calibrated adapter has already confirmed the number is there. The hard
+problem (*is this the right document at all?*) was solved by a
+specialized adapter rather than by a general prompt.
+
+What makes these adapters worth reaching for, as opposed to asking a
+general model "is this document relevant, 1-10," is that the outputs are
+**calibrated**. Pass the doors catalog and a doors question, you get a
+high-confidence `"answerable"`. Pass the lumber catalog with the same doors
+question, and answerability correctly collapses to `"unanswerable"`.
+Frontier model logits don't give you this. Nothing in a general-purpose
+model's training signal makes the raw next-token probabilities mean
+"calibrated confidence that this document answers this question." Adapters
+are trained to make them mean that.
+
+The same adapter family also supports context relevance and citation
+finding. Granite adapters ship as **[ALoRA](https://arxiv.org/abs/2504.12397)**
+adapters — *activated* low-rank adapters that switch on only for the
+adapter's own forward pass, leaving the base model's KV cache from the
+document prefill intact. Running answerability, context relevance, and
+citation finding over the same long document reuses that one prefill
+instead of paying for it three times. You get modularity without paying
+3× the compute.
+
+## Step 4: Generate the Report
+
+The final step is two instructions: make the chart, then write the report
+around it.
+
+Pricing runs on a 3B Granite model, but chart-drawing code is a different
+beast. Tool-calling and matplotlib benefit from a model with more muscle,
+so for this step switch sessions to GPT-OSS 20B running locally on Ollama.
+The same `m` name now points at the larger model for the remainder of the
+pipeline:
+
+```python
+from mellea.backends.model_ids import OPENAI_GPT_OSS_20B
+from mellea.backends.model_options import ModelOption
+from mellea.stdlib.tools import local_code_interpreter
+from mellea.backends.tools import MelleaTool
+
+m = mellea.start_session(backend_name="ollama", model_id=OPENAI_GPT_OSS_20B)
+```
+
+The chart step hands the model a tool and lets it drive:
+
+```python
+chart = m.instruct(
+ "Use the code interpreter to create a pie chart of cost breakdowns "
+ "by category. Save it to /tmp/chart.png.",
+ grounding_context=report_ctx, # {item: {total_price, category}} for each priced row
+ tool_calls=True,
+ model_options={ModelOption.TOOLS: [MelleaTool.from_callable(local_code_interpreter)]},
+)
+chart.tool_calls["local_code_interpreter"].call_func()
+```
+
+Then a plain instruct writes the HTML report around that image:
+
+```python
+report = m.instruct(
+ "Write an HTML report with a top-line cost breakdown by category and a "
+ "line-item material list with prices. At the top include /tmp/chart.png.",
+ grounding_context=report_ctx,
+)
+```
+
+Swapping sessions mid-pipeline is one line. Each step runs against the
+model that suits it: tiny and cheap where the task is narrow, bigger only
+where the task demands it.
+
+The chart is drawn by actual Python code executing on actual data, not
+by a model emitting numbers and hoping they line up with the totals
+elsewhere in the report. The grounding context is a plain dict, not a
+vector store. When the context is small and structured and
+known at call time, passing it directly is simpler and more predictable
+than dragging in retrieval.
+
+Here's what the pipeline actually produces. The chart and top-line table:
+
+
+
+And the line-item list, where every row is either a price extracted from a
+catalog or an explicit `unknown`:
+
+
+
+The grand total ($7,885.04) is the sum of *known* items only. Lumber items
+have no catalog wired up in this run, so they short-circuit to `unknown`
+before the adapter is ever called — and the lumber row in the top-line
+table is annotated *excluding unknown items* so the gap is visible at a
+glance. The same outcome applies when a catalog is present but the
+answerability adapter isn't confident: the row stays `unknown` rather
+than getting a fabricated price. That's the property worth checking —
+not that every row gets a price, but that unsupported rows fail loudly.
+
+## Small Models, Production-Ready Output
+
+Step back from the construction example. The general shape of the trade:
+
+
+
+A task that one-shot required a frontier model and a dollar per run now
+runs as a local pipeline with smaller models doing the narrow parts. Pricing
+extraction runs on Granite 4 Micro 3B via HuggingFace; chart and report
+generation run on GPT-OSS 20B via Ollama. No API keys, no egress, no
+per-token billing. Items the pipeline can't confidently price show up as
+`unknown` in the output rather than as hallucinated numbers, so the report
+is auditable instead of something a human has to re-check line by line.
+
+The construction case isn't a one-off. The same three-pattern approach
+generalizes. The Mellea team has been porting prompt-heavy agents (a DB2
+database agent and a compliance agent) to the same decompose / externalize
+/ modularize shape, and the practical effect is what you'd predict from
+the construction pipeline: the steps the small model is now asked to do
+are narrow enough that it can do them, and the steps that still need a
+larger model are isolated to where they actually matter. The harness is
+what buys you that separation.
+
+The win is not that one tiny model does everything. The win is that the
+harness lets narrow steps run on tiny local models, reserves larger local
+models for the few steps that need them, and keeps control flow, validation,
+and failure handling in code. If you're paying frontier-model prices for a
+task that decomposes cleanly, there's a good chance you're paying for
+engineering you haven't done yet.
+
+## Trade-offs
+
+Wall-clock per run is higher than a single API call. Each individual
+step is usually *faster* — smaller model, no network — but a pipeline
+with several serial steps adds up. Async execution hides most of the
+fan-out within a step, not across them. The win is in cost, privacy,
+and auditability, not in tail latency.
+
+Rate limits disappear. A frontier API can rate-limit you mid-backtest;
+local inference can't. When a pipeline fails at 3am, debugging a
+decomposed Mellea program means finding *which step* went wrong — a single
+frontier-model call is a black box. And the obvious alternative to
+"better harness + small model" is "fine-tune a small model." Harnessing
+is cheaper to iterate on and composes with fine-tuning later if you need
+it.
+
+Decomposition takes engineering effort. It's ordinary software work — a
+senior engineer can typically port a prompt pipeline in a day or two, and
+the GPU or laptop amortizes fast compared to even a week of frontier-API
+backtests. It pays back once the task is recurring, privacy matters, or
+cost compounds at scale.
+
+Not everything decomposes. Tasks that are genuinely holistic, like
+free-form creative writing or long-form reasoning chains that cannot be
+separated into clean intermediate states, still favor a big model. Know
+which kind of task you have before you invest.
+
+## Try It
+
+If you're running structured extraction, matching, classification, or
+report-generation pipelines and paying frontier-model prices for it, the
+pieces used above are all part of Mellea's standard library. The full
+construction tutorial notebook is in the Mellea tutorials repo.
+
+- [Construction tutorial notebook](https://github.com/generative-computing/mellea-tutorials/blob/main/notebooks/atai_2026/tutorial.ipynb)
+- [Mellea on GitHub](https://github.com/generative-computing/mellea)
+- [Granite RAG adapters on Hugging Face](https://huggingface.co/ibm-granite/granitelib-rag-r1.0)
diff --git a/public/images/small-models-rock/chart1.png b/public/images/small-models-rock/chart1.png
new file mode 100644
index 0000000..b55e380
Binary files /dev/null and b/public/images/small-models-rock/chart1.png differ
diff --git a/public/images/small-models-rock/chart2.png b/public/images/small-models-rock/chart2.png
new file mode 100644
index 0000000..93ba0d1
Binary files /dev/null and b/public/images/small-models-rock/chart2.png differ
diff --git a/public/images/small-models-rock/harnessed.png b/public/images/small-models-rock/harnessed.png
new file mode 100644
index 0000000..85aa287
Binary files /dev/null and b/public/images/small-models-rock/harnessed.png differ
diff --git a/public/images/small-models-rock/main.png b/public/images/small-models-rock/main.png
new file mode 100644
index 0000000..145452c
Binary files /dev/null and b/public/images/small-models-rock/main.png differ
diff --git a/public/images/small-models-rock/three-steps.png b/public/images/small-models-rock/three-steps.png
new file mode 100644
index 0000000..9b2713c
Binary files /dev/null and b/public/images/small-models-rock/three-steps.png differ