Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions _config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,5 +37,10 @@ callouts:
# Makes Aux links open in a new tab. Default is false
aux_links_new_tab: true

# Enable mermaid diagrams in fenced ```mermaid code blocks.
# https://just-the-docs.com/docs/ui-components/code/#mermaid-diagram-code-blocks
mermaid:
version: "10.9.0"

kramdown:
syntax_highlighter: coderay
159 changes: 159 additions & 0 deletions technical/guardrails.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
---
title: Guardrails
parent: Technical documentation
has_children: false
nav_order: 7
---

# LiteLLM guardrails

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: as the first bit of this section, we could add a brief explanation of how guardrails work and are applied, which would also clear up some confusion about e.g. the default_on parameter as discussed in os2ai/Feedback#1 (comment)

A suggested text you can adopt or modify:

LiteLLM guardrails are a way to execute code on and around input and output sent to and from
models managed by LiteLLM.

The Open WebUI instance sends requests to LiteLLM, but does not support setting guardrails on
a per-request basis. For this reason, guardrails that should execute on messages from Open
WebUI should be set as always on, by setting the `default_on` setting to `true` when declaring
the guardrail (see below).


Currently, shipped guardrails:

| Guardrail | Type | File | What it does |
|----------------------------|----------|------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `MessageTrimmingGuardrail` | Pre-call | [`message-trimming`](https://github.com/os2ai/helm-deployments/blob/develop/applications/litellm/templates/message-trimming-config.yaml) | Trims oversized message histories to fit the target model's context window, then sanitizes tool-call/tool-response pairings so the trimmed (or otherwise broken) history doesn't crash strict chat templates. |

Pre-call guardrails in [LiteLLM](https://github.com/BerriAI/litellm) proxy applies to inbound chat requests before
forwarding them to the upstream model.

## Usage

The message trimming guardrail can be configured in the litellm
values [file](https://github.com/os2ai/helm-deployments/blob/develop/applications/litellm/litellm-values.yaml#L108)
configuration file in the helm chart.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note on why we need message trimming would make sense after this paragraph, just very briefly. E.g. what happens if oversized message histories are not trimmed, and how does the guardrail avoid it? In a sentence of two.


__Note__: that the default configuration contains an option to set max tokens for a named model, which overrides the
global default max tokens value. This is useful for models that have a different context window size than the global
default.

```yaml
model_list:
- model_name: my-model
litellm_params:
model: openai/some-deployed-model
api_base: https://...
api_key: ""
max_tokens: 8192
guardrails:
# attach the guardrail to this model
- message_trimming

guardrails:
- guardrail_name: message_trimming
litellm_params:
guardrail: /app/custom_guardrails/message_overflow.MessageTrimmingGuardrail
mode: pre_call
default_on: true
default_config:
trim_ratio: 0.75
max_output_tokens: 2000
safety_buffer: 500
debug: false
default_max_context_tokens: 8192
max_context_tokens_by_model:
openai/some-deployed-model: 32768
pop_trailing_tool_messages: false

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some comments on what the different parts of this config does would make sense, if we want it here. Why are the settings what they are, how might a user change them?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a link to the configuration reference could do a lot of the heavy lifting here. I'm still in doubt as to what default_on does from reading this (and honestly litellm's docs too https://docs.litellm.ai/docs/proxy/guardrails/quick_start#default-on-guardrails)

```

## How Message Trimming works

`async_pre_call_hook` runs on every chat completion request. The flow:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the step-by-step flow, I think a sentence stating what the pros and cons of our approach is would make sense.

E.g. "Sending a too large message to the model can be fatal for the entire conversation, so we take a conservative approach in estimating a safe completion budget for the message" and then explain what a safe completion budget is, why we calculate it as is? I think that would give a lot of good context for evaluating the approach.


1. __Resolve context window__ for the target model (`_resolve_max_context_tokens`):
per-model override map → `litellm.get_max_tokens` → global default. Logs a warning if it falls through to the global
default.
2. __Compute a safe completion budget__ (`_calculate_safe_completion_tokens`) — leaves room for input + safety buffer +
a 25% headroom factor for tokens LiteLLM/the provider may add later.
3. __Update `max_tokens` / `max_completion_tokens`__ in the request so the model can't be asked for more than fits.
4. __Trim input messages__ (`litellm.trim_messages`) if `current_input_tokens > max_input_tokens`, dropping older
messages from the head until it fits.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a potentially dangerous strategy, if this means losing the initial context of a chat, which is often more significant framing than the details of interaction happening in later messages?

Out of scope for now, I guess, but we could talk about summarizing the old messages instead, maybe?

5. __Sanitize__ (`_sanitize_messages`):
- `_repair_tool_call_pairings` — strip orphan `role: tool` messages and orphan `tool_calls` entries that the trimmer
may have created.
- (Optional, opt-in via `pop_trailing_tool_messages`) pop trailing `role: tool` messages and re-run the repair, then
append `"Please continue"` if the new terminus is an assistant message.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Please continue" feels like it could skew the output, especially if the language in the context window otherwise isn't English?

6. __Recount and re-budget__ completion tokens once more, since sanitize may have grown or shrunk the message list.

### Why `_repair_tool_call_pairings` exists

LiteLLM's build in `trim_messages` has __no tool-call awareness__ — it drops messages by token count from the head and
freely produces:

- Orphan `role: tool` messages (no surviving `assistant.tool_calls` advertised them).
- Orphan `tool_calls` entries on assistant messages (no surviving `role: tool` answered them).

Both shapes are rejected by strict chat templates (Mistral, vLLM, OpenAI strict mode). The repair pass enforces the
invariant: every surviving `tool_calls[].id` has a later matching `role: tool` message, and every surviving `role: tool`
was advertised by an earlier surviving `assistant.tool_calls` entry. See `_repair_tool_call_pairings` in [`message-trimming`](https://github.com/os2ai/helm-deployments/blob/develop/applications/litellm/templates/message-trimming-config.yaml).

### Why the trailing-tool pop is opt-in

The "normal" agent-loop shape ends on a `role: tool` message:

```mermaid
flowchart LR
U[User] --> A["Assistant{tool_calls}"]
A --> T["Tool{result}"]
T --> C([model is asked to continue here])
```

Most providers (OpenAI, Anthropic, Google, Mistral via the official APIs) __accept__ this shape — that's how tool
calling works. Popping the tool message and substituting `"Please continue"` deprives the model of the result it was
supposed to reason from, so the default is __off__.
Comment on lines +89 to +102

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this section. Specifically Popping the tool message and substituting "Please continue" deprives the model of the result it was supposed to reason from, so the default is __off__. doesn't really tell me what happens in the cases where the setting is enabled vs disabled, and what exactly the default behavior is.

What is the effect of depriving the model of the result it was supposed to reason from? (Am I understanding it correctly that this refers to "the result of the tool call", and if so, could we call it that?)


Set `pop_trailing_tool_messages: true` only for upstream chat templates that explicitly reject `role: tool` messages —
notably the strict HuggingFace template that raises `"Only user and assistant roles are supported!"`. The per-model
override map lets you flip it for one model in a fleet without affecting the others.

### Why both repairs run when pop is enabled

The order is `repair → pop → repair → maybe-append-continue`:

- The first repair cleans up orphans created by `trim_messages`.
- The pop may break a previously-valid `[Assistant{tool_calls=[X]}, Tool X]` pair, leaving the assistant holding orphan
`tool_calls`.
- The second repair restores the invariant — strips the now-orphan `tool_calls`, drops content-empty assistants
entirely.
- *Then* we decide whether to append `"Please continue"`, after seeing the post-repair terminus. (Appending before would
risk leaving a stale "user-continue" line after a now-deleted assistant.)

## Configuration reference

Read from `default_config` of the guardrail entry in `litellm_config.yaml`. All keys optional.

| Key | Type | Default | Purpose |
|---------------------------------------|-------|---------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `trim_ratio` | float | `0.75` | Forwarded to `litellm.trim_messages`. Fraction of `max_tokens` that trimming aims for, leaving headroom for additions later in the pipeline. |
| `max_output_tokens` | int | `2000` | Default completion budget when the request specifies neither `max_tokens` nor `max_completion_tokens`. |
| `safety_buffer` | int | `500` | Reserved tokens carved out of the context window before computing input/output budgets — covers system prompts, function schemas, and other tokens added downstream. |
| `debug` | bool | `false` | When `true`, the guardrail prints `[GUARDRAIL]`-prefixed traces to stdout. Show up in `task compose -- logs -f litellm`. |
| `default_max_context_tokens` | int | `8192` | Fallback context-window size when neither `max_context_tokens_by_model` nor `litellm.get_max_tokens` resolves the model. __Bump this if your fleet's smallest model is bigger than 8k.__ |
| `max_context_tokens_by_model` | dict | `{}` | Per-model overrides keyed by the upstream `model:` value LiteLLM forwards (NOT the friendly `model_name`). Wins over `litellm.get_max_tokens`. Use this for vLLM, Bedrock variants, custom deployments — anything not in [`litellm/model_prices_and_context_window.json`](https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json). |
| `pop_trailing_tool_messages` | bool | `false` | Strip trailing `role: tool` messages before forwarding. __Leave `false` unless the upstream chat template rejects them__ — popping loses tool-call results the model needs to reason from. |
| `pop_trailing_tool_messages_by_model` | dict | `{}` | Per-model override of the flag above, same key shape as `max_context_tokens_by_model`. |

### Resolution order, illustrated

__Context window__ — first hit wins:

```mermaid
flowchart TD
A["max_context_tokens_by_model[model]"] -->|miss| B["litellm.get_max_tokens(model)"]
B -->|raises / 0| C[default_max_context_tokens]
A -. hit .-> H((use value))
B -. hit .-> H
C --> H
```

__Pop trailing tools__ — first hit wins:

```mermaid
flowchart TD
A["pop_trailing_tool_messages_by_model[model]"] -->|miss| B[pop_trailing_tool_messages]
A -. hit .-> H((use value))
B --> H
```

## References

- [LiteLLM custom guardrail docs](https://docs.litellm.ai/docs/proxy/guardrails/custom_guardrail)
Loading