fix: align structured outputs API with vLLM's offline inference pattern

endolith · endolith · commit 42366ff7a58b · 2025-11-25T10:04:33.000-05:00
Previously, structured outputs support was incorrectly implemented by
extracting OpenAI-specific `extra_body.structured_outputs` from the
RunPod API's `sampling_params` dictionary. This mixed two different
API patterns:

1. The OpenAI API uses `extra_body` as a catch-all for vLLM-specific
   parameters that aren't part of the OpenAI spec
2. The RunPod API uses `sampling_params` as a direct pass-through to
   vLLM's `SamplingParams`, where all parameters should be at the same
   level

The RunPod API should be a direct 1:1 mapping to vLLM's offline inference
API, not an abstraction layer. Previously, structured outputs support
incorrectly mixed OpenAI-specific patterns (`extra_body`) into the RunPod
API.

Changes:
- Remove OpenAI `extra_body` extraction from RunPod API path
  - The OpenAI route already handles structured outputs correctly through
    vLLM's OpenAI serving code
  - RunPod API should only use `sampling_params` like all other parameters
- Simplify to direct extraction from `sampling_params.structured_outputs`
  - Convert dict to `StructuredOutputsParams(**config)` directly
  - Set on `SamplingParams` exactly as vLLM expects
- Update README to emphasize direct mapping to vLLM's API

The new implementation allows users to pass structured outputs exactly as
they would in vLLM's Python API:

```json
{
  "sampling_params": {
    "max_tokens": 128,
    "structured_outputs": {
      "json": {"type": "object", "properties": {...}},
      "regex": "[A-Z]+",
      "choice": ["Positive", "Negative"]
    }
  }
}
```

This directly maps to `SamplingParams(structured_outputs=StructuredOutputsParams(...))`
and maintains consistency with vLLM's documented offline inference patterns.
Users familiar with vLLM can use our API without learning new concepts.

When vLLM adds new structured output features, they automatically work
without code changes since we're just passing through the same parameters.
diff --git a/README.md b/README.md
@@ -244,6 +244,7 @@ Additional parameters supported by vLLM:
 | `stop_token_ids` | Optional[List[int]] | list | List of tokens that stop the generation when they are generated. The returned output will contain the stop tokens unless the stop tokens are special tokens. |
 | `skip_special_tokens` | Optional[bool] | True | Whether to skip special tokens in the output. |
 | `spaces_between_special_tokens`| Optional[bool] | True | Whether to add spaces between special tokens in the output. Defaults to True. |
+| `structured_outputs` | Optional[dict] | None | Constrains generations to JSON schemas, regexes, grammar, etc. See [Structured Outputs](https://docs.vllm.ai/en/latest/features/structured_outputs/). |
 | `add_generation_prompt` | Optional[bool] | True | Read more [here](https://huggingface.co/docs/transformers/main/en/chat_templating#what-are-generation-prompts) |
 | `echo` | Optional[bool] | False | Echo back the prompt in addition to the completion |
 | `repetition_penalty` | Optional[float] | 1.0 | Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens. |
@@ -416,4 +417,33 @@ You may either use a `prompt` or a list of `messages` as input.
     }
     ```
 
+#### Structured Outputs
+
+The RunPod API mirrors vLLM's offline inference API directly. To enforce JSON schemas, regexes, grammar rules, or structural tags, provide a `structured_outputs` object directly inside `sampling_params`. The structure matches the `SamplingParams(structured_outputs=StructuredOutputsParams(...))` API from vLLM (`json`, `regex`, `choice`, `grammar`, `structural_tag`, etc.). Example enforcing a JSON schema:
+
+```json
+{
+  "input": {
+    "messages": [
+      {"role": "user", "content": "Return a JSON document with name and age"}
+    ],
+    "sampling_params": {
+      "max_tokens": 128,
+      "structured_outputs": {
+        "json": {
+          "type": "object",
+          "properties": {
+            "name": {"type": "string"},
+            "age": {"type": "integer"}
+          },
+          "required": ["name", "age"]
+        }
+      }
+    }
+  }
+}
+```
+
+For all supported structured output types and usage patterns, refer to the vLLM [Structured Outputs guide](https://docs.vllm.ai/en/v0.11.1.1/features/structured_outputs/).
+
 </details>
diff --git a/src/utils.py b/src/utils.py
@@ -4,14 +4,15 @@
 from functools import wraps
 from time import time
 from vllm.entrypoints.openai.protocol import RequestResponseMetadata
-from vllm.sampling_params import StructuredOutputsParams
 
 try:
     from vllm.utils import random_uuid
     from vllm.entrypoints.openai.protocol import ErrorResponse
     from vllm import SamplingParams
+    from vllm.sampling_params import StructuredOutputsParams
 except ImportError:
     logging.warning("Error importing vllm, skipping related imports. This is ONLY expected when baking model into docker image from a machine without GPUs")
+    StructuredOutputsParams = None
     pass
 
 logging.basicConfig(level=logging.INFO)
@@ -50,47 +51,17 @@ def __init__(self, job):
         self.max_batch_size = job.get("max_batch_size")
         self.apply_chat_template = job.get("apply_chat_template", False)
         self.use_openai_format = job.get("use_openai_format", False)
-        samp_param = job.get("sampling_params", {})
-
-        # Reject deprecated old API format (top-level guided_json parameter)
-        # worker-vllm v2.9.5+ updated to vLLM 0.11.0+, which uses
-        # OpenAI-compatible extra_body.structured_outputs format
-        if job.get("guided_json") is not None:
-            raise ValueError(
-                "The 'guided_json' parameter is deprecated in vLLM 0.11.0+. "
-                "Please use 'structured_outputs' instead. "
-                "See: https://docs.vllm.ai/en/v0.11.0/features/structured_outputs.html"
-            )
-
-        # Extract extra_body (for new structured_outputs API) from sampling_params
-        extra_body = samp_param.pop("extra_body", None)
-        if extra_body and "structured_outputs" in extra_body:
-            structured_outputs = extra_body["structured_outputs"]
-
-            # Create StructuredOutputsParams instance
-            if "json" in structured_outputs:
-                samp_param["structured_outputs"] = StructuredOutputsParams(
-                    json=structured_outputs["json"]
+        samp_param = job.get("sampling_params", {}) or {}
+
+        # Convert structured_outputs dict to StructuredOutputsParams if present
+        if "structured_outputs" in samp_param and StructuredOutputsParams is not None:
+            so_value = samp_param["structured_outputs"]
+            if isinstance(so_value, dict):
+                samp_param["structured_outputs"] = StructuredOutputsParams(**so_value)
+            elif not isinstance(so_value, StructuredOutputsParams):
+                raise TypeError(
+                    "structured_outputs must be a dict or StructuredOutputsParams instance"
                 )
-            elif "regex" in structured_outputs:
-                samp_param["structured_outputs"] = StructuredOutputsParams(
-                    regex=structured_outputs["regex"]
-                )
-            elif "choice" in structured_outputs:
-                samp_param["structured_outputs"] = StructuredOutputsParams(
-                    choice=structured_outputs["choice"]
-                )
-            elif "grammar" in structured_outputs:
-                samp_param["structured_outputs"] = StructuredOutputsParams(
-                    grammar=structured_outputs["grammar"]
-                )
-            elif "structural_tag" in structured_outputs:
-                samp_param["structured_outputs"] = StructuredOutputsParams(
-                    structural_tag=structured_outputs["structural_tag"]
-                )
-
-        # Store for potential use in OpenAI-compatible API
-        self.extra_body = extra_body
 
         if "max_tokens" not in samp_param:
             samp_param["max_tokens"] = 100