Skip to content

Latest commit

 

History

History
377 lines (308 loc) · 13 KB

File metadata and controls

377 lines (308 loc) · 13 KB

Streaming Multi-Provider Chat with Settlement Modes

Most LLM APIs lock you into a single provider. If you want to switch from OpenAI to Anthropic or Google, you rewrite your integration, change authentication, and update response parsing. OpenGradient gives you a single unified API that wraps every major provider -- swap one enum value and everything else stays the same.

But the real differentiator is settlement. Every inference call settles on-chain via the x402 payment protocol, producing a cryptographic receipt you can use for compliance, billing audits, or dispute resolution. You choose how much data goes on-chain: just a hash (privacy), a batch digest (cost savings), or full metadata (complete transparency).

This tutorial walks through the llm.chat() API, covering non-streaming and streaming responses, multi-provider switching, settlement modes, and function calling -- all in one place.

Prerequisites

pip install opengradient

Export your OpenGradient private key:

export OG_PRIVATE_KEY="0x..."

Faucet: Get free OPG tokens on Base Sepolia at https://faucet.opengradient.ai/

All x402 LLM payments currently settle on Base Sepolia using OPG tokens. If you see x402 payment errors, make sure your wallet has sufficient OPG on Base Sepolia.

Step 1: Basic Non-Streaming Chat

Start with the simplest possible call -- send a message and get a response. Before making any LLM calls, ensure sufficient OPG token allowance for the x402 payment protocol using ensure_opg_approval. This only sends a transaction when the current allowance drops below the threshold.

import asyncio
import os
import sys
import opengradient as og

private_key = os.environ.get("OG_PRIVATE_KEY")
if not private_key:
    print("Error: set the OG_PRIVATE_KEY environment variable.")
    sys.exit(1)

llm = og.LLM(private_key=private_key)

# Ensure sufficient OPG allowance for x402 payments (only sends tx when below threshold).
llm.ensure_opg_approval(min_allowance=5)

async def main():
    result = await llm.chat(
        model=og.TEE_LLM.GPT_5,
        messages=[{"role": "user", "content": "What is the x402 payment protocol?"}],
        max_tokens=200,
        temperature=0.0,
    )

    # result is a TextGenerationOutput dataclass
    print(result.chat_output.get("content", ""))  # The model's text response
    print(result.finish_reason)            # "stop", "length", or "tool_calls"
    print(result.payment_hash)             # On-chain x402 receipt

asyncio.run(main())

The chat_output dictionary follows the OpenAI message format: it has role, content, and optionally tool_calls keys. The payment_hash is your on-chain settlement proof -- every call gets one.

Step 2: Switch Providers with One Line

The model parameter accepts any og.TEE_LLM enum value. Swap the model and everything else -- message format, authentication, response parsing -- stays identical.

# All LLM methods are async -- use await in an async function

# OpenAI
result_openai = await llm.chat(
    model=og.TEE_LLM.GPT_5,
    messages=[{"role": "user", "content": "Hello from OpenAI!"}],
    max_tokens=100,
)

# Anthropic
result_anthropic = await llm.chat(
    model=og.TEE_LLM.CLAUDE_SONNET_4_6,
    messages=[{"role": "user", "content": "Hello from Anthropic!"}],
    max_tokens=100,
)

# Google
result_google = await llm.chat(
    model=og.TEE_LLM.GEMINI_2_5_FLASH,
    messages=[{"role": "user", "content": "Hello from Google!"}],
    max_tokens=100,
)

# xAI
result_xai = await llm.chat(
    model=og.TEE_LLM.GROK_4,
    messages=[{"role": "user", "content": "Hello from xAI!"}],
    max_tokens=100,
)

This makes A/B testing trivial -- run the same prompt across providers and compare quality, latency, and cost without changing any infrastructure.

Step 3: Enable Streaming

For chat UIs and real-time applications, pass stream=True to get tokens as they are generated. The return value changes from a TextGenerationOutput to an async generator that yields StreamChunk objects.

stream = await llm.chat(
    model=og.TEE_LLM.GPT_5,
    messages=[
        {"role": "system", "content": "You are a concise technical writer."},
        {"role": "user", "content": "Explain TEEs in one paragraph."},
    ],
    max_tokens=300,
    temperature=0.0,
    stream=True,
)

async for chunk in stream:
    # Each chunk has a choices list. The first choice's delta
    # contains the incremental content for this token.
    delta = chunk.choices[0].delta

    if delta.content:
        print(delta.content, end="", flush=True)

    # The final chunk has a finish_reason and optional usage stats.
    if chunk.is_final:
        print(f"\n\nModel: {chunk.model}")
        if chunk.usage:
            print(f"Tokens used: {chunk.usage.total_tokens}")

The StreamChunk dataclass has these fields:

Field Type Description
choices List[StreamChoice] Incremental choices (usually one)
model str Model identifier
usage StreamUsage or None Token counts (final chunk only)
is_final bool True when the stream is ending

Each StreamChoice contains a StreamDelta with optional content, role, and tool_calls fields.

Step 4: Settlement Modes

Every LLM call settles on-chain. The x402_settlement_mode parameter controls the privacy/cost/transparency trade-off:

Mode On-Chain Data Use Case
PRIVATE Input/output hashes only Privacy -- prove execution without revealing content
BATCH_HASHED Batch digest of multiple calls Cost efficiency -- lower gas per inference (default)
INDIVIDUAL_FULL Full model, input, output, metadata Transparency -- complete audit trail
# Privacy-first: only hashes stored on-chain
result_private = await llm.chat(
    model=og.TEE_LLM.CLAUDE_SONNET_4_6,
    messages=[{"role": "user", "content": "Sensitive query here."}],
    max_tokens=100,
    x402_settlement_mode=og.x402SettlementMode.PRIVATE,
)
print(f"Payment hash (SETTLE): {result_private.payment_hash}")

# Cost-efficient: batched settlement (this is the default)
result_batch = await llm.chat(
    model=og.TEE_LLM.CLAUDE_SONNET_4_6,
    messages=[{"role": "user", "content": "Regular query."}],
    max_tokens=100,
    x402_settlement_mode=og.x402SettlementMode.BATCH_HASHED,
)
print(f"Payment hash (BATCH_HASHED): {result_batch.payment_hash}")

# Full transparency: everything on-chain
result_transparent = await llm.chat(
    model=og.TEE_LLM.CLAUDE_SONNET_4_6,
    messages=[{"role": "user", "content": "Auditable query."}],
    max_tokens=100,
    x402_settlement_mode=og.x402SettlementMode.INDIVIDUAL_FULL,
)
print(f"Payment hash (INDIVIDUAL_FULL): {result_transparent.payment_hash}")

All three calls return a payment_hash you can look up on-chain. The difference is how much detail the on-chain record contains. Store these hashes if you need an audit trail -- they are the on-chain receipts for each inference call.

Step 5: Function Calling

You can pass tools to llm.chat() in the standard OpenAI function-calling format. This works with any model that supports tool use.

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_token_price",
            "description": "Get the current USD price of a cryptocurrency.",
            "parameters": {
                "type": "object",
                "properties": {
                    "symbol": {
                        "type": "string",
                        "description": "Token ticker symbol, e.g. ETH, BTC.",
                    }
                },
                "required": ["symbol"],
            },
        },
    }
]

result = await llm.chat(
    model=og.TEE_LLM.GEMINI_2_5_FLASH,
    messages=[{"role": "user", "content": "What's the current price of ETH?"}],
    max_tokens=200,
    tools=tools,
    tool_choice="auto",
)

if result.chat_output.get("tool_calls"):
    # The model decided to call a tool instead of responding with text.
    # We check for tool_calls in the message rather than relying on finish_reason,
    # since the exact finish_reason string may vary by provider.
    for tc in result.chat_output["tool_calls"]:
        func = tc["function"]
        print(f"Tool: {func['name']}, Args: {func['arguments']}")
else:
    print(result.chat_output.get("content", ""))

When the model returns tool calls, execute the requested functions locally, then send the results back in a follow-up llm.chat() call with a "tool" role message. See Tutorial 3 for a complete multi-turn tool-calling loop.

Complete Code

"""Streaming Multi-Provider Chat -- complete working example."""

import asyncio
import os
import sys
import opengradient as og

# ── Initialize ────────────────────────────────────────────────────────────
private_key = os.environ.get("OG_PRIVATE_KEY")
if not private_key:
    print("Error: set the OG_PRIVATE_KEY environment variable.")
    sys.exit(1)

llm = og.LLM(private_key=private_key)

# Ensure sufficient OPG allowance for x402 payments (only sends tx when below threshold).
llm.ensure_opg_approval(min_allowance=5)

PROMPT = "Explain what a Trusted Execution Environment is in two sentences."


async def main():
    # ── Multi-provider comparison ─────────────────────────────────────────
    models = [
        ("GPT-5",             og.TEE_LLM.GPT_5),
        ("Claude Sonnet 4.6", og.TEE_LLM.CLAUDE_SONNET_4_6),
        ("Gemini 2.5 Flash",  og.TEE_LLM.GEMINI_2_5_FLASH),
        ("Grok 4",            og.TEE_LLM.GROK_4),
    ]

    for name, model in models:
        try:
            result = await llm.chat(
                model=model,
                messages=[{"role": "user", "content": PROMPT}],
                max_tokens=200,
                temperature=0.0,
            )
            print(f"[{name}] {result.chat_output.get('content', '')}")
            print(f"  Payment hash: {result.payment_hash}\n")
        except Exception as e:
            print(f"[{name}] Error: {e}\n")

    # ── Streaming ─────────────────────────────────────────────────────────
    print("--- Streaming from GPT-5 ---")
    stream = await llm.chat(
        model=og.TEE_LLM.GPT_5,
        messages=[{"role": "user", "content": "What is x402? Keep it under 50 words."}],
        max_tokens=100,
        stream=True,
    )

    async for chunk in stream:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)
    print("\n")

    # ── Settlement modes ──────────────────────────────────────────────────
    for mode_name, mode in [
        ("PRIVATE",          og.x402SettlementMode.PRIVATE),
        ("BATCH_HASHED",    og.x402SettlementMode.BATCH_HASHED),
        ("INDIVIDUAL_FULL", og.x402SettlementMode.INDIVIDUAL_FULL),
    ]:
        try:
            r = await llm.chat(
                model=og.TEE_LLM.CLAUDE_SONNET_4_6,
                messages=[{"role": "user", "content": "Say hello."}],
                max_tokens=50,
                x402_settlement_mode=mode,
            )
            print(f"[{mode_name}] payment_hash={r.payment_hash}")
        except Exception as e:
            print(f"[{mode_name}] Error: {e}")

    # ── Function calling ──────────────────────────────────────────────────
    tools = [{
        "type": "function",
        "function": {
            "name": "get_token_price",
            "description": "Get the current USD price of a cryptocurrency.",
            "parameters": {
                "type": "object",
                "properties": {
                    "symbol": {"type": "string", "description": "Token ticker, e.g. ETH."}
                },
                "required": ["symbol"],
            },
        },
    }]

    result = await llm.chat(
        model=og.TEE_LLM.GEMINI_2_5_FLASH,
        messages=[{"role": "user", "content": "What is the price of ETH?"}],
        max_tokens=200,
        tools=tools,
        tool_choice="auto",
    )

    if result.chat_output.get("tool_calls"):
        for tc in result.chat_output["tool_calls"]:
            func = tc["function"]
            print(f"Tool call: {func['name']}({func['arguments']})")
    else:
        print(result.chat_output.get("content", ""))


asyncio.run(main())

Next Steps

  • Build a chat UI: Use the streaming API with a web framework to build a real-time chat interface backed by verifiable inference.
  • Add tool calling: See Tutorial 3 for a full multi-turn agent loop with tool dispatch and result feeding.
  • Build an agent: See Tutorial 1 to combine LangChain with on-chain model tools for a fully verifiable AI agent.