|
1 | | -# bo-eval-server |
| 1 | +# Eval-Server |
2 | 2 |
|
3 | | -A WebSocket-based evaluation server for LLM agents using LLM-as-a-judge methodology. |
| 3 | +A WebSocket-based evaluation server for LLM agents with multiple language implementations. |
4 | 4 |
|
5 | | -## Quick Start |
| 5 | +## Overview |
| 6 | + |
| 7 | +This directory contains two functionally equivalent implementations of the bo-eval-server: |
| 8 | + |
| 9 | +- **NodeJS** (`nodejs/`) - Full-featured implementation with YAML evaluations, HTTP API, CLI, and judge system |
| 10 | +- **Python** (`python/`) - Minimal library focused on core WebSocket functionality and programmatic evaluation creation |
6 | 11 |
|
7 | | -1. **Install dependencies** |
8 | | - ```bash |
9 | | - npm install |
10 | | - ``` |
| 12 | +Both implementations provide: |
| 13 | +- 🔌 **WebSocket Server** - Real-time agent connections |
| 14 | +- 🤖 **Bidirectional RPC** - JSON-RPC 2.0 for calling agent methods |
| 15 | +- 📚 **Programmatic API** - Create and manage evaluations in code |
| 16 | +- ⚡ **Concurrent Support** - Handle multiple agents simultaneously |
| 17 | +- 📊 **Structured Logging** - Comprehensive evaluation tracking |
| 18 | + |
| 19 | +## Quick Start |
11 | 20 |
|
12 | | -2. **Configure environment** |
13 | | - ```bash |
14 | | - cp .env.example .env |
15 | | - # Edit .env and add your OPENAI_API_KEY |
16 | | - ``` |
| 21 | +### NodeJS (Full Featured) |
17 | 22 |
|
18 | | -3. **Start the server** |
19 | | - ```bash |
20 | | - npm start |
21 | | - ``` |
| 23 | +The NodeJS implementation includes YAML evaluation loading, HTTP API wrapper, CLI tools, and LLM-as-a-judge functionality. |
22 | 24 |
|
23 | | -4. **Use interactive CLI** (alternative to step 3) |
24 | | - ```bash |
25 | | - npm run cli |
26 | | - ``` |
| 25 | +```bash |
| 26 | +cd nodejs/ |
| 27 | +npm install |
| 28 | +npm start |
| 29 | +``` |
27 | 30 |
|
28 | | -## Features |
| 31 | +**Key Features:** |
| 32 | +- YAML evaluation file loading |
| 33 | +- HTTP API wrapper for REST integration |
| 34 | +- Interactive CLI for management |
| 35 | +- LLM judge system for response evaluation |
| 36 | +- Comprehensive documentation and examples |
29 | 37 |
|
30 | | -- 🔌 WebSocket server for real-time agent connections |
31 | | -- 🤖 Bidirectional RPC calls to connected agents |
32 | | -- ⚖️ LLM-as-a-judge evaluation using OpenAI GPT-4 |
33 | | -- 📊 Structured JSON logging of all evaluations |
34 | | -- 🖥️ Interactive CLI for testing and management |
35 | | -- ⚡ Support for concurrent agent evaluations |
| 38 | +See [`nodejs/README.md`](nodejs/README.md) for detailed usage. |
36 | 39 |
|
37 | | -## OpenAI Compatible API |
| 40 | +### Python (Lightweight Library) |
38 | 41 |
|
39 | | -The server provides an OpenAI-compatible `/v1/responses` endpoint for direct API access: |
| 42 | +The Python implementation focuses on core WebSocket functionality with programmatic evaluation creation. |
40 | 43 |
|
41 | 44 | ```bash |
42 | | -curl -X POST 'http://localhost:8081/v1/responses' \ |
43 | | - -H 'Content-Type: application/json' \ |
44 | | - -d '{ |
45 | | - "input": "What is 2+2?", |
46 | | - "main_model": "gpt-4.1", |
47 | | - "mini_model": "gpt-4.1-nano", |
48 | | - "nano_model": "gpt-4.1-nano", |
49 | | - "provider": "openai" |
50 | | - }' |
| 45 | +cd python/ |
| 46 | +pip install -e . |
| 47 | +python examples/basic_server.py |
51 | 48 | ``` |
52 | 49 |
|
53 | | -**Model Precedence:** |
54 | | -1. **API calls** OR **individual test YAML models** (highest priority) |
55 | | -2. **config.yaml defaults** (fallback when neither API nor test specify models) |
| 50 | +**Key Features:** |
| 51 | +- Minimal dependencies (websockets, loguru) |
| 52 | +- Full async/await support |
| 53 | +- Evaluation stack for LIFO queuing |
| 54 | +- Type hints throughout |
| 55 | +- Clean Pythonic API |
| 56 | + |
| 57 | +See [`python/README.md`](python/README.md) for detailed usage. |
| 58 | + |
| 59 | +## Architecture Comparison |
| 60 | + |
| 61 | +| Feature | NodeJS | Python | |
| 62 | +|---------|--------|--------| |
| 63 | +| **Core WebSocket Server** | ✅ | ✅ | |
| 64 | +| **JSON-RPC 2.0** | ✅ | ✅ | |
| 65 | +| **Client Management** | ✅ | ✅ | |
| 66 | +| **Programmatic Evaluations** | ✅ | ✅ | |
| 67 | +| **Evaluation Stack** | ✅ | ✅ | |
| 68 | +| **Structured Logging** | ✅ (Winston) | ✅ (Loguru) | |
| 69 | +| **YAML Evaluations** | ✅ | ❌ | |
| 70 | +| **HTTP API Wrapper** | ✅ | ❌ | |
| 71 | +| **CLI Interface** | ✅ | ❌ | |
| 72 | +| **LLM Judge System** | ✅ | ❌ | |
| 73 | +| **Type System** | TypeScript | Type Hints | |
| 74 | + |
| 75 | +## Choosing an Implementation |
| 76 | + |
| 77 | +**Choose NodeJS if you need:** |
| 78 | +- YAML-based evaluation definitions |
| 79 | +- HTTP REST API endpoints |
| 80 | +- Interactive CLI for management |
| 81 | +- LLM-as-a-judge evaluation |
| 82 | +- Comprehensive feature set |
| 83 | + |
| 84 | +**Choose Python if you need:** |
| 85 | +- Minimal dependencies |
| 86 | +- Pure programmatic approach |
| 87 | +- Integration with Python ML pipelines |
| 88 | +- Modern async/await patterns |
| 89 | +- Lightweight deployment |
56 | 90 |
|
57 | 91 | ## Agent Protocol |
58 | 92 |
|
59 | | -Your agent needs to: |
| 93 | +Both implementations use the same WebSocket protocol: |
| 94 | + |
| 95 | +### 1. Connect to WebSocket |
| 96 | +```javascript |
| 97 | +// NodeJS |
| 98 | +const ws = new WebSocket('ws://localhost:8080'); |
| 99 | + |
| 100 | +// Python |
| 101 | +import websockets |
| 102 | +ws = await websockets.connect('ws://localhost:8080') |
| 103 | +``` |
| 104 | + |
| 105 | +### 2. Send Registration |
| 106 | +```json |
| 107 | +{ |
| 108 | + "type": "register", |
| 109 | + "clientId": "your-client-id", |
| 110 | + "secretKey": "your-secret-key", |
| 111 | + "capabilities": ["chat", "action"] |
| 112 | +} |
| 113 | +``` |
| 114 | + |
| 115 | +### 3. Send Ready Signal |
| 116 | +```json |
| 117 | +{ |
| 118 | + "type": "ready" |
| 119 | +} |
| 120 | +``` |
| 121 | + |
| 122 | +### 4. Handle RPC Calls |
| 123 | +Both implementations send JSON-RPC 2.0 requests with the `evaluate` method: |
| 124 | + |
| 125 | +```json |
| 126 | +{ |
| 127 | + "jsonrpc": "2.0", |
| 128 | + "method": "evaluate", |
| 129 | + "params": { |
| 130 | + "id": "eval_001", |
| 131 | + "name": "Test Evaluation", |
| 132 | + "tool": "chat", |
| 133 | + "input": {"message": "Hello world"} |
| 134 | + }, |
| 135 | + "id": "unique-call-id" |
| 136 | +} |
| 137 | +``` |
| 138 | + |
| 139 | +Agents should respond with: |
| 140 | +```json |
| 141 | +{ |
| 142 | + "jsonrpc": "2.0", |
| 143 | + "id": "unique-call-id", |
| 144 | + "result": { |
| 145 | + "status": "completed", |
| 146 | + "output": {"response": "Hello! How can I help you?"} |
| 147 | + } |
| 148 | +} |
| 149 | +``` |
| 150 | + |
| 151 | +## Examples |
| 152 | + |
| 153 | +### NodeJS Example |
| 154 | +```javascript |
| 155 | +import { EvalServer } from 'bo-eval-server'; |
| 156 | + |
| 157 | +const server = new EvalServer({ |
| 158 | + authKey: 'secret', |
| 159 | + port: 8080 |
| 160 | +}); |
| 161 | + |
| 162 | +server.onConnect(async client => { |
| 163 | + const result = await client.evaluate({ |
| 164 | + id: "test", |
| 165 | + name: "Hello World", |
| 166 | + tool: "chat", |
| 167 | + input: {message: "Hi there!"} |
| 168 | + }); |
| 169 | + console.log(result); |
| 170 | +}); |
| 171 | + |
| 172 | +await server.start(); |
| 173 | +``` |
| 174 | + |
| 175 | +### Python Example |
| 176 | +```python |
| 177 | +import asyncio |
| 178 | +from bo_eval_server import EvalServer |
| 179 | + |
| 180 | +async def main(): |
| 181 | + server = EvalServer( |
| 182 | + auth_key='secret', |
| 183 | + port=8080 |
| 184 | + ) |
| 185 | + |
| 186 | + @server.on_connect |
| 187 | + async def handle_client(client): |
| 188 | + result = await client.evaluate({ |
| 189 | + "id": "test", |
| 190 | + "name": "Hello World", |
| 191 | + "tool": "chat", |
| 192 | + "input": {"message": "Hi there!"} |
| 193 | + }) |
| 194 | + print(result) |
| 195 | + |
| 196 | + await server.start() |
| 197 | + await server.wait_closed() |
| 198 | + |
| 199 | +asyncio.run(main()) |
| 200 | +``` |
| 201 | + |
| 202 | +## Development |
| 203 | + |
| 204 | +Each implementation has its own development setup: |
| 205 | + |
| 206 | +**NodeJS:** |
| 207 | +```bash |
| 208 | +cd nodejs/ |
| 209 | +npm install |
| 210 | +npm run dev # Watch mode |
| 211 | +npm test # Run tests |
| 212 | +npm run cli # Interactive CLI |
| 213 | +``` |
| 214 | + |
| 215 | +**Python:** |
| 216 | +```bash |
| 217 | +cd python/ |
| 218 | +pip install -e ".[dev]" |
| 219 | +pytest # Run tests |
| 220 | +black . # Format code |
| 221 | +mypy src/ # Type checking |
| 222 | +``` |
| 223 | + |
| 224 | +## Contributing |
| 225 | + |
| 226 | +When contributing to either implementation: |
| 227 | + |
| 228 | +1. Maintain API compatibility between versions where possible |
| 229 | +2. Update documentation for both implementations when adding shared features |
| 230 | +3. Follow the existing code style and patterns |
| 231 | +4. Add appropriate tests and examples |
| 232 | + |
| 233 | +## License |
60 | 234 |
|
61 | | -1. Connect to the WebSocket server (default: `ws://localhost:8080`) |
62 | | -2. Send a `{"type": "ready"}` message when ready for evaluations |
63 | | -3. Implement the `Evaluate` RPC method that accepts a string task and returns a string response |
| 235 | +MIT License - see individual implementation directories for details. |
64 | 236 |
|
65 | | -## For more details |
| 237 | +--- |
66 | 238 |
|
67 | | -See [CLAUDE.md](./CLAUDE.md) for comprehensive documentation of the architecture and implementation. |
| 239 | +Both implementations provide robust, production-ready evaluation servers for LLM agents with different feature sets optimized for different use cases. |
0 commit comments