Skip to content

Commit c2970e4

Browse files
authored
Feat/evals ready (#37)
## Pull Request Overview This PR implements the "Evals Ready" feature by refactoring evaluation management to support per-tab evaluation agents and improving error handling with retry mechanisms. Key changes: - Replaced global evaluation agent pattern with per-tab EvaluationAgent instances in AIChatPanel - Implemented robust retry logic with exponential backoff for failed evaluations - Created comprehensive Python evaluation server implementation with browsecomp benchmark support
1 parent 7432840 commit c2970e4

File tree

214 files changed

+9791
-1198
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

214 files changed

+9791
-1198
lines changed

eval-server/.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
11
.env
2-
node_modules
2+
node_modules
3+
*.log

eval-server/README.md

Lines changed: 219 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -1,67 +1,239 @@
1-
# bo-eval-server
1+
# Eval-Server
22

3-
A WebSocket-based evaluation server for LLM agents using LLM-as-a-judge methodology.
3+
A WebSocket-based evaluation server for LLM agents with multiple language implementations.
44

5-
## Quick Start
5+
## Overview
6+
7+
This directory contains two functionally equivalent implementations of the bo-eval-server:
8+
9+
- **NodeJS** (`nodejs/`) - Full-featured implementation with YAML evaluations, HTTP API, CLI, and judge system
10+
- **Python** (`python/`) - Minimal library focused on core WebSocket functionality and programmatic evaluation creation
611

7-
1. **Install dependencies**
8-
```bash
9-
npm install
10-
```
12+
Both implementations provide:
13+
- 🔌 **WebSocket Server** - Real-time agent connections
14+
- 🤖 **Bidirectional RPC** - JSON-RPC 2.0 for calling agent methods
15+
- 📚 **Programmatic API** - Create and manage evaluations in code
16+
-**Concurrent Support** - Handle multiple agents simultaneously
17+
- 📊 **Structured Logging** - Comprehensive evaluation tracking
18+
19+
## Quick Start
1120

12-
2. **Configure environment**
13-
```bash
14-
cp .env.example .env
15-
# Edit .env and add your OPENAI_API_KEY
16-
```
21+
### NodeJS (Full Featured)
1722

18-
3. **Start the server**
19-
```bash
20-
npm start
21-
```
23+
The NodeJS implementation includes YAML evaluation loading, HTTP API wrapper, CLI tools, and LLM-as-a-judge functionality.
2224

23-
4. **Use interactive CLI** (alternative to step 3)
24-
```bash
25-
npm run cli
26-
```
25+
```bash
26+
cd nodejs/
27+
npm install
28+
npm start
29+
```
2730

28-
## Features
31+
**Key Features:**
32+
- YAML evaluation file loading
33+
- HTTP API wrapper for REST integration
34+
- Interactive CLI for management
35+
- LLM judge system for response evaluation
36+
- Comprehensive documentation and examples
2937

30-
- 🔌 WebSocket server for real-time agent connections
31-
- 🤖 Bidirectional RPC calls to connected agents
32-
- ⚖️ LLM-as-a-judge evaluation using OpenAI GPT-4
33-
- 📊 Structured JSON logging of all evaluations
34-
- 🖥️ Interactive CLI for testing and management
35-
- ⚡ Support for concurrent agent evaluations
38+
See [`nodejs/README.md`](nodejs/README.md) for detailed usage.
3639

37-
## OpenAI Compatible API
40+
### Python (Lightweight Library)
3841

39-
The server provides an OpenAI-compatible `/v1/responses` endpoint for direct API access:
42+
The Python implementation focuses on core WebSocket functionality with programmatic evaluation creation.
4043

4144
```bash
42-
curl -X POST 'http://localhost:8081/v1/responses' \
43-
-H 'Content-Type: application/json' \
44-
-d '{
45-
"input": "What is 2+2?",
46-
"main_model": "gpt-4.1",
47-
"mini_model": "gpt-4.1-nano",
48-
"nano_model": "gpt-4.1-nano",
49-
"provider": "openai"
50-
}'
45+
cd python/
46+
pip install -e .
47+
python examples/basic_server.py
5148
```
5249

53-
**Model Precedence:**
54-
1. **API calls** OR **individual test YAML models** (highest priority)
55-
2. **config.yaml defaults** (fallback when neither API nor test specify models)
50+
**Key Features:**
51+
- Minimal dependencies (websockets, loguru)
52+
- Full async/await support
53+
- Evaluation stack for LIFO queuing
54+
- Type hints throughout
55+
- Clean Pythonic API
56+
57+
See [`python/README.md`](python/README.md) for detailed usage.
58+
59+
## Architecture Comparison
60+
61+
| Feature | NodeJS | Python |
62+
|---------|--------|--------|
63+
| **Core WebSocket Server** |||
64+
| **JSON-RPC 2.0** |||
65+
| **Client Management** |||
66+
| **Programmatic Evaluations** |||
67+
| **Evaluation Stack** |||
68+
| **Structured Logging** | ✅ (Winston) | ✅ (Loguru) |
69+
| **YAML Evaluations** |||
70+
| **HTTP API Wrapper** |||
71+
| **CLI Interface** |||
72+
| **LLM Judge System** |||
73+
| **Type System** | TypeScript | Type Hints |
74+
75+
## Choosing an Implementation
76+
77+
**Choose NodeJS if you need:**
78+
- YAML-based evaluation definitions
79+
- HTTP REST API endpoints
80+
- Interactive CLI for management
81+
- LLM-as-a-judge evaluation
82+
- Comprehensive feature set
83+
84+
**Choose Python if you need:**
85+
- Minimal dependencies
86+
- Pure programmatic approach
87+
- Integration with Python ML pipelines
88+
- Modern async/await patterns
89+
- Lightweight deployment
5690

5791
## Agent Protocol
5892

59-
Your agent needs to:
93+
Both implementations use the same WebSocket protocol:
94+
95+
### 1. Connect to WebSocket
96+
```javascript
97+
// NodeJS
98+
const ws = new WebSocket('ws://localhost:8080');
99+
100+
// Python
101+
import websockets
102+
ws = await websockets.connect('ws://localhost:8080')
103+
```
104+
105+
### 2. Send Registration
106+
```json
107+
{
108+
"type": "register",
109+
"clientId": "your-client-id",
110+
"secretKey": "your-secret-key",
111+
"capabilities": ["chat", "action"]
112+
}
113+
```
114+
115+
### 3. Send Ready Signal
116+
```json
117+
{
118+
"type": "ready"
119+
}
120+
```
121+
122+
### 4. Handle RPC Calls
123+
Both implementations send JSON-RPC 2.0 requests with the `evaluate` method:
124+
125+
```json
126+
{
127+
"jsonrpc": "2.0",
128+
"method": "evaluate",
129+
"params": {
130+
"id": "eval_001",
131+
"name": "Test Evaluation",
132+
"tool": "chat",
133+
"input": {"message": "Hello world"}
134+
},
135+
"id": "unique-call-id"
136+
}
137+
```
138+
139+
Agents should respond with:
140+
```json
141+
{
142+
"jsonrpc": "2.0",
143+
"id": "unique-call-id",
144+
"result": {
145+
"status": "completed",
146+
"output": {"response": "Hello! How can I help you?"}
147+
}
148+
}
149+
```
150+
151+
## Examples
152+
153+
### NodeJS Example
154+
```javascript
155+
import { EvalServer } from 'bo-eval-server';
156+
157+
const server = new EvalServer({
158+
authKey: 'secret',
159+
port: 8080
160+
});
161+
162+
server.onConnect(async client => {
163+
const result = await client.evaluate({
164+
id: "test",
165+
name: "Hello World",
166+
tool: "chat",
167+
input: {message: "Hi there!"}
168+
});
169+
console.log(result);
170+
});
171+
172+
await server.start();
173+
```
174+
175+
### Python Example
176+
```python
177+
import asyncio
178+
from bo_eval_server import EvalServer
179+
180+
async def main():
181+
server = EvalServer(
182+
auth_key='secret',
183+
port=8080
184+
)
185+
186+
@server.on_connect
187+
async def handle_client(client):
188+
result = await client.evaluate({
189+
"id": "test",
190+
"name": "Hello World",
191+
"tool": "chat",
192+
"input": {"message": "Hi there!"}
193+
})
194+
print(result)
195+
196+
await server.start()
197+
await server.wait_closed()
198+
199+
asyncio.run(main())
200+
```
201+
202+
## Development
203+
204+
Each implementation has its own development setup:
205+
206+
**NodeJS:**
207+
```bash
208+
cd nodejs/
209+
npm install
210+
npm run dev # Watch mode
211+
npm test # Run tests
212+
npm run cli # Interactive CLI
213+
```
214+
215+
**Python:**
216+
```bash
217+
cd python/
218+
pip install -e ".[dev]"
219+
pytest # Run tests
220+
black . # Format code
221+
mypy src/ # Type checking
222+
```
223+
224+
## Contributing
225+
226+
When contributing to either implementation:
227+
228+
1. Maintain API compatibility between versions where possible
229+
2. Update documentation for both implementations when adding shared features
230+
3. Follow the existing code style and patterns
231+
4. Add appropriate tests and examples
232+
233+
## License
60234

61-
1. Connect to the WebSocket server (default: `ws://localhost:8080`)
62-
2. Send a `{"type": "ready"}` message when ready for evaluations
63-
3. Implement the `Evaluate` RPC method that accepts a string task and returns a string response
235+
MIT License - see individual implementation directories for details.
64236

65-
## For more details
237+
---
66238

67-
See [CLAUDE.md](./CLAUDE.md) for comprehensive documentation of the architecture and implementation.
239+
Both implementations provide robust, production-ready evaluation servers for LLM agents with different feature sets optimized for different use cases.
File renamed without changes.

0 commit comments

Comments
 (0)