Skip to content

Conversation

@AutumnsGrove
Copy link
Owner

@AutumnsGrove AutumnsGrove commented Jan 15, 2026

Implements the Sentinel stress testing system for validating Grove’s infrastructure scalability (10 → 10,000 → 10,000,000 users). Based on the documented Sentinel Pattern with realistic traffic composition and three-phase testing.
Key additions:
∙ Database schema for test runs, metrics, checkpoints, baselines, and schedules (migration 021)
∙ Load profile generators following the Sentinel Pattern’s traffic composition (35% reading, 25% browsing, 10% auth, etc.)
∙ Three-phase testing model: Ramp-up → Peak → Steady-state
∙ Operation generators for D1, KV, and R2 that mimic real Grove usage
∙ API endpoints for triggering and monitoring tests
∙ Scheduler integration for automated weekly/daily stress tests
∙ Clearing service integration stubs (API calls, not direct DB)
Test plan
∙ Run migration 021 against test D1 database
∙ Verify Sentinel API endpoints return correct responses
∙ Run a smoke test profile (createSmokeTestProfile()) to validate operation generators
∙ Confirm no interference with existing Clearing service (separate DB)
∙ Test scheduled run creation and cron expression parsing

Add comprehensive infrastructure validation system for Grove:

- Database migration (008_sentinel.sql):
  - sentinel_runs: Test run configuration and results
  - sentinel_metrics: Per-operation timing data
  - sentinel_checkpoints: Periodic snapshots during tests
  - sentinel_baselines: Performance baselines for comparison
  - sentinel_schedules: Cron-triggered test schedules
  - clearing_status: Public status page data
  - clearing_incidents: Incident tracking

- Core library (src/lib/sentinel/):
  - types.ts: Type definitions for all Sentinel entities
  - profiles.ts: Load profile generators following Sentinel Pattern
    - Traffic composition (35% reading, 25% browsing, etc.)
    - Three-phase model (Ramp-up, Peak, Steady-state)
    - Preset profiles (spike, sustained, oscillation, ramp)
  - operations.ts: Operation generators for D1, KV, R2
  - runner.ts: Test execution engine with metrics collection
  - scheduler.ts: Cron trigger handling and maintenance mode

- API endpoints (src/routes/api/sentinel/):
  - GET/POST /api/sentinel - List and create test runs
  - GET/POST/DELETE /api/sentinel/[id] - Manage individual runs

- The Clearing status page (src/routes/clearing/):
  - Public-facing system status display
  - Recent sentinel test results
  - Incident tracking and display
  - Component status overview

Based on the Sentinel Pattern specification for realistic
traffic simulation and three-phase testing methodology.
- Rename migration from 008 to 021 (following existing sequence)
- Remove duplicate /clearing route (exists as separate service)
- Remove clearing_status/clearing_incidents tables from migration
  (those tables belong in Clearing service's own database)
- Update scheduler to use API calls to Clearing instead of direct DB
@claude
Copy link

claude bot commented Jan 15, 2026

PR Review: Sentinel Stress Testing System

Great work on implementing a comprehensive infrastructure validation system! The implementation is well-structured and follows many best practices. Here's my detailed review:

✅ Strengths

Architecture & Design

  • Well-organized module structure - Clear separation between types, profiles, operations, runner, and scheduler
  • Comprehensive type system - Excellent use of TypeScript for type safety across all components
  • Sentinel Pattern implementation - Good adherence to the three-phase testing model and realistic traffic composition
  • Flexible profile system - Support for multiple test patterns (spike, sustained, oscillation, ramp, custom)

Database Schema

  • Solid migration design - Well-indexed tables with appropriate foreign keys
  • Good data normalization - Separation of runs, metrics, checkpoints, baselines, and schedules
  • Audit trail - Proper tracking with timestamps and status fields

Code Quality

  • Excellent documentation - Comprehensive JSDoc comments throughout
  • Error handling patterns - Try-catch blocks with fallbacks
  • Modular design - Clean separation of concerns

⚠️ Issues & Concerns

1. CRITICAL: Database Query Isolation Violation

Location: packages/engine/src/lib/sentinel/operations.ts:88-127

You're violating the critical database isolation pattern from AGENT.md:274-296. Multiple independent queries share try-catch blocks:

// ❌ BAD - Current code
try {
  const result = await db.prepare(/* INSERT */).run();
  return { success: result.success, ... };
} catch (error) {
  if (error instanceof Error && error.message.includes('no such table')) {
    await db.prepare(/* CREATE TABLE */).run();
    // Retry the insert
    const result = await db.prepare(/* INSERT */).run();
    return { success: result.success, ... };
  }
  throw error;
}

This pattern is acceptable here since the CREATE depends on the INSERT failing, but throughout the file, you should ensure each independent query has isolated error handling.

2. Security: Missing Input Validation

Location: packages/engine/src/routes/api/sentinel/+server.ts:108-116

The API endpoint accepts user input but lacks validation:

// Missing validation for:
- targetOperations (could be negative or excessively large)
- durationSeconds (could cause infinite loops or timeout issues)
- concurrency (could overwhelm the system)
- spikeMultiplier (unbounded)

Recommendation:

// Add validation
if (body.targetOperations && (body.targetOperations < 1 || body.targetOperations > 1_000_000)) {
  throw error(400, 'targetOperations must be between 1 and 1,000,000');
}

if (body.durationSeconds && (body.durationSeconds < 1 || body.durationSeconds > 3600)) {
  throw error(400, 'durationSeconds must be between 1 and 3600');
}

if (body.concurrency && (body.concurrency < 1 || body.concurrency > 1000)) {
  throw error(400, 'concurrency must be between 1 and 1000');
}

3. Performance: Cloudflare Workers CPU Time Limits

Location: packages/engine/src/lib/sentinel/runner.ts:84-180

Cloudflare Workers have a 10-30 second CPU time limit. The SentinelRunner.execute() method could easily exceed this for longer tests.

Issues:

  • Tests with durationSeconds > 30 will timeout
  • Even with waitUntil(), CPU time limits apply
  • The runner doesn't account for these constraints

Recommendation:

  • Add a maximum duration check (e.g., 30 seconds)
  • For longer tests, use Durable Objects or cron triggers
  • Document the limitations clearly
if (profile.durationSeconds > 30 && !this.config.ctx) {
  throw new Error('Tests longer than 30 seconds require execution context (waitUntil)');
}

4. Bug: Typo in Function Name

Location: packages/engine/src/lib/sentinel/profiles.ts:78 and index.ts:78

export function estimateCloudflareCosat(profile: LoadProfile) {
  // Should be "Cost" not "Cosat"

This is exported in the public API and will break type safety for consumers.

5. Resource Leak: Missing Cleanup

Location: packages/engine/src/lib/sentinel/runner.ts:162-166

Cleanup only runs on success, not on failure or cancellation:

} catch (error) {
  const err = error instanceof Error ? error : new Error(String(error));
  await this.updateRunStatus(run.id, 'failed', { errorMessage: err.message });
  this.config.onError?.(err);
  throw error;
  // ❌ Cleanup is NOT called here!
} finally {
  this.isRunning = false;
}

Recommendation:

} finally {
  this.isRunning = false;
  
  // Always cleanup test data
  if (this.config.ctx) {
    this.config.ctx.waitUntil(cleanupSentinelData(db, kv, r2, tenantId));
  } else {
    await cleanupSentinelData(db, kv, r2, tenantId).catch(() => {});
  }
}

6. Potential Race Condition

Location: packages/engine/src/routes/api/sentinel/+server.ts:195-206

When startImmediately is true, the runner executes in the background but the response immediately returns with status: 'running'. However, the actual execution might not have started yet, and could fail immediately.

Recommendation:

  • Either wait for the runner to actually start before returning
  • Or return status: 'pending' and let clients poll for updates
  • Add error handling to update the run status if the background execution fails

7. Missing Test Coverage Mention

No tests are included in this PR. For a stress testing system, you should have:

  • Unit tests for profile generators
  • Integration tests for operation execution
  • Mock tests for the runner
  • API endpoint tests

8. Cost Estimation Incomplete

Location: packages/engine/src/lib/sentinel/index.ts:78

The estimateCloudflareCosat function is exported but I don't see its implementation in the files provided. This is critical for users to understand the financial impact of tests.

💡 Suggestions

1. Add Rate Limiting

The API endpoints should have rate limiting to prevent abuse:

// Check if there's already a running test for this tenant
const runningTests = await listSentinelRuns(db, tenantId, { status: 'running', limit: 1 });
if (runningTests.length > 0) {
  throw error(429, 'A test is already running for this tenant');
}

2. Add Metrics Persistence

Location: packages/engine/src/lib/sentinel/runner.ts:250-276

Metrics are collected in memory but never persisted to the sentinel_metrics table. This data is valuable for analysis but will be lost if the Worker times out.

Recommendation: Batch insert metrics periodically during execution.

3. Consider Checksum for Config Snapshot

Location: packages/engine/migrations/021_sentinel.sql:52

The config_snapshot field stores the full JSON configuration. Consider adding a checksum or hash field to quickly detect if configurations are identical without parsing JSON.

4. Add Monitoring/Observability

Consider integrating with Cloudflare AI Gateway (mentioned in AGENT.md:102) or add structured logging for:

  • Operation failures
  • Performance degradation
  • Resource usage

5. Missing Export in package.json

Ensure the engine's package.json exports the sentinel module:

{
  "exports": {
    "./sentinel": {
      "types": "./src/lib/sentinel/index.ts",
      "import": "./src/lib/sentinel/index.js"
    }
  }
}

📝 Minor Issues

  1. Inconsistent error handling - Some functions swallow errors silently, others throw
  2. Magic numbers - Several hardcoded values (50, 100, 500) should be constants
  3. Missing Grove voice - The error messages and descriptions don't match the warm Grove tone from CLAUDE.md
  4. Date handling - Mix of Date objects and Unix timestamps could be simplified

🔒 Security Assessment

📊 Overall Assessment

This is a solid foundation for a stress testing system, but needs refinement before merging:

Must Fix Before Merge:

  1. Fix the typo: estimateCloudflareCosatestimateCloudflareCoast or estimateCloudflareCost
  2. Add input validation to API endpoints
  3. Add Cloudflare Workers time limit checks
  4. Fix cleanup to run in finally block
  5. Implement or remove the cost estimation function

Should Fix Before Merge:

  1. Add rate limiting
  2. Persist metrics to database
  3. Add basic tests
  4. Document limitations clearly

Nice to Have:

  1. Grove voice in user-facing messages
  2. Monitoring/observability integration
  3. Performance optimizations

Recommendation: Request changes to address the critical issues above, then this will be ready to merge.

Great work overall! The architecture is sound and this will be a valuable addition to Grove's infrastructure. 🌲


Review generated at packages/engine/src/lib/sentinel/ - focusing on code quality, security, and Cloudflare Workers compatibility

- Add SentinelDO Durable Object following Loom pattern for long-running tests
  (solves Worker CPU timeout issues for tests >30 seconds)
- Fix typo: estimateCloudflareCosat → estimateCloudflareCost
- Add comprehensive input validation to API endpoints
- Move cleanup to finally block for reliable resource cleanup
- Export SentinelDO from index module
@claude
Copy link

claude bot commented Jan 16, 2026

Sentinel Stress Testing System - Code Review

Summary

This PR implements a comprehensive stress testing infrastructure for Grove (the Sentinel system) with realistic traffic simulation, three-phase testing methodology, and Durable Object integration. Overall, this is high-quality, well-architected code that follows Grove's established patterns. The implementation is thorough and production-ready with only minor suggestions for improvement.

✅ Strengths

Architecture & Design

  • Excellent use of Loom pattern: The SentinelDO Durable Object correctly implements batched D1 writes, hibernation-aware WebSockets, and alarm-based scheduling
  • Well-separated concerns: Clean separation between profiles, operations, runner, scheduler, and API layers
  • Realistic traffic modeling: TRAFFIC_COMPOSITION weights (35% reading, 25% browsing, etc.) align with actual user behavior
  • Comprehensive database schema: All necessary tables with proper indexes, foreign keys, and constraints

Code Quality

  • Type safety: Extensive TypeScript types with proper interfaces throughout
  • Error handling: Graceful fallbacks for missing tables, isolated try/catch blocks (following AGENT.md guidance)
  • Cleanup discipline: Proper cleanup in finally blocks and test data removal after runs
  • Documentation: Clear inline comments and JSDoc explanations

Security & Validation

  • Input validation (+server.ts:119-157): Comprehensive bounds checking prevents abuse
    • Operations: 1 to 1,000,000
    • Duration: 1 to 3,600 seconds
    • Concurrency: 1 to 500
    • Spike multiplier: 1 to 100
  • Authentication checks: Proper locals.user validation in API endpoints
  • SQL injection protection: Parameterized queries throughout

🔍 Issues & Suggestions

1. Migration Dependency Missing (Minor)

Location: 021_sentinel.sql:10
Issue: The migration references sentinel_test_data table but doesn't create it upfront. While the operations code creates it on-demand (line 113-120 in operations.ts), this creates an implicit dependency.

Suggestion: Add the table to the migration for completeness:

-- After line 233 in migration
CREATE TABLE IF NOT EXISTS sentinel_test_data (
  id TEXT PRIMARY KEY,
  tenant_id TEXT NOT NULL,
  data TEXT,
  created_at INTEGER NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_sentinel_test_data_tenant ON sentinel_test_data(tenant_id);

2. Potential Memory Issue in Runner (Medium)

Location: runner.ts:68-70
Issue: The metrics array grows unbounded during long test runs. A 1-hour test with 10 ops/sec = 36,000 metric objects in memory.

Suggestion: Implement periodic flushing similar to the DO pattern:

// After line 140 in runner.ts, add:
if (this.metrics.length >= 1000) {
  await this.flushMetrics();
}

3. WebSocket Message Handler Empty (Low)

Location: durable-object.ts:224-228
Issue: The webSocketMessage handler parses JSON but doesn't act on it. This could throw if invalid JSON is sent.

Suggestion: Add error handling:

async webSocketMessage(ws: WebSocket, message: string | ArrayBuffer): Promise<void> {
  try {
    const data = JSON.parse(message as string);
    // Handle subscription preferences, etc.
    this.log('WebSocket message', data);
  } catch (error) {
    this.log('Invalid WebSocket message', { error: String(error) });
  }
}

4. Race Condition in Alarm Scheduling (Medium)

Location: durable-object.ts:114-116
Issue: If executeBatch() takes longer than 1 second, alarms could stack up or skip intervals.

Suggestion: Calculate next alarm based on when it should occur:

// Replace line 116:
const nextInterval = Math.floor((Date.now() - this.runState.startedAt) / 1000) + 1;
await this.state.storage.setAlarm(this.runState.startedAt + (nextInterval * 1000));

5. Missing Validation for Custom Load Curves (Low)

Location: +server.ts:208-214
Issue: Custom profiles don't validate customConfig.loadCurve structure, which could cause runtime errors.

Suggestion: Add validation or document that custom profiles require direct API construction.

6. Cleanup Function Doesn't Delete Test Table (Low)

Location: operations.ts:746-822
Issue: cleanupSentinelData deletes rows but not the sentinel_test_data table itself, which accumulates over time.

Suggestion: Add table drop after successful cleanup:

// After line 764:
if (d1Deleted > 0) {
  await db.prepare('DROP TABLE IF EXISTS sentinel_test_data').run();
}

7. Typo Fixed in Latest Commit

The typo estimateCloudflareCosatestimateCloudflareCost was already fixed in commit dac137b. Great catch!

🧪 Test Plan Validation

The PR description mentions:

Run migration 021 against test D1 database

Recommendation: Add a test script or document the manual testing procedure:

# Suggested test commands
wrangler d1 execute DB --file=packages/engine/migrations/021_sentinel.sql
curl -X POST http://localhost:8787/api/sentinel -d '{"name":"Test","profileType":"smoke","startImmediately":true}'

📊 Performance Considerations

Positive

  • Batched D1 operations (100 metrics per flush) reduces write overhead
  • Weighted system selection (selectWeightedSystem) mimics realistic traffic
  • Checkpoint interval (30s) balances observability with overhead

Watch For

  • Tests with concurrency: 500 and durationSeconds: 3600 could approach Worker CPU limits
  • R2 operations in batch could hit rate limits at high concurrency
  • The DO alarm pattern is correct, but verify it handles hibernation correctly

🔒 Security Review

Passed:

  • ✅ No secrets in code
  • ✅ Authentication required for all endpoints
  • ✅ Input validation prevents resource exhaustion
  • ✅ SQL injection protection via parameterized queries
  • ✅ Test data properly namespaced (sentinel_test_ prefix)

Minor concern: The SENTINEL_PREFIX prevents pollution but test data from failed runs could accumulate if cleanup fails. Consider a background job to purge old test data.

📝 Documentation

Strengths:

  • Excellent inline documentation
  • Clear type definitions
  • Good commit messages following Conventional Commits

Suggestion: Add a usage guide in docs/sentinel-usage.md showing:

  • How to trigger tests via API
  • How to interpret results
  • Cost estimation formulas
  • Best practices for production use

🎯 Recommendation

APPROVE with minor fixes

This is excellent work. The suggested fixes are all minor and non-blocking. The code is ready to merge with the understanding that:

  1. The memory issue in runner.ts should be addressed before running very long tests
  2. The migration should include the test table definition
  3. Consider adding the usage documentation

The implementation correctly follows:

  • ✅ Loom Durable Object pattern
  • ✅ Grove's engine-first architecture
  • ✅ Cloudflare Workers best practices
  • ✅ Three-phase testing methodology from the Sentinel Pattern spec

Great job on this comprehensive feature! 🌲


Reviewed by Claude Sonnet 4.5 via Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants