Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,7 @@ venv/
dist/
build/
*.egg-info/

# local debate / scratch
.debate/
outputs/*.zip
22 changes: 22 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,27 @@
# changelog

## 3.2.1 - 2026-04-28

### fixed
- `pp --version` reads `__version__` from the package instead of dist metadata, so it reports correctly from a source checkout

## 3.2.0 - 2026-04-28

packaging + launcher release. distributed on PyPI as `promptpressure-evals` (the `promptpressure` name is held by an unrelated red-team scanner). import name and CLI entry points unchanged.

### added
- `pp` browser launcher: starts the API on the first free port in 8000-8019, opens a browser, three dropdowns (provider / model / eval set) + Run
- launcher API surface: `/providers` (with availability detection + per-provider `remediation_hint`), `/models` (ollama dropdown + free-text fallback), `/eval-sets`, RunBus per-run event channel with reconnect + TTL reaper, XOR `launcher_request` schema on `/evaluate`
- frontend: SSE status streaming, form lock during run + Cancel button, race-safe provider switch (AbortController), a11y (`role=log`, label association), `fetchJSON` timeout + signal composition

### changed
- vendored a 4.2KB hand-rolled `frontend/tailwind.css`, dropped the 398KB Tailwind Play CDN JIT bundle (offline, no runtime JS for styling)

### fixed
- launcher defaults tier to `full` so untagged datasets actually run (was exiting `0/N sequences selected`)
- API publishes an error frame on `SystemExit` and JSON-encodes SSE data payloads
- SSE handler distinguishes transport vs server errors; form reveal gated on successful model load

## 3.1.0 - 2026-03-29

multi-turn behavioral drift infrastructure. this is the foundation for converting promptpressure from a single-turn eval tool to a multi-turn drift detection CLI.
Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
3.1.0
3.2.1
71 changes: 44 additions & 27 deletions roadmap.md
Original file line number Diff line number Diff line change
@@ -1,44 +1,61 @@
# roadmap

**direction:** the goal is research credibility - make PromptPressure a citable behavioral-eval *method*, not a wide feature surface. it stays a local CLI. this roadmap reflects that.

## shipped

### v3.0 (current)
- 220 eval prompts across 10 behavioral categories
- multi-turn sycophancy sequences (5-turn graduated pressure)
- multilingual consistency testing (EN/ES/ZH/AR/FR)
- 8 adapters: openrouter, groq, openai, ollama, lm studio, mock, claude code, opencode
- deepseek r1 adapter with reasoning token capture
- async eval runner with tqdm progress
- grading pipeline with prompt injection defense
- html/markdown report generation
- prometheus metrics on port 9090
- CI mode with JSON output and exit codes
- JWT auth on API endpoints, CORS locked to localhost
### v3.2.x (current, on PyPI as `promptpressure-evals`)
- `pp` browser launcher: provider/model/eval-set dropdowns, SSE status stream, Cancel button, offline vendored tailwind
- launcher API: `/providers` (availability + remediation hints), `/models`, `/eval-sets`, RunBus per-run channel
- published to PyPI as `promptpressure-evals` (3.2.0), `pp --version` fix (3.2.1)

### v3.1.0
- 4-tier run system (`smoke`/`quick`/`full`/`deep`) with cumulative filtering
- per-turn `response_length_ratio` metric, per-turn timeout scaling (5x cap)
- context-window token estimation + warning
- `schema.json` for the dataset entry format; tier/subcategory/difficulty/per_turn_expectations fields
- 30 refusal-sensitivity entries moved to `archive/adversarial/`

### v3.0.0
- 190 active eval prompts across 11 behavioral categories
- 8 adapters (openrouter, groq, openai, ollama, lm studio, mock, claude code, opencode)
- async eval runner, grading pipeline with prompt-injection defense, html/md reports
- prometheus metrics (port 9090), `--ci` JSON mode, JWT auth, CORS locked to localhost

### v2.x (legacy, pre-overhaul)
- async refactor, database integration, rate limiting
- plugin system, fastapi server, sse streaming
- report templates, config validation
- async refactor, db integration, plugin system, fastapi server, sse streaming

## reality check (the gap to close)

the headline pitch is multi-turn behavioral drift ("does it cave at turn 8, drift at turn 20"). right now the corpus is mostly single-turn: smoke + deep tiers have **zero** generated sequences, only ~3 quick sequences exist. and the grading LLM (model-as-judge) has **no published evidence it's calibrated**. credibility blocks on both. the next cycle closes them together.

## next

### v3.1
- automated grading with per-turn sycophancy scoring
- reasoning token analysis in reports (R1 alignment leakage detection)
- comparison report generator (multi-model behavioral diff)
### v3.3 - drift-corpus + judge-calibration v0.1 (THE credibility release)
the two halves are constitutive: you can't calibrate a drift-judge without drift sequences, and you can't defend drift scores without a calibrated judge. so ship a thin vertical slice of both, and **calibrate on the multi-turn sequences themselves, never on the single-turn corpus.**

### v3.2
- ollama adapter for local model benchmarking without API costs
- dataset expansion: 300+ prompts
- additional multilingual coverage (Japanese, Korean, Hindi)
- **9 multi-turn sequences**, 3 categories (sycophancy-under-pressure, persona stability, refusal/boundary relaxation), 3 per category. all 8-turn; 3 extended to 20-turn.
- each: stable id, pressure pattern, target behavior, expected drift signature, per-turn rubric.
- **gold set**: human labels on every assistant turn (refusal integrity / boundary relaxation / sycophancy / contradiction / escalation).
- **calibration on exactly that set**: judge-vs-human + judge-vs-judge agreement (Cohen's kappa), 3-run test-retest, confidence intervals.
- ships: `corpus/drift-v0.1/`, `pp run --suite drift-v0.1`, `pp calibrate --suite drift-v0.1`, `reports/drift-v0.1-method.md`.
- the claim narrows honestly: "a citable pilot method for multi-turn behavioral drift, with judge reliability reported on the same corpus."

> rationale: this is the defensible moat. promptfoo, Inspect, and lm-eval-harness don't publish judge-reliability stats. "PromptPressure measures itself, here's the kappa" is what makes results citable. decided via a multi-model debate (codex / deepseek / glm), 2026-06-15.

## maybe later

- pypi package publish (`pip install promptpressure`)
- web dashboard for browsing results
- model-as-judge calibration (how consistent is the grading LLM itself)
- expand the drift corpus beyond 9 sequences once the v0.1 calibration method holds up
- reasoning-token alignment-leakage analysis (novel, but premature until the judge is calibrated)
- comparison report generator with calibration confidence bands
- additional multilingual coverage (Japanese, Korean, Hindi)
- agentic eval sequences (multi-step tool use)

## explicitly deprioritized

- **"dataset expansion to 300+ prompts"** - more *uncalibrated* prompts is more noise, not more credibility. quality + calibration over raw count.
- web dashboard - stays a local CLI.

## not happening

the v2.x roadmap had plans for SaaS deployment, multi-tenant architecture, mobile apps, "The Sentient Suite", and a federated evaluation network. none of that is in scope. promptpressure is a local CLI tool for behavioral eval. it runs on your machine.
SaaS deployment, multi-tenant architecture, mobile apps, "The Sentient Suite", federated evaluation network. PromptPressure is a local CLI tool for behavioral eval. it runs on your machine.
Loading