Keep DeepSeek out of public leaderboard until benchmark requests complete reliably

DeepSeek is configured only as an experimental/manual model path for now, not as a default benchmark model or public pending leaderboard entry.

Reason for exclusion from the public leaderboard:
- Credentials work, and a raw DeepSeek Flash toy JSON call succeeds.
- A real PolicyBench one-household, one-output `deepseek-v4-flash` smoke completed, but took about 109 seconds and used 8,207 reasoning tokens for one requested output.
- `deepseek-v4-pro` did not finish the same one-household, one-output smoke within 150 seconds.
- Earlier full multi-output PolicyBench attempts timed out or hung.

Current repo policy:
- DeepSeek remains in `EXPERIMENTAL_MODELS` so it can be manually probed if the API behavior improves.
- DeepSeek is excluded from default runs, app model metadata, and public pending-model messaging.

Acceptance criteria before adding it to the public leaderboard:
- Complete a representative multi-output smoke without manual repair or process intervention.
- Produce parseable JSON with nonempty explanations.
- Have latency and token usage that make a 100-household run practical.
- Pass the same analysis/rebuild path as other public leaderboard models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep DeepSeek out of public leaderboard until benchmark requests complete reliably #4

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Keep DeepSeek out of public leaderboard until benchmark requests complete reliably #4

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions