Skip to content

Keep DeepSeek out of public leaderboard until benchmark requests complete reliably #4

@MaxGhenis

Description

@MaxGhenis

DeepSeek is configured only as an experimental/manual model path for now, not as a default benchmark model or public pending leaderboard entry.

Reason for exclusion from the public leaderboard:

  • Credentials work, and a raw DeepSeek Flash toy JSON call succeeds.
  • A real PolicyBench one-household, one-output deepseek-v4-flash smoke completed, but took about 109 seconds and used 8,207 reasoning tokens for one requested output.
  • deepseek-v4-pro did not finish the same one-household, one-output smoke within 150 seconds.
  • Earlier full multi-output PolicyBench attempts timed out or hung.

Current repo policy:

  • DeepSeek remains in EXPERIMENTAL_MODELS so it can be manually probed if the API behavior improves.
  • DeepSeek is excluded from default runs, app model metadata, and public pending-model messaging.

Acceptance criteria before adding it to the public leaderboard:

  • Complete a representative multi-output smoke without manual repair or process intervention.
  • Produce parseable JSON with nonempty explanations.
  • Have latency and token usage that make a 100-household run practical.
  • Pass the same analysis/rebuild path as other public leaderboard models.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions