Skip to content

Interesting insights that CodeClash reveals. 2025-11-26 #82

@chenkigba

Description

@chenkigba

I tried to learn more about AI started with SWE-bench, and then I knew this interesting project CodeClash.

After reading the CodeClash paper thoroughly (https://arxiv.org/pdf/2511.00839.pdf) I noticed some fascinating points.

  • Edit more, Think more, Step more, none of these matters.
    • Both minimalists (o3) and high activity editors (Claude 4.5 Sonnet) succeed, with overall score 1343 and 1389. While the "AMAZING LONG THINKING" one, gemini-2.5-pro, only get 1125.
Image
  • Roll good, then good. Roll shit, then shit.
    • If the initial strategy failed, LM can barely win back. Also, LM tends to edit less strategy related code after several rounds.
    • From a subjective view, LM lack of courage to do the ground breaking change. Fresh start, better than struggle in mud?
    • LM are creative. Only in empty space... then they are more likely to be conservative.(I assume it is caused by model training. Because when I use clade 4 sonnet, it always wants to change my code. You may see in Figure 32, sonnet 4 core edit line is steady round 100 to 75. But then, I use Claude 4.5 sonnet and opus know, it always think the exists code is 100% reliable. :( )
    • So starting a new chat, and ask it with abstract task description in a empty codebase, instead of keeping asking LM to review its plan in a single chat again and again, is a better choice.
Image Image
  • LM do read, LM do think, LM don't "Read and Think"

    • LM can get the log file, but don't, or pretended like analyzing the log.
  • LM just fire, they don't care if the earth explode after running their code.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions