Interesting insights that CodeClash reveals. 2025-11-26

I tried to learn more about AI started with SWE-bench, and then I knew this interesting project CodeClash.

After reading the CodeClash paper thoroughly (https://arxiv.org/pdf/2511.00839.pdf) I noticed some fascinating points.

* Edit more, Think more, Step more, none of these matters.
	* Both minimalists (o3) and high activity editors (Claude 4.5 Sonnet) succeed, with overall score 1343 and 1389. While the "AMAZING LONG THINKING" one, gemini-2.5-pro, only get 1125.

<img width="717" height="296" alt="Image" src="https://github.com/user-attachments/assets/155c66e2-45a6-40cd-b2e3-3b94c674509e" />

* Roll good, then good. Roll shit, then shit.
	* If the initial strategy failed, LM can barely win back. Also, LM tends to edit less strategy related code after several rounds. 
	* From a subjective view, LM lack of courage to do the ground breaking change. Fresh start, better than struggle in mud?
	* LM are creative. Only in empty space... then they are more likely to be conservative.（I assume it is caused by model training. Because when I use clade 4 sonnet, it always wants to change my code. You may see in Figure 32, sonnet 4 core edit line is steady round 100 to 75. But then, I use Claude 4.5 sonnet and opus know, it always think the exists code is 100% reliable. :( )
	* So starting a new chat, and ask it with abstract task description in a empty codebase, instead of keeping asking LM to review its plan in a single chat again and again, is a better choice.


<img width="364" height="475" alt="Image" src="https://github.com/user-attachments/assets/3e747561-7e6d-43ea-af89-da4183b935e5" />

<img width="361" height="467" alt="Image" src="https://github.com/user-attachments/assets/1ee9879e-7196-4215-99f8-6a80ac3ece08" />

* LM do read, LM do think, LM don't "Read and Think"
	* LM can get the log file, but don't, or pretended like analyzing the log.

* LM just fire, they don't care if the earth explode after running their code.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Interesting insights that CodeClash reveals. 2025-11-26 #82

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Interesting insights that CodeClash reveals. 2025-11-26 #82

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions