-
Notifications
You must be signed in to change notification settings - Fork 8
Open
Description
I tried to learn more about AI started with SWE-bench, and then I knew this interesting project CodeClash.
After reading the CodeClash paper thoroughly (https://arxiv.org/pdf/2511.00839.pdf) I noticed some fascinating points.
- Edit more, Think more, Step more, none of these matters.
- Both minimalists (o3) and high activity editors (Claude 4.5 Sonnet) succeed, with overall score 1343 and 1389. While the "AMAZING LONG THINKING" one, gemini-2.5-pro, only get 1125.
- Roll good, then good. Roll shit, then shit.
- If the initial strategy failed, LM can barely win back. Also, LM tends to edit less strategy related code after several rounds.
- From a subjective view, LM lack of courage to do the ground breaking change. Fresh start, better than struggle in mud?
- LM are creative. Only in empty space... then they are more likely to be conservative.(I assume it is caused by model training. Because when I use clade 4 sonnet, it always wants to change my code. You may see in Figure 32, sonnet 4 core edit line is steady round 100 to 75. But then, I use Claude 4.5 sonnet and opus know, it always think the exists code is 100% reliable. :( )
- So starting a new chat, and ask it with abstract task description in a empty codebase, instead of keeping asking LM to review its plan in a single chat again and again, is a better choice.
-
LM do read, LM do think, LM don't "Read and Think"
- LM can get the log file, but don't, or pretended like analyzing the log.
-
LM just fire, they don't care if the earth explode after running their code.
Metadata
Metadata
Assignees
Labels
No labels