Work in progress
This repo extends what I started in rust-bench, I am interested in comparing how different llms perform at writing code that compiles, and how that relates to the code working correctly.
I have used some problems written by hand, and the problems from the advent of code 2024, hoping that they were not in the training data. From the AOC problems, I havent uploaded my inputs and outputs.
Looking at the correlation between how often a problem compiles and is correct in different languages, it seems plausible that writing code that compiles requires a slightly different skill from solving the problem with code. Writing Haskell code that compiles correlates more with writing Ocaml code that compiles, than with solving the problem in Haskell.
| compiles cpp | compiles haskell | compiles ocaml | compiles python | compiles rust | compiles go | avg compiles | correct cpp | correct haskell | correct ocaml | correct python | correct rust | correct go | avg correct | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| compiles cpp | 1.000 | 0.825 | 0.860 | -0.345 | 0.889 | -0.440 | 0.923 | 0.774 | 0.732 | 0.770 | 0.661 | 0.818 | 0.838 | 0.810 |
| compiles haskell | 0.825 | 1.000 | 0.929 | -0.226 | 0.884 | -0.526 | 0.952 | 0.780 | 0.891 | 0.844 | 0.620 | 0.919 | 0.893 | 0.866 |
| compiles ocaml | 0.860 | 0.929 | 1.000 | -0.361 | 0.886 | -0.592 | 0.953 | 0.763 | 0.826 | 0.820 | 0.642 | 0.928 | 0.859 | 0.845 |
| compiles python | -0.345 | -0.226 | -0.361 | 1.000 | -0.417 | 0.523 | -0.241 | -0.245 | -0.086 | -0.180 | -0.291 | -0.291 | -0.163 | -0.231 |
| compiles rust | 0.889 | 0.884 | 0.886 | -0.417 | 1.000 | -0.702 | 0.950 | 0.596 | 0.659 | 0.646 | 0.489 | 0.830 | 0.743 | 0.707 |
| compiles go | -0.440 | -0.526 | -0.592 | 0.523 | -0.702 | 1.000 | -0.545 | -0.327 | -0.343 | -0.270 | -0.199 | -0.588 | -0.336 | -0.369 |
| avg compiles | 0.923 | 0.952 | 0.953 | -0.241 | 0.950 | -0.545 | 1.000 | 0.753 | 0.806 | 0.816 | 0.609 | 0.912 | 0.874 | 0.841 |
| correct cpp | 0.774 | 0.780 | 0.763 | -0.245 | 0.596 | -0.327 | 0.753 | 1.000 | 0.900 | 0.922 | 0.907 | 0.841 | 0.934 | 0.978 |
| correct haskell | 0.732 | 0.891 | 0.826 | -0.086 | 0.659 | -0.343 | 0.806 | 0.900 | 1.000 | 0.961 | 0.713 | 0.863 | 0.945 | 0.936 |
| correct ocaml | 0.770 | 0.844 | 0.820 | -0.180 | 0.646 | -0.270 | 0.816 | 0.922 | 0.961 | 1.000 | 0.813 | 0.867 | 0.953 | 0.968 |
| correct python | 0.661 | 0.620 | 0.642 | -0.291 | 0.489 | -0.199 | 0.609 | 0.907 | 0.713 | 0.813 | 1.000 | 0.669 | 0.841 | 0.891 |
| correct rust | 0.818 | 0.919 | 0.928 | -0.291 | 0.830 | -0.588 | 0.912 | 0.841 | 0.863 | 0.867 | 0.669 | 1.000 | 0.868 | 0.896 |
| correct go | 0.838 | 0.893 | 0.859 | -0.163 | 0.743 | -0.336 | 0.874 | 0.934 | 0.945 | 0.953 | 0.841 | 0.868 | 1.000 | 0.980 |
| avg correct | 0.810 | 0.866 | 0.845 | -0.231 | 0.707 | -0.369 | 0.841 | 0.978 | 0.936 | 0.968 | 0.891 | 0.896 | 0.980 | 1.000 |
Python and Go are a bit different as most attempts compile correctly. Focusing only on Rust, Haskell, C++ and Ocaml.
Some graphs comparing the perfromance of different models in Advent of Code 2024






