-
Notifications
You must be signed in to change notification settings - Fork 136
Add ComputeEval Dataset Support #1124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
📝 WalkthroughWalkthroughThis PR adds support for NVIDIA's compute-eval dataset and evaluation framework to NeMo Skills. It introduces a new compute-eval evaluator type, corresponding metrics, a dataset preparation script, a generation task for model inference, and registers these components in the appropriate registries. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Nitpick comments (4)
nemo_skills/dataset/compute-eval/prepare.py (2)
29-34: Consider narrowing the exception type or adding a brief comment.The bare
Exceptioncatch works but obscures intent. A comment explaining that any failure duringwhoami()indicates the user isn't logged in would improve clarity.# noinspection PyBroadException try: api = HfApi() api.whoami() - return None + return None # User is logged in; HF will use cached credentials - except Exception: + except Exception: # Any failure means user is not authenticated return None
48-54: Add validation for the expected dataset split.If the dataset doesn't contain an "eval" split (e.g., due to an invalid release), the script will raise an unclear
KeyError. Consider validating the split exists.dataset = load_dataset("nvidia/compute-eval", args.release, token=_get_hf_token()) + if "eval" not in dataset: + raise ValueError(f"Dataset does not contain 'eval' split. Available splits: {list(dataset.keys())}") data_dir = Path(__file__).absolute().parentnemo_skills/inference/eval/compute_eval.py (2)
54-67: Consider usingpassfor no-op methods and prefixing unused parameters.The no-op lifecycle methods use
returnstatements, butpassis more idiomatic for intentionally empty methods. Additionally, thedataparameter inlog_example_promptis flagged as unused by static analysis - prefixing it with an underscore would signal it's intentionally unused.Apply this diff:
- def log_example_prompt(self, data): - return + def log_example_prompt(self, _data): + pass def setup_prompt(self): - return + pass def setup_llm(self): - return + pass def setup_litellm_cache(self): - return + pass def cleanup_litellm_cache(self): - return + pass
69-69: Prefix unused parameter with underscore.The
dataparameter is flagged as unused by static analysis. If it's part of the interface but not needed in this implementation, prefix it with an underscore to signal it's intentionally unused.Apply this diff:
- async def process_single_datapoint(self, data_point, data): + async def process_single_datapoint(self, data_point, _data):
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
nemo_skills/dataset/compute-eval/__init__.py(1 hunks)nemo_skills/dataset/compute-eval/prepare.py(1 hunks)nemo_skills/evaluation/evaluator/__init__.py(2 hunks)nemo_skills/evaluation/evaluator/compute_eval.py(1 hunks)nemo_skills/evaluation/metrics/code_metrics.py(1 hunks)nemo_skills/evaluation/metrics/map_metrics.py(2 hunks)nemo_skills/inference/eval/compute_eval.py(1 hunks)requirements/main.txt(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-23T17:56:57.556Z
Learnt from: Glorf
Repo: NVIDIA-NeMo/Skills PR: 908
File: requirements/main.txt:16-16
Timestamp: 2025-11-23T17:56:57.556Z
Learning: faiss-cpu must be explicitly listed in requirements/main.txt for BFCLv4 memory evaluations (memory_kv, memory_vector, memory_rec_sum) as it is an optional dependency of sentence_transformers that is required for vector similarity search functionality in the memory backends.
Applied to files:
requirements/main.txt
🧬 Code graph analysis (4)
nemo_skills/evaluation/evaluator/__init__.py (1)
nemo_skills/evaluation/evaluator/compute_eval.py (1)
ComputeEvalEvaluator(31-63)
nemo_skills/evaluation/metrics/map_metrics.py (1)
nemo_skills/evaluation/metrics/code_metrics.py (1)
ComputeEvalMetrics(126-135)
nemo_skills/evaluation/evaluator/compute_eval.py (2)
nemo_skills/evaluation/evaluator/base.py (1)
BaseEvaluator(34-91)nemo_skills/utils.py (1)
get_logger_name(39-43)
nemo_skills/evaluation/metrics/code_metrics.py (7)
nemo_skills/evaluation/metrics/base.py (5)
BaseMetrics(23-434)_get_score_dict(124-143)get_incorrect_sample(200-206)update(145-189)_compute_pass_at_k(352-423)nemo_skills/evaluation/metrics/if_metrics.py (3)
_get_score_dict(24-28)get_incorrect_sample(30-33)update(35-48)nemo_skills/evaluation/metrics/answer_judgement_metrics.py (3)
_get_score_dict(35-39)get_incorrect_sample(41-47)update(121-132)nemo_skills/evaluation/metrics/lean4_metrics.py (3)
_get_score_dict(23-24)get_incorrect_sample(26-29)update(46-48)nemo_skills/evaluation/metrics/math_metrics.py (3)
_get_score_dict(67-79)get_incorrect_sample(81-88)update(90-120)nemo_skills/evaluation/metrics/ruler_metrics.py (3)
_get_score_dict(19-20)get_incorrect_sample(26-29)update(22-24)nemo_skills/evaluation/metrics/arena_metrics.py (2)
get_incorrect_sample(36-40)update(42-84)
🪛 Ruff (0.14.8)
nemo_skills/inference/eval/compute_eval.py
54-54: Unused method argument: data
(ARG002)
69-69: Unused method argument: data
(ARG002)
nemo_skills/dataset/compute-eval/prepare.py
32-32: Consider moving this statement to an else block
(TRY300)
33-33: Do not catch blind exception: Exception
(BLE001)
nemo_skills/evaluation/metrics/code_metrics.py
130-130: Unused method argument: prediction
(ARG002)
🔇 Additional comments (10)
nemo_skills/evaluation/metrics/code_metrics.py (1)
126-135: LGTM - Implementation follows established patterns.The
ComputeEvalMetricsclass correctly mirrors the structure ofHumanEvalInfillingMetrics(lines 114-123). The unusedpredictionparameter inget_incorrect_sampleis required by the base class interface, so the static analysis warning is a false positive.nemo_skills/evaluation/metrics/map_metrics.py (2)
26-26: LGTM - Import correctly added.
72-72: LGTM - Registry entry properly added.The key
"compute-eval"correctly matchesMETRICS_TYPEdefined innemo_skills/dataset/compute-eval/__init__.py.nemo_skills/evaluation/evaluator/__init__.py (2)
29-29: LGTM - Import correctly placed.
71-71: LGTM - Evaluator correctly registered in class map only.The
ComputeEvalEvaluatoris correctly placed inEVALUATOR_CLASS_MAP(notEVALUATOR_MAP) since it implementseval_singlefor async single-point evaluation.nemo_skills/evaluation/evaluator/compute_eval.py (1)
48-62: LGTM - Correct use of asyncio.to_thread for blocking evaluation.Running
evaluate_solutionin a thread pool correctly avoids blocking the event loop during CUDA compilation and test execution.requirements/main.txt (1)
17-17: LGTM - Git commit pinning ensures reproducibility.The dependency is properly pinned to a specific commit (991b47c, currently the main branch HEAD) and correctly sorted alphabetically. The short hash is valid and unambiguous for this repository. Using the full 40-character SHA would provide additional robustness against theoretical hash collisions, but the current approach is acceptable for practical purposes.
nemo_skills/dataset/compute-eval/__init__.py (1)
15-19: LGTM - Dataset constants properly defined.Constants are internally consistent:
METRICS_TYPEmatches the registry key inmap_metrics.py(line 72), andEVAL_SPLITmatches the split accessed inprepare.py(lines 52-53). TheGENERATION_MODULEpath correctly points to the existingnemo_skills.inference.eval.compute_evalmodule.nemo_skills/inference/eval/compute_eval.py (2)
1-34: LGTM!The imports and module-level setup are well-structured. The TypeAdapter with a discriminated union for problem validation is a clean approach.
86-102: LGTM!The module exports and main entry point are well-structured. The help message handling and logging setup follow good practices.
Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
gwarmstrong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good, with a couple high level questions/concerns
|
|
||
| # noinspection PyBroadException | ||
| try: | ||
| api = HfApi() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the purpose of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows a user to access the HuggingFace gated (you have to be allow-listed to have access to the ComputeEval data today) data either via an environment variable (HF_TOKEN) or by using the HuggingFace CLI to login.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, after looking closer we don't need to access the API at all for this to work. I'll just clean this up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, we should just need a load_dataset call and it will handle the token if it's already in your env variables, like this one is gated: https://github.com/blahblahasdf/Skills/blob/cb7829094799f9b5967e260d30e39c2535f67bab/nemo_skills/dataset/flores200/prepare.py#L49
| generate_model_completions, | ||
| system_prompt=self._system_prompt, | ||
| problem=problem, | ||
| model=self._model, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is applying this function:
https://github.com/NVIDIA/compute-eval/blob/main/compute_eval/generate_completions.py#L31-L84
Which appears to use the ComputeEval clients?
What I would recommend we do instead, is write a version of that function that uses NeMo-Skills' process_single_datapoint--so this probably looks something like
result = super().process_sinlge_datapoint(data_point, data)
...
completion = ...
return FileSolution(
task_id=problem.task_id,
files=_parse_solution(completion),
**debug_info,
)
this will make sure the prompt logic is consistent with typical NeMo-Skills and that the eval prompt logic is the same as normal generations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then the above methods that are overwritten would just use the default methods rather than override to return nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick feedback! Your suggestion is to call process_sinlge_datapoint and then use the ComputeEval post processing to convert the model's response to a FileSolution? The "ComputeEval clients" are just a pass through to the OpenAI client. I think as is if you were to configure to a local model (or a model hosted somewhere else) it would just work as is but I agree it doesn't fit super cleanly with the NeMo-Skills abstractions.
Honestly, the generation side of the ComputeEval repo is generally intended to be a reference implementation for how other groups/companies would set up a generator scaffold to solve the ComputeEval problems. The reference implementation's prompt and post processing logic are pretty coupled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for calling the openai client is here, right? https://github.com/NVIDIA/compute-eval/blob/main/compute_eval/models/model_interface.py#L100-L154
Seems pretty straightforward and is pretty much a subset of the functionality our clients have. And if do it the way it's currently being done, you lose a lot of things we support in NeMo-Skills (like tool-calling, pretty much all sampling parameters not directly specified in the compute-eval function, as well as the supported server types would be quite limited).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointers and happy New Year! I see where you're coming from on fitting this in with the NemoSkills generation more cleanly. What is the expectation on how to document thisGenerationTasks content format? I was able to get this working locally but there is definitely some coupling between the expected data format and system and user prompts required in a ++prompt_config. I would guess that has to be normal and just documented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's pretty typical. We don't really have it documented on a case-by-case basis--it's pretty easy to infer from the prompt config itself, and we have a generic section on understanding the interaction between data and prompt_config here: https://nvidia-nemo.github.io/Skills/basics/prompt-format/#prompt-api
… move `ComputeEvalGenerationTask` to be more on the rails of the `process_single_datapoint` approach.
Greptile SummaryThis PR integrates ComputeEval, a CUDA code generation benchmark, into the NeMo-Skills framework. The implementation adds dataset preparation from HuggingFace, a generation task for producing CUDA solutions, an evaluator that compiles and tests generated code, and metrics computation including pass@k scores. Key changes:
Note: The compute-eval dependency version was updated from Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant Prepare as prepare.py
participant HF as HuggingFace
participant Gen as ComputeEvalGenerationTask
participant LLM as Language Model
participant Eval as ComputeEvalEvaluator
participant CE as compute-eval library
participant Metrics as ComputeEvalMetrics
User->>Prepare: Run dataset preparation
Prepare->>HF: load_dataset("nvidia/compute-eval", release)
HF-->>Prepare: Return dataset with problems
Prepare->>Prepare: Format context_files_block
Prepare->>Prepare: Write eval.jsonl with problem data
User->>Gen: Generate solutions
Gen->>Gen: Read eval.jsonl
loop For each problem
Gen->>LLM: Send problem prompt
LLM-->>Gen: Return generated code
Gen->>CE: _parse_solution(generation)
CE-->>Gen: Parsed file solutions
Gen->>Gen: Create FileSolution object
Gen->>Gen: Write solution to output
end
User->>Eval: Evaluate solutions
Eval->>Eval: get_nvcc_version() and parse_semver()
loop For each data_point
Eval->>Eval: Validate problem and solution
Eval->>CE: evaluate_solution(problem, solution, ctk_version)
CE->>CE: Compile and test CUDA code
CE-->>Eval: Return graded result (passed, skipped, etc.)
Eval->>Eval: Return evaluation dict
end
User->>Metrics: Compute metrics
Metrics->>Metrics: Aggregate accuracy scores
Metrics->>Metrics: Compute pass@k metrics
Metrics-->>User: Return final metrics
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
nemo_skills/inference/eval/compute_eval.py, line 21 (link)style: Using protected member
_parse_solution. Consider requesting public API when compute-eval stabilizes.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
8 files reviewed, 1 comment
# Conflicts: # nemo_skills/evaluation/evaluator/__init__.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8 files reviewed, 1 comment
| from compute_eval.data.data_model import FileSolution | ||
|
|
||
| # noinspection PyProtectedMember | ||
| from compute_eval.generate_completions import _parse_solution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: accessing protected member _parse_solution from external library
If _parse_solution becomes unavailable in future versions of compute-eval, this will break. Consider requesting a public API from the compute-eval library.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Greptile found no issues!From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section. This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR". |
Overview
This PR integrates ComputeEval - a benchmark for evaluating LLMs on CUDA code generation - into NeMo-Skills.
ComputeEval features:
Changes
GenerationTaskto generate CUDA solutions using ComputeEval's reference generation logicBaseEvaluatorusing ComputeEval's evaluation logicDependencies
Adds compute-eval package (
compute-eval @ git+https://github.com/NVIDIA/compute-eval.git@991b47c)Note: compute-eval is a public repository but has not been published to PyPI yet
Summary by CodeRabbit
Release Notes
✏️ Tip: You can customize this high-level summary in your review settings.