Skip to content

Conversation

@blahblahasdf
Copy link
Collaborator

@blahblahasdf blahblahasdf commented Dec 17, 2025

Overview

This PR integrates ComputeEval - a benchmark for evaluating LLMs on CUDA code generation - into NeMo-Skills.

ComputeEval features:

  1. 406 (and growing) CUDA programming challenges
  2. Functional correctness evaluation via compilation and test execution
  3. Multiple release versions (2025-1, 2025-2, 2025-3)

Changes

  1. Dataset Preparation - download ComputeEval problems from HuggingFace (gated dataset)
  2. Generation Task - implement GenerationTask to generate CUDA solutions using ComputeEval's reference generation logic
  3. Evaluator - implement BaseEvaluator using ComputeEval's evaluation logic
  4. Metrics - compute pass@k metrics

Dependencies

Adds compute-eval package (compute-eval @ git+https://github.com/NVIDIA/compute-eval.git@991b47c)
Note: compute-eval is a public repository but has not been published to PyPI yet

Summary by CodeRabbit

Release Notes

  • New Features
    • Added support for compute-eval dataset from HuggingFace with optional release specifications.
    • Introduced compute-eval evaluator for CUDA code assessment with automatic CTK version detection.
    • Added compute-eval metrics for computing evaluation scores including pass-at-k performance.
    • Integrated compute-eval generation task into the inference pipeline.
    • Registered compute-eval as a new evaluation type across the framework.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 17, 2025

📝 Walkthrough

Walkthrough

This PR adds support for NVIDIA's compute-eval dataset and evaluation framework to NeMo Skills. It introduces a new compute-eval evaluator type, corresponding metrics, a dataset preparation script, a generation task for model inference, and registers these components in the appropriate registries.

Changes

Cohort / File(s) Summary
Dataset Package Initialization
nemo_skills/dataset/compute-eval/__init__.py
Introduces package constants: EVAL_SPLIT, DATASET_GROUP, METRICS_TYPE, GENERATION_MODULE, and GENERATION_ARGS for compute-eval configuration.
Dataset Preparation
nemo_skills/dataset/compute-eval/prepare.py
New script to download NVIDIA compute-eval dataset from HuggingFace with optional release specification, token authentication via HF_TOKEN environment variable, and output formatting to eval.jsonl.
Evaluator Implementation
nemo_skills/evaluation/evaluator/compute_eval.py, nemo_skills/evaluation/evaluator/__init__.py
Introduces ComputeEvalEvaluator class with NVCC version detection, pydantic validation for problem/solution, thread-pool execution of evaluation, and registers it in EVALUATOR_MAP and EVALUATOR_CLASS_MAP.
Metrics Implementation
nemo_skills/evaluation/metrics/code_metrics.py, nemo_skills/evaluation/metrics/map_metrics.py
Introduces ComputeEvalMetrics class implementing score dict, incorrect sample, and update with pass-at-k computation; registers in METRICS_MAP.
Generation Task
nemo_skills/inference/eval/compute_eval.py
Introduces ComputeEvalGenerationTask for model inference with server configuration validation, problem type validation, and solution generation; provides Hydra entry point run_compute_eval.
Dependencies
requirements/main.txt
Adds git-based dependency for compute-eval package from specified commit.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • NVCC version detection and error handling in ComputeEvalEvaluator.__init__: verify compatibility logic and RuntimeError conditions
  • Thread pool execution in ComputeEvalEvaluator.eval_single: confirm proper async/threading pattern and exception handling
  • Type validation with discriminated unions in both evaluator and generation task: verify pydantic TypeAdapter configuration for CudaCppProblem | CudaPythonProblem
  • Configuration validation in ComputeEvalGenerationTask.__init__: ensure model and base_url enforcement is correct
  • Git dependency specification in requirements/main.txt: verify commit hash and repository URL are correct

Suggested labels

run GPU tests

Suggested reviewers

  • Kipok
  • gwarmstrong

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add ComputeEval Dataset Support' accurately describes the main objective of the PR, which integrates ComputeEval into NeMo-Skills with dataset preparation, generation tasks, evaluators, and metrics.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (4)
nemo_skills/dataset/compute-eval/prepare.py (2)

29-34: Consider narrowing the exception type or adding a brief comment.

The bare Exception catch works but obscures intent. A comment explaining that any failure during whoami() indicates the user isn't logged in would improve clarity.

     # noinspection PyBroadException
     try:
         api = HfApi()
         api.whoami()
-        return None
+        return None  # User is logged in; HF will use cached credentials
-    except Exception:
+    except Exception:  # Any failure means user is not authenticated
         return None

48-54: Add validation for the expected dataset split.

If the dataset doesn't contain an "eval" split (e.g., due to an invalid release), the script will raise an unclear KeyError. Consider validating the split exists.

     dataset = load_dataset("nvidia/compute-eval", args.release, token=_get_hf_token())
+    if "eval" not in dataset:
+        raise ValueError(f"Dataset does not contain 'eval' split. Available splits: {list(dataset.keys())}")
     data_dir = Path(__file__).absolute().parent
nemo_skills/inference/eval/compute_eval.py (2)

54-67: Consider using pass for no-op methods and prefixing unused parameters.

The no-op lifecycle methods use return statements, but pass is more idiomatic for intentionally empty methods. Additionally, the data parameter in log_example_prompt is flagged as unused by static analysis - prefixing it with an underscore would signal it's intentionally unused.

Apply this diff:

-    def log_example_prompt(self, data):
-        return
+    def log_example_prompt(self, _data):
+        pass
 
     def setup_prompt(self):
-        return
+        pass
 
     def setup_llm(self):
-        return
+        pass
 
     def setup_litellm_cache(self):
-        return
+        pass
 
     def cleanup_litellm_cache(self):
-        return
+        pass

69-69: Prefix unused parameter with underscore.

The data parameter is flagged as unused by static analysis. If it's part of the interface but not needed in this implementation, prefix it with an underscore to signal it's intentionally unused.

Apply this diff:

-    async def process_single_datapoint(self, data_point, data):
+    async def process_single_datapoint(self, data_point, _data):
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3e16e1e and 67e8149.

📒 Files selected for processing (8)
  • nemo_skills/dataset/compute-eval/__init__.py (1 hunks)
  • nemo_skills/dataset/compute-eval/prepare.py (1 hunks)
  • nemo_skills/evaluation/evaluator/__init__.py (2 hunks)
  • nemo_skills/evaluation/evaluator/compute_eval.py (1 hunks)
  • nemo_skills/evaluation/metrics/code_metrics.py (1 hunks)
  • nemo_skills/evaluation/metrics/map_metrics.py (2 hunks)
  • nemo_skills/inference/eval/compute_eval.py (1 hunks)
  • requirements/main.txt (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-23T17:56:57.556Z
Learnt from: Glorf
Repo: NVIDIA-NeMo/Skills PR: 908
File: requirements/main.txt:16-16
Timestamp: 2025-11-23T17:56:57.556Z
Learning: faiss-cpu must be explicitly listed in requirements/main.txt for BFCLv4 memory evaluations (memory_kv, memory_vector, memory_rec_sum) as it is an optional dependency of sentence_transformers that is required for vector similarity search functionality in the memory backends.

Applied to files:

  • requirements/main.txt
🧬 Code graph analysis (4)
nemo_skills/evaluation/evaluator/__init__.py (1)
nemo_skills/evaluation/evaluator/compute_eval.py (1)
  • ComputeEvalEvaluator (31-63)
nemo_skills/evaluation/metrics/map_metrics.py (1)
nemo_skills/evaluation/metrics/code_metrics.py (1)
  • ComputeEvalMetrics (126-135)
nemo_skills/evaluation/evaluator/compute_eval.py (2)
nemo_skills/evaluation/evaluator/base.py (1)
  • BaseEvaluator (34-91)
nemo_skills/utils.py (1)
  • get_logger_name (39-43)
nemo_skills/evaluation/metrics/code_metrics.py (7)
nemo_skills/evaluation/metrics/base.py (5)
  • BaseMetrics (23-434)
  • _get_score_dict (124-143)
  • get_incorrect_sample (200-206)
  • update (145-189)
  • _compute_pass_at_k (352-423)
nemo_skills/evaluation/metrics/if_metrics.py (3)
  • _get_score_dict (24-28)
  • get_incorrect_sample (30-33)
  • update (35-48)
nemo_skills/evaluation/metrics/answer_judgement_metrics.py (3)
  • _get_score_dict (35-39)
  • get_incorrect_sample (41-47)
  • update (121-132)
nemo_skills/evaluation/metrics/lean4_metrics.py (3)
  • _get_score_dict (23-24)
  • get_incorrect_sample (26-29)
  • update (46-48)
nemo_skills/evaluation/metrics/math_metrics.py (3)
  • _get_score_dict (67-79)
  • get_incorrect_sample (81-88)
  • update (90-120)
nemo_skills/evaluation/metrics/ruler_metrics.py (3)
  • _get_score_dict (19-20)
  • get_incorrect_sample (26-29)
  • update (22-24)
nemo_skills/evaluation/metrics/arena_metrics.py (2)
  • get_incorrect_sample (36-40)
  • update (42-84)
🪛 Ruff (0.14.8)
nemo_skills/inference/eval/compute_eval.py

54-54: Unused method argument: data

(ARG002)


69-69: Unused method argument: data

(ARG002)

nemo_skills/dataset/compute-eval/prepare.py

32-32: Consider moving this statement to an else block

(TRY300)


33-33: Do not catch blind exception: Exception

(BLE001)

nemo_skills/evaluation/metrics/code_metrics.py

130-130: Unused method argument: prediction

(ARG002)

🔇 Additional comments (10)
nemo_skills/evaluation/metrics/code_metrics.py (1)

126-135: LGTM - Implementation follows established patterns.

The ComputeEvalMetrics class correctly mirrors the structure of HumanEvalInfillingMetrics (lines 114-123). The unused prediction parameter in get_incorrect_sample is required by the base class interface, so the static analysis warning is a false positive.

nemo_skills/evaluation/metrics/map_metrics.py (2)

26-26: LGTM - Import correctly added.


72-72: LGTM - Registry entry properly added.

The key "compute-eval" correctly matches METRICS_TYPE defined in nemo_skills/dataset/compute-eval/__init__.py.

nemo_skills/evaluation/evaluator/__init__.py (2)

29-29: LGTM - Import correctly placed.


71-71: LGTM - Evaluator correctly registered in class map only.

The ComputeEvalEvaluator is correctly placed in EVALUATOR_CLASS_MAP (not EVALUATOR_MAP) since it implements eval_single for async single-point evaluation.

nemo_skills/evaluation/evaluator/compute_eval.py (1)

48-62: LGTM - Correct use of asyncio.to_thread for blocking evaluation.

Running evaluate_solution in a thread pool correctly avoids blocking the event loop during CUDA compilation and test execution.

requirements/main.txt (1)

17-17: LGTM - Git commit pinning ensures reproducibility.

The dependency is properly pinned to a specific commit (991b47c, currently the main branch HEAD) and correctly sorted alphabetically. The short hash is valid and unambiguous for this repository. Using the full 40-character SHA would provide additional robustness against theoretical hash collisions, but the current approach is acceptable for practical purposes.

nemo_skills/dataset/compute-eval/__init__.py (1)

15-19: LGTM - Dataset constants properly defined.

Constants are internally consistent: METRICS_TYPE matches the registry key in map_metrics.py (line 72), and EVAL_SPLIT matches the split accessed in prepare.py (lines 52-53). The GENERATION_MODULE path correctly points to the existing nemo_skills.inference.eval.compute_eval module.

nemo_skills/inference/eval/compute_eval.py (2)

1-34: LGTM!

The imports and module-level setup are well-structured. The TypeAdapter with a discriminated union for problem validation is a clean approach.


86-102: LGTM!

The module exports and main entry point are well-structured. The help message handling and logging setup follow good practices.

Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
Copy link
Collaborator

@gwarmstrong gwarmstrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, with a couple high level questions/concerns


# noinspection PyBroadException
try:
api = HfApi()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows a user to access the HuggingFace gated (you have to be allow-listed to have access to the ComputeEval data today) data either via an environment variable (HF_TOKEN) or by using the HuggingFace CLI to login.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, after looking closer we don't need to access the API at all for this to work. I'll just clean this up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we should just need a load_dataset call and it will handle the token if it's already in your env variables, like this one is gated: https://github.com/blahblahasdf/Skills/blob/cb7829094799f9b5967e260d30e39c2535f67bab/nemo_skills/dataset/flores200/prepare.py#L49

generate_model_completions,
system_prompt=self._system_prompt,
problem=problem,
model=self._model,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is applying this function:
https://github.com/NVIDIA/compute-eval/blob/main/compute_eval/generate_completions.py#L31-L84

Which appears to use the ComputeEval clients?
What I would recommend we do instead, is write a version of that function that uses NeMo-Skills' process_single_datapoint--so this probably looks something like

    result = super().process_sinlge_datapoint(data_point, data)
    ...

    completion = ...
    return FileSolution(
        task_id=problem.task_id,
        files=_parse_solution(completion),
        **debug_info,
    )

this will make sure the prompt logic is consistent with typical NeMo-Skills and that the eval prompt logic is the same as normal generations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the above methods that are overwritten would just use the default methods rather than override to return nothing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick feedback! Your suggestion is to call process_sinlge_datapoint and then use the ComputeEval post processing to convert the model's response to a FileSolution? The "ComputeEval clients" are just a pass through to the OpenAI client. I think as is if you were to configure to a local model (or a model hosted somewhere else) it would just work as is but I agree it doesn't fit super cleanly with the NeMo-Skills abstractions.

Honestly, the generation side of the ComputeEval repo is generally intended to be a reference implementation for how other groups/companies would set up a generator scaffold to solve the ComputeEval problems. The reference implementation's prompt and post processing logic are pretty coupled.

Copy link
Collaborator

@gwarmstrong gwarmstrong Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for calling the openai client is here, right? https://github.com/NVIDIA/compute-eval/blob/main/compute_eval/models/model_interface.py#L100-L154
Seems pretty straightforward and is pretty much a subset of the functionality our clients have. And if do it the way it's currently being done, you lose a lot of things we support in NeMo-Skills (like tool-calling, pretty much all sampling parameters not directly specified in the compute-eval function, as well as the supported server types would be quite limited).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointers and happy New Year! I see where you're coming from on fitting this in with the NemoSkills generation more cleanly. What is the expectation on how to document thisGenerationTasks content format? I was able to get this working locally but there is definitely some coupling between the expected data format and system and user prompts required in a ++prompt_config. I would guess that has to be normal and just documented?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's pretty typical. We don't really have it documented on a case-by-case basis--it's pretty easy to infer from the prompt config itself, and we have a generic section on understanding the interaction between data and prompt_config here: https://nvidia-nemo.github.io/Skills/basics/prompt-format/#prompt-api

… move `ComputeEvalGenerationTask` to be more on the rails of the `process_single_datapoint` approach.
@greptile-apps
Copy link

greptile-apps bot commented Jan 5, 2026

Greptile Summary

This PR integrates ComputeEval, a CUDA code generation benchmark, into the NeMo-Skills framework. The implementation adds dataset preparation from HuggingFace, a generation task for producing CUDA solutions, an evaluator that compiles and tests generated code, and metrics computation including pass@k scores.

Key changes:

  • Dataset preparation downloads problems from nvidia/compute-eval on HuggingFace with optional release version support
  • Generation task extends GenerationTask and parses LLM-generated CUDA code into structured solutions
  • Evaluator automatically detects installed CUDA Toolkit version and runs async evaluation using the compute-eval library
  • Metrics implementation follows existing code benchmark patterns (similar to HumanEvalInfillingMetrics)
  • All integration points properly registered in the framework's evaluator and metrics maps

Note: The compute-eval dependency version was updated from 991b47c to 4221fc2 in the final commit.

Confidence Score: 4/5

  • This PR is safe to merge with minimal risk, pending awareness of the protected member usage
  • The implementation follows existing patterns in the codebase consistently, integrates cleanly with the framework's architecture, and has proper error handling. Score reduced by one point due to the use of protected member _parse_solution from the compute-eval library, which was already flagged in previous review threads but represents a potential maintenance risk if the external library changes its internal API
  • Pay attention to nemo_skills/inference/eval/compute_eval.py due to its dependency on a protected member from the external library

Important Files Changed

Filename Overview
nemo_skills/dataset/compute-eval/prepare.py Implements dataset preparation script to download and format compute-eval problems from HuggingFace
nemo_skills/evaluation/evaluator/compute_eval.py Implements CUDA code evaluator with automatic CTK version detection and async execution
nemo_skills/inference/eval/compute_eval.py Implements generation task that parses CUDA solutions using compute-eval library (uses protected member)
requirements/main.txt Adds compute-eval dependency pinned to commit 4221fc2

Sequence Diagram

sequenceDiagram
    participant User
    participant Prepare as prepare.py
    participant HF as HuggingFace
    participant Gen as ComputeEvalGenerationTask
    participant LLM as Language Model
    participant Eval as ComputeEvalEvaluator
    participant CE as compute-eval library
    participant Metrics as ComputeEvalMetrics

    User->>Prepare: Run dataset preparation
    Prepare->>HF: load_dataset("nvidia/compute-eval", release)
    HF-->>Prepare: Return dataset with problems
    Prepare->>Prepare: Format context_files_block
    Prepare->>Prepare: Write eval.jsonl with problem data
    
    User->>Gen: Generate solutions
    Gen->>Gen: Read eval.jsonl
    loop For each problem
        Gen->>LLM: Send problem prompt
        LLM-->>Gen: Return generated code
        Gen->>CE: _parse_solution(generation)
        CE-->>Gen: Parsed file solutions
        Gen->>Gen: Create FileSolution object
        Gen->>Gen: Write solution to output
    end
    
    User->>Eval: Evaluate solutions
    Eval->>Eval: get_nvcc_version() and parse_semver()
    loop For each data_point
        Eval->>Eval: Validate problem and solution
        Eval->>CE: evaluate_solution(problem, solution, ctk_version)
        CE->>CE: Compile and test CUDA code
        CE-->>Eval: Return graded result (passed, skipped, etc.)
        Eval->>Eval: Return evaluation dict
    end
    
    User->>Metrics: Compute metrics
    Metrics->>Metrics: Aggregate accuracy scores
    Metrics->>Metrics: Compute pass@k metrics
    Metrics-->>User: Return final metrics
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. nemo_skills/inference/eval/compute_eval.py, line 21 (link)

    style: Using protected member _parse_solution. Consider requesting public API when compute-eval stabilizes.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

8 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

# Conflicts:
#	nemo_skills/evaluation/evaluator/__init__.py
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

from compute_eval.data.data_model import FileSolution

# noinspection PyProtectedMember
from compute_eval.generate_completions import _parse_solution
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: accessing protected member _parse_solution from external library

If _parse_solution becomes unavailable in future versions of compute-eval, this will break. Consider requesting a public API from the compute-eval library.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@greptile-apps
Copy link

greptile-apps bot commented Jan 7, 2026

Greptile found no issues!

From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section.

This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants