Skip to content

Conversation

@kaiming-cheng
Copy link
Contributor

This PR adds bottleneck analysis prompt building and response parsing to the diagnose module.

Core Components

1. BottleneckResult (judger_prompt.py)

Single dataclass representing a bottleneck analysis:

  • Category: memory, compute, or underutilized
  • Summary: one-line description
  • Reasoning: explanation citing metrics
  • Root_causes: list of causes with metric evidence
  • Recommended_fixes: actionable fixes with rationale
  • Configurable analysis:
    • num_bottlenecks: how many bottlenecks to identify (default: 2)
    • num_causes: root causes per bottleneck (default: 2)
    • num_fixes: fixes per bottleneck (default: 1)

2. Prompt Builder (judger_prompt.py)

Constructs structured LLM prompts from:

  • Kernel source code
  • NCU profiling metrics (formatted via metric_schema)
  • Roofline analysis results (via ncu_roofline)
  • GPU hardware specifications (via gpu_spec)

Example Usage

  prompt = build_bottleneck_prompt(                                                                                             
      kernel_code=kernel_src,                                                                                                   
      ncu_metrics=ncu_data,                                                                                                     
      roofline=roofline_result,                                                                                                 
      gpu_specs=gpu_specs,                                                                                                      
      num_bottlenecks=2,                                                                                                        
      num_causes=2,                                                                                                             
      num_fixes=1,                                                                                                              
  )                                                                                                                             
                                                                                                                                
  # After LLM call...              
   results = parse_bottleneck_response(llm_response)                     

More end-to-end testing in future PR

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 31, 2026
@Jack-Khuu Jack-Khuu requested a review from Copilot February 4, 2026 03:26
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a structured bottleneck-diagnosis pipeline around Nsight Compute metrics, including roofline analysis, GPU spec lookup, and an LLM-oriented prompt/response interface.

Changes:

  • Add an NCU SOL-based roofline analysis module (RooflineConfig, RooflineResult, RooflineAnalyzer, format_roofline_summary) and a corresponding roofline package scaffold.
  • Introduce a diagnose prompt subsystem with metric schemas, GPU specs database/accessor, and the BottleneckResult + prompt builder/response parser (build_bottleneck_prompt, parse_bottleneck_response).
  • Tidy up NCU profiling utilities (selection policy handling), update profiler package docstrings, and slightly simplify dtype handling in the Triton kernel benchmarking subprocess.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
triton_kernel_agent/opt_worker_component/benchmarking/kernel_subprocess.py Simplifies CLI dtype selection by inlining the string→torch.dtype mapping where dtype is constructed.
kernel_perf_agent/kernel_opt/roofline/ncu_roofline.py Adds SOL-based roofline analysis (RooflineAnalyzer, RooflineResult, NCU_ROOFLINE_METRICS) and text summary formatting for kernel efficiency and bottleneck classification.
kernel_perf_agent/kernel_opt/roofline/init.py Declares the roofline subpackage for roofline-related analysis components.
kernel_perf_agent/kernel_opt/profiler/ncu_profiler.py Refactors metric selection policy handling, tightening the select parameter type and simplifying _apply_selection_policy/load_ncu_metrics control flow.
kernel_perf_agent/kernel_opt/profiler/init.py Updates the profiler package docstring to specifically describe NCU profiling responsibilities.
kernel_perf_agent/kernel_opt/diagnose_prompt/metric_schema.py Defines canonical metric schemas (NCU_METRIC_SECTIONS, GPU spec fields) used to format NCU metrics and GPU specs for prompts.
kernel_perf_agent/kernel_opt/diagnose_prompt/judger_prompt.py Implements BottleneckResult, a structured bottleneck analysis prompt template, formatting helpers, and robust JSON response parsing into BottleneckResult objects.
kernel_perf_agent/kernel_opt/diagnose_prompt/gpu_specs_database.py Provides a curated GPU hardware spec database (A100/H100 SKUs, RTX cards) for use in bottleneck and roofline contextualization.
kernel_perf_agent/kernel_opt/diagnose_prompt/gpu_specs.py Exposes GPU_SPECS_DATABASE and get_gpu_specs() with logging and a simple CLI-style demonstration entrypoint.
kernel_perf_agent/kernel_opt/diagnose_prompt/init.py Declares the diagnose_prompt package and its documentation string for bottleneck analysis helpers.
kernel_perf_agent/init.py Cleans up the top-level package by removing an unused comment and keeping __all__ empty.
Comments suppressed due to low confidence (1)

kernel_perf_agent/kernel_opt/profiler/ncu_profiler.py:320

  • load_ncu_metrics now types the select argument as MetricSelectionPolicy and no longer converts string values, but existing internal callers (e.g. kernel_perf_agent/kernel_opt/profiler/kernel_profiler.py:198 passes select="last") still use strings, which now rely on the generic else fallback path in _apply_selection_policy rather than an explicit policy. To keep the API consistent and avoid subtle behavior changes for non-enum values, either (1) restore explicit string-to-enum conversion with validation, or (2) update all call sites to pass a MetricSelectionPolicy (e.g. MetricSelectionPolicy.LAST) and consider raising for unknown policies instead of silently treating them as LAST.
    select: MetricSelectionPolicy = MetricSelectionPolicy.LAST,
) -> pd.DataFrame:
    """
    Load and parse NCU metrics from CSV file.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

"gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed", # Memory SOL
"sm__throughput.avg.pct_of_peak_sustained_elapsed", # Compute SOL
# Tensor core detection
"sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_active",
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NCU_ROOFLINE_METRICS and _is_using_tensor_cores use the metric key "sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_active", but the profiler (ncu_profiler.METRICS) and NCU_METRIC_SECTIONS both use "sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_elapsed". This mismatch means tensor-core activity will always appear as 0 when analyzing metrics produced by the existing profiler; align the key here with the profiler/schema (or vice versa) so tensor-core detection works correctly.

Suggested change
"sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_active",
"sm__pipe_tensor_cycles_active.avg.pct_of_peak_sustained_elapsed",

Copilot uses AI. Check for mistakes.
Comment on lines 34 to 35
# Note: The profiler (ncu_profiler.py) collects these and more metrics.
# This list documents the minimum required for roofline decisions.
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment stating that "The profiler (ncu_profiler.py) collects these and more metrics" is currently inaccurate: kernel_perf_agent/kernel_opt/profiler/ncu_profiler.METRICS does not include the SOL metrics "sm__throughput.avg.pct_of_peak_sustained_elapsed" or "gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed". Either extend METRICS to include NCU_ROOFLINE_METRICS (so roofline analysis can run directly on profiler output) or update this docstring to clarify that additional metrics are required for SOL-based analysis.

Suggested change
# Note: The profiler (ncu_profiler.py) collects these and more metrics.
# This list documents the minimum required for roofline decisions.
# Note: These are the minimum metrics required for SOL-based roofline decisions.
# The default profiler configuration (ncu_profiler.py: METRICS) may need to be
# extended to include these NCU_ROOFLINE_METRICS for roofline analysis to run
# directly on profiler output.

Copilot uses AI. Check for mistakes.
## Output (JSON array, no markdown fence)
[
{{
"category": "memory" | "compute" | "underutilized",
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the JSON output template, the example line "category": "memory" | "compute" | "underutilized", is not valid JSON and may encourage models to emit the literal | syntax, which parse_bottleneck_response will then fail to decode. To make it easier for the LLM to produce parseable output, use a concrete example value (e.g., "memory") and move the enumeration of allowed categories into surrounding natural-language instructions instead of inside the JSON snippet.

Suggested change
"category": "memory" | "compute" | "underutilized",
"category": "memory",

Copilot uses AI. Check for mistakes.
Comment on lines 88 to 90
print(f"\n{'=' * 60}")
example_gpu = "NVIDIA A100"
specs = get_gpu_specs(example_gpu)
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring and __main__ example use "NVIDIA A100" as the GPU name, but GPU_SPECS_DATABASE only contains more specific keys like "NVIDIA A100 SXM4 40GB" and "NVIDIA A100 PCIe 80GB", so get_gpu_specs("NVIDIA A100") will always return None. Update the example (and/or relax the key-matching logic) so that the documented usage actually resolves to an entry in GPU_SPECS_DATABASE.

Copilot uses AI. Check for mistakes.
data = json.loads(array_match.group())
if isinstance(data, list):
return _parse_bottleneck_list(data, fallback_category)
except json.JSONDecodeError:
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
data = json.loads(obj_match.group())
if isinstance(data, dict):
return _parse_bottleneck_list([data], fallback_category)
except json.JSONDecodeError:
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Copilot uses AI. Check for mistakes.
root_causes: list[dict[str, Any]] = field(default_factory=list)
recommended_fixes: list[dict[str, Any]] = field(default_factory=list)

def to_dict(self) -> dict[str, Any]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we aren't doing any custom logic we can just drop to_dict in favor of dataclass asdict

Comment on lines 167 to 168
compute_sol = ncu_metrics.get(compute_key, 0)
memory_sol = ncu_metrics.get(memory_key, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does 0 mean that something went wrong/wasn't measured? Or is that an error itself?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants