Skip to content

Updated Optimization Worker #93

Merged
kaiming-cheng merged 42 commits into
mainfrom
kaiming/worker_clean
Feb 18, 2026
Merged

Updated Optimization Worker #93
kaiming-cheng merged 42 commits into
mainfrom
kaiming/worker_clean

Conversation

@kaiming-cheng
Copy link
Copy Markdown
Contributor

@kaiming-cheng kaiming-cheng commented Feb 2, 2026

Summary:

  • Adds verify_with_refinement() method for simpler single-shot verification with refinement loop
  • Enables optimization loops to manage their own iteration while delegating correctness checking to the
    worker

Test

    worker = OptimizationWorker(
        worker_id=0,
        workdir=workdir,
        log_dir=log_dir,
        max_rounds=args.max_rounds,
        openai_model=args.model,
        high_reasoning_effort=True,
        benchmark_warmup=25,
        benchmark_repeat=100,
        divergence_threshold=50.0,
        target_platform="cuda",
        gpu_name="NVIDIA H100 NVL 94GB",
    )

    # Run optimization
    print("\nStarting optimization...")
    success, best_kernel, metrics = worker.optimize_kernel(
        kernel_code=kernel_code,
        problem_file=problem_file,
        test_code=test_code,
    )

Result

# Round 1
2026-02-01 23:20:31,642 - opt_worker_0 - INFO - [1] 🎉 NEW BEST! 5.4265 ms (speedup: 1.01x, improvement: 0.6%)
2026-02-01 23:20:31,642 - opt_worker_0 - INFO - [1] Roofline: compute-bound, 82.8% SOL (Compute: 20.4%, Memory: 82.8%)
# Round 2
2026-02-01 23:27:42,678 - opt_worker_0 - INFO - [2] 🎉 NEW BEST! 5.4116 ms (speedup: 1.00x, improvement: 0.3%)
2026-02-01 23:27:42,678 - opt_worker_0 - INFO - [2] Roofline: compute-bound, 84.8% SOL (Compute: 20.5%, Memory: 84.8%)
# Round 3
2026-02-01 23:33:18,010 - opt_worker_0 - INFO - [3] 🎉 NEW BEST! 3.5323 ms (speedup: 1.53x, improvement: 34.7%)
2026-02-01 23:33:18,011 - opt_worker_0 - INFO - [3] Roofline: compute-bound, 91.2% SOL (Compute: 11.0%, Memory: 91.2%)
# Round 4
2026-02-01 23:40:54,465 - opt_worker_0 - INFO - [4] 🎉 NEW BEST! 3.5224 ms (speedup: 1.00x, improvement: 0.3%)
2026-02-01 23:40:54,465 - opt_worker_0 - INFO - [4] Roofline: compute-bound, 91.4% SOL (Compute: 11.0%, Memory: 91.4%)


Kaiming Cheng and others added 30 commits January 15, 2026 11:44
Consolidates previous kernel_benchmark.py and pytorch_benchmark.py into a
streamlined 3-file architecture with clear separation of concerns:

Architecture:
- benchmark.py (299 lines): Main Benchmark class with simplified API
  - benchmark_kernel(): Always uses subprocess for crash protection
  - benchmark_pytorch(): Always uses direct mode for stable code
  - BenchmarkLockManager: GPU lock management for multi-worker scenarios

- timing.py (437 lines): Complete timing infrastructure
  - Timing: time_with_cuda_events(), time_with_triton_do_bench()
  - Loading: prepare_pytorch_model(), load_kernel_function()
  - Stats: compute_timing_stats() with essential metrics (mean/std/min/max)

- kernel_subprocess.py (442 lines): Subprocess runner for kernel isolation
  - Crash protection for potentially buggy kernels
  - Clean CUDA state between runs
  - Timeout handling

Key improvements:
- Eliminated string code generation (was generating Python as strings)
- Removed unnecessary statistics (median, p25/p75/p95/p99)
- Removed confusing use_subprocess parameter (behavior now deterministic)
- Fixed dtype bug causing incorrect speedup measurements
- Reduced from 5 files to 3 files with clearer naming
- Code reduction: ~1,400 lines → 1,178 lines

Simple API:
  bench = Benchmark(logger, temp_dir, lock, worker_id)
  pytorch_result = bench.benchmark_pytorch(problem_file)
  kernel_result = bench.benchmark_kernel(kernel_file, problem_file)
  speedup = pytorch_result['stats']['mean'] / kernel_result['time_ms']
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 2, 2026
Copy link
Copy Markdown
Contributor

@Jack-Khuu Jack-Khuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume worker/worker_util are the only unique changes, lmk if that's not true

Can you check that the changes to worker_util aren't duplicates of existing functions? I'm down to move them in a different PT if it's the same, but let's keep the line changes minimal for this PR

Comment thread triton_kernel_agent/worker_util.py Outdated
# ------------------------


def _call_llm(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we already have somthing like this in the worker?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! fixed it in new commit

Comment thread triton_kernel_agent/worker_util.py Outdated
# ------------------------


def _extract_code_from_response(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto?


return success, stdout, stderr, None

def verify_with_refinement(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is this used?

@msaroufim msaroufim self-requested a review February 14, 2026 01:47
Copy link
Copy Markdown

@msaroufim msaroufim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamping for blog, i did not review

Copy link
Copy Markdown
Contributor

@Jack-Khuu Jack-Khuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shtamp to unblock

@kaiming-cheng kaiming-cheng merged commit 678a00a into main Feb 18, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants