Add stepwise intermediate reward for RL #526

Shaobo-Zhou · 2025-11-26T16:11:16Z

Description

This PR introduces a shaped, step-wise reward signal for the RL-based compiler.

For the figures of merit expected_fidelity and estimated_success_probability, the reward is computed in two regimes:

Exact regime (native + mapped circuits)
If the circuit consists only of device-native gates and respects the device’s coupling map, the step reward is based on the change in the exact calibration-aware metric between successive steps.
Approximate regime (non-native / unmapped circuits)
If the circuit still contains non-native gates or violates the device topology, a conservative canonical cost model is used to approximate the expected fidelity and ESP. The intermediate reward is then derived from the change in this approximate metric.

Checklist:

The pull request only contains commits that are focused and relevant to this change.
I have added appropriate tests that cover the new/changed functionality.
I have updated the documentation to reflect these changes.
I have added entries to the changelog for any noteworthy additions, changes, fixes, or removals.
I have added migration instructions to the upgrade guide (if needed).
The changes follow the project's style guidelines and introduce no new warnings.
The changes are fully tested and pass the CI checks.
I have reviewed my own code changes.

Update action space and feature space Update actions Update action space

Fix: resolve pre-commit issues and add missing annotations Fix: resolve pre-commit issues and add missing annotations Remove example_test.py Remove example_test.py

Fix: resolve pre-commit issues and add missing annotations Fix: resolve pre-commit issues and add missing annotations Fix: resolve pre-commit issues and add missing annotations

Signed-off-by: Shaobo-Zhou <109073755+Shaobo-Zhou@users.noreply.github.com>

Fix bugs Fix bugs Fix bugs

…e gate check

Fix windows runtime warning problem Fix windows runtime warning issue

codecov · 2025-11-27T20:14:47Z

Codecov Report

❌ Patch coverage is 91.56627% with 21 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/mqt/predictor/rl/predictorenv.py	91.0%	16 Missing ⚠️
src/mqt/predictor/rl/cost_model.py	91.9%	5 Missing ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai · 2025-12-03T22:11:17Z

📝 Walkthrough

Summary by CodeRabbit

New Features
- Added device-aware canonical gate cost model with per-device cost tables and fidelity/success probability estimations.
- Introduced dual-path reward computation supporting exact and approximate modes with device-averages caching.
Bug Fixes
- Resolved multiprocessing compatibility issue on Windows with Python 3.13+.
Changed
- Enhanced idle-time decay calculations with improved time unit handling.
Documentation
- Updated setup examples with revised benchmark level configurations.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Walkthrough

Adds a device-aware canonical gate cost model, dual-path RL reward computation (exact vs approximate) with device-averages caching, integrates dt-scaled idle decay for single-qubit delay gates into ESP, adds a Windows/Python 3.13 NO_PARALLEL fallback, updates RL env action/layout postprocessing and tests/docs/packaging (BenchmarkLevel → INDEP).

Changes

Cohort / File(s)	Change Summary
Changelog & Docs `CHANGELOG.md`, `docs/setup.md`	Updated Unreleased entries and PR links; example changed `BenchmarkLevel.ALG` → `BenchmarkLevel.INDEP`.
Packaging `pyproject.toml`	Added `[tool.check-sdist]` sdist-only entry and a test `filterwarnings` pattern to ignore a device-cost warning.
ML predictor parallelism `src/mqt/predictor/ml/predictor.py`	Added module-level `NO_PARALLEL` constant and conditional single-thread fallbacks for compile/generate/train paths to avoid a Windows/Python 3.13 multiprocessing issue.
Reward idle-time scaling `src/mqt/predictor/reward.py`	Applied discrete time unit `dt` to idle-time exponential decay for single-qubit delay gates in `estimated_success_probability`.
Utilities / Logging `src/mqt/predictor/utils.py`	Replaced Windows `RuntimeWarning` with `logger.info` for timeout-not-supported path; removed `warnings.warn` import.
RL cost model (new module) `src/mqt/predictor/rl/cost_model.py`	New device-aware canonical cost tables (ibm_torino, ankaa_3, emerald) and APIs: `get_cost_table`, `canonical_cost`, `estimate_counts`, `approx_expected_fidelity`, `approx_estimated_success_probability`; includes fallbacks, clamping, and warnings.
RL environment & reward refactor `src/mqt/predictor/rl/predictorenv.py`	Major changes: new `calculate_reward(mode="auto"
Tests — RL compilation `tests/compilation/test_predictor_rl.py`	Tests updated to use `BenchmarkLevel.INDEP`; added cost-model imports and tests for unknown-device/gate fallback and `calculate_reward` branches; mapping checks via `CheckMap`.
Tests — ML device selection `tests/device_selection/test_predictor_ml.py`	Switched multiple test cases from `ALG` to `INDEP`, adjusting generated benchmark inputs and prediction scenarios.
Tests — Hellinger distance `tests/hellinger_distance/test_estimated_hellinger_distance.py`	Removed Windows-specific warning branch, increased timeout, removed `num_workers` arguments from calls and updated call signatures to match ML predictor changes.

Sequence Diagram(s)

sequenceDiagram
    actor Agent as RL Agent
    participant Env as PredictorEnv
    participant Calc as calculate_reward()
    participant Cost as cost_model
    participant Dev as DeviceAveragesCache
    participant Reward as reward.py

    Agent->>Env: step(action)
    Env->>Calc: calculate_reward(mode="auto")  -- pre-action
    alt auto → native & mapped (exact)
        Calc->>Dev: _ensure_device_averages_cached()
        Dev-->>Calc: device averages (p1,p2,tau1,tau2,tbar)
        Calc->>Reward: expected_fidelity / estimated_success_probability (device data)
        Reward-->>Calc: exact_reward
    else auto → approximate/fallback
        Calc->>Cost: approx_expected_fidelity / approx_estimated_success_probability
        Cost-->>Calc: approx_reward
    end
    Env->>Env: apply action (via _apply_qiskit_action / pm_property_set)
    Env->>Calc: calculate_reward(mode=exact/approx)  -- post-action
    Calc-->>Env: post_reward
    Env->>Env: delta = post_reward - pre_reward → apply shaping & penalties
    Env-->>Agent: observation, reward (delta-shaped), done, info

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Files warranting extra attention:
- src/mqt/predictor/rl/predictorenv.py — reward flow, device-averages extraction, pm_property_set propagation, step() lifecycle.
- src/mqt/predictor/rl/cost_model.py — correctness of canonical tables, cost mapping, ESP/fidelity approximations and fallbacks.
- src/mqt/predictor/ml/predictor.py and tests/hellinger_distance/* — NO_PARALLEL behavior and updated compile/generate signatures.
- tests/compilation/test_predictor_rl.py — new test branches and mapping assertions.

Poem

🐰
I nibble costs from gates so fine,
dual-path rewards now hop in line.
Idle-time ticks with dt's soft beat,
Windows hush while threads take a seat.
Circuits map, rewards sing — what a treat!

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: introducing stepwise intermediate rewards for the RL-based compiler. It is specific, concise, and reflects the primary feature being added.
Description check	✅ Passed	The description provides a clear summary of the change with motivation and context. It explains both the exact and approximate regimes for reward computation. However, the checklist items are not marked as complete, which may indicate incomplete review.
Docstring Coverage	✅ Passed	Docstring coverage is 85.29% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 14

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ad5ae0b and 0c42b41.

📒 Files selected for processing (13)

CHANGELOG.md (3 hunks)
docs/setup.md (1 hunks)
noxfile.py (3 hunks)
pyproject.toml (1 hunks)
src/mqt/predictor/ml/predictor.py (5 hunks)
src/mqt/predictor/reward.py (1 hunks)
src/mqt/predictor/rl/cost_model.py (1 hunks)
src/mqt/predictor/rl/parsing.py (1 hunks)
src/mqt/predictor/rl/predictorenv.py (10 hunks)
src/mqt/predictor/utils.py (1 hunks)
tests/compilation/test_predictor_rl.py (5 hunks)
tests/device_selection/test_predictor_ml.py (3 hunks)
tests/hellinger_distance/test_estimated_hellinger_distance.py (1 hunks)

🧰 Additional context used

🧠 Learnings (7)

📚 Learning: 2025-10-14T14:37:38.047Z

Learnt from: burgholzer
Repo: munich-quantum-toolkit/yaqs PR: 212
File: CHANGELOG.md:12-15
Timestamp: 2025-10-14T14:37:38.047Z
Learning: In the munich-quantum-toolkit/yaqs project, changelog entries follow the template: "- $TITLE ([#$NUMBER]($URL)) ([**AUTHOR**](https://github.com/$AUTHOR))". Issue references should not be included in changelog entries; the PR number is sufficient for traceability.

Applied to files:

CHANGELOG.md

📚 Learning: 2025-11-05T09:23:46.540Z

Learnt from: burgholzer
Repo: munich-quantum-toolkit/core PR: 1287
File: src/qdmi/dd/Device.cpp:492-521
Timestamp: 2025-11-05T09:23:46.540Z
Learning: In the munich-quantum-toolkit/core repository, the `qasm3::Importer::imports()` function is backwards compatible with OpenQASM 2 programs. Therefore, it can be used to parse both QASM2 and QASM3 program formats without requiring separate importers for each version.

Applied to files:

src/mqt/predictor/rl/parsing.py

📚 Learning: 2025-11-04T14:26:25.420Z

Learnt from: marcelwa
Repo: munich-quantum-toolkit/core PR: 1243
File: test/python/qdmi/qiskit/conftest.py:11-19
Timestamp: 2025-11-04T14:26:25.420Z
Learning: In the munich-quantum-toolkit/core repository, Qiskit is always available as a dependency during testing, so import guards for qiskit-dependent imports in test files (e.g., test/python/qdmi/qiskit/*.py) are not necessary.

Applied to files:

src/mqt/predictor/rl/parsing.py
src/mqt/predictor/rl/predictorenv.py

📚 Learning: 2025-10-10T08:09:54.528Z

Learnt from: burgholzer
Repo: munich-quantum-toolkit/core PR: 1246
File: bindings/CMakeLists.txt:0-0
Timestamp: 2025-10-10T08:09:54.528Z
Learning: In the Munich Quantum Toolkit (MQT) Core project, scikit-build-core is configured with `wheel.install-dir = "mqt/core"` in pyproject.toml, which automatically prefixes all CMake `DESTINATION` paths with `mqt/core/` during wheel installation. Therefore, CMake install destinations are relative to the `mqt/core` package namespace, not `site-packages`.

Applied to files:

pyproject.toml

📚 Learning: 2025-10-10T08:10:16.394Z

Learnt from: burgholzer
Repo: munich-quantum-toolkit/core PR: 1246
File: test/python/na/test_na_fomac.py:35-0
Timestamp: 2025-10-10T08:10:16.394Z
Learning: In the munich-quantum-toolkit/core repository, scikit-build-core is configured with `wheel.install-dir = "mqt/core"` in pyproject.toml, which means CMake `install()` commands with `DESTINATION <path>` install files relative to `mqt/core/` in the wheel, making them accessible via `files("mqt.core").joinpath("<path>")`.

Applied to files:

pyproject.toml

📚 Learning: 2025-12-02T10:08:36.022Z

Learnt from: burgholzer
Repo: munich-quantum-toolkit/core-plugins-catalyst PR: 1
File: .github/workflows/cd.yml:40-48
Timestamp: 2025-12-02T10:08:36.022Z
Learning: In munich-quantum-toolkit workflows, the reusable-python-packaging-sdist.yml workflow uploads the sdist artifact with the name `cibw-sdist` (or `dev-cibw-sdist` for pull requests), which means it is covered by the `cibw-*` glob pattern when downloading artifacts for deployment.

Applied to files:

pyproject.toml

📚 Learning: 2025-11-27T16:58:08.564Z

Learnt from: burgholzer
Repo: munich-quantum-toolkit/ddsim PR: 720
File: noxfile.py:57-60
Timestamp: 2025-11-27T16:58:08.564Z
Learning: In Nox sessions, when using `session.install()` to install a tool and then `session.run(..., external=True)` to execute it, this pattern is valid because the session's virtual environment is activated and its bin directory is on PATH. The `external=True` flag allows calling executables on PATH, which includes the activated venv's bin directory, so it will find tools installed via `session.install()`.

Applied to files:

noxfile.py

🧬 Code graph analysis (3)

tests/hellinger_distance/test_estimated_hellinger_distance.py (1)

src/mqt/predictor/ml/predictor.py (2)

compile_training_circuits (206-240)

generate_training_data (242-300)

src/mqt/predictor/reward.py (2)

tests/hellinger_distance/test_estimated_hellinger_distance.py (1)

device (48-50)

tests/compilation/test_reward.py (1)

device (38-40)

tests/compilation/test_predictor_rl.py (2)

src/mqt/predictor/rl/cost_model.py (2)

canonical_cost (161-173)

get_cost_table (140-158)

src/mqt/predictor/rl/predictorenv.py (1)

calculate_reward (272-355)

🪛 Ruff (0.14.7)

src/mqt/predictor/rl/predictorenv.py

210-210: Logging statement uses f-string

(G004)

241-241: Logging statement uses f-string

(G004)

269-269: Unused noqa directive (non-enabled: SLF001)

Remove unused noqa directive

(RUF100)

666-668: try-except-continue detected, consider logging the exception

(S112)

666-666: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (19)

noxfile.py (2)

21-21: LGTM!

The pathlib import is necessary for the filesystem operations in the new _cleanup function.

112-113: LGTM!

The conditional cleanup call is appropriately placed and only runs in CI environments to save disk space. The pattern is consistent with other CI checks in the file.

CHANGELOG.md (1)

14-15: LGTM! Changelog entries follow the project template.

The changelog entries, PR links, and contributor reference are correctly formatted and consistent with the project's changelog template. Based on learnings, the format matches the expected template.

Also applies to: 49-50, 68-68

pyproject.toml (1)

97-98: LGTM! Packaging configuration correctly added.

The check-sdist configuration correctly specifies that _version.py should only be present in source distributions (sdist), which is the expected behavior for version files generated by hatch-vcs during the build process.

src/mqt/predictor/utils.py (1)

54-54: LGTM! Appropriate logging level for platform limitation.

The change from warnings.warn with RuntimeWarning to logger.info is appropriate. Since running without timeout on Windows is a known platform limitation rather than an error condition, an informational log message is more suitable than a warning.

tests/device_selection/test_predictor_ml.py (1)

46-46: LGTM! Test benchmark level updated consistently.

The change from BenchmarkLevel.ALG to BenchmarkLevel.INDEP is consistent with the PR objective and changelog entry about updating the test circuit level for the RL predictor. All test cases in this file have been updated consistently.

Also applies to: 66-66, 96-96

src/mqt/predictor/ml/predictor.py (4)

64-64: LGTM! Platform-specific workaround for multiprocessing issue.

The NO_PARALLEL constant correctly identifies the Windows + Python 3.13+ combination where joblib's "loky" backend is broken. This is a well-documented issue with the missing _posixsubprocess module.

232-235: LGTM! Clear documentation and appropriate fallback.

The comment clearly explains the Windows/Python 3.13 multiprocessing limitation, and the fallback to single-threaded execution (num_jobs=1) is the appropriate workaround for this platform issue.

275-276: LGTM! Consistent application of multiprocessing workaround.

The same NO_PARALLEL conditional is correctly applied to the generate_training_data method, ensuring consistent behavior across all parallel execution paths.

415-418: LGTM! Multiprocessing workaround applied to ML training.

The NO_PARALLEL conditional is correctly applied to the GridSearchCV training, ensuring the RandomForest model training also respects the Windows/Python 3.13 limitation.

docs/setup.md (1)

112-112: LGTM! Documentation updated to reflect new benchmark level.

The documentation example has been updated from BenchmarkLevel.ALG to BenchmarkLevel.INDEP, which is consistent with the changes in test files and the changelog entry about updating the test circuit level for the RL predictor.

tests/compilation/test_predictor_rl.py (3)

46-54: LGTM!

The switch from BenchmarkLevel.ALG to BenchmarkLevel.INDEP aligns with the PR's focus on device-independent circuits for RL training. The test logic remains correct.

108-115: LGTM!

Good addition of mapping verification. The test now validates both native gates and proper coupling map adherence, which is essential for the new dual-path reward system.

156-167: LGTM!

Comprehensive test coverage for cost model fallback behavior. The identity check (is TORINO_CANONICAL_COSTS) correctly verifies the exact table is returned rather than a copy.

src/mqt/predictor/rl/cost_model.py (2)

1-35: LGTM!

Well-documented module header and type definitions. The CanonicalCostTable type alias is appropriately defined as a Mapping for read-only access.

211-237: LGTM!

The approximate fidelity calculation is correct. The defensive max(n_1q, 0) guards and final clamping to [0, 1] ensure numerical robustness.

src/mqt/predictor/rl/predictorenv.py (3)

179-188: LGTM!

Good addition of instance variables for the dual-path reward system. The device averages cache (_p1_avg, _p2_avg, etc.) and reward tracking variables (prev_reward, prev_reward_kind) are appropriately initialized.

731-740: LGTM!

The method correctly checks both native gate basis and coupling map compliance. The logic is clear and matches the determine_valid_actions_for_state pattern.

458-464: Undefined method fom_aware_compile and attribute max_iter.

The code references self.fom_aware_compile and self.max_iter, but their definitions cannot be verified without access to the codebase. If these are not defined in the class or inherited from a parent class, this will raise an AttributeError at runtime when a stochastic action is applied.

coderabbitai · 2025-12-03T22:11:20Z

noxfile.py

+def _cleanup(session: nox.Session) -> None:
+    """Remove this session's virtualenv to save disk space in CI."""
+    version = session.python
+    if version != "3.13":  # keep cache for last run


⚠️ Potential issue | 🟡 Minor

Hardcoded version check may break if test order changes.

The check if version != "3.13" assumes 3.13 is always the last test run. If PYTHON_ALL_VERSIONS is reordered or a newer version (e.g., 3.14) is added, this logic will fail to preserve the correct cache.

Consider dynamically determining the last version:

+ # Keep cache for the last Python version in the test matrix + last_version = PYTHON_ALL_VERSIONS[-1] - if version != "3.13": # keep cache for last run + if version != last_version: # keep cache for last run venv_dir = session.virtualenv.location

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if version != "3.13": # keep cache for last run

# Keep cache for the last Python version in the test matrix

last_version = PYTHON_ALL_VERSIONS[-1]

if version != last_version: # keep cache for last run

venv_dir = session.virtualenv.location

🤖 Prompt for AI Agents

In noxfile.py around line 67, the hardcoded check `if version != "3.13"` assumes 3.13 is always the last version and will break if PYTHON_ALL_VERSIONS is reordered or new versions are added; change the logic to compute the last version dynamically from PYTHON_ALL_VERSIONS (e.g., take the last element or compute the max/semver-aware last) and compare `version` against that computed value so the cache-preserve branch always targets the actual last configured Python version.

noxfile.py

coderabbitai · 2025-12-03T22:11:21Z

noxfile.py

+        gha_temp = pathlib.Path("/home/runner/work/_temp/setup-uv-cache")
+        if gha_temp.exists():
+            shutil.rmtree(gha_temp, ignore_errors=True)
+            session.log(f"Cleaned GitHub Actions uv temp cache at {gha_temp}")


🧹 Nitpick | 🔵 Trivial

Consider adding a comment for GitHub Actions-specific cleanup.

The hardcoded path /home/runner/work/_temp/setup-uv-cache is specific to GitHub Actions and won't exist in other CI environments. While ignore_errors=True makes this safe, adding a comment would improve clarity.

+ # GitHub Actions-specific uv cache cleanup gha_temp = pathlib.Path("/home/runner/work/_temp/setup-uv-cache") if gha_temp.exists(): shutil.rmtree(gha_temp, ignore_errors=True)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

gha_temp = pathlib.Path("/home/runner/work/_temp/setup-uv-cache")

if gha_temp.exists():

shutil.rmtree(gha_temp, ignore_errors=True)

session.log(f"Cleaned GitHub Actions uv temp cache at {gha_temp}")

# GitHub Actions-specific uv cache cleanup

gha_temp = pathlib.Path("/home/runner/work/_temp/setup-uv-cache")

if gha_temp.exists():

shutil.rmtree(gha_temp, ignore_errors=True)

session.log(f"Cleaned GitHub Actions uv temp cache at {gha_temp}")

🤖 Prompt for AI Agents

noxfile.py around lines 73 to 76: the cleanup block removes a hardcoded GitHub Actions temp path which is GH‑Actions specific; add a one-line comment above this block indicating that the path is GitHub Actions specific (e.g., "Clean up GitHub Actions setup-uv-cache temp dir if present") so readers understand why that exact path is used and that it's safe in other environments due to ignore_errors=True.

src/mqt/predictor/reward.py

src/mqt/predictor/rl/cost_model.py

coderabbitai · 2025-12-03T22:11:22Z

src/mqt/predictor/rl/predictorenv.py

+            self.prev_reward = new_val
+            self.prev_reward_kind = new_kind

        self.state._layout = self.layout  # noqa: SLF001


🧹 Nitpick | 🔵 Trivial

Unused noqa directive per static analysis.

The static analysis flags this noqa: SLF001 as unused. If SLF001 is not enabled in your linter configuration, the directive can be removed.

- self.state._layout = self.layout # noqa: SLF001 + self.state._layout = self.layout

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

self.state._layout = self.layout # noqa: SLF001

self.state._layout = self.layout

🧰 Tools

🪛 Ruff (0.14.7)

269-269: Unused noqa directive (non-enabled: SLF001)

Remove unused noqa directive

(RUF100)

🤖 Prompt for AI Agents

In src/mqt/predictor/rl/predictorenv.py around line 269, the trailing " # noqa: SLF001" on the assignment self.state._layout = self.layout is flagged as an unused/unsupported noqa directive; remove the " # noqa: SLF001" suffix so the line becomes a plain assignment, or if the linter actually requires suppression, replace it with a supported directive for your project's linter configuration — otherwise delete the directive to satisfy static analysis.

src/mqt/predictor/rl/predictorenv.py

tests/compilation/test_predictor_rl.py

tests/hellinger_distance/test_estimated_hellinger_distance.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Shaobo-Zhou <109073755+Shaobo-Zhou@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (3)

src/mqt/predictor/rl/predictorenv.py (2)
269-269: Remove unused noqa directive.

Static analysis indicates SLF001 is not enabled, making this directive unnecessary.
-        self.state._layout = self.layout  # noqa: SLF001
+        self.state._layout = self.layout
256-264: Clarify symmetric reward scaling intent.

Both positive and negative delta branches multiply by reward_scale, while no_effect_penalty is only used for zero delta. The naming suggests asymmetric behavior but implementation is symmetric. Consider renaming for clarity:
-        self.reward_scale = 1
-        self.no_effect_penalty = 0.001
+        self.delta_scale = 1
+        self.no_progress_penalty = 0.001
And update usage accordingly. This makes the symmetric scaling intent explicit.
src/mqt/predictor/rl/cost_model.py (1)
300-301: Remove redundant initialization of t_hat.

Line 300 initializes t_hat = 0.0 but it's immediately overwritten on line 301. This is dead code.
-    t_hat = 0.0
     t_hat = (n_1q * tau1_avg + n_2q * tau2_avg) / k_eff

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0c42b41 and 8259b4d.

📒 Files selected for processing (5)

src/mqt/predictor/reward.py (1 hunks)
src/mqt/predictor/rl/cost_model.py (1 hunks)
src/mqt/predictor/rl/predictorenv.py (10 hunks)
tests/compilation/test_predictor_rl.py (5 hunks)
tests/hellinger_distance/test_estimated_hellinger_distance.py (1 hunks)

🧰 Additional context used

🧠 Learnings (4)

📚 Learning: 2025-11-01T15:57:31.153Z

Learnt from: burgholzer
Repo: munich-quantum-toolkit/core PR: 1283
File: src/qir/runtime/QIR.cpp:196-201
Timestamp: 2025-11-01T15:57:31.153Z
Learning: In the QIR runtime (src/qir/runtime/QIR.cpp), the PRX gate (__quantum__qis__prx__body) is an alias for the R gate (Phased X-Rotation) and should call runtime.apply<qc::R>(theta, phi, qubit), not runtime.apply<qc::RX>() which is a single-parameter rotation gate.

Applied to files:

src/mqt/predictor/rl/cost_model.py

📚 Learning: 2025-10-09T13:20:11.483Z

Learnt from: DRovara
Repo: munich-quantum-toolkit/core PR: 1108
File: mlir/test/Dialect/MQTOpt/Transforms/lift-measurements.mlir:269-288
Timestamp: 2025-10-09T13:20:11.483Z
Learning: In the MQT MLIR dialect, the `rz` gate should not be included in the `DIAGONAL_GATES` set for the `ReplaceBasisStateControlsWithIfPattern` because its operator matrix does not have the required shape | 1 0 | / | 0 x | for the targets-as-controls optimization. It is only included in `LiftMeasurementsAboveGatesPatterns` where the matrix structure requirement differs.

Applied to files:

src/mqt/predictor/rl/cost_model.py

📚 Learning: 2025-11-27T21:26:39.677Z

Learnt from: burgholzer
Repo: munich-quantum-toolkit/qmap PR: 846
File: python/mqt/qmap/plugins/qiskit/sc/load_calibration.py:34-34
Timestamp: 2025-11-27T21:26:39.677Z
Learning: In the qmap project, the Ruff linter has the "PL" (pylint) rule category enabled, which includes PLC0415 (import-outside-top-level). Therefore, `# noqa: PLC0415` directives on lazy imports are appropriate and necessary, not unused.

Applied to files:

src/mqt/predictor/rl/predictorenv.py

📚 Learning: 2025-11-04T14:26:25.420Z

Learnt from: marcelwa
Repo: munich-quantum-toolkit/core PR: 1243
File: test/python/qdmi/qiskit/conftest.py:11-19
Timestamp: 2025-11-04T14:26:25.420Z
Learning: In the munich-quantum-toolkit/core repository, Qiskit is always available as a dependency during testing, so import guards for qiskit-dependent imports in test files (e.g., test/python/qdmi/qiskit/*.py) are not necessary.

Applied to files:

src/mqt/predictor/rl/predictorenv.py

🧬 Code graph analysis (4)

tests/hellinger_distance/test_estimated_hellinger_distance.py (2)

src/mqt/predictor/ml/predictor.py (2)

compile_training_circuits (206-240)

generate_training_data (242-300)

tests/device_selection/test_predictor_ml.py (2)

path_uncompiled_circuits (27-29)

path_compiled_circuits (33-35)

tests/compilation/test_predictor_rl.py (2)

src/mqt/predictor/rl/cost_model.py (2)

canonical_cost (171-183)

get_cost_table (150-168)

src/mqt/predictor/rl/predictorenv.py (1)

calculate_reward (272-356)

src/mqt/predictor/rl/predictorenv.py (5)

src/mqt/predictor/rl/actions.py (4)

CompilationOrigin (92-98)

DeviceDependentAction (136-146)

PassType (101-110)

Action (114-127)

src/mqt/predictor/rl/cost_model.py (2)

approx_estimated_success_probability (250-308)

approx_expected_fidelity (221-247)

src/mqt/predictor/rl/helper.py (1)

create_feature_dict (72-94)

src/mqt/predictor/rl/parsing.py (1)

postprocess_vf2postlayout (229-250)

src/mqt/predictor/utils.py (1)

calc_supermarq_features (93-145)

src/mqt/predictor/reward.py (2)

tests/hellinger_distance/test_estimated_hellinger_distance.py (1)

device (48-50)

tests/compilation/test_reward.py (1)

device (38-40)

🪛 Ruff (0.14.7)

src/mqt/predictor/rl/predictorenv.py

210-210: Logging statement uses f-string

(G004)

241-241: Logging statement uses f-string

(G004)

269-269: Unused noqa directive (non-enabled: SLF001)

Remove unused noqa directive

(RUF100)

🔇 Additional comments (13)

tests/hellinger_distance/test_estimated_hellinger_distance.py (1)

219-226: LGTM! API calls correctly updated to use keyword arguments.

The compile_training_circuits and generate_training_data calls now properly use keyword arguments, aligning with the updated function signatures in src/mqt/predictor/ml/predictor.py. The removal of num_workers is consistent with the new NO_PARALLEL constant handling parallelism internally.

src/mqt/predictor/rl/cost_model.py (4)

41-147: Well-structured canonical cost tables with clear documentation.

The per-device cost tables are well-organized with clear comments explaining the decomposition rationale. The registry pattern with DEVICE_CANONICAL_COSTS provides clean extensibility.

150-168: Safe fallback behavior with appropriate warnings.

The unknown-device fallback to Torino with both a warning and log message ensures visibility while maintaining functionality. The stacklevel=3 is appropriate for the call chain.

186-218: Robust gate cost estimation with conservative fallbacks.

The iteration pattern correctly uses the modern CircuitInstruction access. The conservative fallback for unknown gates (3×1q or 1×2q+4×1q) is reasonable for approximation purposes.

221-247: Clean implementation of approximate fidelity and ESP calculations.

Both functions properly clamp outputs to [0, 1] and handle edge cases. The ESP model's idle-time penalty with liveness modulation is a reasonable approximation approach.

Also applies to: 250-308

src/mqt/predictor/reward.py (1)

206-212: Proper handling of discrete time unit with safe fallback.

The device.dt or 1.0 fallback correctly handles backends without discrete-time calibration, preventing TypeError when device.dt is None. The dt-scaled duration is appropriately used in the exponential decay calculation.

tests/compilation/test_predictor_rl.py (3)

156-167: Good coverage for unknown-device fallback and unknown-gate defaults.

The test properly verifies that unknown devices trigger a UserWarning and fall back to TORINO_CANONICAL_COSTS, and that unknown gates return (0, 0). This aligns with the behavior defined in get_cost_table and canonical_cost.

170-219: Comprehensive test coverage for dual-path reward calculation.

The test covers all three modes (exact, approx, auto) for ESP and verifies that critical_depth always returns exact results. The addition of the auto-exact path test (lines 197-200) addresses the previous review suggestion. The monkeypatch approach for forcing the approx path is appropriate.

108-115: Good addition of mapping validation.

The CheckMap assertion ensures the compiled circuit respects the device's coupling map, complementing the existing native-gates check.

src/mqt/predictor/rl/predictorenv.py (4)

272-356: Well-designed dual-path reward calculation with proper mode handling.

The calculate_reward method cleanly separates exact (calibration-based) and approximate (cost-model-based) paths. The auto mode correctly uses _is_native_and_mapped to determine the appropriate path. The device_id parameter is now correctly passed to both approx_expected_fidelity and approx_estimated_success_probability.

608-730: Thorough device calibration caching with proper error handling.

The _ensure_device_averages_cached method properly aggregates 1q/2q error rates, durations, and coherence times. Key strengths:

Handles dt scaling for durations correctly

Tries flipped orientation for uni-directional couplings (lines 690-691)

Uses specific exceptions (KeyError, AttributeError) per past review feedback

Provides informative error messages when calibration data is missing

732-741: Clean helper for native/mapped circuit detection.

The _is_native_and_mapped method properly combines both checks using GatesInBasis and CheckMap passes, returning a boolean suitable for auto-mode reward selection.

190-270: Solid step-wise reward shaping implementation.

The refactored step() method implements the shaped reward signal as described in the PR objectives:

Pre-action reward evaluation establishes baseline

Post-action delta computation drives intermediate rewards

The prev_kind == "approx" and new_kind == "exact" check (line 253-254) correctly handles regime transitions

Terminal actions use exact metrics for final reward

One note: The logging statements at lines 210 and 241 use f-strings (flagged by Ruff G004), but this is a stylistic preference and doesn't affect functionality.

coderabbitai · 2025-12-03T22:42:42Z

src/mqt/predictor/rl/predictorenv.py

-            passes = action.transpile_pass(
-                self.device.operation_names,
-                CouplingMap(self.device.build_coupling_map()) if self.layout else None,
+        pm_property_set: PropertySet | None = {}


🧹 Nitpick | 🔵 Trivial

Type annotation mismatch for pm_property_set.

The variable is initialized as an empty dict ({}) but the type hint in _handle_qiskit_layout_postprocessing expects dict[str, Any] | None. While this works at runtime, consider initializing as None for consistency with the type, or use explicit typing:

- pm_property_set: PropertySet | None = {} + pm_property_set: dict[str, Any] | None = None

Then ensure downstream code handles the None case (which it already does at line 501).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

pm_property_set: PropertySet | None = {}

pm_property_set: dict[str, Any] | None = None

🤖 Prompt for AI Agents

In src/mqt/predictor/rl/predictorenv.py around line 458, pm_property_set is annotated as PropertySet | None but initialized to an empty dict; change the initialization to None and/or add an explicit type annotation (e.g., pm_property_set: PropertySet | None = None) so the runtime value matches the declared type, and keep existing downstream None-handling (see line 501).

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (2)

src/mqt/predictor/rl/predictorenv.py (2)
269-269: Remove unused noqa directive.

Static analysis indicates this directive is unnecessary since SLF001 is not enabled in the linter configuration.
-        self.state._layout = self.layout  # noqa: SLF001
+        self.state._layout = self.layout
458-458: Type annotation mismatch persists.

The variable is annotated as PropertySet | None but initialized to {}. Since PropertySet is a Qiskit dict-like class, consider initializing to None and updating the downstream logic, or change the annotation to dict[str, Any] | None to match the runtime type.
-        pm_property_set: PropertySet | None = {}
+        pm_property_set: dict[str, Any] | None = None
This would require updating line 477/484 to handle the None initialization:
pm_property_set = dict(pm.property_set) if hasattr(pm, "property_set") else None

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4b5bf6f and 14aa3ac.

📒 Files selected for processing (2)

src/mqt/predictor/rl/predictorenv.py (10 hunks)
tests/compilation/test_predictor_rl.py (5 hunks)

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-11-27T21:26:39.677Z

Learnt from: burgholzer
Repo: munich-quantum-toolkit/qmap PR: 846
File: python/mqt/qmap/plugins/qiskit/sc/load_calibration.py:34-34
Timestamp: 2025-11-27T21:26:39.677Z
Learning: In the qmap project, the Ruff linter has the "PL" (pylint) rule category enabled, which includes PLC0415 (import-outside-top-level). Therefore, `# noqa: PLC0415` directives on lazy imports are appropriate and necessary, not unused.

Applied to files:

src/mqt/predictor/rl/predictorenv.py

📚 Learning: 2025-11-04T14:26:25.420Z

Learnt from: marcelwa
Repo: munich-quantum-toolkit/core PR: 1243
File: test/python/qdmi/qiskit/conftest.py:11-19
Timestamp: 2025-11-04T14:26:25.420Z
Learning: In the munich-quantum-toolkit/core repository, Qiskit is always available as a dependency during testing, so import guards for qiskit-dependent imports in test files (e.g., test/python/qdmi/qiskit/*.py) are not necessary.

Applied to files:

src/mqt/predictor/rl/predictorenv.py

🧬 Code graph analysis (1)

src/mqt/predictor/rl/predictorenv.py (6)

src/mqt/predictor/reward.py (4)

crit_depth (40-43)

esp_data_available (229-324)

estimated_success_probability (102-226)

expected_fidelity (46-74)

src/mqt/predictor/rl/actions.py (5)

CompilationOrigin (92-98)

DeviceDependentAction (136-146)

PassType (101-110)

get_actions_by_pass_type (462-467)

Action (114-127)

src/mqt/predictor/rl/cost_model.py (2)

approx_estimated_success_probability (250-308)

approx_expected_fidelity (221-247)

src/mqt/predictor/rl/helper.py (2)

create_feature_dict (72-94)

get_path_training_circuits (107-109)

src/mqt/predictor/rl/parsing.py (1)

postprocess_vf2postlayout (229-250)

src/mqt/predictor/utils.py (1)

calc_supermarq_features (93-145)

🪛 Ruff (0.14.7)

tests/compilation/test_predictor_rl.py

227-227: Unused noqa directive (non-enabled: SLF001)

Remove unused noqa directive

(RUF100)

src/mqt/predictor/rl/predictorenv.py

210-210: Logging statement uses f-string

(G004)

241-241: Logging statement uses f-string

(G004)

269-269: Unused noqa directive (non-enabled: SLF001)

Remove unused noqa directive

(RUF100)

⏰ Context from checks skipped due to timeout of 900000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: 🐍 Test (macos-14) / 🐍 macos-14
GitHub Check: 🐍 Test (windows-2022) / 🐍 windows-2022
GitHub Check: 🐍 Test (ubuntu-24.04) / 🐍 ubuntu-24.04

🔇 Additional comments (9)

src/mqt/predictor/rl/predictorenv.py (5)

179-188: LGTM!

The renamed variables (reward_scale, no_effect_penalty) are clearer than the previous names. The device-average caching variables are well-structured for the approximate reward path.

272-356: LGTM!

The dual-path reward calculation is well-structured with clear mode selection logic. The device_id is now correctly passed to the approximate reward functions, and the fallback behavior for unsupported reward functions is appropriately logged.

495-526: LGTM!

The refactored method properly accepts the property set as a parameter, improving testability and decoupling from the PassManager. The early return for falsy pm_property_set handles edge cases appropriately.

732-741: LGTM!

Clean helper method that encapsulates the native-and-mapped check logic. The use of Qiskit's analysis passes (GatesInBasis, CheckMap) is appropriate.

253-254: Important regime-switch handling.

Setting delta_reward = 0.0 when transitioning from approximate to exact mode prevents numerical artifacts from comparing metrics computed with different methods. This is a sound design choice.

tests/compilation/test_predictor_rl.py (4)

35-44: LGTM!

The new imports support testing the cost model fallback behavior and the dual-path reward computation. The TYPE_CHECKING pattern for MonkeyPatch is correctly used.

105-115: LGTM!

Adding explicit validation via GatesInBasis and CheckMap strengthens the test by verifying the compiled circuit meets device constraints. The descriptive assertion messages aid debugging.

156-167: LGTM!

Good coverage of cost model edge cases: unknown device fallback and unknown gate default. The test properly verifies both the warning emission and the returned values.

170-235: LGTM!

Comprehensive test coverage for the dual-path reward computation:

ESP exact/approx/auto modes

Auto mode selecting exact for native+mapped circuits (addresses previous review feedback)

RuntimeError on broken Target API

Critical depth always returning exact

The monkeypatch usage for forcing the approximate path is appropriate.

coderabbitai · 2025-12-04T10:02:31Z

src/mqt/predictor/rl/predictorenv.py

 if sys.version_info >= (3, 11) and TYPE_CHECKING:  # pragma: no cover
-    from typing import assert_never
-else:
-    from typing_extensions import assert_never
+    pass


🧹 Nitpick | 🔵 Trivial

Empty conditional block could be clarified.

The pass statement in the sys.version_info >= (3, 11) and TYPE_CHECKING block suggests content was removed. If this block is no longer needed, consider removing it entirely rather than leaving an empty conditional.

-if sys.version_info >= (3, 11) and TYPE_CHECKING: # pragma: no cover - pass

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if sys.version_info >= (3, 11) and TYPE_CHECKING: # pragma: no cover

from typing import assert_never

else:

from typing_extensions import assert_never

pass

🤖 Prompt for AI Agents

In src/mqt/predictor/rl/predictorenv.py around lines 17 to 18, there is an empty conditional using `if sys.version_info >= (3, 11) and TYPE_CHECKING: pass`; remove the entire conditional block if it is no longer needed, or replace the `pass` with a brief explanatory comment (e.g., "# kept for future type-only imports on Python 3.11+" ) to clarify intent; ensure imports and TYPE_CHECKING usage remain correct after the removal/change and run tests/static type checks.

coderabbitai · 2025-12-04T10:02:31Z

src/mqt/predictor/rl/predictorenv.py

+        if not p1 and not p2:
+            msg = "No valid 1q/2q calibration data found in Target; cannot compute approximate reward."
+            raise RuntimeError(msg)


🧹 Nitpick | 🔵 Trivial

Consider differentiating empty 1q vs 2q calibration data.

The check if not p1 and not p2 raises a RuntimeError, but a device with only single-qubit gates (no 2q gates) is valid. The error message could be misleading if p2 is empty by design rather than missing calibration.

Consider logging a warning instead of raising if one list is populated:

if not p1 and not p2: msg = "No valid 1q/2q calibration data found in Target; cannot compute approximate reward." raise RuntimeError(msg) + elif not p1: + logger.warning("No single-qubit calibration data found; _p1_avg will be 0.0") + elif not p2: + logger.warning("No two-qubit calibration data found; _p2_avg will be 0.0")

🤖 Prompt for AI Agents

In src/mqt/predictor/rl/predictorenv.py around lines 706-708, the current check treats missing 1q or 2q calibration the same and raises a RuntimeError when both p1 and p2 are falsy; change the logic so that you only raise when both p1 and p2 are empty, but if one is populated and the other is empty emit a warning (include which calibration is missing: "missing 1q" or "missing 2q") and continue using the available calibration to compute the approximate reward; ensure the warning uses the module/logger available in this file and updates the message to be unambiguous about whether the absence is expected (e.g., device has no 2q gates) or truly missing.

coderabbitai · 2025-12-04T10:02:31Z

tests/compilation/test_predictor_rl.py

+        RuntimeError,
+        match=re.escape("Device target does not expose the required Target API for approximate reward computation."),
+    ):
+        broken_predictor.env._ensure_device_averages_cached()  # noqa: SLF001


🧹 Nitpick | 🔵 Trivial

Remove unused noqa directive.

Static analysis indicates SLF001 is not enabled, making this directive unnecessary.

- broken_predictor.env._ensure_device_averages_cached() # noqa: SLF001 + broken_predictor.env._ensure_device_averages_cached()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

broken_predictor.env._ensure_device_averages_cached() # noqa: SLF001

broken_predictor.env._ensure_device_averages_cached()

🧰 Tools

🪛 Ruff (0.14.7)

227-227: Unused noqa directive (non-enabled: SLF001)

Remove unused noqa directive

(RUF100)

🤖 Prompt for AI Agents

In tests/compilation/test_predictor_rl.py around line 227, the call to broken_predictor.env._ensure_device_averages_cached() includes an unnecessary "# noqa: SLF001" directive; remove the trailing noqa directive from that line so the code reads simply broken_predictor.env._ensure_device_averages_cached() and commit the change.

coderabbitai

Actionable comments posted: 5

♻️ Duplicate comments (2)

src/mqt/predictor/rl/predictorenv.py (2)
269-269: Unused noqa directive.

Static analysis confirms SLF001 is not enabled. Remove the directive.
-        self.state._layout = self.layout  # noqa: SLF001
+        self.state._layout = self.layout
458-458: Type annotation mismatch.

The variable is annotated as PropertySet | None but initialized to an empty dict. Either initialize as None or change the annotation to dict[str, Any]:
-        pm_property_set: PropertySet | None = {}
+        pm_property_set: dict[str, Any] | None = None
Then adjust the stochastic branch return type expectation accordingly.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 14aa3ac and 04fde70.

📒 Files selected for processing (1)

src/mqt/predictor/rl/predictorenv.py (10 hunks)

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-11-27T21:26:39.677Z

Learnt from: burgholzer
Repo: munich-quantum-toolkit/qmap PR: 846
File: python/mqt/qmap/plugins/qiskit/sc/load_calibration.py:34-34
Timestamp: 2025-11-27T21:26:39.677Z
Learning: In the qmap project, the Ruff linter has the "PL" (pylint) rule category enabled, which includes PLC0415 (import-outside-top-level). Therefore, `# noqa: PLC0415` directives on lazy imports are appropriate and necessary, not unused.

Applied to files:

src/mqt/predictor/rl/predictorenv.py

📚 Learning: 2025-11-04T14:26:25.420Z

Learnt from: marcelwa
Repo: munich-quantum-toolkit/core PR: 1243
File: test/python/qdmi/qiskit/conftest.py:11-19
Timestamp: 2025-11-04T14:26:25.420Z
Learning: In the munich-quantum-toolkit/core repository, Qiskit is always available as a dependency during testing, so import guards for qiskit-dependent imports in test files (e.g., test/python/qdmi/qiskit/*.py) are not necessary.

Applied to files:

src/mqt/predictor/rl/predictorenv.py

🧬 Code graph analysis (1)

src/mqt/predictor/rl/predictorenv.py (4)

src/mqt/predictor/reward.py (1)

crit_depth (40-43)

src/mqt/predictor/rl/helper.py (1)

get_path_training_circuits (107-109)

src/mqt/predictor/rl/parsing.py (2)

final_layout_bqskit_to_qiskit (186-226)

postprocess_vf2postlayout (229-250)

src/mqt/predictor/utils.py (1)

calc_supermarq_features (93-145)

🪛 Ruff (0.14.7)

src/mqt/predictor/rl/predictorenv.py

210-210: Logging statement uses f-string

(G004)

241-241: Logging statement uses f-string

(G004)

269-269: Unused noqa directive (non-enabled: SLF001)

Remove unused noqa directive

(RUF100)

⏰ Context from checks skipped due to timeout of 900000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: 🐍 Test (macos-14) / 🐍 macos-14
GitHub Check: 🐍 Test (windows-2022) / 🐍 windows-2022
GitHub Check: 🐍 Test (ubuntu-24.04) / 🐍 ubuntu-24.04

🔇 Additional comments (3)

src/mqt/predictor/rl/predictorenv.py (3)

45-75: LGTM on import organization.

The imports are well-organized, with appropriate separation of type-checking imports and runtime imports. The new imports for approx_expected_fidelity, approx_estimated_success_probability, and calc_supermarq_features align with the dual-path reward computation feature.

272-356: Well-structured dual-path reward computation.

The calculate_reward method cleanly separates exact (calibration-based) and approximate (cost-model-based) paths. The auto-mode logic for selecting the appropriate path based on circuit state is sound. Previous feedback about missing device_id has been addressed.

495-526: LGTM on layout postprocessing refactor.

The method now accepts pm_property_set directly as a dict, which is cleaner than passing a PassManager. The early return on line 501-502 properly handles the None/empty case.

src/mqt/predictor/rl/predictorenv.py

coderabbitai · 2025-12-04T10:25:15Z

src/mqt/predictor/rl/predictorenv.py

+    def _is_native_and_mapped(self, qc: QuantumCircuit) -> bool:
+        check_nat_gates = GatesInBasis(basis_gates=self.device.operation_names)
+        check_nat_gates(qc)
+        only_nat_gates = check_nat_gates.property_set["all_gates_in_basis"]
+
+        check_mapping = CheckMap(coupling_map=CouplingMap(self.device.build_coupling_map()))
+        check_mapping(qc)
+        mapped = check_mapping.property_set["is_swap_mapped"]
+
+        return bool(only_nat_gates and mapped)


🧹 Nitpick | 🔵 Trivial

Consider caching pass instances for performance.

_is_native_and_mapped creates new GatesInBasis and CheckMap instances on every call. During RL training, this method is invoked frequently via calculate_reward. Consider caching the pass instances as instance variables:

+ def __init__(self, ...): + ... + # Cached analysis passes for _is_native_and_mapped + self._gates_in_basis_check: GatesInBasis | None = None + self._check_map: CheckMap | None = None def _is_native_and_mapped(self, qc: QuantumCircuit) -> bool: - check_nat_gates = GatesInBasis(basis_gates=self.device.operation_names) + if self._gates_in_basis_check is None: + self._gates_in_basis_check = GatesInBasis(basis_gates=self.device.operation_names) + check_nat_gates = self._gates_in_basis_check check_nat_gates(qc) only_nat_gates = check_nat_gates.property_set["all_gates_in_basis"] - check_mapping = CheckMap(coupling_map=CouplingMap(self.device.build_coupling_map())) + if self._check_map is None: + self._check_map = CheckMap(coupling_map=CouplingMap(self.device.build_coupling_map())) + check_mapping = self._check_map check_mapping(qc) mapped = check_mapping.property_set["is_swap_mapped"] return bool(only_nat_gates and mapped)

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In src/mqt/predictor/rl/predictorenv.py around lines 732-741, the method _is_native_and_mapped currently instantiates GatesInBasis and CheckMap on every call which is costly during RL training; instead, create and store these pass instances as persistent attributes (e.g., self._gates_in_basis and self._check_map) during the environment initialization (or the first use), reuse them in _is_native_and_mapped, and only rebuild/update them if the device or its coupling map changes (add a simple invalidation/update path). This removes repeated allocations and speeds up frequent calls.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Shaobo-Zhou <109073755+Shaobo-Zhou@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (5)

src/mqt/predictor/rl/predictorenv.py (5)
17-18: Empty conditional block should be removed.

This empty block with just pass serves no purpose. If it was a placeholder for future type-only imports on Python 3.11+, add a comment explaining the intent; otherwise remove the block entirely.
-if sys.version_info >= (3, 11) and TYPE_CHECKING:  # pragma: no cover
-    pass
274-274: Remove unused noqa directive.

Static analysis flags this noqa: SLF001 as unused. The directive can be removed.
-        self.state._layout = self.layout  # noqa: SLF001
+        self.state._layout = self.layout
463-463: Type annotation mismatch persists.

The variable is annotated as PropertySet | None but initialized as {} (empty dict). Later it's assigned dict(pm.property_set). For type consistency, consider:
-        pm_property_set: PropertySet | None = {}
+        pm_property_set: dict[str, Any] | None = None
Then ensure the downstream code handles None appropriately (which it already does at line 506).

711-713: Consider logging a warning for partial calibration data.

The check raises RuntimeError when both p1 and p2 are empty, which is correct. However, if only one list is populated (e.g., a device with only single-qubit gates), a warning would be helpful for debugging.
         if not p1 and not p2:
             msg = "No valid 1q/2q calibration data found in Target, cannot compute approximate reward."
             raise RuntimeError(msg)
+        if not p1:
+            logger.warning("No single-qubit calibration data found; _p1_avg will be 0.0")
+        elif not p2:
+            logger.warning("No two-qubit calibration data found; _p2_avg will be 0.0")
737-746: Consider caching analysis pass instances for performance.

_is_native_and_mapped creates new GatesInBasis and CheckMap instances on every call. During RL training, this method is invoked frequently via calculate_reward. Caching these passes would improve performance.

Initialize cached passes in __init__:
# In __init__:
self._gates_in_basis_check: GatesInBasis | None = None
self._check_map: CheckMap | None = None
Then lazily initialize in _is_native_and_mapped:
 def _is_native_and_mapped(self, qc: QuantumCircuit) -> bool:
-    check_nat_gates = GatesInBasis(basis_gates=self.device.operation_names)
+    if self._gates_in_basis_check is None:
+        self._gates_in_basis_check = GatesInBasis(basis_gates=self.device.operation_names)
+    check_nat_gates = self._gates_in_basis_check
     check_nat_gates(qc)
     only_nat_gates = check_nat_gates.property_set["all_gates_in_basis"]

-    check_mapping = CheckMap(coupling_map=CouplingMap(self.device.build_coupling_map()))
+    if self._check_map is None:
+        self._check_map = CheckMap(coupling_map=CouplingMap(self.device.build_coupling_map()))
+    check_mapping = self._check_map
     check_mapping(qc)
     mapped = check_mapping.property_set["is_swap_mapped"]

     return bool(only_nat_gates and mapped)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 04fde70 and 6a104b8.

📒 Files selected for processing (1)

src/mqt/predictor/rl/predictorenv.py (11 hunks)

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-11-27T21:26:39.677Z

Learnt from: burgholzer
Repo: munich-quantum-toolkit/qmap PR: 846
File: python/mqt/qmap/plugins/qiskit/sc/load_calibration.py:34-34
Timestamp: 2025-11-27T21:26:39.677Z
Learning: In the qmap project, the Ruff linter has the "PL" (pylint) rule category enabled, which includes PLC0415 (import-outside-top-level). Therefore, `# noqa: PLC0415` directives on lazy imports are appropriate and necessary, not unused.

Applied to files:

src/mqt/predictor/rl/predictorenv.py

📚 Learning: 2025-11-04T14:26:25.420Z

Learnt from: marcelwa
Repo: munich-quantum-toolkit/core PR: 1243
File: test/python/qdmi/qiskit/conftest.py:11-19
Timestamp: 2025-11-04T14:26:25.420Z
Learning: In the munich-quantum-toolkit/core repository, Qiskit is always available as a dependency during testing, so import guards for qiskit-dependent imports in test files (e.g., test/python/qdmi/qiskit/*.py) are not necessary.

Applied to files:

src/mqt/predictor/rl/predictorenv.py

🧬 Code graph analysis (1)

src/mqt/predictor/rl/predictorenv.py (5)

src/mqt/predictor/reward.py (3)

crit_depth (40-43)

estimated_success_probability (102-226)

expected_fidelity (46-74)

src/mqt/predictor/rl/cost_model.py (2)

approx_estimated_success_probability (250-308)

approx_expected_fidelity (221-247)

src/mqt/predictor/rl/helper.py (2)

create_feature_dict (72-94)

get_path_training_circuits (107-109)

src/mqt/predictor/rl/parsing.py (3)

final_layout_bqskit_to_qiskit (186-226)

final_layout_pytket_to_qiskit (167-183)

postprocess_vf2postlayout (229-250)

src/mqt/predictor/utils.py (1)

calc_supermarq_features (93-145)

🪛 Ruff (0.14.7)

src/mqt/predictor/rl/predictorenv.py

214-214: Logging statement uses f-string

(G004)

245-245: Logging statement uses f-string

(G004)

274-274: Unused noqa directive (non-enabled: SLF001)

Remove unused noqa directive

(RUF100)

⏰ Context from checks skipped due to timeout of 900000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: 🐍 Test (windows-2022) / 🐍 windows-2022
GitHub Check: 🐍 Test (ubuntu-24.04) / 🐍 ubuntu-24.04
GitHub Check: 🐍 Test (macos-14) / 🐍 macos-14

🔇 Additional comments (6)

src/mqt/predictor/rl/predictorenv.py (6)

88-89: LGTM: Penalty sign corrected and parameters made configurable.

The no_effect_penalty is now correctly initialized as a negative value (-0.001), and both reward_scale and no_effect_penalty are exposed as constructor parameters for tuning during RL experimentation. This addresses previous review feedback.

257-259: Good handling of regime transition.

The comment clearly explains why delta_reward = 0.0 when transitioning from approximate to exact metrics—the two aren't directly comparable. This is the correct approach to avoid misleading reward signals.

277-319: Well-structured dual-path reward computation.

The calculate_reward method cleanly implements the exact vs. approximate reward paths with a clear mode selection mechanism. The fallback for unsupported reward functions with a warning is good defensive coding.

500-531: LGTM: Layout postprocessing signature updated correctly.

The method signature now accepts dict[str, Any] | None which aligns with how it's called (passing dict(pm.property_set)). The early return for falsy pm_property_set is a good guard.

229-231: Good defensive normalization for high-level gates.

Decomposing unitary and clifford gates before reward computation ensures the reward functions can properly analyze the circuit. This prevents errors from gates that don't have calibration data.

130-133: Verify whether this action registration order change breaks pre-trained model compatibility.

Registering OPT actions before LAYOUT actions changes the indices in action_set. While web search suggests PassType operates as a flexible tagging system (with the RL agent dynamically determining execution order), the action indices themselves may still be significant if:

Pre-trained models encode action indices as state features

Saved experience or replay buffers reference specific action indices

Model weights learned associations with particular action index ranges

Confirm whether existing trained agents or datasets depend on action registration indices, or if this reordering is safe.

coderabbitai · 2025-12-04T11:29:56Z

src/mqt/predictor/rl/predictorenv.py

+                    dur = getattr(props, "duration", None)
+                    if dur is not None:
+                        dur_s = float(dur)
+                        t1.append(dur_s)
+
+            elif arity == 2:
+                # Collect two-qubit gate error/duration over all supported edges
+                for i, j in twoq_edges:
+                    props = _get_props(name, (i, j))
+                    if props is None:
+                        # Try flipped orientation for uni-directional couplings
+                        props = _get_props(name, (j, i))
+                    if props is None:
+                        continue
+                    err = getattr(props, "error", None)
+                    if err is not None:
+                        p2.append(float(err))
+                    dur = getattr(props, "duration", None)
+                    if dur is not None:
+                        dur_s = float(dur if dt is None else dur * dt)
+                        t2.append(dur_s)


⚠️ Potential issue | 🟠 Major

Inconsistent dt handling between 1q and 2q gate durations.

Single-qubit gate durations (line 687) correctly use dur_s = float(dur) without dt multiplication. However, two-qubit gate durations (line 704) still multiply by dt:

dur_s = float(dur if dt is None else dur * dt) # Line 704

According to Qiskit documentation, InstructionProperties.duration is already in seconds—not in dt units. This inconsistency will cause 2q durations to be incorrectly scaled when dt is available.

Apply this fix to line 704:

dur = getattr(props, "duration", None) if dur is not None: - dur_s = float(dur if dt is None else dur * dt) + dur_s = float(dur) t2.append(dur_s)

🤖 Prompt for AI Agents

In src/mqt/predictor/rl/predictorenv.py around lines 685 to 705, the two-qubit duration computation incorrectly multiplies InstructionProperties.duration by dt (line ~704), causing inconsistent scaling versus single-qubit durations; change the two-qubit duration assignment to match the single-qubit handling by setting dur_s = float(dur) (i.e., remove the dt multiplication/conditional) so both 1q and 2q durations use the duration value as seconds.

Shaobo Zhou and others added 30 commits March 29, 2025 19:20

Update predictor(adding callbacks)

129b60f

Update

08889bd

Restore helper.py and predictor.py to match upstream

e2ff3fe

Merge remote-tracking branch 'upstream/main'

1c32d15

Implement new mapping actions

78dc1aa

Update action space and feature space Update actions Update action space

Fix: resolve pre-commit issues and add missing annotations

a3ba836

Fix: resolve pre-commit issues and add missing annotations Fix: resolve pre-commit issues and add missing annotations Remove example_test.py Remove example_test.py

Fix: resolve pre-commit issues and add missing annotations

5935e6f

Fix: resolve pre-commit issues and add missing annotations Fix: resolve pre-commit issues and add missing annotations Fix: resolve pre-commit issues and add missing annotations

Fix: resolve pre-commit issues and add missing annotations

f71fb29

Fix: resolve pre-commit issues and add missing annotations

3c7592b

Fix mypy errors

6db5c27

Fix mypy errors

47841c5

Fix dependencies issues

b1ac8ce

Fix dependency issues

5f8473c

Add missing zip file

7491ec0

Fix issue with Python 3.13

3346842

Merge branch 'main' into hybrid-mapping

6f7a73c

Signed-off-by: Shaobo-Zhou <109073755+Shaobo-Zhou@users.noreply.github.com>

Remove Python 3.13 from noxfile.py due to compatibility issue

6c67349

Skip minimums session on Windows due to CI slowness

2692b96

Fix bugs

f4874e6

Fix bugs Fix bugs Fix bugs

Fix bugs

54eec91

Use default Qiskit settings for VF2Layout and add assertion for nativ…

845f7de

…e gate check

Debug

3418936

Fix missing argument

ae870cc

Fix warning issues

861bc62

Fix window runtime warning problem

fa989b6

Fix window runtime warning problem

405bd39

Add time limit for VF2PostLayout

7b2f321

Fix windows runtime warning problem

b67d0a6

Fix windows runtime warning problem Fix windows runtime warning issue

Add new actions

bf7c9ee

Add new actions

6d2733f

Shaobo-Zhou added 3 commits November 27, 2025 15:45

Add fallback for unsupported reward function

cee79cd

Minor Fixes

756d777

Minor Fixes

d60d17e

Shaobo-Zhou force-pushed the new_RL branch from dd212d4 to d60d17e Compare November 27, 2025 16:03

Shaobo-Zhou and others added 3 commits November 27, 2025 17:03

Merge branch 'main' into new_RL

a05e37a

Minor Fixes

85d4d57

Fix predictorenv.py

a1370de

Shaobo-Zhou and others added 6 commits December 3, 2025 16:08

Update changelog and improve test coverage

e456f43

Update cost model

de7c498

Merge branch 'main' into new_RL

9a68c59

Update cost model

d0a07a2

Fix CI issue

18cf31f

Improve coverage

0c42b41

Shaobo-Zhou marked this pull request as ready for review December 3, 2025 22:07

coderabbitai bot requested changes Dec 3, 2025

View reviewed changes

Shaobo-Zhou and others added 4 commits December 3, 2025 23:18

Update src/mqt/predictor/reward.py

a451710

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Shaobo-Zhou <109073755+Shaobo-Zhou@users.noreply.github.com>

Update src/mqt/predictor/rl/cost_model.py

ac7dc28

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Shaobo-Zhou <109073755+Shaobo-Zhou@users.noreply.github.com>

Update tests/hellinger_distance/test_estimated_hellinger_distance.py

67dc402

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Shaobo-Zhou <109073755+Shaobo-Zhou@users.noreply.github.com>

Code improvements suggested by CodeRabbit

8259b4d

coderabbitai bot requested changes Dec 3, 2025

View reviewed changes

Shaobo-Zhou added 2 commits December 4, 2025 00:58

Fix warnings

4b5bf6f

Improve coverage

14aa3ac

coderabbitai bot requested changes Dec 4, 2025

View reviewed changes

Improve coverage

04fde70

coderabbitai bot requested changes Dec 4, 2025

View reviewed changes

Shaobo-Zhou and others added 2 commits December 4, 2025 12:02

Update src/mqt/predictor/rl/predictorenv.py

1df8071

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Shaobo-Zhou <109073755+Shaobo-Zhou@users.noreply.github.com>

Fixes

6a104b8

coderabbitai bot requested changes Dec 4, 2025

View reviewed changes

-    if version != "3.13":  # keep cache for last run
+    # Keep cache for the last Python version in the test matrix
+    last_version = PYTHON_ALL_VERSIONS[-1]
+    if version != last_version:  # keep cache for last run
+        venv_dir = session.virtualenv.location

	self.state._layout = self.layout # noqa: SLF001
	self.state._layout = self.layout

	pm_property_set: PropertySet \| None = {}
	pm_property_set: dict[str, Any] \| None = None

	broken_predictor.env._ensure_device_averages_cached() # noqa: SLF001
	broken_predictor.env._ensure_device_averages_cached()

Uh oh!

Add stepwise intermediate reward for RL #526

Are you sure you want to change the base?

Add stepwise intermediate reward for RL #526

Uh oh!

Conversation

Shaobo-Zhou commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist:

Uh oh!

codecov bot commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Shaobo-Zhou commented Nov 26, 2025 •

edited

Loading

codecov bot commented Nov 27, 2025 •

edited

Loading

coderabbitai bot commented Dec 3, 2025 •

edited

Loading