Skip to content

Invalid orbital range bug#865

Open
calvinp0 wants to merge 4 commits intomainfrom
mono_atom_orca
Open

Invalid orbital range bug#865
calvinp0 wants to merge 4 commits intomainfrom
mono_atom_orca

Conversation

@calvinp0
Copy link
Copy Markdown
Member

@calvinp0 calvinp0 commented Apr 10, 2026

Bugs:

Guard 1 — Prevention (scheduler.py:1440): Case bug. 'DLPNO' in level.method but Level.init normalizes to lowercase. Dead code since the day it was written — never fired once.

Guard 2 — Troubleshooting (trsh.py:1070): Structurally unreachable. This one actually used lowercase 'dlpno' correctly, but it was an elif after the Memory branch. The error flow made it impossible to reach:

  1. ORCA crashes with INVALID ORBITAL RANGE in err.txt
  2. determine_ess_status reads the log file, finds "ORCA finished by error termination in MDCI", scans for "Please increase MaxCore" or "parallel calculation exceeds number of pairs" — finds neither
  3. Falls through the for-else to: "MDCI error in Orca. Assuming memory allocation error." → keywords = ['MDCI', 'Memory']
  4. trsh_ess_job sees 'Memory' in keywords → enters Memory branch → increases memory → resubmits
  5. Same crash → step 2 → infinite loop

The DLPNO check at step 4 was behind elif, so it could never fire when Memory was in the keywords. Two bugs compounding — the first one prevents the problem, the second one should have caught it but couldn't due to the control flow.


Following on for why ARC did the trsh ad infinitum:

  1. ORCA fails → determine_ess_status sees "ORCA finished by error termination in MDCI", doesn't find "Please increase MaxCore" or "parallel calculation exceeds number of pairs" in the log → falls through to else: keywords = ['MDCI', 'Memory']
  2. trsh_ess_job enters Orca Memory branch → 'memory' not in ess_trsh_methods → appends 'memory' → calculates new memory via estimate_orca_mem_cpu_requirement(num_heavy_atoms=0) → couldnt_trsh stays False
  3. Scheduler resubmits with new memory → same ORCA crash
  4. trsh_ess_job enters Orca Memory branch again → 'memory' already in list (not re-added) → calculates same memory estimate → couldnt_trsh stays False
  5. Repeat step 3-4 forever

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Fixes a DLPNO + monoatomic edge case that could trigger ORCA “INVALID ORBITAL RANGE” failures and an infinite memory-troubleshooting loop.

Changes:

  • Normalize the DLPNO monoatomic guard in the scheduler to actually trigger (case/normalization fix) and downgrade to the canonical (non-DLPNO) method.
  • Pass monoatomic context into ESS troubleshooting.
  • Reorder ORCA troubleshooting to detect DLPNO + monoatomic/H before memory-based retries.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
arc/scheduler.py Detects DLPNO on monoatomic species and rewrites the level of theory; forwards monoatomic flag into troubleshooting.
arc/job/trsh.py Adds is_monoatomic parameter and prioritizes DLPNO+monoatomic/H handling before memory retries for ORCA.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@alongd
Copy link
Copy Markdown
Member

alongd commented Apr 10, 2026

We must add a trsh counter (or do we already have one?), so we don't do anything infinitely

Copy link
Copy Markdown
Member

@alongd alongd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!! Added some comments


elif 'orca' in software:
if 'Memory' in job_status['keywords']:
if 'dlpno' in level_of_theory.method and (is_monoatomic or is_h):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldn't happen, if it does, it means Scheduler is buggy. I almost think we could raise an error here (so devs know)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, changed it

@calvinp0
Copy link
Copy Markdown
Member Author

We must add a trsh counter (or do we already have one?), so we don't do anything infinitely

Depends on what we are troubleshooting - liek for TS guess, we eventually try everything and then declare all methods attempted

@calvinp0
Copy link
Copy Markdown
Member Author

We must add a trsh counter (or do we already have one?), so we don't do anything infinitely

I added a counter now, and defaulted it to 10 in the settings.py

@calvinp0 calvinp0 requested a review from alongd April 10, 2026 16:24
@alongd
Copy link
Copy Markdown
Member

alongd commented Apr 10, 2026

I think the trsh counter could be per trsh method (not to try the same one too many times), Maybe we can add a trsh_counter dict to Species?

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.20%. Comparing base (61a711a) to head (7ff2e77).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #865      +/-   ##
==========================================
+ Coverage   60.10%   60.20%   +0.10%     
==========================================
  Files         102      102              
  Lines       31041    31052      +11     
  Branches     8082     8084       +2     
==========================================
+ Hits        18657    18696      +39     
+ Misses      10071    10033      -38     
- Partials     2313     2323      +10     
Flag Coverage Δ
functionaltests 60.20% <ø> (+0.10%) ⬆️
unittests 60.20% <ø> (+0.10%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@calvinp0
Copy link
Copy Markdown
Member Author

I think the trsh counter could be per trsh method (not to try the same one too many times), Maybe we can add a trsh_counter dict to Species?

I guess so. The thing is, most of our trsh methods are safe - they have limits. like rotor scans max 4 in the settings. TS guesses have a flag I implemented a year ago or more where it tries all relevant methods and then does 'all_attempted' to indicate exhaustion. Orca Mem has a guard if 'memory' is not in the ess_trsh_methods. and molpro has an elif chain of guards.

The issue now gaussian memory - no 'memory' not in ess trsh method guard. can keep doubling until hiting 95% node memory. I think that was for ATLAS? and then general ESS trsh where there is no global max attempt counter - so it relies entirely on ess_trsh_methods list eventually mathcing attempted_ess_trsh_methods.

So, I am not so sure having an ess trsh counter per method is really relevant here.

@calvinp0
Copy link
Copy Markdown
Member Author

Ok

I think the trsh counter could be per trsh method (not to try the same one too many times), Maybe we can add a trsh_counter dict to Species?

Further investigating, I am hesitant about per trsh method cause then they means we need to set a limit for a fair few methods and the also allow the user to change it (could be overwhelming), and I know for ORCA memory - I misspoke. It doesn't have a guard cause I know we sometimes have to troubleshoot the mem multiple times as ORCA can keep saying 'okay allocate more' and then we do, then it complains for even more.

@calvinp0 calvinp0 force-pushed the mono_atom_orca branch 4 times, most recently from 557b52d to 46b213f Compare April 12, 2026 11:31
DLPNO methods are incompatible with monoatomic species. This change generalizes the previous hydrogen-specific check to all monoatomic species and automatically falls back to the canonical method by stripping the 'dlpno-' prefix.
Added max ess trsh counter attempts
Generalize the H-atom-specific check to all monoatomic species when using DLPNO methods in Orca, as these methods are incompatible with single-atom systems that lack electron pairs to correlate.

Added tests for trsh regard monoatomic
Added a counter now to how many times ARC will troubleshoot an ESS job. This is set in the settings.py - default is 25 times.
Correctly import and use the logging module to set the Paramiko log level in the SSHClient class, replacing an undefined logger reference.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants