Skip to content

fix: 8 Round 11 audit findings (3 HIGH + 5 MED)#93

Merged
nelsonduarte merged 10 commits into
mainfrom
fix/audit-r11
Jun 17, 2026
Merged

fix: 8 Round 11 audit findings (3 HIGH + 5 MED)#93
nelsonduarte merged 10 commits into
mainfrom
fix/audit-r11

Conversation

@nelsonduarte

Copy link
Copy Markdown
Owner

Summary

Closes 8 bugs from audit Round 11, focused on dataloss prevention and UX consistency. Discovered and fixed a critical regression masking note deletions during adversarial review.

HIGH

  • 🟠 split.py atomic write — Loop wrote directly via with open(..., "wb") as f: w.write(f). If out_dir == dirname(input) and a split name collided with the input, the input was truncated before PdfWriter finished its lazy reads. Now uses _atomic_pdf_write(w, out, sources=[pdf_path]) with same-source rejection.
  • 🟠 window.py folder drop dedup — On Windows, glob("*.pdf") + glob("*.PDF") returned identical entries (case-insensitive FS), opening each PDF twice. Replaced with single os.listdir + case-insensitive filter.
  • 🟠 editor _user_pending — Existing annotations loaded into _pending with _existing=True triggered closeEvent prompts and _pending_list UI rows even when the user hadn't edited anything. New _user_pending property filters them out for those checks, with one critical exception: delete_annot edits stay even when _existing=True because they represent the user's explicit intent to remove an existing note (review caught a regression that would have silently dropped these deletions).

MEDIUM

  • 🟡 parse_pages friendly errorsValueError("invalid literal for int()") was surfaced raw via show_error as "Algo inesperado correu mal". Now catches and re-raises with translated tool.err.bad_page_input (8 languages).
  • 🟡 OCR streaming writerpdf_pages = [] ; ... ; for each page accumulate bytes ; final write was replaced with per-page writer.append(...). Reduces peak memory for long PDFs by avoiding the simultaneous-buffer pile-up (exact savings depend on pypdf's internal copy semantics).
  • 🟡 _clear_recent UI refresh — "Clear recents" updated config but viewers continued showing the old list until reload. Now iterates self._viewers calling _refresh_recents.
  • 🟡 get_recent_files lexists + writebackos.path.isfile hydrated OneDrive Files-On-Demand placeholders on every startup, freezing UI. Switched to os.path.lexists. Also persists the cleaned list (_update_config) so disappeared entries don't re-trigger the disk check on next launch. UNC SMB caveat documented in comment.
  • 🟡 _apply_forms file handlePdfReader(self._doc_path) left an open file handle until GC, blocking subsequent operations on Windows. Wrapped in with open(..., "rb") as _src so the handle is released deterministically.

i18n

1 new key x 8 languages: tool.err.bad_page_input. Parity verified by test.

Tests

tests/test_audit_r11.py — 24 tests (22 from initial implementation + 2 added during review for the critical regression). Mix of source-level and behavioral coverage.

Adversarial review notes

The review caught a critical regression in commit 229b4e3 (editor _user_pending): the filter would have silently dropped delete_annot edits for existing notes, defeating both PR-D and PR-F's annotation deletion persistence fixes. Fixed in commit c08c1dc with two new regression tests covering the bug surface. Also addressed minor cosmetic notes (unverified OCR memory claim, UNC SMB caveat).

Validation

  • 10 atomic commits
  • 24 new tests; all passing
  • Compatible with all merged PRs (A through F)
  • No APP_VERSION bump
  • i18n parity verified

nelsonduarte and others added 10 commits June 17, 2026 20:03
Replace the direct ``with open(...) as f: w.write(f)`` loop with
``self._atomic_pdf_write(w, out_path, sources=[pdf_path])`` so the
existing tempfile + os.replace path is reused and the same-source
guard rejects an output path that collides with the input PDF.
Without this, a split row whose output name matched the basename
of the source (e.g. input ``part_1.pdf`` split into ``part_1.pdf``)
would truncate the input mid-flight while pypdf still held lazy
reads against it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The drop handler globbed both ``*.pdf`` and ``*.PDF``; on Windows
and macOS HFS+ those patterns match the same files, so every PDF
was opened twice. Switch to a single ``os.listdir`` pass with an
extension filter so the walk yields each file once regardless of
filesystem case sensitivity.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Loading a PDF with sticky notes mirrors each one into
``self._pending`` (tagged ``_existing=True``) so the canvas can
render their bubbles. Without filtering, every gate that watches
``_pending`` mistook those mirrored entries for real user edits:

  * ``closeEvent`` (in window.py) prompted "discard unsaved
    changes?" the moment the user opened any PDF with notes
  * ``_run`` skipped the no-pending guard and re-saved a byte-
    identical copy
  * The Forms-mode "has pending edits" warning fired on a
    freshly-opened PDF
  * The ``_pending_list`` UI listed pre-existing notes as if they
    were unsaved edits

Add a ``_user_pending`` property that filters ``_existing`` entries
(while keeping user-driven ``delete_annot`` actions, which have no
flag) and route every gate through it. Also stop pushing existing
notes into ``_pending_list`` — that widget is for THIS session's
unapplied edits.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Garbage input like ``abc`` or ``1-`` previously bubbled up the raw
``ValueError("invalid literal for int() with base 10: …")``, which
``show_error`` then masked behind "Something unexpected went wrong"
— hiding the actionable cause from the user.

Wrap the two ``int()`` calls in a ``try``/``except ValueError`` and
re-raise with the new translated ``tool.err.bad_page_input`` key
(localised across all 8 supported languages). The happy path,
out-of-range message, and dedupe/sort contract are unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous implementation collected every per-page Tesseract PDF
output in ``pdf_pages = []`` and then walked the list to append
each into a writer. For a 100-page job at 300 dpi this kept ~200–
500 MB of intermediate bytes resident before the final write —
enough to OOM on machines tight on RAM.

Append each page directly into the writer as soon as Tesseract
returns it, then drop ``pix``, ``img``, and the per-page
``page_bytes`` locals. The writer's internal copies retain only
what it actually needs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
``_clear_recent`` wrote ``recent_files: []`` to disk but did not
notify any open viewer, so the placeholder pane kept showing the
stale entries until the next reload. Mirror the
``_load_and_track`` pattern: iterate ``self._viewers`` and invoke
``_refresh_recents`` on each (wrapped in ``contextlib.suppress``
so a deleted viewer doesn't abort the rest).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
``os.path.isfile`` on a OneDrive Files-On-Demand placeholder
triggers a placeholder rehydration, blocking the UI for seconds
the first time the app opens. The recents validation runs at
startup, so the freeze is the first thing the user sees.

Switch the per-entry check to ``os.path.lexists``, which does not
follow symlinks or force a download. Broken-symlink entries are
acceptable here — they get pruned the moment the user actually
tries to open them. When entries are dropped during the scan,
write the trimmed list back through ``_update_config`` so the
next startup has less to re-check.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
``PdfReader(self._doc_path)`` held an internal stream pinned by
the reader object until garbage collection. On Windows that
blocked any subsequent tool from renaming the same file (file
lock still active), surfacing as cryptic "file in use" errors
when chaining Forms apply with another tool.

Wrap the source open in an explicit ``with open(...) as _src``
block and pass the stream to ``PdfReader``. All writer work
(``append``, ``update_page_form_field_values``, ``write``,
``os.replace``) happens INSIDE the with-block so pypdf's lazy
reads still see a live stream when they fire. On scope exit the
file handle is released deterministically.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
22 tests covering the 8 PR-G bugs:
  * split.py atomic write + same-source rejection
  * Drop-folder dedup on case-insensitive filesystems
  * Editor _user_pending filter (closeEvent, _run, Forms warning,
    _pending_list UI)
  * parse_pages friendly error + happy-path regression guard
  * OCR streaming writer (no more pdf_pages = [] accumulation)
  * _clear_recent open-viewer refresh
  * get_recent_files lexists + writeback (behavioral, monkeypatched
    config path)
  * _apply_forms with-open + writer-inside-block

Plus an i18n parity assertion that the new ``tool.err.bad_page_input``
key exists with a ``{text}`` placeholder in all 8 supported languages.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Round 11 review found that the _user_pending filter excluded all
edits with _existing=True. But delete_annot edits for existing
notes are also registered with _existing=True (PR-D fix #3 path),
so they were silently dropped from save processing and closeEvent
prompts -- defeating both PR-D and PR-F's note deletion persistence
fixes.

Now the filter excludes _existing edits EXCEPT for delete_annot,
which represents the user's explicit intent to remove. Added two
regression tests covering the bug surface.

Also addresses minor review nits: removed unverified 200-500MB memory
claim from OCR comment; clarified UNC SMB caveat for lexists in
get_recent_files.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@nelsonduarte nelsonduarte added bug Something isn't working enhancement New feature or request labels Jun 17, 2026
Comment thread app/editor/tab.py
os.replace(tmp, out)
except Exception:
try: os.unlink(tmp)
except OSError: pass
Comment thread app/i18n.py
if len(valid) != len([p for p in recents if isinstance(p, str)]):
try:
_update_config(lambda c: c.__setitem__("recent_files", valid))
except Exception:
Comment thread tests/test_audit_r11.py

import json
import os
import tempfile
Comment thread tests/test_audit_r11.py
import os
import tempfile
from pathlib import Path
from unittest.mock import patch
@nelsonduarte nelsonduarte merged commit d407f2f into main Jun 17, 2026
3 checks passed
@nelsonduarte nelsonduarte deleted the fix/audit-r11 branch June 17, 2026 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants