Direct docx edits on Save instead of re-packing new doc by aaron-hirundo · Pull Request #73 · eigenpal/docx-editor

aaron-hirundo · 2026-02-25T11:39:20Z

Summary

This PR adds a direct XML save path so we do not repack/rewrite the whole DOCX on save.

Main goal: reduce unnecessary Word diffs (empty paragraphs, formatting churn, unrelated paragraph changes) when user makes a small edit.

This addresses issue #63.

What changed (short)

Added low-level direct DOCX part editing:

raw OOXML part editor
operation engine for set-xml, relationship updates, content-type updates, etc.

Added direct save planner:

targeted patching of word/document.xml by changed paragraph IDs
note/header/footer package consistency handling
safe fallback to full save when targeted patch is unsafe

Added React wrapper:

DirectXmlDocxEditor with strict/best-effort modes
baseline synchronization across repeated saves
fallback diagnostics

Wired direct save into app/demo and exported public APIs.

How to verify (exact workflow used)

A/B check against old behavior

On main (without this PR), open a DOCX, change one small thing, save.
On this PR branch, do the same with the same input DOCX.
In Microsoft Word:
- Open Word
- Review tab -> Compare -> Compare...
- Original = input DOCX
- Revised = saved DOCX
Compare results:
- Before (old repack): lots of unrelated changes (extra paragraphs/format churn/etc.)
- With this PR (direct XML): only intended/local edits, no broad document noise.

Additional checks

no-edit save
repeated save (edit -> save -> edit -> save)
header/footer and notes edits

Example docs tested

Tracked docs in repo used for manual verification:

examples/shared/sample.docx
e2e/fixtures/with-tables.docx
e2e/fixtures/complex-styles.docx
e2e/fixtures/large-36-page.docx
e2e/fixtures/very-large-50-page.docx

Closes #63

vercel · 2026-02-25T11:39:24Z

@aaron-hirundo is attempting to deploy a commit to the EigenPal Team on Vercel.

A member of the Team first needs to authorize it.

jedrazb · 2026-02-25T14:58:18Z

@aaron-hirundo hey ,thank you for opening the PR!

Can you add some details to PR description how can we verify the changes? Do you have example document that you tested with during the implementation?

aaron-hirundo · 2026-02-26T12:18:39Z

@jedrazb Added some info, and in terms of docs I tested on, I checked and they are confidential. I am sorry.

But they for sure are huge 350-600 pages with tons of edits and etc.

Tell me if you have any more questions on this PR!

…upstream-direct-xml

vercel · 2026-03-03T02:03:26Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
docx-js-editor	Ready	Preview, Comment	Mar 3, 2026 2:03am

jedrazb

Good stuff @aaron-hirundo ! Left a review

jedrazb · 2026-03-03T02:37:31Z

src/docx/directXmlPlanBuilder.ts

+function extractParagraphXmlById(documentXml: string, paraId: string): string | null {
+  const escapedId = escapeRegExp(paraId);
+  const pattern = new RegExp(
+    `<w:p\\b(?=[^>]*\\b(?:w14:paraId|w:paraId)="${escapedId}")[\\s\\S]*?<\\/w:p>`


This lazy [\s\S]*? matches the first </w:p> it finds. Word generates nested <w:p> inside <mc:AlternateContent> blocks — in that case this extracts a truncated paragraph and silently corrupts the output without triggering the fallback.

Suggestion: after extraction, verify the match has balanced <w:p> / </w:p> counts. If not, return null to trigger fallback.

@aaron-hirundo Please also add a test for this case to ensure that your fix works :)

jedrazb · 2026-03-03T02:37:31Z

src/docx/directXmlPlanBuilder.ts

+  const escapedNoteId = escapeRegExp(String(noteId));
+  const tagName = kind === 'footnote' ? 'footnote' : 'endnote';
+  const pattern = new RegExp(
+    `<w:${tagName}\\b(?=[^>]*\\bw:id=["']${escapedNoteId}["'])[\\s\\S]*?<\\/w:${tagName}>`


Same nested-element risk here — <w:footnote> could contain nested structures that cause the lazy match to grab the wrong closing tag.

jedrazb · 2026-03-03T02:37:31Z

src/components/DirectXmlDocxEditor.tsx

+      }
+
+      const currentDocument = inner.getDocument();
+      await refreshBaselineAfterSave(currentDocument, saved);


If refreshBaselineAfterSave throws, onSave on line 307 is never called — the caller has no idea the save succeeded. Move onSave?.(saved) before the refresh, or wrap the refresh in try/catch.

jedrazb · 2026-03-03T02:37:31Z

src/docx/rawXmlEditor.ts

+export type DocxXmlMutator = (editor: DocxXmlEditor) => void | Promise<void>;
+
+function normalizePartPath(path: string): string {
+  return path.replace(/^\/+/, '');


normalizePartPath strips leading / but doesn't reject ../ sequences. Since this is a public API, consider rejecting paths containing .. to prevent path traversal.

Again @aaron-hirundo , please add tests to make sure that you have fixed this

benglewis

Nice work @aaron-hirundo ! :) Took me quite a while to read, but I had a few (if somewhat repetitive) comments that I hope will help improve it

benglewis · 2026-02-25T16:26:21Z

src/components/directXmlBaselineSnapshot.ts

+  hydrationError: Error | null;
+}
+
+export function shouldHydrateBaselineFromSavedBytes(document: Document | null): boolean {


It took me a bit of ChatGPT prompting to understand this:
https://chatgpt.com/share/e/699f226c-0a94-8011-b77a-df1586e98a47
@aaron-hirundo I think that you should document this function with some kind of a docstring.
Also, in general, this PR feels like it needs some .md files documenting the architecture and design of this functionality

benglewis · 2026-02-25T16:28:47Z

examples/vite/src/App.tsx

I couldn't find the relevant file to add it, but since this repository uses CLAUDE.md and we tend to use Codex, @aaron-hirundo , I think that you should symlink CLAUDE.md with an AGENTS.md file so that we can produce better PRs

benglewis · 2026-02-25T16:31:32Z

src/components/DirectXmlDocxEditor.mode.test.ts

Why did you name this DirectXmlDocxEditor.mode.test.ts? What is the .mode bit about?

benglewis · 2026-02-25T16:41:26Z

src/components/DirectXmlDocxEditor.tsx

+      baselineBufferRef.current = null;
+      editedParagraphIdsRef.current = new Set();
+
+      (async () => {


(async () => {

I see that the changes include this pattern. The rest of the code base doesn't. How does it handle async code in useEffect without that?

The difference is straightforward. The rest of the codebase uses the named function + call pattern:

// DocxEditor.tsx — the standard pattern used everywhere else useEffect(() => { const parseDocument = async () => { // async work... }; parseDocument(); // call it }, [deps]);

Your change uses the IIFE (Immediately Invoked Function Expression) pattern:

// DirectXmlDocxEditor.tsx — your pattern useEffect(() => { (async () => { // async work... })(); }, [deps]);

Both are functionally identical — neither makes useEffect's callback itself async (which would break React's cleanup return value). The named function style (DocxEditor.tsx:564–596) is just a bit more readable and is the convention used throughout the rest of the codebase. If you want consistency, you can refactor your IIFE to a named async function called immediately after definition.

You should probably match the conventions of the codebase

benglewis · 2026-02-25T16:53:47Z

src/components/DirectXmlDocxEditor.tsx

+      (async () => {
+        try {
+          const normalizedBuffer = await toArrayBuffer(documentBuffer);
+          if (baselineLoadTokenRef.current !== loadToken) {


I prefer not using this kind of repetitive and hard to make sense of code. Instead, you can use a custom hook:

Yes, absolutely. A custom useAsyncEffect hook can hide all the token machinery behind a cleaner API. The key insight is that React's effect cleanup function already fires exactly when you need cancellation — when deps change or the component unmounts — so you can use a simple closure boolean instead of a ref counter:

function useAsyncEffect( effect: (isCancelled: () => boolean) => Promise<void>, deps: React.DependencyList ) { useEffect(() => { let cancelled = false; effect(() => cancelled); return () => { cancelled = true; }; }, deps); // eslint-disable-line react-hooks/exhaustive-deps }

Then the call site becomes:

useAsyncEffect(async (isCancelled) => { if (!documentBuffer) return; baselineDocumentRef.current = null; baselineBufferRef.current = null; editedParagraphIdsRef.current = new Set(); try { const normalizedBuffer = await toArrayBuffer(documentBuffer); if (isCancelled()) return; baselineBufferRef.current = cloneArrayBuffer(normalizedBuffer); const parsed = await parseDocx(cloneArrayBuffer(normalizedBuffer)); if (isCancelled()) return; baselineDocumentRef.current = cloneDocumentForBaseline(parsed); // ... } catch (error) { if (isCancelled()) return; onError?.(toError(error, '...')); } }, [documentBuffer, onError]);

Compared to the current code, this:

Eliminates baselineLoadTokenRef entirely — the closure boolean replaces it

Reads semantically — isCancelled() is self-documenting vs. a raw integer comparison

Cleanup is automatic — the effect teardown handles it, you can't forget to bump a counter

The tradeoff is that isCancelled() is a slightly weaker model than AbortController — it's cooperative (the async chain has to opt in by checking), whereas AbortController can propagate into fetch calls and cancel in-flight network requests. But since toArrayBuffer and parseDocx don't support signals, there's no practical difference here. The custom hook gets you most of the readability benefits.

https://cursor.com/dashboard?tab=shared-chats&shareId=directxmldocxeditor-check-analysis-l4smZNS8lyMv

benglewis · 2026-03-04T00:08:11Z