Skip to content

Direct docx edits on Save instead of re-packing new doc#73

Open
aaron-hirundo wants to merge 4 commits intoeigenpal:mainfrom
Hirundo-io:codex/upstream-direct-xml
Open

Direct docx edits on Save instead of re-packing new doc#73
aaron-hirundo wants to merge 4 commits intoeigenpal:mainfrom
Hirundo-io:codex/upstream-direct-xml

Conversation

@aaron-hirundo
Copy link

@aaron-hirundo aaron-hirundo commented Feb 25, 2026

Summary

This PR adds a direct XML save path so we do not repack/rewrite the whole DOCX on save.

Main goal: reduce unnecessary Word diffs (empty paragraphs, formatting churn, unrelated paragraph changes) when user makes a small edit.

This addresses issue #63.

What changed (short)

  1. Added low-level direct DOCX part editing:
  • raw OOXML part editor
  • operation engine for set-xml, relationship updates, content-type updates, etc.
  1. Added direct save planner:
  • targeted patching of word/document.xml by changed paragraph IDs
  • note/header/footer package consistency handling
  • safe fallback to full save when targeted patch is unsafe
  1. Added React wrapper:
  • DirectXmlDocxEditor with strict/best-effort modes
  • baseline synchronization across repeated saves
  • fallback diagnostics
  1. Wired direct save into app/demo and exported public APIs.

How to verify (exact workflow used)

A/B check against old behavior

  1. On main (without this PR), open a DOCX, change one small thing, save.
  2. On this PR branch, do the same with the same input DOCX.
  3. In Microsoft Word:
    • Open Word
    • Review tab -> Compare -> Compare...
    • Original = input DOCX
    • Revised = saved DOCX
  4. Compare results:
    • Before (old repack): lots of unrelated changes (extra paragraphs/format churn/etc.)
    • With this PR (direct XML): only intended/local edits, no broad document noise.

Additional checks

  • no-edit save
  • repeated save (edit -> save -> edit -> save)
  • header/footer and notes edits

Example docs tested

Tracked docs in repo used for manual verification:

  • examples/shared/sample.docx
  • e2e/fixtures/with-tables.docx
  • e2e/fixtures/complex-styles.docx
  • e2e/fixtures/large-36-page.docx
  • e2e/fixtures/very-large-50-page.docx

Closes #63

@vercel
Copy link

vercel bot commented Feb 25, 2026

@aaron-hirundo is attempting to deploy a commit to the EigenPal Team on Vercel.

A member of the Team first needs to authorize it.

@jedrazb
Copy link
Contributor

jedrazb commented Feb 25, 2026

@aaron-hirundo hey ,thank you for opening the PR!

Can you add some details to PR description how can we verify the changes? Do you have example document that you tested with during the implementation?

@aaron-hirundo
Copy link
Author

aaron-hirundo commented Feb 26, 2026

@jedrazb Added some info, and in terms of docs I tested on, I checked and they are confidential. I am sorry.

But they for sure are huge 350-600 pages with tons of edits and etc.

Tell me if you have any more questions on this PR!

@vercel
Copy link

vercel bot commented Mar 3, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
docx-js-editor Ready Ready Preview, Comment Mar 3, 2026 2:03am

Request Review

Copy link
Contributor

@jedrazb jedrazb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff @aaron-hirundo ! Left a review

function extractParagraphXmlById(documentXml: string, paraId: string): string | null {
const escapedId = escapeRegExp(paraId);
const pattern = new RegExp(
`<w:p\\b(?=[^>]*\\b(?:w14:paraId|w:paraId)="${escapedId}")[\\s\\S]*?<\\/w:p>`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lazy [\s\S]*? matches the first </w:p> it finds. Word generates nested <w:p> inside <mc:AlternateContent> blocks — in that case this extracts a truncated paragraph and silently corrupts the output without triggering the fallback.

Suggestion: after extraction, verify the match has balanced <w:p> / </w:p> counts. If not, return null to trigger fallback.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaron-hirundo Please also add a test for this case to ensure that your fix works :)

const escapedNoteId = escapeRegExp(String(noteId));
const tagName = kind === 'footnote' ? 'footnote' : 'endnote';
const pattern = new RegExp(
`<w:${tagName}\\b(?=[^>]*\\bw:id=["']${escapedNoteId}["'])[\\s\\S]*?<\\/w:${tagName}>`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same nested-element risk here — <w:footnote> could contain nested structures that cause the lazy match to grab the wrong closing tag.

}

const currentDocument = inner.getDocument();
await refreshBaselineAfterSave(currentDocument, saved);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If refreshBaselineAfterSave throws, onSave on line 307 is never called — the caller has no idea the save succeeded. Move onSave?.(saved) before the refresh, or wrap the refresh in try/catch.

export type DocxXmlMutator = (editor: DocxXmlEditor) => void | Promise<void>;

function normalizePartPath(path: string): string {
return path.replace(/^\/+/, '');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normalizePartPath strips leading / but doesn't reject ../ sequences. Since this is a public API, consider rejecting paths containing .. to prevent path traversal.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again @aaron-hirundo , please add tests to make sure that you have fixed this

Copy link

@benglewis benglewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @aaron-hirundo ! :) Took me quite a while to read, but I had a few (if somewhat repetitive) comments that I hope will help improve it

hydrationError: Error | null;
}

export function shouldHydrateBaselineFromSavedBytes(document: Document | null): boolean {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a bit of ChatGPT prompting to understand this:
https://chatgpt.com/share/e/699f226c-0a94-8011-b77a-df1586e98a47
@aaron-hirundo I think that you should document this function with some kind of a docstring.
Also, in general, this PR feels like it needs some .md files documenting the architecture and design of this functionality

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find the relevant file to add it, but since this repository uses CLAUDE.md and we tend to use Codex, @aaron-hirundo , I think that you should symlink CLAUDE.md with an AGENTS.md file so that we can produce better PRs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you name this DirectXmlDocxEditor.mode.test.ts? What is the .mode bit about?

baselineBufferRef.current = null;
editedParagraphIdsRef.current = new Set();

(async () => {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

      (async () => {

I see that the changes include this pattern. The rest of the code base doesn't. How does it handle async code in useEffect without that?

The difference is straightforward. The rest of the codebase uses the named function + call pattern:

// DocxEditor.tsx — the standard pattern used everywhere else
useEffect(() => {
  const parseDocument = async () => {
    // async work...
  };

  parseDocument(); // call it
}, [deps]);

Your change uses the IIFE (Immediately Invoked Function Expression) pattern:

// DirectXmlDocxEditor.tsx — your pattern
useEffect(() => {
  (async () => {
    // async work...
  })();
}, [deps]);

Both are functionally identical — neither makes useEffect's callback itself async (which would break React's cleanup return value). The named function style (DocxEditor.tsx:564–596) is just a bit more readable and is the convention used throughout the rest of the codebase. If you want consistency, you can refactor your IIFE to a named async function called immediately after definition.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should probably match the conventions of the codebase

(async () => {
try {
const normalizedBuffer = await toArrayBuffer(documentBuffer);
if (baselineLoadTokenRef.current !== loadToken) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not using this kind of repetitive and hard to make sense of code. Instead, you can use a custom hook:

Yes, absolutely. A custom useAsyncEffect hook can hide all the token machinery behind a cleaner API. The key insight is that React's effect cleanup function already fires exactly when you need cancellation — when deps change or the component unmounts — so you can use a simple closure boolean instead of a ref counter:

function useAsyncEffect(
  effect: (isCancelled: () => boolean) => Promise<void>,
  deps: React.DependencyList
) {
  useEffect(() => {
    let cancelled = false;
    effect(() => cancelled);
    return () => { cancelled = true; };
  }, deps); // eslint-disable-line react-hooks/exhaustive-deps
}

Then the call site becomes:

useAsyncEffect(async (isCancelled) => {
  if (!documentBuffer) return;

  baselineDocumentRef.current = null;
  baselineBufferRef.current = null;
  editedParagraphIdsRef.current = new Set();

  try {
    const normalizedBuffer = await toArrayBuffer(documentBuffer);
    if (isCancelled()) return;

    baselineBufferRef.current = cloneArrayBuffer(normalizedBuffer);
    const parsed = await parseDocx(cloneArrayBuffer(normalizedBuffer));
    if (isCancelled()) return;

    baselineDocumentRef.current = cloneDocumentForBaseline(parsed);
    // ...
  } catch (error) {
    if (isCancelled()) return;
    onError?.(toError(error, '...'));
  }
}, [documentBuffer, onError]);

Compared to the current code, this:

  • Eliminates baselineLoadTokenRef entirely — the closure boolean replaces it
  • Reads semanticallyisCancelled() is self-documenting vs. a raw integer comparison
  • Cleanup is automatic — the effect teardown handles it, you can't forget to bump a counter

The tradeoff is that isCancelled() is a slightly weaker model than AbortController — it's cooperative (the async chain has to opt in by checking), whereas AbortController can propagate into fetch calls and cancel in-flight network requests. But since toArrayBuffer and parseDocx don't support signals, there's no practical difference here. The custom hook gets you most of the readability benefits.

https://cursor.com/dashboard?tab=shared-chats&shareId=directxmldocxeditor-check-analysis-l4smZNS8lyMv

Comment on lines +24 to +25
const RELATIONSHIPS_NAMESPACE = 'http://schemas.openxmlformats.org/package/2006/relationships';
const CONTENT_TYPES_NAMESPACE = 'http://schemas.openxmlformats.org/package/2006/content-types';

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you can use NAMESPACES.pr and NAMESPACES.ct from src/docx/xmlParser.ts instead

editor.hasPart(operation.path),
`Operation ${i}: target part not found: ${operation.path}`
);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it definitely make sense to report below that an operation was done when it was skipped in non-strict mode here? This seems kinda unexpected to me

if (!editor.hasPart(relsPath)) {
if (strict && !operation.allowMissing) {
throw new Error(`Operation ${i}: relationships part not found: ${relsPath}`);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again I am surprised that reporting in non-strict mode kinda lies that it has done something when nothing was done

if (!editor.hasPart(CONTENT_TYPES_PART)) {
if (strict && !operation.allowMissing) {
throw new Error(`Operation ${i}: content types part not found: ${CONTENT_TYPES_PART}`);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again regarding the reporting in non-strict mode

editor.setXml(operation.path, result.output);
}

report.push({

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again regarding the reporting in non-strict mode

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Direct docx edits on Save instead of re-packing new doc

3 participants