Direct docx edits on Save instead of re-packing new doc#73
Direct docx edits on Save instead of re-packing new doc#73aaron-hirundo wants to merge 4 commits intoeigenpal:mainfrom
Conversation
|
@aaron-hirundo is attempting to deploy a commit to the EigenPal Team on Vercel. A member of the Team first needs to authorize it. |
|
@aaron-hirundo hey ,thank you for opening the PR! Can you add some details to PR description how can we verify the changes? Do you have example document that you tested with during the implementation? |
|
@jedrazb Added some info, and in terms of docs I tested on, I checked and they are confidential. I am sorry. But they for sure are huge 350-600 pages with tons of edits and etc. Tell me if you have any more questions on this PR! |
…upstream-direct-xml
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
| function extractParagraphXmlById(documentXml: string, paraId: string): string | null { | ||
| const escapedId = escapeRegExp(paraId); | ||
| const pattern = new RegExp( | ||
| `<w:p\\b(?=[^>]*\\b(?:w14:paraId|w:paraId)="${escapedId}")[\\s\\S]*?<\\/w:p>` |
There was a problem hiding this comment.
This lazy [\s\S]*? matches the first </w:p> it finds. Word generates nested <w:p> inside <mc:AlternateContent> blocks — in that case this extracts a truncated paragraph and silently corrupts the output without triggering the fallback.
Suggestion: after extraction, verify the match has balanced <w:p> / </w:p> counts. If not, return null to trigger fallback.
There was a problem hiding this comment.
@aaron-hirundo Please also add a test for this case to ensure that your fix works :)
| const escapedNoteId = escapeRegExp(String(noteId)); | ||
| const tagName = kind === 'footnote' ? 'footnote' : 'endnote'; | ||
| const pattern = new RegExp( | ||
| `<w:${tagName}\\b(?=[^>]*\\bw:id=["']${escapedNoteId}["'])[\\s\\S]*?<\\/w:${tagName}>` |
There was a problem hiding this comment.
Same nested-element risk here — <w:footnote> could contain nested structures that cause the lazy match to grab the wrong closing tag.
| } | ||
|
|
||
| const currentDocument = inner.getDocument(); | ||
| await refreshBaselineAfterSave(currentDocument, saved); |
There was a problem hiding this comment.
If refreshBaselineAfterSave throws, onSave on line 307 is never called — the caller has no idea the save succeeded. Move onSave?.(saved) before the refresh, or wrap the refresh in try/catch.
| export type DocxXmlMutator = (editor: DocxXmlEditor) => void | Promise<void>; | ||
|
|
||
| function normalizePartPath(path: string): string { | ||
| return path.replace(/^\/+/, ''); |
There was a problem hiding this comment.
normalizePartPath strips leading / but doesn't reject ../ sequences. Since this is a public API, consider rejecting paths containing .. to prevent path traversal.
There was a problem hiding this comment.
Again @aaron-hirundo , please add tests to make sure that you have fixed this
benglewis
left a comment
There was a problem hiding this comment.
Nice work @aaron-hirundo ! :) Took me quite a while to read, but I had a few (if somewhat repetitive) comments that I hope will help improve it
| hydrationError: Error | null; | ||
| } | ||
|
|
||
| export function shouldHydrateBaselineFromSavedBytes(document: Document | null): boolean { |
There was a problem hiding this comment.
It took me a bit of ChatGPT prompting to understand this:
https://chatgpt.com/share/e/699f226c-0a94-8011-b77a-df1586e98a47
@aaron-hirundo I think that you should document this function with some kind of a docstring.
Also, in general, this PR feels like it needs some .md files documenting the architecture and design of this functionality
There was a problem hiding this comment.
I couldn't find the relevant file to add it, but since this repository uses CLAUDE.md and we tend to use Codex, @aaron-hirundo , I think that you should symlink CLAUDE.md with an AGENTS.md file so that we can produce better PRs
There was a problem hiding this comment.
Why did you name this DirectXmlDocxEditor.mode.test.ts? What is the .mode bit about?
| baselineBufferRef.current = null; | ||
| editedParagraphIdsRef.current = new Set(); | ||
|
|
||
| (async () => { |
There was a problem hiding this comment.
(async () => {I see that the changes include this pattern. The rest of the code base doesn't. How does it handle async code in useEffect without that?
The difference is straightforward. The rest of the codebase uses the named function + call pattern:
// DocxEditor.tsx — the standard pattern used everywhere else useEffect(() => { const parseDocument = async () => { // async work... }; parseDocument(); // call it }, [deps]);Your change uses the IIFE (Immediately Invoked Function Expression) pattern:
// DirectXmlDocxEditor.tsx — your pattern useEffect(() => { (async () => { // async work... })(); }, [deps]);Both are functionally identical — neither makes
useEffect's callback itselfasync(which would break React's cleanup return value). The named function style (DocxEditor.tsx:564–596) is just a bit more readable and is the convention used throughout the rest of the codebase. If you want consistency, you can refactor your IIFE to a named async function called immediately after definition.
There was a problem hiding this comment.
You should probably match the conventions of the codebase
| (async () => { | ||
| try { | ||
| const normalizedBuffer = await toArrayBuffer(documentBuffer); | ||
| if (baselineLoadTokenRef.current !== loadToken) { |
There was a problem hiding this comment.
I prefer not using this kind of repetitive and hard to make sense of code. Instead, you can use a custom hook:
Yes, absolutely. A custom
useAsyncEffecthook can hide all the token machinery behind a cleaner API. The key insight is that React's effect cleanup function already fires exactly when you need cancellation — when deps change or the component unmounts — so you can use a simple closure boolean instead of a ref counter:function useAsyncEffect( effect: (isCancelled: () => boolean) => Promise<void>, deps: React.DependencyList ) { useEffect(() => { let cancelled = false; effect(() => cancelled); return () => { cancelled = true; }; }, deps); // eslint-disable-line react-hooks/exhaustive-deps }Then the call site becomes:
useAsyncEffect(async (isCancelled) => { if (!documentBuffer) return; baselineDocumentRef.current = null; baselineBufferRef.current = null; editedParagraphIdsRef.current = new Set(); try { const normalizedBuffer = await toArrayBuffer(documentBuffer); if (isCancelled()) return; baselineBufferRef.current = cloneArrayBuffer(normalizedBuffer); const parsed = await parseDocx(cloneArrayBuffer(normalizedBuffer)); if (isCancelled()) return; baselineDocumentRef.current = cloneDocumentForBaseline(parsed); // ... } catch (error) { if (isCancelled()) return; onError?.(toError(error, '...')); } }, [documentBuffer, onError]);Compared to the current code, this:
- Eliminates
baselineLoadTokenRefentirely — the closure boolean replaces it- Reads semantically —
isCancelled()is self-documenting vs. a raw integer comparison- Cleanup is automatic — the effect teardown handles it, you can't forget to bump a counter
The tradeoff is that
isCancelled()is a slightly weaker model thanAbortController— it's cooperative (the async chain has to opt in by checking), whereasAbortControllercan propagate intofetchcalls and cancel in-flight network requests. But sincetoArrayBufferandparseDocxdon't support signals, there's no practical difference here. The custom hook gets you most of the readability benefits.
| const RELATIONSHIPS_NAMESPACE = 'http://schemas.openxmlformats.org/package/2006/relationships'; | ||
| const CONTENT_TYPES_NAMESPACE = 'http://schemas.openxmlformats.org/package/2006/content-types'; |
There was a problem hiding this comment.
I think that you can use NAMESPACES.pr and NAMESPACES.ct from src/docx/xmlParser.ts instead
| editor.hasPart(operation.path), | ||
| `Operation ${i}: target part not found: ${operation.path}` | ||
| ); | ||
| } |
There was a problem hiding this comment.
Does it definitely make sense to report below that an operation was done when it was skipped in non-strict mode here? This seems kinda unexpected to me
| if (!editor.hasPart(relsPath)) { | ||
| if (strict && !operation.allowMissing) { | ||
| throw new Error(`Operation ${i}: relationships part not found: ${relsPath}`); | ||
| } |
There was a problem hiding this comment.
Again I am surprised that reporting in non-strict mode kinda lies that it has done something when nothing was done
| if (!editor.hasPart(CONTENT_TYPES_PART)) { | ||
| if (strict && !operation.allowMissing) { | ||
| throw new Error(`Operation ${i}: content types part not found: ${CONTENT_TYPES_PART}`); | ||
| } |
There was a problem hiding this comment.
Again regarding the reporting in non-strict mode
| editor.setXml(operation.path, result.output); | ||
| } | ||
|
|
||
| report.push({ |
There was a problem hiding this comment.
Again regarding the reporting in non-strict mode
Summary
This PR adds a direct XML save path so we do not repack/rewrite the whole DOCX on save.
Main goal: reduce unnecessary Word diffs (empty paragraphs, formatting churn, unrelated paragraph changes) when user makes a small edit.
This addresses issue #63.
What changed (short)
set-xml, relationship updates, content-type updates, etc.word/document.xmlby changed paragraph IDsDirectXmlDocxEditorwith strict/best-effort modesHow to verify (exact workflow used)
A/B check against old behavior
main(without this PR), open a DOCX, change one small thing, save.Reviewtab ->Compare->Compare...Additional checks
Example docs tested
Tracked docs in repo used for manual verification:
examples/shared/sample.docxe2e/fixtures/with-tables.docxe2e/fixtures/complex-styles.docxe2e/fixtures/large-36-page.docxe2e/fixtures/very-large-50-page.docxCloses #63