Skip to content

feat: add docx loremify sanitizer for confidential bug reports#75

Open
fahdarafat wants to merge 2 commits intoeigenpal:mainfrom
fahdarafat:feat/loremify-docx-sanitizer
Open

feat: add docx loremify sanitizer for confidential bug reports#75
fahdarafat wants to merge 2 commits intoeigenpal:mainfrom
fahdarafat:feat/loremify-docx-sanitizer

Conversation

@fahdarafat
Copy link
Contributor

Summary

I hope this helps so issue reporters can share reproducible files without exposing confidential
text.

What this PR changes

  • Adds a new CLI script: scripts/loremify-docx.ts
  • Adds a package script:
    • bun run docx:loremify -- <file.docx>
  • Adds contributor documentation in CONTRIBUTING.md showing how to sanitize files before attaching them to issues

Behavior

  • Replaces human-readable DOCX run text with lorem-style text
  • Preserves document structure and formatting-critical layout containers (tables, nesting, headers/footers, styles,
    relationships)
  • Preserves per-word lengths so visual flow stays close to the source document
  • Defaults to safer story-part replacement (main document + headers/footers/notes/comments)
  • Supports --all-xml for advanced full-XML replacement when needed

Usage

bun run docx:loremify -- "path/to/file.docx"
bun run docx:loremify -- "path/to/file.docx" --out-dir sanitized-docs

@vercel
Copy link

vercel bot commented Feb 26, 2026

Someone is attempting to deploy a commit to the EigenPal Team on Vercel.

A member of the Team first needs to authorize it.

@jedrazb
Copy link
Contributor

jedrazb commented Mar 1, 2026

Nice contribution! The code is clean and well-structured. A few suggestions to consider:

  1. Document metadata leaksdocProps/core.xml and docProps/app.xml (author, company, manager, revision history) aren't sanitized. This is probably the biggest gap since someone could sanitize text but still ship identifying metadata.

  2. Embedded images/media — files in word/media/ survive sanitization. Logos, charts, screenshots can be just as identifying as text. Worth at least documenting this limitation, or adding a --strip-media option.

  3. No tests — the core functions (loremifyXml, applyCasing, decodeXmlEntities, parseArgs) are pure and easy to unit test.

  4. Numbers become lorem text — the [\p{L}\p{N}]+ regex replaces digits too, so dates like 2026-01-15 become lore-ips-dol. Preserving numeric characters would keep layouts more realistic.

@vercel
Copy link

vercel bot commented Mar 3, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
docx-js-editor Ready Ready Preview, Comment Mar 3, 2026 4:08am

Request Review

- sanitize docProps/core.xml and docProps/app.xml fields\n- add --strip-media to remove word/media files and references\n- preserve numeric characters during lorem replacement\n- add bun unit tests for loremify helper functions\n- document sanitizer behavior in CONTRIBUTING
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants