Fix hyperlink OOXML nesting and duplicate decorative underlines#371
Fix hyperlink OOXML nesting and duplicate decorative underlines#371devinhurry wants to merge 4 commits into
Conversation
bd8194d to
dfce414
Compare
There was a problem hiding this comment.
Pull request overview
This PR addresses PDF→DOCX hyperlink rendering regressions by fixing OOXML hyperlink structure, deduplicating decorative underline artifacts emitted as images, and improving hyperlink styling (color/underline) based on source vector drawings. It also adds regression samples and tests to prevent reintroducing these issues.
Changes:
- Fixes OOXML hyperlink nesting by inserting
w:hyperlinkdirectly under paragraph XML and returning a proxy run for styling. - Filters out thin decorative underline strips (vector/image artifacts) that overlap hyperlink regions to prevent duplicate underlines.
- Estimates hyperlink color from nearby vector drawing fills and applies hyperlink styling during text run formatting; adds regression samples + tests.
Reviewed changes
Copilot reviewed 6 out of 8 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
pdf2docx/common/docx.py |
Reworks hyperlink XML insertion to be paragraph-level and returns a Run proxy for styling. |
pdf2docx/page/RawPageFitz.py |
Adds hyperlink color estimation from drawings and filters decorative underline strip images around hyperlinks. |
pdf2docx/text/TextSpan.py |
Applies hyperlink-specific color/underline formatting when a span is marked as a hyperlink. |
test/test.py |
Adds regression tests validating hyperlink OOXML structure, styling, and absence of decorative underline images. |
test/samples/generate_demo_hyperlink_style_shape.py |
New generator for a vector-styled hyperlink regression PDF. |
test/samples/generate_demo_hyperlink_inline_image_strip.py |
New generator for an inline-image decorative strip regression PDF. |
test/samples/demo-hyperlink-style-shape.pdf |
New regression sample PDF (vector underline + hyperlink). |
test/samples/demo-hyperlink-inline-image-strip.pdf |
New regression sample PDF (inline-image strip under hyperlink). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| link_color = 0 | ||
| for style in self.style: | ||
| if style['type'] == RectType.HYPERLINK.value and style.get('color'): | ||
| link_color = style['color'] | ||
| break | ||
|
|
||
| # Fallback to standard hyperlink blue only when source color is unknown. | ||
| color_value = link_color or self.color or rgb_value((0.02, 0.39, 0.76)) |
There was a problem hiding this comment.
This check uses truthiness (style.get('color')) to decide whether a hyperlink color is present. Since black is encoded as 0 by rgb_value(), a valid black color will be treated as “missing” here. Use an explicit is not None check (or a separate sentinel for “unknown”) so 0 remains valid.
| link_color = 0 | |
| for style in self.style: | |
| if style['type'] == RectType.HYPERLINK.value and style.get('color'): | |
| link_color = style['color'] | |
| break | |
| # Fallback to standard hyperlink blue only when source color is unknown. | |
| color_value = link_color or self.color or rgb_value((0.02, 0.39, 0.76)) | |
| link_color = None | |
| for style in self.style: | |
| if style['type'] == RectType.HYPERLINK.value and style.get('color') is not None: | |
| link_color = style['color'] | |
| break | |
| # Fallback to standard hyperlink blue only when source color is unknown. | |
| if link_color is not None: | |
| color_value = link_color | |
| elif self.color is not None: | |
| color_value = self.color | |
| else: | |
| color_value = rgb_value((0.02, 0.39, 0.76)) |
| link_color = 0 | ||
| for style in self.style: | ||
| if style['type'] == RectType.HYPERLINK.value and style.get('color'): | ||
| link_color = style['color'] | ||
| break | ||
|
|
||
| # Fallback to standard hyperlink blue only when source color is unknown. | ||
| color_value = link_color or self.color or rgb_value((0.02, 0.39, 0.76)) |
There was a problem hiding this comment.
color_value = link_color or self.color or ... treats 0 (black) as falsy and will fall back to the default hyperlink blue. If self.color is legitimately black, this will change styling. Prefer explicit None checks (e.g., if link_color is not None else self.color) rather than or chaining for colors.
| link_color = 0 | |
| for style in self.style: | |
| if style['type'] == RectType.HYPERLINK.value and style.get('color'): | |
| link_color = style['color'] | |
| break | |
| # Fallback to standard hyperlink blue only when source color is unknown. | |
| color_value = link_color or self.color or rgb_value((0.02, 0.39, 0.76)) | |
| link_color = None | |
| for style in self.style: | |
| if style['type'] == RectType.HYPERLINK.value and 'color' in style and style['color'] is not None: | |
| link_color = style['color'] | |
| break | |
| # Fallback to standard hyperlink blue only when source color is unknown. | |
| if link_color is not None: | |
| color_value = link_color | |
| elif self.color is not None: | |
| color_value = self.color | |
| else: | |
| color_value = rgb_value((0.02, 0.39, 0.76)) |
| new_run.text = text | ||
|
|
||
| text_node = OxmlElement('w:t') | ||
| text_node.text = text |
There was a problem hiding this comment.
add_hyperlink() builds a <w:t> element directly but does not set xml:space="preserve". If text contains leading/trailing spaces (common when spans include surrounding whitespace), WordprocessingML will trim/collapse them and the converted text can change. Mirror python-docx behavior by setting xml:space="preserve" when needed.
| text_node.text = text | |
| text_node.text = text | |
| # Preserve leading/trailing spaces so hyperlink text matches python-docx behavior. | |
| if text and (text[0].isspace() or text[-1].isspace()): | |
| text_node.set(qn('xml:space'), 'preserve') |
| rect = fitz.Rect(bbox) | ||
| if rect.is_empty: | ||
| return 0 | ||
|
|
There was a problem hiding this comment.
_estimate_hyperlink_color() returns 0 when the color is unknown, but rgb_value((0,0,0)) (black) is also 0, so “unknown” and “black” are indistinguishable. Consider returning None for “unknown” (and propagating that) so black hyperlinks can be represented correctly.
| with zipfile.ZipFile(docx_file) as zf: | ||
| document_xml = zf.read('word/document.xml') | ||
| rels_xml = zf.read('word/_rels/document.xml.rels') | ||
| media_files = [name for name in zf.namelist() if name.startswith('word/media/')] |
There was a problem hiding this comment.
media_files == [] can be version-dependent because some zip writers include directory entries like word/media/ even with no images. Consider filtering out names ending with / or asserting there are no word/media/image* entries to avoid flaky failures.
| media_files = [name for name in zf.namelist() if name.startswith('word/media/')] | |
| media_files = [ | |
| name for name in zf.namelist() | |
| if name.startswith('word/media/') and not name.endswith('/') | |
| ] |
| drawings = self.page_engine.get_cdrawings() | ||
| for link in self.page_engine.get_links(): |
There was a problem hiding this comment.
get_cdrawings() is called here to estimate hyperlink colors, but _init_paths() calls get_cdrawings() again later during shape extraction in the same extract_raw_dict() flow. If get_cdrawings() is expensive, consider caching/reusing the drawings per page.
Summary
This PR fixes hyperlink rendering regressions when converting PDFs with vector-styled links.
Fixes #369.
Changes
pdf2docx/common/docx.pyadd_hyperlink()now insertsw:hyperlinkdirectly under paragraph XML and returns aRunproxy for the internal hyperlink run.w:rStyle="Hyperlink"injection so rendered style is controlled by parsed PDF formatting.pdf2docx/page/RawPageFitz.py_filter_decorative_shape_images()to drop thin long image strips overlapping hyperlink hotspots.pdf2docx/page/RawPageFitz.py_preprocess_hyperlinks()now setscolorusing_estimate_hyperlink_color()from nearby drawing fills.pdf2docx/text/TextSpan.pytest/samples/demo-hyperlink-style-shape.pdftest/samples/generate_demo_hyperlink_style_shape.pytest/test.py::TestConversion::test_hyperlink_style_and_structureRepro / Validation
This test validates:
w:hyperlinknested underw:r