Skip to content

Fix hyperlink OOXML nesting and duplicate decorative underlines#371

Open
devinhurry wants to merge 4 commits into
ArtifexSoftware:masterfrom
devinhurry:fix-hyperlink-style-and-structure
Open

Fix hyperlink OOXML nesting and duplicate decorative underlines#371
devinhurry wants to merge 4 commits into
ArtifexSoftware:masterfrom
devinhurry:fix-hyperlink-style-and-structure

Conversation

@devinhurry

Copy link
Copy Markdown

Summary

This PR fixes hyperlink rendering regressions when converting PDFs with vector-styled links.

Fixes #369.

Changes

  1. Fix OOXML hyperlink structure
  • pdf2docx/common/docx.py
  • add_hyperlink() now inserts w:hyperlink directly under paragraph XML and returns a Run proxy for the internal hyperlink run.
  • Removes forced w:rStyle="Hyperlink" injection so rendered style is controlled by parsed PDF formatting.
  1. Prevent duplicate decorative underline images near hyperlinks
  • pdf2docx/page/RawPageFitz.py
  • Added _filter_decorative_shape_images() to drop thin long image strips overlapping hyperlink hotspots.
  1. Estimate hyperlink color from vector drawings
  • pdf2docx/page/RawPageFitz.py
  • _preprocess_hyperlinks() now sets color using _estimate_hyperlink_color() from nearby drawing fills.
  1. Apply parsed hyperlink color/underline in text formatting
  • pdf2docx/text/TextSpan.py
  • Hyperlink color prefers parsed hyperlink color; falls back safely only when unavailable.
  1. Add regression sample and test
  • Added sample: test/samples/demo-hyperlink-style-shape.pdf
  • Added generator: test/samples/generate_demo_hyperlink_style_shape.py
  • Added test: test/test.py::TestConversion::test_hyperlink_style_and_structure

Repro / Validation

pytest -q test/test.py::TestConversion::test_hyperlink_style_and_structure

This test validates:

  • No w:hyperlink nested under w:r
  • Hyperlink color + underline preserved from source styling
  • No extra inline image/drawing strip emitted for decorative link underline

@devinhurry devinhurry force-pushed the fix-hyperlink-style-and-structure branch from bd8194d to dfce414 Compare March 1, 2026 16:29

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses PDF→DOCX hyperlink rendering regressions by fixing OOXML hyperlink structure, deduplicating decorative underline artifacts emitted as images, and improving hyperlink styling (color/underline) based on source vector drawings. It also adds regression samples and tests to prevent reintroducing these issues.

Changes:

  • Fixes OOXML hyperlink nesting by inserting w:hyperlink directly under paragraph XML and returning a proxy run for styling.
  • Filters out thin decorative underline strips (vector/image artifacts) that overlap hyperlink regions to prevent duplicate underlines.
  • Estimates hyperlink color from nearby vector drawing fills and applies hyperlink styling during text run formatting; adds regression samples + tests.

Reviewed changes

Copilot reviewed 6 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
pdf2docx/common/docx.py Reworks hyperlink XML insertion to be paragraph-level and returns a Run proxy for styling.
pdf2docx/page/RawPageFitz.py Adds hyperlink color estimation from drawings and filters decorative underline strip images around hyperlinks.
pdf2docx/text/TextSpan.py Applies hyperlink-specific color/underline formatting when a span is marked as a hyperlink.
test/test.py Adds regression tests validating hyperlink OOXML structure, styling, and absence of decorative underline images.
test/samples/generate_demo_hyperlink_style_shape.py New generator for a vector-styled hyperlink regression PDF.
test/samples/generate_demo_hyperlink_inline_image_strip.py New generator for an inline-image decorative strip regression PDF.
test/samples/demo-hyperlink-style-shape.pdf New regression sample PDF (vector underline + hyperlink).
test/samples/demo-hyperlink-inline-image-strip.pdf New regression sample PDF (inline-image strip under hyperlink).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pdf2docx/text/TextSpan.py
Comment on lines +409 to +416
link_color = 0
for style in self.style:
if style['type'] == RectType.HYPERLINK.value and style.get('color'):
link_color = style['color']
break

# Fallback to standard hyperlink blue only when source color is unknown.
color_value = link_color or self.color or rgb_value((0.02, 0.39, 0.76))

Copilot AI Mar 3, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check uses truthiness (style.get('color')) to decide whether a hyperlink color is present. Since black is encoded as 0 by rgb_value(), a valid black color will be treated as “missing” here. Use an explicit is not None check (or a separate sentinel for “unknown”) so 0 remains valid.

Suggested change
link_color = 0
for style in self.style:
if style['type'] == RectType.HYPERLINK.value and style.get('color'):
link_color = style['color']
break
# Fallback to standard hyperlink blue only when source color is unknown.
color_value = link_color or self.color or rgb_value((0.02, 0.39, 0.76))
link_color = None
for style in self.style:
if style['type'] == RectType.HYPERLINK.value and style.get('color') is not None:
link_color = style['color']
break
# Fallback to standard hyperlink blue only when source color is unknown.
if link_color is not None:
color_value = link_color
elif self.color is not None:
color_value = self.color
else:
color_value = rgb_value((0.02, 0.39, 0.76))

Copilot uses AI. Check for mistakes.
Comment thread pdf2docx/text/TextSpan.py
Comment on lines +409 to +416
link_color = 0
for style in self.style:
if style['type'] == RectType.HYPERLINK.value and style.get('color'):
link_color = style['color']
break

# Fallback to standard hyperlink blue only when source color is unknown.
color_value = link_color or self.color or rgb_value((0.02, 0.39, 0.76))

Copilot AI Mar 3, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

color_value = link_color or self.color or ... treats 0 (black) as falsy and will fall back to the default hyperlink blue. If self.color is legitimately black, this will change styling. Prefer explicit None checks (e.g., if link_color is not None else self.color) rather than or chaining for colors.

Suggested change
link_color = 0
for style in self.style:
if style['type'] == RectType.HYPERLINK.value and style.get('color'):
link_color = style['color']
break
# Fallback to standard hyperlink blue only when source color is unknown.
color_value = link_color or self.color or rgb_value((0.02, 0.39, 0.76))
link_color = None
for style in self.style:
if style['type'] == RectType.HYPERLINK.value and 'color' in style and style['color'] is not None:
link_color = style['color']
break
# Fallback to standard hyperlink blue only when source color is unknown.
if link_color is not None:
color_value = link_color
elif self.color is not None:
color_value = self.color
else:
color_value = rgb_value((0.02, 0.39, 0.76))

Copilot uses AI. Check for mistakes.
Comment thread pdf2docx/common/docx.py
new_run.text = text

text_node = OxmlElement('w:t')
text_node.text = text

Copilot AI Mar 3, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add_hyperlink() builds a <w:t> element directly but does not set xml:space="preserve". If text contains leading/trailing spaces (common when spans include surrounding whitespace), WordprocessingML will trim/collapse them and the converted text can change. Mirror python-docx behavior by setting xml:space="preserve" when needed.

Suggested change
text_node.text = text
text_node.text = text
# Preserve leading/trailing spaces so hyperlink text matches python-docx behavior.
if text and (text[0].isspace() or text[-1].isspace()):
text_node.set(qn('xml:space'), 'preserve')

Copilot uses AI. Check for mistakes.
Comment on lines +258 to +261
rect = fitz.Rect(bbox)
if rect.is_empty:
return 0

Copilot AI Mar 3, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_estimate_hyperlink_color() returns 0 when the color is unknown, but rgb_value((0,0,0)) (black) is also 0, so “unknown” and “black” are indistinguishable. Consider returning None for “unknown” (and propagating that) so black hyperlinks can be represented correctly.

Copilot uses AI. Check for mistakes.
Comment thread test/test.py
with zipfile.ZipFile(docx_file) as zf:
document_xml = zf.read('word/document.xml')
rels_xml = zf.read('word/_rels/document.xml.rels')
media_files = [name for name in zf.namelist() if name.startswith('word/media/')]

Copilot AI Mar 3, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

media_files == [] can be version-dependent because some zip writers include directory entries like word/media/ even with no images. Consider filtering out names ending with / or asserting there are no word/media/image* entries to avoid flaky failures.

Suggested change
media_files = [name for name in zf.namelist() if name.startswith('word/media/')]
media_files = [
name for name in zf.namelist()
if name.startswith('word/media/') and not name.endswith('/')
]

Copilot uses AI. Check for mistakes.
Comment on lines +162 to 163
drawings = self.page_engine.get_cdrawings()
for link in self.page_engine.get_links():

Copilot AI Mar 3, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_cdrawings() is called here to estimate hyperlink colors, but _init_paths() calls get_cdrawings() again later during shape extraction in the same extract_raw_dict() flow. If get_cdrawings() is expensive, consider caching/reusing the drawings per page.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hyperlink conversion emits invalid OOXML and duplicate underline artifacts

2 participants