fix(workflow): add PDF extraction fallback by luochen211 · Pull Request #37525 · langgenius/dify

luochen211 · 2026-06-16T10:02:58Z

Summary

The workflow Document Extractor can currently return a successful but empty result for some valid PDFs because Dify delegates PDF text extraction to Graphon 0.5.x, whose PDF extractor only uses PDFium. This PR installs a small Dify-side fallback during workflow node registration: keep Graphon/PDFium as the primary extractor, but when it returns only whitespace for .pdf / application/pdf, retry the same bytes with pypdf.

This keeps existing non-empty extraction behavior unchanged, adds pypdf as a direct API dependency because Dify now imports it, and includes a unit test that patches the Graphon registry to reproduce the empty-PDFium case.

Refs langgenius/graphon#189.

Screenshots

N/A, backend workflow extraction behavior.

Checklist

This change requires a documentation update, included: Dify Document
I understand that this PR may be closed in case there was no previous discussion or issues. (This does not apply to typos!)
I have added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
I have updated the documentation accordingly. (N/A, no user-facing docs change.)
I ran make lint && make type-check (backend) and cd web && pnpm exec vp staged (frontend) to appease the lint gods. Full command not run; scoped backend validation is listed below.

Validation

uv lock --project api --check
uv run --project api pytest -o addopts='' api/tests/unit_tests/core/workflow/test_document_extractor_pdf_fallback.py api/tests/unit_tests/core/workflow/test_node_factory.py (52 passed)
uv run --project api --group dev ruff format --check api/core/workflow/document_extractor_pdf_fallback.py api/core/workflow/node_factory.py api/tests/unit_tests/core/workflow/test_document_extractor_pdf_fallback.py
uv run --project api --group dev ruff check api/core/workflow/document_extractor_pdf_fallback.py api/core/workflow/node_factory.py api/tests/unit_tests/core/workflow/test_document_extractor_pdf_fallback.py
uv --directory api run pyrefly check core/workflow/document_extractor_pdf_fallback.py core/workflow/node_factory.py (0 errors; existing warnings in node_factory.py)
uv --directory api run mypy core/workflow/document_extractor_pdf_fallback.py core/workflow/node_factory.py --check-untyped-defs --disable-error-code=import-untyped

github-actions · 2026-06-16T10:05:45Z

Pyrefly Type Coverage

Metric	Base	PR	Delta
Type coverage	48.58%	48.58%	-0.00%
Strict coverage	48.09%	48.09%	-0.00%
Typed symbols	27,986	27,987	+1
Untyped symbols	29,920	29,923	+3
Modules	2892	2894	+2

fix(workflow): add PDF extraction fallback

fc79ada

luochen211 requested a review from a team June 16, 2026 10:02

luochen211 requested review from QuantumGhost and laipz8200 as code owners June 16, 2026 10:03

dosubot Bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Jun 16, 2026

luochen211 mentioned this pull request Jun 16, 2026

The 'Document Extractor' node failed to recognize the PDF file. #37488

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(workflow): add PDF extraction fallback#37525

fix(workflow): add PDF extraction fallback#37525
luochen211 wants to merge 1 commit into
langgenius:mainfrom
luochen211:fix-document-extractor-pdf-recognition

luochen211 commented Jun 16, 2026

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

luochen211 commented Jun 16, 2026

Summary

Screenshots

Checklist

Validation

Uh oh!

github-actions Bot commented Jun 16, 2026

Pyrefly Type Coverage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant