Skip to content

fix(workflow): add PDF extraction fallback#37525

Open
luochen211 wants to merge 1 commit into
langgenius:mainfrom
luochen211:fix-document-extractor-pdf-recognition
Open

fix(workflow): add PDF extraction fallback#37525
luochen211 wants to merge 1 commit into
langgenius:mainfrom
luochen211:fix-document-extractor-pdf-recognition

Conversation

@luochen211

Copy link
Copy Markdown
Contributor

Summary

Fixes #37488.

The workflow Document Extractor can currently return a successful but empty result for some valid PDFs because Dify delegates PDF text extraction to Graphon 0.5.x, whose PDF extractor only uses PDFium. This PR installs a small Dify-side fallback during workflow node registration: keep Graphon/PDFium as the primary extractor, but when it returns only whitespace for .pdf / application/pdf, retry the same bytes with pypdf.

This keeps existing non-empty extraction behavior unchanged, adds pypdf as a direct API dependency because Dify now imports it, and includes a unit test that patches the Graphon registry to reproduce the empty-PDFium case.

Refs langgenius/graphon#189.

Screenshots

N/A, backend workflow extraction behavior.

Checklist

  • This change requires a documentation update, included: Dify Document
  • I understand that this PR may be closed in case there was no previous discussion or issues. (This does not apply to typos!)
  • I have added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
  • I have updated the documentation accordingly. (N/A, no user-facing docs change.)
  • I ran make lint && make type-check (backend) and cd web && pnpm exec vp staged (frontend) to appease the lint gods. Full command not run; scoped backend validation is listed below.

Validation

  • uv lock --project api --check
  • uv run --project api pytest -o addopts='' api/tests/unit_tests/core/workflow/test_document_extractor_pdf_fallback.py api/tests/unit_tests/core/workflow/test_node_factory.py (52 passed)
  • uv run --project api --group dev ruff format --check api/core/workflow/document_extractor_pdf_fallback.py api/core/workflow/node_factory.py api/tests/unit_tests/core/workflow/test_document_extractor_pdf_fallback.py
  • uv run --project api --group dev ruff check api/core/workflow/document_extractor_pdf_fallback.py api/core/workflow/node_factory.py api/tests/unit_tests/core/workflow/test_document_extractor_pdf_fallback.py
  • uv --directory api run pyrefly check core/workflow/document_extractor_pdf_fallback.py core/workflow/node_factory.py (0 errors; existing warnings in node_factory.py)
  • uv --directory api run mypy core/workflow/document_extractor_pdf_fallback.py core/workflow/node_factory.py --check-untyped-defs --disable-error-code=import-untyped

@luochen211 luochen211 requested a review from a team June 16, 2026 10:02
@dosubot dosubot Bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Jun 16, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Pyrefly Type Coverage

Metric Base PR Delta
Type coverage 48.58% 48.58% -0.00%
Strict coverage 48.09% 48.09% -0.00%
Typed symbols 27,986 27,987 +1
Untyped symbols 29,920 29,923 +3
Modules 2892 2894 +2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The 'Document Extractor' node failed to recognize the PDF file.

1 participant