Skip to content

Replace PaddleOCR SDK with HTTP async API due to pyyaml conflict#3247

Open
Rander7 wants to merge 2 commits into
langgenius:mainfrom
Rander7:refactor/use-paddleocr-sdk
Open

Replace PaddleOCR SDK with HTTP async API due to pyyaml conflict#3247
Rander7 wants to merge 2 commits into
langgenius:mainfrom
Rander7:refactor/use-paddleocr-sdk

Conversation

@Rander7
Copy link
Copy Markdown

@Rander7 Rander7 commented Jun 4, 2026

Why This Change Is Necessary

The official PaddleOCR SDK has a pyyaml dependency conflict with dify_plugin:

  • dify_plugin requires pyyaml >= 6.0.3
  • paddleocr (via paddlex) requires pyyaml == 6.0.2

uv cannot resolve this conflict. This PR replaces SDK calls with direct HTTP requests to the async Job API, following the exact same implementation logic as the SDK (submit → poll → fetch).

Breaking Changes

None for end users - the tool interface and output format remain identical. The plugin continues to accept the same credentials and file inputs.

Internal Changes

Dependencies

  • No change: continues to use requests for HTTP calls
  • Removed: paddleocr>=3.6.0 SDK dependency

HTTP API Integration (follows SDK implementation)

  • Implemented _submit_job() - POST to /api/v2/ocr/jobs
  • Implemented _poll_job() - exponential backoff polling (3s → max 15s, 600s timeout)
  • Implemented _parse_ocr_result() / _parse_doc_parsing_result() - JSONL parsing
  • Authentication: Authorization: Bearer {token} + Client-Platform: dify
  • Result format: compatible dict structure matching SDK's output

Code Changes

  • tools/utils.py: Add HTTP async Job API functions, remove SDK imports
  • Tool files: Replace client.ocr()/parse_document() with call_paddleocr_api()
  • provider/paddleocr.py: Update credential validation to use HTTP API
  • test_manual.py: Delete SDK-specific test script

Testing

All three tools maintain their original behavior:

  • Text Recognition (PP-OCRv5)
  • Document Parsing (PP-StructureV3)
  • VL Document Parsing (PaddleOCR-VL-1.6)

Module imports and utility functions verified.

@dosubot dosubot Bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Jun 4, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the PaddleOCR integration to use the official paddleocr SDK instead of direct HTTP requests, updating the provider and tools to utilize PaddleOCRClient and its corresponding option builders. Feedback on these changes highlights several important improvements: keeping requests in the dependencies since it is still used for image downloads, checking if a string is an existing file path before performing base64 validation to avoid false positives, handling trailing slashes in the API URL extraction, and introducing a bytes_to_temp_file helper to write raw binary data directly to temporary files, avoiding redundant base64 encoding and decoding.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines 8 to 11
dependencies = [
"dify_plugin>=0.9.0",
"requests>=2.34.2",
"paddleocr>=3.6.0",
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The requests library is still imported and used in tools/paddleocr/tools/utils.py inside the download_image_from_url function. Removing it from the explicit dependencies in pyproject.toml can lead to runtime errors if it is not transitively installed by other packages. Please keep requests in the dependencies list.

Suggested change
dependencies = [
"dify_plugin>=0.9.0",
"requests>=2.34.2",
"paddleocr>=3.6.0",
]
dependencies = [
"dify_plugin>=0.9.0",
"paddleocr>=3.6.0",
"requests>=2.34.2",
]

Comment on lines 91 to +100
if isinstance(file_value, str):
return file_value, explicit_file_type
# Check if it's a URL
if file_value.startswith(("http://", "https://")):
return file_value, False, explicit_file_type
# Check if it's base64 (data URL or raw)
if file_value.startswith("data:") or is_likely_base64(file_value):
temp_file = base64_to_temp_file(extract_base64(file_value))
return temp_file, True, explicit_file_type
# It's a file path
return file_value, False, explicit_file_type
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The is_likely_base64 check can produce false positives for plain text strings (e.g., alphanumeric file paths or identifiers of length >= 32 with no special characters), causing them to be incorrectly treated as base64 and written to temporary files. To prevent this, check if the string is an existing local file path using os.path.exists before performing the base64 check.

Suggested change
if isinstance(file_value, str):
return file_value, explicit_file_type
# Check if it's a URL
if file_value.startswith(("http://", "https://")):
return file_value, False, explicit_file_type
# Check if it's base64 (data URL or raw)
if file_value.startswith("data:") or is_likely_base64(file_value):
temp_file = base64_to_temp_file(extract_base64(file_value))
return temp_file, True, explicit_file_type
# It's a file path
return file_value, False, explicit_file_type
if isinstance(file_value, str):
# Check if it's a URL
if file_value.startswith(("http://", "https://")):
return file_value, False, explicit_file_type
# Check if it's an existing file path
if os.path.exists(file_value):
return file_value, False, explicit_file_type
# Check if it's base64 (data URL or raw)
if file_value.startswith("data:") or is_likely_base64(file_value):
temp_file = base64_to_temp_file(extract_base64(file_value))
return temp_file, True, explicit_file_type
# It's a file path
return file_value, False, explicit_file_type

Comment on lines +46 to +51
parsed = urlparse(api_url)
# Remove common PaddleOCR endpoints
path = parsed.path
if path in ("/ocr", "/layout-parsing", "/paddleocr"):
path = ""
return f"{parsed.scheme}://{parsed.netloc}{path}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If the user-provided api_url contains a trailing slash (e.g., https://example.com/ocr/), parsed.path will be /ocr/. This won't match the exact strings in the common endpoints check, leaving the trailing slash intact and potentially causing malformed URLs when the SDK appends endpoints. Strip trailing slashes when checking for common endpoints.

Suggested change
parsed = urlparse(api_url)
# Remove common PaddleOCR endpoints
path = parsed.path
if path in ("/ocr", "/layout-parsing", "/paddleocr"):
path = ""
return f"{parsed.scheme}://{parsed.netloc}{path}"
parsed = urlparse(api_url)
# Remove common PaddleOCR endpoints, ignoring trailing slashes
path = parsed.path
if path.rstrip("/") in ("/ocr", "/layout-parsing", "/paddleocr"):
path = ""
return f"{parsed.scheme}://{parsed.netloc}{path}"

Comment on lines 85 to +89
if isinstance(file_value, File):
encoded_file = base64.b64encode(file_value.blob).decode("utf-8")
if explicit_file_type is not None:
return encoded_file, explicit_file_type
return encoded_file, infer_file_type(file_value)
temp_file = base64_to_temp_file(encoded_file, infer_file_extension(file_value))
file_type_code = explicit_file_type if explicit_file_type is not None else infer_file_type(file_value)
return temp_file, True, file_type_code
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When file_value is a File object, its binary content is already available in file_value.blob. Encoding it to a base64 string only to immediately decode it back to bytes in base64_to_temp_file introduces unnecessary CPU and memory overhead. Instead, write the raw bytes directly to the temporary file using a helper function like bytes_to_temp_file.

Suggested change
if isinstance(file_value, File):
encoded_file = base64.b64encode(file_value.blob).decode("utf-8")
if explicit_file_type is not None:
return encoded_file, explicit_file_type
return encoded_file, infer_file_type(file_value)
temp_file = base64_to_temp_file(encoded_file, infer_file_extension(file_value))
file_type_code = explicit_file_type if explicit_file_type is not None else infer_file_type(file_value)
return temp_file, True, file_type_code
if isinstance(file_value, File):
temp_file = bytes_to_temp_file(file_value.blob, infer_file_extension(file_value))
file_type_code = explicit_file_type if explicit_file_type is not None else infer_file_type(file_value)
return temp_file, True, file_type_code

Comment on lines +162 to +174
def base64_to_temp_file(base64_str: str, suffix: str = ".png") -> str:
"""Save base64 string to a temporary file.

Args:
base64_str: Base64 encoded string
suffix: File extension suffix

Returns:
Path to the temporary file
"""
with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as f:
f.write(base64.b64decode(base64_str))
return f.name
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Introduce a bytes_to_temp_file helper to write raw binary data directly to a temporary file, and refactor base64_to_temp_file to reuse it. This avoids redundant base64 encoding/decoding for File objects.

def bytes_to_temp_file(data: bytes, suffix: str = ".png") -> str:
    """Save bytes to a temporary file.

    Args:
        data: Binary data
        suffix: File extension suffix

    Returns:
        Path to the temporary file
    """
    with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as f:
        f.write(data)
        return f.name


def base64_to_temp_file(base64_str: str, suffix: str = ".png") -> str:
    """Save base64 string to a temporary file.

    Args:
        base64_str: Base64 encoded string
        suffix: File extension suffix

    Returns:
        Path to the temporary file
    """
    return bytes_to_temp_file(base64.b64decode(base64_str), suffix)

@Rander7 Rander7 temporarily deployed to tools/paddleocr June 4, 2026 05:54 — with GitHub Actions Inactive
@Rander7 Rander7 temporarily deployed to tools/paddleocr June 4, 2026 09:00 — with GitHub Actions Inactive
@Rander7 Rander7 force-pushed the refactor/use-paddleocr-sdk branch from c6856a5 to 1a3d077 Compare June 5, 2026 05:52
@Rander7 Rander7 temporarily deployed to tools/paddleocr June 5, 2026 05:53 — with GitHub Actions Inactive
@Rander7 Rander7 temporarily deployed to tools/paddleocr June 5, 2026 05:57 — with GitHub Actions Inactive
@Rander7 Rander7 force-pushed the refactor/use-paddleocr-sdk branch from fa8810d to 0f7d8c2 Compare June 5, 2026 09:14
@dosubot dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Jun 5, 2026
@Rander7 Rander7 temporarily deployed to tools/paddleocr June 5, 2026 09:14 — with GitHub Actions Inactive
@Rander7 Rander7 changed the title refactor: use PaddleOCR Python SDK instead of direct API calls Replace PaddleOCR SDK with HTTP async API due to pyyaml conflict Jun 5, 2026
@Rander7 Rander7 temporarily deployed to tools/paddleocr June 5, 2026 09:32 — with GitHub Actions Inactive
@Rander7 Rander7 temporarily deployed to tools/paddleocr June 5, 2026 09:46 — with GitHub Actions Inactive
The official PaddleOCR SDK has a pyyaml dependency conflict with dify_plugin:
- dify_plugin requires pyyaml >= 6.0.3
- paddleocr (via paddlex) requires pyyaml == 6.0.2

This change replaces SDK calls with direct HTTP requests to the async Job API,
following the exact same implementation logic as the SDK (submit → poll → fetch).

Key changes:
- tools/utils.py: Implement HTTP async Job API (submit → poll → fetch)
- Tool files: Replace SDK calls with call_paddleocr_api()
- provider/paddleocr.yaml: Simplify to single optional base_url
- Tests: Update to mock HTTP API instead of SDK
@Rander7 Rander7 force-pushed the refactor/use-paddleocr-sdk branch from 4d44f74 to d00aa48 Compare June 5, 2026 10:11
@Rander7 Rander7 deployed to tools/paddleocr June 5, 2026 10:13 — with GitHub Actions Active
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant