Replace PaddleOCR SDK with HTTP async API due to pyyaml conflict#3247
Replace PaddleOCR SDK with HTTP async API due to pyyaml conflict#3247Rander7 wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the PaddleOCR integration to use the official paddleocr SDK instead of direct HTTP requests, updating the provider and tools to utilize PaddleOCRClient and its corresponding option builders. Feedback on these changes highlights several important improvements: keeping requests in the dependencies since it is still used for image downloads, checking if a string is an existing file path before performing base64 validation to avoid false positives, handling trailing slashes in the API URL extraction, and introducing a bytes_to_temp_file helper to write raw binary data directly to temporary files, avoiding redundant base64 encoding and decoding.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| dependencies = [ | ||
| "dify_plugin>=0.9.0", | ||
| "requests>=2.34.2", | ||
| "paddleocr>=3.6.0", | ||
| ] |
There was a problem hiding this comment.
The requests library is still imported and used in tools/paddleocr/tools/utils.py inside the download_image_from_url function. Removing it from the explicit dependencies in pyproject.toml can lead to runtime errors if it is not transitively installed by other packages. Please keep requests in the dependencies list.
| dependencies = [ | |
| "dify_plugin>=0.9.0", | |
| "requests>=2.34.2", | |
| "paddleocr>=3.6.0", | |
| ] | |
| dependencies = [ | |
| "dify_plugin>=0.9.0", | |
| "paddleocr>=3.6.0", | |
| "requests>=2.34.2", | |
| ] |
| if isinstance(file_value, str): | ||
| return file_value, explicit_file_type | ||
| # Check if it's a URL | ||
| if file_value.startswith(("http://", "https://")): | ||
| return file_value, False, explicit_file_type | ||
| # Check if it's base64 (data URL or raw) | ||
| if file_value.startswith("data:") or is_likely_base64(file_value): | ||
| temp_file = base64_to_temp_file(extract_base64(file_value)) | ||
| return temp_file, True, explicit_file_type | ||
| # It's a file path | ||
| return file_value, False, explicit_file_type |
There was a problem hiding this comment.
The is_likely_base64 check can produce false positives for plain text strings (e.g., alphanumeric file paths or identifiers of length >= 32 with no special characters), causing them to be incorrectly treated as base64 and written to temporary files. To prevent this, check if the string is an existing local file path using os.path.exists before performing the base64 check.
| if isinstance(file_value, str): | |
| return file_value, explicit_file_type | |
| # Check if it's a URL | |
| if file_value.startswith(("http://", "https://")): | |
| return file_value, False, explicit_file_type | |
| # Check if it's base64 (data URL or raw) | |
| if file_value.startswith("data:") or is_likely_base64(file_value): | |
| temp_file = base64_to_temp_file(extract_base64(file_value)) | |
| return temp_file, True, explicit_file_type | |
| # It's a file path | |
| return file_value, False, explicit_file_type | |
| if isinstance(file_value, str): | |
| # Check if it's a URL | |
| if file_value.startswith(("http://", "https://")): | |
| return file_value, False, explicit_file_type | |
| # Check if it's an existing file path | |
| if os.path.exists(file_value): | |
| return file_value, False, explicit_file_type | |
| # Check if it's base64 (data URL or raw) | |
| if file_value.startswith("data:") or is_likely_base64(file_value): | |
| temp_file = base64_to_temp_file(extract_base64(file_value)) | |
| return temp_file, True, explicit_file_type | |
| # It's a file path | |
| return file_value, False, explicit_file_type |
| parsed = urlparse(api_url) | ||
| # Remove common PaddleOCR endpoints | ||
| path = parsed.path | ||
| if path in ("/ocr", "/layout-parsing", "/paddleocr"): | ||
| path = "" | ||
| return f"{parsed.scheme}://{parsed.netloc}{path}" |
There was a problem hiding this comment.
If the user-provided api_url contains a trailing slash (e.g., https://example.com/ocr/), parsed.path will be /ocr/. This won't match the exact strings in the common endpoints check, leaving the trailing slash intact and potentially causing malformed URLs when the SDK appends endpoints. Strip trailing slashes when checking for common endpoints.
| parsed = urlparse(api_url) | |
| # Remove common PaddleOCR endpoints | |
| path = parsed.path | |
| if path in ("/ocr", "/layout-parsing", "/paddleocr"): | |
| path = "" | |
| return f"{parsed.scheme}://{parsed.netloc}{path}" | |
| parsed = urlparse(api_url) | |
| # Remove common PaddleOCR endpoints, ignoring trailing slashes | |
| path = parsed.path | |
| if path.rstrip("/") in ("/ocr", "/layout-parsing", "/paddleocr"): | |
| path = "" | |
| return f"{parsed.scheme}://{parsed.netloc}{path}" |
| if isinstance(file_value, File): | ||
| encoded_file = base64.b64encode(file_value.blob).decode("utf-8") | ||
| if explicit_file_type is not None: | ||
| return encoded_file, explicit_file_type | ||
| return encoded_file, infer_file_type(file_value) | ||
| temp_file = base64_to_temp_file(encoded_file, infer_file_extension(file_value)) | ||
| file_type_code = explicit_file_type if explicit_file_type is not None else infer_file_type(file_value) | ||
| return temp_file, True, file_type_code |
There was a problem hiding this comment.
When file_value is a File object, its binary content is already available in file_value.blob. Encoding it to a base64 string only to immediately decode it back to bytes in base64_to_temp_file introduces unnecessary CPU and memory overhead. Instead, write the raw bytes directly to the temporary file using a helper function like bytes_to_temp_file.
| if isinstance(file_value, File): | |
| encoded_file = base64.b64encode(file_value.blob).decode("utf-8") | |
| if explicit_file_type is not None: | |
| return encoded_file, explicit_file_type | |
| return encoded_file, infer_file_type(file_value) | |
| temp_file = base64_to_temp_file(encoded_file, infer_file_extension(file_value)) | |
| file_type_code = explicit_file_type if explicit_file_type is not None else infer_file_type(file_value) | |
| return temp_file, True, file_type_code | |
| if isinstance(file_value, File): | |
| temp_file = bytes_to_temp_file(file_value.blob, infer_file_extension(file_value)) | |
| file_type_code = explicit_file_type if explicit_file_type is not None else infer_file_type(file_value) | |
| return temp_file, True, file_type_code |
| def base64_to_temp_file(base64_str: str, suffix: str = ".png") -> str: | ||
| """Save base64 string to a temporary file. | ||
|
|
||
| Args: | ||
| base64_str: Base64 encoded string | ||
| suffix: File extension suffix | ||
|
|
||
| Returns: | ||
| Path to the temporary file | ||
| """ | ||
| with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as f: | ||
| f.write(base64.b64decode(base64_str)) | ||
| return f.name |
There was a problem hiding this comment.
Introduce a bytes_to_temp_file helper to write raw binary data directly to a temporary file, and refactor base64_to_temp_file to reuse it. This avoids redundant base64 encoding/decoding for File objects.
def bytes_to_temp_file(data: bytes, suffix: str = ".png") -> str:
"""Save bytes to a temporary file.
Args:
data: Binary data
suffix: File extension suffix
Returns:
Path to the temporary file
"""
with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as f:
f.write(data)
return f.name
def base64_to_temp_file(base64_str: str, suffix: str = ".png") -> str:
"""Save base64 string to a temporary file.
Args:
base64_str: Base64 encoded string
suffix: File extension suffix
Returns:
Path to the temporary file
"""
return bytes_to_temp_file(base64.b64decode(base64_str), suffix)c6856a5 to
1a3d077
Compare
fa8810d to
0f7d8c2
Compare
The official PaddleOCR SDK has a pyyaml dependency conflict with dify_plugin: - dify_plugin requires pyyaml >= 6.0.3 - paddleocr (via paddlex) requires pyyaml == 6.0.2 This change replaces SDK calls with direct HTTP requests to the async Job API, following the exact same implementation logic as the SDK (submit → poll → fetch). Key changes: - tools/utils.py: Implement HTTP async Job API (submit → poll → fetch) - Tool files: Replace SDK calls with call_paddleocr_api() - provider/paddleocr.yaml: Simplify to single optional base_url - Tests: Update to mock HTTP API instead of SDK
4d44f74 to
d00aa48
Compare
Why This Change Is Necessary
The official PaddleOCR SDK has a pyyaml dependency conflict with dify_plugin:
pyyaml >= 6.0.3pyyaml == 6.0.2uv cannot resolve this conflict. This PR replaces SDK calls with direct HTTP requests to the async Job API, following the exact same implementation logic as the SDK (submit → poll → fetch).
Breaking Changes
None for end users - the tool interface and output format remain identical. The plugin continues to accept the same credentials and file inputs.
Internal Changes
Dependencies
requestsfor HTTP callspaddleocr>=3.6.0SDK dependencyHTTP API Integration (follows SDK implementation)
_submit_job()- POST to/api/v2/ocr/jobs_poll_job()- exponential backoff polling (3s → max 15s, 600s timeout)_parse_ocr_result()/_parse_doc_parsing_result()- JSONL parsingAuthorization: Bearer {token}+Client-Platform: difyCode Changes
tools/utils.py: Add HTTP async Job API functions, remove SDK importsclient.ocr()/parse_document()withcall_paddleocr_api()provider/paddleocr.py: Update credential validation to use HTTP APItest_manual.py: Delete SDK-specific test scriptTesting
All three tools maintain their original behavior:
Module imports and utility functions verified.