Skip to content

Fix /pdf processing flow: use parsed arxiv_id, restore sync processing, and return paper_id#1

Draft
ponyfly6 wants to merge 1 commit intomainfrom
codex/bug
Draft

Fix /pdf processing flow: use parsed arxiv_id, restore sync processing, and return paper_id#1
ponyfly6 wants to merge 1 commit intomainfrom
codex/bug

Conversation

@ponyfly6
Copy link
Owner

Motivation

  • Tests were failing because _handle_pdf_command sometimes returned None and used args[1] directly for the arXiv id, which mis-parsed arguments when flags were present.
  • The CLI /pdf flow should provide immediate database updates and a paper_id for predictable behavior in synchronous CLI usage.

Description

  • Use the previously parsed arxiv_id variable (arxiv_id_arg = arxiv_id) instead of reading args[1] directly to avoid mis-parsing flags and positional arguments.
  • Restore synchronous PDF processing as the default when pdf_processing_method is not GeminiAsync, performing tools.extract_text_from_pdf_gemini, saving the blob via tools.save_text_blob, preparing self.pending_pdf_context, updating processed_timestamp and status, and returning the created paper_id.
  • Keep the async ingestion path available behind pdf_processing_method = "GeminiAsync" and ensure the function returns the paper_id in the async path as well.
  • Add explicit DB status updates for error cases (e.g., error_extraction, error_blob, error_processing, error_file_not_found) and improve logging/error handling around synchronous processing.

Testing

  • Ran pytest -q after the changes, and the suite passed with 27 passed, 1 skipped.
  • Verified the failing integration tests for /pdf previously triggered are now passing under the synchronous path.

Codex Task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments