Skip to content

Feature/doc transform#332

Closed
kartpop wants to merge 15 commits into
mainfrom
feature/doc-transform
Closed

Feature/doc transform#332
kartpop wants to merge 15 commits into
mainfrom
feature/doc-transform

Conversation

@kartpop
Copy link
Copy Markdown
Collaborator

@kartpop kartpop commented Aug 13, 2025

Summary

Target issue is #PLEASE_TYPE_ISSUE_NUMBER
Explain the motivation for making this change. What existing problem does the pull request solve?

Checklist

Before submitting a pull request, please ensure that you mark these task.

  • Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
  • If you've fixed a bug or added code that is tested and has test cases.

Notes

Please add here if any other information is required for the reviewer.

Summary by CodeRabbit

  • New Features

    • Upload documents with optional format conversion; transformation requests return 202 with job details and status URL.
    • Background processing for document transformations with progress tracking.
    • Transformed documents are linked to their original source.
  • API

    • Added endpoints to fetch one or multiple transformation jobs by ID.
    • Enhanced document upload to accept target_format and optional transformer.
  • Chores

    • Increased minimum Python version to 3.11.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Aug 13, 2025

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Adds document transformation capability: new models and migrations, CRUD helpers, FastAPI routes for starting and querying jobs, a background service to execute transformations, and a transformer framework (abstract base, registry, test and Zerox-backed implementations). Upload endpoint now optionally triggers async transformation jobs. Dependency and Python version updated.

Changes

Cohort / File(s) Summary
Database migrations
backend/app/alembic/versions/269675883ecf_add_source_document_id_to_document.py, backend/app/alembic/versions/93b86c1246b1_create_doc_transformation_job_table.py
Add document.source_document_id (self-FK); create doc_transformation_job table with status enum and FKs; adjust/downgrade related FKs.
Models
backend/app/models/document.py, backend/app/models/doc_transformation_job.py, backend/app/models/__init__.py
Add Document.source_document_id; introduce DocTransformationJob and TransformationStatus; export model via package init.
CRUD
backend/app/crud/doc_transformation_job.py
CRUD for transformation jobs: create, read, update status, list.
API routes – transformations
backend/app/api/routes/doc_transformation_job.py
New endpoints to fetch one or multiple transformation jobs by ID.
API routes – documents
backend/app/api/routes/documents.py
Upload route becomes async; accepts target_format/transformer; starts background transformation job; returns 202 with job info when applicable.
Doctransform core
backend/app/core/doctransform/transformer.py, .../zerox_transformer.py, .../test_transformer.py, .../registry.py, .../service.py
Define Transformer base; add Zerox and test transformers; registry for formats/transformers and conversion; background job orchestration with retries, storage I/O, and record updates.
Project config
backend/pyproject.toml
Require Python >=3.11; add py-zerox dependency.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant API as Documents API
  participant Service as Transform Service
  participant DB
  participant Storage

  Client->>API: POST /documents/upload (file, target_format, transformer?)
  API->>DB: Create DocTransformationJob (PENDING)
  API->>Service: Schedule execute_job(user_id, job_id, transformer, target_format)
  API-->>Client: 202 Accepted (job_id, status_check_url, source doc info)

  Service->>DB: Mark job PROCESSING
  Service->>DB: Load source Document
  Service->>Storage: Download source file
  Service->>Service: convert_document(source, transformer)
  Service->>Storage: Upload transformed file
  Service->>DB: Create transformed Document (source_document_id set)
  Service->>DB: Mark job COMPLETED (transformed_document_id)
Loading
sequenceDiagram
  participant Client
  participant API as Transform Jobs API
  participant DB

  Client->>API: GET /documents/transformations/{job_id}
  API->>DB: Read job by id
  DB-->>API: Job record
  API-->>Client: 200 OK (job)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

enhancement

Suggested reviewers

  • AkhileshNegi
  • avirajsingh7

Poem

I thump my paws on server logs,
A hop, a skip—convert the docs!
From PDF to text I leap,
While background tasks quietly creep.
Job IDs sprout like clover green,
Transform complete—so crisp, so clean! 🥕🐇

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/doc-transform

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@kartpop kartpop marked this pull request as draft August 13, 2025 01:23
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 16

🔭 Outside diff range comments (1)
backend/app/core/doctransform/zerox_transformer.py (1)

40-40: Remove trailing empty line at end of file.

The pipeline failure indicates that the end-of-file-fixer modified this file. There's an unnecessary empty line at the end.

-            ) from e
-
+            ) from e
🧹 Nitpick comments (24)
backend/app/core/doctransform/test_transformer.py (1)

9-13: Silence Ruff ARG001 by marking the parameter intentionally unused.

Ruff has ARG001 enabled; the method parameter isn’t used. Rename it to underscore-prefixed to indicate intent.

-    def transform(self, input_path: Path) -> str:
+    def transform(self, _input_path: Path) -> str:
         return (
             "Lorem ipsum dolor sit amet, consectetur adipiscing elit, "
             "sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
         )
backend/app/alembic/versions/269675883ecf_add_source_document_id_to_document.py (2)

10-10: Remove unused import (sqlmodel.sql.sqltypes).

This import isn’t used.

-import sqlmodel.sql.sqltypes

22-23: Optional: add an index on source_document_id to improve query performance.

If you plan to query by source_document_id (e.g., fetching derived documents), an index will help.

 def upgrade():
@@
-    op.add_column('document', sa.Column('source_document_id', sa.Uuid(), nullable=True))
-    op.create_foreign_key(
+    op.add_column('document', sa.Column('source_document_id', sa.Uuid(), nullable=True))
+    op.create_index('ix_document_source_document_id', 'document', ['source_document_id'])
+    op.create_foreign_key(
         'document_source_document_id_fkey',
         'document',
         'document',
         ['source_document_id'],
         ['id'],
         ondelete='SET NULL',
     )
@@
 def downgrade():
@@
-    op.drop_constraint('document_source_document_id_fkey', 'document', type_='foreignkey')
-    op.drop_column('document', 'source_document_id')
+    op.drop_constraint('document_source_document_id_fkey', 'document', type_='foreignkey')
+    op.drop_index('ix_document_source_document_id', table_name='document')
+    op.drop_column('document', 'source_document_id')

Also applies to: 31-34

backend/app/models/__init__.py (1)

6-6: Export DocTransformationJob in __all__ to clarify public API.

The static analysis tool correctly identifies that DocTransformationJob is imported but not explicitly used in this module. Since this is an __init__.py file that serves as a public API interface, you should either add it to an __all__ list to make the export explicit, or use it somewhere in the module.

Consider adding an __all__ list at the end of the file to explicitly define the public API:

__all__ = [
    # Auth
    "Token", "TokenPayload",
    # Collections
    "Collection", "DocumentCollection",
    # Documents
    "Document", "DocTransformationJob",
    # ... (other exports)
]
backend/app/core/doctransform/zerox_transformer.py (1)

1-5: Fix import order according to PEP 8.

The imports should be ordered according to PEP 8: standard library imports first, then third-party imports, then local imports.

-from asyncio import Runner
-import logging
-from pathlib import Path
-from .transformer import Transformer
-from pyzerox import zerox
+import logging
+from asyncio import Runner
+from pathlib import Path
+
+from pyzerox import zerox
+
+from .transformer import Transformer
backend/app/api/routes/doc_transformation_job.py (2)

3-6: Use modern type hints instead of deprecated typing.List.

The static analysis correctly identifies that typing.List is deprecated in favor of the built-in list type (available from Python 3.9+).

-from typing import List
-from fastapi import APIRouter
-from fastapi import Path as FastPath
-from fastapi import Query
+from fastapi import APIRouter, Query
+from fastapi import Path as FastPath

Then update Line 40:

-job_id_list: List[UUID] = [UUID(jid.strip()) for jid in job_ids.split(",") if jid.strip()]
+job_id_list: list[UUID] = [UUID(jid.strip()) for jid in job_ids.split(",") if jid.strip()]

41-43: Import HTTPException at module level.

HTTPException is imported inside the exception handler. This should be imported at the module level for consistency and better readability.

Add to imports at the top:

 from uuid import UUID
-from fastapi import APIRouter
+from fastapi import APIRouter, HTTPException
 from fastapi import Path as FastPath

Then remove the import from Line 42:

 except Exception:
-    from fastapi import HTTPException
     raise HTTPException(status_code=400, detail="Invalid job_ids format. Must be comma-separated UUIDs.")
backend/app/api/routes/documents.py (3)

3-3: Use modern type hints instead of deprecated typing imports.

Replace deprecated typing.List with built-in list type.

-from typing import List, Optional
+from typing import Optional

And update Line 32:

-    response_model=APIResponse[List[Document]],
+    response_model=APIResponse[list[Document]],

54-56: Use modern union type syntax for optional parameters.

Use X | None syntax instead of Optional[X] for Python 3.10+.

-    target_format: Optional[str] = Form(None),
-    transformer: Optional[str] = Form(None),
+    target_format: str | None = Form(None),
+    transformer: str | None = Form(None),

110-123: Consider returning job status information in a standardized format.

The response structure mixes document metadata with transformation job details. Consider using a more structured response format that clearly separates these concerns.

 # Compose response with full document metadata and job info
 response_data = {
-    "message": f"Document accepted for transformation from {source_format} to {target_format}.",
-    "original_document": APIResponse.success_response(source_document).data,
-    "transformation_job_id": str(job_id),
-    "source_format": source_format,
-    "target_format": target_format,
-    "transformer": actual_transformer,
-    "status_check_url": f"/documents/transformations/{job_id}"
+    "document": source_document,
+    "transformation": {
+        "job_id": str(job_id),
+        "status": "PENDING",
+        "source_format": source_format,
+        "target_format": target_format,
+        "transformer": actual_transformer,
+        "status_url": f"/api/v1/documents/transformations/{job_id}"
+    },
+    "message": f"Document uploaded successfully. Transformation from {source_format} to {target_format} has been queued."
 }
backend/app/alembic/versions/93b86c1246b1_create_doc_transformation_job_table.py (1)

22-33: Consider adding indexes for foreign key columns.

The foreign key columns source_document_id and transformed_document_id will likely be used in queries. Adding indexes would improve query performance.

Consider adding indexes after creating the table:

 def upgrade():
     # ### commands auto generated by Alembic - please adjust! ###
     op.create_table('doc_transformation_job',
     sa.Column('id', sa.Uuid(), nullable=False),
     sa.Column('source_document_id', sa.Uuid(), nullable=False),
     sa.Column('transformed_document_id', sa.Uuid(), nullable=True),
     sa.Column('status', sa.Enum('PENDING', 'PROCESSING', 'COMPLETED', 'FAILED', name='transformationstatus'), nullable=False),
     sa.Column('error_message', sqlmodel.sql.sqltypes.AutoString(), nullable=True),
     sa.Column('created_at', sa.DateTime(), nullable=False),
     sa.Column('updated_at', sa.DateTime(), nullable=False),
     sa.ForeignKeyConstraint(['source_document_id'], ['document.id'], ),
     sa.ForeignKeyConstraint(['transformed_document_id'], ['document.id'], ),
     sa.PrimaryKeyConstraint('id')
     )
+    # Add indexes for foreign keys and status for better query performance
+    op.create_index('ix_doc_transformation_job_source_document_id', 'doc_transformation_job', ['source_document_id'])
+    op.create_index('ix_doc_transformation_job_status', 'doc_transformation_job', ['status'])
     # ### end Alembic commands ###

And update downgrade accordingly:

 def downgrade():
     # ### commands auto generated by Alembic - please adjust! ###
+    op.drop_index('ix_doc_transformation_job_status', 'doc_transformation_job')
+    op.drop_index('ix_doc_transformation_job_source_document_id', 'doc_transformation_job')
     op.drop_table('doc_transformation_job')
+    op.execute("DROP TYPE IF EXISTS transformationstatus")
     # ### end Alembic commands ###
backend/app/models/doc_transformation_job.py (1)

14-23: Consider using modern Python type hints and apply formatting fixes.

The model structure is well-designed with appropriate field defaults and foreign key relationships. However, there are formatting issues and opportunities to modernize the type hints.

Apply these formatting and type annotation updates:

 class DocTransformationJob(SQLModel, table=True):
     __tablename__ = "doc_transformation_job"
 
     id: UUID = Field(default_factory=uuid4, primary_key=True)
     source_document_id: UUID = Field(foreign_key="document.id")
-    transformed_document_id: Optional[UUID] = Field(default=None, foreign_key="document.id")
+    transformed_document_id: UUID | None = Field(default=None, foreign_key="document.id")
     status: TransformationStatus = Field(default=TransformationStatus.PENDING)
-    error_message: Optional[str] = Field(default=None)
+    error_message: str | None = Field(default=None)
     created_at: datetime = Field(default_factory=now)
     updated_at: datetime = Field(default_factory=now)

Also ensure the file ends with a newline character to comply with the formatting standards detected by the CI pipeline.

backend/app/crud/doc_transformation_job.py (3)

1-10: Use modern Python type hints.

The imports are appropriate and the logger setup follows best practices. Consider modernizing the type hints.

Apply this diff to modernize type hints:

 import logging
 from uuid import UUID
-from typing import List, Optional
 from sqlmodel import Session, select
 from app.models.doc_transformation_job import DocTransformationJob, TransformationStatus
 from app.core.util import now
 from app.core.exception_handlers import HTTPException

29-48: Update type hints and consider simplifying the update logic.

The method correctly updates job status with optional fields. Consider modernizing type hints and a minor simplification.

Apply this diff to modernize type hints and simplify:

     def update_status(
         self,
         job_id: UUID,
         status: TransformationStatus,
         *,
-        error_message: Optional[str] = None,
-        transformed_document_id: Optional[UUID] = None,
+        error_message: str | None = None,
+        transformed_document_id: UUID | None = None,
     ) -> DocTransformationJob:
         job = self.read_one(job_id)
         job.status = status
         job.updated_at = now()
         if error_message is not None:
             job.error_message = error_message
         if transformed_document_id is not None:
             job.transformed_document_id = transformed_document_id
 
         self.session.add(job)
         self.session.commit()
         self.session.refresh(job)
         return job

50-52: Update return type annotation and fix formatting.

The pagination logic is correct. Update the type hint to use modern Python syntax.

Apply this diff to modernize the type hint:

-    def read_many(self, skip: int = 0, limit: int = 100) -> List[DocTransformationJob]:
+    def read_many(self, skip: int = 0, limit: int = 100) -> list[DocTransformationJob]:
         statement = select(DocTransformationJob).offset(skip).limit(limit)
         return self.session.exec(statement).all()

Also ensure the file ends with a newline character.

backend/app/core/doctransform/service.py (2)

23-40: Fix whitespace issues but the implementation is solid.

The function correctly creates a job and schedules the background task. Good practice extracting the user ID before passing to the background task.

Remove trailing whitespace from line 35:

     job = job_crud.create(source_document_id=source_document_id)
     logger.debug(f"Job created | job_id={job.id}")
-    
+
     # Extract the user ID before passing to background task

101-111: Consider using a dataclass or named tuple for FileUpload.

The inline class definition works but could be better organized.

Consider moving this to a module-level definition or using a dataclass:

from dataclasses import dataclass
from typing import BinaryIO

@dataclass
class FileUpload:
    filename: str
    file: BinaryIO
    content_type: str
backend/app/core/doctransform/registry.py (7)

1-6: Modernize type hints.

Update the import statements to use modern Python type hints.

Apply this diff:

 from pathlib import Path
-from typing import Type, Dict, Set, Tuple, Optional
+from typing import Optional

Then update all type annotations throughout the file to use built-in types (e.g., dict, set, tuple, type) instead of their typing module equivalents.


12-16: Update type annotations.

The transformer registry is well-structured with sensible defaults.

Update the type annotation:

 # Map transformer names to their classes
-TRANSFORMERS: Dict[str, Type[Transformer]] = {
+TRANSFORMERS: dict[str, type[Transformer]] = {
     "default": ZeroxTransformer,
     "test": TestTransformer,
     "zerox": ZeroxTransformer,
 }

19-27: Update type annotations and fix trailing whitespace.

The transformation mapping structure is logical and extensible.

Apply these fixes:

 # Define supported transformations: (source_format, target_format) -> [available_transformers]
-SUPPORTED_TRANSFORMATIONS: Dict[Tuple[str, str], Dict[str, str]] = {
+SUPPORTED_TRANSFORMATIONS: dict[tuple[str, str], dict[str, str]] = {
     ("pdf", "markdown"): {
         "default": "zerox",
         "zerox": "zerox",
     },
     # Future transformations can be added here
     # ("docx", "markdown"): {"default": "pandoc", "pandoc": "pandoc"},
     # ("html", "markdown"): {"default": "pandoc", "pandoc": "pandoc"},
 }

30-39: Fix trailing whitespace.

The extension mapping is comprehensive.

Remove trailing whitespace from line 33:

     ".pdf": "pdf",
     ".docx": "docx",
-    ".doc": "doc", 
+    ".doc": "doc",
     ".html": "html",

59-64: Update type annotations and fix trailing whitespace.

The function correctly transforms the data structure.

Apply these fixes:

-def get_supported_transformations() -> Dict[Tuple[str, str], Set[str]]:
+def get_supported_transformations() -> dict[tuple[str, str], set[str]]:
     """Get all supported transformation combinations."""
     return {
-        key: set(transformers.keys()) 
+        key: set(transformers.keys())
         for key, transformers in SUPPORTED_TRANSFORMATIONS.items()
     }

70-72: Update type annotation.

Simple and effective lookup function.

-def get_available_transformers(source_format: str, target_format: str) -> Dict[str, str]:
+def get_available_transformers(source_format: str, target_format: str) -> dict[str, str]:
     """Get available transformers for a specific transformation."""
     return SUPPORTED_TRANSFORMATIONS.get((source_format, target_format), {})

74-96: Fix whitespace issues and update type annotation.

Good validation logic with helpful error messages.

Apply these fixes:

-def resolve_transformer(source_format: str, target_format: str, transformer_name: Optional[str] = None) -> str:
+def resolve_transformer(source_format: str, target_format: str, transformer_name: str | None = None) -> str:
     """
     Resolve the actual transformer to use for a transformation.
     Returns the transformer name to use.
     """
     available_transformers = get_available_transformers(source_format, target_format)
-    
+
     if not available_transformers:
         raise ValueError(
             f"Transformation from {source_format} to {target_format} is not supported"
         )
-    
+
     if transformer_name is None:
         transformer_name = "default"
-    
+
     if transformer_name not in available_transformers:
         available = ", ".join(available_transformers.keys())
         raise ValueError(
             f"Transformer '{transformer_name}' not available for {source_format} to {target_format}. "
             f"Available: {available}"
         )
-    
+
     return available_transformers[transformer_name]
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between db872b9 and 7d78cdb.

⛔ Files ignored due to path filters (1)
  • backend/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (14)
  • backend/app/alembic/versions/269675883ecf_add_source_document_id_to_document.py (1 hunks)
  • backend/app/alembic/versions/93b86c1246b1_create_doc_transformation_job_table.py (1 hunks)
  • backend/app/api/routes/doc_transformation_job.py (1 hunks)
  • backend/app/api/routes/documents.py (2 hunks)
  • backend/app/core/doctransform/registry.py (1 hunks)
  • backend/app/core/doctransform/service.py (1 hunks)
  • backend/app/core/doctransform/test_transformer.py (1 hunks)
  • backend/app/core/doctransform/transformer.py (1 hunks)
  • backend/app/core/doctransform/zerox_transformer.py (1 hunks)
  • backend/app/crud/doc_transformation_job.py (1 hunks)
  • backend/app/models/__init__.py (1 hunks)
  • backend/app/models/doc_transformation_job.py (1 hunks)
  • backend/app/models/document.py (2 hunks)
  • backend/pyproject.toml (2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (13)
backend/app/core/doctransform/transformer.py (1)
backend/app/core/doctransform/zerox_transformer.py (1)
  • transform (15-38)
backend/app/alembic/versions/93b86c1246b1_create_doc_transformation_job_table.py (2)
backend/app/alembic/versions/269675883ecf_add_source_document_id_to_document.py (2)
  • upgrade (20-26)
  • downgrade (29-35)
backend/app/alembic/versions/c43313eca57d_add_document_tables.py (1)
  • upgrade (20-36)
backend/app/models/document.py (3)
backend/app/alembic/versions/c43313eca57d_add_document_tables.py (1)
  • upgrade (20-36)
backend/app/models/document_collection.py (1)
  • DocumentCollection (9-23)
backend/app/models/user.py (1)
  • User (48-60)
backend/app/core/doctransform/test_transformer.py (2)
backend/app/core/doctransform/transformer.py (1)
  • Transformer (4-12)
backend/app/core/doctransform/zerox_transformer.py (1)
  • transform (15-38)
backend/app/core/doctransform/zerox_transformer.py (2)
backend/app/core/doctransform/transformer.py (2)
  • Transformer (4-12)
  • transform (8-12)
backend/app/core/doctransform/test_transformer.py (1)
  • transform (9-13)
backend/app/api/routes/doc_transformation_job.py (3)
backend/app/crud/doc_transformation_job.py (1)
  • DocTransformationJobCrud (11-52)
backend/app/utils.py (2)
  • APIResponse (27-48)
  • success_response (34-37)
backend/app/api/routes/assistants.py (1)
  • update_assistant_route (74-93)
backend/app/crud/doc_transformation_job.py (1)
backend/app/models/doc_transformation_job.py (2)
  • DocTransformationJob (14-23)
  • TransformationStatus (8-12)
backend/app/models/doc_transformation_job.py (2)
backend/app/alembic/versions/c43313eca57d_add_document_tables.py (1)
  • upgrade (20-36)
backend/app/alembic/versions/d98dd8ec85a3_edit_replace_id_integers_in_all_models_.py (1)
  • upgrade (21-73)
backend/app/models/__init__.py (1)
backend/app/models/doc_transformation_job.py (1)
  • DocTransformationJob (14-23)
backend/app/core/doctransform/service.py (6)
backend/app/crud/doc_transformation_job.py (4)
  • DocTransformationJobCrud (11-52)
  • create (15-20)
  • update_status (29-48)
  • read_one (22-27)
backend/app/crud/document.py (1)
  • DocumentCrud (14-133)
backend/app/models/document.py (1)
  • Document (11-36)
backend/app/models/doc_transformation_job.py (1)
  • TransformationStatus (8-12)
backend/app/core/cloud/storage.py (1)
  • AmazonCloudStorage (121-206)
backend/app/core/doctransform/registry.py (1)
  • convert_document (98-114)
backend/app/alembic/versions/269675883ecf_add_source_document_id_to_document.py (3)
backend/app/alembic/versions/c43313eca57d_add_document_tables.py (1)
  • upgrade (20-36)
backend/app/alembic/versions/e9dd35eff62c_add_openai_conversation_table.py (1)
  • upgrade (20-70)
backend/app/alembic/versions/66abc97f3782_user_id_from_uuid_to_int.py (1)
  • upgrade (20-152)
backend/app/core/doctransform/registry.py (3)
backend/app/core/doctransform/transformer.py (2)
  • Transformer (4-12)
  • transform (8-12)
backend/app/core/doctransform/test_transformer.py (2)
  • TestTransformer (4-13)
  • transform (9-13)
backend/app/core/doctransform/zerox_transformer.py (2)
  • ZeroxTransformer (7-38)
  • transform (15-38)
backend/app/api/routes/documents.py (5)
backend/app/crud/document.py (2)
  • DocumentCrud (14-133)
  • update (97-122)
backend/app/models/document.py (1)
  • Document (11-36)
backend/app/core/cloud/storage.py (3)
  • AmazonCloudStorage (121-206)
  • put (114-115)
  • put (126-151)
backend/app/core/doctransform/registry.py (4)
  • get_file_format (51-57)
  • is_transformation_supported (66-68)
  • get_available_transformers (70-72)
  • resolve_transformer (74-96)
backend/app/core/doctransform/service.py (1)
  • start_job (23-40)
🪛 GitHub Actions: AI Platform CI
backend/app/core/doctransform/transformer.py

[error] 1-1: Black formatting applied. Reformatted this file.

backend/app/alembic/versions/93b86c1246b1_create_doc_transformation_job_table.py

[error] 1-1: Black formatting applied. Reformatted this file.

backend/app/core/doctransform/test_transformer.py

[error] 1-1: Black formatting applied. Reformatted this file.

backend/app/core/doctransform/zerox_transformer.py

[error] 1-1: End-of-file-fixer modified this file.


[error] 1-1: Black formatting applied. Reformatted this file.

backend/app/api/routes/doc_transformation_job.py

[error] 1-1: Black formatting applied. Reformatted this file.

backend/app/crud/doc_transformation_job.py

[error] 1-1: Black formatting applied. Reformatted this file.

backend/app/models/doc_transformation_job.py

[error] 1-1: Black formatting applied. Reformatted this file.

backend/app/core/doctransform/service.py

[error] 1-1: End-of-file-fixer modified this file.


[error] 1-1: Trailing whitespace removed (hook: trailing-whitespace).


[error] 1-1: Black formatting applied. Reformatted this file.

backend/app/alembic/versions/269675883ecf_add_source_document_id_to_document.py

[error] 1-1: Black formatting applied. Reformatted this file.

backend/app/core/doctransform/registry.py

[error] 1-1: Trailing whitespace removed (hook: trailing-whitespace).


[error] 1-1: Black formatting applied. Reformatted this file.

backend/app/api/routes/documents.py

[error] 1-1: Trailing whitespace removed (hook: trailing-whitespace).


[error] 1-1: Black formatting applied. Reformatted this file.

🪛 Ruff (0.12.2)
backend/app/models/document.py

30-30: Use X | None for type annotations

Convert to X | None

(UP045)

backend/app/api/routes/doc_transformation_job.py

6-6: typing.List is deprecated, use list instead

(UP035)


21-21: Unused function argument: current_user

(ARG001)


35-35: Unused function argument: current_user

(ARG001)


40-40: Use list instead of List for type annotation

Replace with list

(UP006)

backend/app/crud/doc_transformation_job.py

3-3: typing.List is deprecated, use list instead

(UP035)


34-34: Use X | None for type annotations

Convert to X | None

(UP045)


35-35: Use X | None for type annotations

Convert to X | None

(UP045)


50-50: Use list instead of List for type annotation

Replace with list

(UP006)

backend/app/models/doc_transformation_job.py

19-19: Use X | None for type annotations

Convert to X | None

(UP045)


21-21: Use X | None for type annotations

Convert to X | None

(UP045)

backend/app/models/__init__.py

6-6: .doc_transformation_job.DocTransformationJob imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)

backend/app/core/doctransform/service.py

7-7: fastapi.UploadFile imported but unused

Remove unused import: fastapi.UploadFile

(F401)


9-9: sqlmodel.select imported but unused

Remove unused import: sqlmodel.select

(F401)


15-15: app.core.util.now imported but unused

Remove unused import: app.core.util.now

(F401)


35-35: Blank line contains whitespace

Remove whitespace from blank line

(W293)


67-67: Blank line contains whitespace

Remove whitespace from blank line

(W293)


138-138: No newline at end of file

Add trailing newline

(W292)

backend/app/alembic/versions/269675883ecf_add_source_document_id_to_document.py

10-10: sqlmodel.sql.sqltypes imported but unused

Remove unused import: sqlmodel.sql.sqltypes

(F401)

backend/app/core/doctransform/registry.py

2-2: typing.Type is deprecated, use type instead

(UP035)


2-2: typing.Dict is deprecated, use dict instead

(UP035)


2-2: typing.Set is deprecated, use set instead

(UP035)


2-2: typing.Tuple is deprecated, use tuple instead

(UP035)


12-12: Use dict instead of Dict for type annotation

Replace with dict

(UP006)


12-12: Use type instead of Type for type annotation

Replace with type

(UP006)


19-19: Use dict instead of Dict for type annotation

Replace with dict

(UP006)


19-19: Use tuple instead of Tuple for type annotation

Replace with tuple

(UP006)


19-19: Use dict instead of Dict for type annotation

Replace with dict

(UP006)


30-30: Use dict instead of Dict for type annotation

Replace with dict

(UP006)


33-33: Trailing whitespace

Remove trailing whitespace

(W291)


42-42: Use dict instead of Dict for type annotation

Replace with dict

(UP006)


59-59: Use dict instead of Dict for type annotation

Replace with dict

(UP006)


59-59: Use tuple instead of Tuple for type annotation

Replace with tuple

(UP006)


59-59: Use set instead of Set for type annotation

Replace with set

(UP006)


62-62: Trailing whitespace

Remove trailing whitespace

(W291)


70-70: Use dict instead of Dict for type annotation

Replace with dict

(UP006)


74-74: Use X | None for type annotations

Convert to X | None

(UP045)


80-80: Blank line contains whitespace

Remove whitespace from blank line

(W293)


85-85: Blank line contains whitespace

Remove whitespace from blank line

(W293)


88-88: Blank line contains whitespace

Remove whitespace from blank line

(W293)


95-95: Blank line contains whitespace

Remove whitespace from blank line

(W293)

backend/app/api/routes/documents.py

3-3: typing.List is deprecated, use list instead

(UP035)


19-19: Trailing whitespace

Remove trailing whitespace

(W291)


20-20: Trailing whitespace

Remove trailing whitespace

(W291)


55-55: Use X | None for type annotations

Convert to X | None

(UP045)


56-56: Use X | None for type annotations

Convert to X | None

(UP045)

🔇 Additional comments (11)
backend/app/core/doctransform/transformer.py (1)

4-12: Solid abstraction for the transformer interface.

The ABC is clean and sets a clear contract. No issues spotted.

backend/app/models/doc_transformation_job.py (2)

1-6: LGTM!

The imports are well-organized and all are used in the code. The use of UUID for primary keys and the utility function now for timestamps are good practices.


8-12: LGTM!

The enum definition follows Python best practices by inheriting from both str and enum.Enum, providing both type safety and string compatibility. The status values comprehensively cover the job lifecycle.

backend/app/crud/doc_transformation_job.py (3)

11-14: LGTM!

Clean class initialization that properly stores the session for use in CRUD operations.


15-20: LGTM!

The create method correctly creates and persists a new job with proper transaction handling.


22-27: LGTM!

Good error handling with appropriate logging and HTTP exception raising when a job is not found.

backend/app/core/doctransform/service.py (2)

42-48: LGTM with retry strategy!

Excellent use of the retry decorator with exponential backoff for handling transient failures. The retry parameters (3 attempts, 5-10 second waits) are reasonable for document transformation operations.


130-138: Good error handling pattern!

Excellent error handling with proper logging and status updates. The nested try-except ensures job status is updated even if there are database errors.

Add a newline at the end of the file to comply with formatting standards.

backend/app/core/doctransform/registry.py (3)

8-10: LGTM!

Good custom exception class for transformation-specific errors.


51-57: LGTM!

Good error handling with a clear error message for unsupported file extensions.


98-114: Excellent error handling pattern!

The function properly wraps exceptions with context and provides helpful error messages. The error chaining with from e preserves the original stack trace.

def upgrade():
# ### commands auto generated by Alembic - please adjust! ###
op.add_column('document', sa.Column('source_document_id', sa.Uuid(), nullable=True))
op.create_foreign_key(None, 'document', 'document', ['source_document_id'], ['id'])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Name the FK and fix downgrade: dropping a FK with name=None will fail.

Alembic can’t drop a constraint by passing None. Provide an explicit name when creating the FK and use the same name when dropping it. Also consider ondelete="SET NULL" for a nullable self-reference.

-    op.create_foreign_key(None, 'document', 'document', ['source_document_id'], ['id'])
+    op.create_foreign_key(
+        'document_source_document_id_fkey',
+        'document',
+        'document',
+        ['source_document_id'],
+        ['id'],
+        ondelete='SET NULL',
+    )
@@
-    op.drop_constraint(None, 'document', type_='foreignkey')
+    op.drop_constraint('document_source_document_id_fkey', 'document', type_='foreignkey')

Also applies to: 33-34

🤖 Prompt for AI Agents
In
backend/app/alembic/versions/269675883ecf_add_source_document_id_to_document.py
around lines 23 and 33-34, the migration creates a foreign key with name=None
which cannot be dropped in downgrade; update the op.create_foreign_key call to
give the FK an explicit name (e.g. "fk_document_source_document_id"), add
ondelete="SET NULL" since source_document_id is nullable, and update the
downgrade to call op.drop_constraint with that explicit name and
constraint_type="foreignkey" so the constraint can be reliably removed.

Comment on lines +24 to +25
op.drop_constraint('openai_conversation_organization_id_fkey1', 'openai_conversation', type_='foreignkey')
op.drop_constraint('openai_conversation_project_id_fkey1', 'openai_conversation', type_='foreignkey')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Do not drop unrelated foreign keys on openai_conversation in this migration.

These drops are unrelated and will remove referential integrity for openai_conversation. They also aren’t re-created in this upgrade. This is likely an autogenerate artifact and should be removed.

Apply this diff:

-    op.drop_constraint('openai_conversation_organization_id_fkey1', 'openai_conversation', type_='foreignkey')
-    op.drop_constraint('openai_conversation_project_id_fkey1', 'openai_conversation', type_='foreignkey')
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
op.drop_constraint('openai_conversation_organization_id_fkey1', 'openai_conversation', type_='foreignkey')
op.drop_constraint('openai_conversation_project_id_fkey1', 'openai_conversation', type_='foreignkey')
🤖 Prompt for AI Agents
In
backend/app/alembic/versions/269675883ecf_add_source_document_id_to_document.py
around lines 24-25, the migration is dropping openai_conversation foreign key
constraints 'openai_conversation_organization_id_fkey1' and
'openai_conversation_project_id_fkey1' which are unrelated to adding
source_document_id; remove those op.drop_constraint lines from the upgrade (and
any matching re-create in downgrade if present) so the migration only modifies
constraints/tables related to the document/source_document_id change, preserving
referential integrity for openai_conversation.

Comment on lines +37 to +40
def downgrade():
# ### commands auto generated by Alembic - please adjust! ###
op.drop_table('doc_transformation_job')
# ### end Alembic commands ###
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add proper cleanup of enum type in downgrade.

The downgrade function drops the table but doesn't clean up the transformationstatus enum type that was created. This could cause issues if the migration is rolled back and then re-applied.

 def downgrade():
     # ### commands auto generated by Alembic - please adjust! ###
     op.drop_table('doc_transformation_job')
+    # Drop the enum type
+    op.execute("DROP TYPE IF EXISTS transformationstatus")
     # ### end Alembic commands ###
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def downgrade():
# ### commands auto generated by Alembic - please adjust! ###
op.drop_table('doc_transformation_job')
# ### end Alembic commands ###
def downgrade():
# ### commands auto generated by Alembic - please adjust! ###
op.drop_table('doc_transformation_job')
# Drop the enum type
op.execute("DROP TYPE IF EXISTS transformationstatus")
# ### end Alembic commands ###
🤖 Prompt for AI Agents
In
backend/app/alembic/versions/93b86c1246b1_create_doc_transformation_job_table.py
around lines 37 to 40, the downgrade drops the doc_transformation_job table but
does not remove the Postgres enum type `transformationstatus`; after
op.drop_table('doc_transformation_job') add a command to drop the enum (for
example: op.execute("DROP TYPE IF EXISTS transformationstatus")) so the enum is
cleaned up on downgrade and re-creation of the migration will not fail.

Comment on lines +19 to +26
def get_transformation_job(
session: SessionDep,
current_user: CurrentUser,
job_id: UUID = FastPath(description="Transformation job ID"),
):
crud = DocTransformationJobCrud(session)
job = crud.read_one(job_id)
return APIResponse.success_response(job)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove unused current_user parameter or add authorization logic.

The current_user parameter is not used in this function. Either remove it if authorization is not needed, or add proper authorization checks to ensure users can only access their own transformation jobs.

If authorization is intended:

 def get_transformation_job(
     session: SessionDep,
     current_user: CurrentUser,
     job_id: UUID = FastPath(description="Transformation job ID"),
 ):
     crud = DocTransformationJobCrud(session)
     job = crud.read_one(job_id)
+    # TODO: Verify the user has permission to view this job
+    # This would require checking if the job's source_document belongs to current_user
     return APIResponse.success_response(job)

Or remove if not needed:

 def get_transformation_job(
     session: SessionDep,
-    current_user: CurrentUser,
     job_id: UUID = FastPath(description="Transformation job ID"),
 ):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_transformation_job(
session: SessionDep,
current_user: CurrentUser,
job_id: UUID = FastPath(description="Transformation job ID"),
):
crud = DocTransformationJobCrud(session)
job = crud.read_one(job_id)
return APIResponse.success_response(job)
def get_transformation_job(
session: SessionDep,
job_id: UUID = FastPath(description="Transformation job ID"),
):
crud = DocTransformationJobCrud(session)
job = crud.read_one(job_id)
return APIResponse.success_response(job)
🧰 Tools
🪛 Ruff (0.12.2)

21-21: Unused function argument: current_user

(ARG001)

Comment on lines +28 to +45
@router.get(
"/",
description="Get the status and details of multiple document transformation jobs by IDs.",
response_model=APIResponse,
)
def get_multiple_transformation_jobs(
session: SessionDep,
current_user: CurrentUser,
job_ids: str = Query(..., description="Comma-separated list of transformation job IDs"),
):
crud = DocTransformationJobCrud(session)
try:
job_id_list: List[UUID] = [UUID(jid.strip()) for jid in job_ids.split(",") if jid.strip()]
except Exception:
from fastapi import HTTPException
raise HTTPException(status_code=400, detail="Invalid job_ids format. Must be comma-separated UUIDs.")
jobs = [crud.read_one(job_id) for job_id in job_id_list]
return APIResponse.success_response(jobs)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add authorization check and improve error handling for batch retrieval.

Similar to the single job endpoint, the current_user parameter is unused. Additionally, the batch retrieval could fail if any individual job is not found, which would cause the entire request to fail.

 def get_multiple_transformation_jobs(
     session: SessionDep,
     current_user: CurrentUser,
     job_ids: str = Query(..., description="Comma-separated list of transformation job IDs"),
 ):
     crud = DocTransformationJobCrud(session)
     try:
         job_id_list: list[UUID] = [UUID(jid.strip()) for jid in job_ids.split(",") if jid.strip()]
-    except Exception:
-        from fastapi import HTTPException
+    except (ValueError, AttributeError) as e:
         raise HTTPException(status_code=400, detail="Invalid job_ids format. Must be comma-separated UUIDs.")
-    jobs = [crud.read_one(job_id) for job_id in job_id_list]
+    
+    jobs = []
+    for job_id in job_id_list:
+        try:
+            job = crud.read_one(job_id)
+            # TODO: Verify user has permission to view this job
+            jobs.append(job)
+        except HTTPException as e:
+            # Log the error but continue processing other jobs
+            logger.warning(f"Failed to fetch job {job_id}: {e.detail}")
+            # Optionally include an error placeholder in the response
+            
     return APIResponse.success_response(jobs)

Would you like me to implement proper authorization checks that verify document ownership?

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.12.2)

35-35: Unused function argument: current_user

(ARG001)


40-40: Use list instead of List for type annotation

Replace with list

(UP006)

fname_no_ext = Path(source_doc.fname).stem
target_extension = FORMAT_TO_EXTENSION.get(target_format, f".{target_format}")
transformed_doc_id = uuid4()
tmp_out = tmp_dir / f"<transformed>{fname_no_ext}{target_extension}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Security concern: Avoid angle brackets in filenames.

Using <transformed> in the filename could cause issues with file systems or tools that interpret angle brackets specially.

Replace the angle brackets with a safer prefix:

-            tmp_out = tmp_dir / f"<transformed>{fname_no_ext}{target_extension}"
+            tmp_out = tmp_dir / f"transformed_{fname_no_ext}{target_extension}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
tmp_out = tmp_dir / f"<transformed>{fname_no_ext}{target_extension}"
tmp_out = tmp_dir / f"transformed_{fname_no_ext}{target_extension}"
🤖 Prompt for AI Agents
In backend/app/core/doctransform/service.py around line 87, the code embeds
"<transformed>" in the temporary filename which uses angle brackets that can
break filesystems or tools; change the prefix to a safe string (e.g.,
"transformed_" or "transformed-"), avoid special characters, and if appropriate
run a small sanitization on fname_no_ext to ensure the resulting filename
contains only cross-platform-safe characters before joining with
target_extension.

Comment on lines +15 to +28
def transform(self, input_path: Path) -> str:
logging.info(f"ZeroxTransformer: {input_path} (model={self.model})")
try:
with Runner() as runner:
result = runner.run(zerox(
file_path=str(input_path),
model=self.model,
))
if result is None or not hasattr(result, "pages") or result.pages is None:
raise RuntimeError("Zerox returned no pages. This may indicate a PDF/image conversion failure (is Poppler installed and in PATH?)")
output = '\n\n'.join(x.content for x in result.pages)
if not output:
raise ValueError('Empty output from zerox')
return output
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add input validation and improve error context.

The function should validate that the input file exists before attempting transformation. Also, the error handling could be more specific about different failure scenarios.

 def transform(self, input_path: Path) -> str:
+    if not input_path.exists():
+        raise FileNotFoundError(f"Input file does not exist: {input_path}")
+    if not input_path.is_file():
+        raise ValueError(f"Input path is not a file: {input_path}")
+        
     logging.info(f"ZeroxTransformer: {input_path} (model={self.model})")
     try:
         with Runner() as runner:
             result = runner.run(zerox(
                 file_path=str(input_path),
                 model=self.model,
             ))
         if result is None or not hasattr(result, "pages") or result.pages is None:
             raise RuntimeError("Zerox returned no pages. This may indicate a PDF/image conversion failure (is Poppler installed and in PATH?)")
         output = '\n\n'.join(x.content for x in result.pages)
         if not output:
             raise ValueError('Empty output from zerox')
         return output
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def transform(self, input_path: Path) -> str:
logging.info(f"ZeroxTransformer: {input_path} (model={self.model})")
try:
with Runner() as runner:
result = runner.run(zerox(
file_path=str(input_path),
model=self.model,
))
if result is None or not hasattr(result, "pages") or result.pages is None:
raise RuntimeError("Zerox returned no pages. This may indicate a PDF/image conversion failure (is Poppler installed and in PATH?)")
output = '\n\n'.join(x.content for x in result.pages)
if not output:
raise ValueError('Empty output from zerox')
return output
def transform(self, input_path: Path) -> str:
if not input_path.exists():
raise FileNotFoundError(f"Input file does not exist: {input_path}")
if not input_path.is_file():
raise ValueError(f"Input path is not a file: {input_path}")
logging.info(f"ZeroxTransformer: {input_path} (model={self.model})")
try:
with Runner() as runner:
result = runner.run(zerox(
file_path=str(input_path),
model=self.model,
))
if result is None or not hasattr(result, "pages") or result.pages is None:
raise RuntimeError("Zerox returned no pages. This may indicate a PDF/image conversion failure (is Poppler installed and in PATH?)")
output = '\n\n'.join(x.content for x in result.pages)
if not output:
raise ValueError('Empty output from zerox')
return output
🤖 Prompt for AI Agents
In backend/app/core/doctransform/zerox_transformer.py around lines 15 to 28, add
input validation to check that input_path.exists() and input_path.is_file() (and
optionally os.access for readability) and raise a clear FileNotFoundError or
PermissionError if the checks fail before creating the Runner; keep the existing
with Runner() block but narrow the try/except to only wrap runner.run(...) and
catch specific exceptions (e.g., RuntimeError/ValueError/Exception from zerox or
Runner), then re-raise or raise a new RuntimeError that includes contextual
details (input_path, model, and the original exception message) so callers get
actionable info about whether the failure was due to a missing file, unreadable
file, empty pages, or a conversion error.

@@ -1,5 +1,6 @@
from uuid import UUID, uuid4
from datetime import datetime
from typing import Optional
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use modern union types and add safe FK behavior (SET NULL) for self-reference.

  • Be consistent with Python 3.11 unions (you already use datetime | None at Line 29). This also satisfies Ruff’s UP045 hint.
  • Consider setting ondelete="SET NULL" to avoid cascaded deletions wiping derived documents if the source is removed (safer default for a nullable self-reference).
  • Optional: add an index on source_document_id for lookup performance.

Apply this diff:

-from typing import Optional
+from typing import Optional  # will be removed below if using PEP 604 unions

@@
-    source_document_id: Optional[UUID] = Field(
-        default=None,
-        foreign_key="document.id",
-        nullable=True,
-    )
+    source_document_id: UUID | None = Field(
+        default=None,
+        foreign_key="document.id",
+        nullable=True,
+        ondelete="SET NULL",
+        index=True,
+    )

If you adopt the PEP 604 union above, remove the now-unused Optional import:

-from typing import Optional

Also applies to: 30-34

🤖 Prompt for AI Agents
In backend/app/models/document.py around lines 3 and 30-34, the file imports
Optional but elsewhere uses PEP 604 unions; replace Optional usage with the
modern X | None annotation, remove the now-unused Optional import at line 3,
update the self-referential ForeignKey for source_document_id to include
ondelete="SET NULL" (and ensure the column is nullable), and add an index on
source_document_id (e.g., sa.Index or index=True on the Column) to improve
lookup performance; keep type annotations consistent and run linters to confirm
UP045 is resolved.

Comment thread backend/pyproject.toml
version = "0.1.0"
description = ""
requires-python = ">=3.10,<4.0"
requires-python = ">=3.11,<4.0"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Align Ruff target version with the new Python requirement (3.11).

You bumped requires-python to >=3.11, but Ruff still targets py310. This mismatch can cause misleading/later fixes (e.g., PEP 604 unions) and inconsistent linting behavior.

Apply this diff to align Ruff with Python 3.11:

 [project]
-requires-python = ">=3.11,<4.0"
+requires-python = ">=3.11,<4.0"

 [tool.ruff]
-target-version = "py310"
+target-version = "py311"

Also applies to: 57-57

🤖 Prompt for AI Agents
In backend/pyproject.toml around lines 5 and 57, the project sets
requires-python = ">=3.11,<4.0" but Ruff is still targeting py310; update the
Ruff configuration by replacing any target-version entries (e.g., "py310") with
"py311" so the linter target matches the declared Python requirement, and ensure
all occurrences of "py310" in the file are changed to "py311".

Comment thread backend/pyproject.toml
"openai_responses",
"langfuse>=2.60.3",
"asgi-correlation-id>=4.3.4",
"py-zerox>=0.0.7,<1.0.0"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Document and provision OS dependencies for py-zerox (Poppler) and verify image compatibility.

py-zerox typically requires Poppler to be installed on the host/container. Without it, runtime will fail (as also hinted by ZeroxTransformer’s error messaging). Ensure:

  • Dockerfile(s) install Poppler (e.g., poppler-utils on Debian/Ubuntu).
  • CI runners and dev instructions are updated.

Run this script to check if your repo’s Dockerfiles already install Poppler and if a Python 3.11 base image is used:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Searching for Dockerfiles..."
fd -i --glob "Dockerfile*" || true

echo -e "\nChecking for Poppler installation mentions..."
rg -n "(poppler|poppler-utils|libpoppler|poppler-data)" -S || true

echo -e "\nChecking for Python 3.11 base images in Dockerfiles..."
rg -n "python:3\.11" -S || true

echo -e "\nChecking whether declared dependency versions exist on PyPI..."
echo "py-zerox latest:"
curl -s https://pypi.org/pypi/py-zerox/json | jq -r '.info.version'
echo "boto3 latest:"
curl -s https://pypi.org/pypi/boto3/json | jq -r '.info.version'

Length of output: 1318


Install Poppler in Dockerfile & update docs
We ran a quick check and found that:

  • backend/Dockerfile exists but doesn’t install any Poppler packages
  • it isn’t using a Python 3.11 base image (if you intend to target 3.11)
  • PyPI shows py-zerox 0.0.7, matching your >=0.0.7,<1.0.0 declaration

Action items:

  • In backend/Dockerfile, after apt-get update install Poppler tools. For example:
     FROM python:3.11-slim            # or your current base
    +RUN apt-get update \
    +    && apt-get install -y poppler-utils libpoppler-dev poppler-data \
    +    && rm -rf /var/lib/apt/lists/*
  • If you’re standardizing on Python 3.11, bump the FROM line to python:3.11-slim (or equivalent).
  • Update CI definitions and local‐dev docs (README/CONTRIBUTING.md) to call out the OS dependency on Poppler (and how to install it on macOS/Linux).
  • Optionally, add a quick smoke test in CI to verify that pdfimages (from poppler-utils) is in PATH.
🤖 Prompt for AI Agents
In backend/pyproject.toml around line 32, the dependency on py-zerox is fine but
the Dockerfile and docs must be updated: modify backend/Dockerfile to use a
Python 3.11 base (e.g. change FROM to python:3.11-slim if targeting 3.11) and,
after the apt-get update step, install Poppler (e.g. poppler-utils or
distro-equivalent) so pdf tools like pdfimages are available; update CI
definitions and README/CONTRIBUTING.md to document the OS dependency and
installation steps for macOS/Linux; and optionally add a small CI smoke test
that checks pdfimages is in PATH to prevent regressions.

@avirajsingh7 avirajsingh7 force-pushed the feature/doc-transform branch from f0b60b8 to 2dd5009 Compare August 25, 2025 08:05
@avirajsingh7 avirajsingh7 changed the base branch from main to update/document_module August 29, 2025 10:23
Base automatically changed from update/document_module to main September 1, 2025 05:57
…d Poppler installation (#342)

* add doc transformation endpoint to router main.py

* Handle response of upload document and validate request first than start upload

* Modify docker file to install poppler

* fix alembic migration order

* use fastapi UploadFile

* use model_validate in response

* Option to include signed url in get document endpoint

* speciify response type in doc transformation route

* logs reviewed and added

* logs modified

* resolve alembic head and remove duplicate get_signed url from cloud

* update background job to use project id instead of user id

* fix DocTransformationJobCrud and routes

* unit test for doc transformation crud

* extend unit test for upload document endpoint

* fix background job to use different session

* delete temp directories after job completion

* test for background job for transformer

* configure test_service properly

* fix migrations head

* return signed url in upload response

* fix testcases

* pass filepath to storage

* Instead of string return path of output file after transformation
@avirajsingh7
Copy link
Copy Markdown
Collaborator

Closing this one, created new PR#363

@AkhileshNegi AkhileshNegi deleted the feature/doc-transform branch April 2, 2026 04:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants