Skip to content

Kaapi: V1.0: Database optimization#817

Open
AkhileshNegi wants to merge 13 commits intomainfrom
enhancement/database-cleanups
Open

Kaapi: V1.0: Database optimization#817
AkhileshNegi wants to merge 13 commits intomainfrom
enhancement/database-cleanups

Conversation

@AkhileshNegi
Copy link
Copy Markdown
Collaborator

@AkhileshNegi AkhileshNegi commented May 7, 2026

Summary

Target issue is #808
Explain the motivation for making this change. What existing problem does the pull request solve?

Checklist

Before submitting a pull request, please ensure that you mark these task.

  • Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
  • If you've fixed a bug or added code that is tested and has test cases.

Notes

Please add here if any other information is required for the reviewer.

Summary by CodeRabbit

Release Notes

  • Performance Improvements

    • Added database indexes to optimize query performance for frequently accessed data paths.
  • Data Integrity

    • Improved soft-deletion mechanism using timestamp tracking.
    • Added unique constraint on document collections to prevent duplicate entries.
  • Schema Updates

    • Enhanced timestamp tracking across records for better audit trails.
    • Refined database constraints and relationships for data consistency.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 7, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR implements a comprehensive soft-delete semantic migration across the backend, replacing boolean is_deleted columns with timestamp-based deleted_at deletion tracking via three database migrations, ORM model schema updates, CRUD/query refactoring, API layer changes, and full test coverage spanning 149 files.

Changes

Soft-Delete Schema & Logic Migration

Layer / File(s) Summary
Database Migrations
backend/app/alembic/versions/058_v1_query_optimization.py, 059_drop_redundant_indexes.py, 060_v1_assorted_cleanups.py
Three coordinated migrations: 058 backfills deleted_at from is_deleted, drops the boolean column, and creates tenant-scoped/query-optimized indexes; 059 drops redundant indexes; 060 adds missing timestamp columns (user.inserted_at, user.updated_at, user_project.updated_at), renames created_atinserted_at on job and llm_call, deduplicates documentcollection, and aligns type definitions.
ORM Model Schema
backend/app/models/api_key.py, assistants.py, document.py, document_collection.py, evaluation.py, fine_tuning.py, job.py, llm/request.py, model_evaluation.py, openai_conversation.py, threads.py, user.py, user_project.py, project.py, __init__.py
Remove is_deleted: bool fields from APIKey, Assistant, Document, ModelEvaluation, OpenAIConversation. Add timestamps: User.inserted_at/updated_at, UserProject.updated_at. Rename timestamp fields: Job.created_atinserted_at, LlmCall.created_atinserted_at. Rename model classes for consistency: Fine_TuningFineTuning, OpenAI_ThreadOpenAIThread (with explicit __tablename__ mappings). Update relationship type annotations and add DocumentCollection uniqueness constraint.
CRUD & Query Layer
backend/app/crud/api_key.py, assistants.py, auth.py, document/doc_transformation_job.py, document/document.py, fine_tuning.py, llm.py, model_evaluation.py, openai_conversation.py, thread_results.py
Update read operations to filter deleted_at.is_(None) instead of is_deleted.is_(False). Soft-delete writes now set deleted_at = now() instead of toggling is_deleted. Rename model references: OpenAI_ThreadOpenAIThread in thread_results; Fine_TuningFineTuning across fine_tuning CRUD. Update query ordering: LlmCall ordered by inserted_at DESC. Optimize fetch_top_model_by_doc_id to sort by numeric mcc_score at query time rather than in Python.
API Routes & Authentication
backend/app/api/deps.py, core/security.py, api/routes/llm.py, api/routes/user_project.py
JWT/API key verification checks deleted_at.is_(None) for validity. LLM endpoint response maps job.inserted_at instead of job.created_at. User-project bulk-add refactored from per-user CRUD delegation to atomic prefetch + in-memory conflict detection with transactional rollback on constraint violations; adds user creation with generated hashed_password and is_active=False for new invitees.
Seed Data & Configuration
backend/app/seed_data/seed_data.json, seed_data.py, backend/app/tests/seed_data/seed_data.json, seed_data.py
Remove is_deleted field from API key seed records; APIKeyData model removes is_deleted attribute and relies on optional deleted_at timestamp for soft-delete state.
Tests
backend/app/tests/api/routes/documents/test_route_document_remove.py, test_fine_tuning.py, test_threads.py, test_user_project.py, core/test_security.py, crud/collections/collection/test_crud_collection_delete.py, crud/documents/documents/test_crud_document_delete.py, test_crud_document_read_many.py, test_crud_document_update.py, test_doc_transformation_job.py, test_api_key.py, test_assistants.py, test_fine_tuning.py, test_model_evaluation.py, test_onboarding.py, test_openai_conversation.py, seed_data/..., utils/auth.py, utils/document.py, utils/test_data.py, utils/utils.py
Soft-delete assertions updated: is_deleted is Truedeleted_at is not None, is_deleted is Falsedeleted_at is None. Model rename updates: Fine_TuningFineTuning, OpenAI_ThreadOpenAIThread in queries and fixtures. New tests for bulk-add conflict detection and soft-delete filtering in model evaluation/fine-tuning. Seed data tests updated for timestamp-based deletion.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • kartpop
  • Prajna1999

Poem

🐰 The schema hops with gentle grace,
From is_deleted to timestamp's embrace,
Each soft-delete now marks its end time,
As queries bloom where conditions align,
A cleaner schema, timestamp-signed and refined!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 49.40% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Kaapi: V1.0: Database optimization' directly and clearly summarizes the main changes in the PR, which involve significant database schema optimization including migrations, soft-delete consolidation, and index management.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch enhancement/database-cleanups

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@AkhileshNegi AkhileshNegi self-assigned this May 7, 2026
@AkhileshNegi AkhileshNegi added the enhancement New feature or request label May 7, 2026
@AkhileshNegi AkhileshNegi marked this pull request as ready for review May 7, 2026 04:25
@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

❌ Patch coverage is 98.48485% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
backend/app/api/routes/user_project.py 96.77% 1 Missing ⚠️
backend/app/crud/fine_tuning.py 87.50% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
backend/app/models/evaluation.py (1)

357-364: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Add onupdate hook to EvaluationRun.updated_at and EvaluationDataset.updated_at.

Without SQLAlchemy's onupdate, the updated_at field only gets set on INSERT. Updates at lines 208–210 in batch.py (status, batch_job_id, total_items), at lines 230–231 (error_message, status), and in processing.py (object_store_url) all modify eval_run without explicitly setting updated_at, leaving the timestamp stale.

The codebase pattern (e.g., user.py, tts_evaluation.py) uses "onupdate": now in sa_column_kwargs. Apply the same fix:

-    updated_at: datetime = SQLField(
-        default_factory=now,
-        nullable=False,
-        description="The timestamp when the evaluation run was last updated",
-        sa_column_kwargs={
-            "comment": "Timestamp when the evaluation run was last updated"
-        },
-    )
+    updated_at: datetime = SQLField(
+        default_factory=now,
+        nullable=False,
+        description="The timestamp when the evaluation run was last updated",
+        sa_column_kwargs={
+            "comment": "Timestamp when the evaluation run was last updated",
+            "onupdate": now,
+        },
+    )

Apply the same fix to EvaluationDataset.updated_at (lines 172–178) for consistency.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/models/evaluation.py` around lines 357 - 364, The updated_at
SQLField for EvaluationRun and EvaluationDataset currently only sets a default
timestamp on INSERT; add an SQLAlchemy onupdate hook so it auto-updates on
UPDATEs by adding "onupdate": now to the sa_column_kwargs for the SQLField
definitions named EvaluationRun.updated_at and EvaluationDataset.updated_at (the
SQLField instances using default_factory=now and sa_column_kwargs={"comment":
...}); ensure the sa_column_kwargs now contains both "comment" and "onupdate":
now consistent with other models like user.py and tts_evaluation.py.
backend/app/crud/assistants.py (1)

274-278: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Set updated_at when soft-deleting an assistant for consistency with other delete operations in this PR.

The Assistant.updated_at field lacks a SQL-level onupdate= default and only sets a default_factory. This means the soft-delete operation at line 275 does not update the timestamp — unlike the APIKeyCrud.delete and DocumentCrud.delete implementations elsewhere in the PR that update both deleted_at and updated_at. This breaks the "last modified" semantics for clients relying on updated_at to track when records were last changed.

♻️ Proposed fix
    existing_assistant.deleted_at = now()
+   existing_assistant.updated_at = now()
    session.add(existing_assistant)
    session.commit()
    session.refresh(existing_assistant)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/crud/assistants.py` around lines 274 - 278, The soft-delete path
for Assistant doesn't update the Assistant.updated_at timestamp; before
committing the change in the block that sets existing_assistant.deleted_at =
now(), also set existing_assistant.updated_at = now() (mirroring
APIKeyCrud.delete and DocumentCrud.delete behavior) so the record's "last
modified" time is updated, then session.add, session.commit, session.refresh as
currently done.
backend/app/api/routes/user_project.py (1)

65-67: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate the project/org pairing before creating memberships.

These checks are independent right now. If body.project_id belongs to a different organization, the code below will persist UserProject.organization_id = body.organization_id and generate invite tokens for the wrong org/project combination.

🛠️ Suggested fix
-    validate_organization(session=session, org_id=body.organization_id)
-    validate_project(session=session, project_id=body.project_id)
+    validate_organization(session=session, org_id=body.organization_id)
+    project = get_project_by_id(session=session, project_id=body.project_id)
+    if project is None or project.organization_id != body.organization_id:
+        raise HTTPException(
+            status_code=status.HTTP_400_BAD_REQUEST,
+            detail="Project does not belong to the given organization",
+        )

Also applies to: 114-118

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/api/routes/user_project.py` around lines 65 - 67, The current
independent checks (validate_organization and validate_project) allow mismatched
org/project combos; update validation so the project is confirmed to belong to
the requested organization before creating memberships or invite tokens—either
extend validate_project or add a new function (e.g.,
validate_project_belongs_to_org) to fetch the Project by body.project_id and
assert project.organization_id == body.organization_id, and call this check in
place of or after validate_project (also apply the same fix for the block
referenced at lines 114-118); ensure the code paths that set
UserProject.organization_id and generate invite tokens only run after this
combined validation passes.
🧹 Nitpick comments (3)
backend/app/crud/thread_results.py (1)

24-24: ⚡ Quick win

Replace deprecated data.dict() with data.model_dump().

Pydantic v2 deprecated .dict() in favour of .model_dump(). Since line 24 is already being touched, this is the right moment to fix it.

🔧 Proposed fix
-        new_thread = OpenAIThread(**data.dict())
+        new_thread = OpenAIThread(**data.model_dump())
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/crud/thread_results.py` at line 24, Replace the deprecated
Pydantic v2 call data.dict() with data.model_dump() where the OpenAIThread is
constructed: change the OpenAIThread(**data.dict()) call to use
OpenAIThread(**data.model_dump()) so the created_thread/ new_thread
instantiation uses the v2-compatible model dump; update any other occurrences in
the same module that still call data.dict() (search for data.dict and replace
with data.model_dump).
backend/app/alembic/versions/058_v1_query_optimization.py (1)

182-225: ⚡ Quick win

Add explicit -> None annotations to the Alembic entrypoints.

Both changed functions are missing return annotations.

As per coding guidelines, **/*.py: Always add type hints to all function parameters and return values in Python code.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/alembic/versions/058_v1_query_optimization.py` around lines 182 -
225, The upgrade and downgrade Alembic entrypoint functions (upgrade and
downgrade) lack return type hints; modify their definitions to include explicit
return annotations by changing their signatures to include "-> None" (i.e., def
upgrade() -> None: and def downgrade() -> None:) so both functions are fully
type-annotated per the project's guidelines.
backend/app/alembic/versions/060_v1_assorted_cleanups.py (1)

47-163: ⚡ Quick win

Add explicit -> None annotations to upgrade() and downgrade().

These changed entrypoints are missing return annotations.

As per coding guidelines, **/*.py: Always add type hints to all function parameters and return values in Python code.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/alembic/versions/060_v1_assorted_cleanups.py` around lines 47 -
163, Add explicit return type annotations -> None to the Alembic entrypoint
functions upgrade() and downgrade(): update the function definitions for upgrade
and downgrade in this diff to include a return annotation of None (i.e., def
upgrade() -> None: and def downgrade() -> None:) so they comply with the
project's typing guideline; no other behavioral changes are needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/app/alembic/versions/058_v1_query_optimization.py`:
- Around line 182-202: Add explicit return type annotations "-> None" to the
upgrade() and downgrade() functions, and prevent unrecoverable partial commits
by ensuring Phase A (the UPDATE and drop_column on IS_DELETED_TABLES) cannot be
committed before Phase B index creation fails: either run Phase A inside the
same transactional boundary as Phase B (avoid using
op.get_context().autocommit_block() for Phase A) or keep autocommit for Phase B
only and guard each drop with a runtime existence check (use a helper like
column_exists(table, "is_deleted") before calling op.drop_column) so a retry
won't error on missing columns; locate these changes around upgrade(),
autocommit_block(), IS_DELETED_TABLES, FK_INDEXES, and COMPOSITE_INDEXES and
apply the same pattern to downgrade().

In `@backend/app/alembic/versions/059_drop_redundant_indexes.py`:
- Around line 89-98: The migration functions lack explicit return type hints;
update the function signatures for upgrade and downgrade to declare return type
None (i.e., change upgrade() and downgrade() to upgrade() -> None and
downgrade() -> None) so they comply with the repository typing guidelines while
keeping the existing bodies (including op.get_context().autocommit_block() and
the loops over INDEXES_TO_DROP) unchanged.

In `@backend/app/alembic/versions/060_v1_assorted_cleanups.py`:
- Around line 47-147: Add explicit return type hints (-> None) to both upgrade()
and downgrade(), and refactor this migration so all transactional schema changes
(op.add_column, op.alter_column, op.create_foreign_key, DELETEs, ALTER TABLE ...
ADD CONSTRAINT, ALTER COLUMN type) occur inside the single migration transaction
and do not include CONCURRENTLY/index-creation autocommit blocks; extract the
two autocommit index creations that use op.get_context().autocommit_block() (the
CREATE INDEX CONCURRENTLY for "ix_llm_call_job_inserted_at_active" and the
CREATE UNIQUE INDEX CONCURRENTLY for "uq_document_collection") into a new,
separate migration file that only runs those concurrent, non-transactional
statements, leaving this file to perform only transactional changes and making
upgrade()/downgrade() idempotent and safe to rerun.

In `@backend/app/models/document_collection.py`:
- Around line 9-11: DocumentCollectionCrud.create() currently commits without
catching SQLAlchemy IntegrityError from the composite UniqueConstraint, causing
500s on duplicates; wrap the bulk_save_objects()/commit() in a try/except that
catches sqlalchemy.exc.IntegrityError, calls session.rollback(), and handle
duplicates by either skipping them (idempotent re-linking) or raising a
user-friendly validation error (e.g., a ValidationError/HTTPException with a
clear message) instead of letting the exception propagate; also add a return
type hint "-> None" to create() and make any error logs include the "[create]"
prefix per logging guidelines, and import IntegrityError where needed.

In `@backend/app/tests/api/test_user_project.py`:
- Around line 220-222: The POST requests in TestAddProjectUsers are using a
trailing slash (f"{USER_PROJECTS_URL}/") which triggers a 307 redirect; update
both occurrences to use the canonical route without the slash (use
USER_PROJECTS_URL or f"{USER_PROJECTS_URL}") so the POST hits the intended
handler directly (search for f"{USER_PROJECTS_URL}/" inside TestAddProjectUsers
and replace those two occurrences).

---

Outside diff comments:
In `@backend/app/api/routes/user_project.py`:
- Around line 65-67: The current independent checks (validate_organization and
validate_project) allow mismatched org/project combos; update validation so the
project is confirmed to belong to the requested organization before creating
memberships or invite tokens—either extend validate_project or add a new
function (e.g., validate_project_belongs_to_org) to fetch the Project by
body.project_id and assert project.organization_id == body.organization_id, and
call this check in place of or after validate_project (also apply the same fix
for the block referenced at lines 114-118); ensure the code paths that set
UserProject.organization_id and generate invite tokens only run after this
combined validation passes.

In `@backend/app/crud/assistants.py`:
- Around line 274-278: The soft-delete path for Assistant doesn't update the
Assistant.updated_at timestamp; before committing the change in the block that
sets existing_assistant.deleted_at = now(), also set
existing_assistant.updated_at = now() (mirroring APIKeyCrud.delete and
DocumentCrud.delete behavior) so the record's "last modified" time is updated,
then session.add, session.commit, session.refresh as currently done.

In `@backend/app/models/evaluation.py`:
- Around line 357-364: The updated_at SQLField for EvaluationRun and
EvaluationDataset currently only sets a default timestamp on INSERT; add an
SQLAlchemy onupdate hook so it auto-updates on UPDATEs by adding "onupdate": now
to the sa_column_kwargs for the SQLField definitions named
EvaluationRun.updated_at and EvaluationDataset.updated_at (the SQLField
instances using default_factory=now and sa_column_kwargs={"comment": ...});
ensure the sa_column_kwargs now contains both "comment" and "onupdate": now
consistent with other models like user.py and tts_evaluation.py.

---

Nitpick comments:
In `@backend/app/alembic/versions/058_v1_query_optimization.py`:
- Around line 182-225: The upgrade and downgrade Alembic entrypoint functions
(upgrade and downgrade) lack return type hints; modify their definitions to
include explicit return annotations by changing their signatures to include "->
None" (i.e., def upgrade() -> None: and def downgrade() -> None:) so both
functions are fully type-annotated per the project's guidelines.

In `@backend/app/alembic/versions/060_v1_assorted_cleanups.py`:
- Around line 47-163: Add explicit return type annotations -> None to the
Alembic entrypoint functions upgrade() and downgrade(): update the function
definitions for upgrade and downgrade in this diff to include a return
annotation of None (i.e., def upgrade() -> None: and def downgrade() -> None:)
so they comply with the project's typing guideline; no other behavioral changes
are needed.

In `@backend/app/crud/thread_results.py`:
- Line 24: Replace the deprecated Pydantic v2 call data.dict() with
data.model_dump() where the OpenAIThread is constructed: change the
OpenAIThread(**data.dict()) call to use OpenAIThread(**data.model_dump()) so the
created_thread/ new_thread instantiation uses the v2-compatible model dump;
update any other occurrences in the same module that still call data.dict()
(search for data.dict and replace with data.model_dump).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e5424384-f9f4-48de-b522-a08cf8402648

📥 Commits

Reviewing files that changed from the base of the PR and between e863cf4 and 914ff2c.

📒 Files selected for processing (56)
  • backend/app/alembic/versions/058_v1_query_optimization.py
  • backend/app/alembic/versions/059_drop_redundant_indexes.py
  • backend/app/alembic/versions/060_v1_assorted_cleanups.py
  • backend/app/api/deps.py
  • backend/app/api/routes/llm.py
  • backend/app/api/routes/user_project.py
  • backend/app/core/security.py
  • backend/app/crud/api_key.py
  • backend/app/crud/assistants.py
  • backend/app/crud/auth.py
  • backend/app/crud/document/doc_transformation_job.py
  • backend/app/crud/document/document.py
  • backend/app/crud/fine_tuning.py
  • backend/app/crud/llm.py
  • backend/app/crud/model_evaluation.py
  • backend/app/crud/openai_conversation.py
  • backend/app/crud/thread_results.py
  • backend/app/models/__init__.py
  • backend/app/models/api_key.py
  • backend/app/models/assistants.py
  • backend/app/models/document.py
  • backend/app/models/document_collection.py
  • backend/app/models/evaluation.py
  • backend/app/models/fine_tuning.py
  • backend/app/models/job.py
  • backend/app/models/llm/request.py
  • backend/app/models/model_evaluation.py
  • backend/app/models/openai_conversation.py
  • backend/app/models/project.py
  • backend/app/models/threads.py
  • backend/app/models/user.py
  • backend/app/models/user_project.py
  • backend/app/seed_data/seed_data.json
  • backend/app/seed_data/seed_data.py
  • backend/app/tests/api/routes/documents/test_route_document_remove.py
  • backend/app/tests/api/routes/test_fine_tuning.py
  • backend/app/tests/api/routes/test_threads.py
  • backend/app/tests/api/test_user_project.py
  • backend/app/tests/core/test_security.py
  • backend/app/tests/crud/collections/collection/test_crud_collection_delete.py
  • backend/app/tests/crud/documents/documents/test_crud_document_delete.py
  • backend/app/tests/crud/documents/documents/test_crud_document_read_many.py
  • backend/app/tests/crud/documents/documents/test_crud_document_update.py
  • backend/app/tests/crud/documents/test_doc_transformation_job.py
  • backend/app/tests/crud/test_api_key.py
  • backend/app/tests/crud/test_assistants.py
  • backend/app/tests/crud/test_fine_tuning.py
  • backend/app/tests/crud/test_model_evaluation.py
  • backend/app/tests/crud/test_onboarding.py
  • backend/app/tests/crud/test_openai_conversation.py
  • backend/app/tests/seed_data/seed_data.json
  • backend/app/tests/seed_data/seed_data.py
  • backend/app/tests/utils/auth.py
  • backend/app/tests/utils/document.py
  • backend/app/tests/utils/test_data.py
  • backend/app/tests/utils/utils.py
💤 Files with no reviewable changes (9)
  • backend/app/tests/seed_data/seed_data.json
  • backend/app/seed_data/seed_data.json
  • backend/app/seed_data/seed_data.py
  • backend/app/models/document.py
  • backend/app/models/api_key.py
  • backend/app/models/openai_conversation.py
  • backend/app/models/assistants.py
  • backend/app/tests/seed_data/seed_data.py
  • backend/app/tests/utils/document.py

Comment on lines +182 to +202
def upgrade():
# Phase A (transactional): preserve audit timestamp, drop redundant column.
for table in IS_DELETED_TABLES:
op.execute(
f"UPDATE {table} "
f"SET deleted_at = NOW() "
f"WHERE is_deleted = TRUE AND deleted_at IS NULL"
)
op.drop_column(table, "is_deleted")

# Phase B (autocommit): CONCURRENTLY index creation. Each statement
# runs in its own implicit transaction, required by the CONCURRENTLY
# variant.
with op.get_context().autocommit_block():
for table, column, index in FK_INDEXES:
op.execute(
f'CREATE INDEX CONCURRENTLY IF NOT EXISTS "{index}" '
f'ON "{table}" ("{column}")'
)
for index, body, _schema in COMPOSITE_INDEXES:
op.execute(f'CREATE INDEX CONCURRENTLY IF NOT EXISTS "{index}" {body}')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

find backend/app/alembic/versions -name "058_v1_query_optimization.py" -type f

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 133


🏁 Script executed:

wc -l backend/app/alembic/versions/058_v1_query_optimization.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 137


🏁 Script executed:

cat -n backend/app/alembic/versions/058_v1_query_optimization.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 10237


Add type hints to upgrade() and downgrade() functions, and redesign migration phases to prevent unrecoverable partial failures.

The upgrade() and downgrade() functions are missing return type annotations (-> None), violating the type hints guideline.

More critically, autocommit_block() at line 195 commits Phase A changes (UPDATE + drop_column) before Phase B index creation begins. If any CREATE INDEX CONCURRENTLY fails, the revision is not recorded as complete but the is_deleted columns are already dropped. A rerun fails on drop_column() with a missing column error, leaving the migration unrecoverable without manual intervention. Wrap Phase A operations inside conditional checks (e.g., column_exists() before drop) or consolidate transaction semantics so either all changes commit together or all roll back on any failure.

🧰 Tools
🪛 OpenGrep (1.20.0)

[ERROR] 202-202: SQL query built via f-string passed to execute()/executemany(). Use parameterized queries with placeholders instead.

(coderabbit.sql-injection.python-fstring-execute)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/alembic/versions/058_v1_query_optimization.py` around lines 182 -
202, Add explicit return type annotations "-> None" to the upgrade() and
downgrade() functions, and prevent unrecoverable partial commits by ensuring
Phase A (the UPDATE and drop_column on IS_DELETED_TABLES) cannot be committed
before Phase B index creation fails: either run Phase A inside the same
transactional boundary as Phase B (avoid using
op.get_context().autocommit_block() for Phase A) or keep autocommit for Phase B
only and guard each drop with a runtime existence check (use a helper like
column_exists(table, "is_deleted") before calling op.drop_column) so a retry
won't error on missing columns; locate these changes around upgrade(),
autocommit_block(), IS_DELETED_TABLES, FK_INDEXES, and COMPOSITE_INDEXES and
apply the same pattern to downgrade().

Comment on lines +89 to +98
def upgrade():
with op.get_context().autocommit_block():
for index_name, _body in INDEXES_TO_DROP:
op.execute(f'DROP INDEX CONCURRENTLY IF EXISTS "{index_name}"')


def downgrade():
with op.get_context().autocommit_block():
for index_name, body in INDEXES_TO_DROP:
op.execute(f'CREATE INDEX CONCURRENTLY IF NOT EXISTS "{index_name}" {body}')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
python - <<'PY'
import ast
from pathlib import Path

path = Path("backend/app/alembic/versions/059_drop_redundant_indexes.py")
tree = ast.parse(path.read_text())

for node in tree.body:
    if isinstance(node, ast.FunctionDef):
        if node.returns is None:
            print(f"{node.name}: missing return type annotation")
PY

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 157


Add explicit return type hints to migration functions.

The upgrade() (line 89) and downgrade() (line 95) functions must declare -> None to comply with repository typing guidelines for Python files.

♻️ Suggested patch
-def upgrade():
+def upgrade() -> None:
     with op.get_context().autocommit_block():
         for index_name, _body in INDEXES_TO_DROP:
             op.execute(f'DROP INDEX CONCURRENTLY IF EXISTS "{index_name}"')
 
 
-def downgrade():
+def downgrade() -> None:
     with op.get_context().autocommit_block():
         for index_name, body in INDEXES_TO_DROP:
             op.execute(f'CREATE INDEX CONCURRENTLY IF NOT EXISTS "{index_name}" {body}')
🧰 Tools
🪛 OpenGrep (1.20.0)

[ERROR] 92-92: SQL query built via f-string passed to execute()/executemany(). Use parameterized queries with placeholders instead.

(coderabbit.sql-injection.python-fstring-execute)


[ERROR] 98-98: SQL query built via f-string passed to execute()/executemany(). Use parameterized queries with placeholders instead.

(coderabbit.sql-injection.python-fstring-execute)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/alembic/versions/059_drop_redundant_indexes.py` around lines 89 -
98, The migration functions lack explicit return type hints; update the function
signatures for upgrade and downgrade to declare return type None (i.e., change
upgrade() and downgrade() to upgrade() -> None and downgrade() -> None) so they
comply with the repository typing guidelines while keeping the existing bodies
(including op.get_context().autocommit_block() and the loops over
INDEXES_TO_DROP) unchanged.

Comment on lines +47 to +147
def upgrade():
# 1. Backfill missing timestamp columns. Each column is created with
# a NOW() server default so existing rows are populated atomically;
# the default is then dropped so future inserts get their value
# from the model's `default_factory=now`.
op.add_column(
"user_project",
sa.Column(
"updated_at",
sa.DateTime(),
nullable=False,
server_default=sa.text("NOW()"),
comment="Timestamp when the mapping was last updated",
),
)
op.alter_column("user_project", "updated_at", server_default=None)

op.add_column(
"user",
sa.Column(
"inserted_at",
sa.DateTime(),
nullable=False,
server_default=sa.text("NOW()"),
comment="Timestamp when the user was created",
),
)
op.alter_column("user", "inserted_at", server_default=None)

op.add_column(
"user",
sa.Column(
"updated_at",
sa.DateTime(),
nullable=False,
server_default=sa.text("NOW()"),
comment="Timestamp when the user was last updated",
),
)
op.alter_column("user", "updated_at", server_default=None)

# 2. job.project_id foreign key (with orphan backfill)
op.execute(
"""
UPDATE job
SET project_id = NULL
WHERE project_id IS NOT NULL
AND project_id NOT IN (SELECT id FROM project)
"""
)
op.create_foreign_key(
"job_project_id_fkey",
"job",
"project",
["project_id"],
["id"],
ondelete="CASCADE",
)

# 3. Rename created_at → inserted_at on job and llm_call, then create
# the llm_call hot-path index using the new column name. Index
# creation is CONCURRENTLY and must run outside a transaction.
op.alter_column("job", "created_at", new_column_name="inserted_at")
op.alter_column("llm_call", "created_at", new_column_name="inserted_at")
with op.get_context().autocommit_block():
op.execute(
"CREATE INDEX CONCURRENTLY IF NOT EXISTS "
'"ix_llm_call_job_inserted_at_active" '
'ON "llm_call" ("job_id", "inserted_at" DESC) '
'WHERE "deleted_at" IS NULL'
)

# 4. documentcollection unique constraint (with dedupe).
# Build the underlying unique index CONCURRENTLY (no AccessExclusive
# on the table during the scan/build), then attach it as a constraint
# via ADD CONSTRAINT ... USING INDEX (catalog-only, brief lock).
op.execute(
"""
DELETE FROM documentcollection
WHERE id NOT IN (
SELECT MIN(id)
FROM documentcollection
GROUP BY document_id, collection_id
)
"""
)
with op.get_context().autocommit_block():
op.execute(
"CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS "
'"uq_document_collection" '
'ON "documentcollection" ("document_id", "collection_id")'
)
op.execute(
'ALTER TABLE "documentcollection" '
'ADD CONSTRAINT "uq_document_collection" '
'UNIQUE USING INDEX "uq_document_collection"'
)

# 5. Align global.languages.id to INTEGER
op.execute("ALTER TABLE global.languages ALTER COLUMN id SET DATA TYPE INTEGER")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

# First, let's find and read the full migration file
find backend/app/alembic/versions -name "060*.py" -type f

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 132


🏁 Script executed:

# Read the full migration file to see function signatures and complete context
cat -n backend/app/alembic/versions/060_v1_assorted_cleanups.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 7160


🏁 Script executed:

# Check if there are similar migration patterns in the codebase
fd -e py "versions" backend/app/alembic/ -x wc -l {} \; | sort -n | tail -5

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 57


🏁 Script executed:

# Look at the function signatures in a few other migration files to understand the pattern
fd "versions.*\.py$" backend/app/alembic/ -x head -20 {} \;

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 57


🏁 Script executed:

# Check a few other migration files to see their function signatures
fd "versions.*\.py$" backend/app/alembic/versions -type f | head -5 | xargs -I {} sh -c 'echo "=== {} ===" && head -50 {}'

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 246


🏁 Script executed:

# Search specifically for upgrade/downgrade function signatures across migrations
rg "^def (upgrade|downgrade)" backend/app/alembic/versions/ -A 1

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 24506


Add return type hints and refactor migration to prevent partial application on failure.

This migration lacks return type annotations on upgrade() and downgrade() functions (see recent migrations 047–048 using -> None).

More critically, the destructive operations (column adds, renames, FK creation, row deletion) run within transaction blocks that precede two autocommit_block() calls. If execution fails after line 111's index creation but before the DELETE completes, or if the second autocommit block at line 133 fails, the migration leaves the database partially applied with committed DDL but rolled-back constraints. Future reruns become unsafe because op.add_column() is not idempotent.

Split concurrent-index work (lines 111–117 and 133–138) into a separate migration to complete all transactional schema changes atomically first, then handle index creation outside the transaction boundary.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/alembic/versions/060_v1_assorted_cleanups.py` around lines 47 -
147, Add explicit return type hints (-> None) to both upgrade() and downgrade(),
and refactor this migration so all transactional schema changes (op.add_column,
op.alter_column, op.create_foreign_key, DELETEs, ALTER TABLE ... ADD CONSTRAINT,
ALTER COLUMN type) occur inside the single migration transaction and do not
include CONCURRENTLY/index-creation autocommit blocks; extract the two
autocommit index creations that use op.get_context().autocommit_block() (the
CREATE INDEX CONCURRENTLY for "ix_llm_call_job_inserted_at_active" and the
CREATE UNIQUE INDEX CONCURRENTLY for "uq_document_collection") into a new,
separate migration file that only runs those concurrent, non-transactional
statements, leaving this file to perform only transactional changes and making
upgrade()/downgrade() idempotent and safe to rerun.

Comment on lines +9 to +11
__table_args__ = (
UniqueConstraint("document_id", "collection_id", name="uq_document_collection"),
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find all places that insert/add DocumentCollection rows to confirm duplicate-insert handling.
rg -nP --type=py -C4 '\bDocumentCollection\s*\('

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 1448


🏁 Script executed:

cat -n backend/app/crud/document_collection.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 1866


🏁 Script executed:

rg -nP --type=py "IntegrityError|DocumentCollection" backend/app/crud/document_collection.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 391


🏁 Script executed:

rg -nP --type=py "session\.(add|commit)|flush\|execute" backend/app/crud/document_collection.py -A 2 -B 2

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 223


Add IntegrityError handling to DocumentCollectionCrud.create() to gracefully handle duplicate document-collection pairs.

The composite unique constraint is well-designed, but the upstream insertion code at backend/app/crud/document_collection.py lines 21–22 does not catch IntegrityError. When bulk_save_objects() and commit() are called, a duplicate (document_id, collection_id) pair will raise an unhandled IntegrityError, resulting in a 500 error instead of a clean validation response.

Wrap the commit in a try/except block to catch IntegrityError and either:

  • Skip duplicate pairs silently (if re-linking is idempotent), or
  • Raise a user-friendly validation error.

Additionally, the method is missing a return type hint (-> None) and violates the logging guideline (no [function_name] prefix on any errors).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/models/document_collection.py` around lines 9 - 11,
DocumentCollectionCrud.create() currently commits without catching SQLAlchemy
IntegrityError from the composite UniqueConstraint, causing 500s on duplicates;
wrap the bulk_save_objects()/commit() in a try/except that catches
sqlalchemy.exc.IntegrityError, calls session.rollback(), and handle duplicates
by either skipping them (idempotent re-linking) or raising a user-friendly
validation error (e.g., a ValidationError/HTTPException with a clear message)
instead of letting the exception propagate; also add a return type hint "->
None" to create() and make any error logs include the "[create]" prefix per
logging guidelines, and import IntegrityError where needed.

Comment on lines +220 to +222
resp = client.post(
f"{USER_PROJECTS_URL}/",
json={
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Trailing slash in POST URLs is inconsistent with the rest of the test class.

Lines 221 and 252 use f"{USER_PROJECTS_URL}/", while every other POST in TestAddProjectUsers uses the bare USER_PROJECTS_URL. With FastAPI's default redirect_slashes=True, this causes a 307 redirect before the actual handler runs — so the tests still pass, but they're not hitting the canonical route directly. If redirect_slashes is ever disabled, these tests will break with a 404/405.

🔧 Proposed fix
-        resp = client.post(
-            f"{USER_PROJECTS_URL}/",
+        resp = client.post(
+            USER_PROJECTS_URL,

(Apply to both occurrences at lines 221 and 252.)

Also applies to: 251-253

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/tests/api/test_user_project.py` around lines 220 - 222, The POST
requests in TestAddProjectUsers are using a trailing slash
(f"{USER_PROJECTS_URL}/") which triggers a 307 redirect; update both occurrences
to use the canonical route without the slash (use USER_PROJECTS_URL or
f"{USER_PROJECTS_URL}") so the POST hits the intended handler directly (search
for f"{USER_PROJECTS_URL}/" inside TestAddProjectUsers and replace those two
occurrences).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request ready-for-review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant