Skip to content

Collections: Save file size of previously uploaded documents#769

Closed
nishika26 wants to merge 10 commits intomainfrom
bug/file_size_default
Closed

Collections: Save file size of previously uploaded documents#769
nishika26 wants to merge 10 commits intomainfrom
bug/file_size_default

Conversation

@nishika26
Copy link
Copy Markdown
Collaborator

@nishika26 nishika26 commented Apr 17, 2026

Target issue is #768

Summary

  • Documentation

    • Clarified file upload docs: single-file uploads are limited to 25 MB.
  • Bug Fixes / Behavior

    • Improved handling of missing file-size metadata when adding documents, yielding more accurate total-size reporting and more consistent batching.
    • Updated batching behavior: documents lacking file-size information are now handled differently (may surface an error during batching) to avoid silent size assumptions.

Summary by CodeRabbit

  • Documentation

    • Clarified upload docs: single-file uploads are limited to 25 MB.
  • Bug Fixes

    • Collection size totals now reflect actual cloud file sizes by backfilling missing file-size data and updating totals.
    • Batch processing now treats missing file-size values differently, which may surface errors for incomplete metadata.
  • Tests

    • Adjusted tests to match new batch-processing behavior with missing file sizes.

@nishika26 nishika26 self-assigned this Apr 17, 2026
@nishika26 nishika26 added the bug Something isn't working label Apr 17, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 17, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Move storage initialization after a second read of document metadata, backfill missing per-document file_size_kb from object storage and optionally persist them, recompute and persist rounded collection total size before marking job PROCESSING, change batching to use raw file_size_kb (no zero fallback), and document a 25 MB upload-only file-size limit.

Changes

Cohort / File(s) Summary
Documentation
backend/app/api/docs/documents/upload.md
Added explicit 25 MB maximum file-size constraint for uploads when only a file is provided.
Collection service
backend/app/services/collections/create_collection.py
Set storage = None, re-read documents, initialize storage provider before backfilling, backfill missing document.file_size_kb from object_store_url, optionally persist backfilled sizes, recompute total from non-None sizes, compute total_size_mb (rounded to 2 decimals) and set CollectionJob to PROCESSING with the total.
Collection helpers
backend/app/services/collections/helpers.py
Changed batching logic to use doc.file_size_kb without a 0 fallback, so None sizes now influence batching (may raise errors) and reported batch sizes.
Tests
backend/app/tests/services/collections/test_helpers.py
Updated test for None file_size_kb: renamed and now expects a TypeError instead of treating None/0 as zero-sized.

Sequence Diagram(s)

sequenceDiagram
    participant Service as CollectionService
    participant DB as Database
    participant Storage as CloudStorage
    participant Helper as BatchingHelper

    Service->>DB: Read documents metadata (flat_docs)
    Service->>Service: set storage = None
    Service->>DB: Re-read documents metadata (flat_docs)
    Service->>Storage: init/get_storage()
    loop For each doc where file_size_kb is None
        Service->>Storage: get_file_size_kb(object_store_url)
        Storage-->>Service: return size_kb
        Service->>Service: update in-memory doc.file_size_kb
    end
    Service->>DB: persist backfilled Document.file_size_kb (if any)
    Service->>Service: recompute total_size_kb and total_size_mb (rounded)
    Service->>Helper: batch_documents(documents) (uses raw file_size_kb)
    Helper-->>Service: batches
    Service->>DB: update CollectionJob(total_size_mb, status=PROCESSING)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • Collections: Add dynamic batching #718: Modifies the same collection batching and file-size handling code paths (helpers.py, create_collection.py) and likely overlaps on Document.file_size_kb handling.

Suggested labels

documentation

Suggested reviewers

  • Prajna1999
  • AkhileshNegi
  • vprashrex

Poem

🐰 I hopped to storage, peeked at each file,
I filled in sizes and counted each mile,
Batches now heed bytes with truth in their art,
Uploads capped at twenty-five MB from the start,
A tiny rabbit patch—steady and smart! 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly reflects the main objective: saving/backfilling file sizes of previously uploaded documents to improve metadata accuracy and size reporting.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch bug/file_size_default

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
backend/app/services/collections/helpers.py (1)

105-105: Hardcoded 15 MB default is inconsistent with calculate_total_size_kb, and or mishandles 0-byte docs.

The new helper resolves missing file_size_kb from storage to compute an accurate total, but batch_documents silently assumes 15 MB for the same case. For any collection that reaches this path, the "total size" reported by the job and the "batch size" used for packing can diverge significantly, and the 15 MB figure is not derived from anything documented (MAX_DOC_SIZE_MB = 25, MAX_BATCH_SIZE_KB = 30 MB). Two suggestions:

  1. Have callers ensure file_size_kb is populated (e.g., by calling calculate_total_size_kb or a similar enrichment step) before batch_documents, so the default is unreachable.
  2. At minimum, distinguish "unknown" from "0-byte" and log when the default kicks in.
♻️ Minimal fix (distinguish unknown vs. 0)
-    for doc in documents:
-        doc_size_kb = doc.file_size_kb or 15 * 1024
+    for doc in documents:
+        if doc.file_size_kb is None:
+            logger.warning(
+                f"[batch_documents] file_size_kb missing, assuming default | "
+                f"{{'doc_id': '{doc.id}', 'fname': '{doc.fname}', 'default_kb': {15 * 1024}}}"
+            )
+            doc_size_kb = 15 * 1024
+        else:
+            doc_size_kb = doc.file_size_kb
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/collections/helpers.py` at line 105, The batch_documents
logic currently uses "doc.file_size_kb or 15 * 1024" which treats 0-byte files
as unknown and hardcodes 15MB; change it to explicitly distinguish None vs 0 by
checking "if doc.file_size_kb is None:" and when None either (a) ensure callers
populate sizes by calling calculate_total_size_kb / the same enrichment step
before batch_documents, or (b) set a clear fallback constant (e.g.,
DEFAULT_UNKNOWN_DOC_SIZE_KB derived from MAX_DOC_SIZE_MB or MAX_BATCH_SIZE_KB)
and emit a warning log including the document identifier so you know the default
was used; update the code in batch_documents (and any helper that sets
doc_size_kb) to use the explicit None check and add the log call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/services/collections/helpers.py`:
- Around line 69-82: The code uses a falsy fallback expression
(`doc.file_size_kb or 15 * 1024`) that treats legitimate 0 KB files the same as
None; change it to an explicit None check to match calculate_total_size_kb's
pattern and the tests: replace the `doc.file_size_kb or 15 * 1024` usage with
`if doc.file_size_kb is not None` and handle None as 0 (or otherwise the
intended behavior per tests), referencing the same `doc.file_size_kb` field and
aligning with the `calculate_total_size_kb` logic.
- Around line 15-17: The NameError occurs because the type annotation storage:
CloudStorage in helpers.py is evaluated at import time but CloudStorage is only
imported under TYPE_CHECKING; fix by either (A) changing the annotation to a
forward-reference string (e.g., "CloudStorage") wherever storage: CloudStorage
is declared (locate the annotation in calculate_total_size_kb or related
functions in helpers.py) or (B) enable postponed evaluation by adding from
__future__ import annotations at the top of helpers.py so all annotations are
lazily evaluated; pick one approach and update all occurrences of CloudStorage
annotations accordingly.

---

Nitpick comments:
In `@backend/app/services/collections/helpers.py`:
- Line 105: The batch_documents logic currently uses "doc.file_size_kb or 15 *
1024" which treats 0-byte files as unknown and hardcodes 15MB; change it to
explicitly distinguish None vs 0 by checking "if doc.file_size_kb is None:" and
when None either (a) ensure callers populate sizes by calling
calculate_total_size_kb / the same enrichment step before batch_documents, or
(b) set a clear fallback constant (e.g., DEFAULT_UNKNOWN_DOC_SIZE_KB derived
from MAX_DOC_SIZE_MB or MAX_BATCH_SIZE_KB) and emit a warning log including the
document identifier so you know the default was used; update the code in
batch_documents (and any helper that sets doc_size_kb) to use the explicit None
check and add the log call.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3396bb24-d589-4951-bf5a-36b66d84a0d6

📥 Commits

Reviewing files that changed from the base of the PR and between ff744dc and 7f5d86f.

📒 Files selected for processing (3)
  • backend/app/api/docs/documents/upload.md
  • backend/app/services/collections/create_collection.py
  • backend/app/services/collections/helpers.py

Comment thread backend/app/services/collections/helpers.py Outdated
Comment on lines +69 to +82
def calculate_total_size_kb(documents: list[Document], storage: CloudStorage) -> float:
"""
Sum document sizes in KB. Uses the stored file_size_kb if available.
"""
total: float = 0
for doc in documents:
if doc.file_size_kb is not None:
total += doc.file_size_kb
else:
logger.info(
f"[calculate_total_size_kb] file_size_kb missing, fetching from storage | {{'doc_id': '{doc.id}', 'fname': '{doc.fname}'}}"
)
total += storage.get_file_size_kb(doc.object_store_url)
return total
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Confirm error-path behavior of AmazonCloudStorage.get_file_size_kb (does it re-raise?).
ast-grep --pattern $'def get_file_size_kb(self, url: str) -> float:
    $$$'
rg -nP -C6 'def get_file_size_kb' backend/app/core/cloud

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 4102


🏁 Script executed:

#!/bin/bash
# Check line 105 and surrounding context in helpers.py
cat -n backend/app/services/collections/helpers.py | sed -n '100,110p'

# Search for other uses of file_size_kb with falsy checks
rg -n 'file_size_kb' backend/app/services/collections/ -A2 -B2

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 2228


🏁 Script executed:

#!/bin/bash
# Search for other usages of file_size_kb across the codebase
rg -n 'file_size_kb' backend/ --type py -B1 -A1 | head -60

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 5078


Line 105 treats 0 and None equivalently; align with explicit None checks elsewhere.

The falsy check doc.file_size_kb or 15 * 1024 at line 105 treats both None and legitimate 0 KB files as missing, defaulting to 15 MB. This conflicts with the explicit is not None pattern used in calculate_total_size_kb (line 75) and the test expectation that None files "are treated as 0 bytes" (test_helpers.py:126). Use if doc.file_size_kb is not None consistently, or clarify whether 0 KB is a valid size or an error state.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/collections/helpers.py` around lines 69 - 82, The code
uses a falsy fallback expression (`doc.file_size_kb or 15 * 1024`) that treats
legitimate 0 KB files the same as None; change it to an explicit None check to
match calculate_total_size_kb's pattern and the tests: replace the
`doc.file_size_kb or 15 * 1024` usage with `if doc.file_size_kb is not None` and
handle None as 0 (or otherwise the intended behavior per tests), referencing the
same `doc.file_size_kb` field and aligning with the `calculate_total_size_kb`
logic.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/services/collections/helpers.py (1)

85-99: ⚠️ Potential issue | 🔴 Critical

TypeError when doc.file_size_kb is None; breaks the documented "None treated as 0" contract.

file_size_kb is typed float | None in the model, and the existing test test_batch_documents_with_none_file_size (test_helpers.py:125-133) passes file_size_kb=None and expects it to be treated as 0. With the or 0 fallback removed, line 87 (current_batch_size_kb + doc_size_kb) and line 99 will raise TypeError: unsupported operand type(s) for +: 'int' and 'NoneType' whenever any document reaches this function without a populated size.

While create_collection.py backfills sizes before calling batch_documents, batch_documents is a general-purpose helper with no precondition enforcing non-None sizes, and the test suite explicitly exercises the None path. Either restore the fallback or validate inputs explicitly.

🛠️ Suggested fix
-        doc_size_kb = doc.file_size_kb
+        doc_size_kb = doc.file_size_kb if doc.file_size_kb is not None else 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/collections/helpers.py` around lines 85 - 99, The helper
fails when doc.file_size_kb is None; restore the "None treated as 0" behavior in
batch_documents by changing the doc_size_kb assignment to treat None as 0 (e.g.
doc_size_kb = doc.file_size_kb if doc.file_size_kb is not None else 0.0) so
subsequent arithmetic with current_batch_size_kb, would_exceed_size, and
current_batch_size_kb += doc_size_kb won't raise TypeError; keep references to
doc.file_size_kb, current_batch_size_kb, would_exceed_size, and the test
test_batch_documents_with_none_file_size in mind while making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@backend/app/services/collections/helpers.py`:
- Around line 85-99: The helper fails when doc.file_size_kb is None; restore the
"None treated as 0" behavior in batch_documents by changing the doc_size_kb
assignment to treat None as 0 (e.g. doc_size_kb = doc.file_size_kb if
doc.file_size_kb is not None else 0.0) so subsequent arithmetic with
current_batch_size_kb, would_exceed_size, and current_batch_size_kb +=
doc_size_kb won't raise TypeError; keep references to doc.file_size_kb,
current_batch_size_kb, would_exceed_size, and the test
test_batch_documents_with_none_file_size in mind while making the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d59b779c-d020-4827-a61e-b9c1a6583258

📥 Commits

Reviewing files that changed from the base of the PR and between 7f5d86f and d02bac8.

📒 Files selected for processing (2)
  • backend/app/services/collections/create_collection.py
  • backend/app/services/collections/helpers.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/app/services/collections/create_collection.py

@nishika26 nishika26 changed the title Bug/file size default Collections: calculate and save file size of previously uploaded documents Apr 17, 2026
@nishika26 nishika26 changed the title Collections: calculate and save file size of previously uploaded documents Collections: Save file size of previously uploaded documents Apr 17, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
backend/app/services/collections/create_collection.py (2)

169-172: Storage initialized under a session that immediately closes — OK, but callout.

get_cloud_storage (backend/app/core/cloud/storage.py:291-308) only uses the session to look up the project, and returns an AmazonCloudStorage that manages its own boto3 lifecycle via @cached_property, so using storage after the with Session(engine) block exits is safe. Just note: if get_cloud_storage ever starts holding a reference to the passed Session (e.g. for lazy project refresh), this pattern silently breaks. A small comment here, or splitting project lookup from client construction, would harden this against future refactors.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/collections/create_collection.py` around lines 169 -
172, The code creates storage via get_cloud_storage while inside a short-lived
with Session(engine) block which currently is safe because get_cloud_storage
only uses the session to read the project and returns an AmazonCloudStorage that
manages its own boto3 lifecycle, but this is fragile if get_cloud_storage later
retains the session; update create_collection (around the with Session(engine)
block where DocumentCrud.read_each and get_cloud_storage are called) to either
1) move project lookup out so you pass only project_id into a new storage
factory that does not receive the live Session, or 2) add a clear inline comment
next to the get_cloud_storage call referencing get_cloud_storage and
AmazonCloudStorage to explain that the returned storage is safe to use outside
the session (and warn future maintainers not to hold onto the session), so
future changes won’t accidentally create a dangling session reference.

188-194: Backfill persistence does N commits; consider a single transaction.

Per backend/app/crud/document/document.py:104-124, DocumentCrud.update(doc) performs session.add(doc) + session.commit() + session.refresh(doc) per call. With many backfilled documents, this is one round-trip per doc and partial persistence on mid-loop failures. A single-commit bulk update would be cheaper and atomic. Also, you already hold the mutated detached objects from flat_docs; the extra read_one(doc_id) re-fetch is unnecessary if you merge/add them instead.

♻️ Suggested batched persistence
         with Session(engine) as session:
             if backfill:
-                document_crud = DocumentCrud(session, project_id)
-                for doc_id, size_kb in backfill:
-                    doc = document_crud.read_one(doc_id)
-                    doc.file_size_kb = size_kb
-                    document_crud.update(doc)
+                backfill_map = dict(backfill)
+                docs_to_update = session.exec(
+                    select(Document).where(Document.id.in_(backfill_map.keys()))
+                ).all()
+                for doc in docs_to_update:
+                    doc.file_size_kb = backfill_map[doc.id]
+                    doc.updated_at = now()
+                session.commit()

(If you prefer to keep the current DocumentCrud abstraction for the project-ownership check, consider adding a bulk_update_sizes method on the CRUD.)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/collections/create_collection.py` around lines 188 -
194, The backfill loop calls DocumentCrud.read_one + DocumentCrud.update which
commits per-document; change to perform a single-transaction bulk update: inside
the existing with Session(engine) as session block, collect the target Document
objects (use the already-available mutated objects instead of re-fetching if
present), set doc.file_size_kb for each doc from backfill, add/merge them into
the same session (e.g. session.add/merge or session.add_all) and call
session.commit() once after the loop so all size updates are persisted
atomically; alternatively, implement a new
DocumentCrud.bulk_update_sizes(project_id, list_of_(doc_id,size_kb)) and call
that to encapsulate ownership checks and a single commit.
backend/app/tests/services/collections/test_helpers.py (1)

125-130: Test change looks correct, but relies on an implicit TypeError.

The expectation aligns with the stricter batch_documents behavior after removing the or 0 fallback. One small caveat: the TypeError is raised implicitly by arithmetic on None in helpers.batch_documents (see helpers.py:87,99), not by an explicit guard. If that helper later adds an explicit ValueError/validation (which would be preferable — the production code now backfills precisely so this state is unexpected), this test will need updating. Consider either tightening the helper to raise a typed, message‑checked error or using pytest.raises((TypeError, ValueError)) here to be forward-compatible.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/services/collections/test_helpers.py` around lines 125 -
130, The test test_batch_documents_with_none_file_size_raises relies on an
implicit TypeError from arithmetic on None in helpers.batch_documents; make it
forward‑compatible by changing the assertion to accept either TypeError or
ValueError (use pytest.raises((TypeError, ValueError))) so future explicit
validation in helpers.batch_documents still passes without changing the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/services/collections/create_collection.py`:
- Around line 176-186: The backfill loop in create_collection.py currently calls
storage.get_file_size_kb(doc.object_store_url) without per-document error
handling, so a single storage error aborts the whole job; wrap the call inside a
try/except around the storage.get_file_size_kb invocation in the loop over
flat_docs (the block that assigns size_kb, sets doc.file_size_kb and appends to
backfill) to catch storage errors, log or record the failing doc.id and error,
skip that document (leave doc.file_size_kb as None) and continue processing the
rest; keep the existing total_size_kb calculation (which already ignores None)
so a single failure doesn't raise, and optionally add a note to record
failed_backfills for later retry/metric.
- Around line 176-181: The code is truncating fractional KB precision by
wrapping storage.get_file_size_kb(doc.object_store_url) in round(...); instead,
remove the bare round call and assign the returned float directly to
doc.file_size_kb and to the backfill tuple (which is declared as
list[tuple[UUID, float]]), e.g. call size_kb = storage.get_file_size_kb(...),
set doc.file_size_kb = size_kb and append (doc.id, size_kb) so the float
precision is preserved.

---

Nitpick comments:
In `@backend/app/services/collections/create_collection.py`:
- Around line 169-172: The code creates storage via get_cloud_storage while
inside a short-lived with Session(engine) block which currently is safe because
get_cloud_storage only uses the session to read the project and returns an
AmazonCloudStorage that manages its own boto3 lifecycle, but this is fragile if
get_cloud_storage later retains the session; update create_collection (around
the with Session(engine) block where DocumentCrud.read_each and
get_cloud_storage are called) to either 1) move project lookup out so you pass
only project_id into a new storage factory that does not receive the live
Session, or 2) add a clear inline comment next to the get_cloud_storage call
referencing get_cloud_storage and AmazonCloudStorage to explain that the
returned storage is safe to use outside the session (and warn future maintainers
not to hold onto the session), so future changes won’t accidentally create a
dangling session reference.
- Around line 188-194: The backfill loop calls DocumentCrud.read_one +
DocumentCrud.update which commits per-document; change to perform a
single-transaction bulk update: inside the existing with Session(engine) as
session block, collect the target Document objects (use the already-available
mutated objects instead of re-fetching if present), set doc.file_size_kb for
each doc from backfill, add/merge them into the same session (e.g.
session.add/merge or session.add_all) and call session.commit() once after the
loop so all size updates are persisted atomically; alternatively, implement a
new DocumentCrud.bulk_update_sizes(project_id, list_of_(doc_id,size_kb)) and
call that to encapsulate ownership checks and a single commit.

In `@backend/app/tests/services/collections/test_helpers.py`:
- Around line 125-130: The test test_batch_documents_with_none_file_size_raises
relies on an implicit TypeError from arithmetic on None in
helpers.batch_documents; make it forward‑compatible by changing the assertion to
accept either TypeError or ValueError (use pytest.raises((TypeError,
ValueError))) so future explicit validation in helpers.batch_documents still
passes without changing the test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a312d795-f9ef-4847-a716-6ad6a3b3c4ac

📥 Commits

Reviewing files that changed from the base of the PR and between d02bac8 and 8b7556c.

📒 Files selected for processing (2)
  • backend/app/services/collections/create_collection.py
  • backend/app/tests/services/collections/test_helpers.py

Comment on lines +176 to +186
backfill: list[tuple[UUID, float]] = []
for doc in flat_docs:
if doc.file_size_kb is None:
size_kb = round(storage.get_file_size_kb(doc.object_store_url))
doc.file_size_kb = size_kb
backfill.append((doc.id, size_kb))

total_size_kb = sum(
doc.file_size_kb for doc in flat_docs if doc.file_size_kb is not None
)
total_size_mb = total_size_kb / 1024
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Single storage failure aborts the entire collection job.

storage.get_file_size_kb(doc.object_store_url) is called synchronously per document with no per-document error handling. Per backend/app/core/cloud/storage.py:220-230, this wraps a head_object call that will raise on any error (missing object, IAM issue, transient S3 outage, mid-deletion, etc.). A single failure here propagates to the outer try at line 267 and marks the whole job FAILED — even for collections where most documents already have file_size_kb persisted or where one legacy doc is simply missing from S3.

Given this PR's goal is to backfill historical data, recovering gracefully from per-doc failures is likely the safer posture:

🛡️ Suggested per-doc guard
         backfill: list[tuple[UUID, float]] = []
         for doc in flat_docs:
             if doc.file_size_kb is None:
-                size_kb = round(storage.get_file_size_kb(doc.object_store_url))
-                doc.file_size_kb = size_kb
-                backfill.append((doc.id, size_kb))
+                try:
+                    size_kb = storage.get_file_size_kb(doc.object_store_url)
+                except Exception as err:
+                    logger.warning(
+                        f"[execute_job] Failed to backfill file_size_kb | "
+                        f"{{'doc_id': '{doc.id}', 'error': '{err}'}}"
+                    )
+                    continue
+                doc.file_size_kb = size_kb
+                backfill.append((doc.id, size_kb))

Note: if you keep fail-fast semantics intentionally, the downstream if doc.file_size_kb is not None filter at line 184 and the TypeError-raising path in batch_documents become inconsistent with each other — pick one contract and document it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/collections/create_collection.py` around lines 176 -
186, The backfill loop in create_collection.py currently calls
storage.get_file_size_kb(doc.object_store_url) without per-document error
handling, so a single storage error aborts the whole job; wrap the call inside a
try/except around the storage.get_file_size_kb invocation in the loop over
flat_docs (the block that assigns size_kb, sets doc.file_size_kb and appends to
backfill) to catch storage errors, log or record the failing doc.id and error,
skip that document (leave doc.file_size_kb as None) and continue processing the
rest; keep the existing total_size_kb calculation (which already ignores None)
so a single failure doesn't raise, and optionally add a note to record
failed_backfills for later retry/metric.

Comment on lines +176 to +181
backfill: list[tuple[UUID, float]] = []
for doc in flat_docs:
if doc.file_size_kb is None:
size_kb = round(storage.get_file_size_kb(doc.object_store_url))
doc.file_size_kb = size_kb
backfill.append((doc.id, size_kb))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
rg -nP -C2 '\bfile_size_kb\b' --type=py -g '!**/tests/**'

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 4257


🏁 Script executed:

#!/bin/bash
# Check the storage.py file around the get_file_size_kb function
wc -l backend/app/core/cloud/storage.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 112


🏁 Script executed:

#!/bin/bash
# Read lines 220-230 from storage.py to see get_file_size_kb implementation
sed -n '220,230p' backend/app/core/cloud/storage.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 662


Remove bare round(...) to preserve fractional KB precision.

storage.get_file_size_kb() returns a float already rounded to 2 decimals (e.g., 512.37). Wrapping it with bare round(...) truncates to an integer, causing precision loss. This contradicts the declared type backfill: list[tuple[UUID, float]] and the Document.file_size_kb column type (Float in the database schema).

🔧 Proposed fix
-        backfill: list[tuple[UUID, float]] = []
-        for doc in flat_docs:
-            if doc.file_size_kb is None:
-                size_kb = round(storage.get_file_size_kb(doc.object_store_url))
-                doc.file_size_kb = size_kb
-                backfill.append((doc.id, size_kb))
+        backfill: list[tuple[UUID, float]] = []
+        for doc in flat_docs:
+            if doc.file_size_kb is None:
+                size_kb = storage.get_file_size_kb(doc.object_store_url)
+                doc.file_size_kb = size_kb
+                backfill.append((doc.id, size_kb))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
backfill: list[tuple[UUID, float]] = []
for doc in flat_docs:
if doc.file_size_kb is None:
size_kb = round(storage.get_file_size_kb(doc.object_store_url))
doc.file_size_kb = size_kb
backfill.append((doc.id, size_kb))
backfill: list[tuple[UUID, float]] = []
for doc in flat_docs:
if doc.file_size_kb is None:
size_kb = storage.get_file_size_kb(doc.object_store_url)
doc.file_size_kb = size_kb
backfill.append((doc.id, size_kb))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/collections/create_collection.py` around lines 176 -
181, The code is truncating fractional KB precision by wrapping
storage.get_file_size_kb(doc.object_store_url) in round(...); instead, remove
the bare round call and assign the returned float directly to doc.file_size_kb
and to the backfill tuple (which is declared as list[tuple[UUID, float]]), e.g.
call size_kb = storage.get_file_size_kb(...), set doc.file_size_kb = size_kb and
append (doc.id, size_kb) so the float precision is preserved.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/services/collections/create_collection.py (1)

200-214: ⚠️ Potential issue | 🟠 Major

Avoid publishing the pre-backfill total.

The job is still marked PROCESSING once with sum(doc.file_size_kb or 0 ...) before the backfill runs, then updated again later. That can briefly expose the old zero-fallback total and undermines the corrected total-size behavior.

🛠️ Suggested cleanup
-            total_size_kb = sum(doc.file_size_kb or 0 for doc in flat_docs)
-            total_size_mb = round(total_size_kb / 1024, 2)
             span.set_attribute("collection.documents.count", len(flat_docs))
-            span.set_attribute("collection.documents.total_size_mb", total_size_mb)
-
-            with Session(engine) as session:
-                collection_job_crud = CollectionJobCrud(session, project_id)
-                collection_job = collection_job_crud.read_one(job_uuid)
-                collection_job = collection_job_crud.update(
-                    job_uuid,
-                    CollectionJobUpdate(
-                        task_id=task_id,
-                        status=CollectionJobStatus.PROCESSING,
-                        total_size_mb=total_size_mb,
-                    ),
-                )
-
-                storage = get_cloud_storage(session=session, project_id=project_id)
-                provider = get_llm_provider(
-                    session=session,
-                    provider=creation_request.provider,
-                    project_id=project_id,
-                    organization_id=organization_id,
-                )

Then set the span attribute after the backfilled total is computed:

         total_size_mb = total_size_kb / 1024
+        span.set_attribute("collection.documents.total_size_mb", round(total_size_mb, 2))

Also applies to: 239-260

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/collections/create_collection.py` around lines 200 -
214, The code is publishing a pre-backfill total_size_mb (computed from
total_size_kb) and setting
span.set_attribute("collection.documents.total_size_mb", ...) before the
backfill runs; change the flow so you do not update the CollectionJob or span
with the pre-backfill total: keep marking the job PROCESSING via
CollectionJobCrud.update(job_uuid, CollectionJobUpdate(...,
status=CollectionJobStatus.PROCESSING, task_id=task_id)) but postpone setting
total_size_mb on the CollectionJob and calling
span.set_attribute("collection.documents.total_size_mb", total_size_mb) until
after you compute the backfilled total (the later total_size_kb -> total_size_mb
calculation), ensuring any CollectionJobCrud.update that writes total_size_mb
only uses the backfilled value and the span attribute is set after backfill
completion.
♻️ Duplicate comments (2)
backend/app/services/collections/create_collection.py (2)

235-237: ⚠️ Potential issue | 🟡 Minor

Preserve fractional KB precision.

get_file_size_kb() already returns a rounded float, so bare round(...) truncates values to whole KB before persisting them.

🔧 Proposed fix
-                size_kb = round(storage.get_file_size_kb(doc.object_store_url))
+                size_kb = storage.get_file_size_kb(doc.object_store_url)
                 doc.file_size_kb = size_kb
                 backfill.append((doc.id, size_kb))

Quick verification of Python’s bare round(float) behavior:

#!/bin/bash
python - <<'PY'
print(round(512.37))
PY
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/collections/create_collection.py` around lines 235 -
237, The code currently calls round(storage.get_file_size_kb(...)) which coerces
the returned rounded float to an integer and loses fractional KB precision;
change the assignment in the create collection logic to use the value returned
by get_file_size_kb() directly (e.g., set size_kb =
storage.get_file_size_kb(doc.object_store_url)), then assign doc.file_size_kb =
size_kb and backfill.append((doc.id, size_kb)) so the fractional KB is
preserved.

232-237: ⚠️ Potential issue | 🟠 Major

Guard each file-size backfill from storage failures.

A single head_object failure from storage.get_file_size_kb(...) still aborts the whole collection job. Keep processing other documents when one legacy object cannot be inspected.

🛡️ Proposed per-document guard
         backfill: list[tuple[UUID, float]] = []
         for doc in flat_docs:
             if doc.file_size_kb is None:
-                size_kb = round(storage.get_file_size_kb(doc.object_store_url))
+                try:
+                    size_kb = storage.get_file_size_kb(doc.object_store_url)
+                except Exception as err:
+                    logger.warning(
+                        "[create_collection.execute_job] Failed to backfill file_size_kb | "
+                        "{'doc_id': '%s', 'error': '%s'}",
+                        doc.id,
+                        str(err),
+                    )
+                    continue
                 doc.file_size_kb = size_kb
                 backfill.append((doc.id, size_kb))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/collections/create_collection.py` around lines 232 -
237, The backfill loop over flat_docs currently calls
storage.get_file_size_kb(doc.object_store_url) without guarding against storage
failures, which lets a single head_object error abort the whole collection job;
update the loop in create_collection.py to wrap each call to
storage.get_file_size_kb inside a try/except that catches storage-related
exceptions, logs a warning including doc.id and doc.object_store_url (so
failures are observable), skips setting doc.file_size_kb and does not append to
backfill for that document, and continues processing the remaining docs; ensure
you reference the same variables (flat_docs, doc.file_size_kb, backfill) and
preserve behavior for successful size lookups.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/services/collections/create_collection.py`:
- Around line 225-268: The code block starting after the try around the Session
usage is mis-indented and the get_llm_provider(...) call is missing its closing
parenthesis before the subsequent with tracer.start_as_current_span(...); fix by
moving the entire block (the with Session(engine) as session: block that
initializes DocumentCrud, computes backfill, updates documents and
collection_job) to be inside the try scope and add the closing ')' to the
get_llm_provider call so it becomes get_llm_provider(session=..., provider=...,
project_id=..., organization_id=...) immediately before the with
tracer.start_as_current_span("collections.create.provider") line; ensure
indentation of the following lines (provider variable use and the tracer
context) matches the surrounding try block.

---

Outside diff comments:
In `@backend/app/services/collections/create_collection.py`:
- Around line 200-214: The code is publishing a pre-backfill total_size_mb
(computed from total_size_kb) and setting
span.set_attribute("collection.documents.total_size_mb", ...) before the
backfill runs; change the flow so you do not update the CollectionJob or span
with the pre-backfill total: keep marking the job PROCESSING via
CollectionJobCrud.update(job_uuid, CollectionJobUpdate(...,
status=CollectionJobStatus.PROCESSING, task_id=task_id)) but postpone setting
total_size_mb on the CollectionJob and calling
span.set_attribute("collection.documents.total_size_mb", total_size_mb) until
after you compute the backfilled total (the later total_size_kb -> total_size_mb
calculation), ensuring any CollectionJobCrud.update that writes total_size_mb
only uses the backfilled value and the span attribute is set after backfill
completion.

---

Duplicate comments:
In `@backend/app/services/collections/create_collection.py`:
- Around line 235-237: The code currently calls
round(storage.get_file_size_kb(...)) which coerces the returned rounded float to
an integer and loses fractional KB precision; change the assignment in the
create collection logic to use the value returned by get_file_size_kb() directly
(e.g., set size_kb = storage.get_file_size_kb(doc.object_store_url)), then
assign doc.file_size_kb = size_kb and backfill.append((doc.id, size_kb)) so the
fractional KB is preserved.
- Around line 232-237: The backfill loop over flat_docs currently calls
storage.get_file_size_kb(doc.object_store_url) without guarding against storage
failures, which lets a single head_object error abort the whole collection job;
update the loop in create_collection.py to wrap each call to
storage.get_file_size_kb inside a try/except that catches storage-related
exceptions, logs a warning including doc.id and doc.object_store_url (so
failures are observable), skips setting doc.file_size_kb and does not append to
backfill for that document, and continues processing the remaining docs; ensure
you reference the same variables (flat_docs, doc.file_size_kb, backfill) and
preserve behavior for successful size lookups.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7947ebe7-05ab-49be-a857-067135ce4e98

📥 Commits

Reviewing files that changed from the base of the PR and between 8b7556c and 9cae069.

📒 Files selected for processing (1)
  • backend/app/services/collections/create_collection.py

Comment on lines +225 to 268
with Session(engine) as session:
document_crud = DocumentCrud(session, project_id)
flat_docs = document_crud.read_each(creation_request.documents)
storage = get_cloud_storage(session=session, project_id=project_id)

file_exts = {doc.fname.split(".")[-1] for doc in flat_docs if "." in doc.fname}

backfill: list[tuple[UUID, float]] = []
for doc in flat_docs:
if doc.file_size_kb is None:
size_kb = round(storage.get_file_size_kb(doc.object_store_url))
doc.file_size_kb = size_kb
backfill.append((doc.id, size_kb))

total_size_kb = sum(
doc.file_size_kb for doc in flat_docs if doc.file_size_kb is not None
)
total_size_mb = total_size_kb / 1024

with Session(engine) as session:
if backfill:
document_crud = DocumentCrud(session, project_id)
for doc_id, size_kb in backfill:
doc = document_crud.read_one(doc_id)
doc.file_size_kb = size_kb
document_crud.update(doc)

collection_job_crud = CollectionJobCrud(session, project_id)
collection_job = collection_job_crud.read_one(job_uuid)
collection_job = collection_job_crud.update(
job_uuid,
CollectionJobUpdate(
task_id=task_id,
status=CollectionJobStatus.PROCESSING,
total_size_mb=round(total_size_mb, 2),
),
)

provider = get_llm_provider(
session=session,
provider=creation_request.provider,
project_id=project_id,
organization_id=organization_id,
with tracer.start_as_current_span("collections.create.provider"):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
python - <<'PY'
import ast
from pathlib import Path

path = Path("backend/app/services/collections/create_collection.py")
try:
    ast.parse(path.read_text(), filename=str(path))
except SyntaxError as err:
    print(f"{path}:{err.lineno}:{err.offset}: {err.msg}")
    raise SystemExit(1)

print("syntax ok")
PY

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 172


🏁 Script executed:

cat -n backend/app/services/collections/create_collection.py | sed -n '175,275p'

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 5065


Fix indentation and close the get_llm_provider() call to resolve syntax errors.

The try block starting at line 184 is incomplete—line 225 onwards should be indented within it, and the get_llm_provider() call at line 263 needs a closing parenthesis before the with tracer.start_as_current_span() statement at line 268.

Proposed structural fix
-        with Session(engine) as session:
-            document_crud = DocumentCrud(session, project_id)
-            flat_docs = document_crud.read_each(creation_request.documents)
-            storage = get_cloud_storage(session=session, project_id=project_id)
+            with Session(engine) as session:
+                document_crud = DocumentCrud(session, project_id)
+                flat_docs = document_crud.read_each(creation_request.documents)
+                storage = get_cloud_storage(session=session, project_id=project_id)

-        file_exts = {doc.fname.split(".")[-1] for doc in flat_docs if "." in doc.fname}
+            file_exts = {doc.fname.split(".")[-1] for doc in flat_docs if "." in doc.fname}

-        backfill: list[tuple[UUID, float]] = []
-        for doc in flat_docs:
-            if doc.file_size_kb is None:
-                size_kb = round(storage.get_file_size_kb(doc.object_store_url))
-                doc.file_size_kb = size_kb
-                backfill.append((doc.id, size_kb))
+            backfill: list[tuple[UUID, float]] = []
+            for doc in flat_docs:
+                if doc.file_size_kb is None:
+                    size_kb = round(storage.get_file_size_kb(doc.object_store_url))
+                    doc.file_size_kb = size_kb
+                    backfill.append((doc.id, size_kb))

-        total_size_kb = sum(
-            doc.file_size_kb for doc in flat_docs if doc.file_size_kb is not None
-        )
-        total_size_mb = total_size_kb / 1024
+            total_size_kb = sum(
+                doc.file_size_kb for doc in flat_docs if doc.file_size_kb is not None
+            )
+            total_size_mb = total_size_kb / 1024

-        with Session(engine) as session:
-            if backfill:
-                document_crud = DocumentCrud(session, project_id)
-                for doc_id, size_kb in backfill:
-                    doc = document_crud.read_one(doc_id)
-                    doc.file_size_kb = size_kb
-                    document_crud.update(doc)
+            with Session(engine) as session:
+                if backfill:
+                    document_crud = DocumentCrud(session, project_id)
+                    for doc_id, size_kb in backfill:
+                        doc = document_crud.read_one(doc_id)
+                        doc.file_size_kb = size_kb
+                        document_crud.update(doc)

-            collection_job_crud = CollectionJobCrud(session, project_id)
-            collection_job = collection_job_crud.read_one(job_uuid)
-            collection_job = collection_job_crud.update(
-                job_uuid,
-                CollectionJobUpdate(
-                    task_id=task_id,
-                    status=CollectionJobStatus.PROCESSING,
-                    total_size_mb=round(total_size_mb, 2),
-                ),
-            )
+                collection_job_crud = CollectionJobCrud(session, project_id)
+                collection_job = collection_job_crud.read_one(job_uuid)
+                collection_job = collection_job_crud.update(
+                    job_uuid,
+                    CollectionJobUpdate(
+                        task_id=task_id,
+                        status=CollectionJobStatus.PROCESSING,
+                        total_size_mb=round(total_size_mb, 2),
+                    ),
+                )

-            provider = get_llm_provider(
-                session=session,
-                provider=creation_request.provider,
-                project_id=project_id,
-                organization_id=organization_id,
+                provider = get_llm_provider(
+                    session=session,
+                    provider=creation_request.provider,
+                    project_id=project_id,
+                    organization_id=organization_id,
+                )
🧰 Tools
🪛 Ruff (0.15.10)

[warning] 225-225: Expected except or finally after try block

(invalid-syntax)


[warning] 267-268: Expected ), found newline

(invalid-syntax)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/collections/create_collection.py` around lines 225 -
268, The code block starting after the try around the Session usage is
mis-indented and the get_llm_provider(...) call is missing its closing
parenthesis before the subsequent with tracer.start_as_current_span(...); fix by
moving the entire block (the with Session(engine) as session: block that
initializes DocumentCrud, computes backfill, updates documents and
collection_job) to be inside the try scope and add the closing ')' to the
get_llm_provider call so it becomes get_llm_provider(session=..., provider=...,
project_id=..., organization_id=...) immediately before the with
tracer.start_as_current_span("collections.create.provider") line; ensure
indentation of the following lines (provider variable use and the tracer
context) matches the surrounding try block.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/services/collections/create_collection.py (1)

193-223: ⚠️ Potential issue | 🟠 Major

Old pre-backfill block wasn't removed — reintroduces the or 0 bug this PR aims to fix and double-updates the job.

Lines 193–223 are still running the original flow (read docs → compute total_size_kb with the doc.file_size_kb or 0 fallback → update the job to PROCESSING with that total → init storage/provider) before the new backfill block at 225–267 executes. This means:

  • Line 200 still silently assumes 0 KB for documents missing file_size_kb, which directly contradicts this PR's stated goal of producing "more accurate total-size reporting" and being consistent with helpers.py now using doc.file_size_kb without the or 0 fallback.
  • The job is moved to PROCESSING twice (lines 208–215 and again at 254–261), and on the first transition total_size_mb is the pre-backfill (incorrect) value. Any consumer observing the job in between the two updates will see the wrong total.
  • flat_docs, storage, and provider are each fetched twice (lines 193–195 vs 225–228, line 217 vs 228, line 218 vs 263), doubling DB reads and an S3/project lookup (get_cloud_storage re-queries the project per backend/app/core/cloud/storage.py:291-308).

Based on the AI summary ("Move storage initialization after a second read of document metadata… recompute and persist rounded collection total size before marking job PROCESSING"), the new block at 225–267 was meant to replace this section, not coexist with it. Delete 193–223 (keep only the span.set_attribute calls in a single place) so the PROCESSING transition happens exactly once, with the backfilled total.

🛠 Suggested removal
             job_uuid = UUID(job_id)

-            with Session(engine) as session:
-                document_crud = DocumentCrud(session, project_id)
-                flat_docs = document_crud.read_each(creation_request.documents)
-
-            file_exts = {
-                doc.fname.split(".")[-1] for doc in flat_docs if "." in doc.fname
-            }
-            total_size_kb = sum(doc.file_size_kb or 0 for doc in flat_docs)
-            total_size_mb = round(total_size_kb / 1024, 2)
-            span.set_attribute("collection.documents.count", len(flat_docs))
-            span.set_attribute("collection.documents.total_size_mb", total_size_mb)
-
-            with Session(engine) as session:
-                collection_job_crud = CollectionJobCrud(session, project_id)
-                collection_job = collection_job_crud.read_one(job_uuid)
-                collection_job = collection_job_crud.update(
-                    job_uuid,
-                    CollectionJobUpdate(
-                        task_id=task_id,
-                        status=CollectionJobStatus.PROCESSING,
-                        total_size_mb=total_size_mb,
-                    ),
-                )
-
-                storage = get_cloud_storage(session=session, project_id=project_id)
-                provider = get_llm_provider(
-                    session=session,
-                    provider=creation_request.provider,
-                    project_id=project_id,
-                    organization_id=organization_id,
-                )
-
             with Session(engine) as session:
                 document_crud = DocumentCrud(session, project_id)
                 flat_docs = document_crud.read_each(creation_request.documents)
                 storage = get_cloud_storage(session=session, project_id=project_id)

And move the two span.set_attribute("collection.documents.*", …) calls to right after the recomputed total_size_mb in the new block.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/collections/create_collection.py` around lines 193 -
223, Remove the old pre-backfill flow that reads documents and updates the job
(the block that calls Session/DocumentCrud.read_each, computes total_size_kb
using "doc.file_size_kb or 0", then calls CollectionJobCrud.update to set
CollectionJobStatus.PROCESSING and initializes
get_cloud_storage/get_llm_provider); leave only the new backfill block (the
later read/recompute/update) so the job transitions to PROCESSING exactly once
with the backfilled total_size_mb. Specifically delete the duplicated
reads/updates (flat_docs/read_each, the pre-backfill total_size_kb computation,
and the first CollectionJobCrud.update/PROCESSING call) and move the two
span.set_attribute("collection.documents.count", …) and
span.set_attribute("collection.documents.total_size_mb", …) calls to directly
after the recomputed total_size_mb in the backfill block so the span reflects
the corrected total.
♻️ Duplicate comments (1)
backend/app/services/collections/create_collection.py (1)

225-268: ⚠️ Potential issue | 🔴 Critical

Previously-raised blockers on this new block are still unresolved.

Re-confirming that the issues flagged on earlier commits are still present in the current diff and must be fixed before this can merge / even parse:

  1. Syntax / indentation (blocker): Ruff 0.15.11 still reports Expected except or finally after try block at 225 and Expected ) at 267–268. The new with Session(engine) as session: at 225 sits at the same indentation as try: (so 225–267 is outside the try), and the get_llm_provider(...) call at 263–267 is missing its closing ) — line 268's with tracer.start_as_current_span(...) is being parsed as a continuation of the open paren. This file does not currently compile.
  2. Precision loss at line 235: storage.get_file_size_kb() already returns a float rounded to 2 decimals (per backend/app/core/cloud/storage.py:220-238). Wrapping it in bare round(...) truncates to an integer, which is inconsistent with both the declared backfill: list[tuple[UUID, float]] type and the Float column.
  3. Single-doc S3 failure fails the whole job (lines 232–237): storage.get_file_size_kb can raise CloudStorageError on any ClientError (missing object, IAM, transient outage) per backend/app/core/cloud/storage.py:220-238. Because this loop has no per-doc guard, one legacy/missing object aborts the whole collection job — counterproductive for a historical backfill. A try/except around the call (log and continue, leaving file_size_kb=None) is consistent with the already-filtered sum at 239–241.

Once the block at 193–223 is removed (see separate comment), please also resolve all three above. The log added in (3) should follow the repo convention f"[execute_job] … {mask_string(...)}" as per coding guidelines: "Prefix all log messages with the function name in square brackets."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/collections/create_collection.py` around lines 225 -
268, The new Session/try block and the get_llm_provider call are syntactically
broken and the per-doc S3 backfill logic is unsafe and losing precision: move or
indent the "with Session(engine) as session:" so it stays inside the surrounding
try block (or close the try before opening the new with), and close the open
parenthesis on get_llm_provider (add the missing ")" before "with
tracer.start_as_current_span") so the parser doesn't treat the tracer line as
part of the call; in the backfill loop (flat_docs) stop calling bare
round(storage.get_file_size_kb(...)) — preserve the float returned by
storage.get_file_size_kb to match backfill: list[tuple[UUID, float]] and the DB
Float column; and wrap the storage.get_file_size_kb call in a try/except
catching CloudStorageError, log the failure with the repo convention prefix
f"[execute_job] ..." and mask_string(...) then continue without setting
file_size_kb so the sum/filter behavior remains unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@backend/app/services/collections/create_collection.py`:
- Around line 193-223: Remove the old pre-backfill flow that reads documents and
updates the job (the block that calls Session/DocumentCrud.read_each, computes
total_size_kb using "doc.file_size_kb or 0", then calls CollectionJobCrud.update
to set CollectionJobStatus.PROCESSING and initializes
get_cloud_storage/get_llm_provider); leave only the new backfill block (the
later read/recompute/update) so the job transitions to PROCESSING exactly once
with the backfilled total_size_mb. Specifically delete the duplicated
reads/updates (flat_docs/read_each, the pre-backfill total_size_kb computation,
and the first CollectionJobCrud.update/PROCESSING call) and move the two
span.set_attribute("collection.documents.count", …) and
span.set_attribute("collection.documents.total_size_mb", …) calls to directly
after the recomputed total_size_mb in the backfill block so the span reflects
the corrected total.

---

Duplicate comments:
In `@backend/app/services/collections/create_collection.py`:
- Around line 225-268: The new Session/try block and the get_llm_provider call
are syntactically broken and the per-doc S3 backfill logic is unsafe and losing
precision: move or indent the "with Session(engine) as session:" so it stays
inside the surrounding try block (or close the try before opening the new with),
and close the open parenthesis on get_llm_provider (add the missing ")" before
"with tracer.start_as_current_span") so the parser doesn't treat the tracer line
as part of the call; in the backfill loop (flat_docs) stop calling bare
round(storage.get_file_size_kb(...)) — preserve the float returned by
storage.get_file_size_kb to match backfill: list[tuple[UUID, float]] and the DB
Float column; and wrap the storage.get_file_size_kb call in a try/except
catching CloudStorageError, log the failure with the repo convention prefix
f"[execute_job] ..." and mask_string(...) then continue without setting
file_size_kb so the sum/filter behavior remains unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 24103d97-ea73-402f-875a-c9786b609758

📥 Commits

Reviewing files that changed from the base of the PR and between 9cae069 and 166cb9a.

📒 Files selected for processing (1)
  • backend/app/services/collections/create_collection.py

@nishika26 nishika26 closed this Apr 24, 2026
@nishika26 nishika26 deleted the bug/file_size_default branch April 24, 2026 07:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant