Collections: Save file size of previously uploaded documents#769
Collections: Save file size of previously uploaded documents#769
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughMove storage initialization after a second read of document metadata, backfill missing per-document Changes
Sequence Diagram(s)sequenceDiagram
participant Service as CollectionService
participant DB as Database
participant Storage as CloudStorage
participant Helper as BatchingHelper
Service->>DB: Read documents metadata (flat_docs)
Service->>Service: set storage = None
Service->>DB: Re-read documents metadata (flat_docs)
Service->>Storage: init/get_storage()
loop For each doc where file_size_kb is None
Service->>Storage: get_file_size_kb(object_store_url)
Storage-->>Service: return size_kb
Service->>Service: update in-memory doc.file_size_kb
end
Service->>DB: persist backfilled Document.file_size_kb (if any)
Service->>Service: recompute total_size_kb and total_size_mb (rounded)
Service->>Helper: batch_documents(documents) (uses raw file_size_kb)
Helper-->>Service: batches
Service->>DB: update CollectionJob(total_size_mb, status=PROCESSING)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
backend/app/services/collections/helpers.py (1)
105-105: Hardcoded 15 MB default is inconsistent withcalculate_total_size_kb, andormishandles 0-byte docs.The new helper resolves missing
file_size_kbfrom storage to compute an accurate total, butbatch_documentssilently assumes 15 MB for the same case. For any collection that reaches this path, the "total size" reported by the job and the "batch size" used for packing can diverge significantly, and the 15 MB figure is not derived from anything documented (MAX_DOC_SIZE_MB = 25,MAX_BATCH_SIZE_KB = 30 MB). Two suggestions:
- Have callers ensure
file_size_kbis populated (e.g., by callingcalculate_total_size_kbor a similar enrichment step) beforebatch_documents, so the default is unreachable.- At minimum, distinguish "unknown" from "0-byte" and log when the default kicks in.
♻️ Minimal fix (distinguish unknown vs. 0)
- for doc in documents: - doc_size_kb = doc.file_size_kb or 15 * 1024 + for doc in documents: + if doc.file_size_kb is None: + logger.warning( + f"[batch_documents] file_size_kb missing, assuming default | " + f"{{'doc_id': '{doc.id}', 'fname': '{doc.fname}', 'default_kb': {15 * 1024}}}" + ) + doc_size_kb = 15 * 1024 + else: + doc_size_kb = doc.file_size_kb🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/collections/helpers.py` at line 105, The batch_documents logic currently uses "doc.file_size_kb or 15 * 1024" which treats 0-byte files as unknown and hardcodes 15MB; change it to explicitly distinguish None vs 0 by checking "if doc.file_size_kb is None:" and when None either (a) ensure callers populate sizes by calling calculate_total_size_kb / the same enrichment step before batch_documents, or (b) set a clear fallback constant (e.g., DEFAULT_UNKNOWN_DOC_SIZE_KB derived from MAX_DOC_SIZE_MB or MAX_BATCH_SIZE_KB) and emit a warning log including the document identifier so you know the default was used; update the code in batch_documents (and any helper that sets doc_size_kb) to use the explicit None check and add the log call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/services/collections/helpers.py`:
- Around line 69-82: The code uses a falsy fallback expression
(`doc.file_size_kb or 15 * 1024`) that treats legitimate 0 KB files the same as
None; change it to an explicit None check to match calculate_total_size_kb's
pattern and the tests: replace the `doc.file_size_kb or 15 * 1024` usage with
`if doc.file_size_kb is not None` and handle None as 0 (or otherwise the
intended behavior per tests), referencing the same `doc.file_size_kb` field and
aligning with the `calculate_total_size_kb` logic.
- Around line 15-17: The NameError occurs because the type annotation storage:
CloudStorage in helpers.py is evaluated at import time but CloudStorage is only
imported under TYPE_CHECKING; fix by either (A) changing the annotation to a
forward-reference string (e.g., "CloudStorage") wherever storage: CloudStorage
is declared (locate the annotation in calculate_total_size_kb or related
functions in helpers.py) or (B) enable postponed evaluation by adding from
__future__ import annotations at the top of helpers.py so all annotations are
lazily evaluated; pick one approach and update all occurrences of CloudStorage
annotations accordingly.
---
Nitpick comments:
In `@backend/app/services/collections/helpers.py`:
- Line 105: The batch_documents logic currently uses "doc.file_size_kb or 15 *
1024" which treats 0-byte files as unknown and hardcodes 15MB; change it to
explicitly distinguish None vs 0 by checking "if doc.file_size_kb is None:" and
when None either (a) ensure callers populate sizes by calling
calculate_total_size_kb / the same enrichment step before batch_documents, or
(b) set a clear fallback constant (e.g., DEFAULT_UNKNOWN_DOC_SIZE_KB derived
from MAX_DOC_SIZE_MB or MAX_BATCH_SIZE_KB) and emit a warning log including the
document identifier so you know the default was used; update the code in
batch_documents (and any helper that sets doc_size_kb) to use the explicit None
check and add the log call.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3396bb24-d589-4951-bf5a-36b66d84a0d6
📒 Files selected for processing (3)
backend/app/api/docs/documents/upload.mdbackend/app/services/collections/create_collection.pybackend/app/services/collections/helpers.py
| def calculate_total_size_kb(documents: list[Document], storage: CloudStorage) -> float: | ||
| """ | ||
| Sum document sizes in KB. Uses the stored file_size_kb if available. | ||
| """ | ||
| total: float = 0 | ||
| for doc in documents: | ||
| if doc.file_size_kb is not None: | ||
| total += doc.file_size_kb | ||
| else: | ||
| logger.info( | ||
| f"[calculate_total_size_kb] file_size_kb missing, fetching from storage | {{'doc_id': '{doc.id}', 'fname': '{doc.fname}'}}" | ||
| ) | ||
| total += storage.get_file_size_kb(doc.object_store_url) | ||
| return total |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Confirm error-path behavior of AmazonCloudStorage.get_file_size_kb (does it re-raise?).
ast-grep --pattern $'def get_file_size_kb(self, url: str) -> float:
$$$'
rg -nP -C6 'def get_file_size_kb' backend/app/core/cloudRepository: ProjectTech4DevAI/kaapi-backend
Length of output: 4102
🏁 Script executed:
#!/bin/bash
# Check line 105 and surrounding context in helpers.py
cat -n backend/app/services/collections/helpers.py | sed -n '100,110p'
# Search for other uses of file_size_kb with falsy checks
rg -n 'file_size_kb' backend/app/services/collections/ -A2 -B2Repository: ProjectTech4DevAI/kaapi-backend
Length of output: 2228
🏁 Script executed:
#!/bin/bash
# Search for other usages of file_size_kb across the codebase
rg -n 'file_size_kb' backend/ --type py -B1 -A1 | head -60Repository: ProjectTech4DevAI/kaapi-backend
Length of output: 5078
Line 105 treats 0 and None equivalently; align with explicit None checks elsewhere.
The falsy check doc.file_size_kb or 15 * 1024 at line 105 treats both None and legitimate 0 KB files as missing, defaulting to 15 MB. This conflicts with the explicit is not None pattern used in calculate_total_size_kb (line 75) and the test expectation that None files "are treated as 0 bytes" (test_helpers.py:126). Use if doc.file_size_kb is not None consistently, or clarify whether 0 KB is a valid size or an error state.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/app/services/collections/helpers.py` around lines 69 - 82, The code
uses a falsy fallback expression (`doc.file_size_kb or 15 * 1024`) that treats
legitimate 0 KB files the same as None; change it to an explicit None check to
match calculate_total_size_kb's pattern and the tests: replace the
`doc.file_size_kb or 15 * 1024` usage with `if doc.file_size_kb is not None` and
handle None as 0 (or otherwise the intended behavior per tests), referencing the
same `doc.file_size_kb` field and aligning with the `calculate_total_size_kb`
logic.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/services/collections/helpers.py (1)
85-99:⚠️ Potential issue | 🔴 Critical
TypeErrorwhendoc.file_size_kbisNone; breaks the documented "None treated as 0" contract.
file_size_kbis typedfloat | Nonein the model, and the existing testtest_batch_documents_with_none_file_size(test_helpers.py:125-133) passesfile_size_kb=Noneand expects it to be treated as 0. With theor 0fallback removed, line 87 (current_batch_size_kb + doc_size_kb) and line 99 will raiseTypeError: unsupported operand type(s) for +: 'int' and 'NoneType'whenever any document reaches this function without a populated size.While
create_collection.pybackfills sizes before callingbatch_documents,batch_documentsis a general-purpose helper with no precondition enforcing non-None sizes, and the test suite explicitly exercises the None path. Either restore the fallback or validate inputs explicitly.🛠️ Suggested fix
- doc_size_kb = doc.file_size_kb + doc_size_kb = doc.file_size_kb if doc.file_size_kb is not None else 0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/collections/helpers.py` around lines 85 - 99, The helper fails when doc.file_size_kb is None; restore the "None treated as 0" behavior in batch_documents by changing the doc_size_kb assignment to treat None as 0 (e.g. doc_size_kb = doc.file_size_kb if doc.file_size_kb is not None else 0.0) so subsequent arithmetic with current_batch_size_kb, would_exceed_size, and current_batch_size_kb += doc_size_kb won't raise TypeError; keep references to doc.file_size_kb, current_batch_size_kb, would_exceed_size, and the test test_batch_documents_with_none_file_size in mind while making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@backend/app/services/collections/helpers.py`:
- Around line 85-99: The helper fails when doc.file_size_kb is None; restore the
"None treated as 0" behavior in batch_documents by changing the doc_size_kb
assignment to treat None as 0 (e.g. doc_size_kb = doc.file_size_kb if
doc.file_size_kb is not None else 0.0) so subsequent arithmetic with
current_batch_size_kb, would_exceed_size, and current_batch_size_kb +=
doc_size_kb won't raise TypeError; keep references to doc.file_size_kb,
current_batch_size_kb, would_exceed_size, and the test
test_batch_documents_with_none_file_size in mind while making the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d59b779c-d020-4827-a61e-b9c1a6583258
📒 Files selected for processing (2)
backend/app/services/collections/create_collection.pybackend/app/services/collections/helpers.py
🚧 Files skipped from review as they are similar to previous changes (1)
- backend/app/services/collections/create_collection.py
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
backend/app/services/collections/create_collection.py (2)
169-172: Storage initialized under a session that immediately closes — OK, but callout.
get_cloud_storage(backend/app/core/cloud/storage.py:291-308) only uses the session to look up the project, and returns anAmazonCloudStoragethat manages its own boto3 lifecycle via@cached_property, so usingstorageafter thewith Session(engine)block exits is safe. Just note: ifget_cloud_storageever starts holding a reference to the passedSession(e.g. for lazy project refresh), this pattern silently breaks. A small comment here, or splitting project lookup from client construction, would harden this against future refactors.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/collections/create_collection.py` around lines 169 - 172, The code creates storage via get_cloud_storage while inside a short-lived with Session(engine) block which currently is safe because get_cloud_storage only uses the session to read the project and returns an AmazonCloudStorage that manages its own boto3 lifecycle, but this is fragile if get_cloud_storage later retains the session; update create_collection (around the with Session(engine) block where DocumentCrud.read_each and get_cloud_storage are called) to either 1) move project lookup out so you pass only project_id into a new storage factory that does not receive the live Session, or 2) add a clear inline comment next to the get_cloud_storage call referencing get_cloud_storage and AmazonCloudStorage to explain that the returned storage is safe to use outside the session (and warn future maintainers not to hold onto the session), so future changes won’t accidentally create a dangling session reference.
188-194: Backfill persistence does N commits; consider a single transaction.Per
backend/app/crud/document/document.py:104-124,DocumentCrud.update(doc)performssession.add(doc)+session.commit()+session.refresh(doc)per call. With many backfilled documents, this is one round-trip per doc and partial persistence on mid-loop failures. A single-commit bulk update would be cheaper and atomic. Also, you already hold the mutated detached objects fromflat_docs; the extraread_one(doc_id)re-fetch is unnecessary if you merge/add them instead.♻️ Suggested batched persistence
with Session(engine) as session: if backfill: - document_crud = DocumentCrud(session, project_id) - for doc_id, size_kb in backfill: - doc = document_crud.read_one(doc_id) - doc.file_size_kb = size_kb - document_crud.update(doc) + backfill_map = dict(backfill) + docs_to_update = session.exec( + select(Document).where(Document.id.in_(backfill_map.keys())) + ).all() + for doc in docs_to_update: + doc.file_size_kb = backfill_map[doc.id] + doc.updated_at = now() + session.commit()(If you prefer to keep the current
DocumentCrudabstraction for the project-ownership check, consider adding abulk_update_sizesmethod on the CRUD.)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/collections/create_collection.py` around lines 188 - 194, The backfill loop calls DocumentCrud.read_one + DocumentCrud.update which commits per-document; change to perform a single-transaction bulk update: inside the existing with Session(engine) as session block, collect the target Document objects (use the already-available mutated objects instead of re-fetching if present), set doc.file_size_kb for each doc from backfill, add/merge them into the same session (e.g. session.add/merge or session.add_all) and call session.commit() once after the loop so all size updates are persisted atomically; alternatively, implement a new DocumentCrud.bulk_update_sizes(project_id, list_of_(doc_id,size_kb)) and call that to encapsulate ownership checks and a single commit.backend/app/tests/services/collections/test_helpers.py (1)
125-130: Test change looks correct, but relies on an implicitTypeError.The expectation aligns with the stricter
batch_documentsbehavior after removing theor 0fallback. One small caveat: theTypeErroris raised implicitly by arithmetic onNoneinhelpers.batch_documents(seehelpers.py:87,99), not by an explicit guard. If that helper later adds an explicitValueError/validation (which would be preferable — the production code now backfills precisely so this state is unexpected), this test will need updating. Consider either tightening the helper to raise a typed, message‑checked error or usingpytest.raises((TypeError, ValueError))here to be forward-compatible.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/tests/services/collections/test_helpers.py` around lines 125 - 130, The test test_batch_documents_with_none_file_size_raises relies on an implicit TypeError from arithmetic on None in helpers.batch_documents; make it forward‑compatible by changing the assertion to accept either TypeError or ValueError (use pytest.raises((TypeError, ValueError))) so future explicit validation in helpers.batch_documents still passes without changing the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/services/collections/create_collection.py`:
- Around line 176-186: The backfill loop in create_collection.py currently calls
storage.get_file_size_kb(doc.object_store_url) without per-document error
handling, so a single storage error aborts the whole job; wrap the call inside a
try/except around the storage.get_file_size_kb invocation in the loop over
flat_docs (the block that assigns size_kb, sets doc.file_size_kb and appends to
backfill) to catch storage errors, log or record the failing doc.id and error,
skip that document (leave doc.file_size_kb as None) and continue processing the
rest; keep the existing total_size_kb calculation (which already ignores None)
so a single failure doesn't raise, and optionally add a note to record
failed_backfills for later retry/metric.
- Around line 176-181: The code is truncating fractional KB precision by
wrapping storage.get_file_size_kb(doc.object_store_url) in round(...); instead,
remove the bare round call and assign the returned float directly to
doc.file_size_kb and to the backfill tuple (which is declared as
list[tuple[UUID, float]]), e.g. call size_kb = storage.get_file_size_kb(...),
set doc.file_size_kb = size_kb and append (doc.id, size_kb) so the float
precision is preserved.
---
Nitpick comments:
In `@backend/app/services/collections/create_collection.py`:
- Around line 169-172: The code creates storage via get_cloud_storage while
inside a short-lived with Session(engine) block which currently is safe because
get_cloud_storage only uses the session to read the project and returns an
AmazonCloudStorage that manages its own boto3 lifecycle, but this is fragile if
get_cloud_storage later retains the session; update create_collection (around
the with Session(engine) block where DocumentCrud.read_each and
get_cloud_storage are called) to either 1) move project lookup out so you pass
only project_id into a new storage factory that does not receive the live
Session, or 2) add a clear inline comment next to the get_cloud_storage call
referencing get_cloud_storage and AmazonCloudStorage to explain that the
returned storage is safe to use outside the session (and warn future maintainers
not to hold onto the session), so future changes won’t accidentally create a
dangling session reference.
- Around line 188-194: The backfill loop calls DocumentCrud.read_one +
DocumentCrud.update which commits per-document; change to perform a
single-transaction bulk update: inside the existing with Session(engine) as
session block, collect the target Document objects (use the already-available
mutated objects instead of re-fetching if present), set doc.file_size_kb for
each doc from backfill, add/merge them into the same session (e.g.
session.add/merge or session.add_all) and call session.commit() once after the
loop so all size updates are persisted atomically; alternatively, implement a
new DocumentCrud.bulk_update_sizes(project_id, list_of_(doc_id,size_kb)) and
call that to encapsulate ownership checks and a single commit.
In `@backend/app/tests/services/collections/test_helpers.py`:
- Around line 125-130: The test test_batch_documents_with_none_file_size_raises
relies on an implicit TypeError from arithmetic on None in
helpers.batch_documents; make it forward‑compatible by changing the assertion to
accept either TypeError or ValueError (use pytest.raises((TypeError,
ValueError))) so future explicit validation in helpers.batch_documents still
passes without changing the test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a312d795-f9ef-4847-a716-6ad6a3b3c4ac
📒 Files selected for processing (2)
backend/app/services/collections/create_collection.pybackend/app/tests/services/collections/test_helpers.py
| backfill: list[tuple[UUID, float]] = [] | ||
| for doc in flat_docs: | ||
| if doc.file_size_kb is None: | ||
| size_kb = round(storage.get_file_size_kb(doc.object_store_url)) | ||
| doc.file_size_kb = size_kb | ||
| backfill.append((doc.id, size_kb)) | ||
|
|
||
| total_size_kb = sum( | ||
| doc.file_size_kb for doc in flat_docs if doc.file_size_kb is not None | ||
| ) | ||
| total_size_mb = total_size_kb / 1024 |
There was a problem hiding this comment.
Single storage failure aborts the entire collection job.
storage.get_file_size_kb(doc.object_store_url) is called synchronously per document with no per-document error handling. Per backend/app/core/cloud/storage.py:220-230, this wraps a head_object call that will raise on any error (missing object, IAM issue, transient S3 outage, mid-deletion, etc.). A single failure here propagates to the outer try at line 267 and marks the whole job FAILED — even for collections where most documents already have file_size_kb persisted or where one legacy doc is simply missing from S3.
Given this PR's goal is to backfill historical data, recovering gracefully from per-doc failures is likely the safer posture:
🛡️ Suggested per-doc guard
backfill: list[tuple[UUID, float]] = []
for doc in flat_docs:
if doc.file_size_kb is None:
- size_kb = round(storage.get_file_size_kb(doc.object_store_url))
- doc.file_size_kb = size_kb
- backfill.append((doc.id, size_kb))
+ try:
+ size_kb = storage.get_file_size_kb(doc.object_store_url)
+ except Exception as err:
+ logger.warning(
+ f"[execute_job] Failed to backfill file_size_kb | "
+ f"{{'doc_id': '{doc.id}', 'error': '{err}'}}"
+ )
+ continue
+ doc.file_size_kb = size_kb
+ backfill.append((doc.id, size_kb))Note: if you keep fail-fast semantics intentionally, the downstream if doc.file_size_kb is not None filter at line 184 and the TypeError-raising path in batch_documents become inconsistent with each other — pick one contract and document it.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/app/services/collections/create_collection.py` around lines 176 -
186, The backfill loop in create_collection.py currently calls
storage.get_file_size_kb(doc.object_store_url) without per-document error
handling, so a single storage error aborts the whole job; wrap the call inside a
try/except around the storage.get_file_size_kb invocation in the loop over
flat_docs (the block that assigns size_kb, sets doc.file_size_kb and appends to
backfill) to catch storage errors, log or record the failing doc.id and error,
skip that document (leave doc.file_size_kb as None) and continue processing the
rest; keep the existing total_size_kb calculation (which already ignores None)
so a single failure doesn't raise, and optionally add a note to record
failed_backfills for later retry/metric.
| backfill: list[tuple[UUID, float]] = [] | ||
| for doc in flat_docs: | ||
| if doc.file_size_kb is None: | ||
| size_kb = round(storage.get_file_size_kb(doc.object_store_url)) | ||
| doc.file_size_kb = size_kb | ||
| backfill.append((doc.id, size_kb)) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
rg -nP -C2 '\bfile_size_kb\b' --type=py -g '!**/tests/**'Repository: ProjectTech4DevAI/kaapi-backend
Length of output: 4257
🏁 Script executed:
#!/bin/bash
# Check the storage.py file around the get_file_size_kb function
wc -l backend/app/core/cloud/storage.pyRepository: ProjectTech4DevAI/kaapi-backend
Length of output: 112
🏁 Script executed:
#!/bin/bash
# Read lines 220-230 from storage.py to see get_file_size_kb implementation
sed -n '220,230p' backend/app/core/cloud/storage.pyRepository: ProjectTech4DevAI/kaapi-backend
Length of output: 662
Remove bare round(...) to preserve fractional KB precision.
storage.get_file_size_kb() returns a float already rounded to 2 decimals (e.g., 512.37). Wrapping it with bare round(...) truncates to an integer, causing precision loss. This contradicts the declared type backfill: list[tuple[UUID, float]] and the Document.file_size_kb column type (Float in the database schema).
🔧 Proposed fix
- backfill: list[tuple[UUID, float]] = []
- for doc in flat_docs:
- if doc.file_size_kb is None:
- size_kb = round(storage.get_file_size_kb(doc.object_store_url))
- doc.file_size_kb = size_kb
- backfill.append((doc.id, size_kb))
+ backfill: list[tuple[UUID, float]] = []
+ for doc in flat_docs:
+ if doc.file_size_kb is None:
+ size_kb = storage.get_file_size_kb(doc.object_store_url)
+ doc.file_size_kb = size_kb
+ backfill.append((doc.id, size_kb))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| backfill: list[tuple[UUID, float]] = [] | |
| for doc in flat_docs: | |
| if doc.file_size_kb is None: | |
| size_kb = round(storage.get_file_size_kb(doc.object_store_url)) | |
| doc.file_size_kb = size_kb | |
| backfill.append((doc.id, size_kb)) | |
| backfill: list[tuple[UUID, float]] = [] | |
| for doc in flat_docs: | |
| if doc.file_size_kb is None: | |
| size_kb = storage.get_file_size_kb(doc.object_store_url) | |
| doc.file_size_kb = size_kb | |
| backfill.append((doc.id, size_kb)) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/app/services/collections/create_collection.py` around lines 176 -
181, The code is truncating fractional KB precision by wrapping
storage.get_file_size_kb(doc.object_store_url) in round(...); instead, remove
the bare round call and assign the returned float directly to doc.file_size_kb
and to the backfill tuple (which is declared as list[tuple[UUID, float]]), e.g.
call size_kb = storage.get_file_size_kb(...), set doc.file_size_kb = size_kb and
append (doc.id, size_kb) so the float precision is preserved.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/services/collections/create_collection.py (1)
200-214:⚠️ Potential issue | 🟠 MajorAvoid publishing the pre-backfill total.
The job is still marked
PROCESSINGonce withsum(doc.file_size_kb or 0 ...)before the backfill runs, then updated again later. That can briefly expose the old zero-fallback total and undermines the corrected total-size behavior.🛠️ Suggested cleanup
- total_size_kb = sum(doc.file_size_kb or 0 for doc in flat_docs) - total_size_mb = round(total_size_kb / 1024, 2) span.set_attribute("collection.documents.count", len(flat_docs)) - span.set_attribute("collection.documents.total_size_mb", total_size_mb) - - with Session(engine) as session: - collection_job_crud = CollectionJobCrud(session, project_id) - collection_job = collection_job_crud.read_one(job_uuid) - collection_job = collection_job_crud.update( - job_uuid, - CollectionJobUpdate( - task_id=task_id, - status=CollectionJobStatus.PROCESSING, - total_size_mb=total_size_mb, - ), - ) - - storage = get_cloud_storage(session=session, project_id=project_id) - provider = get_llm_provider( - session=session, - provider=creation_request.provider, - project_id=project_id, - organization_id=organization_id, - )Then set the span attribute after the backfilled total is computed:
total_size_mb = total_size_kb / 1024 + span.set_attribute("collection.documents.total_size_mb", round(total_size_mb, 2))Also applies to: 239-260
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/collections/create_collection.py` around lines 200 - 214, The code is publishing a pre-backfill total_size_mb (computed from total_size_kb) and setting span.set_attribute("collection.documents.total_size_mb", ...) before the backfill runs; change the flow so you do not update the CollectionJob or span with the pre-backfill total: keep marking the job PROCESSING via CollectionJobCrud.update(job_uuid, CollectionJobUpdate(..., status=CollectionJobStatus.PROCESSING, task_id=task_id)) but postpone setting total_size_mb on the CollectionJob and calling span.set_attribute("collection.documents.total_size_mb", total_size_mb) until after you compute the backfilled total (the later total_size_kb -> total_size_mb calculation), ensuring any CollectionJobCrud.update that writes total_size_mb only uses the backfilled value and the span attribute is set after backfill completion.
♻️ Duplicate comments (2)
backend/app/services/collections/create_collection.py (2)
235-237:⚠️ Potential issue | 🟡 MinorPreserve fractional KB precision.
get_file_size_kb()already returns a rounded float, so bareround(...)truncates values to whole KB before persisting them.🔧 Proposed fix
- size_kb = round(storage.get_file_size_kb(doc.object_store_url)) + size_kb = storage.get_file_size_kb(doc.object_store_url) doc.file_size_kb = size_kb backfill.append((doc.id, size_kb))Quick verification of Python’s bare
round(float)behavior:#!/bin/bash python - <<'PY' print(round(512.37)) PY🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/collections/create_collection.py` around lines 235 - 237, The code currently calls round(storage.get_file_size_kb(...)) which coerces the returned rounded float to an integer and loses fractional KB precision; change the assignment in the create collection logic to use the value returned by get_file_size_kb() directly (e.g., set size_kb = storage.get_file_size_kb(doc.object_store_url)), then assign doc.file_size_kb = size_kb and backfill.append((doc.id, size_kb)) so the fractional KB is preserved.
232-237:⚠️ Potential issue | 🟠 MajorGuard each file-size backfill from storage failures.
A single
head_objectfailure fromstorage.get_file_size_kb(...)still aborts the whole collection job. Keep processing other documents when one legacy object cannot be inspected.🛡️ Proposed per-document guard
backfill: list[tuple[UUID, float]] = [] for doc in flat_docs: if doc.file_size_kb is None: - size_kb = round(storage.get_file_size_kb(doc.object_store_url)) + try: + size_kb = storage.get_file_size_kb(doc.object_store_url) + except Exception as err: + logger.warning( + "[create_collection.execute_job] Failed to backfill file_size_kb | " + "{'doc_id': '%s', 'error': '%s'}", + doc.id, + str(err), + ) + continue doc.file_size_kb = size_kb backfill.append((doc.id, size_kb))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/collections/create_collection.py` around lines 232 - 237, The backfill loop over flat_docs currently calls storage.get_file_size_kb(doc.object_store_url) without guarding against storage failures, which lets a single head_object error abort the whole collection job; update the loop in create_collection.py to wrap each call to storage.get_file_size_kb inside a try/except that catches storage-related exceptions, logs a warning including doc.id and doc.object_store_url (so failures are observable), skips setting doc.file_size_kb and does not append to backfill for that document, and continues processing the remaining docs; ensure you reference the same variables (flat_docs, doc.file_size_kb, backfill) and preserve behavior for successful size lookups.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/services/collections/create_collection.py`:
- Around line 225-268: The code block starting after the try around the Session
usage is mis-indented and the get_llm_provider(...) call is missing its closing
parenthesis before the subsequent with tracer.start_as_current_span(...); fix by
moving the entire block (the with Session(engine) as session: block that
initializes DocumentCrud, computes backfill, updates documents and
collection_job) to be inside the try scope and add the closing ')' to the
get_llm_provider call so it becomes get_llm_provider(session=..., provider=...,
project_id=..., organization_id=...) immediately before the with
tracer.start_as_current_span("collections.create.provider") line; ensure
indentation of the following lines (provider variable use and the tracer
context) matches the surrounding try block.
---
Outside diff comments:
In `@backend/app/services/collections/create_collection.py`:
- Around line 200-214: The code is publishing a pre-backfill total_size_mb
(computed from total_size_kb) and setting
span.set_attribute("collection.documents.total_size_mb", ...) before the
backfill runs; change the flow so you do not update the CollectionJob or span
with the pre-backfill total: keep marking the job PROCESSING via
CollectionJobCrud.update(job_uuid, CollectionJobUpdate(...,
status=CollectionJobStatus.PROCESSING, task_id=task_id)) but postpone setting
total_size_mb on the CollectionJob and calling
span.set_attribute("collection.documents.total_size_mb", total_size_mb) until
after you compute the backfilled total (the later total_size_kb -> total_size_mb
calculation), ensuring any CollectionJobCrud.update that writes total_size_mb
only uses the backfilled value and the span attribute is set after backfill
completion.
---
Duplicate comments:
In `@backend/app/services/collections/create_collection.py`:
- Around line 235-237: The code currently calls
round(storage.get_file_size_kb(...)) which coerces the returned rounded float to
an integer and loses fractional KB precision; change the assignment in the
create collection logic to use the value returned by get_file_size_kb() directly
(e.g., set size_kb = storage.get_file_size_kb(doc.object_store_url)), then
assign doc.file_size_kb = size_kb and backfill.append((doc.id, size_kb)) so the
fractional KB is preserved.
- Around line 232-237: The backfill loop over flat_docs currently calls
storage.get_file_size_kb(doc.object_store_url) without guarding against storage
failures, which lets a single head_object error abort the whole collection job;
update the loop in create_collection.py to wrap each call to
storage.get_file_size_kb inside a try/except that catches storage-related
exceptions, logs a warning including doc.id and doc.object_store_url (so
failures are observable), skips setting doc.file_size_kb and does not append to
backfill for that document, and continues processing the remaining docs; ensure
you reference the same variables (flat_docs, doc.file_size_kb, backfill) and
preserve behavior for successful size lookups.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7947ebe7-05ab-49be-a857-067135ce4e98
📒 Files selected for processing (1)
backend/app/services/collections/create_collection.py
| with Session(engine) as session: | ||
| document_crud = DocumentCrud(session, project_id) | ||
| flat_docs = document_crud.read_each(creation_request.documents) | ||
| storage = get_cloud_storage(session=session, project_id=project_id) | ||
|
|
||
| file_exts = {doc.fname.split(".")[-1] for doc in flat_docs if "." in doc.fname} | ||
|
|
||
| backfill: list[tuple[UUID, float]] = [] | ||
| for doc in flat_docs: | ||
| if doc.file_size_kb is None: | ||
| size_kb = round(storage.get_file_size_kb(doc.object_store_url)) | ||
| doc.file_size_kb = size_kb | ||
| backfill.append((doc.id, size_kb)) | ||
|
|
||
| total_size_kb = sum( | ||
| doc.file_size_kb for doc in flat_docs if doc.file_size_kb is not None | ||
| ) | ||
| total_size_mb = total_size_kb / 1024 | ||
|
|
||
| with Session(engine) as session: | ||
| if backfill: | ||
| document_crud = DocumentCrud(session, project_id) | ||
| for doc_id, size_kb in backfill: | ||
| doc = document_crud.read_one(doc_id) | ||
| doc.file_size_kb = size_kb | ||
| document_crud.update(doc) | ||
|
|
||
| collection_job_crud = CollectionJobCrud(session, project_id) | ||
| collection_job = collection_job_crud.read_one(job_uuid) | ||
| collection_job = collection_job_crud.update( | ||
| job_uuid, | ||
| CollectionJobUpdate( | ||
| task_id=task_id, | ||
| status=CollectionJobStatus.PROCESSING, | ||
| total_size_mb=round(total_size_mb, 2), | ||
| ), | ||
| ) | ||
|
|
||
| provider = get_llm_provider( | ||
| session=session, | ||
| provider=creation_request.provider, | ||
| project_id=project_id, | ||
| organization_id=organization_id, | ||
| with tracer.start_as_current_span("collections.create.provider"): |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
python - <<'PY'
import ast
from pathlib import Path
path = Path("backend/app/services/collections/create_collection.py")
try:
ast.parse(path.read_text(), filename=str(path))
except SyntaxError as err:
print(f"{path}:{err.lineno}:{err.offset}: {err.msg}")
raise SystemExit(1)
print("syntax ok")
PYRepository: ProjectTech4DevAI/kaapi-backend
Length of output: 172
🏁 Script executed:
cat -n backend/app/services/collections/create_collection.py | sed -n '175,275p'Repository: ProjectTech4DevAI/kaapi-backend
Length of output: 5065
Fix indentation and close the get_llm_provider() call to resolve syntax errors.
The try block starting at line 184 is incomplete—line 225 onwards should be indented within it, and the get_llm_provider() call at line 263 needs a closing parenthesis before the with tracer.start_as_current_span() statement at line 268.
Proposed structural fix
- with Session(engine) as session:
- document_crud = DocumentCrud(session, project_id)
- flat_docs = document_crud.read_each(creation_request.documents)
- storage = get_cloud_storage(session=session, project_id=project_id)
+ with Session(engine) as session:
+ document_crud = DocumentCrud(session, project_id)
+ flat_docs = document_crud.read_each(creation_request.documents)
+ storage = get_cloud_storage(session=session, project_id=project_id)
- file_exts = {doc.fname.split(".")[-1] for doc in flat_docs if "." in doc.fname}
+ file_exts = {doc.fname.split(".")[-1] for doc in flat_docs if "." in doc.fname}
- backfill: list[tuple[UUID, float]] = []
- for doc in flat_docs:
- if doc.file_size_kb is None:
- size_kb = round(storage.get_file_size_kb(doc.object_store_url))
- doc.file_size_kb = size_kb
- backfill.append((doc.id, size_kb))
+ backfill: list[tuple[UUID, float]] = []
+ for doc in flat_docs:
+ if doc.file_size_kb is None:
+ size_kb = round(storage.get_file_size_kb(doc.object_store_url))
+ doc.file_size_kb = size_kb
+ backfill.append((doc.id, size_kb))
- total_size_kb = sum(
- doc.file_size_kb for doc in flat_docs if doc.file_size_kb is not None
- )
- total_size_mb = total_size_kb / 1024
+ total_size_kb = sum(
+ doc.file_size_kb for doc in flat_docs if doc.file_size_kb is not None
+ )
+ total_size_mb = total_size_kb / 1024
- with Session(engine) as session:
- if backfill:
- document_crud = DocumentCrud(session, project_id)
- for doc_id, size_kb in backfill:
- doc = document_crud.read_one(doc_id)
- doc.file_size_kb = size_kb
- document_crud.update(doc)
+ with Session(engine) as session:
+ if backfill:
+ document_crud = DocumentCrud(session, project_id)
+ for doc_id, size_kb in backfill:
+ doc = document_crud.read_one(doc_id)
+ doc.file_size_kb = size_kb
+ document_crud.update(doc)
- collection_job_crud = CollectionJobCrud(session, project_id)
- collection_job = collection_job_crud.read_one(job_uuid)
- collection_job = collection_job_crud.update(
- job_uuid,
- CollectionJobUpdate(
- task_id=task_id,
- status=CollectionJobStatus.PROCESSING,
- total_size_mb=round(total_size_mb, 2),
- ),
- )
+ collection_job_crud = CollectionJobCrud(session, project_id)
+ collection_job = collection_job_crud.read_one(job_uuid)
+ collection_job = collection_job_crud.update(
+ job_uuid,
+ CollectionJobUpdate(
+ task_id=task_id,
+ status=CollectionJobStatus.PROCESSING,
+ total_size_mb=round(total_size_mb, 2),
+ ),
+ )
- provider = get_llm_provider(
- session=session,
- provider=creation_request.provider,
- project_id=project_id,
- organization_id=organization_id,
+ provider = get_llm_provider(
+ session=session,
+ provider=creation_request.provider,
+ project_id=project_id,
+ organization_id=organization_id,
+ )🧰 Tools
🪛 Ruff (0.15.10)
[warning] 225-225: Expected except or finally after try block
(invalid-syntax)
[warning] 267-268: Expected ), found newline
(invalid-syntax)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/app/services/collections/create_collection.py` around lines 225 -
268, The code block starting after the try around the Session usage is
mis-indented and the get_llm_provider(...) call is missing its closing
parenthesis before the subsequent with tracer.start_as_current_span(...); fix by
moving the entire block (the with Session(engine) as session: block that
initializes DocumentCrud, computes backfill, updates documents and
collection_job) to be inside the try scope and add the closing ')' to the
get_llm_provider call so it becomes get_llm_provider(session=..., provider=...,
project_id=..., organization_id=...) immediately before the with
tracer.start_as_current_span("collections.create.provider") line; ensure
indentation of the following lines (provider variable use and the tracer
context) matches the surrounding try block.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/services/collections/create_collection.py (1)
193-223:⚠️ Potential issue | 🟠 MajorOld pre-backfill block wasn't removed — reintroduces the
or 0bug this PR aims to fix and double-updates the job.Lines 193–223 are still running the original flow (read docs → compute
total_size_kbwith thedoc.file_size_kb or 0fallback → update the job toPROCESSINGwith that total → init storage/provider) before the new backfill block at 225–267 executes. This means:
- Line 200 still silently assumes
0 KBfor documents missingfile_size_kb, which directly contradicts this PR's stated goal of producing "more accurate total-size reporting" and being consistent withhelpers.pynow usingdoc.file_size_kbwithout theor 0fallback.- The job is moved to
PROCESSINGtwice (lines 208–215 and again at 254–261), and on the first transitiontotal_size_mbis the pre-backfill (incorrect) value. Any consumer observing the job in between the two updates will see the wrong total.flat_docs,storage, andproviderare each fetched twice (lines 193–195 vs 225–228, line 217 vs 228, line 218 vs 263), doubling DB reads and an S3/project lookup (get_cloud_storagere-queries the project perbackend/app/core/cloud/storage.py:291-308).Based on the AI summary ("Move storage initialization after a second read of document metadata… recompute and persist rounded collection total size before marking job PROCESSING"), the new block at 225–267 was meant to replace this section, not coexist with it. Delete 193–223 (keep only the
span.set_attributecalls in a single place) so thePROCESSINGtransition happens exactly once, with the backfilled total.🛠 Suggested removal
job_uuid = UUID(job_id) - with Session(engine) as session: - document_crud = DocumentCrud(session, project_id) - flat_docs = document_crud.read_each(creation_request.documents) - - file_exts = { - doc.fname.split(".")[-1] for doc in flat_docs if "." in doc.fname - } - total_size_kb = sum(doc.file_size_kb or 0 for doc in flat_docs) - total_size_mb = round(total_size_kb / 1024, 2) - span.set_attribute("collection.documents.count", len(flat_docs)) - span.set_attribute("collection.documents.total_size_mb", total_size_mb) - - with Session(engine) as session: - collection_job_crud = CollectionJobCrud(session, project_id) - collection_job = collection_job_crud.read_one(job_uuid) - collection_job = collection_job_crud.update( - job_uuid, - CollectionJobUpdate( - task_id=task_id, - status=CollectionJobStatus.PROCESSING, - total_size_mb=total_size_mb, - ), - ) - - storage = get_cloud_storage(session=session, project_id=project_id) - provider = get_llm_provider( - session=session, - provider=creation_request.provider, - project_id=project_id, - organization_id=organization_id, - ) - with Session(engine) as session: document_crud = DocumentCrud(session, project_id) flat_docs = document_crud.read_each(creation_request.documents) storage = get_cloud_storage(session=session, project_id=project_id)And move the two
span.set_attribute("collection.documents.*", …)calls to right after the recomputedtotal_size_mbin the new block.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/collections/create_collection.py` around lines 193 - 223, Remove the old pre-backfill flow that reads documents and updates the job (the block that calls Session/DocumentCrud.read_each, computes total_size_kb using "doc.file_size_kb or 0", then calls CollectionJobCrud.update to set CollectionJobStatus.PROCESSING and initializes get_cloud_storage/get_llm_provider); leave only the new backfill block (the later read/recompute/update) so the job transitions to PROCESSING exactly once with the backfilled total_size_mb. Specifically delete the duplicated reads/updates (flat_docs/read_each, the pre-backfill total_size_kb computation, and the first CollectionJobCrud.update/PROCESSING call) and move the two span.set_attribute("collection.documents.count", …) and span.set_attribute("collection.documents.total_size_mb", …) calls to directly after the recomputed total_size_mb in the backfill block so the span reflects the corrected total.
♻️ Duplicate comments (1)
backend/app/services/collections/create_collection.py (1)
225-268:⚠️ Potential issue | 🔴 CriticalPreviously-raised blockers on this new block are still unresolved.
Re-confirming that the issues flagged on earlier commits are still present in the current diff and must be fixed before this can merge / even parse:
- Syntax / indentation (blocker): Ruff 0.15.11 still reports
Expected except or finally after try blockat 225 andExpected )at 267–268. The newwith Session(engine) as session:at 225 sits at the same indentation astry:(so 225–267 is outside thetry), and theget_llm_provider(...)call at 263–267 is missing its closing)— line 268'swith tracer.start_as_current_span(...)is being parsed as a continuation of the open paren. This file does not currently compile.- Precision loss at line 235:
storage.get_file_size_kb()already returns afloatrounded to 2 decimals (perbackend/app/core/cloud/storage.py:220-238). Wrapping it in bareround(...)truncates to an integer, which is inconsistent with both the declaredbackfill: list[tuple[UUID, float]]type and theFloatcolumn.- Single-doc S3 failure fails the whole job (lines 232–237):
storage.get_file_size_kbcan raiseCloudStorageErroron anyClientError(missing object, IAM, transient outage) perbackend/app/core/cloud/storage.py:220-238. Because this loop has no per-doc guard, one legacy/missing object aborts the whole collection job — counterproductive for a historical backfill. Atry/exceptaround the call (log andcontinue, leavingfile_size_kb=None) is consistent with the already-filtered sum at 239–241.Once the block at 193–223 is removed (see separate comment), please also resolve all three above. The log added in (3) should follow the repo convention
f"[execute_job] … {mask_string(...)}"as per coding guidelines: "Prefix all log messages with the function name in square brackets."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/collections/create_collection.py` around lines 225 - 268, The new Session/try block and the get_llm_provider call are syntactically broken and the per-doc S3 backfill logic is unsafe and losing precision: move or indent the "with Session(engine) as session:" so it stays inside the surrounding try block (or close the try before opening the new with), and close the open parenthesis on get_llm_provider (add the missing ")" before "with tracer.start_as_current_span") so the parser doesn't treat the tracer line as part of the call; in the backfill loop (flat_docs) stop calling bare round(storage.get_file_size_kb(...)) — preserve the float returned by storage.get_file_size_kb to match backfill: list[tuple[UUID, float]] and the DB Float column; and wrap the storage.get_file_size_kb call in a try/except catching CloudStorageError, log the failure with the repo convention prefix f"[execute_job] ..." and mask_string(...) then continue without setting file_size_kb so the sum/filter behavior remains unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@backend/app/services/collections/create_collection.py`:
- Around line 193-223: Remove the old pre-backfill flow that reads documents and
updates the job (the block that calls Session/DocumentCrud.read_each, computes
total_size_kb using "doc.file_size_kb or 0", then calls CollectionJobCrud.update
to set CollectionJobStatus.PROCESSING and initializes
get_cloud_storage/get_llm_provider); leave only the new backfill block (the
later read/recompute/update) so the job transitions to PROCESSING exactly once
with the backfilled total_size_mb. Specifically delete the duplicated
reads/updates (flat_docs/read_each, the pre-backfill total_size_kb computation,
and the first CollectionJobCrud.update/PROCESSING call) and move the two
span.set_attribute("collection.documents.count", …) and
span.set_attribute("collection.documents.total_size_mb", …) calls to directly
after the recomputed total_size_mb in the backfill block so the span reflects
the corrected total.
---
Duplicate comments:
In `@backend/app/services/collections/create_collection.py`:
- Around line 225-268: The new Session/try block and the get_llm_provider call
are syntactically broken and the per-doc S3 backfill logic is unsafe and losing
precision: move or indent the "with Session(engine) as session:" so it stays
inside the surrounding try block (or close the try before opening the new with),
and close the open parenthesis on get_llm_provider (add the missing ")" before
"with tracer.start_as_current_span") so the parser doesn't treat the tracer line
as part of the call; in the backfill loop (flat_docs) stop calling bare
round(storage.get_file_size_kb(...)) — preserve the float returned by
storage.get_file_size_kb to match backfill: list[tuple[UUID, float]] and the DB
Float column; and wrap the storage.get_file_size_kb call in a try/except
catching CloudStorageError, log the failure with the repo convention prefix
f"[execute_job] ..." and mask_string(...) then continue without setting
file_size_kb so the sum/filter behavior remains unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 24103d97-ea73-402f-875a-c9786b609758
📒 Files selected for processing (1)
backend/app/services/collections/create_collection.py
Target issue is #768
Summary
Documentation
Bug Fixes / Behavior
Summary by CodeRabbit
Documentation
Bug Fixes
Tests