Skip to content

Conversation

@d33bs
Copy link
Member

@d33bs d33bs commented Nov 1, 2025

Summary by CodeRabbit

Release Notes

  • New Features

    • Added example demonstrating data access and image visualization via LanceDB
    • Enhanced image processing with improved error handling, retry logic, and timeout control
  • Improvements

    • Added detailed logging and progress tracking for data processing workflows
    • Improved file validation and idempotent download checks
  • Dependencies

    • Updated Python version support (>3.11, <3.13)
    • Upgraded PyArrow and LanceDB to latest compatible versions

✏️ Tip: You can customize this high-level summary in your review settings.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@d33bs d33bs marked this pull request as ready for review January 15, 2026 23:23
@d33bs
Copy link
Member Author

d33bs commented Jan 15, 2026

@coderabbitai full review

@coderabbitai
Copy link

coderabbitai bot commented Jan 15, 2026

✅ Actions performed

Full review triggered.

@coderabbitai
Copy link

coderabbitai bot commented Jan 15, 2026

📝 Walkthrough

Walkthrough

The changes introduce LanceDB-based data packaging with image processing utilities, including FTP retrieval enhancements, Docker optimization, context frame computation, and OME-Arrow schema integration. Project dependencies are updated, and schema files undergo minor formatting adjustments.

Changes

Cohort / File(s) Summary
Configuration & Ignores
.gitignore, pyproject.toml
Added ignore patterns for text, CSV, and Parquet files. Updated Python constraint to >3.11,<3.13; upgraded pyarrow (>=12) and lancedb (>=0.6.7); added serpula_rasa, imageio, jupyterlab, jupytext; introduced DOCKER_DEFAULT_PLATFORM environment variable and jupytext tool configuration.
Data Packaging Examples & Utilities
5.data_packaging/example.py, 5.data_packaging/gather_images.py
New example script demonstrating LanceDB integration and image rendering. Enhanced gather_images.py with FTP retry logic, Docker image pre-check optimization, frame context computation, OME-Arrow data assembly, Parquet export, and extensive logging for IC-based image processing.
Schema Files
5.data_packaging/schema/*
Minor formatting adjustments across 13 schema files: whitespace normalization, field label truncation in mitocheck_metadata.idr0013-screenA-annotation.csv.gz, and index value update. No semantic changes to data structures.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant ScriptRunner as Script Runner
    participant FTP as FTP Server
    participant ImageProc as Image Processor
    participant OMEArrow as OME-Arrow Assembler
    participant Parquet as Parquet Exporter

    User->>ScriptRunner: Trigger read_image_as_binary(image_path)
    ScriptRunner->>FTP: retrieve_ftp_file with retries & timeout
    FTP-->>ScriptRunner: Local file or retry/fallback
    
    ScriptRunner->>ImageProc: get_ic_context_frames(target_frame, movie_len)
    ImageProc-->>ScriptRunner: List of context frame indices
    
    ScriptRunner->>ImageProc: get_frame_tiff_from_idr_ch5 for each frame
    ImageProc-->>ScriptRunner: TIFF file path
    
    ScriptRunner->>ImageProc: pybasic_IC_target_frame_to_tiff(frames_as_arrays, target_frame)
    ImageProc->>ImageProc: normalize_frame_label, scale to uint8, save
    ImageProc-->>ScriptRunner: Processed target frame path
    
    ScriptRunner->>OMEArrow: make_ome_arrow_row for frame metadata
    OMEArrow-->>ScriptRunner: OME-Arrow structure
    
    ScriptRunner->>Parquet: Coerce to OME_STRUCT_TYPE & parquet.write_table
    Parquet-->>ScriptRunner: Exported Parquet file with IC frame data
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A hop, skip, and arrow we go,
Frames and contexts in graceful flow,
Lance holds the images, bright and clean,
The finest data pipeline we've seen! ✨📦

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Extract all images and add to lancedb' directly reflects the main changes, which involve adding new image extraction logic and LanceDB integration throughout the codebase.
Docstring Coverage ✅ Passed Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
5.data_packaging/gather_images.py (2)

49-116: Avoid leaving partial downloads behind on failure.
If all retries fail, a partially downloaded file (>0 bytes) can remain and later be treated as valid due to the early size check. Clean up any file on failure before raising.

🛠️ Suggested fix
-    # clean up a zero-byte stub if created
-    try:
-        if download_filepath.exists() and download_filepath.stat().st_size == 0:
-            download_filepath.unlink()
-    except Exception:
-        pass
+    # clean up any partial download if all attempts failed
+    try:
+        if download_filepath.exists():
+            download_filepath.unlink()
+    except Exception:
+        pass

538-557: Handle missing time_lapse datasets before frame normalization.
find_frame_len can return None, which will cause downstream errors in normalization/context-frame selection.

🛠️ Suggested fix
     movie_length = find_frame_len(ch5_file=local_ch5_file)
+    if movie_length is None or movie_length <= 0:
+        print(f"[WARN] No time_lapse dataset found in {local_ch5_file}; skipping.")
+        continue
🤖 Fix all issues with AI agents
In @.gitignore:
- Around line 26-28: Remove the broad global ignore entries (*.txt, *.csv,
*.parquet) from .gitignore and replace them with directory-scoped ignore rules
so only generated/artifact files are excluded (for example restrict to build/,
tmp/, artifacts/, or the specific data output directories used by your tooling).
Locate the three entries in .gitignore and change them to scoped patterns that
reference the exact folders where those generated files appear, ensuring
important repo files like README.txt or requirements.txt remain tracked.

In `@5.data_packaging/gather_images.py`:
- Around line 472-483: The normalization block can divide by zero when max_val
== 0; in the code handling brightfield_images_corrected (the variables
brightfield_images_corrected, max_val and corrected_movie), detect if max_val is
0 (or very close to 0) before dividing and handle it: if zero, log a warning and
produce a zeroed uint8 frame (or set max_val = 1 to skip scaling), otherwise
perform the existing divide-and-scale; update the print message to reflect the
zero-case to avoid propagating NaNs/infs into corrected_movie.
🧹 Nitpick comments (3)
pyproject.toml (1)

11-24: Please confirm version bounds and pin git deps.
python = ">3.11,<3.13" excludes 3.11.0; confirm that’s intentional. Also, serpula_rasa from a moving branch can break reproducibility—consider pinning to a tag or commit.

5.data_packaging/gather_images.py (2)

387-405: Prefer a specific exception and align docs.
The function raises a generic Exception while the docstring implies None may be returned. Consider a ValueError (or update the docstring to match behavior).

♻️ Suggested tweak
-        # something is seriously wrong; skip
-        raise Exception("Detected a frame less than 0")
+        # something is seriously wrong; skip
+        raise ValueError("frame_label must be >= 0")

628-645: Remove duplicate branches when building new_row.
Both branches are identical; a single assignment will reduce noise.

♻️ Suggested simplification
-            if str(frame_number) == str(base_row["Frames"]):
-                new_row = {
-                    **base_row,
-                    "Frames": str(frame_number),
-                    "DNA_dotted_notation": str(frame_tiff),
-                    "Frame_type": "IC_FRAME" if "_IC" not in frame_number else "IC_TARGET_FRAME",
-                    "ome-arrow_original": ome_struct["ome-arrow_original"],
-                }
-            else:
-                new_row = {
-                    **base_row,
-                    "Frames": str(frame_number),
-                    "DNA_dotted_notation": str(frame_tiff),
-                    "Frame_type": "IC_FRAME" if "_IC" not in frame_number else "IC_TARGET_FRAME",
-                    "ome-arrow_original": ome_struct["ome-arrow_original"],
-                }
+            new_row = {
+                **base_row,
+                "Frames": str(frame_number),
+                "DNA_dotted_notation": str(frame_tiff),
+                "Frame_type": "IC_FRAME" if "_IC" not in frame_number else "IC_TARGET_FRAME",
+                "ome-arrow_original": ome_struct["ome-arrow_original"],
+            }
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b793d38 and 823766b.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (19)
  • .gitignore
  • 5.data_packaging/example.ipynb
  • 5.data_packaging/example.py
  • 5.data_packaging/gather_images.py
  • 5.data_packaging/schema/0.locate_data.locations.negative_control_locations.tsv.arrow.schema.txt
  • 5.data_packaging/schema/0.locate_data.locations.positive_control_locations.tsv.arrow.schema.txt
  • 5.data_packaging/schema/0.locate_data.locations.training_locations.tsv.arrow.schema.txt
  • 5.data_packaging/schema/1.idr_streams.stream_files.idr0013-screenA-plates-w-colnames.tsv.arrow.schema.txt
  • 5.data_packaging/schema/1.idr_streams.stream_files.idr0013-screenA-plates.tsv.arrow.schema.txt
  • 5.data_packaging/schema/2.format_training_data.results.training_data__ic.csv.gz.arrow.schema.txt
  • 5.data_packaging/schema/2.format_training_data.results.training_data__no_ic.csv.gz.arrow.schema.txt
  • 5.data_packaging/schema/3.normalize_data.normalized_data.training_data__ic.csv.gz.arrow.schema.txt
  • 5.data_packaging/schema/3.normalize_data.normalized_data.training_data__no_ic.csv.gz.arrow.schema.txt
  • 5.data_packaging/schema/4.analyze_data.results.compiled_2D_umap_embeddings.csv.arrow.schema.txt
  • 5.data_packaging/schema/4.analyze_data.results.single_cell_class_counts.csv.arrow.schema.txt
  • 5.data_packaging/schema/mitocheck_metadata.features.samples-w-colnames.txt.arrow.schema.txt
  • 5.data_packaging/schema/mitocheck_metadata.features.samples.txt.arrow.schema.txt
  • 5.data_packaging/schema/mitocheck_metadata.idr0013-screenA-annotation.csv.gz.arrow.schema.txt
  • pyproject.toml
🧰 Additional context used
🪛 Ruff (0.14.11)
5.data_packaging/gather_images.py

80-80: Loop control variable attempt not used within loop body

Rename unused attempt to _attempt

(B007)


82-82: FTP-related functions are being called. FTP is considered insecure. Use SSH/SFTP/SCP or some other encrypted protocol.

(S321)


89-89: Do not catch blind exception: Exception

(BLE001)


103-103: Do not catch blind exception: Exception

(BLE001)


110-111: try-except-pass detected, consider logging the exception

(S110)


110-110: Do not catch blind exception: Exception

(BLE001)


113-116: Avoid specifying long messages outside the exception class

(TRY003)


203-203: subprocess call: check for execution of untrusted input

(S603)


204-204: Starting a process with a partial executable path

(S607)


213-213: subprocess call: check for execution of untrusted input

(S603)


214-223: Starting a process with a partial executable path

(S607)


325-325: Avoid specifying long messages outside the exception class

(TRY003)


363-363: Docstring contains ambiguous (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF002)


366-366: Avoid specifying long messages outside the exception class

(TRY003)


371-373: Avoid specifying long messages outside the exception class

(TRY003)


401-401: Create your own exception

(TRY002)


401-401: Avoid specifying long messages outside the exception class

(TRY003)


436-436: f-string without any placeholders

Remove extraneous f prefix

(F541)

🔇 Additional comments (27)
5.data_packaging/schema/4.analyze_data.results.single_cell_class_counts.csv.arrow.schema.txt (1)

4-4: No functional change detected.

This line was removed and re-added with identical content, resulting in no schema modification. The AI summary indicates similar no-op edits appear across multiple schema files—likely a cosmetic adjustment or formatting artifact.

5.data_packaging/schema/mitocheck_metadata.features.samples-w-colnames.txt.arrow.schema.txt (1)

2-2: LGTM: Field name correction.

Removing the leading dash from the field name is the correct fix. Field names should not start with special characters like dashes in schema definitions.

5.data_packaging/schema/mitocheck_metadata.features.samples.txt.arrow.schema.txt (1)

2-2: Verification of the sample key change is inconclusive. While the sample identifier PLLT0010_27--ex2005_05_13--sp2005_03_23--tt17--c5___P00173_01___T00082___X0397___Y0618 appears consistently documented in utility functions (utils/training_data_utils.py and utils/locate_utils.py), the previous version of the P00173_01 segment is not visible, making it impossible to confirm whether this change is correct or to assess its impact on data linkage without additional context about what changed.

5.data_packaging/schema/mitocheck_metadata.idr0013-screenA-annotation.csv.gz.arrow.schema.txt (2)

56-56: Schema text file truncations are display artifacts and do not affect data access.

The ellipsis notation in the schema file (e.g., "... 21 chars omitted") is a display artifact from PyArrow's .schema.to_string() method used during schema generation. The actual CSV file contains complete field names. Since these schema text files are generated for visibility purposes and are not used for programmatic data access (which relies directly on PyArrow/Pandas/lancedb), the truncations have no practical impact. Code accessing the data through standard libraries will work with full field names as expected.


145-145: This metadata is auto-generated—no intentional change was made.

The schema file is automatically generated by infer_schema.py when pandas/pyarrow converts the CSV to Arrow format. The number 23714 is not a row count (the file contains 196,608 data rows). Pandas automatically includes index metadata during this conversion, so changes to the schema reflect regeneration of the source data rather than manual modifications. No verification of intentionality is needed.

Likely an incorrect or invalid review comment.

5.data_packaging/schema/0.locate_data.locations.training_locations.tsv.arrow.schema.txt (1)

10-10: No functional change detected.

This line change appears to be formatting-only with no impact on the schema structure or field types.

5.data_packaging/schema/0.locate_data.locations.negative_control_locations.tsv.arrow.schema.txt (1)

10-10: No functional change detected.

This line change appears to be formatting-only with no impact on the schema structure or field types.

5.data_packaging/schema/0.locate_data.locations.positive_control_locations.tsv.arrow.schema.txt (1)

10-10: No functional change detected.

This line change appears to be formatting-only with no impact on the schema structure or field types.

5.data_packaging/schema/4.analyze_data.results.compiled_2D_umap_embeddings.csv.arrow.schema.txt (1)

7-7: No functional change detected.

This line change appears to be formatting-only with no impact on the schema structure or field types.

5.data_packaging/schema/2.format_training_data.results.training_data__ic.csv.gz.arrow.schema.txt (1)

1451-1451: No functional change detected.

This line was removed and re-added with identical content, resulting in no schema modification. The change appears to be a formatting artifact, likely from schema regeneration.

5.data_packaging/schema/2.format_training_data.results.training_data__no_ic.csv.gz.arrow.schema.txt (1)

1451-1451: No functional change detected.

This line was removed and re-added with identical content, resulting in no schema modification. Consistent with similar no-op changes observed across other schema files in this PR.

5.data_packaging/schema/1.idr_streams.stream_files.idr0013-screenA-plates-w-colnames.tsv.arrow.schema.txt (1)

2-2: Formatting adjustment only.

The leading whitespace was removed from this line. No semantic change to the schema definition.

5.data_packaging/schema/3.normalize_data.normalized_data.training_data__ic.csv.gz.arrow.schema.txt (1)

1451-1451: No functional change detected.

This line was removed and re-added with identical content, resulting in no schema modification. Part of a consistent pattern of formatting changes across multiple schema files.

5.data_packaging/schema/3.normalize_data.normalized_data.training_data__no_ic.csv.gz.arrow.schema.txt (1)

1451-1451: No functional change detected.

This line was removed and re-added with identical content, resulting in no schema modification. Consistent with the pattern observed across all schema files in this PR.

5.data_packaging/schema/1.idr_streams.stream_files.idr0013-screenA-plates.tsv.arrow.schema.txt (1)

1-2: No functional change detected.

5.data_packaging/example.py (1)

19-42: Clear LanceDB usage example.

pyproject.toml (2)

28-33: LGTM on the Docker platform pin for packaging.


44-45: Jupytext format config looks good.

5.data_packaging/gather_images.py (9)

10-15: Imports align with the new image/OME workflow.


32-41: Helper additions look good.


133-163: Union-by-name change is a solid schema-safety improvement.


202-226: Image cache check before build looks good.


323-338: Negative-frame guard and z=0 selection are good additions.


347-385: Context-frame selection logic looks sound.


531-537: Reusing existing downloads is fine.


579-596: IC target output wiring looks good.


647-666: OME struct casting and per-frame parquet write look good.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines +26 to +28
*.txt
*.csv
*.parquet
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Global ignore patterns are too broad and will exclude important files.

Adding *.txt, *.csv, and *.parquet as global patterns will ignore these file types across the entire repository, including:

  • Documentation files (README.txt, CHANGELOG.txt, LICENSE.txt)
  • Requirements or configuration files (requirements.txt)
  • Schema files (ironically, this PR includes .arrow.schema.txt files)
  • Important data or metadata files

This could prevent critical files from being tracked and cause confusion for contributors.

✏️ Suggested fix: Scope patterns to specific directories

Replace the global patterns with directory-scoped patterns:

-
-*.txt
-*.csv
-*.parquet
+
+# data packaging artifacts
+5.data_packaging/**/*.txt
+5.data_packaging/**/*.csv
+5.data_packaging/**/*.parquet
+# exclude schema files from ignore
+!5.data_packaging/schema/**/*.txt

Or scope them to the specific subdirectories where these artifacts are generated:

-
-*.txt
-*.csv
-*.parquet
+
+# data packaging artifacts (generated data only)
+5.data_packaging/packaged/**/*.txt
+5.data_packaging/packaged/**/*.csv
+5.data_packaging/packaged/**/*.parquet
+5.data_packaging/images/**/*.txt
+5.data_packaging/images/**/*.csv
+5.data_packaging/images/**/*.parquet
🤖 Prompt for AI Agents
In @.gitignore around lines 26 - 28, Remove the broad global ignore entries
(*.txt, *.csv, *.parquet) from .gitignore and replace them with directory-scoped
ignore rules so only generated/artifact files are excluded (for example restrict
to build/, tmp/, artifacts/, or the specific data output directories used by
your tooling). Locate the three entries in .gitignore and change them to scoped
patterns that reference the exact folders where those generated files appear,
ensuring important repo files like README.txt or requirements.txt remain
tracked.

Comment on lines +472 to 483
# Ensure no negative values
negatives = np.sum(brightfield_images_corrected < 0)
if negatives > 0:
print(f" Found {negatives} negative pixels — setting them to zero.")
brightfield_images_corrected[brightfield_images_corrected < 0] = 0

# Normalize and scale to uint8
max_val = np.max(brightfield_images_corrected)
print(f" Maximum pixel value before normalization: {max_val}")
brightfield_images_corrected = brightfield_images_corrected / max_val
brightfield_images_corrected *= 255
corrected_movie = brightfield_images_corrected.astype(np.uint8)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard against all-zero frames before normalization.
max_val can be 0, which would cause a divide-by-zero and propagate NaNs/inf.

🛠️ Suggested fix
         max_val = np.max(brightfield_images_corrected)
         print(f"       Maximum pixel value before normalization: {max_val}")
-        brightfield_images_corrected = brightfield_images_corrected / max_val
+        if max_val == 0:
+            brightfield_images_corrected = np.zeros_like(
+                brightfield_images_corrected, dtype=np.float32
+            )
+        else:
+            brightfield_images_corrected = brightfield_images_corrected / max_val
         brightfield_images_corrected *= 255
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Ensure no negative values
negatives = np.sum(brightfield_images_corrected < 0)
if negatives > 0:
print(f" Found {negatives} negative pixels — setting them to zero.")
brightfield_images_corrected[brightfield_images_corrected < 0] = 0
# Normalize and scale to uint8
max_val = np.max(brightfield_images_corrected)
print(f" Maximum pixel value before normalization: {max_val}")
brightfield_images_corrected = brightfield_images_corrected / max_val
brightfield_images_corrected *= 255
corrected_movie = brightfield_images_corrected.astype(np.uint8)
# Ensure no negative values
negatives = np.sum(brightfield_images_corrected < 0)
if negatives > 0:
print(f" Found {negatives} negative pixels — setting them to zero.")
brightfield_images_corrected[brightfield_images_corrected < 0] = 0
# Normalize and scale to uint8
max_val = np.max(brightfield_images_corrected)
print(f" Maximum pixel value before normalization: {max_val}")
if max_val == 0:
brightfield_images_corrected = np.zeros_like(
brightfield_images_corrected, dtype=np.float32
)
else:
brightfield_images_corrected = brightfield_images_corrected / max_val
brightfield_images_corrected *= 255
corrected_movie = brightfield_images_corrected.astype(np.uint8)
🤖 Prompt for AI Agents
In `@5.data_packaging/gather_images.py` around lines 472 - 483, The normalization
block can divide by zero when max_val == 0; in the code handling
brightfield_images_corrected (the variables brightfield_images_corrected,
max_val and corrected_movie), detect if max_val is 0 (or very close to 0) before
dividing and handle it: if zero, log a warning and produce a zeroed uint8 frame
(or set max_val = 1 to skip scaling), otherwise perform the existing
divide-and-scale; update the print message to reflect the zero-case to avoid
propagating NaNs/infs into corrected_movie.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant