-
Notifications
You must be signed in to change notification settings - Fork 5
Extract all images and add to lancedb #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
📝 WalkthroughWalkthroughThe changes introduce LanceDB-based data packaging with image processing utilities, including FTP retrieval enhancements, Docker optimization, context frame computation, and OME-Arrow schema integration. Project dependencies are updated, and schema files undergo minor formatting adjustments. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant ScriptRunner as Script Runner
participant FTP as FTP Server
participant ImageProc as Image Processor
participant OMEArrow as OME-Arrow Assembler
participant Parquet as Parquet Exporter
User->>ScriptRunner: Trigger read_image_as_binary(image_path)
ScriptRunner->>FTP: retrieve_ftp_file with retries & timeout
FTP-->>ScriptRunner: Local file or retry/fallback
ScriptRunner->>ImageProc: get_ic_context_frames(target_frame, movie_len)
ImageProc-->>ScriptRunner: List of context frame indices
ScriptRunner->>ImageProc: get_frame_tiff_from_idr_ch5 for each frame
ImageProc-->>ScriptRunner: TIFF file path
ScriptRunner->>ImageProc: pybasic_IC_target_frame_to_tiff(frames_as_arrays, target_frame)
ImageProc->>ImageProc: normalize_frame_label, scale to uint8, save
ImageProc-->>ScriptRunner: Processed target frame path
ScriptRunner->>OMEArrow: make_ome_arrow_row for frame metadata
OMEArrow-->>ScriptRunner: OME-Arrow structure
ScriptRunner->>Parquet: Coerce to OME_STRUCT_TYPE & parquet.write_table
Parquet-->>ScriptRunner: Exported Parquet file with IC frame data
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
5.data_packaging/gather_images.py (2)
49-116: Avoid leaving partial downloads behind on failure.
If all retries fail, a partially downloaded file (>0 bytes) can remain and later be treated as valid due to the early size check. Clean up any file on failure before raising.🛠️ Suggested fix
- # clean up a zero-byte stub if created - try: - if download_filepath.exists() and download_filepath.stat().st_size == 0: - download_filepath.unlink() - except Exception: - pass + # clean up any partial download if all attempts failed + try: + if download_filepath.exists(): + download_filepath.unlink() + except Exception: + pass
538-557: Handle missingtime_lapsedatasets before frame normalization.
find_frame_lencan returnNone, which will cause downstream errors in normalization/context-frame selection.🛠️ Suggested fix
movie_length = find_frame_len(ch5_file=local_ch5_file) + if movie_length is None or movie_length <= 0: + print(f"[WARN] No time_lapse dataset found in {local_ch5_file}; skipping.") + continue
🤖 Fix all issues with AI agents
In @.gitignore:
- Around line 26-28: Remove the broad global ignore entries (*.txt, *.csv,
*.parquet) from .gitignore and replace them with directory-scoped ignore rules
so only generated/artifact files are excluded (for example restrict to build/,
tmp/, artifacts/, or the specific data output directories used by your tooling).
Locate the three entries in .gitignore and change them to scoped patterns that
reference the exact folders where those generated files appear, ensuring
important repo files like README.txt or requirements.txt remain tracked.
In `@5.data_packaging/gather_images.py`:
- Around line 472-483: The normalization block can divide by zero when max_val
== 0; in the code handling brightfield_images_corrected (the variables
brightfield_images_corrected, max_val and corrected_movie), detect if max_val is
0 (or very close to 0) before dividing and handle it: if zero, log a warning and
produce a zeroed uint8 frame (or set max_val = 1 to skip scaling), otherwise
perform the existing divide-and-scale; update the print message to reflect the
zero-case to avoid propagating NaNs/infs into corrected_movie.
🧹 Nitpick comments (3)
pyproject.toml (1)
11-24: Please confirm version bounds and pin git deps.
python = ">3.11,<3.13"excludes 3.11.0; confirm that’s intentional. Also,serpula_rasafrom a moving branch can break reproducibility—consider pinning to a tag or commit.5.data_packaging/gather_images.py (2)
387-405: Prefer a specific exception and align docs.
The function raises a genericExceptionwhile the docstring impliesNonemay be returned. Consider aValueError(or update the docstring to match behavior).♻️ Suggested tweak
- # something is seriously wrong; skip - raise Exception("Detected a frame less than 0") + # something is seriously wrong; skip + raise ValueError("frame_label must be >= 0")
628-645: Remove duplicate branches when buildingnew_row.
Both branches are identical; a single assignment will reduce noise.♻️ Suggested simplification
- if str(frame_number) == str(base_row["Frames"]): - new_row = { - **base_row, - "Frames": str(frame_number), - "DNA_dotted_notation": str(frame_tiff), - "Frame_type": "IC_FRAME" if "_IC" not in frame_number else "IC_TARGET_FRAME", - "ome-arrow_original": ome_struct["ome-arrow_original"], - } - else: - new_row = { - **base_row, - "Frames": str(frame_number), - "DNA_dotted_notation": str(frame_tiff), - "Frame_type": "IC_FRAME" if "_IC" not in frame_number else "IC_TARGET_FRAME", - "ome-arrow_original": ome_struct["ome-arrow_original"], - } + new_row = { + **base_row, + "Frames": str(frame_number), + "DNA_dotted_notation": str(frame_tiff), + "Frame_type": "IC_FRAME" if "_IC" not in frame_number else "IC_TARGET_FRAME", + "ome-arrow_original": ome_struct["ome-arrow_original"], + }
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
poetry.lockis excluded by!**/*.lock
📒 Files selected for processing (19)
.gitignore5.data_packaging/example.ipynb5.data_packaging/example.py5.data_packaging/gather_images.py5.data_packaging/schema/0.locate_data.locations.negative_control_locations.tsv.arrow.schema.txt5.data_packaging/schema/0.locate_data.locations.positive_control_locations.tsv.arrow.schema.txt5.data_packaging/schema/0.locate_data.locations.training_locations.tsv.arrow.schema.txt5.data_packaging/schema/1.idr_streams.stream_files.idr0013-screenA-plates-w-colnames.tsv.arrow.schema.txt5.data_packaging/schema/1.idr_streams.stream_files.idr0013-screenA-plates.tsv.arrow.schema.txt5.data_packaging/schema/2.format_training_data.results.training_data__ic.csv.gz.arrow.schema.txt5.data_packaging/schema/2.format_training_data.results.training_data__no_ic.csv.gz.arrow.schema.txt5.data_packaging/schema/3.normalize_data.normalized_data.training_data__ic.csv.gz.arrow.schema.txt5.data_packaging/schema/3.normalize_data.normalized_data.training_data__no_ic.csv.gz.arrow.schema.txt5.data_packaging/schema/4.analyze_data.results.compiled_2D_umap_embeddings.csv.arrow.schema.txt5.data_packaging/schema/4.analyze_data.results.single_cell_class_counts.csv.arrow.schema.txt5.data_packaging/schema/mitocheck_metadata.features.samples-w-colnames.txt.arrow.schema.txt5.data_packaging/schema/mitocheck_metadata.features.samples.txt.arrow.schema.txt5.data_packaging/schema/mitocheck_metadata.idr0013-screenA-annotation.csv.gz.arrow.schema.txtpyproject.toml
🧰 Additional context used
🪛 Ruff (0.14.11)
5.data_packaging/gather_images.py
80-80: Loop control variable attempt not used within loop body
Rename unused attempt to _attempt
(B007)
82-82: FTP-related functions are being called. FTP is considered insecure. Use SSH/SFTP/SCP or some other encrypted protocol.
(S321)
89-89: Do not catch blind exception: Exception
(BLE001)
103-103: Do not catch blind exception: Exception
(BLE001)
110-111: try-except-pass detected, consider logging the exception
(S110)
110-110: Do not catch blind exception: Exception
(BLE001)
113-116: Avoid specifying long messages outside the exception class
(TRY003)
203-203: subprocess call: check for execution of untrusted input
(S603)
204-204: Starting a process with a partial executable path
(S607)
213-213: subprocess call: check for execution of untrusted input
(S603)
214-223: Starting a process with a partial executable path
(S607)
325-325: Avoid specifying long messages outside the exception class
(TRY003)
363-363: Docstring contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF002)
366-366: Avoid specifying long messages outside the exception class
(TRY003)
371-373: Avoid specifying long messages outside the exception class
(TRY003)
401-401: Create your own exception
(TRY002)
401-401: Avoid specifying long messages outside the exception class
(TRY003)
436-436: f-string without any placeholders
Remove extraneous f prefix
(F541)
🔇 Additional comments (27)
5.data_packaging/schema/4.analyze_data.results.single_cell_class_counts.csv.arrow.schema.txt (1)
4-4: No functional change detected.This line was removed and re-added with identical content, resulting in no schema modification. The AI summary indicates similar no-op edits appear across multiple schema files—likely a cosmetic adjustment or formatting artifact.
5.data_packaging/schema/mitocheck_metadata.features.samples-w-colnames.txt.arrow.schema.txt (1)
2-2: LGTM: Field name correction.Removing the leading dash from the field name is the correct fix. Field names should not start with special characters like dashes in schema definitions.
5.data_packaging/schema/mitocheck_metadata.features.samples.txt.arrow.schema.txt (1)
2-2: Verification of the sample key change is inconclusive. While the sample identifierPLLT0010_27--ex2005_05_13--sp2005_03_23--tt17--c5___P00173_01___T00082___X0397___Y0618appears consistently documented in utility functions (utils/training_data_utils.pyandutils/locate_utils.py), the previous version of theP00173_01segment is not visible, making it impossible to confirm whether this change is correct or to assess its impact on data linkage without additional context about what changed.5.data_packaging/schema/mitocheck_metadata.idr0013-screenA-annotation.csv.gz.arrow.schema.txt (2)
56-56: Schema text file truncations are display artifacts and do not affect data access.The ellipsis notation in the schema file (e.g., "... 21 chars omitted") is a display artifact from PyArrow's
.schema.to_string()method used during schema generation. The actual CSV file contains complete field names. Since these schema text files are generated for visibility purposes and are not used for programmatic data access (which relies directly on PyArrow/Pandas/lancedb), the truncations have no practical impact. Code accessing the data through standard libraries will work with full field names as expected.
145-145: This metadata is auto-generated—no intentional change was made.The schema file is automatically generated by
infer_schema.pywhen pandas/pyarrow converts the CSV to Arrow format. The number 23714 is not a row count (the file contains 196,608 data rows). Pandas automatically includes index metadata during this conversion, so changes to the schema reflect regeneration of the source data rather than manual modifications. No verification of intentionality is needed.Likely an incorrect or invalid review comment.
5.data_packaging/schema/0.locate_data.locations.training_locations.tsv.arrow.schema.txt (1)
10-10: No functional change detected.This line change appears to be formatting-only with no impact on the schema structure or field types.
5.data_packaging/schema/0.locate_data.locations.negative_control_locations.tsv.arrow.schema.txt (1)
10-10: No functional change detected.This line change appears to be formatting-only with no impact on the schema structure or field types.
5.data_packaging/schema/0.locate_data.locations.positive_control_locations.tsv.arrow.schema.txt (1)
10-10: No functional change detected.This line change appears to be formatting-only with no impact on the schema structure or field types.
5.data_packaging/schema/4.analyze_data.results.compiled_2D_umap_embeddings.csv.arrow.schema.txt (1)
7-7: No functional change detected.This line change appears to be formatting-only with no impact on the schema structure or field types.
5.data_packaging/schema/2.format_training_data.results.training_data__ic.csv.gz.arrow.schema.txt (1)
1451-1451: No functional change detected.This line was removed and re-added with identical content, resulting in no schema modification. The change appears to be a formatting artifact, likely from schema regeneration.
5.data_packaging/schema/2.format_training_data.results.training_data__no_ic.csv.gz.arrow.schema.txt (1)
1451-1451: No functional change detected.This line was removed and re-added with identical content, resulting in no schema modification. Consistent with similar no-op changes observed across other schema files in this PR.
5.data_packaging/schema/1.idr_streams.stream_files.idr0013-screenA-plates-w-colnames.tsv.arrow.schema.txt (1)
2-2: Formatting adjustment only.The leading whitespace was removed from this line. No semantic change to the schema definition.
5.data_packaging/schema/3.normalize_data.normalized_data.training_data__ic.csv.gz.arrow.schema.txt (1)
1451-1451: No functional change detected.This line was removed and re-added with identical content, resulting in no schema modification. Part of a consistent pattern of formatting changes across multiple schema files.
5.data_packaging/schema/3.normalize_data.normalized_data.training_data__no_ic.csv.gz.arrow.schema.txt (1)
1451-1451: No functional change detected.This line was removed and re-added with identical content, resulting in no schema modification. Consistent with the pattern observed across all schema files in this PR.
5.data_packaging/schema/1.idr_streams.stream_files.idr0013-screenA-plates.tsv.arrow.schema.txt (1)
1-2: No functional change detected.5.data_packaging/example.py (1)
19-42: Clear LanceDB usage example.pyproject.toml (2)
28-33: LGTM on the Docker platform pin for packaging.
44-45: Jupytext format config looks good.5.data_packaging/gather_images.py (9)
10-15: Imports align with the new image/OME workflow.
32-41: Helper additions look good.
133-163: Union-by-name change is a solid schema-safety improvement.
202-226: Image cache check before build looks good.
323-338: Negative-frame guard and z=0 selection are good additions.
347-385: Context-frame selection logic looks sound.
531-537: Reusing existing downloads is fine.
579-596: IC target output wiring looks good.
647-666: OME struct casting and per-frame parquet write look good.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| *.txt | ||
| *.csv | ||
| *.parquet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Global ignore patterns are too broad and will exclude important files.
Adding *.txt, *.csv, and *.parquet as global patterns will ignore these file types across the entire repository, including:
- Documentation files (README.txt, CHANGELOG.txt, LICENSE.txt)
- Requirements or configuration files (requirements.txt)
- Schema files (ironically, this PR includes .arrow.schema.txt files)
- Important data or metadata files
This could prevent critical files from being tracked and cause confusion for contributors.
✏️ Suggested fix: Scope patterns to specific directories
Replace the global patterns with directory-scoped patterns:
-
-*.txt
-*.csv
-*.parquet
+
+# data packaging artifacts
+5.data_packaging/**/*.txt
+5.data_packaging/**/*.csv
+5.data_packaging/**/*.parquet
+# exclude schema files from ignore
+!5.data_packaging/schema/**/*.txtOr scope them to the specific subdirectories where these artifacts are generated:
-
-*.txt
-*.csv
-*.parquet
+
+# data packaging artifacts (generated data only)
+5.data_packaging/packaged/**/*.txt
+5.data_packaging/packaged/**/*.csv
+5.data_packaging/packaged/**/*.parquet
+5.data_packaging/images/**/*.txt
+5.data_packaging/images/**/*.csv
+5.data_packaging/images/**/*.parquet🤖 Prompt for AI Agents
In @.gitignore around lines 26 - 28, Remove the broad global ignore entries
(*.txt, *.csv, *.parquet) from .gitignore and replace them with directory-scoped
ignore rules so only generated/artifact files are excluded (for example restrict
to build/, tmp/, artifacts/, or the specific data output directories used by
your tooling). Locate the three entries in .gitignore and change them to scoped
patterns that reference the exact folders where those generated files appear,
ensuring important repo files like README.txt or requirements.txt remain
tracked.
| # Ensure no negative values | ||
| negatives = np.sum(brightfield_images_corrected < 0) | ||
| if negatives > 0: | ||
| print(f" Found {negatives} negative pixels — setting them to zero.") | ||
| brightfield_images_corrected[brightfield_images_corrected < 0] = 0 | ||
|
|
||
| # Normalize and scale to uint8 | ||
| max_val = np.max(brightfield_images_corrected) | ||
| print(f" Maximum pixel value before normalization: {max_val}") | ||
| brightfield_images_corrected = brightfield_images_corrected / max_val | ||
| brightfield_images_corrected *= 255 | ||
| corrected_movie = brightfield_images_corrected.astype(np.uint8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard against all-zero frames before normalization.
max_val can be 0, which would cause a divide-by-zero and propagate NaNs/inf.
🛠️ Suggested fix
max_val = np.max(brightfield_images_corrected)
print(f" Maximum pixel value before normalization: {max_val}")
- brightfield_images_corrected = brightfield_images_corrected / max_val
+ if max_val == 0:
+ brightfield_images_corrected = np.zeros_like(
+ brightfield_images_corrected, dtype=np.float32
+ )
+ else:
+ brightfield_images_corrected = brightfield_images_corrected / max_val
brightfield_images_corrected *= 255📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Ensure no negative values | |
| negatives = np.sum(brightfield_images_corrected < 0) | |
| if negatives > 0: | |
| print(f" Found {negatives} negative pixels — setting them to zero.") | |
| brightfield_images_corrected[brightfield_images_corrected < 0] = 0 | |
| # Normalize and scale to uint8 | |
| max_val = np.max(brightfield_images_corrected) | |
| print(f" Maximum pixel value before normalization: {max_val}") | |
| brightfield_images_corrected = brightfield_images_corrected / max_val | |
| brightfield_images_corrected *= 255 | |
| corrected_movie = brightfield_images_corrected.astype(np.uint8) | |
| # Ensure no negative values | |
| negatives = np.sum(brightfield_images_corrected < 0) | |
| if negatives > 0: | |
| print(f" Found {negatives} negative pixels — setting them to zero.") | |
| brightfield_images_corrected[brightfield_images_corrected < 0] = 0 | |
| # Normalize and scale to uint8 | |
| max_val = np.max(brightfield_images_corrected) | |
| print(f" Maximum pixel value before normalization: {max_val}") | |
| if max_val == 0: | |
| brightfield_images_corrected = np.zeros_like( | |
| brightfield_images_corrected, dtype=np.float32 | |
| ) | |
| else: | |
| brightfield_images_corrected = brightfield_images_corrected / max_val | |
| brightfield_images_corrected *= 255 | |
| corrected_movie = brightfield_images_corrected.astype(np.uint8) |
🤖 Prompt for AI Agents
In `@5.data_packaging/gather_images.py` around lines 472 - 483, The normalization
block can divide by zero when max_val == 0; in the code handling
brightfield_images_corrected (the variables brightfield_images_corrected,
max_val and corrected_movie), detect if max_val is 0 (or very close to 0) before
dividing and handle it: if zero, log a warning and produce a zeroed uint8 frame
(or set max_val = 1 to skip scaling), otherwise perform the existing
divide-and-scale; update the print message to reflect the zero-case to avoid
propagating NaNs/infs into corrected_movie.
Summary by CodeRabbit
Release Notes
New Features
Improvements
Dependencies
✏️ Tip: You can customize this high-level summary in your review settings.