Skip to content

pertzlab/LatentLens

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔎 LatentLens: A Microscope for Latent Spaces

LatentLens is an interactive workbench designed to bridge the gap between high-dimensional embeddings and biological discovery. By projecting complex feature spaces into a human-navigable 2D interface, it allows researchers to see the world "through their model's eyes."

Instead of treating neural networks as black boxes, LatentLens provides the tools to validate, annotate, and train models in a single, fluid Human-in-the-Loop workflow.

LatentLens Interface

🧠 Architecture Overview

LatentLens is built as an interactive local web application:

  • Dash – UI framework for the browser-based interface
  • Plotly – high-performance visualization (UMAP scatter, densities)
  • UMAP (umap-learn) – dimensionality reduction
  • scikit-learn – on-the-fly classification (Logistic Regression)

The application runs as a local server and is accessed through your browser.


✨ Key Features

1. Model-Eye Perspective

  • Visual Ground-Truth: Click any point in the UMAP to instantly render the original image crop along its n_nearest_neighbors. Verify if a cluster represents a true phenotype or a technical artifact.
  • Track Trajectories: For temporal data, visualize the journey of a track_id through the latent space to identify state transitions and phenotypic shifts.
  • Density & Class Mapping: Switch between global scatter plots and class-specific densities to reveal hidden over-representations.

2. Active Discovery & Labeling

  • Lasso Annotation: Mass-select clusters directly in the feature distribution to assign ground-truth labels.
  • Interactive LogReg: Train a Logistic Regression classifier on the fly. Results can be projected back onto the UI for inspection.
  • Uncertainty Coloring: Color the UMAP by Margin Sampling Uncertainty. This highlights the model's "decision boundaries," showing where the classifier is struggling and where your labels are needed most.

3. High-Performance Engineering

  • Scalability: Optimized for smooth interactivity with 100k+ points, supporting deep investigation of datasets up to 500k points (possibly more).
  • Dynamic Re-projection: Update UMAP parameters while preserving all manual annotations, allowing you to find the most "resolvable" view of your data without losing progress.

🟢 How to use LatentLens (3-Step Workflow)

  1. Explore: Move around the UMAP. Click on points to see if the images in the gallery look like what you expect.
  2. Label: Use the Lasso Tool (top right of the plot) to circle a group of points. Type a name for this phenotype when prompted.
  3. Train & Improve: Once you have 2+ labels, click Train Classifier. Switch the view to Uncertainty to see exactly where the model needs more of your help!

🛠 Data Architecture

LatentLens is built for speed and memory efficiency, utilizing line-by-line parsing for large-scale embedding datasets.

Column Specifications

To ensure full functionality, your input data should contain the following:

Column File Type Status Description
embedding JSONL Required High-dimensional vectors used for UMAP and classification
track_id JSONL/CSV Required Groups timepoints belonging to the same object (e.g., cell), preventing data leakage and enabling trajectory visualization
path JSONL Required Absolute path to the .tif file for image rendering
filename JSONL/CSV Required The join key used to merge features with metadata
umap_1 / 2 JSONL Optional Pre-computed coordinates; if missing, UMAP will run on launch
class / t_start CSV Optional Automatically mapped to phenotype and t internally

Feature Files (.jsonl)

The tool ingests features where each line is a self-contained record.

{
  "track_id": 1, 
  "t": 25, 
  "embedding": [1.39, -0.80, ...], 
  "path": "/path/to/image.tif", 
  "filename": "Exp01_Site01"
}

⚠️ All embeddings must have the same dimensionality across the dataset.

Metadata Files (.csv)

Easily merge existing experimental metadata. The tool automatically maps common keys like t_start and class to the internal t and phenotype columns.

file_path filename track_id t_start class
path/to/img.tif Exp01_Site01 1 0 STATE_A

📁 Example Data

LatentLens includes synthetic example datasets in data/. Available files (once with 5, once with 12 frames):

  • Mini datasets (fast testing, ~25k)
    • features_5f_mini.jsonl
    • features_12f_mini.jsonl
  • Larger datasets (~250k)
    • sq_rect_tracks_5f_features.jsonl
    • sq_rect_tracks_12f_features.jsonl
  • Metadata files (work with full and subsampled features)
    • metadata_f5_synth.csv
    • metadata_f12_synth.csv Quick test:
python run_app.py --features data/features_12f_mini.jsonl --metadata data/metadata_f12_synth.csv 

📦 Installation & Setup

1. Environment Setup (Conda)

conda env create -f environment.yml
conda activate umap_app_env

If you prefer manual installation, ensure your environment includes these core dependencies:

  • numpy < 2.4
  • plotly >= 5, < 6
  • dash >= 2.14
  • scikit-learn
  • umap-learn
  • tifffile
  • anywidget

2. Quick Start

The easiest way to launch the explorer is via the run_app.py script.

Basic Launch:

python run_app.py --features data/features.jsonl --metadata data/metadata.csv

Launch for Remote Access (HPC): If running on a remote cluster, the app binds to 0.0.0.0 by default. Note the node name (e.g. gpu_node_01) and use SSH tunneling from your local machine to view the interface:

ssh -L 8050:gpu-node-01:8050 user@hpc-address

Then navigate to http://localhost:8050 in your browser.

If you are connected to the network via VPN then you can simply start the server and enter: http://:8050/ in your browser.

Also do not forget to start an interactive session if working remotely!

salloc --gres=gpu:1 --mem=64G --time=02:00:00

⚙️ UMAP Defaults

By default, LatentLens uses:

  • n_neighbors = 50
  • min_dist = 0.1
  • n_components = 2
  • random_state = 42

These parameters can be adjusted dynamically in the UI.

💾 Saving Your Progress

To ensure your discovery and labeling work is preserved, use the Export Tab in the interface:

  • Export Data: You can manually save the entire working dataframe as a .csv. This includes all original metadata PLUS newly assigned lasso labels, model predictions, and Uncertainty scores. You will be prompted to provide a save path.
  • Export Classifier: The trained Logistic Regression model can be exported as a .pkl file. This allows you to apply your custom phenotype classifier to other datasets later.
  • Manual Trigger: Note that saving is not automatic. Always use the Export tab before closing your session or shutting down the HPC job.

📤 Outputs

LatentLens produces the following artifacts:

  • Annotated dataset (.csv)

    • Original metadata
    • User-defined labels (lasso annotations)
    • Model predictions
    • Uncertainty scores
  • Trained classifier (.pkl)

    • scikit-learn LogisticRegression model
    • Can be reused to label new datasets

⚠️ Common Pitfalls

If the app isn't behaving as expected, check these common issues:

  • "Image Not Found" in Gallery:
    • Ensure the path in your JSONL is an absolute path (e.g., /home/user/data/img.tif).
    • Windows users: Use forward slashes / or escaped backslashes \\ even if running locally.
  • App Won't Load in Browser:
    • If using a VPN, ensure you are using the specific node hostname (e.g., gpu-node-01) and not just localhost.
    • Double-check that the port (default 8050) isn't being blocked by a firewall.
  • UMAP Calculation is Slow:
    • For datasets >100k points, the initial calculation can take a few minutes. Check the terminal for progress updates.
  • Low Memory (RAM) Crashes:
    • Loading 500k high-dimensional embeddings requires significant RAM. If the app crashes on launch, try requesting more memory in your salloc command (e.g., --mem=128G).
  • Broken Layout:
    • If the UI looks scrambled, ensure you are using a modern browser (Chrome, Firefox, or Edge) and that the assets/ folder was correctly included in your installation.

🚀 Why LatentLens?

  • Validate Black Boxes: Don't just trust a metric; see the images the model is grouping together.
  • Model-Native Neighbors: The image gallery shows the $k$-nearest neighbors based on the UMAP projection, ensuring the visual context matches the spatial distribution you see on screen.
  • Accelerated Annotation: Use the model’s own uncertainty (Margin Sampling) to decide which points to label next, creating a virtuous cycle of model improvement.
  • Beyond 2D: While the view is 2D, the insights are high-dimensional. Identify new phenotypes by finding clusters that the model has separated before you have even named them.
  • No Data Leakage: The internal classifier utilizes a track-level split, ensuring that timepoints from the same object don't appear in both training and validation sets.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors