LatentLens is an interactive workbench designed to bridge the gap between high-dimensional embeddings and biological discovery. By projecting complex feature spaces into a human-navigable 2D interface, it allows researchers to see the world "through their model's eyes."
Instead of treating neural networks as black boxes, LatentLens provides the tools to validate, annotate, and train models in a single, fluid Human-in-the-Loop workflow.
LatentLens is built as an interactive local web application:
- Dash – UI framework for the browser-based interface
- Plotly – high-performance visualization (UMAP scatter, densities)
- UMAP (umap-learn) – dimensionality reduction
- scikit-learn – on-the-fly classification (Logistic Regression)
The application runs as a local server and is accessed through your browser.
- Visual Ground-Truth: Click any point in the UMAP to instantly render the original image crop along its n_nearest_neighbors. Verify if a cluster represents a true phenotype or a technical artifact.
- Track Trajectories: For temporal data, visualize the journey of a
track_idthrough the latent space to identify state transitions and phenotypic shifts. - Density & Class Mapping: Switch between global scatter plots and class-specific densities to reveal hidden over-representations.
- Lasso Annotation: Mass-select clusters directly in the feature distribution to assign ground-truth labels.
- Interactive LogReg: Train a Logistic Regression classifier on the fly. Results can be projected back onto the UI for inspection.
- Uncertainty Coloring: Color the UMAP by Margin Sampling Uncertainty. This highlights the model's "decision boundaries," showing where the classifier is struggling and where your labels are needed most.
- Scalability: Optimized for smooth interactivity with 100k+ points, supporting deep investigation of datasets up to 500k points (possibly more).
- Dynamic Re-projection: Update UMAP parameters while preserving all manual annotations, allowing you to find the most "resolvable" view of your data without losing progress.
- Explore: Move around the UMAP. Click on points to see if the images in the gallery look like what you expect.
- Label: Use the Lasso Tool (top right of the plot) to circle a group of points. Type a name for this phenotype when prompted.
- Train & Improve: Once you have 2+ labels, click Train Classifier. Switch the view to Uncertainty to see exactly where the model needs more of your help!
LatentLens is built for speed and memory efficiency, utilizing line-by-line parsing for large-scale embedding datasets.
To ensure full functionality, your input data should contain the following:
| Column | File Type | Status | Description |
|---|---|---|---|
| embedding | JSONL | Required | High-dimensional vectors used for UMAP and classification |
| track_id | JSONL/CSV | Required | Groups timepoints belonging to the same object (e.g., cell), preventing data leakage and enabling trajectory visualization |
| path | JSONL | Required | Absolute path to the .tif file for image rendering |
| filename | JSONL/CSV | Required | The join key used to merge features with metadata |
| umap_1 / 2 | JSONL | Optional | Pre-computed coordinates; if missing, UMAP will run on launch |
| class / t_start | CSV | Optional | Automatically mapped to phenotype and t internally |
The tool ingests features where each line is a self-contained record.
{
"track_id": 1,
"t": 25,
"embedding": [1.39, -0.80, ...],
"path": "/path/to/image.tif",
"filename": "Exp01_Site01"
}Easily merge existing experimental metadata. The tool automatically maps common keys like t_start and class to the internal t and phenotype columns.
| file_path | filename | track_id | t_start | class |
|---|---|---|---|---|
| path/to/img.tif | Exp01_Site01 | 1 | 0 | STATE_A |
LatentLens includes synthetic example datasets in data/. Available files (once with 5, once with 12 frames):
- Mini datasets (fast testing, ~25k)
features_5f_mini.jsonlfeatures_12f_mini.jsonl
- Larger datasets (~250k)
sq_rect_tracks_5f_features.jsonlsq_rect_tracks_12f_features.jsonl
- Metadata files (work with full and subsampled features)
metadata_f5_synth.csvmetadata_f12_synth.csvQuick test:
python run_app.py --features data/features_12f_mini.jsonl --metadata data/metadata_f12_synth.csv conda env create -f environment.yml
conda activate umap_app_envIf you prefer manual installation, ensure your environment includes these core dependencies:
- numpy < 2.4
- plotly >= 5, < 6
- dash >= 2.14
- scikit-learn
- umap-learn
- tifffile
- anywidget
The easiest way to launch the explorer is via the run_app.py script.
Basic Launch:
python run_app.py --features data/features.jsonl --metadata data/metadata.csvLaunch for Remote Access (HPC): If running on a remote cluster, the app binds to 0.0.0.0 by default. Note the node name (e.g. gpu_node_01) and use SSH tunneling from your local machine to view the interface:
ssh -L 8050:gpu-node-01:8050 user@hpc-addressThen navigate to http://localhost:8050 in your browser.
If you are connected to the network via VPN then you can simply start the server and enter: http://:8050/ in your browser.
Also do not forget to start an interactive session if working remotely!
salloc --gres=gpu:1 --mem=64G --time=02:00:00By default, LatentLens uses:
n_neighbors = 50min_dist = 0.1n_components = 2random_state = 42
These parameters can be adjusted dynamically in the UI.
To ensure your discovery and labeling work is preserved, use the Export Tab in the interface:
- Export Data: You can manually save the entire working dataframe as a
.csv. This includes all original metadata PLUS newly assigned lasso labels, model predictions, and Uncertainty scores. You will be prompted to provide a save path. - Export Classifier: The trained Logistic Regression model can be exported as a
.pklfile. This allows you to apply your custom phenotype classifier to other datasets later. - Manual Trigger: Note that saving is not automatic. Always use the Export tab before closing your session or shutting down the HPC job.
LatentLens produces the following artifacts:
-
Annotated dataset (
.csv)- Original metadata
- User-defined labels (lasso annotations)
- Model predictions
- Uncertainty scores
-
Trained classifier (
.pkl)- scikit-learn
LogisticRegressionmodel - Can be reused to label new datasets
- scikit-learn
If the app isn't behaving as expected, check these common issues:
- "Image Not Found" in Gallery:
- Ensure the
pathin your JSONL is an absolute path (e.g.,/home/user/data/img.tif). - Windows users: Use forward slashes
/or escaped backslashes\\even if running locally.
- Ensure the
- App Won't Load in Browser:
- If using a VPN, ensure you are using the specific node hostname (e.g.,
gpu-node-01) and not justlocalhost. - Double-check that the port (default
8050) isn't being blocked by a firewall.
- If using a VPN, ensure you are using the specific node hostname (e.g.,
- UMAP Calculation is Slow:
- For datasets >100k points, the initial calculation can take a few minutes. Check the terminal for progress updates.
- Low Memory (RAM) Crashes:
- Loading 500k high-dimensional embeddings requires significant RAM. If the app crashes on launch, try requesting more memory in your
salloccommand (e.g.,--mem=128G).
- Loading 500k high-dimensional embeddings requires significant RAM. If the app crashes on launch, try requesting more memory in your
- Broken Layout:
- If the UI looks scrambled, ensure you are using a modern browser (Chrome, Firefox, or Edge) and that the
assets/folder was correctly included in your installation.
- If the UI looks scrambled, ensure you are using a modern browser (Chrome, Firefox, or Edge) and that the
- Validate Black Boxes: Don't just trust a metric; see the images the model is grouping together.
-
Model-Native Neighbors: The image gallery shows the
$k$ -nearest neighbors based on the UMAP projection, ensuring the visual context matches the spatial distribution you see on screen. - Accelerated Annotation: Use the model’s own uncertainty (Margin Sampling) to decide which points to label next, creating a virtuous cycle of model improvement.
- Beyond 2D: While the view is 2D, the insights are high-dimensional. Identify new phenotypes by finding clusters that the model has separated before you have even named them.
- No Data Leakage: The internal classifier utilizes a track-level split, ensuring that timepoints from the same object don't appear in both training and validation sets.
