Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds a new Optuna-driven CLIP training pipeline and introduces a new ultrasound downstream benchmark (MLP-on-embeddings) to evaluate trained models.
Changes:
- Added
train_new_pipeline.pyto train CLIP dual-encoder models and optimize hyperparameters with Optuna using downstream benchmark scores. - Added an ultrasound anatomical region benchmark pipeline (
ultrasound_new_benchmark.py) built on cached CLIP embeddings and an MLP classifier (mlp_eval.py). - Introduced a shared
Benchmarkabstract base class to unify benchmark evaluation interfaces.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 26 comments.
| File | Description |
|---|---|
src/multimeditron/experts/train_new_pipeline.py |
New training + Optuna objective pipeline intended to train CLIP and evaluate via benchmarks. |
src/multimeditron/experts/evaluation_pipeline/ultrasound_new_benchmark.py |
New ultrasound benchmark that builds embeddings from JSONL datasets and evaluates with an MLP head. |
src/multimeditron/experts/evaluation_pipeline/mlp_eval.py |
New MLP evaluation utility used by the ultrasound benchmark (k-fold hyperparam sweep + final test eval). |
src/multimeditron/experts/evaluation_pipeline/Benchmark.py |
New abstract Benchmark base class for evaluation modules. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| kwargs["dataset_tags"].append(dataset.dataset_name) | ||
|
|
||
| trainer.create_model_card(**kwargs) | ||
| #returns the training value |
There was a problem hiding this comment.
The try/except sets train_result = None on failure, but the function later unconditionally returns train_result.metrics["train_loss"], which will raise an AttributeError and mask the original training error. Either re-raise after logging or return a sentinel value when training fails.
| #returns the training value | |
| #returns the training value | |
| if train_result is None or getattr(train_result, "metrics", None) is None: | |
| raise RuntimeError( | |
| "Training failed earlier; 'train_result' is not available. " | |
| "Check the training logs for the original error." | |
| ) |
| def training(model_args, data_args, training_args, dataset, n_freeze, last_checkpoint): | ||
| # 5. Load pretrained model, tokenizer, and image processor | ||
| if model_args.vision_model_name and model_args.text_model_name: | ||
| # Dual encoder path | ||
| logger.info(f"Loading dual encoder with vision model {model_args.vision_model_name} " | ||
| f"and text model {model_args.text_model_name}") | ||
|
|
||
| model = VisionTextDualEncoderModel.from_vision_text_pretrained( | ||
| model_args.vision_model_name, | ||
| model_args.text_model_name, | ||
| cache_dir=model_args.cache_dir, | ||
| token=model_args.token, | ||
| ).to(dtype=torch.bfloat16) | ||
|
|
There was a problem hiding this comment.
n_freeze is passed into training(...) and optimized in Optuna (freezed_layers), but it is never used to actually freeze any model layers. This makes the Optuna search misleading; either implement freezing (vision + text encoder layers) or remove the parameter from the objective/search space.
| dataset = dataset[list(dataset.keys())[0]] | ||
| dataset = DatasetConfig(**dataset) | ||
| if dataset.dataset_name is not None: | ||
| if not hasattr(kwargs, "dataset_tags"): |
There was a problem hiding this comment.
kwargs is a dict, so hasattr(kwargs, "dataset_tags") is always false and kwargs["dataset_tags"] gets reset on every iteration, leaving only the last dataset tag. Use a dict key check (e.g., "dataset_tags" not in kwargs) so tags accumulate correctly.
| if not hasattr(kwargs, "dataset_tags"): | |
| if "dataset_tags" not in kwargs: |
| torch.save(self.data, save_path + "/data_emb_test_"+ model_name + ".pt") | ||
| torch.save(self.labels, save_path + "/data_lab_test_" + model_name + ".pt") | ||
| else: | ||
| self.data = torch.load(save_path + "/data_embl_test_"+ model_name +".pt") |
There was a problem hiding this comment.
Cached test embeddings are saved as data_emb_test_... but loaded from data_embl_test_... (typo), so load=True will fail with FileNotFoundError. Make the save/load filenames consistent.
| self.data = torch.load(save_path + "/data_embl_test_"+ model_name +".pt") | |
| self.data = torch.load(save_path + "/data_emb_test_"+ model_name +".pt") |
| def evaluate_pipeline(model, model_name): | ||
| device = "cuda" | ||
| model = model.to(device) | ||
| print("beginnig of the evaluation") |
There was a problem hiding this comment.
evaluate_pipeline forces device = "cuda" without checking availability; this will crash on CPU-only machines. Use torch.device("cuda" if torch.cuda.is_available() else "cpu") (and move tensors/model accordingly).
| from sklearn.utils.class_weight import compute_class_weight | ||
| from Benchmark import Benchmark | ||
| import numpy as np | ||
| import os |
There was a problem hiding this comment.
Import of 'os' is not used.
| import os |
| from transformers import ( | ||
| VisionTextDualEncoderModel, | ||
| VisionTextDualEncoderProcessor, | ||
| AutoImageProcessor, | ||
| AutoTokenizer, | ||
| VisionTextDualEncoderConfig | ||
| ) |
There was a problem hiding this comment.
Import of 'VisionTextDualEncoderConfig' is not used.
Import of 'AutoTokenizer' is not used.
Import of 'AutoImageProcessor' is not used.
Import of 'VisionTextDualEncoderProcessor' is not used.
| from transformers import ( | |
| VisionTextDualEncoderModel, | |
| VisionTextDualEncoderProcessor, | |
| AutoImageProcessor, | |
| AutoTokenizer, | |
| VisionTextDualEncoderConfig | |
| ) | |
| from transformers import VisionTextDualEncoderModel |
| AutoTokenizer, | ||
| VisionTextDualEncoderConfig | ||
| ) | ||
| import optuna |
There was a problem hiding this comment.
Import of 'optuna' is not used.
| import optuna |
| VisionTextDualEncoderConfig | ||
| ) | ||
| import optuna | ||
| from transformers import CLIPModel, CLIPProcessor |
There was a problem hiding this comment.
Import of 'CLIPProcessor' is not used.
Import of 'CLIPModel' is not used.
| from transformers import CLIPModel, CLIPProcessor |
| from multiprocessing import Pool | ||
| from optuna.storages import JournalStorage | ||
| from optuna.storages.journal import JournalFileBackend | ||
| import os |
There was a problem hiding this comment.
This import of module os is redundant, as it was previously imported on line 24.
| import os |
… and the mlp_eval script
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 17 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| labelsWEIGHT = np.array(train_dataset.labels) | ||
| labelsWEIGHT = np.unique(labelsWEIGHT.astype(int)) | ||
|
|
||
| class_weights = compute_class_weight(class_weight='balanced', classes=labelsWEIGHT, y=np.array(train_dataset.labels)) |
There was a problem hiding this comment.
compute_class_weight is called with y=np.array(train_dataset.labels) where train_dataset.labels is a torch Tensor; np.array(tensor) typically produces an object array of scalar tensors and can break sklearn. Convert explicitly via train_dataset.labels.cpu().numpy() (and similarly for classes).
| labelsWEIGHT = np.array(train_dataset.labels) | |
| labelsWEIGHT = np.unique(labelsWEIGHT.astype(int)) | |
| class_weights = compute_class_weight(class_weight='balanced', classes=labelsWEIGHT, y=np.array(train_dataset.labels)) | |
| labels_np = train_dataset.labels.cpu().numpy().astype(int) | |
| labelsWEIGHT = np.unique(labels_np) | |
| class_weights = compute_class_weight(class_weight='balanced', classes=labelsWEIGHT, y=labels_np) |
| if not hasattr(kwargs, "dataset_tags"): | ||
| kwargs["dataset_tags"] = [] |
There was a problem hiding this comment.
kwargs is a dict, so hasattr(kwargs, "dataset_tags") is always false and kwargs["dataset_tags"] gets reinitialized on every loop iteration; only the last dataset name will remain. Use a key check (e.g., "dataset_tags" not in kwargs) or kwargs.setdefault("dataset_tags", []).
| if not hasattr(kwargs, "dataset_tags"): | |
| kwargs["dataset_tags"] = [] | |
| kwargs.setdefault("dataset_tags", []) |
| def __init__(self, model, model_name, load, path): | ||
| path = "/mloscratch/users/deschryv/clipFineTune/ultrasound_evaluation/" | ||
| if not load: | ||
| BUSI = (load_dataset(path+"classifier-breast-radiopedia-final_train.jsonl", BREAST, "", model)) | ||
| CAMUS = (load_dataset(path+"classifier-heart-radiopedia-final_train.jsonl", OTHER, "", model)) | ||
| COVIDUS = load_dataset(path+"classifier-lungs-radiopedia-final_train.jsonl", OTHER, "", model) |
There was a problem hiding this comment.
BodyPartsDataset.__init__ overwrites the provided path argument with a hard-coded absolute path. This makes the benchmark non-portable and ignores the caller’s configuration. Use the passed-in path/save_path arguments (or expose these paths as CLI args/env vars) instead of embedding user-specific filesystem paths in the library code.
| @dataclass | ||
| class DatasetConfig: | ||
| dataset_name: Optional[str] = field( | ||
| default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."} | ||
| ) | ||
| dataset_config_name: Optional[str] = field( | ||
| default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."} | ||
| ) | ||
| data_dir: Optional[str] = field(default=None, metadata={"help": "The data directory containing input files."}) | ||
| image_column: Optional[str] = field( | ||
| default="modalities", | ||
| metadata={"help": "The name of the column in the datasets containing the full image file paths."}, | ||
| ) | ||
| caption_column: Optional[str] = field( | ||
| default="text", | ||
| metadata={"help": "The name of the column in the datasets containing the image captions."}, | ||
| ) | ||
| weight: Optional[float] = field( | ||
| default=1.0, metadata={"help": "The weight to assign to this dataset during training."} | ||
| ) | ||
|
|
There was a problem hiding this comment.
This pipeline’s DatasetConfig uses dataset_name, but the existing training script (train_clip.py) and config tooling use dataset_path (see train_clip.py’s DatasetConfig). As written, YAML configs generated for the existing pipeline won’t parse here (DatasetConfig(**config) will reject dataset_path). Consider keeping the same field name for compatibility or supporting both keys.
| TYROID = 3 | ||
|
|
There was a problem hiding this comment.
Spelling: TYROID should be THYROID (and similarly in comments/variable names) to avoid confusion and improve readability.
| fig.write_html("parallel_coordinate.html") | ||
| fig = optuna.visualization.plot_param_importances(study) | ||
| fig.write_html("plot_param_importance.html") | ||
| print("studes ploted") No newline at end of file |
There was a problem hiding this comment.
Typo in user-facing output: studes ploted → studies plotted (or similar).
| print("studes ploted") | |
| print("studies plotted") |
| def training(model_args, data_args, training_args, dataset, n_freeze, last_checkpoint): | ||
| # 5. Load pretrained model, tokenizer, and image processor | ||
| if model_args.vision_model_name and model_args.text_model_name: | ||
| # Dual encoder path | ||
| logger.info(f"Loading dual encoder with vision model {model_args.vision_model_name} " | ||
| f"and text model {model_args.text_model_name}") | ||
|
|
||
| model = VisionTextDualEncoderModel.from_vision_text_pretrained( | ||
| model_args.vision_model_name, | ||
| model_args.text_model_name, | ||
| cache_dir=model_args.cache_dir, | ||
| token=model_args.token, | ||
| ).to(dtype=torch.bfloat16) | ||
|
|
||
| tokenizer = AutoTokenizer.from_pretrained( |
There was a problem hiding this comment.
n_freeze/freezed_layers is passed into training() but never used to actually freeze any layers, so Optuna will waste trials on a no-op hyperparameter. Either implement the freezing logic (e.g., freeze first N layers of the vision/text encoders) or remove the parameter from the objective.
| data_loader = DataLoader(dataset=train_dataset, batch_size=512) | ||
| test_dataset = BodyPartsDatasetTEST(model, model_name, False, "/mloscratch/users/deschryv/clipFineTune/embeddings") | ||
| print("test dataset loaded") | ||
| test_loader = DataLoader(dataset=test_dataset, batch_size=512) |
There was a problem hiding this comment.
Variable test_loader is not used.
| test_loader = DataLoader(dataset=test_dataset, batch_size=512) |
| @@ -0,0 +1,10 @@ | |||
| from abc import ABC, abstractmethod | |||
| from transformers import VisionTextDualEncoderConfig, VisionTextDualEncoderModel | |||
There was a problem hiding this comment.
Import of 'VisionTextDualEncoderConfig' is not used.
Import of 'VisionTextDualEncoderModel' is not used.
| from transformers import VisionTextDualEncoderConfig, VisionTextDualEncoderModel |
| import sys | ||
| from dataclasses import dataclass, field | ||
| from typing import List, Optional | ||
| from plotly.io import show |
There was a problem hiding this comment.
Import of 'show' is not used.
| from plotly.io import show |
…tion to load the data into the xray benchmark, and added scripts to adapt the lion_roberta model to hugging face transformers.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 29 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| from Benchmark import Benchmark | ||
| import os | ||
| from tqdm import tqdm | ||
| from transformers import VisionTextDualEncoderModel |
There was a problem hiding this comment.
Duplicate import: VisionTextDualEncoderModel is imported twice, once on line 9 and again on line 11.
| from transformers import VisionTextDualEncoderModel |
| def evaluate(self, clip_path): | ||
| l = clip_path.split('/') | ||
| l.reverse() | ||
| clip_name = l[0] | ||
| self.clip_name=clip_name | ||
| eval_pipeline(clip_path, clip_name, self.sample_number, self.csv_path, self.device) |
There was a problem hiding this comment.
Missing return statement: The XRay_benchmark.evaluate method doesn't return any value, but according to the Benchmark abstract class contract (which requires returning a float), it should return the evaluation result from eval_pipeline.
| class BodyPartsDatasetTEST(Dataset): | ||
|
|
||
| def __init__(self, model, model_name, load, save_path): | ||
| path = "/mloscratch/users/deschryv/clipFineTune/ultrasound_evaluation/" |
There was a problem hiding this comment.
Hard-coded path: The path is hard-coded to a specific user directory '/mloscratch/users/deschryv/clipFineTune/ultrasound_evaluation/'. This will fail when run by other users or on different systems. Consider making this configurable via a parameter or environment variable.
| train_dataset = BodyPartsDataset(model, model_name, False, "/mloscratch/users/deschryv/clipFineTune/embeddings") | ||
| print("traning dataset loaded") | ||
| data_loader = DataLoader(dataset=train_dataset, batch_size=512) | ||
| test_dataset = BodyPartsDatasetTEST(model, model_name, False, "/mloscratch/users/deschryv/clipFineTune/embeddings") |
There was a problem hiding this comment.
Hard-coded path: The path is hard-coded to a specific user directory '/mloscratch/users/deschryv/clipFineTune/embeddings'. This will fail when run by other users or on different systems. Consider making this configurable via a parameter or environment variable.
| train_dataset = BodyPartsDataset(model, model_name, False, "/mloscratch/users/deschryv/clipFineTune/embeddings") | |
| print("traning dataset loaded") | |
| data_loader = DataLoader(dataset=train_dataset, batch_size=512) | |
| test_dataset = BodyPartsDatasetTEST(model, model_name, False, "/mloscratch/users/deschryv/clipFineTune/embeddings") | |
| save_path = os.getenv("ULTRASOUND_EMBEDDINGS_PATH", "/mloscratch/users/deschryv/clipFineTune/embeddings") | |
| train_dataset = BodyPartsDataset(model, model_name, False, save_path) | |
| print("traning dataset loaded") | |
| data_loader = DataLoader(dataset=train_dataset, batch_size=512) | |
| test_dataset = BodyPartsDatasetTEST(model, model_name, False, save_path) |
|
|
||
| from transformers.trainer_utils import get_last_checkpoint | ||
| from transformers.utils.versions import require_version | ||
| from lion.modeling_clip import OpenCLIPVisionTextDualEncoderModel, VisionTextDualEncoderConfig |
There was a problem hiding this comment.
Name collision: VisionTextDualEncoderConfig is imported twice - once from transformers on line 50 and again from lion.modeling_clip on line 58. The second import will shadow the first one, which could cause confusion or unexpected behavior if the two configs are different.
| df = pd.read_csv(input_path) | ||
|
|
||
| df_randomized = df.sample(frac=1, random_state=seed).reset_index(drop=True) |
There was a problem hiding this comment.
Missing import: pd (pandas) is used on lines 22 and 24 but is never imported.
| encoder = model.text_model.encoder | ||
| for i in range(n_freeze): | ||
| for param in encoder.layer[i].parameters(): | ||
| param.requires_grad = False | ||
|
|
||
|
|
There was a problem hiding this comment.
Potential error: The code assumes that the text encoder has a .layer attribute (line 392), but this is specific to certain architectures like BERT. For models like RoBERTa or XLM-RoBERTa (which is used on line 356), the attribute is .encoder.layer. This could cause an AttributeError for incompatible architectures. The code should check the model architecture or use a more robust way to access encoder layers.
| encoder = model.text_model.encoder | |
| for i in range(n_freeze): | |
| for param in encoder.layer[i].parameters(): | |
| param.requires_grad = False | |
| # Obtain the text encoder module; fall back to the text_model itself if needed | |
| encoder = getattr(model.text_model, "encoder", model.text_model) | |
| # Freeze the first n_freeze layers of the text encoder in a model-agnostic way | |
| text_layers = None | |
| for attr_name in ("layer", "layers"): | |
| if hasattr(encoder, attr_name): | |
| text_layers = getattr(encoder, attr_name) | |
| break | |
| if text_layers is None: | |
| logger.warning( | |
| "Could not identify encoder layers on text model; " | |
| "skipping freezing of text encoder layers." | |
| ) | |
| else: | |
| for i, layer in enumerate(text_layers): | |
| if i >= n_freeze: | |
| break | |
| for param in layer.parameters(): | |
| param.requires_grad = False |
| torch.save(self.data, "data_embl_" + model_name + ".pt") | ||
| torch.save(self.labels, "data_lab_" + model_name + ".pt") | ||
| else: | ||
| self.data = torch.load(path + "/data_embl_"+ model_name +".pt") | ||
| self.labels = torch.load(path + "/data_embl_"+ model_name + ".pt") |
There was a problem hiding this comment.
Path inconsistency: When saving embeddings, the code saves to the current working directory (lines 58-59) without any path prefix, but when loading, it tries to load from path + "/data_embl_..." (line 61), which includes a directory prefix. This mismatch will cause file not found errors when trying to load cached embeddings. The save and load paths should be consistent.
| torch.save(self.data, "data_embl_" + model_name + ".pt") | |
| torch.save(self.labels, "data_lab_" + model_name + ".pt") | |
| else: | |
| self.data = torch.load(path + "/data_embl_"+ model_name +".pt") | |
| self.labels = torch.load(path + "/data_embl_"+ model_name + ".pt") | |
| torch.save(self.data, os.path.join(path, "data_embl_" + model_name + ".pt")) | |
| torch.save(self.labels, os.path.join(path, "data_lab_" + model_name + ".pt")) | |
| else: | |
| self.data = torch.load(os.path.join(path, "data_embl_" + model_name + ".pt")) | |
| self.labels = torch.load(os.path.join(path, "data_lab_" + model_name + ".pt")) |
|
|
||
| from transformers.trainer_utils import get_last_checkpoint | ||
| from transformers.utils.versions import require_version | ||
| from lion.modeling_clip import OpenCLIPVisionTextDualEncoderModel, VisionTextDualEncoderConfig |
There was a problem hiding this comment.
Missing import for OpenCLIPVisionTextDualEncoderModel. The class is used on line 355 but not imported. It's imported from lion.modeling_clip on line 58 as part of a multi-import statement, but this won't work if the script is run from a different directory context. The import should be verified to work from the intended execution context.
| from lion.modeling_clip import OpenCLIPVisionTextDualEncoderModel, VisionTextDualEncoderConfig | |
| from multimeditron.experts.lion.modeling_clip import OpenCLIPVisionTextDualEncoderModel |
| training_args.fp16 = False | ||
| training_args.bf16 = False |
There was a problem hiding this comment.
These two lines modify training_args.fp16 and training_args.bf16 to False after the Trainer has already been initialized. These changes will have no effect on the already-created Trainer object. If the intent is to disable mixed precision training, these assignments should occur before line 503 where the Trainer is created.
addition of the new CLIP training pipeline and ultrasound benchmark