Welcome to the Prompt-Guided Image Segmentation (PSEG) repository! This project presents a robust, end-to-end PyTorch implementation of a highly adaptable multimodal segmentation framework. By leveraging the synergistic power of state-of-the-art vision and language foundational models, PSEG transcends traditional fixed-vocabulary segmentation. Instead, it empowers users to extract precise, pixel-perfect masks for any object simply by providing a descriptive free-text natural language prompt alongside an image. Whether identifying complex open-vocabulary instances or resolving semantic ambiguities, this tool bridges the gap between complex human language and granular computer vision constraints.
At the core of PSEG lies an elegantly designed, conceptually unified architecture that strategically pairs massive, pre-trained frozen backbones with an ultra-lightweight, trainable segmentation head. Rather than expending immense computational resources fine-tuning large models from scratch, our architecture creatively fuses the dense, patch-level semantic representations from DINOv2 with the rich, globally aware textual embeddings extracted from CLIP. Both foundational streams are subsequently fed into a partially-initialized, mathematically sophisticated SAM-based (Segment Anything Model) decoder. This efficient paradigm ensures that the model can maintain deep, fine-grained spatial understanding and robust multi-modal contextual grounding while keeping the trainable parameter count astonishingly low—yielding high-performance semantic segmentation with significantly reduced overhead and faster convergence times.
graph TD
A[Image Input] --> B
C[Text Prompt] --> D
subgraph "Visual Backbone (Frozen)"
B[DINOv2 facebook/dinov2-base<br>Dim: 768, Resolution: 16x16]
B -- "Multi-scale features<br>(Layers: 3, 6, 9, 11)" --> E[FPN Neck]
E -- "Fused Visual Features<br>Dim: 256" --> G
end
subgraph "Text Backbone (Frozen)"
D[CLIP openai/clip-vit-base-patch16<br>Max Length: 77]
D -- "L2 Norm / CLS Token<br>Dim: 512" --> F[Text Projection Linear]
F -- "Text Embedding<br>Dim: 256" --> G
end
subgraph "Mask Decoder (Trainable)"
H[Mask Token] --> G
I[IoU Token] --> G
G[Two-Way Transformer Blocks<br>Init: SAM-Base, Depth: 3, Heads: 8]
G -- "Mask Tokens &<br>Image Features" --> J[Dynamic Projection &<br>Progressive Upsampler]
G -- "IoU Token" --> K[IoU Prediction Head]
end
J --> L[Segmentation Mask]
K --> M[Predicted IoU Score]
| Model Component | Base Model | State | Extraction / Details | Output Dim |
|---|---|---|---|---|
| Visual Encoding | facebook/dinov2-base |
Frozen | Patch Size: 14 (16×16 grid). Multi-scale hidden states from layers [3, 6, 9, 11]. FPN Neck fusion via convolutions, GroupNorm, and GELU. |
256 |
| Text Encoding | openai/clip-vit-base-patch16 |
Frozen | Extraction of CLS token projection from last hidden state, followed by L2 normalization. | 256 (projected) |
| Mask Decoder | facebook/sam-vit-base |
Trainable | 3 Layers of TwoWayBlock Transformers (dim=256, heads=8). Uses FPN features, projected CLIP point embeddings, learnable mask & IoU tokens. |
Mask + IoU |
Parameter Breakdown
| Feature | Parameters Count | Status |
|---|---|---|
| DINOv2 + CLIP | ~150M | Frozen |
| SAM Decoder | ~9.3M | Trainable |
| Total Pipeline | ~159.3M | Evaluated + Trainable |
The model is trained on the moondream/refcoco-m dataset, featuring images and textual reference prompts.
- Training Samples: 15,138
- Validation Samples: 1,656
- Base Instances: 2,080
- Total Augmented Copies: 4,160
To get started with this project locally, clone the repository and configure your environment:
-
Clone the Repository
git clone https://github.com/halim-cv/PromptSeg-Lightweight.git cd PromptSeg-Lightweight -
Create a Virtual Environment
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install Dependencies Install all required libraries (PyTorch, HuggingFace Transformers, SciPy, Albumentations, pycocotools, and more) via the project's
setup.py:pip install -e .Or install directly from
requirements.txt:pip install -r requirements.txt
Full dependency list:
torch,torchvision,transformers,datasets,tqdm,numpy,matplotlib,opencv-python,Pillow,scipy,albumentations,pycocotools
You can seamlessly trigger training using the provided command-line scripts. This automatically manages the dataloaders and saves prompt_seg_best.pt and prompt_seg_final.pt.
1. Download the Dataset
The trainer requires the moondream/refcoco-m dataset to be prepared locally. Run the following CLI command to securely download and format the dataset into its respective folder:
python data/download_dataset.py2. Start Training With the dataset ready, launch the training process directly:
python train.py --epochs 20 --batch_size 32The model is configured with a robust object-mask delineation setup and optimized cosine-annealing learning schedule with linear warmup:
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW (betas=(0.9, 0.999), weight_decay=1e-2) |
| FPN LR | lr × 0.5 = 1e-4 |
| Decoder LR | lr = 2e-4 |
| LR Schedule | Cosine annealing with 2-epoch linear warmup |
| Batch Size | 32 |
| Epochs | 20 |
| Steps per Epoch | 473 |
| Grad Clip | Max norm 1.0 |
| Mixed Precision | AMP (torch.cuda.amp) |
Running inference on an arbitrary image with an input prompt requires tokenization and processing matching model dimensions. You can use the provided inference.py script:
# Assuming you have a saved checkpoint named `prompt_seg_best.pt`
python inference.py \
--image path/to/your/image.jpg \
--prompt "the prompt you want to predict" \
--checkpoint prompt_seg_best.pt \
--output output_mask.pngThis script will automatically format the output, showing the Original Image, the predicted Boolean Threshold Mask alongside its estimated internal IoU confidence, and a final Overlay.
The segmentation capabilities show strong adaptation of the SAM-based mask decoder to the DINOv2 visual features aligned with CLIP textual embeddings.
Validation Results Pipeline
- Best Validation IoU: 0.4167
- Mean IoU: 0.4166 (Median: 0.3899)
- IoU @ 0.5: 36.71%
- Mean Precision: 0.4603
- Mean Recall: 0.6622
Analysis of the full 20-epoch training log reveals key learning dynamics:
| Epoch | Train Loss | Val Loss | Val IoU | IoU@0.5 |
|---|---|---|---|---|
| 1 | 0.9286 | 0.8479 | 0.2928 | 0.00% |
| 5 | 0.6594 | 0.7225 | 0.3650 | 3.85% |
| 7 | 0.6141 | 0.7019 | 0.4063 | 19.23% |
| 10 | 0.5746 | 0.6981 | 0.4128 | 19.23% |
| 13 | 0.5434 | 0.7041 | 0.4167 | 25.00% |
| 14 | 0.5319 | 0.7624 | 0.4044 | 26.92% |
| 20 | 0.4944 | 0.8595 | 0.3936 | 17.31% |
- Peak IoU (0.4167): Reached at Epoch 13 — the optimal balance point where the trainable SAM decoder most effectively aligns frozen DINOv2 spatial features with frozen CLIP text embeddings.
- Peak IoU@0.5 (26.92%): Reached at Epoch 14 — discrete boolean masks maintain strong geometric overlap even as continuous logit confidence begins to overfit.
- Validation Divergence: The overall val loss hits its minimum at Epoch 10 (
0.6981) then rises to0.8595by Epoch 20. This classic sign of decoder overfitting underscores the necessity of early stopping at Epoch 13–14 in production runs.
Metrics distribution analysis showing correlation with varying inputs:
IoU Distribution
Prompt Length vs IoU
Below are some showcasing examples of successful prompt-guided predictions from our best cases.
Architecture Constraints Interpretation (Wins):
These successes highlight the exceptional synergy between CLIP's semantic embeddings and the SAM-based mask decoder's edge-detection logic. Because DINOv2 provides dense, semantically-rich patch-level features that are smoothly aggregated by the FPN Neck, the TwoWayBlock transformer layers can easily map concepts across modalities. When a text prompt is unambiguous, the global text embedding reliably triggers the precise spatial areas. The dynamic projection of the Mask Token then outlines the object cleanly, making great use of SAM's inherent boundary awareness to snap to accurate edges even from low-resolution feature maps.
Examples where the model's spatial constraints or semantic ambiguity result in suboptimal adherence to the instance ground truth.
Architecture Constraints Interpretation (Losses): These erroneous predictions can frequently be traced back to structural limitations in the current architecture design:
- Resolution Bottleneck: The standard DINOv2 workflow processes images down to a highly compressed 16×16 abstract feature grid. Extremely small objects, fine details, or thin structures are often irrecoverably pooled or lost during interpolation and FPN smoothing before they ever reach the mask decoder.
- Global Text Pooling: By selectively extracting only the
[CLS]token from CLIP (last_hidden_state[:,0,:]), the entire positional phrasing of the prompt is collapsed into a single global vector. This essentially wipes out fine-grained syntactic relationships (such as spatial locators like "to the left of"). Because of this, the model struggles profoundly to disambiguate identical objects based purely on their structural relations. - Trainable Capacity Gap: Because both foundational backbones are kept strictly frozen, the entirety of the cross-modal mapping burden falls onto the relatively small ~9.3M parameter decoder. Resolving highly complex occlusions or contradictory features forces the
Progressive Upsamplerto "guess" using coarse token maps, frequently leading to "blobby" and largely unrefined segmentations.
While our lightweight PromptSeg architecture demonstrates impressive parameter efficiency and strong baseline performance, the freezing of both visual and textual foundational models inherently introduces some strict limitations.
- Resolution Constraints (DINOv2
base): Thefacebook/dinov2-basemodel processes images into fundamentally coarse 16x16 patch embeddings. While excellent for global context and general semantic representation, this drastically limits the model's ability to delineate highly detailed boundaries or identify extremely small objects. - Global Prompt Collapse (CLIP
base): Relying purely on the global[CLS]token fromopenai/clip-vit-base-patch16collapses the spatial and relational nuances of the text prompt. This causes the decoder to struggle with relative positional prompts (e.g., "the cup to the left of the laptop" vs "the laptop"). - Cross-Modal Mapping Capacity: Relying entirely on a minuscule ~9.3M parameter SAM decoder to map rich DINOv2 visual features to abstract CLIP textual semantics creates a formidable representative bottleneck.
To profoundly elevate the model's accuracy, boundary precision, and true multimodal reasoning, the following architectural and training pathways are highly recommended:
- Upscaling the Backbones:
- Upgrading from
dinov2-basetodinov2-largeordinov2-giantwould dramatically enrich the semantic density of the visual embeddings. - Similarly, upgrading the textual encoder (or integrating modern LLM text embeddings like LLaMA) would enhance nuanced relational prompt comprehension.
- Upgrading from
- Multi-Dataset Pretraining:
- Currently, the model is trained exclusively on
refcoco-m. Pretraining the decoder across a massive, diverse combination of datasets (such as MS COCO, LVIS, and SA-1B) would substantially improve its zero-shot generalization and out-of-distribution robustness before task-specific fine-tuning.
- Currently, the model is trained exclusively on
- Unfreezing Textual/Visual Layers (PEFT/LoRA):
- Completely freezing the backbones forces the decoder to do all the heavy lifting. Implementing LoRA (Low-Rank Adaptation) on the deeper layers of both DINOv2 and CLIP would allow the foundational models to align with each other cross-modally, exponentially reducing the mapping burden on the decoder without exploding trainable parameter counts.
- Token-Level Text Fusion:
- Instead of passing only the static
[CLS]token, projecting and injecting the entire sequence of text tokens directly into theTwoWayBlockmulti-head attention components would allow the model to build dense, word-to-patch spatial interactions.
- Instead of passing only the static
This project is licensed under the MIT License.
MIT License
Copyright (c) 2026 halim-cv
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.












