Prompt-Guided Image Segmentation (PSEG)

Welcome to the Prompt-Guided Image Segmentation (PSEG) repository! This project presents a robust, end-to-end PyTorch implementation of a highly adaptable multimodal segmentation framework. By leveraging the synergistic power of state-of-the-art vision and language foundational models, PSEG transcends traditional fixed-vocabulary segmentation. Instead, it empowers users to extract precise, pixel-perfect masks for any object simply by providing a descriptive free-text natural language prompt alongside an image. Whether identifying complex open-vocabulary instances or resolving semantic ambiguities, this tool bridges the gap between complex human language and granular computer vision constraints.

Model Architecture

At the core of PSEG lies an elegantly designed, conceptually unified architecture that strategically pairs massive, pre-trained frozen backbones with an ultra-lightweight, trainable segmentation head. Rather than expending immense computational resources fine-tuning large models from scratch, our architecture creatively fuses the dense, patch-level semantic representations from DINOv2 with the rich, globally aware textual embeddings extracted from CLIP. Both foundational streams are subsequently fed into a partially-initialized, mathematically sophisticated SAM-based (Segment Anything Model) decoder. This efficient paradigm ensures that the model can maintain deep, fine-grained spatial understanding and robust multi-modal contextual grounding while keeping the trainable parameter count astonishingly low—yielding high-performance semantic segmentation with significantly reduced overhead and faster convergence times.

Architecture Diagram

graph TD
    A[Image Input] --> B
    C[Text Prompt] --> D

    subgraph "Visual Backbone (Frozen)"
        B[DINOv2 facebook/dinov2-base<br>Dim: 768, Resolution: 16x16]
        B -- "Multi-scale features<br>(Layers: 3, 6, 9, 11)" --> E[FPN Neck]
        E -- "Fused Visual Features<br>Dim: 256" --> G
    end

    subgraph "Text Backbone (Frozen)"
        D[CLIP openai/clip-vit-base-patch16<br>Max Length: 77]
        D -- "L2 Norm / CLS Token<br>Dim: 512" --> F[Text Projection Linear]
        F -- "Text Embedding<br>Dim: 256" --> G
    end

    subgraph "Mask Decoder (Trainable)"
        H[Mask Token] --> G
        I[IoU Token] --> G
        G[Two-Way Transformer Blocks<br>Init: SAM-Base, Depth: 3, Heads: 8]
        G -- "Mask Tokens &<br>Image Features" --> J[Dynamic Projection &<br>Progressive Upsampler]
        G -- "IoU Token" --> K[IoU Prediction Head]
    end

    J --> L[Segmentation Mask]
    K --> M[Predicted IoU Score]

Technical Specifications

Model Component	Base Model	State	Extraction / Details	Output Dim
Visual Encoding	`facebook/dinov2-base`	Frozen	Patch Size: 14 (16×16 grid). Multi-scale hidden states from layers `[3, 6, 9, 11]`. FPN Neck fusion via convolutions, GroupNorm, and GELU.	256
Text Encoding	`openai/clip-vit-base-patch16`	Frozen	Extraction of CLS token projection from last hidden state, followed by L2 normalization.	256 (projected)
Mask Decoder	`facebook/sam-vit-base`	Trainable	3 Layers of `TwoWayBlock` Transformers (`dim=256`, `heads=8`). Uses FPN features, projected CLIP point embeddings, learnable mask & IoU tokens.	Mask + IoU

Parameter Breakdown

Feature	Parameters Count	Status
DINOv2 + CLIP	~150M	Frozen
SAM Decoder	~9.3M	Trainable
Total Pipeline	~159.3M	Evaluated + Trainable

Dataset

The model is trained on the moondream/refcoco-m dataset, featuring images and textual reference prompts.

Training Samples: 15,138
Validation Samples: 1,656
Base Instances: 2,080
Total Augmented Copies: 4,160

Setup Instructions

To get started with this project locally, clone the repository and configure your environment:

Clone the Repository

git clone https://github.com/halim-cv/PromptSeg-Lightweight.git
cd PromptSeg-Lightweight

Create a Virtual Environment

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install Dependencies Install all required libraries (PyTorch, HuggingFace Transformers, SciPy, Albumentations, pycocotools, and more) via the project's setup.py:
```
pip install -e .
```
Or install directly from requirements.txt:
```
pip install -r requirements.txt
```
Full dependency list: torch, torchvision, transformers, datasets, tqdm, numpy, matplotlib, opencv-python, Pillow, scipy, albumentations, pycocotools

Training

You can seamlessly trigger training using the provided command-line scripts. This automatically manages the dataloaders and saves prompt_seg_best.pt and prompt_seg_final.pt.

1. Download the Dataset The trainer requires the moondream/refcoco-m dataset to be prepared locally. Run the following CLI command to securely download and format the dataset into its respective folder:

python data/download_dataset.py

2. Start Training With the dataset ready, launch the training process directly:

python train.py --epochs 20 --batch_size 32

Training Configuration

The model is configured with a robust object-mask delineation setup and optimized cosine-annealing learning schedule with linear warmup:

Hyperparameter	Value
Optimizer	AdamW (`betas=(0.9, 0.999)`, `weight_decay=1e-2`)
FPN LR	`lr × 0.5 = 1e-4`
Decoder LR	`lr = 2e-4`
LR Schedule	Cosine annealing with 2-epoch linear warmup
Batch Size	32
Epochs	20
Steps per Epoch	473
Grad Clip	Max norm 1.0
Mixed Precision	AMP (`torch.cuda.amp`)

Inference

Running inference on an arbitrary image with an input prompt requires tokenization and processing matching model dimensions. You can use the provided inference.py script:

# Assuming you have a saved checkpoint named `prompt_seg_best.pt`
python inference.py \
    --image path/to/your/image.jpg \
    --prompt "the prompt you want to predict" \
    --checkpoint prompt_seg_best.pt \
    --output output_mask.png

This script will automatically format the output, showing the Original Image, the predicted Boolean Threshold Mask alongside its estimated internal IoU confidence, and a final Overlay.

Evaluation & Metrics

The segmentation capabilities show strong adaptation of the SAM-based mask decoder to the DINOv2 visual features aligned with CLIP textual embeddings.

Validation Results Pipeline

Best Validation IoU: 0.4167
Mean IoU: 0.4166 (Median: 0.3899)
IoU @ 0.5: 36.71%
Mean Precision: 0.4603
Mean Recall: 0.6622

Learning Dynamics & Overfitting

Analysis of the full 20-epoch training log reveals key learning dynamics:

Epoch	Train Loss	Val Loss	Val IoU	IoU@0.5
1	0.9286	0.8479	0.2928	0.00%
5	0.6594	0.7225	0.3650	3.85%
7	0.6141	0.7019	0.4063	19.23%
10	0.5746	0.6981	0.4128	19.23%
13	0.5434	0.7041	0.4167	25.00%
14	0.5319	0.7624	0.4044	26.92%
20	0.4944	0.8595	0.3936	17.31%

Peak IoU (0.4167): Reached at Epoch 13 — the optimal balance point where the trainable SAM decoder most effectively aligns frozen DINOv2 spatial features with frozen CLIP text embeddings.
Peak IoU@0.5 (26.92%): Reached at Epoch 14 — discrete boolean masks maintain strong geometric overlap even as continuous logit confidence begins to overfit.
Validation Divergence: The overall val loss hits its minimum at Epoch 10 (0.6981) then rises to 0.8595 by Epoch 20. This classic sign of decoder overfitting underscores the necessity of early stopping at Epoch 13–14 in production runs.

Analytics

Metrics distribution analysis showing correlation with varying inputs:

IoU Distribution

Prompt Length vs IoU

Qualitative Results

Successful Segmentations

Below are some showcasing examples of successful prompt-guided predictions from our best cases.

Architecture Constraints Interpretation (Wins): These successes highlight the exceptional synergy between CLIP's semantic embeddings and the SAM-based mask decoder's edge-detection logic. Because DINOv2 provides dense, semantically-rich patch-level features that are smoothly aggregated by the FPN Neck, the TwoWayBlock transformer layers can easily map concepts across modalities. When a text prompt is unambiguous, the global text embedding reliably triggers the precise spatial areas. The dynamic projection of the Mask Token then outlines the object cleanly, making great use of SAM's inherent boundary awareness to snap to accurate edges even from low-resolution feature maps.

Failure Cases Analysis

Examples where the model's spatial constraints or semantic ambiguity result in suboptimal adherence to the instance ground truth.

Architecture Constraints Interpretation (Losses): These erroneous predictions can frequently be traced back to structural limitations in the current architecture design:

Resolution Bottleneck: The standard DINOv2 workflow processes images down to a highly compressed 16×16 abstract feature grid. Extremely small objects, fine details, or thin structures are often irrecoverably pooled or lost during interpolation and FPN smoothing before they ever reach the mask decoder.
Global Text Pooling: By selectively extracting only the [CLS] token from CLIP (last_hidden_state[:,0,:]), the entire positional phrasing of the prompt is collapsed into a single global vector. This essentially wipes out fine-grained syntactic relationships (such as spatial locators like "to the left of"). Because of this, the model struggles profoundly to disambiguate identical objects based purely on their structural relations.
Trainable Capacity Gap: Because both foundational backbones are kept strictly frozen, the entirety of the cross-modal mapping burden falls onto the relatively small ~9.3M parameter decoder. Resolving highly complex occlusions or contradictory features forces the Progressive Upsampler to "guess" using coarse token maps, frequently leading to "blobby" and largely unrefined segmentations.

Limitations & Future Improvements

While our lightweight PromptSeg architecture demonstrates impressive parameter efficiency and strong baseline performance, the freezing of both visual and textual foundational models inherently introduces some strict limitations.

Current Backbone Limitations

Resolution Constraints (DINOv2 base): The facebook/dinov2-base model processes images into fundamentally coarse 16x16 patch embeddings. While excellent for global context and general semantic representation, this drastically limits the model's ability to delineate highly detailed boundaries or identify extremely small objects.
Global Prompt Collapse (CLIP base): Relying purely on the global [CLS] token from openai/clip-vit-base-patch16 collapses the spatial and relational nuances of the text prompt. This causes the decoder to struggle with relative positional prompts (e.g., "the cup to the left of the laptop" vs "the laptop").
Cross-Modal Mapping Capacity: Relying entirely on a minuscule ~9.3M parameter SAM decoder to map rich DINOv2 visual features to abstract CLIP textual semantics creates a formidable representative bottleneck.

Proposed Improvement Pathways

To profoundly elevate the model's accuracy, boundary precision, and true multimodal reasoning, the following architectural and training pathways are highly recommended:

Upscaling the Backbones:
- Upgrading from dinov2-base to dinov2-large or dinov2-giant would dramatically enrich the semantic density of the visual embeddings.
- Similarly, upgrading the textual encoder (or integrating modern LLM text embeddings like LLaMA) would enhance nuanced relational prompt comprehension.
Multi-Dataset Pretraining:
- Currently, the model is trained exclusively on refcoco-m. Pretraining the decoder across a massive, diverse combination of datasets (such as MS COCO, LVIS, and SA-1B) would substantially improve its zero-shot generalization and out-of-distribution robustness before task-specific fine-tuning.
Unfreezing Textual/Visual Layers (PEFT/LoRA):
- Completely freezing the backbones forces the decoder to do all the heavy lifting. Implementing LoRA (Low-Rank Adaptation) on the deeper layers of both DINOv2 and CLIP would allow the foundational models to align with each other cross-modally, exponentially reducing the mapping burden on the decoder without exploding trainable parameter counts.
Token-Level Text Fusion:
- Instead of passing only the static [CLS] token, projecting and injecting the entire sequence of text tokens directly into the TwoWayBlock multi-head attention components would allow the model to build dense, word-to-patch spatial interactions.

License

This project is licensed under the MIT License.

MIT License

Copyright (c) 2026 halim-cv

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
eval		eval
models		models
notebooks		notebooks
outputs		outputs
training		training
LICENSE		LICENSE
README.md		README.md
inference.py		inference.py
requirements.txt		requirements.txt
setup.py		setup.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prompt-Guided Image Segmentation (PSEG)

Model Architecture

Architecture Diagram

Technical Specifications

Dataset

Setup Instructions

Training

Training Configuration

Inference

Evaluation & Metrics

Learning Dynamics & Overfitting

Analytics

Qualitative Results

Successful Segmentations

Failure Cases Analysis

Limitations & Future Improvements

Current Backbone Limitations

Proposed Improvement Pathways

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Prompt-Guided Image Segmentation (PSEG)

Model Architecture

Architecture Diagram

Technical Specifications

Dataset

Setup Instructions

Training

Training Configuration

Inference

Evaluation & Metrics

Learning Dynamics & Overfitting

Analytics

Qualitative Results

Successful Segmentations

Failure Cases Analysis

Limitations & Future Improvements

Current Backbone Limitations

Proposed Improvement Pathways

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages