One repository, three pillars — graph-centric learning, multimodal quality evaluation, and graph-aware creative generation (text & image).
This README consolidates and operationalizes the project guides into a single, runnable document with complete usage instructions.
- Overview
- Key Features
- Repository Layout
- Installation
- Data & Embedding Workflow
- Graph-Centric Tasks
- Multimodal Quality Evaluation (QE)
- Creative Generation
- Configuration (YAML) Examples
- Metrics
- Reproducibility & Logging
- Troubleshooting
- License
- Citation
UniMAG provides end-to-end pipelines for Multimodal Attributed Graphs (MAGs):
- Lightweight GNN backbones for graph-centric learning (node/edge/graph-level).
- Multimodal quality evaluation: matching, retrieval, and fine-grained alignment with graph-aware enhancement.
- Creative generation: graph → text, graph → image, and text + graph context → controllable image generation.
Outputs
- Modality-level: per-modality embeddings per node (e.g., image/text).
- Entity-level: fused node representation derived from its modalities and optionally its neighborhood.
- Standardized embedding conversion (factory + registry) and a unified loader for downstream tasks.
- Clean task separation with dedicated subpackages and YAML-driven experiments.
- Reproducible training/evaluation with seed control, consistent logging, and structured results.
- Built-in utilities for PPR sampling, subgraph construction, and common metrics.
UniMAG/
├─ configs/
│ ├─ nc_gcn.yaml # node classification (GCN)
│ ├─ lp_gat.yaml # link prediction (GAT)
│ ├─ qe_matching.yaml # modality matching
│ ├─ qe_retrieval.yaml # modality retrieval
│ ├─ qe_alignment.yaml # phrase-region alignment
│ ├─ g2text.yaml # graph → text
│ ├─ g2i.yaml # graph → image
│ └─ GT2Image.yaml # text + graph context → image
├─ src/
│ ├─ data/
│ │ ├─ embedding_converter/ # raw → npy feature pipeline (factory + registry)
│ │ └─ embedding_manager.py # unified feature/graph loader for downstream tasks
│ ├─ graph_centric/
│ │ ├─ train_nc.py
│ │ ├─ train_lp.py
│ │ └─ eval.py
│ ├─ multimodal_centric/
│ │ ├─ qe/
│ │ │ ├─ matching.py
│ │ │ ├─ retrieval.py
│ │ │ └─ alignment.py # spaCy + GroundingDINO + RoI features
│ │ ├─ g2text/
│ │ │ ├─ decoder.py # multimodal decoder → soft prompts
│ │ │ └─ infer.py
│ │ ├─ g2i/
│ │ │ ├─ unet.py
│ │ │ └─ pipeline.py
│ │ └─ gt2image/
│ │ ├─ train.py test.py infer_pipeline.py
│ │ ├─ GraphQFormer.py GraphAdapter.py
│ │ └─ dataset.py # PPR sampling, subgraph loading
│ └─ utils/
│ ├─ graph_samplers.py # PPR, neighbor search, subgraph assembly
│ └─ metrics.py # Accuracy/F1/MRR/H@K/NDCG/mAP/Alignment
├─ data/
│ └─ <dataset_name>/ # images/, texts/, graphs/, splits/, labels/
├─ outputs/
│ ├─ ckpts/ logs/ results/
└─ README.md
- Python: 3.10+ (recommended)
- Core: PyTorch, DGL or PyG (choose one)
- Multimodal/Generation:
transformers,accelerate,diffusers,torchvision - QE Alignment:
spacy(withen_core_web_sm),GroundingDINO - Utils:
numpy,scipy,pandas,tqdm,pyyaml
conda create -n unimag python=3.10 -y
conda activate unimag
# core
pip install torch torchvision
pip install dgl # or: pip install torch-geometric
# multimodal + generation
pip install transformers accelerate diffusers
# QE alignment extras
pip install spacy
python -m spacy download en_core_web_sm
# utilities
pip install numpy scipy pandas tqdm pyyamlPrimary: https://huggingface.co/datasets/enjun-collab/MMAG
Additional MAG baselines:
Common dataset components
*-images.tar/*.tar.gz: raw images (file name == node id)*-raw-text.jsonl/*.csv:{id, text/title/description}node_mapping.pt: raw IDs → graph indicesGraph.pt/nc_edges-nodeid.pt/*.pt: graph structuresplit.pt,labels*.pt,lp-edge-split.pt: splits and labels
Convert raw MAG data to
.npyfeature matrices with a standard naming scheme and consistent dtype (float32).
- Factory + Registry: plug-and-play encoders (e.g., CLIP, SigLIP, BLIP; other stable variants).
- Modalities:
text,image, and optionallymultimodalfused features. - Naming:
{dataset_name}_{modality}_{encoder_name}_{dimension}d.npy.
Examples
# Convert MAGB text CSV to unified JSONL (if needed)
python -m src.data.embedding_converter.utils.convert_magb_text_to_mmgraph \
--in data/MAGB/books-nc.csv \
--out data/mm-graph/books-nc-raw-text.jsonl
# Extract features with a registered encoder
python -m src.data.embedding_converter.run \
--dataset books-nc \
--modality text image \
--encoder clip-vit-b32 \
--outdir data/books-nc/featuresUse embedding_manager.py in downstream code. It abstracts file paths and naming:
from src.data.embedding_manager import load_node_features, load_graph_splits
x_text = load_node_features(dataset="books-nc", modality="text", encoder="clip-vit-b32") # [N, d_t]
x_image = load_node_features(dataset="books-nc", modality="image", encoder="clip-vit-b32") # [N, d_i]
splits = load_graph_splits(dataset="books-nc") # dict: train/val/test indicesLightweight GNNs for node/edge/graph objectives.
Backbones (from the consolidated guide)
MLP,GCN,GAT,GraphSAGE,RevGAT,MMGCN,MGAT
Losses
- Node classification: Cross-Entropy
- Link prediction: Binary Cross-Entropy (with negative sampling)
- Self-supervised: InfoNCE (optional)
python -m src.graph_centric.train_nc --config configs/nc_gcn.yaml
python -m src.graph_centric.eval --config configs/nc_gcn.yaml --ckpt outputs/ckpts/nc_gcn.ptpython -m src.graph_centric.train_lp --config configs/lp_gat.yaml
python -m src.graph_centric.eval --config configs/lp_gat.yaml --ckpt outputs/ckpts/lp_gat.pt- Graph classification and community detection are supported via the same backbone design.
- Provide
graph_splitsand use appropriate pooling (mean/sum/max) prior to classification layers.
Metrics: Accuracy, F1-Macro, MRR, Hits@{1,10}.
Evaluate cross-modal embedding quality with or without graph context.
- Traditional: cosine / CLIP-like score between arbitrary image/text embeddings.
- Graph-aware: fetch node + neighbors via the embedding manager; aggregate with a small GNN; compute cosine on enhanced embeddings.
python -m src.multimodal_centric.qe.matching --config configs/qe_matching.yaml- Traditional: similarity matrix
query @ candidates.T→ rank. - Graph-aware: enhance query via 1-hop aggregation; rank against all other nodes.
python -m src.multimodal_centric.qe.retrieval --config configs/qe_retrieval.yamlTraining Tips (from the benchmark report)
- InfoNCE: pulls same-node cross-modal pairs together; pushes different nodes apart.
- Symmetric InfoNCE (retrieval): optimize text→image and image→text jointly; average the two losses for a bidirectionally aligned space.
- Offline: build
(image, [(phrase, box), ...])using spaCy (noun phrase extraction) + Grounding DINO (region proposals). - Online: extract region features (e.g., RoIAlign on the feature map) + phrase embeddings → similarity per
(phrase, box). - Graph-aware alignment uses GNN-enhanced feature maps.
python -m src.multimodal_centric.qe.alignment --config configs/qe_alignment.yamlMetrics
- Matching: score distribution / threshold-F1
- Retrieval:
Recall@K,mAP,NDCG - Alignment: mean/max phrase-region similarity, coverage
Use a light multimodal decoder to map MAG features to virtual tokens (soft prompts) that steer a frozen LLM.
Inputs
- Node entity-level embedding (concatenate modalities; e.g.,
768×3=2304) - Optional context embeddings (neighbors) and structure encoding (e.g., LPE)
Multimodal Decoder
- MLP + LayerNorm + Tanh to project concatenated features to the LLM hidden size
- Learnable positional encodings
- Outputs virtual tokens that are prepended to the LLM context
LLM
- Local frozen model (e.g., Qwen2.5-VL or similar). No gradient on LLM.
Training
- Labels: use provided references; if missing, weak labels can be generated from raw graph-text.
- Loss: Cross-Entropy (decoder-only training)
- Optimizer: AdamW
Inference
with torch.no_grad():
vtoks = decoder(node_and_context_emb) # [num_virtual_tokens, hidden_size]
text = frozen_llm.generate_with_soft_prompt(vtoks, prompt_template)CLI
python -m src.multimodal_centric.g2text.train --config configs/g2text.yaml
python -m src.multimodal_centric.g2text.infer --config configs/g2text.yaml --ckpt outputs/ckpts/g2text.ptEvaluation
- BLEU/ROUGE/BERTScore + human preference (coherence, faithfulness to graph evidence).
Feed precomputed embeddings into a diffusion U-Net to synthesize images.
Pipeline
- Map embeddings to initial conditioning in latent diffusion space.
- Inject conditioning via cross-attention or conditional concatenation during denoising.
- Decode latents to pixels (e.g., VAE decoder).
Training Objective (noise reconstruction MSE) $$ \mathbb{E}{t,\mathbf{x},\epsilon}\left[\lVert \epsilon - \epsilon\theta(\mathbf{x}_t, t, c)\rVert^2\right] $$ CLI
python -m src.multimodal_centric.g2i.train --config configs/g2i.yaml
python -m src.multimodal_centric.g2i.infer --config configs/g2i.yaml --ckpt outputs/ckpts/g2i.ptEvaluation
- FID/KID, CLIP score; human study on semantic faithfulness.
Generate or reconstruct the target node image using a text description and graph context (neighbor images).
Three Stages
- Informative Neighbor Sampling
- PPR for structural importance
- Re-rank neighbors by semantic similarity between neighbor images and the target text
- Graph Encoding
- Encode selected neighbors with Graph-QFormer (self-attention among neighbors)
- Optional GraphAdapter for feature alignment
- Controllable Generation
- Diffusion-based synthesis conditioned on graph prompt and target text
Training Loss (supervised denoising)
Let latent $ z \sim \mathrm{Enc}(x) $ of a real image be noised to $ z_t $ with noise $ \varepsilon \sim \mathcal{N}(0,I)
where $ h(c_T,c_G) $ fuses text and graph conditions.
Structure & CLI
src/multimodal_centric/gt2image/
train.py test.py infer_pipeline.py
GraphQFormer.py GraphAdapter.py
dataset.py
python -m src.multimodal_centric.gt2image.train --config configs/GT2Image.yaml
python -m src.multimodal_centric.gt2image.test --config configs/GT2Image.yaml --ckpt outputs/ckpts/gt2image.pt
python -m src.multimodal_centric.gt2image.infer_pipeline --config configs/GT2Image.yaml --ckpt outputs/ckpts/gt2image.pt
Evaluation
- FID/KID, CLIP-text consistency, retrieval consistency vs. neighbor context.
# configs/nc_gcn.yaml
task: node_classification # link_prediction, matching, retrieval, alignment, g2text, g2image, gt2image
dataset:
name: books-nc
root: ./data/books-nc
modalities: [image, text]
features:
text_encoder: clip-vit-b32
image_encoder: clip-vit-b32
model:
backbone: gcn
hidden_dim: 256
num_layers: 2
num_heads: 4 # if applicable (e.g., GAT)
train:
batch_size: 256
epochs: 100
lr: 3.0e-4
seed: 42
eval:
metrics: [accuracy, f1_macro]
topk: [1, 5, 10]
device: cuda
log_dir: ./outputs/logs
save_dir: ./outputs/ckpts# configs/qe_retrieval.yaml
task: retrieval
dataset:
name: books-nc
root: ./data/books-nc
features:
text_encoder: clip-vit-b32
image_encoder: clip-vit-b32
qe:
use_graph_context: true # enable MAG-specific enhancement
loss: symmetric_infonce # two-way alignment
train:
batch_size: 512
epochs: 20
lr: 2.0e-4
eval:
metrics: [recall@1, recall@5, recall@10, map, ndcg]
device: cuda# configs/g2text.yaml
task: g2text
dataset:
name: Movies
root: ./data/Movies
features:
text_encoder: clip-vit-b32
image_encoder: clip-vit-b32
model:
decoder:
num_virtual_tokens: 16
hidden_size: 4096 # match LLM hidden dim
dropout: 0.1
llm:
name_or_path: Qwen2.5-VL-7B
freeze: true
train:
batch_size: 8
epochs: 5
lr: 1.0e-4
weight_decay: 0.01
eval:
metrics: [bleu, rouge, bertscore]
device: cuda# configs/GT2Image.yaml
task: gt2image
dataset:
name: Reddit-M
root: ./data/Reddit-M
features:
image_encoder: clip-vit-b32
text_encoder: clip-vit-b32
sampling:
ppr_alpha: 0.15
topk_neighbors: 16
graph_encoding:
qformer_layers: 4
adapter: true
gen:
diffusion_steps: 50
guidance_scale: 7.5
train:
batch_size: 2
epochs: 10
lr: 1.0e-4
eval:
metrics: [fid, kid, clip_score]
device: cuda| Family | Tasks | Metrics |
|---|---|---|
| Graph-centric | NC / LP / Graph cls / Community | Accuracy, F1-Macro, MRR, Hits@K |
| QE — Matching | Image↔Text score | cosine/CLIP-like, threshold-F1 |
| QE — Retrieval | Text→Image / Image→Text | Recall@K, mAP, NDCG |
| QE — Alignment | Phrase↔Region (fine-grained) | mean/max similarity, coverage |
| Generation | G2Text / G2Image / GT2Image | BLEU/ROUGE/BERTScore, FID/KID, CLIP score |
- Pin random seeds in YAML and data loaders.
- Save
{config, ckpt, metrics.json}underoutputs/per run. - For retrieval, report mean ± std across repeated runs.
- For alignment, record exact spaCy/GroundingDINO versions.
- Keep encoder versions fixed for stability (start with CLIP/SigLIP/BLIP families).
-
Slow convergence / weak performance
- Graph tasks: check normalization, depth (over-smoothing), and negative sampling.
- QE: ensure symmetric InfoNCE and sufficient negatives; validate neighbor enhancement.
- Generation: tune diffusion steps/lr; adjust number and dimension of virtual tokens (G2Text).
-
Out-of-memory (OOM)
- Reduce batch size and diffusion steps; cap neighbor count in Graph-QFormer; use subgraph sampling.
-
Unstable encoders
- Prefer robust encoders first; some very large multimodal encoders may be unstable on limited VRAM.
Released under a permissive open-source license.
If you find this repository useful, please consider citing our papers:
@article{du2025graphmaster,
title={GraphMaster: Automated Graph Synthesis via LLM Agents in Data-Limited Environments},
author={Du, Enjun and Li, Xunkai and Jin, Tian and Zhang, Zhihan and Li, Rong-Hua and Wang, Guoren},
journal={arXiv preprint arXiv:2504.00711},
year={2025},
url={https://arxiv.org/abs/2504.00711}
}
@article{du2025mokgr,
title={Mixture of Length and Pruning Experts for Knowledge Graphs Reasoning},
author={Du, Enjun and Liu, Siyi and Zhang, Yongqi},
journal={arXiv preprint arXiv:2507.20498},
year={2025},
url={https://arxiv.org/abs/2507.20498}
}