Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
199 changes: 199 additions & 0 deletions contrib/models/LongCat-Image-Edit/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
# Contrib Model: LongCat-Image-Edit

NeuronX adaptation of [meituan-longcat/LongCat-Image-Edit](https://huggingface.co/meituan-longcat/LongCat-Image-Edit) for AWS Trainium2 inference.

## Model Information

- **HuggingFace ID:** `meituan-longcat/LongCat-Image-Edit`
- **Model Type:** FLUX-style diffusion model for image editing
- **Architecture:** Multi-component (Vision Encoder + Language Model + FLUX Transformer + VAE)
- **License:** Check HuggingFace model card

## Architecture Details

LongCat-Image-Edit is a FLUX-style image editing model with the following components:

| Component | Model | Neuron Parallelism |
|-----------|-------|-------------------|
| Vision Encoder | Qwen2.5-VL ViT (32 blocks) | TP=4, float32 |
| Language Model | Qwen2.5-VL LM (28 layers) | TP=4, world_size=8 |
| Transformer (CP) | LongCatImageTransformer2DModel (10 dual + 20 single stream) | TP=4, CP=2, world_size=8 |
| Transformer (CFG) | LongCatImageTransformer2DModel (10 dual + 20 single stream) | TP=4, DP=2, world_size=8, batch=2 |
| VAE | 2D AutoencoderKL | Single device (1024x1024, no tiling) |

Key parameters:
- **Attention Heads:** 24, head_dim=128, inner_dim=3072
- **Text Hidden Size:** 3584 (Qwen2.5-VL)
- **In Channels:** 64 (packed latents)
- **Dual-stream blocks:** 10 (separate text/image norms+FFN, joint attention)
- **Single-stream blocks:** 20 (concatenated text+image, parallel MLP+attention)

## Performance

| Machine | Config | Total Time | Per Step | Quality |
|---------|--------|------------|----------|---------|
| **Trn2** (trn2.48xlarge) | All Neuron, **CFG Parallel** | **18.17s** | 0.36s | Good |
| **Trn2** (trn2.48xlarge) | All Neuron, Context Parallel | 22.39s | 0.45s | Good |
| **H100** (single GPU, bf16) | Full GPU | 23.61s | 0.47s | Reference |

Test: 1024x1024 output, guidance_scale=4.5, 50 steps.

## CFG Parallel vs Context Parallel

Both modes use TP=4, world_size=8 on the same hardware:

| Aspect | Context Parallel (CP) | CFG Parallel |
|--------|----------------------|--------------|
| Scatter dimension | dim=1 (sequence) | dim=0 (batch) |
| Calls per step | 2 (neg + pos sequential) | 1 (neg + pos batched) |
| K/V All-Gather | Yes (every attention layer) | No |
| Compile batch_size | 1 | 2 |
| Best for | guidance_scale = 1 (no CFG) | guidance_scale > 1 (~9% faster) |

## Prerequisites

- **Instance**: trn2.48xlarge (64 NeuronCores, 1.5TB device memory)
- **Virtual env**: `/opt/aws_neuronx_venv_pytorch_2_9_nxd_inference`
- PyTorch 2.9, neuronx-cc 2.22, neuronx-distributed 0.16
- **NVMe**: Mount RAID at `/opt/dlami/nvme/` (run `src/setup_nvme.sh`)

## Usage

### 1. Setup

```bash
# Mount NVMe RAID
sudo bash src/setup_nvme.sh

# Activate virtual environment
source /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate

# Install dependencies
pip install -r requirements.txt
```

### 2. Download Model

```bash
python src/cache_hf_model.py
```

### 3. Compile All Components

```bash
# Compile with CFG Parallel (default, recommended, fastest)
bash src/compile.sh

# Compile with Context Parallel
bash src/compile.sh cp

# Custom dimensions:
# bash src/compile.sh [cfg|cp] <height> <width> <image_size> <max_seq_len>
# bash src/compile.sh cfg 1024 1024 448 1024
```

Compilation takes ~60-90 minutes total. Compiled models are saved to `/opt/dlami/nvme/compiled_models_longcat/`.

### 4. Run Inference

```bash
# CFG Parallel (default, recommended, fastest)
NEURON_RT_NUM_CORES=8 PYTHONPATH=src:$PYTHONPATH python src/run_longcat_image_edit.py \
--image assets/test.png \
--prompt "change the cat to a dog" \
--seed 43 \
--output output.png

# Context Parallel
NEURON_RT_NUM_CORES=8 PYTHONPATH=src:$PYTHONPATH python src/run_longcat_image_edit.py \
--image assets/test.png \
--prompt "change the cat to a dog" \
--seed 43 \
--use_cp \
--output output.png
```

### CLI Arguments

| Argument | Default | Description |
|----------|---------|-------------|
| `--image` | (required) | Input image path |
| `--prompt` | (required) | Edit instruction |
| `--output` | `output_edited.png` | Output image path |
| `--height` | 1024 | Output height |
| `--width` | 1024 | Output width |
| `--num_inference_steps` | 50 | Denoising steps |
| `--guidance_scale` | 4.5 | Guidance scale |
| `--seed` | 42 | Random seed |
| `--use_cfg_parallel` | true | Use CFG Parallel transformer (default, fastest) |
| `--use_cp` | false | Use Context Parallel instead of CFG |
| `--cpu_vision_encoder` | false | Use CPU vision encoder for better accuracy |
| `--warmup` | false | Run warmup inference first |
| `--compiled_models_dir` | `/opt/dlami/nvme/compiled_models_longcat` | Path to compiled models |

## Compatibility Matrix

| Instance/Version | 2.22+ (PyTorch 2.9) | 2.21 and earlier |
|------------------|---------------------|------------------|
| Trn2 (trn2.48xlarge) | Tested | Not tested |
| Trn1 | Not tested | Not tested |
| Inf2 | Not supported | Not supported |

## Testing

Run integration test (requires Trn2 instance with compiled models):

```bash
# Full test (compile + inference + validate output)
PYTHONPATH=src:$PYTHONPATH pytest test/integration/test_model.py --capture=tee-sys -v

# Or run manually:
cd contrib/models/LongCat-Image-Edit
PYTHONPATH=src:$PYTHONPATH python test/integration/test_model.py
```

## Key Implementation Notes

1. **M-RoPE position IDs**: Must use original model's `get_rope_index()` method for correct 3D position IDs. Custom reimplementation produces wrong results.
2. **VL processor resolution**: Must match between compiled model and inference. CPU VE mode uses default resolution.
3. **Text sequence length**: `text_seq_len=1024` required (770-838 tokens typical for image editing prompts).
4. **VAE**: Compiled for full 1024x1024 output to avoid tile seam artifacts.
5. **Vision Encoder**: Uses native `F.scaled_dot_product_attention` (no monkey-patching) for accuracy.
6. **NKI Flash Attention**: Used for FLUX transformer attention (both dual-stream and single-stream blocks).

## File Structure

```
LongCat-Image-Edit/
README.md
requirements.txt
assets/
test.png # Test input image
src/
run_longcat_image_edit.py # Main Neuron inference script
neuron_commons.py # NeuronTextEncoderWrapper, NKI attention
neuron_parallel_utils.py # FLUX-specific TP sharding
neuron_rope.py # 3-axis RoPE pre-computation
compile_transformer.py # FLUX transformer (TP=4, CP=2)
compile_transformer_cfg.py # FLUX transformer (TP=4, DP=2, CFG Parallel)
compile_vae.py # 2D AutoencoderKL (1024x1024)
compile_vision_encoder.py # Qwen2.5-VL ViT (TP=4)
compile_language_model.py # Qwen2.5-VL LM (TP=4)
cache_hf_model.py # Download model + install diffusers
compile.sh # Master compilation script
setup_nvme.sh # NVMe RAID setup
test/
integration/
test_model.py # Integration test
unit/
```

## Example Checkpoints

* [meituan-longcat/LongCat-Image-Edit](https://huggingface.co/meituan-longcat/LongCat-Image-Edit)

## Maintainer

Henan Wan (whn09)

**Last Updated:** 2026-04-13
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
12 changes: 12 additions & 0 deletions contrib/models/LongCat-Image-Edit/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# LongCat-Image-Edit Neuron Adaptation
# Install in /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference

# Diffusers from source (LongCat classes are in latest diffusers)
git+https://github.com/huggingface/diffusers

# Required packages
accelerate
sentencepiece
qwen-vl-utils
Pillow
safetensors
Empty file.
32 changes: 32 additions & 0 deletions contrib/models/LongCat-Image-Edit/src/cache_hf_model.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
import subprocess
import sys
import torch

CACHE_DIR = "/opt/dlami/nvme/longcat_hf_cache"
MODEL_ID = "meituan-longcat/LongCat-Image-Edit"

if __name__ == "__main__":
# Install diffusers from source (LongCat classes are in latest diffusers)
print("Installing diffusers from source (required for LongCat classes)...")
subprocess.check_call([
sys.executable, "-m", "pip", "install",
"git+https://github.com/huggingface/diffusers",
"--quiet",
])
subprocess.check_call([
sys.executable, "-m", "pip", "install",
"accelerate", "sentencepiece", "qwen-vl-utils", "Pillow",
"--quiet",
])

print(f"\nDownloading {MODEL_ID} to {CACHE_DIR}...")
from diffusers import LongCatImageEditPipeline
pipe = LongCatImageEditPipeline.from_pretrained(
MODEL_ID,
torch_dtype=torch.bfloat16,
cache_dir=CACHE_DIR,
)
print("Model downloaded successfully!")
print(f" Transformer type: {type(pipe.transformer).__name__}")
print(f" Text encoder type: {type(pipe.text_encoder).__name__}")
print(f" VAE type: {type(pipe.vae).__name__}")
152 changes: 152 additions & 0 deletions contrib/models/LongCat-Image-Edit/src/compile.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
#!/bin/bash

# Compile LongCat-Image-Edit for Neuron (trn2.48xlarge)
#
# Components:
# 1. VAE: 2D AutoencoderKL (standard FLUX VAE)
# 2. Transformer: FLUX-style with TP=4, CP=2 (10 dual + 20 single stream blocks)
# 3. Vision Encoder: Qwen2.5-VL ViT with TP=4 (same as Qwen reference)
# 4. Language Model: Qwen2.5-VL LM with TP=4 (same as Qwen reference)
#
# Usage:
# ./compile.sh # Compile CFG (CFG Parallel, recommended, fastest)
# ./compile.sh cp # Compile CP (Context Parallel)
# ./compile.sh cfg 1024 1024 448 512 # Custom dimensions with CFG
# ./compile.sh cp 1024 1024 448 512 # Custom dimensions with CP

set -e

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
export PYTHONPATH="${SCRIPT_DIR}:$PYTHONPATH"
COMPILED_MODELS_DIR="/opt/dlami/nvme/compiled_models_longcat"
COMPILER_WORKDIR="/opt/dlami/nvme/compiler_workdir_longcat"

# VAE compiled for full output size (no tiling needed, avoids seam artifacts)
VAE_TILE_SIZE=1024

# Check if first argument is mode selector
MODE="cfg"
if [[ "$1" == "cp" || "$1" == "cfg" ]]; then
MODE="$1"
shift
fi

# Parse arguments
HEIGHT=${1:-1024}
WIDTH=${2:-1024}
IMAGE_SIZE=${3:-448}
MAX_SEQ_LEN=${4:-1024}
BATCH_SIZE=${5:-1}

echo "============================================"
echo "LongCat-Image-Edit Compilation for Neuron"
echo "============================================"
echo "Output Size: ${HEIGHT}x${WIDTH}"
echo "VAE Tile Size: ${VAE_TILE_SIZE}x${VAE_TILE_SIZE}"
echo "Vision Encoder Image Size: ${IMAGE_SIZE}"
echo "Max Sequence Length: ${MAX_SEQ_LEN}"
echo "Batch Size: ${BATCH_SIZE}"
echo "Mode: ${MODE}"
if [[ "$MODE" == "cfg" ]]; then
echo "Transformer: FLUX-style, TP=4, DP=2 (CFG Parallel, world_size=8)"
else
echo "Transformer: FLUX-style, TP=4, CP=2 (Context Parallel, world_size=8)"
fi
echo ""

# Step 1: Download model and install dependencies
echo "[Step 1/5] Downloading model and installing dependencies..."
pip install -r "${SCRIPT_DIR}/../requirements.txt" --quiet
python ${SCRIPT_DIR}/cache_hf_model.py
echo "Model downloaded successfully!"
echo ""

# Step 2: Compile VAE (single device, ~5 min)
echo "[Step 2/5] Compiling VAE (2D AutoencoderKL)..."
echo " Tile size: ${VAE_TILE_SIZE}x${VAE_TILE_SIZE}"
python ${SCRIPT_DIR}/compile_vae.py \
--height ${VAE_TILE_SIZE} \
--width ${VAE_TILE_SIZE} \
--batch_size ${BATCH_SIZE} \
--compiled_models_dir ${COMPILED_MODELS_DIR} \
--compiler_workdir ${COMPILER_WORKDIR}
echo "VAE compiled!"
echo ""

# Step 3: Compile Transformer (TP=4, world_size=8)
if [[ "$MODE" == "cfg" ]]; then
echo "[Step 3/5] Compiling FLUX Transformer (CFG Parallel, TP=4, DP=2)..."
neuron_parallel_compile python ${SCRIPT_DIR}/compile_transformer_cfg.py \
--height ${HEIGHT} \
--width ${WIDTH} \
--tp_degree 4 \
--world_size 8 \
--max_sequence_length ${MAX_SEQ_LEN} \
--compiled_models_dir ${COMPILED_MODELS_DIR} \
--compiler_workdir ${COMPILER_WORKDIR}
echo "CFG Transformer compiled!"
else
echo "[Step 3/5] Compiling FLUX Transformer (Context Parallel, TP=4, CP=2)..."
neuron_parallel_compile python ${SCRIPT_DIR}/compile_transformer.py \
--height ${HEIGHT} \
--width ${WIDTH} \
--tp_degree 4 \
--world_size 8 \
--max_sequence_length ${MAX_SEQ_LEN} \
--batch_size ${BATCH_SIZE} \
--compiled_models_dir ${COMPILED_MODELS_DIR} \
--compiler_workdir ${COMPILER_WORKDIR}
echo "CP Transformer compiled!"
fi
echo ""

# Step 4: Compile Vision Encoder (TP=4, ~10 min)
echo "[Step 4/5] Compiling Vision Encoder (TP=4, float32)..."
python ${SCRIPT_DIR}/compile_vision_encoder.py \
--image_size ${IMAGE_SIZE} \
--compiled_models_dir ${COMPILED_MODELS_DIR} \
--compiler_workdir ${COMPILER_WORKDIR}
echo "Vision Encoder compiled!"
echo ""

# Step 5: Compile Language Model (TP=4, ~15 min)
echo "[Step 5/5] Compiling Language Model (TP=4)..."
neuron_parallel_compile python ${SCRIPT_DIR}/compile_language_model.py \
--max_sequence_length ${MAX_SEQ_LEN} \
--batch_size ${BATCH_SIZE} \
--compiled_models_dir ${COMPILED_MODELS_DIR} \
--compiler_workdir ${COMPILER_WORKDIR}
echo "Language Model compiled!"
echo ""

echo "============================================"
echo "Compilation Complete!"
echo "============================================"
echo ""
echo "Compiled models saved to: ${COMPILED_MODELS_DIR}/"
echo " - vae_encoder/ (tile: ${VAE_TILE_SIZE}x${VAE_TILE_SIZE})"
echo " - vae_decoder/ (tile: ${VAE_TILE_SIZE}x${VAE_TILE_SIZE})"
if [[ "$MODE" == "cfg" ]]; then
echo " - transformer_cfg/ (TP=4, DP=2, CFG Parallel, output: ${HEIGHT}x${WIDTH}, batch=2)"
else
echo " - transformer/ (TP=4, CP=2, output: ${HEIGHT}x${WIDTH})"
fi
echo " - vision_encoder/ (TP=4, float32)"
echo " - language_model/ (TP=4)"
echo ""
echo "To run inference:"
if [[ "$MODE" == "cfg" ]]; then
echo " # CFG Parallel (recommended when guidance_scale > 1):"
echo " NEURON_RT_NUM_CORES=8 python run_longcat_image_edit.py \\"
echo " --image input.jpg \\"
echo " --prompt \"your edit instruction\" \\"
echo " --use_cfg_parallel --warmup"
echo ""
echo " Note: CFG Parallel batches negative+positive prompts for ~2x denoising speedup"
else
echo " # Context Parallel:"
echo " NEURON_RT_NUM_CORES=8 python run_longcat_image_edit.py \\"
echo " --image input.jpg \\"
echo " --prompt \"your edit instruction\" \\"
echo " --warmup"
fi
Loading