aws-neuron · whn09 · Apr 9, 2026 · Apr 9, 2026 · Apr 9, 2026 · Apr 9, 2026
diff --git a/contrib/models/LongCat-Image-Edit/README.md b/contrib/models/LongCat-Image-Edit/README.md
@@ -0,0 +1,199 @@
+# Contrib Model: LongCat-Image-Edit
+
+NeuronX adaptation of [meituan-longcat/LongCat-Image-Edit](https://huggingface.co/meituan-longcat/LongCat-Image-Edit) for AWS Trainium2 inference.
+
+## Model Information
+
+- **HuggingFace ID:** `meituan-longcat/LongCat-Image-Edit`
+- **Model Type:** FLUX-style diffusion model for image editing
+- **Architecture:** Multi-component (Vision Encoder + Language Model + FLUX Transformer + VAE)
+- **License:** Check HuggingFace model card
+
+## Architecture Details
+
+LongCat-Image-Edit is a FLUX-style image editing model with the following components:
+
+| Component | Model | Neuron Parallelism |
+|-----------|-------|-------------------|
+| Vision Encoder | Qwen2.5-VL ViT (32 blocks) | TP=4, float32 |
+| Language Model | Qwen2.5-VL LM (28 layers) | TP=4, world_size=8 |
+| Transformer (CP) | LongCatImageTransformer2DModel (10 dual + 20 single stream) | TP=4, CP=2, world_size=8 |
+| Transformer (CFG) | LongCatImageTransformer2DModel (10 dual + 20 single stream) | TP=4, DP=2, world_size=8, batch=2 |
+| VAE | 2D AutoencoderKL | Single device (1024x1024, no tiling) |
+
+Key parameters:
+- **Attention Heads:** 24, head_dim=128, inner_dim=3072
+- **Text Hidden Size:** 3584 (Qwen2.5-VL)
+- **In Channels:** 64 (packed latents)
+- **Dual-stream blocks:** 10 (separate text/image norms+FFN, joint attention)
+- **Single-stream blocks:** 20 (concatenated text+image, parallel MLP+attention)
+
+## Performance
+
+| Machine | Config | Total Time | Per Step | Quality |
+|---------|--------|------------|----------|---------|
+| **Trn2** (trn2.48xlarge) | All Neuron, **CFG Parallel** | **18.17s** | 0.36s | Good |
+| **Trn2** (trn2.48xlarge) | All Neuron, Context Parallel | 22.39s | 0.45s | Good |
+| **H100** (single GPU, bf16) | Full GPU | 23.61s | 0.47s | Reference |
+
+Test: 1024x1024 output, guidance_scale=4.5, 50 steps.
+
+## CFG Parallel vs Context Parallel
+
+Both modes use TP=4, world_size=8 on the same hardware:
+
+| Aspect | Context Parallel (CP) | CFG Parallel |
+|--------|----------------------|--------------|
+| Scatter dimension | dim=1 (sequence) | dim=0 (batch) |
+| Calls per step | 2 (neg + pos sequential) | 1 (neg + pos batched) |
+| K/V All-Gather | Yes (every attention layer) | No |
+| Compile batch_size | 1 | 2 |
+| Best for | guidance_scale = 1 (no CFG) | guidance_scale > 1 (~9% faster) |
+
+## Prerequisites
+
+- **Instance**: trn2.48xlarge (64 NeuronCores, 1.5TB device memory)
+- **Virtual env**: `/opt/aws_neuronx_venv_pytorch_2_9_nxd_inference`
+  - PyTorch 2.9, neuronx-cc 2.22, neuronx-distributed 0.16
+- **NVMe**: Mount RAID at `/opt/dlami/nvme/` (run `src/setup_nvme.sh`)
+
+## Usage
+
+### 1. Setup
+
+```bash
+# Mount NVMe RAID
+sudo bash src/setup_nvme.sh
+
+# Activate virtual environment
+source /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate
+
+# Install dependencies
+pip install -r requirements.txt
+```
+
+### 2. Download Model
+
+```bash
+python src/cache_hf_model.py
+```
+
+### 3. Compile All Components
+
+```bash
+# Compile with CFG Parallel (default, recommended, fastest)
+bash src/compile.sh
+
+# Compile with Context Parallel
+bash src/compile.sh cp
+
+# Custom dimensions:
+# bash src/compile.sh [cfg|cp] <height> <width> <image_size> <max_seq_len>
+# bash src/compile.sh cfg 1024 1024 448 1024
+```
+
+Compilation takes ~60-90 minutes total. Compiled models are saved to `/opt/dlami/nvme/compiled_models_longcat/`.
+
+### 4. Run Inference
+
+```bash
+# CFG Parallel (default, recommended, fastest)
+NEURON_RT_NUM_CORES=8 PYTHONPATH=src:$PYTHONPATH python src/run_longcat_image_edit.py \
+    --image assets/test.png \
+    --prompt "change the cat to a dog" \
+    --seed 43 \
+    --output output.png
+
+# Context Parallel
+NEURON_RT_NUM_CORES=8 PYTHONPATH=src:$PYTHONPATH python src/run_longcat_image_edit.py \
+    --image assets/test.png \
+    --prompt "change the cat to a dog" \
+    --seed 43 \
+    --use_cp \
+    --output output.png
+```
+
+### CLI Arguments
+
+| Argument | Default | Description |
+|----------|---------|-------------|
+| `--image` | (required) | Input image path |
+| `--prompt` | (required) | Edit instruction |
+| `--output` | `output_edited.png` | Output image path |
+| `--height` | 1024 | Output height |
+| `--width` | 1024 | Output width |
+| `--num_inference_steps` | 50 | Denoising steps |
+| `--guidance_scale` | 4.5 | Guidance scale |
+| `--seed` | 42 | Random seed |
+| `--use_cfg_parallel` | true | Use CFG Parallel transformer (default, fastest) |
+| `--use_cp` | false | Use Context Parallel instead of CFG |
+| `--cpu_vision_encoder` | false | Use CPU vision encoder for better accuracy |
+| `--warmup` | false | Run warmup inference first |
+| `--compiled_models_dir` | `/opt/dlami/nvme/compiled_models_longcat` | Path to compiled models |
+
+## Compatibility Matrix
+
+| Instance/Version | 2.22+ (PyTorch 2.9) | 2.21 and earlier |
+|------------------|---------------------|------------------|
+| Trn2 (trn2.48xlarge) | Tested | Not tested |
+| Trn1 | Not tested | Not tested |
+| Inf2 | Not supported | Not supported |
+
+## Testing
+
+Run integration test (requires Trn2 instance with compiled models):
+
+```bash
+# Full test (compile + inference + validate output)
+PYTHONPATH=src:$PYTHONPATH pytest test/integration/test_model.py --capture=tee-sys -v
+
+# Or run manually:
+cd contrib/models/LongCat-Image-Edit
+PYTHONPATH=src:$PYTHONPATH python test/integration/test_model.py
+```
+
+## Key Implementation Notes
+
+1. **M-RoPE position IDs**: Must use original model's `get_rope_index()` method for correct 3D position IDs. Custom reimplementation produces wrong results.
+2. **VL processor resolution**: Must match between compiled model and inference. CPU VE mode uses default resolution.
+3. **Text sequence length**: `text_seq_len=1024` required (770-838 tokens typical for image editing prompts).
+4. **VAE**: Compiled for full 1024x1024 output to avoid tile seam artifacts.
+5. **Vision Encoder**: Uses native `F.scaled_dot_product_attention` (no monkey-patching) for accuracy.
+6. **NKI Flash Attention**: Used for FLUX transformer attention (both dual-stream and single-stream blocks).
+
+## File Structure
+
+```
+LongCat-Image-Edit/
+  README.md
+  requirements.txt
+  assets/
+    test.png                          # Test input image
+  src/
+    run_longcat_image_edit.py         # Main Neuron inference script
+    neuron_commons.py                 # NeuronTextEncoderWrapper, NKI attention
+    neuron_parallel_utils.py          # FLUX-specific TP sharding
+    neuron_rope.py                    # 3-axis RoPE pre-computation
+    compile_transformer.py            # FLUX transformer (TP=4, CP=2)
+    compile_transformer_cfg.py        # FLUX transformer (TP=4, DP=2, CFG Parallel)
+    compile_vae.py                    # 2D AutoencoderKL (1024x1024)
+    compile_vision_encoder.py         # Qwen2.5-VL ViT (TP=4)
+    compile_language_model.py         # Qwen2.5-VL LM (TP=4)
+    cache_hf_model.py                 # Download model + install diffusers
+    compile.sh                        # Master compilation script
+    setup_nvme.sh                     # NVMe RAID setup
+  test/
+    integration/
+      test_model.py                   # Integration test
+    unit/
+```
+
+## Example Checkpoints
+
+* [meituan-longcat/LongCat-Image-Edit](https://huggingface.co/meituan-longcat/LongCat-Image-Edit)
+
+## Maintainer
+
+Henan Wan (whn09)
+
+**Last Updated:** 2026-04-13
diff --git a/contrib/models/LongCat-Image-Edit/assets/test.png b/contrib/models/LongCat-Image-Edit/assets/test.png
diff --git a/contrib/models/LongCat-Image-Edit/requirements.txt b/contrib/models/LongCat-Image-Edit/requirements.txt
@@ -0,0 +1,12 @@
+# LongCat-Image-Edit Neuron Adaptation
+# Install in /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference
+
+# Diffusers from source (LongCat classes are in latest diffusers)
+git+https://github.com/huggingface/diffusers
+
+# Required packages
+accelerate
+sentencepiece
+qwen-vl-utils
+Pillow
+safetensors
diff --git a/contrib/models/LongCat-Image-Edit/src/__init__.py b/contrib/models/LongCat-Image-Edit/src/__init__.py
diff --git a/contrib/models/LongCat-Image-Edit/src/cache_hf_model.py b/contrib/models/LongCat-Image-Edit/src/cache_hf_model.py
@@ -0,0 +1,32 @@
+import subprocess
+import sys
+import torch
+
+CACHE_DIR = "/opt/dlami/nvme/longcat_hf_cache"
+MODEL_ID = "meituan-longcat/LongCat-Image-Edit"
+
+if __name__ == "__main__":
+    # Install diffusers from source (LongCat classes are in latest diffusers)
+    print("Installing diffusers from source (required for LongCat classes)...")
+    subprocess.check_call([
+        sys.executable, "-m", "pip", "install",
+        "git+https://github.com/huggingface/diffusers",
+        "--quiet",
+    ])
+    subprocess.check_call([
+        sys.executable, "-m", "pip", "install",
+        "accelerate", "sentencepiece", "qwen-vl-utils", "Pillow",
+        "--quiet",
+    ])
+
+    print(f"\nDownloading {MODEL_ID} to {CACHE_DIR}...")
+    from diffusers import LongCatImageEditPipeline
+    pipe = LongCatImageEditPipeline.from_pretrained(
+        MODEL_ID,
+        torch_dtype=torch.bfloat16,
+        cache_dir=CACHE_DIR,
+    )
+    print("Model downloaded successfully!")
+    print(f"  Transformer type: {type(pipe.transformer).__name__}")
+    print(f"  Text encoder type: {type(pipe.text_encoder).__name__}")
+    print(f"  VAE type: {type(pipe.vae).__name__}")
diff --git a/contrib/models/LongCat-Image-Edit/src/compile.sh b/contrib/models/LongCat-Image-Edit/src/compile.sh
@@ -0,0 +1,152 @@
+#!/bin/bash
+
+# Compile LongCat-Image-Edit for Neuron (trn2.48xlarge)
+#
+# Components:
+#   1. VAE: 2D AutoencoderKL (standard FLUX VAE)
+#   2. Transformer: FLUX-style with TP=4, CP=2 (10 dual + 20 single stream blocks)
+#   3. Vision Encoder: Qwen2.5-VL ViT with TP=4 (same as Qwen reference)
+#   4. Language Model: Qwen2.5-VL LM with TP=4 (same as Qwen reference)
+#
+# Usage:
+#   ./compile.sh                    # Compile CFG (CFG Parallel, recommended, fastest)
+#   ./compile.sh cp                 # Compile CP (Context Parallel)
+#   ./compile.sh cfg 1024 1024 448 512 # Custom dimensions with CFG
+#   ./compile.sh cp 1024 1024 448 512  # Custom dimensions with CP
+
+set -e
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+export PYTHONPATH="${SCRIPT_DIR}:$PYTHONPATH"
+COMPILED_MODELS_DIR="/opt/dlami/nvme/compiled_models_longcat"
+COMPILER_WORKDIR="/opt/dlami/nvme/compiler_workdir_longcat"
+
+# VAE compiled for full output size (no tiling needed, avoids seam artifacts)
+VAE_TILE_SIZE=1024
+
+# Check if first argument is mode selector
+MODE="cfg"
+if [[ "$1" == "cp" || "$1" == "cfg" ]]; then
+    MODE="$1"
+    shift
+fi
+
+# Parse arguments
+HEIGHT=${1:-1024}
+WIDTH=${2:-1024}
+IMAGE_SIZE=${3:-448}
+MAX_SEQ_LEN=${4:-1024}
+BATCH_SIZE=${5:-1}
+
+echo "============================================"
+echo "LongCat-Image-Edit Compilation for Neuron"
+echo "============================================"
+echo "Output Size: ${HEIGHT}x${WIDTH}"
+echo "VAE Tile Size: ${VAE_TILE_SIZE}x${VAE_TILE_SIZE}"
+echo "Vision Encoder Image Size: ${IMAGE_SIZE}"
+echo "Max Sequence Length: ${MAX_SEQ_LEN}"
+echo "Batch Size: ${BATCH_SIZE}"
+echo "Mode: ${MODE}"
+if [[ "$MODE" == "cfg" ]]; then
+    echo "Transformer: FLUX-style, TP=4, DP=2 (CFG Parallel, world_size=8)"
+else
+    echo "Transformer: FLUX-style, TP=4, CP=2 (Context Parallel, world_size=8)"
+fi
+echo ""
+
+# Step 1: Download model and install dependencies
+echo "[Step 1/5] Downloading model and installing dependencies..."
+pip install -r "${SCRIPT_DIR}/../requirements.txt" --quiet
+python ${SCRIPT_DIR}/cache_hf_model.py
+echo "Model downloaded successfully!"
+echo ""
+
+# Step 2: Compile VAE (single device, ~5 min)
+echo "[Step 2/5] Compiling VAE (2D AutoencoderKL)..."
+echo "  Tile size: ${VAE_TILE_SIZE}x${VAE_TILE_SIZE}"
+python ${SCRIPT_DIR}/compile_vae.py \
+    --height ${VAE_TILE_SIZE} \
+    --width ${VAE_TILE_SIZE} \
+    --batch_size ${BATCH_SIZE} \
+    --compiled_models_dir ${COMPILED_MODELS_DIR} \
+    --compiler_workdir ${COMPILER_WORKDIR}
+echo "VAE compiled!"
+echo ""
+
+# Step 3: Compile Transformer (TP=4, world_size=8)
+if [[ "$MODE" == "cfg" ]]; then
+    echo "[Step 3/5] Compiling FLUX Transformer (CFG Parallel, TP=4, DP=2)..."
+    neuron_parallel_compile python ${SCRIPT_DIR}/compile_transformer_cfg.py \
+        --height ${HEIGHT} \
+        --width ${WIDTH} \
+        --tp_degree 4 \
+        --world_size 8 \
+        --max_sequence_length ${MAX_SEQ_LEN} \
+        --compiled_models_dir ${COMPILED_MODELS_DIR} \
+        --compiler_workdir ${COMPILER_WORKDIR}
+    echo "CFG Transformer compiled!"
+else
+    echo "[Step 3/5] Compiling FLUX Transformer (Context Parallel, TP=4, CP=2)..."
+    neuron_parallel_compile python ${SCRIPT_DIR}/compile_transformer.py \
+        --height ${HEIGHT} \
+        --width ${WIDTH} \
+        --tp_degree 4 \
+        --world_size 8 \
+        --max_sequence_length ${MAX_SEQ_LEN} \
+        --batch_size ${BATCH_SIZE} \
+        --compiled_models_dir ${COMPILED_MODELS_DIR} \
+        --compiler_workdir ${COMPILER_WORKDIR}
+    echo "CP Transformer compiled!"
+fi
+echo ""
+
+# Step 4: Compile Vision Encoder (TP=4, ~10 min)
+echo "[Step 4/5] Compiling Vision Encoder (TP=4, float32)..."
+python ${SCRIPT_DIR}/compile_vision_encoder.py \
+    --image_size ${IMAGE_SIZE} \
+    --compiled_models_dir ${COMPILED_MODELS_DIR} \
+    --compiler_workdir ${COMPILER_WORKDIR}
+echo "Vision Encoder compiled!"
+echo ""
+
+# Step 5: Compile Language Model (TP=4, ~15 min)
+echo "[Step 5/5] Compiling Language Model (TP=4)..."
+neuron_parallel_compile python ${SCRIPT_DIR}/compile_language_model.py \
+    --max_sequence_length ${MAX_SEQ_LEN} \
+    --batch_size ${BATCH_SIZE} \
+    --compiled_models_dir ${COMPILED_MODELS_DIR} \
+    --compiler_workdir ${COMPILER_WORKDIR}
+echo "Language Model compiled!"
+echo ""
+
+echo "============================================"
+echo "Compilation Complete!"
+echo "============================================"
+echo ""
+echo "Compiled models saved to: ${COMPILED_MODELS_DIR}/"
+echo "  - vae_encoder/ (tile: ${VAE_TILE_SIZE}x${VAE_TILE_SIZE})"
+echo "  - vae_decoder/ (tile: ${VAE_TILE_SIZE}x${VAE_TILE_SIZE})"
+if [[ "$MODE" == "cfg" ]]; then
+    echo "  - transformer_cfg/ (TP=4, DP=2, CFG Parallel, output: ${HEIGHT}x${WIDTH}, batch=2)"
+else
+    echo "  - transformer/ (TP=4, CP=2, output: ${HEIGHT}x${WIDTH})"
+fi
+echo "  - vision_encoder/ (TP=4, float32)"
+echo "  - language_model/ (TP=4)"
+echo ""
+echo "To run inference:"
+if [[ "$MODE" == "cfg" ]]; then
+    echo "  # CFG Parallel (recommended when guidance_scale > 1):"
+    echo "  NEURON_RT_NUM_CORES=8 python run_longcat_image_edit.py \\"
+    echo "      --image input.jpg \\"
+    echo "      --prompt \"your edit instruction\" \\"
+    echo "      --use_cfg_parallel --warmup"
+    echo ""
+    echo "  Note: CFG Parallel batches negative+positive prompts for ~2x denoising speedup"
+else
+    echo "  # Context Parallel:"
+    echo "  NEURON_RT_NUM_CORES=8 python run_longcat_image_edit.py \\"
+    echo "      --image input.jpg \\"
+    echo "      --prompt \"your edit instruction\" \\"
+    echo "      --warmup"
+fi