Skip to content

(Draft) [Roadmap] DeepSpeed Roadmap Q2 2026 #7861

Description

@tohtana

This is a draft roadmap for DeepSpeed Q2 2026. Feedback is welcome — please leave comments on this issue or join the #2026q2-roadmap channel on the DeepSpeed Slack.

New feature and enhancement

AutoEP support

AutoEP enables Expert Parallelism (EP) for major Mixture-of-Experts (MoE) models out of the box, eliminating the need for users to write model-specific parallelization code. By automatically distributing expert layers across devices, AutoEP allows users to scale MoE training with minimal configuration changes.

A prototype implementation has been validated on 8xH100, achieving ~5x throughput improvement over ZeRO-3 baselines. We will build on this work to extend AutoEP support to production readiness in Q2.

  • Convergence validation: Verify training convergence matches non-EP baselines across latest MoE model architectures
  • Model coverage: Add support for additional MoE architectures (e.g., Qwen-MoE)
  • ZeRO-3 support: Extend AutoEP to work with ZeRO Stage 3
  • AutoTP integration: Combine AutoEP with AutoTP for hybrid expert/tensor parallelism
  • Benchmarking: Publish throughput, memory, and scaling efficiency numbers across model sizes and GPU counts
  • Universal Checkpoint support: Enable saving and resuming from Universal Checkpoints with AutoEP

AutoTP extension

AutoTP was significantly revamped in Q1 (PR #7806), introducing a flexible, configuration-driven API for custom layer partitioning patterns. In Q2, we will extend this foundation to support a broader range of models and scales.

  • HuggingFace tp_plan support: Leverage the base_model_tp_plan metadata provided by HuggingFace Transformers models to automatically derive partitioning configurations, enabling out-of-the-box TP for any model that ships with a tp_plan
  • Combination with AutoEP: Support parallel folding for hybrid expert/tensor parallelism
  • Universal Checkpoint support: Enable saving and resuming from Universal Checkpoints with AutoTP

AutoSP Integration

AutoSP (ICLR 2026) is a compiler-based approach that automatically applies sequence parallelism via DeepSpeed Ulysses, removing the need for manual partitioning of sequence dimensions.

  • Initial integration: The initial PR (Merging AutoSP into DeepSpeed #7860) is ready
  • Model coverage: Improve coverage for major model families (e.g., Qwen, Llama)
  • Multimodal model support: Multimodal models involve significantly longer sequence lengths, making sequence parallelism critical for training efficiency (blog post). However, existing frameworks such as Megatron-LM do not support sequence parallelism for ViT encoders, and manually implementing it requires substantial engineering effort. AutoSP aims to automate this, enabling DeepSpeed Ulysses-based sequence parallelism for multimodal architectures out of the box.

Compiler Integration Enhancement (Optional)

  • "DTensor mode" for less graph break and stable graph tracing
  • DeepCompile enhancement
    • Support multi-stage optimization passes for PyTorch v2.9+
    • Compiler pass enhancement
      • AutoTP support
      • AutoEP support
    • AMD support

New Accelerator Support (Q2)

  • Planning (Scope, target accelerators)

RL training specific Optimization for DeepSpeed-Inference

  • Systems Design, prototyping and benchmarking

Stability (Q2)

  • Performance regression test
  • Enable nightly full test
    • CUDA
    • AMD
    • Intel XPU
    • Intel Gaudi
    • NPU

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions