Skip to content

Cyber-Blacat/MiMo-VL-API-inference

Repository files navigation

🇨🇳 中文说明 | 🇺🇸 English

MiMo-VL-7B-RL Quick API Script Guide

The MiMoVLM-api_server.py script in this project is designed for rapid deployment and invocation of Xiaomi's open-source multi-modal Vision-Language Model (VLM) — MiMo-VL-7B-RL. It supports image captioning and other multi-modal reasoning tasks.

Model Overview

MiMo-VL-7B-RL is a high-performance vision-language model released by Xiaomi's large model team. It features powerful image understanding, reasoning, and multi-modal dialogue capabilities. The model utilizes a native-resolution ViT encoder, MLP projector, and MiMo-7B language model, and is optimized through multi-stage pre-training and mixed reinforcement learning, achieving state-of-the-art results on several public benchmarks.

Main Features

  • Supports image captioning via image URL or local file upload
  • Customizable prompt (instruction) support
  • Standard RESTful API interface for easy integration
  • Automatic management of temporary files, suitable for high concurrency

Quick Start

  1. Prepare Model Weights

    • Download the model weights from ModelScope Download Page and extract them to the directory specified by MODEL_PATH in MiMoVLM-api_server.py (default: /hy-tmp/data/MiMo-VL-7B-RL).
  2. Install Dependencies

    pip install -r requirements.txt
  3. Start the API Service

    python MiMoVLM-api_server.py

    The service will listen on http://0.0.0.0:8000 after startup.

API Endpoints

1. Image URL Captioning

  • Endpoint: POST /describe_url/
  • Request Body:
    {
      "image_url": "URL of the image",
      "prompt_text": "(Optional) Custom prompt"
    }
  • Response Example:
    {
      "description": "Image caption",
      "prompt_used": "Prompt actually used",
      "error": null
    }

2. Image File Upload Captioning

  • Endpoint: POST /describe_upload/
  • Request Body:
    • image: The uploaded image file (form-data)
    • prompt_text: (Optional) Custom prompt
  • Response: Same as above

Dependencies

See requirements.txt for details.

Acknowledgement

This project is based on the open-source MiMo-VL project by Xiaomi Large Model Team. Special thanks!

For more technical details, please refer to the MiMo-VL Technical Report.

About

MiMo-VL-API-inference is an open-source project for rapid deployment and inference of Xiaomi's MiMo-VL-7B-RL multi-modal vision-language model via a RESTful API. Fast API-based, Supports both image URL and local file input, Customizable prompts. 基于 FastAPI 的 MiMo-VL-7B-RL 推理服务 支持图片 URL 和本地文件输入,可自定义提示词等

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages