More Multimodal RL support

Thanks for your wonderful job of the RL framework ROLL.

Do you have plans to support more multimodal RL pipelines, including audio generation for some omni models and even some image/video generation models? And for inference, vllm-omni and sglang diffusion are also pushing to serve these multimodal models.