Support me on Ko-fi

MiMo v2 TTS

Generate high-quality speech from text using the latest MiMo v2 TTS API. Support styles, voice selection, and emotions.

Text to Speech Generator

This online tool is powered by the latest MiMo v2 TTS (Text-to-Speech) model released by Xiaomi, capable of automatically converting input text into highly natural and fluent speech. You can generate vivid, expressive voice content by configuring speech styles and inserting fine-grained audio tags.

โš ๏ธ Disclaimer: In order to bring this tool to you quickly, it was built fast and might have edge-case bugs. If you experience issues or have feature requests, please feel free to raise them!

๐Ÿ”— Quick Links

๐ŸŒŸ Configuration Guide

1. API Key Application & Security

Before using this tool, you must provide a valid MIMO API Key.

  • How to apply: Visit the Xiaomi MiMo Console to register and generate your unique Key.
  • ๐Ÿ”’ Privacy Guarantee: All API calls from this website are made directly from your browser to the official servers. We will NEVER record, collect, or upload your API Key. If you are still concerned, you can delete or revoke the key in the console after using the tool.

2. Voice Selection (Built-in Voices)

You can choose an official pre-set voice from the dropdown:

  • mimo_default: MiMo-Default
  • default_zh: MiMo-Chinese Female Voice
  • default_en: MiMo-English Female Voice (Note: Voice cloning is currently not supported by the API)

3. Overall Speech Style Control

Input your desired emotion or dialect into the "Style" input box. The tool will automatically prepend it as <style>Your Style</style> to the target content. You can even combine styles separated by spaces!

Supported styles include but are not limited to:

  • Speech Rate: Speed up / Slow down
  • Emotions: Happy / Sad / Angry
  • Roles: Sun Wukong / Lin Daiyu
  • Style Change: Whisper / Clamped voice / Taiwanese accent / Singing
  • Dialects: Northeastern dialect / Sichuan dialect / Henan dialect / Cantonese

Examples:

  • <style>Happy</style>Tomorrow is Friday, so happy!
  • <style>Whisper</style>Oh my goodness, it's so cold today! You know that wind, it's howling like a knife!
  • (Note: To achieve the best singing style, you must add ONLY <style>ๅ”ฑๆญŒ</style> at the very beginning of the target text).

4. Fine-grained Audio Tags

Through inline Audio Tags, you can exercise fine-grained control to precisely adjust tone, emotion, and expression styleโ€”whether it's a whisper, a hearty laugh, or inserting breaths, pauses, and coughs. Insert them directly into the target text. Examples:

  • Achoo! Ahem. Iโ€”I really [cough] think I am coming down with a terrible [cough] terrible cold.
  • [heavy breathing] Just... give me... a second.
  • It's just so stupid! (sobbing) he just ate the whole thing in one bite!

5. Roles: User Context vs Assistant Text

  • Assistant Text (Required): The target text for speech synthesis MUST be placed in an assistant role message. This field is the actual speech audio that will be generated.
  • User Context (Optional): Provides a background conversational context for the TTS engine. It helps the TTS model adapt a suitable tone in response to the user's input.