MiMo v2 TTS
Generate high-quality speech from text using the latest MiMo v2 TTS API. Support styles, voice selection, and emotions.
Text to Speech Generator
This online tool is powered by the latest MiMo v2 TTS (Text-to-Speech) model released by Xiaomi, capable of automatically converting input text into highly natural and fluent speech. You can generate vivid, expressive voice content by configuring speech styles and inserting fine-grained audio tags.
โ ๏ธ Disclaimer: In order to bring this tool to you quickly, it was built fast and might have edge-case bugs. If you experience issues or have feature requests, please feel free to raise them!
๐ Quick Links
- ๐ Apply for MIMO API Key (Console)
- ๐ Official Speech Synthesis API Docs
- ๐ฐ Billing: Currently free for a limited time.
๐ Configuration Guide
1. API Key Application & Security
Before using this tool, you must provide a valid MIMO API Key.
- How to apply: Visit the Xiaomi MiMo Console to register and generate your unique Key.
- ๐ Privacy Guarantee: All API calls from this website are made directly from your browser to the official servers. We will NEVER record, collect, or upload your API Key. If you are still concerned, you can delete or revoke the key in the console after using the tool.
2. Voice Selection (Built-in Voices)
You can choose an official pre-set voice from the dropdown:
mimo_default: MiMo-Defaultdefault_zh: MiMo-Chinese Female Voicedefault_en: MiMo-English Female Voice (Note: Voice cloning is currently not supported by the API)
3. Overall Speech Style Control
Input your desired emotion or dialect into the "Style" input box. The tool will automatically prepend it as <style>Your Style</style> to the target content. You can even combine styles separated by spaces!
Supported styles include but are not limited to:
- Speech Rate: Speed up / Slow down
- Emotions: Happy / Sad / Angry
- Roles: Sun Wukong / Lin Daiyu
- Style Change: Whisper / Clamped voice / Taiwanese accent / Singing
- Dialects: Northeastern dialect / Sichuan dialect / Henan dialect / Cantonese
Examples:
<style>Happy</style>Tomorrow is Friday, so happy!<style>Whisper</style>Oh my goodness, it's so cold today! You know that wind, it's howling like a knife!- (Note: To achieve the best singing style, you must add ONLY
<style>ๅฑๆญ</style>at the very beginning of the target text).
4. Fine-grained Audio Tags
Through inline Audio Tags, you can exercise fine-grained control to precisely adjust tone, emotion, and expression styleโwhether it's a whisper, a hearty laugh, or inserting breaths, pauses, and coughs. Insert them directly into the target text. Examples:
Achoo! Ahem. IโI really [cough] think I am coming down with a terrible [cough] terrible cold.[heavy breathing] Just... give me... a second.It's just so stupid! (sobbing) he just ate the whole thing in one bite!
5. Roles: User Context vs Assistant Text
- Assistant Text (Required): The target text for speech synthesis MUST be placed in an
assistantrole message. This field is the actual speech audio that will be generated. - User Context (Optional): Provides a background conversational context for the TTS engine. It helps the TTS model adapt a suitable tone in response to the user's input.