VibeVoice：微软开源的文本转语音项目

🎙️ VibeVoice: A Frontier Long Conversational Text-to-Speech Model

2025-09-05: VibeVoice is an open-source research framework intended to advance collaboration in the speech synthesis community. After release, we discovered instances where the tool was used in ways inconsistent with the stated intent. Since responsible use of AI is one of Microsoft’s guiding principles, we have disabled this repo until we are confident that out-of-scope use is no longer possible.

VibeVoice is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.

A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a next-token diffusion framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.

The model can synthesize speech up to 90 minutes long with up to 4 distinct speakers, surpassing the typical 1-2 speaker limits of many prior models.

MOS Preference Results VibeVoice Overview

🎵 Demo Examples

Video Demo

We produced this video with Wan2.2. We sincerely appreciate the Wan-Video team for their great work.

English

https://github.com/user-attachments/assets/0967027c-141e-4909-bec8-091558b1b784

Chinese

https://github.com/user-attachments/assets/322280b7-3093-4c67-86e3-10be4746c88f

Cross-Lingual

https://github.com/user-attachments/assets/838d8ad9-a201-4dde-bb45-8cd3f59ce722

Spontaneous Singing

https://github.com/user-attachments/assets/6f27a8a5-0c60-4f57-87f3-7dea2e11c730

Long Conversation with 4 people

https://github.com/user-attachments/assets/a357c4b6-9768-495c-a576-1618f6275727

For more examples, see the Project Page.

Risks and limitations

While efforts have been made to optimize it through various techniques, it may still produce outputs that are unexpected, biased, or inaccurate. VibeVoice inherits any biases, errors, or omissions produced by its base model (specifically, Qwen2.5 1.5b in this release).
Potential for Deepfakes and Disinformation: High-quality synthetic speech can be misused to create convincing fake audio content for impersonation, fraud, or spreading disinformation. Users must ensure transcripts are reliable, check content accuracy, and avoid using generated content in misleading ways. Users are expected to use the generated content and to deploy the models in a lawful manner, in full compliance with all applicable laws and regulations in the relevant jurisdictions. It is best practice to disclose the use of AI when sharing AI-generated content.

English and Chinese only: Transcripts in languages other than English or Chinese may result in unexpected audio outputs.

Non-Speech Audio: The model focuses solely on speech synthesis and does not handle background noise, music, or other sound effects.

Overlapping Speech: The current model does not explicitly model or generate overlapping speech segments in conversations.

We do not recommend using VibeVoice in commercial or real-world applications without further testing and development. This model is intended for research and development purposes only. Please use responsibly.

本站部分源码来源于网络，版权归属原开发者，用户仅获得使用权。依据《计算机软件保护条例》第十六条，禁止：

逆向工程破解技术保护措施
未经许可的分发行为
去除源码中的原始版权标识

※ 本站源码仅用于学习和研究，禁止用于商业用途。如有侵权, 请及时联系我们进行处理。

侵权举报请提供：侵权页面URL | 权属证明模板

响应时效：收到完整材料后48小时内处理

思考过程

搜索结果

相似度:

手机扫码登录