논문 링크: https://arxiv.org/pdf/2306.07691
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
Columbia Univ.
NeurIPS 2023
디퓨전 기반 모델, E2E, 다양한 기법을 사용
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for endto-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs. The audio demos and source code are available at https://styletts2.github.io/.
Key Contributions:
Style Diffusion: 스타일을 잠재 변수로 모델링하고 디퓨전 모델을 통해 샘플링하여 레퍼런스 오디오 없이도 효율적이고 현실적인 음성 합성을 가능하게 했음.
→ 다른 확산 기반 TTS 모델에 비해 더 빠른 합성을 가능하게 하면서도 다양한 음성 생성 능력을 유지
Adversarial Training with Large SLMs: WavLM과 같은 대규모 사전 훈련된 음성 언어 모델(SLM)을 적대적 훈련에서 판별자로 사용.
→ 음성의 자연스러움을 향상시킬 뿐만 아니라 대규모 사전 훈련 모델로부터의 정보를 TTS 시스템에 효과적으로 전달
Differentiable Duration Modeling: 음성 요소의 지속 시간을 미분가능한 방식으로 모델링하는 새로운 접근 방식을 도입
→ 사전 분할이나 정렬 없이도 End-to-End 훈련을 용이하게 함.
이전 StyleTTS 프레임워크와의 차이: 2-stage 구조에서 End-to-End로 개선