• We also replace the phoneme representations

    • $h_{text}$ from $T$, referred to as acoustic text encoder,
    • with $h_{bert}$ from another text encoder $B$ based on PL-BERT denoted as prosodic text encoder
      • pre-trained on extensive corpora as the prosodic text encoder
    • this approach has been shown to enhance the naturalness of StyleTTS
  • input

    • phoneme $t$ (B, T)
  • output

    • phoneme representation $h_{bert}$ (B, T, 768)
  • model architecture

    Untitled

    한국어에 대해서는 pretrained acoustic text encoder를 그냥 사용하는 중.