prosodic text encoder (pre-trained PL-BERT) (1)

We also replace the phoneme representations
- $h_{text}$ from $T$, referred to as acoustic text encoder,
- with $h_{bert}$ from another text encoder $B$ based on PL-BERT denoted as prosodic text encoder
  - pre-trained on extensive corpora as the prosodic text encoder
- this approach has been shown to enhance the naturalness of StyleTTS
input
- phoneme $t$ (B, T)
output
- phoneme representation $h_{bert}$ (B, T, 768)
model architecture

한국어에 대해서는 pretrained acoustic text encoder를 그냥 사용하는 중.