acoustic text encoder (StyleTTS) (1)

transforms the phonemes into a hidden
3-layer CNN bidirectional LSTM
it is same as in Tacotron2
input
- phoneme $t$ (B, T)
output
- phoneme representation $h_{text}$ (B, 512, T)

model architecture

TextEncoder(
  (embedding): Embedding(178, 512)
  (cnn): ModuleList(
    (0-2): 3 x Sequential(
      (0): ParametrizedConv1d(
        512, 512, kernel_size=(5,), stride=(1,), padding=(2,)
        (parametrizations): ModuleDict(
          (weight): ParametrizationList(
            (0): _WeightNorm()
          )
        )
      )
      (1): LayerNorm()
      (2): LeakyReLU(negative_slope=0.2)
      (3): Dropout(p=0.2, inplace=False)
    )
  )
  (lstm): LSTM(512, 256, batch_first=True, bidirectional=True)
)

Why not to use transformer like VITS?