transforms the phonemes into a hidden
3-layer CNN bidirectional LSTM
it is same as in Tacotron2
input
output
model architecture
TextEncoder(
(embedding): Embedding(178, 512)
(cnn): ModuleList(
(0-2): 3 x Sequential(
(0): ParametrizedConv1d(
512, 512, kernel_size=(5,), stride=(1,), padding=(2,)
(parametrizations): ModuleDict(
(weight): ParametrizationList(
(0): _WeightNorm()
)
)
)
(1): LayerNorm()
(2): LeakyReLU(negative_slope=0.2)
(3): Dropout(p=0.2, inplace=False)
)
)
(lstm): LSTM(512, 256, batch_first=True, bidirectional=True)
)
Why not to use transformer like VITS?