text aligner (StyleTTS) (1)

StyleTTS Text aligner → ****https://github.com/yl4579/StyleTTS/blob/main/Utils/ASR/models.py
- Generate an alignment between mel and phone
- Train this alongside the decoder during the reconstruction phase
- This is initially trained on an automatic speech recognition (ASR)
- An aligner with this setup (pre-trained on large corpora and fine-tuned for TTS tasks) is called as Transferable Monotonic Aligner (TMA)
- This has very similar architecture to the decoder of Tacotron2
input
- mel spectrogram $x$ (B, 80, F)
- phoneme $t$ (B, T)
output
- speech-phoneme alignment $a_{algn}$ (B, T, F//2)

model architecture

ASRCNN(
  (to_mfcc): MFCC()
  (init_cnn): ConvNorm(
    (conv): Conv1d(40, 256, kernel_size=(7,), stride=(2,), padding=(3,))
  )
  (cnns): Sequential(
    (0-5): 6 x Sequential(
      (0): ConvBlock(
        (blocks): ModuleList(
          (0-2): 3 x Sequential(
            (0): ConvNorm(
              (conv): Conv1d(256, 256, kernel_size=(3,), stride=(1,), padding=(1,))
            )
            (1): ReLU()
            (2): GroupNorm(8, 256, eps=1e-05, affine=True)
            (3): Dropout(p=0.2, inplace=False)
            (4): ConvNorm(
              (conv): Conv1d(256, 256, kernel_size=(3,), stride=(1,), padding=(1,))
            )
            (5): ReLU()
            (6): Dropout(p=0.2, inplace=False)
          )
        )
      )
      (1): GroupNorm(1, 256, eps=1e-05, affine=True)
    )
  )
  (projection): ConvNorm(
    (conv): Conv1d(256, 128, kernel_size=(1,), stride=(1,))
  )
  (ctc_linear): Sequential(
    (0): LinearNorm(
      (linear_layer): Linear(in_features=128, out_features=256, bias=True)
    )
    (1): ReLU()
    (2): LinearNorm(
      (linear_layer): Linear(in_features=256, out_features=178, bias=True)
    )
  )
  (asr_s2s): ASRS2S(
    (embedding): Embedding(178, 512)
    (project_to_n_symbols): Linear(in_features=128, out_features=178, bias=True)
    (attention_layer): Attention(
      (query_layer): LinearNorm(
        (linear_layer): Linear(in_features=128, out_features=128, bias=False)
      )
      (memory_layer): LinearNorm(
        (linear_layer): Linear(in_features=128, out_features=128, bias=False)
      )
      (v): LinearNorm(
        (linear_layer): Linear(in_features=128, out_features=1, bias=False)
      )
      (location_layer): LocationLayer(
        (location_conv): ConvNorm(
          (conv): Conv1d(2, 32, kernel_size=(63,), stride=(1,), padding=(31,), bias=False)
        )
        (location_dense): LinearNorm(
          (linear_layer): Linear(in_features=32, out_features=128, bias=False)
        )
      )
    )
    (decoder_rnn): LSTMCell(640, 128)
    (project_to_hidden): Sequential(
      (0): LinearNorm(
        (linear_layer): Linear(in_features=256, out_features=128, bias=True)
      )
      (1): Tanh()
    )
  )
)