[SF-TTS] TODOs and Update notes

Target Conf.

TODOs

[x] Reproducing Grad-TTS using LJSpeech
- [x] Check official implementation of Grad-TTS
- [x] Download LJSpeech dataset
- [x] Download pretrained models of Grad-TTS and HiFi-GAN
- [x] Train and check result
[ ] Reproducing Grad-TTS using LibriTTS (Multispeaker)
- [ ] Download LibriTTS dataset
[ ] Train Grad-TTS using BEAT dataset
- [x] Data comparison: LJSpeech and BEAT dataset
- [x] BEAT dataset preprocessing (split sentences)
- [x] Handle different sampling rate (Grad-TTS 22.5kHz, BEAT 16kHz)
- [x] Check output sound quality
- [ ] Finetuning multispeaker setting
[ ] Implement face module
- [ ] use pretrained TTS model
  - BEAT datasets이 multispeaker에 한명당 녹음시간이 짧기(1~4시간) 때문에, pretrained model을 잘 쓰는게 중요할 듯
- [ ] train together
  - [ ] train from scratch using BEAT only
  - [ ] train TTS using other datasets and finetune together using BEAT
[ ] Research on the diffusion methods
- [ ] DDPM vs. Score-based Diffusion Model vs. Consistency Model

(1. 26.) Data Comparison ‣
(2. 1.) Grad-TTS reproducing ‣
(2. 6.) BEAT dataset preprocessing (split sentences) ‣
(2. 7.) Fine tuning Grad-TTS using BEAT dataset (only single speaker, 9_miranda) ‣
(working on) Implement face module
- !!! ground truth와 duration predictor의 output이 다른 문제를 face module에서 어떻게 해결한건지?? Grad-TTS에서 teacher forcing 방법을 썼나??
(working on) Research on the diffusion models