

GitHub - Kyubyong/expressive_tacotron: Tensorflow Implementation of Expressive T...
source link: https://github.com/Kyubyong/expressive_tacotron
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

README.md
A TensorFlow Implementation of Expressive Tacotron
This project aims at implementing the paper, Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron, to verify its concept. Most of the baseline codes are based on my previous Tacotron implementation.
Requirements
- NumPy >= 1.11.1
- TensorFlow >= 1.3
- librosa
- tqdm
- matplotlib
- scipy
Data
Because the paper used their internal data, I train the model on the LJ Speech Dataset
LJ Speech Dataset is recently widely used as a benchmark dataset in the TTS task because it is publicly available. It has 24 hours of reasonable quality samples.
Training
- STEP 0. Download LJ Speech Dataset or prepare your own data.
- STEP 1. Adjust hyper parameters in
hyperparams.py
. (If you want to do preprocessing, setprepro
True`. - STEP 2. Run
python train.py
. (If you setprepro
True, runpython prepro.py
first) - STEP 3. Run
python eval.py
regularly during training.
Sample Synthesis
I generate speech samples based on the same script as the one used for the original web demo. You can check it in test_sents.txt
.
- Run
python synthesize.py
and check the files insamples
.
Samples
16 sample sentences in the first chapter of the original web demo are collected for sample synthesis. Two audio clips per sentence are used for prosody embedding--reference voice and base voice. Mostly, those two are different in terms of gender or region. The samples below look like the following:
- 1a: the first reference audio
- 1b: sample embedded with 1a's prosody
- 1c: the second reference audio (base)
- 1d: sample embedded with 1c's prosody
Check out the samples at each steps.
- 130k steps
- 300k steps (soon)
- 500k steps (soon)
- 1m steps (later)
Analysis
- Hearing the results of 130k steps, it's not clear if the model has learned the prosody.
- It's clear that different reference audios cause different samples.
- Some samples are worthy of note. For example, listen to the four audios of no.15. The stress of "right" part was obvious transferred.
- Check out no.9, reference audios of which are sung. They are fun.
Notes
- Because this repo focuses on investigating the concept of the paper, I did not follow some details of the paper.
- The paper used phoneme inputs, whereas I stuck to graphemes.
- Instead of the Bahdanau attention, the paper used the GMM attention.
- The original audio samples were obtained from wavenet vocoder.
- I'm still confused what the paper claims to be a prosody embedding can be isolated from the speaker.
- For prosody embedding, the authors employed conv2d. Why not conv1d?
- When the reference audio's text or sentence structure is totally different from the inference script, what happens?
- If I have time, I'd like to implement their 2nd paper: Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
April 2018, Kyubyong Park
Recommend
-
206
Pytorch Exercises Pytorch is one of the most popular deep learning libraries as of 2017. One possible way of familiarizing yourself with it, I think, is to practice with simple quizzes. That's where this project comes in. The outline will...
-
177
Files Permalink Latest commit message Commit time
-
186
README.md 데브시스터즈는 본 기술의 소스코드를 연구 목적으로 공개하였으나, 해당 소스코드의 상업적 이용 및 그로 인한 타인의 퍼블리시티권 침해의 우려가 있어 소스코드 배포 및 샘플 재생을...
-
140
NumPy Exercises In numerical computing in python, NumPy is essential. I'm writing simple (a few lines for each problem) but hopefully helpful exercises based on each of numpy's functions. The outline will be as follows. Array...
-
152
Learn new Google tools with your community. Find a DevFest near you!
-
133
dc_tts - A TensorFlow Implementation of DC-TTS: yet another text-to-speech model
-
84
Expressive Speech Synthesis with Tacotron
-
81
业界 | 带有韵律的合成语音:谷歌展示基于Tacotron的新型TTS方法
-
84
README.md PyTorch Implementation of Feature Based NER with pretrained Bert I know that you know BERT. In...
-
35
README.md Tacotron-2: Tensorflow implementation of DeepMind's Tacotron-2. A deep neural network architecture described in this paper:
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK