GitHub - keonlee9420/PortaSpeech: PyTorch Implementation of PortaSpeech: Portabl...
source link: https://github.com/keonlee9420/PortaSpeech
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
PortaSpeech - PyTorch Implementation
PyTorch Implementation of PortaSpeech: Portable and High-Quality Generative Text-to-Speech.
Model Size
Module Normal Small Normal (paper) Small (paper) Total 24M 7.6M 21.8M 6.7M LinguisticEncoder 3.7M 1.4M - - VariationalGenerator 11M 2.8M - - FlowPostNet 9.3M 3.4M - -Quickstart
DATASET refers to the names of datasets such as LJSpeech
in the following documents.
Dependencies
You can install the Python dependencies with
pip3 install -r requirements.txt
Also, Dockerfile
is provided for Docker
users.
Inference
You have to download the pretrained models and put them in output/ckpt/DATASET/
.
For a single-speaker TTS, run
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET
The generated utterances will be put in output/result/
.
Batch Inference
Batch inference is also supported, try
python3 synthesize.py --source preprocessed_data/DATASET/val.txt --restore_step RESTORE_STEP --mode batch --dataset DATASET
to synthesize all utterances in preprocessed_data/DATASET/val.txt
.
Controllability
The speaking rate of the synthesized utterances can be controlled by specifying the desired duration ratios. For example, one can increase the speaking rate by 20 by
python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET --duration_control 0.8
Please note that the controllability is originated from FastSpeech2 and not a vital interest of PortaSpeech.
Training
Datasets
The supported datasets are
- LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
Preprocessing
python3 prepare_align.py --dataset DATASET
for some preparations.
For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences.
Pre-extracted alignments for the datasets are provided here.
You have to unzip the files in preprocessed_data/DATASET/TextGrid/
. Alternately, you can run the aligner by yourself.
After that, run the preprocessing script by
python3 preprocess.py --dataset DATASET
Training
Train your model with
python3 train.py --dataset DATASET
Useful options:
- To use Automatic Mixed Precision, append
--use_amp
argument to the above command. - The trainer assumes single-node multi-GPU training. To use specific GPUs, specify
CUDA_VISIBLE_DEVICES=<GPU_IDs>
at the beginning of the above command.
TensorBoard
tensorboard --logdir output/log
to serve TensorBoard on your localhost.
Notes
- For vocoder, HiFi-GAN and MelGAN are supported.
- Speed up the convergence of word-to-phoneme alignment in LinguisticEncoder by dividing long words into subwords and sorting the dataset by mel-spectrogram frame length.
- No ReLU activation and LayerNorm in VariationalGenerator to avoid mashed output.
- Will be extended to a multi-speaker TTS.
Citation
Please cite this repository by the "Cite this repository" of About section (top right of the main page).
References
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK