

GitHub - mindslab-ai/voicefilter: Unofficial PyTorch implementation of Google AI...
source link: https://github.com/mindslab-ai/voicefilter
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
README.md
VoiceFilter
Unofficial PyTorch implementation of Google AI's: VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking.
Result
- Training took about 20 hours on AWS p3.2xlarge(NVIDIA V100).
Audio Sample
- Listen to audio sample at webpage: http://swpark.me/voicefilter/
Metric
Median SDR Paper Ours before VoiceFilter 2.5 1.9 after VoiceFilter 12.6 10.2- SDR converged at 10, which is slightly lower than paper's.
Dependencies
-
Python and packages
This code was tested on Python 3.6 with PyTorch 1.0.1. Other packages can be installed by:
pip install -r requirements.txt
-
Miscellaneous
ffmpeg-normalize is used for resampling and normalizing wav files. See README.md of ffmpeg-normalize for installation.
Prepare Dataset
-
Download LibriSpeech dataset
To replicate VoiceFilter paper, get LibriSpeech dataset at http://www.openslr.org/12/.
train-clear-100.tar.gz
(6.3G) contains speech of 252 speakers, andtrain-clear-360.tar.gz
(23G) contains 922 speakers. You may use either, but the more speakers you have in dataset, the more better VoiceFilter will be. -
Resample & Normalize wav files
First, unzip
tar.gz
file to desired folder:tar -xvzf train-clear-360.tar.gz
Next, copy
utils/normalize-resample.sh
to root directory of unzipped data folder. Then:vim normalize-resample.sh # set "N" as your CPU core number. chmod a+x normalize-resample.sh ./normalize-resample.sh # this may take long
-
Edit
config.yaml
cd config cp default.yaml config.yaml vim config.yaml
-
Preprocess wav files
In order to boost training speed, perform STFT for each files before training by:
python generator.py -c [config yaml] -d [data directory] -o [output directory] -p [processes to run]
This will create 100,000(train) + 1000(test) data. (About 160G)
Train VoiceFilter
-
Get pretrained model for speaker recognition system
VoiceFilter utilizes speaker recognition system (d-vector embeddings). Here, we provide pretrained model for obtaining d-vector embeddings.
This model was trained with VoxCeleb2 dataset, where utterances are randomly fit to time length [70, 90] frames. Tests are done with window 80 / hop 40 and have shown equal error rate about 1%.
The model can be downloaded at this GDrive link.
-
Run
After specifying
train_dir
,test_dir
atconfig.yaml
, run:python trainer.py -c [config yaml] -e [path of embedder pt file] -m [name]
This will create
chkpt/name
andlogs/name
at base directory(-b
option,.
in default) -
View tensorboardX
tensorboard --logdir ./logs
-
Resuming from checkpoint
python trainer.py -c [config yaml] --checkpoint_path [chkpt/name/chkpt_{step}.pt] -e [path of embedder pt file] -m name
Evaluate
python inference.py -c [config yaml] -e [path of embedder pt file] --checkpoint_path [path of chkpt pt file] -m [path of mixed wav file] -r [path of reference wav file] -o [output directory]
Possible improvments
These are some of my personal opinions for improvement. If you have other ideas, don't hesitate to open issue.
- Masks performed poorly on high-frequency channels.
- Training embedder system with linear-scale spectrogram instead of mel might improve this.
- Replace zero-padding with partial convolution.
- Try power-law compressed reconstruction error as loss function, instead of MSE.
- Tried
power=0.3
, but failed.
- Tried
Author
Seungwon Park at MINDsLab ([email protected], [email protected])
License
Apache License 2.0
This repository contains codes adapted/copied from the followings:
- utils/adabound.py from https://github.com/Luolc/AdaBound (Apache License 2.0)
- utils/audio.py from https://github.com/keithito/tacotron (MIT License)
- utils/hparams.py from https://github.com/HarryVolek/PyTorch_Speaker_Verification (No License specified)
- utils/normalize-resample.sh from https://unix.stackexchange.com/a/216475
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK