README.md

VoiceFilter

Unofficial PyTorch implementation of Google AI's: VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking.

Result

Training took about 20 hours on AWS p3.2xlarge(NVIDIA V100).

Audio Sample

Listen to audio sample at webpage: http://swpark.me/voicefilter/

Metric

Median SDR Paper Ours before VoiceFilter 2.5 1.9 after VoiceFilter 12.6 10.2

SDR converged at 10, which is slightly lower than paper's.

Dependencies

Python and packages

This code was tested on Python 3.6 with PyTorch 1.0.1. Other packages can be installed by:
```
pip install -r requirements.txt
```
Miscellaneous

ffmpeg-normalize is used for resampling and normalizing wav files. See README.md of ffmpeg-normalize for installation.

Prepare Dataset

Download LibriSpeech dataset

To replicate VoiceFilter paper, get LibriSpeech dataset at http://www.openslr.org/12/. train-clear-100.tar.gz(6.3G) contains speech of 252 speakers, and train-clear-360.tar.gz(23G) contains 922 speakers. You may use either, but the more speakers you have in dataset, the more better VoiceFilter will be.

Resample & Normalize wav files

First, unzip tar.gz file to desired folder:

tar -xvzf train-clear-360.tar.gz

Next, copy utils/normalize-resample.sh to root directory of unzipped data folder. Then:

vim normalize-resample.sh # set "N" as your CPU core number.
chmod a+x normalize-resample.sh
./normalize-resample.sh # this may take long

Edit config.yaml

cd config
cp default.yaml config.yaml
vim config.yaml

Preprocess wav files

In order to boost training speed, perform STFT for each files before training by:
```
python generator.py -c [config yaml] -d [data directory] -o [output directory] -p [processes to run]
```
This will create 100,000(train) + 1000(test) data. (About 160G)

Train VoiceFilter

Get pretrained model for speaker recognition system

VoiceFilter utilizes speaker recognition system (d-vector embeddings). Here, we provide pretrained model for obtaining d-vector embeddings.

This model was trained with VoxCeleb2 dataset, where utterances are randomly fit to time length [70, 90] frames. Tests are done with window 80 / hop 40 and have shown equal error rate about 1%.

The model can be downloaded at this GDrive link.
Run

After specifying train_dir, test_dir at config.yaml, run:
```
python trainer.py -c [config yaml] -e [path of embedder pt file] -m [name]
```
This will create chkpt/name and logs/name at base directory(-b option, . in default)
View tensorboardX
```
tensorboard --logdir ./logs
```

Resuming from checkpoint

python trainer.py -c [config yaml] --checkpoint_path [chkpt/name/chkpt_{step}.pt] -e [path of embedder pt file] -m name

Evaluate

python inference.py -c [config yaml] -e [path of embedder pt file] --checkpoint_path [path of chkpt pt file] -m [path of mixed wav file] -r [path of reference wav file] -o [output directory]

Possible improvments

These are some of my personal opinions for improvement. If you have other ideas, don't hesitate to open issue.

Masks performed poorly on high-frequency channels.
- Training embedder system with linear-scale spectrogram instead of mel might improve this.
Replace zero-padding with partial convolution.
Try power-law compressed reconstruction error as loss function, instead of MSE.
- Tried power=0.3, but failed.

Author

Seungwon Park at MINDsLab ([email protected], [email protected])

License

Apache License 2.0

This repository contains codes adapted/copied from the followings:

utils/adabound.py from https://github.com/Luolc/AdaBound (Apache License 2.0)
utils/audio.py from https://github.com/keithito/tacotron (MIT License)
utils/hparams.py from https://github.com/HarryVolek/PyTorch_Speaker_Verification (No License specified)
utils/normalize-resample.sh from https://unix.stackexchange.com/a/216475

GitHub - mindslab-ai/voicefilter: Unofficial PyTorch implementation of Google AI...

README.md

VoiceFilter

Result

Audio Sample

Metric

Dependencies

Prepare Dataset

Train VoiceFilter

Evaluate

Possible improvments

Author

License

Recommend

丢了iPhone后，我黑了一个Apple ID钓鱼网站_挨踢1024_抽屉新热榜

明星产品：如何利用流量红利做增长？

转向微服务的八条建议

【代码范式集】减少代码量的 7~8 种方式

Top 5 Open-Source HIDS Systems

Five layers of security for Red Hat Data Grid on OpenShift

Grid 布局教程

CoreAnimation编程指南翻译(七):修改图层的默认行为

黎万强兑现5年前抽奖：为10亿赌约送出100台小米9

GitHub - liumingye/music

About Joyk