GitHub - mindslab-ai/voicefilter: Unofficial PyTorch implementation of Google AI...

 4 years ago
source link: https://github.com/mindslab-ai/voicefilter
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.



Unofficial PyTorch implementation of Google AI's: VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking.


  • Training took about 20 hours on AWS p3.2xlarge(NVIDIA V100).

Audio Sample


Median SDR Paper Ours before VoiceFilter 2.5 1.9 after VoiceFilter 12.6 10.2

  • SDR converged at 10, which is slightly lower than paper's.


  1. Python and packages

    This code was tested on Python 3.6 with PyTorch 1.0.1. Other packages can be installed by:

    pip install -r requirements.txt
  2. Miscellaneous

    ffmpeg-normalize is used for resampling and normalizing wav files. See README.md of ffmpeg-normalize for installation.

Prepare Dataset

  1. Download LibriSpeech dataset

    To replicate VoiceFilter paper, get LibriSpeech dataset at http://www.openslr.org/12/. train-clear-100.tar.gz(6.3G) contains speech of 252 speakers, and train-clear-360.tar.gz(23G) contains 922 speakers. You may use either, but the more speakers you have in dataset, the more better VoiceFilter will be.

  2. Resample & Normalize wav files

    First, unzip tar.gz file to desired folder:

    tar -xvzf train-clear-360.tar.gz

    Next, copy utils/normalize-resample.sh to root directory of unzipped data folder. Then:

    vim normalize-resample.sh # set "N" as your CPU core number.
    chmod a+x normalize-resample.sh
    ./normalize-resample.sh # this may take long
  3. Edit config.yaml

    cd config
    cp default.yaml config.yaml
    vim config.yaml
  4. Preprocess wav files

    In order to boost training speed, perform STFT for each files before training by:

    python generator.py -c [config yaml] -d [data directory] -o [output directory] -p [processes to run]

    This will create 100,000(train) + 1000(test) data. (About 160G)

Train VoiceFilter

  1. Get pretrained model for speaker recognition system

    VoiceFilter utilizes speaker recognition system (d-vector embeddings). Here, we provide pretrained model for obtaining d-vector embeddings.

    This model was trained with VoxCeleb2 dataset, where utterances are randomly fit to time length [70, 90] frames. Tests are done with window 80 / hop 40 and have shown equal error rate about 1%.

    The model can be downloaded at this GDrive link.

  2. Run

    After specifying train_dir, test_dir at config.yaml, run:

    python trainer.py -c [config yaml] -e [path of embedder pt file] -m [name]

    This will create chkpt/name and logs/name at base directory(-b option, . in default)

  3. View tensorboardX

    tensorboard --logdir ./logs

  4. Resuming from checkpoint

    python trainer.py -c [config yaml] --checkpoint_path [chkpt/name/chkpt_{step}.pt] -e [path of embedder pt file] -m name


python inference.py -c [config yaml] -e [path of embedder pt file] --checkpoint_path [path of chkpt pt file] -m [path of mixed wav file] -r [path of reference wav file] -o [output directory]

Possible improvments

These are some of my personal opinions for improvement. If you have other ideas, don't hesitate to open issue.

  • Masks performed poorly on high-frequency channels.
    • Training embedder system with linear-scale spectrogram instead of mel might improve this.
  • Replace zero-padding with partial convolution.
  • Try power-law compressed reconstruction error as loss function, instead of MSE.
    • Tried power=0.3, but failed.


Seungwon Park at MINDsLab ([email protected], [email protected])


Apache License 2.0

This repository contains codes adapted/copied from the followings:

About Joyk

Aggregate valuable and interesting links.
Joyk means Joy of geeK