90

GitHub - snakers4/open_stt: Russian open STT dataset

 4 years ago
source link: https://github.com/snakers4/open_stt
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

README.md

Russian Open Speech To Text (STT/ASR) Dataset

Arguably the largest public Russian STT dataset up to date:

  • ~4.6m utterances;
  • ~4000 hours;
  • 431 GB;
  • Additional 1,500 hours ... and more ... to be released soon!;
  • And then maybe even more hours to be released!;

Prove us wrong! Open issues, collaborate, submit a PR, contribute, share your datasets! Let's make STT in Russian (and more) as open and available as CV models.

Table of contents

Dataset composition

Dataset Utterances Hours GB Av s/chars Comment Annotation Quality/noise public_youtube1500 (*)

1,500

* Coming soon

audiobook_2 1,149,404 1,511 166 4.7s / 56 Books Alignment (*) 95% / crisp public_youtube700 759,483 701 75 3.3s / 43 Youtube videos Subtitles 95% / ~crisp tts_russian_addresses 1,741,838 754 81 1.6s / 20 Russian addresses TTS 4 voices 100% / crisp asr_public_phone_calls_2 603,797 601 66 3.6s / 37 Phone calls ASR 70% / noisy asr_public_phone_calls_1 233,868 211 23 3.3s / 29 Phone calls ASR 70% / noisy asr_public_stories_2 78,186 78 9 3.5s / 43 Books ASR 80% / crisp asr_public_stories_1 46,142 38 4 3.0s / 30 Books ASR 80% / crisp public_series_1 20,243 17 2 3.1s / 38 Youtube videos Subtitles 95% / ~crisp ru_RU 5,826 17 2 11s / 12 Public dataset Alignment 99% / crisp voxforge_ru 8,344 17 2 7.5s / 77 Public dataset Reading 100% / crisp russian_single 3,357 9 1 9.3s / 102 Public dataset Alignment 99% / crisp public_lecture_1 6,803 6 1 3.4s / 47 Lectures Subtitles 95% / crisp Total 4,657,291 3,961 431

(*) Automatic alignment

This alignment was performed using Yuri's alignment tool. Contact him if you need alignment for your own dataset.

update 2019-05-07 Help needed!

If you want to support the project, you can:

  • Help us with hosting (create a mirror) / provide a reliable node for torrent;
  • Help us with writing some helper functions;
  • Donate (each coffee pays for several full downloads) / use our DO referral link to help;

We are converting the dataset to MP3 now. Please contact us using the below contacts, if you would like to help.

Downloads

Links

Meta data file.

Dataset GB GB, compressed Audio Source Manifest audiobook_2 166 131.7 part1, part2, part3, part4, part5, part6, part7 Sources from the Internet + alignment link asr_public_phone_calls_2 66 51.7 part1, part2, part3 Sources from the Internet + ASR link asr_public_stories_2 9 7.5 part1 Sources from the Internet + alignment link tts_russian_addresses_rhvoice_4voices 80.9 67.0 part1, part2, part3, part4 TTS link public_youtube700 75.0 67.0 part1, part2, part3, part4 YouTube videos link asr_public_phone_calls_1 22.7 19.0 part1 Sources from the Internet + ASR link asr_public_stories_1 4.1 3.8 part1 Public stories link public_series_1 1.9 1.7 part1 Public series link ru_RU 1.9 1.4 part1 Caito.de dataset link voxforge_ru 1.9 1.5 part1 Voxforge dataset link russian_single 0.9 0.7 part1 Russian single speaker dataset link public_lecture_1 0.7 0.6 part1 Sources from the Internet link Total 190 163

Download instructions

  1. Download each dataset separately:

Via wget

wget https://ru-open-stt.ams3.digitaloceanspaces.com/some_file

For multi-threaded downloads use aria2 with -x flag, i.e.

aria2c -c -x5 https://ru-open-stt.ams3.digitaloceanspaces.com/some_file

If necessary, merge chunks like this:

cat ru_open_stt_v01.tar.gz_* > ru_open_stt_v01.tar.gz
  1. Download the meta data and manifests for each dataset:
  2. Merge files (where applicable), unpack and enjoy!

Check md5sum

md5sum /path/to/downloaded/file

Click to expandtype md5sum file manifest b0ce7564ba90b121aeb13aada73a6e30 asr_public_phone_calls_1.csv manifest 6867d14dfdec1f9e9b8ca2f1de9ceda6 asr_public_phone_calls_2.csv manifest 0bdd77e15172e654d9a1999a86e92c7f asr_public_stories_1.csv manifest f388013039d94dc36970547944db51c7 asr_public_stories_2.csv manifest 3b67e27c1429593cccbf7c516c4b582d private_buriy_audiobooks_2.csv manifest 04027c20eb3aff05f6067957ecff856b public_lecture_1.csv manifest 89da3f1b6afcd4d4936662ceabf3033e public_series_1.csv manifest a81dfb018c88d0ecd5194ab3d8ff6c95 public_youtube700.csv manifest c858f020729c34ba0ab525bbb8950d0c ru_RU.csv manifest 0275525914825dec663fd53390fdc9a0 russian_single.csv manifest 52f406f4e30fcc8c634f992befd91beb tts_russian_addresses_rhvoice_4voices.csv audio a5496898ee78654bf398ec6df71540d7 asr_public_phone_calls_1.tar.gz audio e4df5ef50787384648b59f5a87edc0c6 asr_public_phone_calls_2.tar.gz audio 97594127a922df8a7bcc2eecd2470805 asr_public_phone_calls_2.tar.gz_aa audio f9b6475f0f2898b16d9e6e0e648fb531 asr_public_phone_calls_2.tar.gz_ab audio b19977c889cda639f621195251e6bb6f asr_public_phone_calls_2.tar.gz_ac audio 657a31b544b10295f909ef4b2ca5c156 asr_public_stories_1.tar.gz audio 7533581bb26975212817bcacb25546d0 asr_public_stories_2.tar.gz audio 3955616cd89761bf2d54d0e992f7eae5 audiobooks_2.tar.gz_aa audio 81b6ec147c0c43bdd56002c41e0288b8 audiobooks_2.tar.gz_ab audio 15d4cf99171c2db3f375619f4bd2b6d9 audiobooks_2.tar.gz_ac audio 50635b0f4bdf44fae96e5a65f4738e19 audiobooks_2.tar.gz_ad audio f1103be39ffc2da4a98d8f6ddeb50aa0 audiobooks_2.tar.gz_ae audio 8b45d2bd8b1fa1d906e36b9fabd9fe4c audiobooks_2.tar.gz_af audio 5104df44933b612b3c1bfc06f6376654 audiobooks_2.tar.gz_ag audio e6b9e5f46811d33ea34ce50f6067a762 public_lecture_1.tar.gz audio 86ebf7e30986b8ee8df11f85b35588a0 public_series_1.tar.gz audio dc260dd8151b4fce6cde6d80af13146d public_youtube700.tar.gz_aa audio 04706ef0f98841ec8d2f20a83aca3cf1 public_youtube700.tar.gz_ab audio e11d5b118bf71425e4915e61277a06a9 public_youtube700.tar.gz_ac audio d9a93157263eb9d8078c0e0b88c271de public_youtube700.tar.gz_ad audio 1bbba5eb2f4911c9ed20ec69cbd292cb ru_ru.tar.gz audio 6f79a9c514ad48a5763e3142919fc765 russian_single.tar.gz audio c926df1068218eb9cc8103c94003fcc6 tts_russian_addresses_rhvoice_4voices.tar audio 31d515e0bdfc467c3fe63088b817c15c tts_russian_addresses_rhvoice_4voices.tar.gz_aa audio 4ca15694a8d8a638bbdc5e90832eadb4 tts_russian_addresses_rhvoice_4voices.tar.gz_ab audio 447559a38cd8bf61c5de64e602f06da3 tts_russian_addresses_rhvoice_4voices.tar.gz_ac audio 9131347a97c2e794d7c6d5a265083e83 tts_russian_addresses_rhvoice_4voices.tar.gz_ad audio 91e2115b17b1ad08649f428d2caa643b voxforge_ru.tar.gz

Annotation methodology

The dataset is compiled using open domain sources. Some audio types are annotated automatically and verified statistically / using heuristics.

Audio normalization

All files are normalized for easier / faster runtime augmentations and processing as follows:

  • Converted to mono, if necessary;
  • Converted to 16 kHz sampling rate, if necessary;
  • Stored as 16-bit integers;

On disk DB methodology

Each audio file is hashed. Its hash is used to create a folder hierarchy for more optimal fs operation.

target_format = 'wav'
wavb = wav.tobytes()

f_hash = hashlib.sha1(wavb).hexdigest()

store_path = Path(root_folder,
                  f_hash[0],
                  f_hash[1:3],
                  f_hash[3:15]+'.'+target_format)

Helper functions

Use helper functions from here for easier work with manifest files.

Read manifests

See example
from utils.open_stt_utils import read_manifest

manifest_df = read_manifest('path/to/manifest.csv')

Merge, check and save manifests

See example
from utils.open_stt_utils import (plain_merge_manifests,
                                  check_files,
                                  save_manifest)
train_manifests = [
 'path/to/manifest1.csv',
 'path/to/manifest2.csv',
]
train_manifest = plain_merge_manifests(train_manifests,
                                        MIN_DURATION=0.1,
                                        MAX_DURATION=100)
check_files(train_manifest)
save_manifest(train_manifest,
             'my_manifest.csv')

Contacts

Please contact us here or just create a GitHub issue!

Authors in alphabetic order:

  • Anna Slizhikova;
  • Alexander Veysov;
  • Dmitry Voronin;
  • Yuri Baburov;

FAQ

0. Why not MP3?

We were planning to make an MP3 version (around 64 kb/s), and probably we were too quick to publish the dataset - it grew out of control. Despite having ample free DO credits, we incurred some charges for data transfer. We are making / will soon make an MP3 version and replace the links with the new ones.

1. Issues with reading files

Maybe try this approach:

See example
from scipy.io import wavfile

sample_rate, sound = wavfile.read(path)

abs_max = np.abs(sound).max()
sound = sound.astype('float32')
if abs_max>0:
    sound *= 1/abs_max

2. Why share such dataset?

We are not altruists, life just is not a zero sum game.

Consider the progress in computer vision, that was made possible by:

  • Public datasets;
  • Public pre-trained models;
  • Open source frameworks;
  • Open research;

TTS does not enjoy the same attention by ML community because it is data hungry and public datasets are lacking, especially for languages other than English. Ultimately it leads to worse-off situation for the general community.

3. Known issues with the dataset to be fixed

  • Blank files in Youtube dataset. Just filter them out using meta-data. Will be fixed in future;
  • Some files that have low values / crash with tochaudio;
  • Looks like scipy does not always write meta-data when saving wavs (or you should save (N,1) shaped file) - this can be fixed as shown above;

License

Dual license, cc-by-nc and commercial usage available after agreement with dataset authors. Except for VoxForge, its license is GNU GPL 3.0.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK