

GitHub - snakers4/open_stt: Russian open STT dataset
source link: https://github.com/snakers4/open_stt
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

README.md
Russian Open Speech To Text (STT/ASR) Dataset
Arguably the largest public Russian STT dataset up to date:
- ~4.6m utterances;
- ~4000 hours;
- 431 GB;
- Additional 1,500 hours ... and more ... to be released soon!;
- And then maybe even more hours to be released!;
Prove us wrong! Open issues, collaborate, submit a PR, contribute, share your datasets! Let's make STT in Russian (and more) as open and available as CV models.
Table of contents
- Dataset composition
- Downloads
- Annotation methodology
- Audio normalization
- Disk db methodology
- Helper functions
- Contacts
- FAQ
- License
Dataset composition
Dataset Utterances Hours GB Av s/chars Comment Annotation Quality/noise public_youtube1500 (*)
1,500
* Coming soon
audiobook_2 1,149,404 1,511 166 4.7s / 56 Books Alignment (*) 95% / crisp public_youtube700 759,483 701 75 3.3s / 43 Youtube videos Subtitles 95% / ~crisp tts_russian_addresses 1,741,838 754 81 1.6s / 20 Russian addresses TTS 4 voices 100% / crisp asr_public_phone_calls_2 603,797 601 66 3.6s / 37 Phone calls ASR 70% / noisy asr_public_phone_calls_1 233,868 211 23 3.3s / 29 Phone calls ASR 70% / noisy asr_public_stories_2 78,186 78 9 3.5s / 43 Books ASR 80% / crisp asr_public_stories_1 46,142 38 4 3.0s / 30 Books ASR 80% / crisp public_series_1 20,243 17 2 3.1s / 38 Youtube videos Subtitles 95% / ~crisp ru_RU 5,826 17 2 11s / 12 Public dataset Alignment 99% / crisp voxforge_ru 8,344 17 2 7.5s / 77 Public dataset Reading 100% / crisp russian_single 3,357 9 1 9.3s / 102 Public dataset Alignment 99% / crisp public_lecture_1 6,803 6 1 3.4s / 47 Lectures Subtitles 95% / crisp Total 4,657,291 3,961 431
(*) Automatic alignment
This alignment was performed using Yuri's alignment tool. Contact him if you need alignment for your own dataset.
update 2019-05-07 Help needed!
If you want to support the project, you can:
- Help us with hosting (create a mirror) / provide a reliable node for torrent;
- Help us with writing some helper functions;
- Donate (each coffee pays for several full downloads) / use our DO referral link to help;
We are converting the dataset to MP3 now. Please contact us using the below contacts, if you would like to help.
Downloads
Links
Meta data file.
Dataset GB GB, compressed Audio Source Manifest audiobook_2 166 131.7 part1, part2, part3, part4, part5, part6, part7 Sources from the Internet + alignment link asr_public_phone_calls_2 66 51.7 part1, part2, part3 Sources from the Internet + ASR link asr_public_stories_2 9 7.5 part1 Sources from the Internet + alignment link tts_russian_addresses_rhvoice_4voices 80.9 67.0 part1, part2, part3, part4 TTS link public_youtube700 75.0 67.0 part1, part2, part3, part4 YouTube videos link asr_public_phone_calls_1 22.7 19.0 part1 Sources from the Internet + ASR link asr_public_stories_1 4.1 3.8 part1 Public stories link public_series_1 1.9 1.7 part1 Public series link ru_RU 1.9 1.4 part1 Caito.de dataset link voxforge_ru 1.9 1.5 part1 Voxforge dataset link russian_single 0.9 0.7 part1 Russian single speaker dataset link public_lecture_1 0.7 0.6 part1 Sources from the Internet link Total 190 163
Download instructions
- Download each dataset separately:
Via wget
wget https://ru-open-stt.ams3.digitaloceanspaces.com/some_file
For multi-threaded downloads use aria2 with -x
flag, i.e.
aria2c -c -x5 https://ru-open-stt.ams3.digitaloceanspaces.com/some_file
If necessary, merge chunks like this:
cat ru_open_stt_v01.tar.gz_* > ru_open_stt_v01.tar.gz
- Download the meta data and manifests for each dataset:
- Merge files (where applicable), unpack and enjoy!
Check md5sum
md5sum /path/to/downloaded/file
Annotation methodology
The dataset is compiled using open domain sources. Some audio types are annotated automatically and verified statistically / using heuristics.
Audio normalization
All files are normalized for easier / faster runtime augmentations and processing as follows:
- Converted to mono, if necessary;
- Converted to 16 kHz sampling rate, if necessary;
- Stored as 16-bit integers;
On disk DB methodology
Each audio file is hashed. Its hash is used to create a folder hierarchy for more optimal fs operation.
target_format = 'wav'
wavb = wav.tobytes()
f_hash = hashlib.sha1(wavb).hexdigest()
store_path = Path(root_folder,
f_hash[0],
f_hash[1:3],
f_hash[3:15]+'.'+target_format)
Helper functions
Use helper functions from here for easier work with manifest files.
Read manifests
See examplefrom utils.open_stt_utils import read_manifest manifest_df = read_manifest('path/to/manifest.csv')
Merge, check and save manifests
See examplefrom utils.open_stt_utils import (plain_merge_manifests, check_files, save_manifest) train_manifests = [ 'path/to/manifest1.csv', 'path/to/manifest2.csv', ] train_manifest = plain_merge_manifests(train_manifests, MIN_DURATION=0.1, MAX_DURATION=100) check_files(train_manifest) save_manifest(train_manifest, 'my_manifest.csv')
Contacts
Please contact us here or just create a GitHub issue!
Authors in alphabetic order:
- Anna Slizhikova;
- Alexander Veysov;
- Dmitry Voronin;
- Yuri Baburov;
FAQ
0. Why not MP3?
We were planning to make an MP3 version (around 64 kb/s), and probably we were too quick to publish the dataset - it grew out of control. Despite having ample free DO credits, we incurred some charges for data transfer. We are making / will soon make an MP3 version and replace the links with the new ones.
1. Issues with reading files
Maybe try this approach:
See examplefrom scipy.io import wavfile sample_rate, sound = wavfile.read(path) abs_max = np.abs(sound).max() sound = sound.astype('float32') if abs_max>0: sound *= 1/abs_max
2. Why share such dataset?
We are not altruists, life just is not a zero sum game.
Consider the progress in computer vision, that was made possible by:
- Public datasets;
- Public pre-trained models;
- Open source frameworks;
- Open research;
TTS does not enjoy the same attention by ML community because it is data hungry and public datasets are lacking, especially for languages other than English. Ultimately it leads to worse-off situation for the general community.
3. Known issues with the dataset to be fixed
- Blank files in Youtube dataset. Just filter them out using meta-data. Will be fixed in future;
- Some files that have low values / crash with tochaudio;
- Looks like scipy does not always write meta-data when saving wavs (or you should save (N,1) shaped file) - this can be fixed as shown above;
License
Dual license, cc-by-nc and commercial usage available after agreement with dataset authors. Except for VoxForge, its license is GNU GPL 3.0.
Recommend
-
129
Files Permalink Latest commit message
-
130
README.md fdns
-
65
README.md Flickr-Faces-HQ Dataset (FFHQ)
-
78
README.md Mathematics Dataset This dataset code generates mathematical question and answer pairs, from a range of question types at roughly school-leve...
-
32
If you do not pay the iron price, you know someone paid it for you. It works like this in every aspect of life Originally posted on spark...
-
42
A Free, Open Resource for the Global Research Community
-
9
March 9, 2022 Expanding the Waymo Open Dataset with new labels and challenges technology The Waymo Team
-
16
-
6
September 28, 2022 ...
-
12
V2EX › 3D Waymo open dataset 数据集下载后,如何解压
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK