

GitHub - Kyubyong/word2word: Easy-to-use word-to-word translations for 3,564 lan...
source link: https://github.com/Kyubyong/word2word
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

README.md
word2word
Easy-to-use word-to-word translations for 3,564 language pairs.
Key Features
- A large collection of freely & publicly available word-to-word translations for 3,564 language pairs across 62 unique languages.
- Easy-to-use Python interface.
- Constructed using an efficient approach that is quantitatively examined by proficient bilingual human labelers.
Usage
First, install the package using pip
:
pip install word2word
Alternatively:
git clone https://github.com/Kyubyong/word2word.git
python setup.py install
Then, in Python, download the model and retrieve top-k word translations of any given word to the desired language:
from word2word import Word2word en2fr = Word2word("en", "fr") print(en2fr("apple")) # out: ['pomme', 'pommes', 'pommier', 'tartes', 'fleurs'] print(en2fr("worked", n_best=2)) # out: ['travaillé', 'travaillait'] en2zh = Word2word("en", "zh_cn") print(en2zh("teacher")) # out: ['老师', '教师', '学生', '导师', '墨盒']
Supported Languages
We provide top-k word-to-word translations across all available pairs from OpenSubtitles2018. This amounts to a total of 3,564 language pairs across 62 unique languages.
The full list is provided here.
Methodology
Our approach computes the top-k word-to-word translations based on the co-occurrence statistics between cross-lingual word pairs in a parallel corpus. We additionally introduce a correction term that controls for any confounding effect coming from other source words within the same sentence. The resulting method is an efficient and scalable approach that allows us to construct large bilingual dictionaries from any given parallel corpus.
For more details, see the Methods section of our paper draft.
Comparisons with Existing Software
A popular publicly available dataset of word-to-word translations is
facebookresearch/MUSE
, which
includes 110 bilingual dictionaries that are built from Facebook's internal translation tool.
In comparison to MUSE, word2word
does not rely on a translation software
and contains much larger sets of language pairs (3,564).
word2word
also provides the top-k word-to-word translations for up to 100k words
(compared to 5~10k words in MUSE) and can be applied to any language pairs
for which there is a parallel corpus.
In terms of quality, while a direct comparison between the two methods is difficult, we did notice that MUSE's bilingual dictionaries involving non-European languages may be not as useful. For English-Vietnamese, we found that 80% of the 1,500 word pairs in the validation set had the same word twice as a pair (e.g. crimson-crimson, Suzuki-Suzuki, Randall-Randall).
For more details, see Appendix in our paper draft.
References
If you use our software for research, please cite:
@misc{word2word2019,
author = {Park, Kyubyong and Kim, Dongwoo and Choe, Yo Joong},
title = {word2word},
year = {2019},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/Kyubyong/word2word}}
}
(We may later update this bibtex with a reference to our paper report.)
All of our word-to-word translations were constructed from the publicly available OpenSubtitles2018 dataset:
@article{opensubtitles2016,
title={Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles},
author={Lison, Pierre and Tiedemann, J{\"o}rg},
year={2016},
publisher={European Language Resources Association}
}
Authors
Kyubyong Park, Dongwoo Kim, and YJ Choe
Recommend
-
206
Pytorch Exercises Pytorch is one of the most popular deep learning libraries as of 2017. One possible way of familiarizing yourself with it, I think, is to practice with simple quizzes. That's where this project comes in. The outline will...
-
177
Files Permalink Latest commit message Commit time
-
140
NumPy Exercises In numerical computing in python, NumPy is essential. I'm writing simple (a few lines for each problem) but hopefully helpful exercises based on each of numpy's functions. The outline will be as follows. Array...
-
133
dc_tts - A TensorFlow Implementation of DC-TTS: yet another text-to-speech model
-
100
README.md A TensorFlow Implementation of Expressive Tacotron This project aims at implementing the paper, Towards End-to-End Prosody Transfer for Expr...
-
84
README.md PyTorch Implementation of Feature Based NER with pretrained Bert I know that you know BERT. In...
-
10
Use your gettext translations in your React components
-
8
README.md Note: This was hacked together in like half an hou...
-
16
DeFi Developer Road Map Here we collect and discuss the best DeFI & Blockchain researches and tools - contributions are welcome. Feel free to submit a pull request, with anything from small fixes t...
-
14
Microsoft is pretty confident in its AI services recently—so confident, in fact, that the company believes it can predict what you're going to type next. Microsoft is adding a predictive text feature into Word that will try to gu...
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK