3

No Language Left Behind

 1 year ago
source link: https://ai.facebook.com/research/no-language-left-behind/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

No Language Left Behind

83271035_2632912153661839_668787963458486272_n.svg?_nc_cat=107&ccb=1-7&_nc_sid=ad8a9d&_nc_ohc=mnp7yu_Mh-YAX8TpLSe&_nc_ht=scontent-nrt1-1.xx&oh=00_AT9IIc8YwOXGlynKGOmioKfuPG120A1-m90zzS2YltsVlQ&oe=62CA7E46
83062140_636691383771739_7554317247066406912_n.svg?_nc_cat=104&ccb=1-7&_nc_sid=ad8a9d&_nc_ohc=JkHhhAXXVFQAX_kW1dN&_nc_ht=scontent-nrt1-1.xx&oh=00_AT_AO7ydmrcC1s_qDFN3A8af68PLO9bzGdq8rBVl_IWgjA&oe=62CA986B

No Language Left Behind

Driving inclusion through the power of AI translation

About No Language Left Behind

No Language Left Behind (NLLB) is a first-of-its-kind, AI breakthrough project that open-sources models capable of delivering evaluated, high-quality translations directly between 200 languages—including low-resource languages like Asturian, Luganda, Urdu and more. It aims to give people the opportunity to access and share web content in their native language, and communicate with anyone, anywhere, regardless of their language preferences.

283853146_1070687723538930_517719265315905943_n.png?_nc_cat=107&ccb=1-7&_nc_sid=ad8a9d&_nc_ohc=iPC289P2JRsAX8tXdlq&_nc_ht=scontent-nrt1-1.xx&oh=00_AT_tzP64eaoUreW9HHhRrFm4nHzreZKJdxmSTT_RGpiyzA&oe=62CB1D67
284091281_3524906124403760_9107319679757639544_n.png?_nc_cat=100&ccb=1-7&_nc_sid=ad8a9d&_nc_ohc=e4LzdKpMX48AX-Wf_VE&_nc_ht=scontent-nrt1-1.xx&oh=00_AT8nZJ5rs78UbhGomYAROeFEjhHsEiYKSMwSD6R_RVjhAg&oe=62CC487C
284351366_555951899539592_4737885248177381860_n.png?_nc_cat=111&ccb=1-7&_nc_sid=ad8a9d&_nc_ohc=YrTVDATR-7sAX9thoaS&_nc_ht=scontent-nrt1-1.xx&oh=00_AT-BlRojvWdyBu0CeX8S3gHZpqEZDlV9PatRy3w08v1Q-A&oe=62CB52B5

ai research for real-world application

Applying AI Techniques to Facebook and Instagram for translation of low-resource languages

We’re committed to bringing people together. That’s why we’re using modeling techniques and learnings from our NLLB research to improve translations of low-resource languages on Facebook and Instagram. By applying these techniques and learnings to our production translation systems, people will be able to make more authentic, more meaningful connections in their preferred or native languages. In the future, we hope to extend our learnings from NLLB to more Meta apps.

REAL-WORLD APPLICATION

Building for an inclusive metaverse

A translated metaverse: bringing people together on a global scale

As we build for the metaverse, integrating real-time AR/VR text translation in hundreds of languages is a priority. Our aim is to set a new standard of inclusion—where someday everyone can have access to virtual-world content, devices and experiences, with the ability to communicate with anyone, in any language in the metaverse. And over time, bring people together on a global scale.

286795445_546590023813091_1634833345290453049_n.png?_nc_cat=105&ccb=1-7&_nc_sid=ad8a9d&_nc_ohc=lp9gPmBHY4kAX-NOe_E&_nc_ht=scontent-nrt1-1.xx&oh=00_AT-Fg7Av8R86Xi7jxNdc42aSAQU77s8z1Gp9NrDWaIzEUA&oe=62CB0BE9
286967313_1420782491695697_1665451105592785878_n.png?_nc_cat=102&ccb=1-7&_nc_sid=ad8a9d&_nc_ohc=rPG21lvfOAEAX8GnjSz&_nc_ht=scontent-nrt1-1.xx&oh=00_AT_KLQT7LQFegkdrL-ybUqLbor7J_fGwKJV5uq1JH76RZw&oe=62CAC7A4
286895689_558735499299198_6466235559791205407_n.png?_nc_cat=107&ccb=1-7&_nc_sid=ad8a9d&_nc_ohc=PVRm4nCPUe8AX_CZe77&_nc_ht=scontent-nrt1-1.xx&oh=00_AT_ZbbWZciO59c48CTK80ryJVKrAk5Cf3hyM4JCTfS_vhg&oe=62CBE322
284601273_1022685931715508_2301823127725482438_n.png?_nc_cat=106&ccb=1-7&_nc_sid=ad8a9d&_nc_ohc=Bm5T73uV0KUAX_2TMAg&_nc_ht=scontent-nrt1-1.xx&oh=00_AT86TneAM2LTNCp3D2lQ8ax2oxhspsITC5HhIa6_zx_W5g&oe=62CC5049
283172587_375720687702405_1288095996265331043_n.png?_nc_cat=111&ccb=1-7&_nc_sid=ad8a9d&_nc_ohc=EHZuxDbvsFAAX9oKaOg&_nc_ht=scontent-nrt1-1.xx&oh=00_AT8R0nuVjoaYEIoudyk8ttZxBtxMVTNs1ZJMwxx4axHd-A&oe=62CB858F
282240075_1646732589016450_8203062459697715273_n.png?_nc_cat=107&ccb=1-7&_nc_sid=ad8a9d&_nc_ohc=xGzP8mmLVdEAX8lE5KV&_nc_ht=scontent-nrt1-1.xx&oh=00_AT_neoJcCOSGL1V0m4c78mRpGrwG2mHLovcD8x87ZPUT5Q&oe=62CB5814

REAL-WORLD APPLICATION

Translating Wikipedia for everyone

Helping volunteer editors make information available in more languages

The technology behind the NLLB-200 model, now available through the Wikimedia Foundation’s Content Translation tool, is supporting Wikipedia editors as they translate information into their native and preferred languages. Wikipedia editors are using the technology to more efficiently translate and edit articles originating in other under-represented languages, such as Luganda and Icelandic. This helps to make more knowledge available in more languages for Wikipedia readers around the world. The open-source NLLB-200 model will also help researchers and interested Wikipedia editor communities build on our work.

289479777_786113826155203_7636043491013460263_n.jpg?_nc_cat=103&ccb=1-7&_nc_sid=ad8a9d&_nc_ohc=rPL8c4OzHccAX_mKcHs&_nc_ht=scontent-nrt1-1.xx&oh=00_AT-YNg8uptkHAl8nc6sZ-4n4jv-jJ_hYrDO4NcSSn3Hanw&oe=62CB6AFD

Experience the Tech

Stories Told Through Translation:

books from around the world translated into hundreds of languages

Experience the power of AI translation with Stories Told Through Translation, our demo that uses the latest AI advancements from the No Language Left Behind project. This demo translates books from their languages of origin such as Indonesian, Somali and Burmese, into more languages for readers—with hundreds available in the coming months. Through this initiative, the NLLB-200 will be the first-ever AI model able to translate literature at this scale.

The Tech

Machine translation explained

How does the open-source NLLB model directly translate 200 languages?

STAGE 1

Automatic dataset construction

STAGE 2

Training

STAGE 3

Evaluation

Stage 1: Automatic dataset construction

Training data is collected containing sentences in the input language and desired output language.

-0:23
-PAXP-deijE.gif

The Innovations

The science behind the breakthrough

Most of today’s machine translation (MT) models work for mid- to high-resource languages—leaving most low-resource languages behind. Meta AI researchers are addressing this issue with three significant AI innovations.

Automatic dataset construction for low-resource languages

The context

MT is a supervised learning task, which means the model needs data to learn from. Example translations from open-source data collections are often used. Our solution is to automatically construct translation pairs by pairing sentences in different collections of monolingual documents.

The challenge

The LASER models used for this dataset creation process primarily support mid- to high-resource languages, making it impossible to produce accurate translation pairs for low-resource languages.

The innovation

We solved this by investing in a teacher-student training procedure, making it possible to 1) extend LASER’s language coverage to 200 languages, and 2) produce a massive amount of data, even for low resource languages.

Modeling 200 languages

The context

Multilingual MT systems have been improved upon over bilingual systems. This is due to their ability to enable "transfer" from language pairs with plenty of training data, to other languages with fewer training resources.

The challenge

Jointly training hundreds of language pairs together has its disadvantages, as the same model must represent increasingly large numbers of languages with the same number of parameters. This is an issue when the dataset sizes are imbalanced, as it can cause overfitting.

The innovation

We’ve developed a Sparse Mixture-of-Experts model that has a shared and specialized capacity, so low-resource languages without much data can be automatically routed to the shared capacity. When combined with better regularization systems, this avoids overfitting. Further, we used self-supervised learning and large-scale data augmentation through multiple types of back translation.

Evaluating translation quality

The context

To know if a translation produced by our model meets our quality standards, we must evaluate it.

The challenge

Machine translation models are typically evaluated by comparing machine-translated sentences with human translations, however for many languages, reliable translation data is not available. So accurate evaluations are not possible.

The innovation

We extended 2x the coverage of FLORES, a human-translated evaluation benchmark, to now cover 200 languages. Through automatic metrics and human evaluation support, we’re able to extensively quantify the quality of our translations.

Learn more about the science behind NLLB by reading our whitepaper and blog, and by downloading the model to help us take this project further.

The Journey

Research milestones

Meta AI has been advancing Machine Translation technology while successfully overcoming numerous industry challenges along the way—from the unavailability of data for low-resource languages to translation quality and accuracy. Our journey continues, as we drive inclusion through the power of AI translation.

See model milestones by # of languages released

< 50 languages

50-99 languages

100 languages

200 languages

LASER (Language-agnostic sentence representations)

The first successful exploration of massively multilingual sentence representations shared publicly with the NLP community. The encoder creates embeddings to automatically pair up sentences sharing the same meaning in 50 languages.

Data Encoders

WMT-19

FB AI models outperformed all other models at WMT 2019, using large-scale sampled back-translation, noisy channel modeling and data cleaning techniques to help build a strong system.

Model

Flores V1

A benchmarking dataset for MT between English and low-resource languages introducing a fair and rigorous evaluation process, starting with 2 languages.

Evaluation Dataset

WikiMatrix

The largest extraction of parallel sentences across multiple languages: Bitext extraction of 135 million Wikipedia sentences in 1,620 language pairs for building better translation models.

Data Construction

M2M-100

The first, single multilingual machine translation model to directly translate between any pair of 100 languages without relying on English data. Trained on 2,200 language directions —10x more than previous multilingual models.

Model

CCMatrix

The largest dataset of high-quality, web-based bitexts for building better translation models that work with more languages, especially low-resource languages: 4.5 billion parallel sentences in 576 language pairs.

Data Construction

LASER 2

Creates embeddings to automatically pair up sentences sharing the same meaning in 100 languages.

Data Encoders

WMT-21

For the first time, a single multilingual model outperformed the best specially trained bilingual models across 10 out of 14 language pairs to win WMT 2021, providing the best translations for both low- and high-resource languages.

Model

FLORES-101

FLORES-101 is the first-of-its-kind, many-to-many evaluation data set covering 101 languages, enabling researchers to rapidly test and improve upon multilingual translation models like M2M-100.

Evaluation Dataset

NLLB-200

The NLLB model translates 200 languages.

Model

FLORES 200

Expansion of FLORES evaluation data set now covering 200 languages

Evaluation Dataset

NLLB-Data-200

Constructed and released training data for 200 languages

Evaluation Dataset

LASER 3

Creates embeddings to automatically pair up sentences sharing the same meaning in 200 languages.

Data Encoders

Learn More

Let's take No Language Left Behind further, together.

There’s more to learn about NLLB, and even more to accomplish with it. Read our whitepaper and blog for details, and download the model to help us take this project further. While we’ve reached 200 languages, we’ve only just begun. Join us, and build with us, as we continue on this important journey of translation and inclusion.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK