On 11 July 2019, Google’s AI team has published their recent research titled Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges. A team of 13 researchers has worked on “building a universal neural machine translation (NMT) system capable of translating between any language pair.” The model is “a single massively multilingual NMT model handling 103 languages trained on over 25 billion examples.”
A single machine translation (MT) model that can handle translation between an arbitrary language pair has been at the core of the universal MT agenda. “In order to reach this goal” say the Google AI researchers, “the underlying machinery, the learner, must model a massively multi-way input-output mapping task under strong constraints: a huge number of languages, different scripting systems, heavy data imbalance across languages and domains, and a practical limit on model capacity.”
Why Massive and Wild?
The concept of Massive NMT is not a recent discovery. There are several other papers written on the topic which this latest research has been built on. However, in the paper, the researchers note that their system is the largest multilingual NMT system to date, in terms of the amount of training data and the number of languages considered at the same time. In the Wild reference in the title of the research is derived from the type of training data used in the model which is classified as realistic, meaning an “in-house corpus generated by crawling and extracting parallel sentences from the web” was used.
The paper defines a truly multilingual translation model by the following characterizations: “maximum throughput in terms of the number of languages considered within a single model, maximum inductive (positive) transfer towards low-resource languages, minimum interference (negative transfer) for high-resource languages, robust multilingual NMT models that perform well in realistic, open-domain settings.”
To ensure maximum positive transfer towards low-resource languages, the impact of the model was analyzed both for low-resource languages and higher-resourced languages. In the research, a dedicated data sampling strategy is devised to better control the effects of this transfer. As a result of the “positive transfer” process as defined in the paper, the NMT performance on languages with low amounts of data available has improved.
What Do We Learn From This Research?
Providing a robust solution to the universal search for a multilingual NMT model continues to be a problem that requires a multi-task solution. In this pursuit, the findings in the paper serve as a significant milestone for NMT history. While highlighting the importance of interdisciplinary grounds to achieve a promising solution to the problem of multilingual universal NMT, the researchers note that “we still have a long way to go towards truly universal machine translation,” and suggest that multilingual NMT is “a plausible general testbed for other machine learning practitioners and theoreticians.”
To understand the underlying motives of the Massive NMT efforts and better estimate how it will echo in the commercial space, we have talked with Orhan Firat, Senior Research Scientist at Google AI - Translate who is also in the team of contributors to the Massive NMT research.
TAUS: What does this study mean for the translation and localization industry? Do you expect the findings of this study to yield immediate commercial actions within the industry?
Orhan: The main takeaway is that we are in a trajectory of scaling things up dramatically, both in terms of the number of languages considered within a single model and number of parameters (weights or connections) in our neural networks. Here the first point implies better transfer towards low-resource languages or more convenient and useful seed models for localization. On the other hand, from the second point, a conjecture that can be made is that the bigger models will likely have better quality compared to their smaller counterparts, which is an ongoing trend in deep learning research. For the immediate actions, I expect the industry to keep paying attention to developments in infrastructure and making use of high-quality data, but of course, my comments are from the research perspective of things.
TAUS: We observe that low-resource languages are gaining more importance in the MT space both from an academic and commercial point of view. With the Massive NMT model, higher performance was achieved for low-resource languages. How will this affect the future of translation?
Orhan: This definitely is a very accurate observation. Now the quality of the high-resource languages is getting very close to the human level, the MT community is shifting their attention towards the tail of the distribution of languages, mid to low-resource. As we expressed in our position paper, in order to improve the translation quality of a language pair, we inject inductive biases and in our case of massively multilingual MT, our inductive bias is that “the learning signal from one language should benefit the quality of other languages” and we show the effectiveness and limits of this direction. There are many other inductive biases, such as data augmentation with backward/forward translation or bottom-up alignment for unsupervised machine translation. I expect that the MT community will come up with even more creative and effective inductive biases for low-resource or zero-resource languages while pushing the limits of multilingual MT.
TAUS: In the paper, you touch upon “Multi-domain and multi-task learning across a very large number of domains/tasks, with wide data imbalance”. How will transfer learning be achieved in the case where there is zero data available for a language pair (for instance Amharic-Kurdish)?
Orhan: Zero-resource or zero-shot translation is not new for the MT research and the most straightforward approaches rely on pivoting, or bridging, via a common language mostly being English due to its data abundance. In parallel, over the last two years, unsupervised MT, which relies only on monolingual data, has made tremendous progress. We expect that a massively multilingual model, that can translate between several languages, will have the closest representation to, so-called interlingua, hence combined with the techniques developed for unsupervised MT, has the potential to excel in language pairs where there is zero available data. But of course, we should never overlook the amount of computing required to train such massively multilingual models to start with, and the tricks of the trade that make unsupervised MT applicable.
TAUS: What do you think about the increasing demand for language data in the industry, do you foresee that new data resourcing models will emerge?
Orhan: For the data resourcing models, from a research perspective, I suspect converting compute into data will be the new paradigm. Rather than solely relying on hard to obtain, labor-intensive and expensive professionally translated data, our neural nets might be favoring the data generated by their peers, the other neural networks, to achieve better quality on their task of translation. This links back to the first question, focusing more on computational power and scaling things up dramatically.
To hear more about the Massive NMT research and listen to Orhan’s presentation of the research in detail, join the MT User Group #16.