The last significant breakthrough in the technology of statistical machine translation (SMT) was in 2005. That year, David Chiang published his famous paper on hierarchical translation models that allowed to significantly improve the quality of statistical MT between distant languages. Nowadays we are standing on the verge of an even more exciting moment in MT history: deep learning (DL) is taking MT towards much higher accuracy and finally brings human-like semantics to the translation process.
In general terms, DL is a family of machine learning algorithms that use multilayer artificial neural networks to efficiently learn representation of high-level features from noisy observations. Artificial neural networks, inspired by human knowledge about the biological brain, have already led to breakthroughs in several data-centered fields, including speech recognition, computer vision, user behavior prediction and nonlinear classification.
The idea to use a computational model of natural neurons to learn from data is not new. In fact, modern DL is a reincarnation of traditional neural networks with a low number of layers from the early 1990s. The key difference between old-style neural networks and DL is the iterative multilayer architecture that allows the latter to learn and represent features in a more complete (not all features can be defined by experts) and more accurate (multiple levels of representation) way. Another key factor of DL success is the ability of multilayer neural networks to learn an optimal set of features describing objects automatically instead of hand-engineered features traditionally used in old neural networks. Training of DL networks requires extensive computational resources (by the way, their unavailability was the main reason why DL did not capture the headlines earlier) and abundant training data, but allows to discover dependencies that were previously undiscoverable.
Many modern natural language processing applications heavily rely on machine learning methods. This is the reason why expectations from DL Natural Language Processing (NLP) in general and DL MT, in particular, were extremely high. However, in practice, learning higher levels of abstraction (semantics) derived from lower levels (lexical features) via intermediate steps (morphology, syntax, etc.) turned out to be a difficult task and the usability of DL for NLP remained a big question for some years. The main reasons for that are:
- DL MT systems do not rely on pre-defined feature functions but learn these from data. It implies a more complex network architecture and, since expected improvement is typically observed only if the configuration of the DL system is close to optimal, makes the architect’s experience a key element of success. The scarceness of qualified DL system designers used to be an obstacle for the successful commercialization of the DL MT technology.
- The lack of established guidelines and best practices adds uncertainty to the design process.
- A word can be an extremely informative feature per se and learning from such informative units (especially in case of morphologically rich languages) is not an easy task.
- There is no consensus among the research community what is the most beneficial way to use DL for MT: directly training DL models for MT or improving existing MT systems with DL elements.
However, nowadays the status quo is changing markedly:
- The high-performing DL MT has come to the world of commercial translation automation. Google went public with an elegant idea of using recurrent neural networks with long short-term memory for MT. Facebook, Microsoft, and Bing are also implementing, if they have not done so already, similar DL MT systems. MT providers, like Systran, are launching DL MT systems to boost the quality of MT services.
- DL models bring a new perspective to MT and open way for new applications of indirect DL MT integration. For instance Word2vec, which produces semantic links between words and groups of words and allows to enhance SMT systems with synonymity.
- In addition, MT is becoming a testing ground for researchers who want to evaluate various kinds of DL algorithms in the symbol variable processing settings.
DL is currently on the verge of breaking the quality barrier, making MT smarter. On the other hand, DL MT is still just an extended version of a statistical approach which the most-quoted scientist alive today, Chomsky, criticized saying that “statistical models have been proven incapable of learning language”. In other words, while significant improvement in terms of quality can be expected because DL helps learn a distributed semantic representation of human language, it is not immediately able to accurately generalize from the discrete word space based on the finite training dataset.