Alon Lavie is an experienced member of the international translation automation community, with a productive career in research at Carnegie Mellon University, plus a number of MT appointments in enterprises to his credit, including his startup Safaba and more recently Amazon. Today he is VP Language Technology at Unbabel. We asked him for his views on our key questions about critical pathways ahead for the translation industry in technology, data/algorithms, quality evaluation, and innovation.
What were the specific challenges that attracted you to working with Unbabel today?
After leading an MT startup and then managing an MT group at Amazon for almost four years, what I liked about Unbabel is that they came at the problem of translation not from a pure MT stance but from a broader view of how technology can help solve the full enterprise customer problem, starting by looking at how state-of-the-art technology could help support translator communities.
Today, if you want human quality translation as your end product you’ll need to leverage technology and data to optimize the combination of human skills and automation. Unbabel started this with addressing the management of translation communities and then expanded quickly to understanding the need to solve the translation confidence and quality estimation problem at a very detailed level. They have now moved into the core MT space, but what I particularly believe in is their focus on the enterprise problem as a whole, and building end-to-end translation pipelines into enterprises.
Remember that enterprises don’t have a strictly “MT” or “translation” problem; they have a multilingual content delivery and communication problem. And using NMT, it is now possible to develop effective solutions to these major enterprise use cases. To do this you can’t just build MT in isolation, everything has to come together in a very effective combination. That was what attracted me particularly to UnbabeI when I moved on from Amazon. I deliberately contacted a very small number of companies working on unique approaches in this space: Unbabel came up at the top of the list!
Which is more important in solving this multilingual communication problem – data or algorithms?
This is a complex question and it of course doesn’t have a simple answer. I think the key is to look at how we can best develop concrete solutions to specific use cases in the enterprise space. In the case of algorithms, what the research community has typically focused on are the fundamentals of the neural architecture, the training algorithms and the runtime decoding. Those are of course important, but what really matters is what we are trying to translate. Software strings? Or websites for an international company? Or FAQs or email communications? Is it content to be used either inside or outside the company, or is it customer support or chat between agents and customers? These are all important use cases today. These impact not only the data that we need, but also the underlying architecture and algorithms. To build solutions, we obviously need the right combination of algorithms and data.
One critical issue is the question of general purpose vs. adaptation/custom models of translation. In the past we analyzed content in terms of domains (financial, travel, automotive, IT etc.) but this is a very fuzzy categorization. If you look at the issue in terms of enterprise use cases, however, these are understood as multilingual communication problems that require accurate translation quality. With such use-cases in mind, the generic vs. customized MT question is an over-simplification of the real problem, as we never translate sentences in isolation - we basically translate a document or a webpage or a chat interaction between a person and customer. This means we need MT that can actually translate things in context. So it needs to have an awareness of the contextual history of what has been translated before (source doc or other) and then be able to leverage this information to correctly translate what comes next. This is true regardless of whether the translation is an offline process or an online process. Addressing the translation in context problem is both a data and an algorithmic process. That entire document is data and we need data similar to that to build a good model; but it also needs an architecture to leverage that contextual data and pass it along to the model doing the translation at any point in time, and training and decoding algorithms that are suitable for dealing with context. That changes everything.
There are various ways to build technology for MT models that are context sensitive. Data is important but it will require organizing the data in a different way. It’s not just about having larger quantities of generic data for a language pair at sentence level. Nor about having specific domain data or access to that specific data from a particular customer. Once again, what you need is to have data that effectively represents the use case. This exists to a degree in translation memories but even standard TMs don’t provide access to complete stored documents. So the real question is not generic vs. customer-specific data, but how you organize data so that it provides access to information at the level of compete documents, conversations, or software documentation. That is typically not how data is sourced today.
How do you apply quality estimation in a pipeline constructed around use cases?
This is a critical question, and I see three aspects to this issue. First, there is human quality which plays a key role in enterprise content, and fairly manual human evaluation mechanisms are still used by customers to check whether translation solutions are addressing their demands. The most prevalent metric here is MQM (Multidimensional Quality Metrics), which is useful because it allows areas of specific error to pinpointed, rather than simply giving an overall score for a translation. Annotating errors for the purpose of calculating MQM scores is still a very manual and time-consuming process. We are making progress on detecting errors automatically, but so far this is still largely a human process.
Second aspect here is applying these notions of translation quality estimation to MT, which means you would like to be able to annotate the MT output and give it an MQM quality score as the ground truth. This is a difficult problem even though advances are being made to automate MQM generally. This will be very important for enterprises. If we are going to construct a two-step pipeline basically prioritizing MT and have human editing as secondary (this is different to the more integrated approach taken by Lilt and others), it will be critical to reach a point where every MT output that has been generated comes with a quality estimation, not just at an abstract level but at the level of MQM. Then you will know whether you can send the automatic translation out without human intervention, or send it to the 24/7 translator/editor community, or look at an alternative mechanism for communicating with your customer.
In other words, this means we need to move to a world in which every MT project, interaction or transaction will generate a quality estimation or (ideally) a self-assessment of its own quality. So MT needs to become more self-aware and know instantly how good a translation output is likely to be and how risky any might be. This goal is becoming increasingly attainable. And if we get to a world in which MT is integrated automatically with quality assessment so we know how good it is before any human touches it, this opens the door to even more exciting things in terms of end-to-end translation.
Third: we need to be perfectly clear about the challenges and problems of using reference-based automated metrics such as BLEU for benchmarks and evaluating current state-of-the-art neural translation models, and I think it is time for a major overhaul on this front too.
In the past we used an example of a correct translation - the human reference translation - as a benchmark, and this worked reasonably well in the age of SMT. But there are two fundamental problems with using metrics such as BLUE in today’s NMT systems. One is that the reference translation, even if fully correct in isolation (and often it’s not and we don’t know its provenance!) gives a very limited signal in a neural MT world about how a model is performing. What worries me most is that neural models today typically generate more fluent grammatical output in the target language than SMT models did in the past, but variation in neural models comes from somewhat different word choice or grammatical structure or terminology. In BLEU, if there is no exact lexical match then you get no credit for it, so a translation receives a low score for a paraphrase. Two, neural models do often make significant semantic mistakes by missing the right meaning of a word. But BLEU cannot distinguish between these two cases of mismatches against a reference, due to either a paraphrase or due to semantic errors, so it is of very limited use as an evaluation tool.
So overall, we are overdue for change. On the path forward we should be able to assess the quality of models without the scaffold of a reference translation. Quality estimation models based on source and the generated MT should be enough, without calling on a human reference.
Where else will the focus of innovation be in the immediate future?
I expect that there will be significant, accelerating disruption in the underlying structure and technology ecosystem of the translation industry. For example, the entire concept of managing Translation Memories (TMs) as a separate step in translation might disappear fairly soon, as MT and memories fuse together in new ways that make the distinction superfluous. Translation management systems themselves will likely not disappear but newer players will begin to circumvent Translation Management as a separate cloud-ware service, and largely like Unbabel, they will find ways to build infrastructure and get rid of the entire issue of managing human translation projects in favor of integrated end-to-end processes that are much more creative and flexible. As a result, I suspect it will be very difficult for large LSPs to make that transformation, and this should lead to exciting new opportunities for disruption for companies such as Unbabel, Lilt and others in this space. I’m really looking forward to seeing all of this happen in the next few years!