The long-expected technical revolution is here. Automatic translation is no longer just a freebie on the internet. It is now entering the ‘real’ economy of the translation sector and it changes everything.
A Short History of the Translation Industry
Over a period of four decades, the translation sector has gone through regular shifts of adaptation, instigated by changes in the business and technological environment (see image below).
However impressive the journey has been so far, nothing compares to what is still to come: the Singularity. In this new phase, technology essentially takes over completely. The human translator is no longer needed in the process. Google and Microsoft alluded to this future state when they claimed that their MT engines translated as well as human professional translators. However, this has led to heated debates both in academic and professional circles about what this so-called human parity really means.
2. The Rise of Zero-Cost Translation
The global translation industry finds itself now in a 'mixed economy' condition: on one side a traditional vertical cascaded supply chain and on the other the new flat free machines model. The speed with which the machines are improving when fed with the right quality and volumes of data makes translation a near-zero marginal cost type of business (in the spirit of Jeremy Rifkin). This means that once the right infrastructure is in place, the production of a new translation costs nearly nothing and capacity becomes infinite.
As long as the translation industry is locked up in a vertical labor-based cost model, how realistic is it to think that we can just add more capacity and skills into our existing economic model to generate that global business impact?
For operators in the translation industry to follow the trend and transition to the new free machines model, they need to consider fundamental changes in the economics of their business: be prepared to break down existing structures, adopt new behaviors of sharing and collaboration, decrease the need for human tasks and work activities, and progress technology. Under the new economic model the concepts of linguistic quality, translation memories, and word rates will lose their meaning. Instead, we will talk about global business impact, data and models, and value-based pricing.
3. Kill Your Business Before it Kills You
In 2019 Google alone translated 300 trillion words compared to an estimated 200 billion words translated by the professional translation industry. Adding other big players like Microsoft Bing Translator, Yandex MT, Alibaba, Tencent, Amazon and Apple, the total output from MT engines is probably already ten-thousands times bigger than the overall production capacity of all professional translators on our planet.
Until only two or three years ago, after the new Neural MT success stories started to settle in, human professional translation and MT existed in two parallel worlds. Even inside Google and Microsoft, the product localization divisions didn’t make use of their company’s own MT engines. But that has changed now. MT is integrated into almost every translation tool and workflow.
The question for operators in the translation industry, therefore, is whether the two processes continue to co-exist, or whether MT will completely wash away the old business. The increasing pressure that LSPs are feeling already pushes them to offer various data or transcreation services or to start building their own MT systems and services. Gartner, in their recent research report, reckons that by 2025 enterprises will see 75% of the work of translators shift from creating translations to reviewing and editing machine translation output.
The takeaway for language service providers is that ignoring AI and MT is not an option. To grow the business they need to get out of their localization niche and use data and technology to scale up and expand into new services.
4. Buying the Best MT Engine
The question is asked many times: which MT engine is the very best for language A or in domain B? As MT developers use the same frameworks or models, like Marian, BERT or OpenNMT, which are shared under open-source licenses on GitHub, the answer to all these questions is that the “best” MT engine out-of-the-box does not exist. MT is not static: the models are constantly being improved and the output of the machine is dependent on the data that is used to train and customize the models. It’s a constant process of tuning and measuring the results.
For LSPs, it is much more important to have an easy way to customize the MT engines with their own high-quality language data. Some of the disruptive innovators in the translation industry that have implemented a real-time or dynamic adaptive MT process show how easy that is with “predictive translation”, which means that the engine learns almost immediately from the edits made by humans. This real-time, adaptive MT is only available in a closed software-as-a-service offering, which is understandable because of the immediate data-feedback loop required to optimize the speed and success of the learning process.
Language companies that need more flexibility and control over the technology should build and customize their own end-to-end solution. Their main challenge is to set up a diligent pipeline for data preparation, training and measurement. Their language operations then become a data-driven solution.
5. No Data, no Future
In the past thirty years, the translation industry has accumulated a massive amount of texts in source and target languages that are stored in databases referred to as translation memories. However, they do not always make the best training data. Translation memories are often not maintained very well over time, they may be too specific and too repetitive, or contain names and attributes that can confuse the MT engines.
To optimize quality output from machine translation the language data needs to be of the highest possible quality. Data cleaning and corpus preparation should include steps like deduplication, tokenization, anonymization, alignment checks, named entity tagging and more. To ensure that the language data used for customization of MT engines is on topic, more advanced techniques can be used to select and cluster data to match the domain.
Even if you may decide to outsource most of the data-related activities, new skills, talents and a new organizational structure in your business are required to get ahead in the new AI-enabled translation space.
6. Who Owns My Language Data?
Convinced as they may be that the future lies in taking control over the data, many owners of agencies, as well as translation buyers, still hesitate to move forward because they are in doubt about their legal rights to the data. There is a strong feeling across the translation industry that translations are copyright-protected and can never be used to train systems. If uncertainty over ownership of data is a factor that slows down innovation, it is time to get more clarity on this matter.
In the Who Owns My Language Data White Paper, Baker McKenzie and TAUS address important questions regarding the privacy and copyright of language datasets, individual segments, GDPR and international rulings and more. The white paper functions as a blueprint for the global translation industry. One important point to highlight is that copyrights are intended to apply to complete works or parts of the work more so than to individual segments. Since MT developers normally use datasets that consist of randomly collected segments for the training of their engines, the chances of a copyright clash are minimal.
Copyright on language data is complex and involves multiple stakeholders and many exceptions. Customers expect vendors to use the best tools and resources available, and today that means using MT and data to customize the engines. To our knowledge, there is no precedent of a lawsuit over the use of translation memories for training of MT engines, and the risk of being penalized is negligible. But in case of a doubt, you can always consult your stakeholders about the use of the data.
7. Breaking the Data Monopolies
If there is no future in translation without access to data, it is in the interest of all language service providers, their customers and translators in the world to break the monopolies on language data. Right now, a handful of big tech companies and a few dozen large language service providers have taken control of the most precious resource of the new AI-driven translation economy. A more circular, sharing and cooperative economic model would fit better into our modern way of working.
One solution is to unbundle the offering tied up in the AI-driven translation solution and to recognize the value of all the different contributors:
- Hosting the powerful, scalable infrastructure that can support the ever-growing AI systems is a task that can only be managed by the largest companies.
- Customizing the models for specific domains and languages is a specialized service that may best be left to service companies that have expertise in these fields and are capable of adding value through their offering.
- And since the best quality training data is vital for everyone, why not let translators and linguistic reviewers who produce this data take full responsibility and earn money from their data every time an engine is trained with their data?
The process of creative destruction is now in full swing and may lead to a redesign of our entire ecosystem. The first marketplaces to enable this new dispensation are now out there: SYSTRAN launched a marketplace that allows service providers to train and trade translation models, while TAUS launched a data marketplace that allows stakeholders in the translation industry to monetize their language data. These first steps should lead to healthy debate across the industry as we feel the shock waves of an industry reconfiguration driven by radical digitization, human reskilling, and exponential data intelligence.**The long version of this article has been published in the July/August 2021 issue of the Multilingual Magazine. This shorter version has been composed by Anne-Maj van der Meer.
10 minute read