Report on the 13th TAUS QE Summit that took place on April 11, 2018 at Microsoft (Dublin, Ireland).
Technologies sometimes evolve at a faster pace than we as humans are able to keep up to. Powered by technology, intelligent platforms, automated workflows and efficient distribution processes, translations easily become invisible and so does the workforce behind it. In an effort to rely on technology to help us do our work and continuously measure and manage its increasing impact on our future, we need to be able to dynamically adjust our requirements, make decisions that are based on data and not opinions and finally, put our trust in the power of distributed, invisible workforces. That also counts for the way we evaluate translation quality.
On April 11 in Dublin, Microsoft and TAUS welcomed more than fifty professionals from all parts of the industry who were eager to talk quality. This year we zoomed in on people, organizations and the platforms and workflows they are operating on. Key topics revolved around the impact of the gig economy trend on quality, the user experience perspective on quality, changing jobs and education of translators and the application of the Modern Translation Pipeline on different businesses.
After the warm welcome by Vincent Gadani (Microsoft) and Jaap van der Meer (TAUS), we jumped straight into the introductory sessions, followed by four panel discussions. The day was rounded up with a breakout and a plenary session. Engaged attendees, focused program and a lot of interaction led to defining clear ‘homework’ for everyone in the industry and particularly for TAUS. TAUS promised to bring the actions forward, work on them through TAUS User Groups, and come back with deliverables to the TAUS QE Summit in October.
Introductory sessions: Decentralizing quality evaluation and DQF Roadmap
The idea of redefining the “D” - Decentralized - in DQF emerged while answering the same old question: How do you define quality? to one of the enthusiastic new users of DQF, explained Jaap van der Meer. Clarifying that the quality needs to be defined and driven by data, not by individuals and subjective opinions led to a revelation - realizing that there is a similarity between the blockchain technology and the DQF, with its potential to decentralize and automate the transactional and administrative activities surrounding quality evaluation while ensuring security and trust. Jaap elaborated further on the importance of having measurements in place to support the decentralization: We need to move away from statements like: “We know quality when we see it”, and introduce metrics where they are not employed yet. We should be able to decentralize work and distribute it to teams around the world, trust the people and the data to tell us how they perform.
What data points do we need to track to ensure we measure quality in a most optimal way? Dace Dzeguze presented the results of the first deep analysis of the DQF data, performed in February 2018. Objectives were to extract business intelligence from DQF data, verify if the current DQF data points provide useful material to analyze the efficiency of translation resources and discover if cognitive translation effort can be measured using DQF data. The focus was solely on the translation phase of the localization process: productivity, time spent, and edit distance. Three hypotheses were successfully tested and the results led to the realization that the relationship is often not linear. TAUS is likely to do a quality comparison for the three assumptions in the future and look further into the correlation between productivity and quality.
Meanwhile, the community-driven DQF roadmap for the first half of 2018 will focus on the release of trend reports, more flexible vendor management feature, improved usability of SDL Trados Studio Plugin and a facelift of the Quality Dashboard.
First panel: Quality in the gig economy
Decentralization lies at the heart of the gig-economy - an economic model where a set of small independent tasks (gigs) are completed by individual external contractors rather than by permanent, full-time employees. Gig-economy powers crowdsourcing. Enabled by technology, crowdsourcing helps reach scale.
How do large platforms that use crowdsourcing in localization measure and manage quality? Where do they differ from the corporate globalization management systems when it comes to quality review and evaluation? How do they recruit? What are their quality challenges? In this panel discussion led by Vicent Gadani (Microsoft), the representatives of three large platforms that all use crowd resources shared their experiences on adopting crowdsourcing in localization: Alba Guix (Pactera), Alessandro Cattelan (Translated) and Julie Belião (Unbabel).
Alba Guix presented the Gig model that Pactera employs. Supported by mature program framework and platforms, this model is focused on Results. In this model not only the standard methodology translators are involved, but also the so-called ‘gig’ translators - either junior translators or professionals specialized in a language-related discipline. Even though they receive only a light training and have no full scope of readiness or product awareness, a user survey run by Pactera showed that the majority of their users (53%) prefer gig translators as opposed to official translators (47%), as their specific expertise often helps them hit the mark with translations.
When it comes to recruitment, Pactera applies a well-rounded, 5-stage recruitment process including using social media to attract talents, applying smart filters for job-profile matching, multiphase testing, training and performance-based rating. Review and quality evaluation process is also two-tiered: light QA to ensure fit for purpose for simple content and full QA in a controlled gig environment for complex content.
This model and continuous platform innovations allow them to put technology in full service of quality and enhance community translations with AI driven QA that sets the level of QA based on the performance score and historic data analysis.
Technology is also at the core of the crowdsourcing setup at Translated, as explained by Alessandro Cattelan. Working with hundreds of thousands of native language translator in more than hundred languages they need to be able to identify the best available resources at any given moment and the amount of work they can accept. The recruitment process at Translated relies on an a priori evaluation of translators’ experience based on questionnaires and profile information and the use of a classifier trained on thousands of human evaluations to exclude the translators that do not match their quality requirements. Availability is communicated in real-time, using an internally developed software client. The performance of translators is ranked dynamically for any specific job, using the in-house developed T-Rank system. T-Rank has been trained on over 1,2 million of translated jobs and feedback of the customer/revisor on various touchpoints in the job. All data is then fed back into the system and turned into rules.
Unbabel Community Pipeline is another example of a successful combination of crowd and AI working together to scale professional quality evaluation. Their review process is complex and built on a three-layer community collaboration network.
Customer requests go through MT and if the quality is deemed high the request is delivered straight to the customer. However, if unsatisfactory, the job goes to the Community Editor. Community Editors are reevaluated every 100 tasks and ranked repeatedly by the Community Translators Evaluators. On top of that, the Community Language Analysts assesses the quality of delivered translations with quality annotations. Combined automatic smart check and human evaluation with 55 available error types for error annotations give the MQM score as a result.
From all three presentations it was clear that using crowdsourcing and the “gig” model needs to be supported by heavy automation and intelligent processes, due to large volumes that are being processed, number of people involved and minimal human interaction. Quality evaluation in the gig economy is largely a combination of automated checks and human evaluation, and the review step consists of multiple quality checkpoints, meaning that it takes up a big part of the whole localization process. It was pointed out that the degree in Translations is still relevant in the recruitment process, but not necessarily for ‘gig’ translators. Alessandro assured us that even if machine learning is applied in the selection process, the machine is more often than not able to detect the correlation between high qualifications and top quality translators.
Second panel: UX vs. Linguistic quality
The always-on economy provides new context for quality in which the user always has the final say. Evaluating quality becomes a balancing act between the traditional language quality focus and the user experience testing.
Kirill Soloviev (ContentQuo) opened this panel with a question on how to define user experience and the role that the linguistic quality plays in it. Panelists Damien Fernandez (Booking.com), Grainne Maycock (Amplexor), Alberto Ferreira (Travel Republic) shared their viewpoints.
For Amplexor UX is the end-to-end interaction with a service or a product. Products need to be able to do what we as users expect and need them to do: provide a personalized experience and not leave us hanging or disappointed. Grainne admitted that she rarely looks at the language when assessing the user-friendliness of a product, as she should not have to - language should be invisible.
Booking.com also applies a holistic approach to UX, looking at all the elements that the user interacts with: relevance of content, product features, photos, design, language, everything that plays a role in establishing and growing brand trust. Damian emphasized the additional level of complexity for their localization efforts - it needs to satisfy two different target audiences - customer and partner.
Alberto from Travel Republic expanded further on the definition of UX, to what he called the journey map - including all touch points presented to the user as well as the thought process that occurs in the mind of a user while interacting with a product or a service. He pointed out that localization was long kept peripheral in UX, but it goes hand in hand with product design.
In order to ensure the understanding of the full scope of UX design, Alberto clarified the difference between UX research and UX design - while UX research is an applied form of market research that provides the data to improve the design, UX design refers to applying that data to the actual design of the product or website.
What are the challenges and opportunities of evaluating UX and how do you measure it? Amplexor stresses the importance of objectivity and moving away from “I don’t like it” approach. What matters is the end-user satisfaction and blocker-free experience. What and how to measure? Collect the data that is meaningful to what you are trying to achieve as a business and always ask the question: How do you define why you measure this?
Booking.com used to measure the localization quality with the same metrics as the UX - increase in bookings, decrease in cancellations, bounce rates, but they are looking to switch to the combination with of the UX and linguistic metrics. The challenge is finding the correlation between the two, and finding data to link to an overarching quality score.
While giving examples of actions that positively or negatively impacted user experience, it was clear that having a conversation between involved parties and striking a balance between what vendor is telling you and what the internal clients ask for is key. There should be a framework put in place to support identifying issues.
When asked to give tips on how to ensure best UX, the panelists almost unanimously agreed on the following few points: define what you are measuring and why, back up your decisions with data, improve the tools and build cross-functional teams and ask the customer.
In conclusion, there is no absolute metric for assessing the quality and user experience, just as there is no way to create a universal user experience. A/B testing is limited and there is an inherent cultural bias in user satisfaction surveys. To fully understand the results of tests and the satisfaction rating, one needs to establish a baseline and continuously iterate while tracking and comparing the results over time. Lastly, localization need to reflect the common perception of what we are selling (the product) and not be content production at large.
Third Panel: Education & translators future
Microsoft recently announced that they’ve achieved parity with human translation for EN-ZH. The Neural technology has an impact on translation that can’t be ignored. We asked Anna Kennedy (Chillistore Technologies), Sharon O’Brien (Dublin City University), Annette Schiller (Dublin City University), Greg Hellmann (Simultrans), Marcin Kotwicki (European Commission) to reflect on that and share their perspectives on the future of translator’s education and profession. Ana Guerberof (Dublin City University/ADAPT Centre) led this panel discussion with focused questions.
To the question of what the future looks like, Marcin answered that it changes very little. Before we had Translation Memory that accounted for 75% of the pre-translated source content, now the remaining 25% can be filled by MT. The confidence issue persists - MT translation is still on probation. While SMT is almost entirely gone, NMT is not mature enough yet. So far, NMT looks good when it comes to fluency, but there are errors at the accuracy level, resulting in hard criticism. The term post-editing implies that it is “lighter“ and “easier” than revision. However, MT requires more attention and not less. Although it opens up communication in many new language pairs, it also drives demand for high quality translations.
Sharon from Dublin City University mentioned that the PE specialization does not belong to the Translation Studies. Traditional PE is a part of revision, and therefore separate from translation. She also pointed out that TM and MT are talked about separately, while in fact they have merged.
All panelists agreed that there is another market for translation in the near future. Greg explained that Machine translation is intensifying an old trend - the language becomes more and more generic. There is a new market for transcreating, creative work and competing with advertising agencies. The responsibility of the market is to push for this and push for charging more for translation work.
What skills do we need to teach? Sharon explained that creativity is built in and encouraged as part of all study programmes. Where more can be done is specialization in the target markets, cultures and advertising jobs. The ultimate question is if we should train new generations to be content creators.
What can the industry do to assist translators in getting training? Annette said that supporting translations associations is a good start. Graduates need to acquire experience and the industry is not easy on them, there is an apparent gap. You can’t post-edit if you don’t translate. There has never been so much work in the industry and the main challenge is to keep up with the speed of development in technology. Collaboration between the industry and the universities therefore only makes sense.
Anna emphasized the need to change the business model to eliminate the invisibility of translations and translators. It is hard to train and deliver feedback to a person X. We should give the translator a voice, transform them into language evangelists and language owners.
What is necessary to add to education today so that these students are marketable? IT proficiency, creative writing and adding content creation as possible study direction. Aside from that, the industry needs to cater for the shift towards more lucrative markets.
Fourth panel: Quality & Modern Translation Pipeline
Last year, TAUS published the Nunc Est Tempus ebook with the blueprint of the Modern Translation Pipeline (MTP): data-driven, autonomous, self-learning and invisible. In this panel Jaap van der Meer tried to find out from the panelists Riccardo Superbo (KantanMT), John Tinsley (Iconic Translation Machines), Kerstin Berns (Berns Language Consulting), Elaine O’Curran (Welocalize), Wayne Bourland (Dell) how far they think we have gotten with implementing the Modern Translation Pipeline in translation business.
MTP distinguishes three quality levels or processes: FAUT (Fully Automatic Useful Translation), good enough and high quality. Data from all localization processes is constantly fed back into the machine learning and machine translation algorithms. Standardization is a precondition for its implementation, as it is for an industry that wants and needs to scale and grow.
Kerstin from BLC explains why their clients in the manufacturing sector are still far from adopting the MTP. They find MTP hard to understand, monitor and control. There is also data security concern related to cloud applications. In Kerstin’s opinion that should be talked about and addressed openly.
Welocalize manages quality using the DQF-MQM quality metric. They differentiate between five different quality levels. Elaine explained that the main challenge that they are facing when trying to adopt standardized workflows or map their processes to a standard is the fact that their workflows and the pipeline are dictated by client requirements. In addition to that, proprietary tools that they’ve built make it hard to integrate.
Wayne shared a vision of the MTP application for Dell. There is for sure work to be done, but what should drive the future are vendor agnostic technologies that support agile methodologies, robust APIs to allow cross CMS communication and centralized asset management, and business Intelligence. In addition to that, he also shared Dell’s experience around making a shift from various translation quality metric and mapping them to DQF. The process was facilitated by TAUS and the result was greater flexibility during the rating process, better statistical understanding of the error categories and more detailed reporting on quality.
Kantan MT has their own quality assessment process, based on simplified DQF-MQM metric with customizable KPIs. As Roberto Superbo explained, this process helps make the human evaluation of MT faster for their users, and speed up the improvement of the performance quality of their MT engines.
Iconic illustrated the evolution of Machine Translation in the modern translation pipeline. John surprised us with the results of an analysis of their internal uses cases that showed that the use of Raw MT is increasing amongst buyers, while PEMT usage by LSPs is going down. Possible explanations are the increasing success of NMT and the vendor lock-in - buyers are often dependent when LSPs build the engines for them.
The Plenary session and conclusions
By the end of the day, we managed to get a pulse of the industry dynamic when it comes to quality: what are the different approaches to it and concerns around it, available solutions and overall direction that we are moving into: dynamic, data-driven and decentralized. Following a standard that is DQF-MQM becomes crucial for measuring translation quality, industry benchmarking and eventually bringing the global intelligent content delivery forward.
During the breakout session, the participants were split into four groups to continue their discussion on the topics of the day and come up with action items for themselves and for TAUS. Below is the result:
Action items for QE Summit attendees:
- We seem to be all trying to achieve the same thing by applying different approaches. How can we use a standard translation process combined with additional processes that can help us improve?
- How can we train translators in UX?
- Companies should get involved in sharing their content to support training of translators
Homework for TAUS:
- Ensure consistent implementation of DQF. Provide guidance on how to make sure that all users have understood the guidelines and are complying to them.
- Research on how to avoid the fear of using the cloud and get consensus from the buyers.
- Create guidelines on how to write a transcreation brief.
- Create a content repository for teaching to involve TAUS/LSPs/universities, or be used by graduates and MA programs to work with real translated content.