The demand for post-editing of machine translation (PEMT) is growing, according to the 2018 report from Slator. But before post-editing becomes an inherent part of every production workflow, the industry should agree on the most effective methods to evaluate the quality of post-edited machine translation output.
PhD student of Translation and Language Sciences at UPF, Clara Ginovart, is doing research in post-editing training while working as CAT tool and MT Consultant at Datawords. In the framework of her PhD, she carried out a small-scale study using TAUS DQF. The goal of the study was to set up a method for evaluating the quality of machine translation output in order for LSPs to be able to determine when an MT engine is ready to be used.
Initial results were presented at the AsLing (International Association for Advancement in Language Technology) conference, held on 15 and 16 November 2018 in London, UK.
Case Study Scope and Challenge
The study consisted of post-editing output from three MT engines (KantanMT statistical and neural, and SDL statistical), in two language pairs: from French into Spanish and into Italian. It was conducted in a real-life production environment at Datawords, with 20 freelance post-editors, and with roughly 6,000 words. Various evaluation methods were applied in order to establish the amount of post-editing effort required when working with an MT engine. These methods were: task-based evaluation and time spent (post-editing being the task), automatic machine translation quality evaluation metrics (BLEU and edit distance), and human evaluation through ranking.
The main caveat of evaluating quality of machine translation through measuring post-editing effort is that the linguistic quality of the final post-edited output is not taken into consideration. It is not enough to conclude that one post-editor was faster than the other, or that the TER metric was lower for one engine or another, without the adequate quality evaluation of the final translation. This is where DQF came into the picture.
Completing the Quality Evaluation with DQF-MQM
DQF-MQM error typology is a standard framework for defining translation quality metrics. It provides a comprehensive catalog of quality error types, with standardized names and definitions and a mechanism for applying them to generate quality scores. In this case study, it was used to measure the types of errors appearing in the post-edited translation and their severity, and finally, to check them against the pre-defined desired quality threshold.
Twenty project managers at Datawords, all with a background in translations, performed error annotation with DQF-MQM on the 6,000 post-edited words. They all used the DQF plugin in SDL Trados Studio for the first time. The briefing process took only an hour and they have found the tool incredibly helpful for tracking the quality of translation projects, as the error typology is integrated in Trados Studio editor and the reports are available for download in Excel. One of the reviewers in the project, Senior Italian Country Manager Rosaria Sorrentino, explained: “DQF is an interesting tool to evaluate different aspects of a post-editing project. The user interface is intuitive and the options provided for the evaluation are very accurate and complete.”
The results of the automatic MT evaluation tests validated some of the hypotheses: that there is almost always a linear correlation between edit distance, speed, TER and BLEU score (see article for full report: TC40 Proceedings), but did not measure the extent to which the final post-edited content corresponded to the quality expectation of the project initiator. That is the insight that DQF was able to offer, based on the preset acceptance threshold (here 2,5%). According to the numbers for both languages as shown in the table below, 20% of the post-edited texts did not “pass” the threshold.
Automatic MT quality metrics provide only one side of the story about quality, which is not always useful in a production environment. Adding DQF-MQM elements, specific errors and their severities, gives meaning to the automatic metrics - it validates their results. It also opens the door to a wider range of possible correlations - between productivity, edit distance and correction and error density.
Join the next DQF webinar where we discuss importance of standard measures and an industry-focused way to identify and categorize translation errors with DQF-MQM error typology.