In part I, we defined the pivot language approach, discussed briefly its major drawbacks, referred to factors regarding the selection of the pivot language and explored two areas where pivoting can be deployed i.e. the relay interpretation (oral) and the human translation (written), including translations from audio recordings with or without script. In part II of this blog article, we will discuss more areas where pivot languages can be deployed, namely in building and enhancing bilingual lexicons, translation memories, machine translation systems and machine transliteration systems.
Increasing the Size and Improving Bilingual Lexicons
There is a number of efforts and approaches for increasing the size of bilingual lexicons via a pivot language. For example, in the paper “Building Bilingual Lexicons Using Lexical Translation Probabilities via Pivot Languages” (2008) T.Tsunakawa et al. address the main challenges of ambiguity and mismatch of terms and in “Improving Calculation of Contextual Similarity for Constructing a Bilingual Dictionary via a Third Language” (2013), T.Tsunakawa et al. propose a context similarity criterion and word-association measures that improve the generated, via pivot, lexicon by alleviating spurious term pairs from it. In both works, experiments were carried out on the construction of Japanese to Chinese lexicons through English.
Building or Enhancing Translation Memories (TM)
The pivot language approach can also be used to build or enhance TMs for low-resourced or non-existent language pairs. Due to the two-step translation approach and possibility of error transfer, such translation links can have a lower confidence rate in order to be selected as a last resort or be cross-checked and post-edited. In TAUS Data Cloud, in order for bilingual segments to be generated through a pivot language, all the features of source and target segments namely domain, content type, data owner and data provider have to agree. Setting such constraints for allowing pivot can reduce the possibility of errors in the generated translations.
Existing Translation Memory Management systems in the language industry also ease the creation of bilingual corpora via a pivot language, when there is a lack of data in specific language pairs, besides consolidating TMs in a single database, allowing translation units to be annotated with metadata and facilitating the preparation of quality data for statistical machine translation.
Building and Improving Machine Translation (MT) Systems
In a large-scale study, Koehn et al in the paper “462 Machine Translation Systems for Europe“ (2009) build and compare the performance of these systems against pivot translation and a number of system combination methods (multi-pivot and multi-source) and show that pivoting through English, compared to French or other languages often works as well as direct translation in the multilingual Acquis corpus. Acquis languages come from seven language families: Indo-European languages (namely Germanic, Romance, Slavic, Baltic, Greek) and non-Indo-European (namely Finno-Ugric and Semitic).
Automatic language translation systems mainly cover around 100 languages out of the 7,000 languages in the world. In the effort to create language resources for MT in language pairs with scarce resources, the following three basic techniques are used to employ pivot language in MT: (1) triangulation, focusing on phrase paralleling between source and pivot and between pivot and target (2) transfer, translating the whole sentence of the source language to a pivot language and then to the target language and (3) synthesis, crafting synthetic training corpora – an alternative way to paraphrasing.
In their paper “Selective Combination of Pivot and Direct Statistical Machine Translation Models”, A.El Kholy et al (2013), a joint effort of Columbia University, Science Applications International Corporation and eBay Inc., they propose a selective combination approach of pivot and direct statistical machine translation (SMT) models to improve translation quality with a Persian-Arabic SMT as a case study.
Pivot language is even tested in SMT in cases when the corresponding source phrase and target phrase connect to different pivot phrases. Xiaoning Zhu et al., a joint effort of Harbin Institute of Technology and Baidu Inc. in their paper “Improving Pivot-Based Statistical Machine Translation Using Random Walk” (2013), utilize a method called Markov random walks to connect possible translation phrases between source and target language, in an effort to generate even more bilingual resources that would otherwise be lost.
In “Pivot-based Triangulation for Low-Resource Languages” (2014), R.Dholakia et al. conduct a comprehensive study on the use of the triangulation technique for four very low-resource languages: Mawukakan and Maninkakan, Haitian Kreyol and Malagasy. Furthermore, as the low-resource language pair and pivot language pair data typically come from very different domains, they use insights from domain adaptation to tune the weighted mixture of direct and pivot based phrase pairs to improve translation quality.
In “Improving Arabic-Chinese Statistical Machine Translation using English as Pivot Language”, N.Habash et al. explore different ways of pivoting through English to translate Arabic to Chinese, two languages with large global presence. The three languages (source, pivot and target) they use are very different and from completely unrelated families.
Their results show that using English as a pivot language for translating Arabic to Chinese actually outperforms direct translation due to the fact that English is a sort of middle ground between Arabic and Chinese in terms of different linguistic features and, in particular, word order.
The pivot approach is particularly interesting for multilingual countries like Spain, a country with four official languages namely (Castilian) Spanish, Catalan, Basque, and Galician, where it is difficult to find parallel corpora among Catalan, Basque and Galician but it is relatively easy to collect it between Spanish and any of them.
In the paper “Pivot strategies as an alternative for statistical machine translation tasks involving Iberian languages” , C.Henriquez et al., a joint effort of the Polytechnic University of Catalunya and the Institute for Infocomm Research (Singapore), develop machine translation systems from languages like English to Catalan for instance, using Spanish as pivot. Such systems help these minority languages by giving them global presence and promoting their use in content collaboration.
As online machine translation services become more popular, a pivot translation service often produces translations between non-English languages by cascading different translation services via English. As a result, the meaning of words often drifts due to the inconsistency, asymmetry and intransitivity of word selections among translation services.
In the paper “Context-Based Approach for Pivot Translation Services” (2009), R.Tanaka et al., a joint effort from C&C Innovation Research Laboratories, NEC Corporation, National Institute of Information and Communications Technology (NICT) and Kyoto University propose context-based coordination to maintain the consistency of word meanings during pivot translation services. The evaluation results showed that translation quality was improved by 40% for all sentences, indicating that consistent pivot translation services can be offered through context-based coordination based on existing services.
Constructing Bridge Transliteration Systems
The pivot language approach is also used in building transliteration systems (i.e. systems that automatically convert a script or alphabet to another) when direct parallel resources are unavailable or insufficient. In the paper “Everybody loves a rich cousin: An empirical study of transliteration through bridge languages” (2010), a joint effort from the Indian Institute of Technology in Bombay and Microsoft Research India in Bangalore, M.Khapra et al. explore the use of pivot languages for machine transliteration systems that are mostly data driven and require large resources of parallel names corpora.
The experiments were conducted in different set of languages: English, Indic, Slavic and Semitic.The study shows that the performance of the pivot approach is in par with the direct transliteration systems in practical applications. Further, it shows that in many cases the bridge language can be suitably selected (in terms of rich orthography and phoneme-grapheme mapping) to ensure optimal machine transliteration accuracy.
The use of pivot languages in different application areas has recently become more popular as it provides practical solutions to overcome language resource limitations. Different methods using language relations, context similarity, word associations, feature constraints such as domain or content type, and so on, alleviate the drawbacks of the pivot language approach such as error transfer, ambiguity and mismatches and improve the quality of the generated (via pivot) language resources and translation services.