TAUS provided language data for Pangeanic to train their machine translation models and improve their adaptive MT outputs for the COVID-19, pandemic and healthcare domain.

Finding high-quality data for MT training has always been a challenge on the path to generating high-performing MT output. The challenge increases when the language pairs are rare or when training data in a lesser-known domain is needed.

Machine translation systems do a great job at providing solutions for automated translation services when fed with the right training data. However, it was a challenge to find high-quality datasets necessary to build specialized automatic tools about this new topic in the healthcare domain. 

TAUS provided Pangeanic a total of 1.8 million words of MT training data in English to Spanish, German, Polish, Russian, and Chinese language pairs.

Using the data provided by TAUS, Pangeanic built COVID-19 domain-specific neural machine translation (NMT) models for the five language pairs on Pangeanic ECO user-friendly customer portal on which the user can adapt models using three levels of training settings.

The highest BLEU score improvement has been recorded in the English > Russian language pair with 50%, followed by English > Chinese with 26%, English > German with 20%, English > Spanish with 9%, and English > Polish with 8%.

Download Case Study


10 minutes read