Amsterdam, January 14, 2019 - TAUS launches Matching Data: a new technique of selecting language data for the training and tuning of machine translation (MT) engines. This new approach is a perfect fit for the new generation of Neural MT, which is much more sensitive to the quality of the training data. Matching Data empowers MT developers as well as Language Service Providers to efficiently compile customized corpora for building their own domain-specific translation solutions based on an example data set.
“Finding language data for MT training has always been a big challenge.”, says Jaap van der Meer, director of TAUS. “Selecting data for a particular domain is almost impossible. In 2010 already we started scoping a scenario in which an example data set, a simple domain-specific translation memory, would assist our users to compile a completely personalized corpus out of the repository of many billions of segments in the TAUS Data Cloud. The technology to do this was not there yet, but now it is, thanks to the DatAptor project.”
The DatAptor project was a research project undertaken by the Institute for Logic, Language and Computation of the University of Amsterdam, led by Professor Khalil Sima’an and funded by the Dutch STW. Partners in the project were Intel, the Directorate General of Translation of the European Commission, and TAUS. From 2013 to 2016 a team of researchers explored different approaches to make data selection from vast amounts of data seamless and more effective.
“Our dream was to make the world wide web itself the source of all data selections,” says Professor Khalil Sima’an, “but we decided to start more modest and make the very large TAUS Data repository our hunting field first. In DatAptor we learned that every domain is a mixture of many subdomains. The combinatorics of subdomains in a very large repository harbors a wealth of new, untapped selections. Therefore, if the user provides a Query corpus representing their domain of interest, the Matching Data method is likely to find a suitable selection in the repository. ”
The Matching Data method inverts the typical search approach by indexing all sentences in the mixed domain search corpora as searchable entities. As a result, Matching Data returns high-fidelity data with matching scores assigned to each individual segment. Users can decide to download compact, medium or large selections, depending on their needs.
Oracle International Product Solutions has worked with the new TAUS Matching Data service to develop a colloquial corpus for general online conversations and chats between English and Chinese, Korean, Japanese, Spanish and Brazilian Portuguese. Oracle language specialists undertook an in-depth linguistic review and gave an average quality score of 84% on the segments retrieved through Matching Data.
“Matching Data is designed to serve as an industry community service”, says Jaap van der Meer. “Anyone can initiate a new domain corpus by providing a Query Corpus. The resulting domain corpora are available in the TAUS Matching Data Library for everyone who is interested in improving their global content solutions. This release of Matching Data is the first step on our ambitious road towards an open data marketplace.”
For more information, please go to:
TAUS, the language data network, is an independent and neutral industry organization. We develop communities through a program of events and online user groups and by sharing knowledge, metrics, and data that help all stakeholders in the translation industry develop a better service. We provide data services to buyers and providers of language and translation services.
The shared knowledge and data help TAUS members decide on effective localization strategies. The metrics support more efficient processes and the normalization of quality evaluation. The data lead to improved translation automation.
TAUS develops APIs that give members access to services like DQF, the DQF Dashboard, and the TAUS Data Market through their own translation platforms and tools. TAUS metrics and data are already built into most of the major translation technologies.
5 minute read