TAUS has recently staged two successful webinars on datafication (Datafication in Europe Webinar Recording and Datafication in China Webinar Recording), one focused on data collection projects in Europe, the other on business developments around language data in China. Taken together, they revealed a striking contrast between the central role of public administration concerns in Europe and the rise of a private translation data marketplace in China.
Facts on the ground derived from these two webinars offer an opportunity to look a little more closely at three topics central to the longer-term data transformation process: language needs, market potential, and the role of platforms.
In Europe, the scarcity of digital data resources for low-population languages disadvantages communities who wish to shift towards more automated translation, and in the longer term join the Digital Single Market. In other words, most languages in the EU outside of the well-resourced English/French/German/Italian/Spanish cluster cannot yet meet the volume criterion required for effective automation.
That is largely why there’s yet another EC (European Commission) call for proposals in November 2017 for projects dedicated to boosting L-data discovery, as we might put it: i.e. identifying, processing and collecting Member State language resources for those languages that are still under-represented in databases for relevant public administration domains.
In China, on the other hand, the emergence of companies such as UTH and TMXMall, which act as marketplaces for buying and selling tech-ready translation data, is opening up a more accessible source of different kinds of language data, where there is little local concern about the inclusiveness of low-population languages. Nevertheless, TMXMall claims it has resources for 222 languages. It is also likely that China has far more experience than Europe of using crowd-sourcing to build translation resources.
In Europe, ongoing EC efforts to collect and/or create language data resources for public agencies have been driven by the ELRC (European Language Resource Coordination) project. But in one salient way, ELRC seems to have put the cart before the horse.
It is not clear whether the EC has commissioned a market study to identify the real (long-term) translation needs for the many public agencies in its cross-hairs. Do these potential customers need translation for incoming understanding or for outgoing publishing? What if it turned out that Europe’s agencies in fact need to understand/publish considerable amounts of content in non-European languages such as Russian, Arabic, Chinese, Japanese and Korean? By maintaining the focus on collecting/creating “local” EU language resources, strategic international language data could well be ignored.
China’s data marketplaces appear to be nowhere near so parochial in scope. While it might have been the case that translation into Chinese was a key requirement in the recent past, it is very likely that Chinese will in future become a major source language for technical knowledge from medicine to energy, commerce, and consumer culture of all kinds. This will herald an age of localization from Chinese to many languages that have so far not featured on the strategic language map (think One Belt One Road). If this turns out to be true, then companies such as TMXMall and UTH International will be well positioned to deliver the speed and range required by a 21st century data-to-translation service.
Europe’s new ELRC project, and the existing ELDA language resource organization, plus a scattering of sites such as OPUS and META, provide the bulk of Europe’s publicly available language data, including some (but limited domain) parallel corpora for translation. ELDA is also a member of the ELRC network, as is META, so presumably EC data will be available through all these channels (all collected here).
However, apart from the TAUS Data exchange, no one yet in Europe at least has attempted to set up a more comprehensive commercial language data exchange with a focus on translation needs in a neural age – i.e. in-domain and vocabulary rich, able to overcome the holistic short-cuts found in current machine-learning translation.
In China, on the other hand, UTH International has not only been constructing a fully-fledged platform for the exchange of translation data, but also developed a machine translation service called Sesame. TMXMall meanwhile offers language data to three kinds of clients - translation companies using the company’s TM marketplace to run pre-translations to reduce translation costs; MT companies buying bulk data for their engines in domains such as medical and news, and lastly academics using translation memories for research purposes.
This first commercial foray into language data sharing could spell the future for large-scale language data management more generally. The growing need for the right data for automated translation could well drive a general scaling up of resource collection. Cloud-based platforms could eventually provide an array of services for identifying, selecting, cleaning, tagging, and parallelizing data, and even training domain-specific MT engines for LSPs or buyer organizations. Remember that large cloud web services may be only to ready to gobble up this activity.
In the longer term, a language data mart could handle different combinations of data needed for machine learning in the cognitive industries (in the form of speech, text, image, sound, video, IoT content, etc.) so that rich, multilingual scenarios could be devised for new types of needs, applications and markets. But it would have to start with an agreement that wide-spread sharing of data can make sense to the industry – a sentiment often expressed but rarely translated in practice!
For more information about data you can download TAUS Data Market Whitepaper