TAUS Services

Data Initiatives Near and Far

In this blog post, we tackle the question of translation data. The question is, could “good” data become as accessible as the open ML software packages to build new businesses?

With two TAUS Datafication webinars ahead (one covering Europe, the other China) in October followed by the TAUS Data Summit in early November, we think it’s worth quickly reviewing the state of play about data sharing opportunities in two different regions.  

TAUS believes there could be a new opportunity for raising the level of debate and innovation around translation/language data in our industry. The rapid rise of machine learning (ML), the near global research effort into the most viable software models for better data-driven ML, and the constant search for relevant data sets to improve content automation in general are converging to produce a new dynamic in tech generally. The question is, could “good” data become as easily accessible as the open ML software packages that now leverage data to build new businesses and/or production and quality pipelines?

The Summit will offer a high-level exploration of the viability of a data market for the translation industry. The rest of this blog summarizes the most striking data developments on the ground.

While Europe is currently bogged down in reviewing tougher new legislation on data privacy and copyright in general (more below), China has seen several large-scale attempts to generate and/or share translation data as an industry resource. This makes China an almost unique case of marketing language data for the industry.

China opening up

Baidu, for example, paid human translators a modest sum for participating in a hard-working “gaming” session in 2015 to produce millions of English-Chinese string pairs from translations of various types of business content. The ultimate aim was to provide lots of training data for MT projects.

In fact, crowdsourcing translations of small strings of language pair data from a very broad population of translators, using a Mechanical Turk approach - is probably “uberizing” some translation work in Asia. This practice could in time spread to India, and other countries, assuring a steady flow of human-translated data sets to prime the MT pump.

In another move, Shanghai-based UTH International has built a database of several billion strings in multiple language pairs, which it sells to other companies needing translation memories (TMs) for translation purposes - 8.7 billion parallel segments of high-quality human TM data covering over 222 languages. It recently raised a new round of funding that values the company at over $90M.

Tmxmall, another Chinese tech company, uses its Cloud-Based Smart Aligner to build large-scale TMs as part of its Tmxmall TM marketplace, which should promote much more translation memory sharing. 

China is clearly committed to an intensive AI agenda, which means ensuring that data is available at almost any cost, and above all, not applying legislation that makes data sharing more difficult than it needs to be.

And with the 2020 Olympics looming on Japan’s national calendar, there is strong East Asian interest in acquiring relevant language data to drive the various translation services, mobile as well as commercial, that will be launching or operating at volume in the run up to and during the Games.

European shutting down

Contrast this more laissez-aller approach with the legal scene for data in the European Union. The legislative focus in Brussels is rightly on making personal data more secure for citizens, and protecting the content creators of all kinds who might find their work devalued under a softer copyright regime. The question is, though, do the proposed solutions help improve a data-driven business culture, and more specifically the promise of sharing language data for the translation industry?

First, a quick look at European data privacy. The fundamental right to keep personal information private has recently led to the creation of a European General Data Protection Regulation (GDPR) law, which comes into effect in May 2018. Basically, this will give citizens more control over how their personal data is to be used by online companies, and requires reasoned explanations on data exploitation.

The overall effect of this will be to make data-driven innovation less easy in the EU, as any firm anywhere in the world must meet this regulation and ask consent from individuals about using each piece of their data. On the other hand, it will also establish a plausible new standard for privacy in a digital economy, which many people feel is needed.

No doubt workarounds will be found so that start-ups trying to build new data-fueled solutions will avoid falling foul of the law, either by anonymizing their data or by getting ML algorithms to explain how any data-sourced decision is reached.

Second, copyright. The EU is currently working on a reform of copyright for existing data or “works.” The key proposal, which has raised hackles in the startup community and among many librarians, is that the new law will put a time limit of three years on the use of copyrighted data by a given AI startup. When the three years are up, these companies will have to find their own sources of data, or pay for copyrighted versions.

The general feeling is that the law is designed to protect legacy data owners – e.g. traditional publishers or music recording companies – rather than open up the market to more data sharing and stimulate wider data use in innovation projects. At the same time, it rightly aims to protect citizens from the unacceptable usage of their data by very large internet platforms.

So, once again, the proposal is designed to provide some protection for smaller players and maintain a more level playing field for innovation, while making it more difficult for businesses in general to share data in a practical way as they go about innovating. Awkward.

Ironically, the EU is currently trying to find out how much data sharing actually goes on among small and mid-sized companies, in an effort to improve its open data drive, and its support for what is sometimes called linked data.

Presumably the plan is to encourage businesses to share more data – a cheaper and quicker way than procuring or building your own data. Whether this could include parallel corpora of translation content is not yet clear.   

New translation service at the EC

The EC also currently supports a project within the large-scale Connecting Europe Facility (CEF) program that aims to share more publicly-available data (mainly from European Commission collections) in the translation domain.

This  European Language Resource Coordination (ELRC) effort is a consortium tasked with encouraging national administrations in all European countries to transform relevant content into translation data sets. These in turn will be used to source data for the MT@EC project that provides centralized automated translation for EC departments around the EU.

As of November 15, 2017, this MT service will merge into a new eTranslation facility, designed to offer a more fully-structured “service desk” for translation to EU administrations.  

As it happens, ELRC will be holding its 3rd conference in November 2017 so we should gain some idea of how much progress has been made in this exercise. And possibly, whether the data collected can be shared more generally within the translation industry.   

New Call-to-action

 To learn all about data related trends and insights within the translation industry, register for upcoming TAUS webinars:


Andrew Joscelyne

Long-time European language technology journalist, consultant, analyst and adviser.