Machine Translation (MT) is a technology that has been around for decades, but it’s only in recent years that it’s taking the stage, represented at every language-related conference and heightened with human parity claims.
While there are new engines and research articles coming out every month, getting the right data for MT training still seems to be a struggle. If you don’t have your own data, the main blockers are licensing, client confidentiality agreements and intellectual property rights. If you do, your data might not be in the right format, or it might be “noisy”- corrupt in one or the other way. All the more reason to start looking for data elsewhere.
But where to look for it, and what is the right format that you should be after? A standard format used in both statistical and neural translation is the parallel text format. Parallel corpus is a structured data set consisting of source sentences and corresponding target translation, aligned line-by-line.
TAUS has been covering the recent developments in the MT world, from the MT trends for 2019 to the "Data or engines?" debate, and we thought it was time to address the data sourcing topic as well. What we’ve gathered here is a quick guide to get good quality, generic or domain-specific parallel data.
There are quite a few parallel data repositories out there that gathered data either from public and governmental resources or through crawling the web. We list the top ones here:
CLARIN, Common Language Resources and Technology Infrastructure, contains 84 parallel corpora, the majority of which are available for download from national repositories as well as through concordancers such as Korp, Corpuscle, and KonText. The corpora is mostly containing European language pairs but also non-European languages such as Hindi, Tamil, and Vietnamese.
OPUS, collection of translated texts, crawled from the web. It consists of free online data that has been aligned. The collection is based on open source products, such as Opensubtitles, TEDTalks, WIkipedia, etc, and it involves automatic pre-processing.
ParaCrawl, another collection of crawled data, available in 24 European languages. The ParaCrawl data is aligned, cleaned and anonymized, and the crawling pipeline - Bitextor - is also open source. The plans are to expand this selection to some of the long-tail languages in the coming releases, like Icelandic, Norwegian (Bokmål, Nynorsk), Basque, Catalan/Valencian and Galician.
Industry-shared Data Repository
If you’ve already explored the public resources or are not so sure about the crawled data, the next best choice are industry-shared language data repositories. TAUS Data Cloud is one such repository, founded by 40 TAUS members who contributed translation memories and other parallel corpora. Since its start in 2008, the repository gained more contributors and grew every year until it reached 73 billion words in 2016. Today, it is being maintained and enhanced with automatic checks and regular clean-ups, and it contains more than 35 billion words in over a thousand language pairs. Organized by language pair, industry domain and content type, it is a go-to-place to get large data quantities from multiple owners.
Generic data is enough to get you started, but you should not expect exceptional results on specialized content. If you’d like your engine to produce great in-domain translations, you’ll need to feed it domain-specific data. Most of MT technology providers offer customizations, and you can even try it yourself with Google’s Cloud AutoML.
Whichever option you choose, you will need to have domain-specific parallel corpora. That is the reason why TAUS developed Matching Data, a technology for extracting domain-specific data based on a user-provided sample. You don’t have to rely on the meta tags or categories when searching for data anymore, just assemble a sample that we can use in the search. We can apply it directly on the TAUS Data Cloud, on ParaCrawl crawled data, or on your own Translation Memory, in a dedicated, private environment. The result is a parallel corpus with sentences matching your domain, scored based on relevance and ready to be used for the training of your MT engine. Check out our Library for any ready-made data sets, or get in touch for more information.