The total volume of data created worldwide is expected to reach 149 zettabytes by 2045. Therefore, capitalizing on data has become as important as human, financial, or any other capital. Data as capital has gained even more importance now that data-trained systems start to dominate all imaginable aspects of the world we live in.
On the backdrop of this great surge in the demand for data emerge marketplaces specializing in all forms of data, from language data to consumer or geospatial data. Businesses that are smaller in size and reach, or operate in a niche category, have to strive harder to gain visibility and put their offering on the map. That is where marketplaces such as TAUS Data Marketplace come into play.
We talked with Larry Cady, Solutions Consultant and partner with Chilin. Chilin (HK) Ltd is a privately held technology company based in Hong Kong. Founded in 2005, Chilin is a spin-off of the City University Enterprises Ltd. of Hong Kong. The company draws on research initiated at the university's Language Information Sciences Research Centre for more than 20 years. It is a successful provider of rigorously curated Chinese language and bilingual data for many kinds of research organizations and enterprises. “The challenge for language data providers is reaching potential buyers or other players in the industry to make them aware of the useful data that the provider can offer. Chilin has joined the TAUS Data Marketplace in order to solve this problem,” says Larry.
Chilin has made two datasets available on the TAUS Data Marketplace so far. One dataset contains 12,947 segments; 475,509 en-US words; and 401,629 zh-CN characters. It is based on the CPC Patent Classification category A61K in the pharmaceuticals domain. The second dataset contains 10,377 segments; 379,898 en-US words; and 327,637 zh-CN words. It is based on the CPC Patent Classification C12N which contains many biotechnology filings. “By adding these Chilin parallel sentence datasets to the base Google model, the BLEU score was increased by almost 5 points – within the Pharmaceutical-Biotechnology domain,” explains Larry.
According to Forrester, firms that exploit next-generation data marketplaces will gain a digital edge. TAUS Data Marketplace is here to address the supply and demand challenges for language data, making its marketing outreach available to all language data sellers regardless of their size, language pair, domain, or content type they specialize in.
High-quality language data for Chinese is highly scarce therefore the Chinese outputs generated by MT engines are still far from perfect. Chilin’s English-Chinese datasets are proven to improve MT output quality based on the training tests performed previously. For those who look to improve their MT engine performance in this language pair, these datasets generated by a specialist data provider can be game-changing.
By making these high-performing datasets available through the TAUS Data Marketplace, Chilin hopes to address a wider audience of potential buyers and to sell more data, effortlessly. And they plan on publishing additional Chinese-English datasets in the near future. These will be extracted from their collection of over 30 million sentence pairs and classified in various other domains.
4 minute read