We share some of the main challenges that we’ve faced while setting up the very first marketplace for language data for AI - the TAUS Data Marketplace.

Online marketplaces are a popular business model in the digital era and some of them are the biggest and most valued tech companies today (think of eBay, Airbnb, Amazon, etc). They connect the sellers and buyers of certain types of goods and services and facilitate processes like search, transactions, ratings and more. 

Some of the key traits of a successful online marketplace are the ease of use, fair pricing and security, but there are many more design and business decisions that one needs to make when building one. We share here five of the main challenges that we’ve faced while setting up the very first marketplace for language data for AI - the TAUS Data Marketplace.

Challenge #1: The Right Data Structure

In order for the buyers to be able to find the language data that they need, the data that is uploaded by the sellers needs to be structured and stored for fast and effective retrieval. 

Let's take text data as an example: the sellers upload documents in specific language pairs, add the metadata such as the content type and domain, and specify the price per word. These metadata inform the search for the buyers, so that they can then explore separate documents from different sellers, but also be able to search through all available data for a language pair and domain combination or upload a sample to find similar segments. To be able to run this advanced search, the Data Marketplace needs to store data as separate segments, with segment-level metadata. That way, it can also easily identify if segments in the uploaded document already exist in the Data Marketplace, register them, and show only the originals (first uploaded) among the results.

Challenge #2: Data Quality

One of the biggest concerns when acquiring data is its quality. And rightfully so, given that there are no quality standards for this kind of data, and the quality expectations differ greatly. In the Data Marketplace, we define high-quality data as “ready-to-use for engine training”, requiring minimal to no effort for fine-tuning.

How do we ensure that? By establishing a strong, automated validation process. Every dataset that is uploaded to the marketplace goes through a series of checks - from document format and language verification to identifying grammar errors and untranslated terms. If the segments flagged with these errors are classified as low-quality they are blocked from being uploaded to the marketplace. Still, the best possible quality feedback comes from the actual users of data. That’s why the Data Marketplace will also have a feedback loop for quality rating after purchase.

Challenge #3: Plugging in NLP

As the main currency of the Data Marketplace is language data, enhancing the data and platform capabilities are best done using Natural Language Processing (NLP). However, NLP features require complex language models and continuous fine-tuning.

The marketplace therefore focuses on cleaning, clustering and anonymization. How precise the algorithms are in filtering out bad segments, selecting domain-specific data and identifying PII is something that we need to keep a close eye on, test regularly, and adjust according to changing needs for each of the languages available. Low-resource languages pose an additional challenge, as there are rarely any language models available for them .

Challenge #4: Price Definition

Probably the most sensitive topic and the hardest nut to crack is the pricing. While there are benchmarks for translations in terms of price per word, there is no reference point for data for AI. One would expect that the right price will work itself out within the marketplace, but you need to have something to start with - and that something needs to give users a sense of transparency and fairness.

We’ve addressed this by creating a Price Index Table that shows the sellers and buyers where a dataset price point lies compared to other similar datasets in the Data Marketplace. The shown price per word is based on the overall marketplace availability of data for the specific language pair, domain and content type combination. As new datasets get published to the data pool, the price points are dynamically adjusted.

Challenge #5: Data Provenance

Knowing where the data originates from and if and how it changes over time is one of the main prerequisites for trusting the platform’s data and being able to fairly distribute monetary rewards to the sellers. But this is not an easy thing to achieve. 

Because the Data Marketplace allows partial data buying and can create new datasets on the fly through data clustering (Matching Data technology), we need to keep a detailed record of every transaction between seller and buyer, and be able to tell how many times a single segment got downloaded as a part of a bigger dataset. If the seller changes their data by updating new segments at some point, the dataset needs to have new versions.

 

6 minute read

New call-to-actionIf you’d like to find out more about the Data Marketplace, check out the How it Works page. For the updates follow Data Marketplace on Twitter, LinkedIn and Facebook.