Embeddings have radically transformed the field of natural language processing (NLP) in recent years by making it possible to encode pieces of text as fixed-sized vectors. One of the most recent breakthroughs born out of this innovative way of representing textual data is a collection of methods for creating sentence embeddings, also known as sentence vectors. These embeddings make it possible to represent longer pieces of text numerically as vectors that computer algorithms, such as machine learning (ML) models, can handle directly. In this article, we will discuss the key ideas behind this technique, list some of its possible applications, and provide an overview of some of the state-of-the-art sentence embedding approaches commonly used in NLP research and the language industry.
A principal question in NLP is how to represent textual data in a format that computers can understand and work with easily. The solution is to convert language to numerical data - traditional methods like TF-IDF and one-hot encoding have been applied in the field for several decades. However, these methods have a major limitation, namely that they fail to capture the fine-grained semantic information present in human languages. For example, the popular bag-of-words approach only takes into account whether or not vocabulary items are present in a sentence or document, ignoring wider context and the semantic relatedness between words.
For many NLP tasks, however, it is crucial to have access to semantic knowledge that goes beyond simple count-based representations. This is where embeddings come to the rescue. Embeddings are fixed-length, multi-dimensional vectors that make it possible to extract and manipulate the meaning of the segments they represent, for example by comparing how similar two sentences are to each other semantically.
Background: Word Embeddings
To understand the way sentence embeddings work, one must first become familiar with the concept that inspired them, namely word embeddings. In contrast to binary vectors (e.g. one-hot encoding) that are computed by mapping tokens to integer values, word embeddings are learned by ML algorithms in an unsupervised manner.
The idea behind learning word embeddings is grounded in the theory of distributional semantics, according to which similar words appear in similar contexts. Thus, by looking at the contexts in which a word appears frequently in a large body of text, it is possible to find similar words that occur in nearly the same context. This allows neural network architectures to extract the semantics of each word in a given corpus. Word vectors, the end product of this process, encode semantic relations, such as the fact the relationship between Paris and France is the same as the one between Berlin and Germany, and much more.
Common approaches for computing word vectors include:
- Word2Vec: the first approach that efficiently used neural networks to learn embeddings from large data set
- fastText: an efficient character-based model that can process out-of-vocabulary words
- ELMo: deep, contextualized word representations that can handle polysemy (words with multiple meanings in different contexts)
The Rise of Sentence Embeddings
Sentence embeddings can be understood as the extension of the key ideas behind word embeddings. Being representations of longer chunks of text as numerical vectors, they open up the realm of possibilities in NLP research. The same core characteristics apply to them as to word embeddings - for example, they capture a range of semantic relationships between sentences, such as similarity, contradiction, and entailment.
A simple and straightforward baseline method for creating sentence vectors is to use a word embedding model to encode all the words of a given sentence and take the average of all the resulting vectors. While this provides a strong baseline, it falls short of capturing information related to word order and other aspects of overall sentence semantics.
On the other hand, sentence embeddings can be learned using ML algorithms in both supervised and unsupervised ways. Often, these algorithms are trained to achieve a number of different objectives in a process known as multi-task learning. By solving some NLP task using a labeled dataset (supervised learning), these models produce universal sentence embeddings that can be further optimized for a variety of downstream applications. Ultimately, these methods provide semantically richer representations and have been shown to be highly effective in applications where semantic knowledge is required. Furthermore, by using multilingual training data, it is possible to create language-agnostic sentence embeddings that are capable of handling text in different languages.
Sentence embeddings can be applied in nearly all NLP tasks and can dramatically improve performance when compared to counts-based vectorization methods. For instance, they can be used to compute the degree of semantic relatedness between two sentences expressed as the cosine similarity between their vectors:
[Sentence 1] Sentence embeddings are a great way to represent textual data numerically.
[Sentence 2] Sentence vectors are very useful for encoding language data as numbers.
[Cosine similarity] between the above: 0.775
Based on this simple mathematical calculation, it is possible to adapt sentence embeddings for tasks such as semantic search, text clustering, intent detection, paraphrase detection, and in the development of virtual assistants and smart-reply algorithms. Moreover, cross-lingual sentence embedding models can be used for parallel text mining or translation pair detection. For example, TAUS Data Marketplace uses a data cleaning algorithm which leverages sentence vectors to compute the semantic similarity between parallel segments in different languages to estimate translation quality.
State-of-the-art Sentence Embedding Methods
There exist a variety of sentence embedding techniques for obtaining vector representations of sentences that can be used in downstream NLP applications. The current state of the art includes:
- SkipThought: an adaptation of the Word2Vec algorithm that produces embeddings by learning to predict the surroundings of an encoded sentence
- SentenceBERT: based on the popular BERT model, this framework combines the power of transformer architectures and twin neural networks to create high-quality sentence representations
- InferSent: produces sentence embeddings by training neural networks to identify semantic relationships between sentences in a supervised manner
- Universal Sentence Encoder (USE): a collection of two models that leverages multi-task learning to encode sentences into highly generic sentence vectors that are easily adaptable for a wide range of NLP tasks
Sentence embeddings were originally conceived in a monolingual context, or in other words, they were only capable of encoding sentences in a single language. Recently, however, multilingual models have been published that can create shared, cross-lingual vector spaces in which semantically equivalent or similar sentences from different languages appear close to each other.
- The LASER model calculates universal, language-agnostic sentence vectors based on a shared byte-pair encoding vocabulary that can be used in different NLP tasks
- LaBSE generates language-agnostic BERT sentence embeddings that are capable of generalizing to languages not seen during training by combining the powers of masked and cross-lingual language modeling
We have seen that sentence embeddings are an effective and versatile method of converting raw textual data into numerical vector representations for use in a wide range of natural language processing applications. Not only are they useful for encoding sentences in a single language, but they can also be applied to solve cross-lingual tasks, such as translation pair detection and quality estimation. The state-of-the-art approaches discussed in this article are easily accessible and can be plugged into existing models to improve their performance, which makes them an essential part of the NLP professional’s toolkit and an exciting addition to the offering of language data service providers.
9 minute read