Review: Deep Learning for Sentence Semantic Similarity

Ahmed Sabir
31 min readApr 20, 2022

A very brief introduction to previous and current trends in sentence semantic similarity with deep learning. Please refer to this GitHub for the related official repository for each paper.

Table of ContentsConvolutional Neural Network
Long Short-Term Memory
Attention Mechanism
BERT unsupervised

The general word embedding i.e., count-based word embedding models (e.g., word embedding (Mikolov et al. 2013b), GloVe (Pennington et al. 2014), and fasttext (Bojanowski et al. 2017)) enable learning word level semantics similarity. However, language is usually used as text fragments (phrases, sentences) that need to be assigned similarities. This blog post describes learning semantic similarity at the sentence level via a multi-layer neural network. The general structure of a multi-layer neural network provides a powerful tool or framework for building many NLP applications, such as Question Answering (Severyn and Moschitti 2015), Machine Translation (Bahdanau et al. 2014), and Image Captioning (Vinyals et al. 2015), etc.

Convolutional Neural Network

Convolutional neural network is a multilayer hierarchical neural network, and three principal factors distinguish CNNs from simple feed-forward neural networks: 1) local receptive fields, 2) weight sharing, and 3) pooling or sub-sampling.

The deep structure of the CNNs allows them to refine feature representation and abstract semantic meaning gradually. CNNs have achieved many successes in many problems such as text recognition (Wang et al. 2012), object recognition (He et al. 2016), and text classification (Conneau et al. 2016). CNNs-based method was the first work by (LeCun et al. 1998) to apply CNNs as a classifier in a sliding window in convolutional networks to generate a text saliency map. The authors use the output score of CNNs for character detection and other CNNs for recognizing characters that have been detected. They use a multi-scale sliding window approach and consider windows in different rows independently.

1-D Convolutional. The one-dimensional CNN for NLP tasks involved applying a filter window (kernel) over each word in the sentence to extract the N-gram features for different positions. Let x_i ∈ R^d be the word vector for the i-th word in the sentence. Let x ∈ R^s×d be the input sentence matrix where s is the length of the sentence. Let k be the length of each kernel, where c vector is a kernel for each convolution operation c ∈ R^kd. For each word position j in the sentence, there is a window vector w_j with k word vectors, i.e.: w_j = [x_j, x_(j+1), . . . , x_(j+k−1)]. The extracted one-dimensional feature m_j for each window vector w_j is computed by applying non-linear function f to the dot product of the window w_j by the kernel c, plus a bias b:

feature map

The non-linear activation function f can take any form, but is most often a hyperbolic tangent or a Rectified Linear Unit (ReLU) (Nair and Hinton 2010). Figure 1 below shows two feature maps m with 3-gram and 4-gram kernels.

Figure 1. Illustration of 1-D convolutional layer mapping the sentence to their feature representations (feature map).

Channels. In vision, images can have several channels (e.g., RGB channels). Each image is represented as a combination of pixels with Red, Green and Blue colors intensity at a particular point. In computer vision, applying a 2-D convolution to an image with different sets of filters and then combining them into a single vector means combining a different view of the image. Each matrix or view is referred to as a Channel. However, in the case of text, multiple channels may translate into a different representation of the same input text, such as different word vectors for a word. Multi-channel embedding can be either static or non-static (i.e., trainable embedding).

Learning Semantic Similarity with CNN

In this section, we describe different works that learn semantic similarity via convolution based architecture. A convolution neural network is designed to extract and identify a local feature in a large structure and combine it to produce a vector representation of a fixed size of that structure, abstracting the most important and informative aspects from that structure for predicting tasks. For text, the 1D- convolution, as described above, captures an n-gram from a sequence. Kim (2014) apply simple convolution with a pooling layer for sentiment analysis classification. Also, Kalchbrenner et al. (2014) presented the Dynamic Convolutional Neural Network (DCNN) for movie review sentiment prediction. Dynamic convolution is able to capture short and long relations in the sentence. It uses a feature graph to obtain a different size of words. Conneau et al. (2016) propose a very deep CNN (VDCNN) (29 convolutional layers), inspired by computer vision VGG-based architecture. The VDCNN works at the character level and uses a small convolution layer followed by a pooling operation.

For the semantic similarity task, He et al. (2015) use various convolution and pooling to extract a stream of tokens. The model consists of two models, sentence and similarity measurement layers that compare sentence representations using multiple similarity metrics. Yin et al. (2016) introduce attention based CNN for modeling sentence similarity, such as answer selection, paraphrase identification, and textual entailment. They proposed three models 1) Attention impacts the convolution, 2) Attention influences the pooling layer, and finally, 3) a combined model.

Next, we discuss in more detail two different approaches to learning semantic similarity with convolutional-based architecture: first, a model proposed by (Severyn and Moschitti 2015), for information retrieval tasks, in particular, for re-ranking query candidate answer pairs using a sentence similarity model. Second, an approach that can learn not only the similarity but also dissimilarity (Wang et al. 2016b). These approaches take advantage of the strong association between similarity and dissimilarity to learn better similarity relations.

Learning to Rank Short Text Pairs — ConvNets

ConvNets is a CNN-based model for matching or learning the similarity between text pairs (Severyn and Moschitti 2015). ConvNets can map inputs, pairs sentences, and compute their similarity score. These sentence pairs from the ConvNets are represented as vectors x_q as a query and x_d as a document. The similarity function (Bordes et al. 2014) between them is computed as follows:

similarity function

where M ∈ R^d×d is a similarity matrix. The objective of the similarity function is to transform the candidate document x′_d = M x_d to the closest x_q. The M similarity matrix is learned during the training. After each convolution there are two additional layers: 1) a hidden layer and 2) a softmax. The hidden layer is computed as:

hidden layer

where σ is the non-linearity, a Rectified Linear Unit (ReLU) (Nair and Hinton 2010) that is defined as simple max(0, x). The ReLU ensures that all the feature map are positive, w_h is the weight vector and b is the bias. The output of convolutional and pooling layers is a dense factor x that is connected to softmax layers. The softmax layer is computed as a probability distribution over the labels:

softmax layer layer

where x is the final abstract of the input representation obtained from input layers i.e., convolutional and pooling, and θ_k is the weight vector of the k−th class. In summary, the output of the sentence model, query x_q, and document x_d are the distributional representation. Then, the model learns the similarity Matrix M according to the similarity function equation above, which produces a similarity score of s sim to capture different aspects of the similarity between the input query x_q and document x_d.

Figure 2. Learning to re-rank short text pairs with convolutional (Severyn and Moschitti 2015). The network learns to optimally represent text pairs and a similarity function in a supervised manner. Figure reproduced from Severyn and Moschitti (2015).

The final joint layer of all intermediate vectors: x_T, x_T, and similarity score x is q_d sim represented in the signal vector:

joint layer

The vector is then fed into a fully connected layer that allows for modeling interactions between the joint vector. Finally, the final output is computed with a softmax layer (as in the softmax layer Equation above).

Sentence Similarity Learning by Lexical De/Composition

Most sentence similarity approaches focus on similarity and ignores the dissimilarity between the two input words or sentences. Wang et al. (2016b) present a CNN-based model that takes into account both similar and dissimilar words through the lexical-semantic by composing and decomposing the sentences. In particular, the model computes semantic similarity based on word-to-word matching between the two input sentences (i.e., matching words in sentence A-to-B with sentence B-to-A). Then, each word vector is decomposed into similar and dissimilar based on its similarity. Then, CNN based model is trained to capture similar and dissimilar features. Finally, the similarity is estimated over the composed vectors.

The model takes a pair of sentences S and T and computes the semantic similarity score sim(S, T). The model uses pre-trained word embedding to have an effective way to represent each word with a distributed vector. As in word embedding words appearing in a similar context tend to have similar meanings. They build the sentence matrices from S = [s_1,…, s_i,…, s_m] and T = [t_1,…, t_j,…, t_n], where s_i and t_j are the dimension d vectors of the corresponding words and n and m are the sentence lengths of T and S respectively.

| sentence | ♠ similarity |
| | ♦ dissimilarity |
| E1 The research is [irrelevant♠ to sockeye]| red salmon |
| E2 The study is [no related♠] to salmon | |
| E3 The research is relevant to salmon♠ | |
| E4 The study is relevant to sockeye♠, | sliver salmon |
| instead of coho♠ | |
| E5 The study is relevant to sockeye♠, | |
| rather than flounder♦. | flatfish |
Table 1. Examples from sentence similarity learning by lexical decomposition and composition (Wang et al.2016b). The ♠ and ♦ reflect the similarity and dissimilarity, respectively.

In order to learn the similarity between the two sentences, they compare the word coverage, word by word, between the two input sentences. As shown in Table 1, the E1 paraphrase irrelevant matches the correct paraphrase in E2 no related. Concretely, they consider each word as a semantic unit (primitive unit), and then compute the semantic matching Sˆi for each word in the sentence s_i, by composing full or part of the word vector in, the other sentence pair, T. For this, the word can match a word s_i in the other phrase or word in, the other sentence, T, and vice versa. The semantic matching function can be computed as follows:

semantic matching function

where tˆ, dˆ are the semantic matching vectors, and f_match is the cosine similarity. After the semantic similarity matching phase, the resulting vectors are sˆi and tˆi. sˆ or tˆi is considered to be the word semantic coverage of s or t. For instance, as in Table 1, the word in E2 salmon matched E1 sockeye, which is red salmon. The model decomposes the word s or t based on the similarity matching sˆ or tˆ into two components s+ or t+ and the dissimilar part (red salmon) s− or t−. The decomposition function is defined as:

decomposition function

After having both similar and dissimilar component matrices: 1) similar matrix S+ = 􏰂[s+,…, s+_m] or T+ = [t+,…, t+_m]􏰃 and 2) dissimilar matrix S− = [s−,…, s−_m] or T− = [t−,…, t−_m]. The goal is to use this information since similar and dissimilar have a strong relation. For instance, as shown in Table 1 it is very difficult to distinguish between E4 and E5 which is more similar to E3. However, after considering both similar and dissimilar, the model can identify that E3 and E5 are similar. The model composes both the similar and dissimilar component matrix into a feature vector as follows:

dissimilar component matrix

Finally, a concatenation between the two vectors T⃗ and S⃗, and the final semantic score prediction:

semantic score

Long Short-Term Memory

Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber 1997) is able to capture historical information from sequential input sequences. The architecture can handle sequential information as the current input x_t can access the previous output, hidden layer h_(t−1), at each time step.

The advantages of LSTMs over standard Recurrent Neural Network (RNN) relies on R gates that control the output of each time step, as a function of the previous/old hidden state h_(t−1) and the current time step input x_t: 1) forget gate f_t, 2) input gate it, and 3) output gate o_t. These gates can control (update, reject) the memory cell c_t and the hidden state h_t. The LSTM transition function is defined as:


Where σ is a logistic sigmoid function [0,1] and tanh is a hyperbolic tangent function [-1,1] and ⊙ denotes element-wise multiplication. f_t is the function that controls the information from the old memory cell c_t (reject), i_t and q_t are the function to control how much information is stored in the current memory cell c_t and o_t is the function to control the output c_t.

Learning Semantic Similarity with LSTM

LSTMs (Hochreiter and Schmidhuber 1997) have been used successfully in many NLP tasks, such as text classification (Liu et al. 2016), language modeling (Peters et al. (2018), and machine translation (Bahdanau et al. 2014).

For learning semantic similarity, Wang et al. (2017) propose an LSTM model called the bilateral multi-perspective matching model (BiMPM). The BiLSTM model takes two sentences (P and Q) and encodes them into two directions P against Q and Q against P. Then another BiLSTM is used to aggregate the result (similarity matching) into a matching vector. Finally, the matching vector is used for the final decision with a dense layer. Another approach introduces two subnetworks, LSTM based, to learn semantic similarity (Chen et al. 2019). They propose a generative model that relies on two latent variables: 1) the first one representing the syntax and 2) the second one representing the semantics. The model is trained with multiple losses that exploit the alignment of both sentences and word order information.

Next, we discuss in more detail two different approaches that use LSTMs to learn semantic similarity.

Manhattan LSTM

LSTM architecture is naturally suited for learning variable-length sequences like a sentence. Mueller and Thyagarajan (2016) propose a Siamese Network based on LSTM to estimate the degree of similarity measure between two sentences. The model is divided into two LSTM networks LSTM_1 and LSTM_2, in which each model processes one sentence in a given pair. The model is based on a Siamese architecture that relies upon weight sharing technique (i.e., LSTM_1 = LSTM_2) for comparing between two instances. Each LSTM learns to map a variable length of sequences of d-dimensional vectors into representation R^d-rep. Each sentence is represented as a sentence word vector x_1,.., x_T (i.e., pre-trained word embedding) and is passed into the LSTM encoder as shown in LSTM Equations above Section. The final encoding of representation for each sentence is encoded by the last hidden state h_T ∈ R^d-rep. For each given sentence pair they apply a similarity function:

similarity function

where g is the similarity function. For each given pair this similarity function is applied to their encoder representation.

Unlike the standard LSTM that is used in Language modeling, which predicts the next word from the previous one, this LSTM is a simple encoder (Sutskever et al. 2014), as shown in Figure 3 (below). The LSTM encoder is trained to learn the similarity between the two sentence representations h_t and predicts the similarity score. The similarity function is the Manhattan distance:

Manhattan distance

According to the author, the Manhattan distance g as in similarity function Equations outperforms the most common approaches, such as cosine similarity (Yih et al. 2011)

Figure 3. Example of learning the semantic similarity with LSTM. Learning the semantic similarity with Manhattan distance. Figure from [Mueller and Thyagarajan 2016].


Most recent approaches learn the embedding with feed-forward neural (FFN) networks. However, the advantage of LSTMs over FFN is that it takes into consideration both the word context and word-order awareness. Iacobacci (2019) presented LSTMEmbed (Figure 4 below), a bidirectional LSTM (BiLSTM) based model to learn the knowledge-based word embedding. They use the tagged sense to provide the input context in both directions 1) preceding context s_i − W, . . . , s_i−1 and 2) posterior context s_i+1,…, s_i+W, where s_j, (j ∈ [i − W,…, i + W]) is the word sense from external knowledge (an existing inventory Bablenet (Navigli and Ponzetto 2012)). Each token is associated with an embedding vector v(s_j) ∈ R^n, in a shared look-up table:

Next, the LSTMs are merged and projected linearly via a fully connected or dense layer:

where W^o ∈ R ^2m×m is the weight matrix with m as the LSTM dimension. The model out_LSTMEmbed is compared with a pre-trained model vector emb(s_i) (see Figure 4 below) such as GloVe or word2vec. The model is trained to maximize the similarity between out_LSTMEmbed and emb(s_i). Therefore, the loss function is a similarity distance, cosine similarity:

After the training, the model obtains joint representations in the same vector space from the look-up table. Precisely, senses and latent semantic representations of words are joined in the same vector space. Figure 4 shows an overview of the proposed architecture.

Figure 4. Example of learning the semantic similarity with LSTM. Knowledge-based embedding with LSTM LSTMEmbed (Iacobacci 2019). The LSTM input is a lookup table from an existing inventory Bablenet (Navigli and Ponzetto 2012). Figure from acobacci 2019.

Attention Mechanism

Attention-based models have shown promising results in various NLP tasks, such as machine translation (Bahdanau et al. 2014), and caption generation (Xu et al. 2015). Such a mechanism learns to focus attention on a specific part of the input (e.g., target word in a sentence). The basic concept of the attention-based encoder-decoder can be described as follows:

First, the encoder is used to process and encode the input sentence into a context vector (last hidden state). The concept of attention in this scenario is the summarization of the input sentence. In other words, all initial hidden states are ignored, and only the final state will be considered as the initial state of the decoder.

Second, the decoder generates the word or the summarization of words in that sentence, from the context vector.

However, one of the drawbacks is that the decoder depends on the fixed-length context vector to generate the output, which is not practical in long sentences (it has often overlooked the first element once it completes the entire sequence). For instance, in machine translation, a wrong context will lead to incorrect translation. To solve this problem, rather than generating a single fixed-length context vector from the last hidden state of the encoder, the attention creates shortcuts between the entire source data and the context vector (Bahdanau et al. 2014). The shortcut of these weights is adjustable for each output. The proposed model considers not only the context vector but also the relative importance in the sequences. The context vector accesses the full sequence, and the model learns the alignments between the target and the input source. The context vector depends on three pieces of information, as shown in Figure 5 the hidden states are from the encoder and decoder and the alignment between the input source and the target.

Figure 5. A word alignment model, dot-line-box, that generates context ct for the target word yt from a source sentence X_1.., X_T. Figures adopted from Bahdanau et al. (2014).

The model introduced by Bahdanau et al. (2014) use a BiLSTM as an encoder to generate sequences for each sentence, in both directions (h_1, h_2). The result vector h_1 and h_2 are then concatenated as forward and backward as follows:

The decoder takes the hidden state s_t = f (s_t−1, y_t−1, c_i) for a word at position t, (t = 1, …m, ) where c_t is the sum of the hidden state of the input sequences weighted by alignment scores:

where α_(tj) is the weight computed at each time t step for hidden state h_j, and T is the number of time steps for the input sequence. The c_t context vector is used to compute the new state sequence s, where s_t depends on the previous state s_t−1. The α_(t j) weight are then computed as:

The alignment score a is computed as a feed-forward network, and the alignment model score is calculated by:

for the inputs/output at the j position.

Next, we discuss the Transformer architecture or self-attention model without any recurrent network, which offers a significant improvement over the methods we described in this section, such as attention-based aligned recurrent models.


In the previous section, we described the best ways to capture dependencies in long sequences, LSTM with attention, in particular, the encoder-decoder based architecture called the Seq2Seq model. The encoder maps the input sequence into a higher dimensional space. Then, that abstract vector is fed into the decoder, which turns it into the output sequence.

However, although these models obtain state-of-the-art performance in sequence modeling (i.e., language modeling (Sutskever et al. 2014) and machine translation (Bahdanau et al. 2014), recurrent networks (i.e., GRU, LSTM, etc.) have some drawbacks: 1) they slow down the training (difficult to parallelize) and 2) are computationally expensive.

Vaswani et al. (2017) introduce Transformer architecture, which is able to deal with Encoder and Decoder without any Recurrent Networks. Transformer architecture is based entirely on self-attention without any aligned recurrent network, as we described above in the attention Section.

As shown in Figure 6 below the transformer uses scaled dot product attention. The weight is determined by the dot-product as follows:

where the input is a set of pair Key K and value V of a dimension n, and √1 is d_k the scaling factor. In the encoder, both K and V are the hidden states. The decoder compresses the previous output as query Q with dimension m and produces the next output by mapping Q and the set of the K_s and V_s.

The Transformer uses Multi-Head Attention (MHA) in parallel. The main concept of MHA is that rather than computing the attention once, the MHA runs through the scaled dot product attention many times in parallel. As shown in Figure 6 the independent outputs of attention are concatenated and linearly reshaped into the required dimension.

Figure 6. Full architecture of the transformer. Figure reproduced from Vaswani et al. (2017).

where W_i^Q, K, V, and W^O are matrices for parameters to be learned by the network. The encoder, the left-hand block in Figure 6 above, produces an attention-based representation with the ability to locate or find a particular piece of information in a large context. The authors stack 6 identical layers NX = 6, and each layer has 1) multi-head attention layers and 2) a fully connected layer. Each sub-layer has a skip connection (residual) connection and normalization layers.

Finally, at the top of each decoder, a softmax and linear layer are added for the final output. A positional encoding with a sinusoid wave is introduced to preserve the position information. The positional encoding can be added directly to the input, as it has the same dimension as the input embedding.

Learning Semantic Similarity with Transformer

In this section, we offer a simple introduction to a Universal Encoder that uses a stack of transformers to learn semantics similarity. We will discuss BERT (Devlin et al. 2019) in the next section, the building block of the current state-of-the-art model in sentence semantic similarity based on the Transformer.

Universal Sentence Encoder

The Universal Sentence Encoder (Cer et al. 2018) is a transformer-based model that encodes sentences into a sentence embedding vector. The model uses transformer-based architecture to construct the sentence embedding with an encoding sub-graph. The sub-graph utilizes the attention mechanism to compute the context of words in a sentence. These context-aware representations take into account word order to identify all of the other words in the sentence. These representations are mapped into an encoding vector with a fixed length by computing the sum (element-wise) of the representation at each word position. The encoder takes a lower-case tokenized token as its input and generates a 512-dimensional vector as the embedded sentence.

The model is designed to be universal for general use for many tasks. This is achieved by introducing multi-task learning, a single encoding, to be able to feed multiple downstream tasks. For instance, one of the supported tasks is the Skip-Though like application (Kiros et al. 2015), in which the Transformer replaces the LSTM for classification tasks over supervised data.

The data used to train the sentence encoders in an unsupervised way are extracted from a variety of web resources, such as Wikipedia, question-answer pages, discussion forums, and web news. They improve unsupervised learning with supervised data training with data augmentation techniques.

For transfer learning tasks such as sentence similarity, similarity can be computed directly from two-sentence embedding with cosine similarity:

where sim(u, v) is similarity based on angular distance. Angular similarity distinguishes near similarity vectors with small changes better than raw cosine.

Contextual Word Embedding Learning

Most current state-of-the-art approaches in many tasks in NLP are based on transformer architecture, such as BERT (Devlin et al. 2019) and GPT-2 (Radford et al. 2019). Unlike word embedding the transformer-based model, pre-trained language modeling, is able to learn dynamic embeddings. In the next section, we explore this dynamic embedding, which is called Contextual Word Embedding Learning.

All word embedding approaches such as GloVe and fasttext, are context insensitive or context-independent i.e., each word can have only one vector representation.

  • Each word always has the same vector representation, regardless of the context in which sentence or tokens occur. Especially for a word with more than one meaning or sense, a polysemic word, one vector representation is not enough to encompass its different meanings.
  • Even when a word has only one meaning or sense, its occurrences still have different semantic aspects, such as syntactic behavior.

Some research directions have proposed the injection of senses (meanings of the word in a different context) into the word embedding semantic space (Mancini et al. 2017, Iacobacci 2019). However, these approaches rely on a large annotated corpus based on external resources, such as Babelnet (Navigli and Ponzetto 2012) and Wordnet (Miller 1998), which make it unpractical in a real application. Also, despite their impressive performance over standard word embedding (Mikolov et al. 2013b), these approaches are bounded by word2vec architecture limitations as they are context-independent.

To solve this, dynamic word embeddings were introduced. Contextual word embedding learning is based on the context for each token rather than a static context-independent embedding. Next, we present two methods that led to a breakthrough in the field, with many NLP tasks overperforming humans in Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al. 2018).


The Embeddings from Language Models (ELMo) (Peters et al. 2018) is an embedding that is derived from the BiLSTM based language model. The ELMo uses a two layers bi-directional language model (biLM). For each computed token t_k by BiLSTM, the biLM computes a set of representations:

where L represents the number of layers in biLM, x_k is the context-independent token,

are the top layer BiLSTM.

Table 2 shows that the BiLM is able to disambiguate the correct sense of the word in the source sense. On the other hand, GloVe vector treats the word as a different part of speech (i.e., game as nouns and “playing” as verbs), but the sense of the word is concentrated in the sport-related ”Play”. This example shows the effectiveness of contextualized representation, which could be applied to a variety of NLP tasks.

| Source | Nearest Neighbor |
| GloVe: play | playing, game, play, football, multiplayer | |-------------------+---------------------------------------------|
| BiLM: | Kieffer, the only junior in the group, was |
| Chico Ruiz. | a commended for his ability to hit in the |
| made a | clutch as well as his all-round excellent |
| spectacular | play. |
| play on Alusik’s. | |
Table 2: Example of nearest neighbours from (Peters et al. 2018) to highly polysemous word “play” using static embedding GloVe (Pennington et al. 2014) and the context embeddings from a biLM.


Bidirectional Encoder Representations from Transformers or BERT (Devlin et al. 2019) is another contextual language model like ELMo; however, In contrast to ELMo, it does not rely on recurrent language models with static word embedding initialization, but provides an end-to-end language model that is based entirely on contextualized token embeddings. For this, a transformer (Vaswani et al. (2017) based architecture is used in combination with masked language modeling targets that allow for training a model that can see the context from left and right handed perspectives at the same time.

Figure 7. Different architecture of pretraining models. GPT is an autoregressive model that uses a transformer with a left-to-right approach. BERT uses a Transformer with a bi-directional based model that conditions the right and left context in all layers. Elmo uses a Bilstm with the concatenation of both LSTMs (right-to-left and left-to-right) to generate features. ELMo is the only feature-based method, while BERT and GPT are fine-tuned approaches. Figure reproduced from Devlin et al. (2019).

The idea of self-attention in the transformer, and non-directional language modeling results in extraordinary performance gains compared to previous approaches.

The Masked Language Model uses the output of the masked word’s position to predict the masked word. The Masking covers 15% of words in the input and tries to predict the masked word:

Input [CLS] Rosa attended the play at the firework festival.

Randomly Mask [CLS] Rosa did not want to [MASK] card-game at the festival.

Unlike the GPT (Radford et al. 2019), which uses a left-to-right transformer (aka autoregressive model), the transformer encoder in BERT handles the entire sequence from left-to-right or right-to-left simultaneously, as shown (red connections) in Figure 8. Therefore, although this is described as bidirectional, it could be regarded as a non-directional encoding.

Figure 8. Learning the semantic similarity with BERT (Devlin et al. 2019) . Left) Fine-tuning the model with one FFNN layer with softmax. Right) adding pooling operation to derive fixed sentence embedding and compute the mean of all vectors as in Sentence BERT (SBERT) [Reimers and Gurevych 2019]. Figures reproduced from (Devlin et al. 2019), Reimers and Gurevych 2019.

BERT has shown groundbreaking results in many tasks such as question answering and Natural Language Inference (Conneau et al., 2017) (NLI). However, according to its main authors, it is not suited for the Semantic Textual Similarity (STS) task, since it does not generate a meaningful vector to compute the cosine distance. The most common approach to averaging the BERT output layer is called BERT embedding. However, the result is worse than static embeddings, such as GloVe (Pennington et al. 2014). There are two most common approaches to computing semantic similarity with BERT as shown in Figure 8, Left) fine-tuning the model with one FFNN layer with softmax, Right) are adding pooling operations to derive fixed sentence embedding and computing the mean of all vectors as in Sentence BERT (SBERT) (Reimers and Gurevych 2019).

BERT with Contrastive learning — unsupervised

As we mentioned above, to learn semantic similarity between two text fragments there are two common approaches: 1) fine-tuning with a linear layer and 2) adding a pooing layer. However, fine-tuning is required with supervised label data such as NLI (Conneau et al., 2017), and label data is not always available for downstream tasks and is costly. Next, we describe an unsupervised approach that uses contrastive learning.

The core idea of contrastive learning is to learn representation in an unsupervised manner by pulling semantically similar neighbors as positive instances and pushing away non-related neighbors as negative instances (Hadsell et al., 2006). Note that, in the supervised setting the positive/negative pairs learn the relation based on their labels.

SimCSE applied contrastive learning (Gao et al., 2021) via dropout (Srivastava el., 2014) to build positive pairs (i.e., from the input sentence itself to generate the positive and in-batch negatives) as shown in Figure 9 below.

Figure 9. Unsupervised SimCSE. The model uses the input sentence itself as positive pair (identical positive pairs) through dropout masking. Figures reproduced from (Gao et al., 2021).

SimCSE follows the SimCLR framework Chen et al. (2020) that uses contrastive learning between images for computer vision tasks with data augmentation to generate positive and negative pairs. It learns to maximize the agreement between differently augmented examples from the same data sample with contrastive loss. In the SimCSE sentence scenario, the dropout is employed as the data augmentation function.

let h_i and h_+i be the representation of semantically related pair x_i and x_+i , and (x_i, x+i) is the training objective with a mini-batch of N pairs:

where τ is a temperature hyperparameter and sim(h_1, h_2) is the cosine similarity:

where h = f_θ(x) is the input sentences via BERT encoder. After encoding the input sentences the model needs to be fine-tuned with all the parameters using the contrastive loss (mentioned above).

In practice, the dropout is used with the fully connected layers:

where z is the random dropout (mask) that is fed twice to the encoder to get two different embedding with different masks (z,z`). The loss function can be written as:

where z is a mask dropout, and N is the mini-batch for each sentence.


Text-To-Text Transfer Transformers (T5) (Raffel et al., 2020) is another Transformer based model that achieves state-of-the-art in different NLP sequence-to-sequence mapping tasks such as question and answering. Unlike BERT, which is an encoder only, the T5 is an encoder-decoder transformer-based model. However, Sentence-T5 (Ni et al., 2021) also follows the same approach as the BERT encoder only or encoder-decoder like the original T5.

The Sentence-T5 explored three methods to extract sentence levels from the original T5 Figure 10 (a). As shown in Figures 5 (b, c) the first two methods follow BERT with encoders only via pooling strategies (e.g., with mean or first). The third method uses T5 encoder-decoder-like approach as shown in (d).

Figure 10. Illustration of architecture diagrams of the original T5 (a) and three ST5 variants with encoders only (b, c), and in (d) encoder-decoder model. Figure reproduced from (Ni et al., 2021).

However, unlike BERT (see Figure 8) the model that uses a CLS token at the beginning of each sentence, T5 as seq-to-seq assumes the model is aware of the semantics of the entire sentence when generating the prediction.

For training, ST5 follows the same approach as SimCSE (Gao et al., 2021) and applies contrastive learning via cosine similarity to the original T5 to extract sentence representations.

Self-Supervised Mirror-BERT

As we mentioned in BERT Section, BERT (aka Mask Language Model) is not useful for semantic similarity tasks as out-of-the-box pre-trained sentence embedding and needs to be supervised (i.e., NLI dataset) to generate a meaningful vector.

The next work (Liu et al., 2021) proposed a Self-Supervised Mirro-BERT which uses any BERT model as MLM with contrastive learning from the same model (self-supervised). More specifically, as shown in Figure 11, the input is (1) randomly span with masking, (2) a drop-out layer is used to mirror positive examples, and finally (3) a contrastive learning loss is utilized to encourage such mirrored pair i.e., pulling similar positive pair and pushing away negative pair.

Figure 11. Self-supervised Mirror-BERT. The model first randomly spans by masking the input, and then 2) applying dropout to extract identity-based (i.e., “mirrored”) positive examples for fine-tuning, then 3) a contrastive learning objective is used to encourage such “mirrored” positive and negative pairs. Figure reproduced from (Liu et al., 2021).

Self-Guided Contrastive Learning-BERT

This work (Kim et al., 2021) also proposes a contrastive learning method but with self-guidance by using the same model without relying on any data augmentation as the work mentioned above with dropout layers.

Contrastive Learning with Self-Guidance. The idea of self-Guidance is to use the model itself (i.e., hidden layers) via contrastive learning without relying on data augmentation like the previous work we mentioned above (i.e., SimCSE and Mirror-BERT). The process of self-guide can be summarized in Figure 12 as follows:

Figure 12. Self-guided contrastive learning framework. Two clone BERT is copied at beginning of the training. However, only one BERT_T (except Layer 0) is fine-tuned to optimize the sentence vector c_i. The light and dark gray colors indicate the fixed and fined-tuned layers, respectively. Figure reproduced from (Kim et al., 2021).

First, a clone of BERT is made into two copies BERT_Fix and BERT_Tuned. The Fixed-BERT_Fix is fixed during training to provide a fixed single of the original BERT while the fine-tuned model tries to construct better sentence embedding. The main idea is to take advantage of different layers from different embedding and then introduce new information with extra training via contrastive loss.

Also, note that as the original BERT, the [CLS] vector from the last layers is used after the fine-tuning, as shown in Figure 11 the output c_i.

Second, for a give b sentence in a mini-batch, say s_1, s_2, .., s_b, a sentence s_i is fed into BERT_Fixed to compute hidden representations:

where 0 ≤ k ≤ n (0: the non-contextualized layer) and n is the number of hidden layers, and len(s_i) is the tokenized sentence length, d is the dimension of the hidden representations.

Then a pooling function is applied (p) to H_i. To extract diverse sentences from all BERT layers, the pooling function will be applied to all layers p(H_i, k) as shown in Figure 12, and finally a sampler function:

Also, by knowing that each BERT layer specialized in capturing different linguistic concepts (Jawahar et al., 2019) a max-pooling is applied in the sampler σ layer to give each layer h_i, k the same importance.

Third, the sentence embedding c_i for s_i is computed as follows:

where BRRT(.) corresponds to [CLS] vector from BERT last layers. Then, an X vector is collected as:

and then NT-Xent loss via SimCLR framework (Chen et al., 2020) is computed as:

where τ is the temperature, f is the projection head that consists of MLP layers, g(u,v) is the cosine similarity function, and μ(·) is the matching function as defined:

Finally, everything is summed up as L_m and divided by 2b with regularizer to keep BERT_T close to BERT_F, the loss function is computed as:

where λ coefficient is a hyperparameter. To conclude, this method refines BERT using the model itself (i.e., hidden layer) so that c_i has a higher similarity with h_i as shown in Figure 12, which is another representation of s_i.

ConSERT-Self-Supervised- BERT

This work (Yan et al., 2021) is heavily inspired by the SimCLR framework (Chen et al., 2020), as shown in Figure 13 below, (1) data augmentation strategies (e.g., token Shuffling, feature cutoff, and dropout) are used to generate different samples from the embedding layer; (2) Shared BERT with pooling layers, and finally contrastive loss to keep similar sentence together and push away non-related sentence as a negative sample.

Figure 13. ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer. The figure is simplified from (Yan et al., 2021).

Data Augmentation Strategies

Different data augmentation strategies are used to generate different samples’ positive and negative pairs for contrastive learning as shown in Figure 14 below.

a) Adversarial Attack. The idea of an Adversarial Attack is to improve the robustness of the model by adding worst case perturbation to the input. In this case, a Fast Gradient Value (FGV) (Rozsa et al., 2016) is used to compute the perturbation version in a supervised manner as shown in Figure 14 (a).

b) Token Shuffling. A random shuffling of the input token sequences. In particular, by shuffling the position ids without changing the order (Lee et al. (2020) as shown in Figure 14 (b).

c) Cutoff (Shen et al. 2020). A cutoff is a random erase of the token. Two types of cutoff are employed: 1) a random token level cutoff, and 2) a feature level cutoff as token span in the embedding matrix as shown in Figure 14 (c left figure).

d) Dropout. The most used method for data augmentation strategy for contrastive learning for sentence embedding. In this case, the dropout is randomly zero-out some elements in the token embedding layer (each individually) as shown in Figure 14 d.

Figure 14. Data augmentation strategies. Figure reproduced from (Yan et al., 2021)


This is a follow-up work (mentioned above) of the Self-Supervised Mirror-BERT. This work (Liu et al., 2022) proposes a framework called TRANS-ENCODER that combines both bi/cross encoders on top of any Pre-trained Language Model (PLM) (e.g., SimSCE, Mirror-BERT, etc). Next, we describe the process of TRANS-ENCODER in more detail.

Transform Off-the-shelf PLMS into bi-encoder. Contrastive learning that is similar to SimSCE (Gao et al., 2021) is used to convert out-of-the-box PLM into a bi-encoder. Let f(.) be the encoder model, and X be a random batch of raw sentences. For any sentence x_i ∈ X, the encoder create two copy of the same input or data point f(x_i) and f(x_i)*(bar). Then, an infoNCE loss (Oord et al., 2018) is used to pull positive pairs together and push away negative pairs in each mini bash as:

InfoNCE is a dot product function that uses similarity to measure the representation from the same pair, that needs to be maximized. While measuring the similarity of all negative pairs that need to be minimized.

where τ is the temperature parameter; N_i refers to all positive and negatives of x_i within the current data. In the numerator the similarity of the same pair (self-duplicated of positive is used x_i*(bar)). Meanwhile, the denominator is the similarity between x_i and all negative samples.

Self-Distillation: BI-to-Cross-Encoder. The main idea of Self-Distillation is to transfer knowledge from the teacher (bigger model) to a student (smaller model). The self-Distillation bi-encoder to cross-encoder can be done as follows:

First, to obtain a sufficient good bi-encoder, label sentence pairs (sen_1, sent_2) are used as input (separately) to get two different embeddings. The similarity cosine distance is considered as their relevance score. By doing this, we can generate a self-labelled sentence-pair scoring dataset (sent1, sent2, score).

Secondly, the same model is used to learn these scores, but as a cross-encoder. The object function is to minimize the KL divergence between the self-labelled sentence pair (mentioned above) and the prediction from the bi-encoder. For this, a soft binary cross-entropy loss can be used as:

where N is the data-batch size, σ(·) is the activation function (sigmoid), x_n is the prediction of the cross-encoder, and y_n is the self-labelled sentence-pair ground-truth score from the bi-encoder.

Note that, the bi-encoder here can be viewed as a teacher, meanwhile the cross-encoder as a student. However, in this case, the student is better than the teacher and thus helps design better learning for both encoders.

Figure 15. TRANS-ENCODER. The proposed model uses self-distillation learning from the same model (as shown in yellow). Figure reproduced from (Liu et al., 2022).

Self-Distillation: Cross-Encoder-to-Bi-Econder (Backward process). Now, after we learned a strong cross-encoder, a natural way is to distill back to the bi-encoder to gain back the extra knowledge. In addition, a better encoder can generate more accurate self-labelled data. Figure 15 shows the process of Distillation as a loop (in red) between the bi and cross encoders.

The cosine similarity of the two embeddings is considered as the prediction and the final score. The final prediction is done by a cross-encoder using the self-labeled score via mean square error loss (MSE):

where N is the batch size, x_n is the cosine similarity between a sentence pair, and y_n is the self-labelled ground truth.

Mutual Distillation. Since the model is self-learning it amplified the error, which affects the learning ability as a close loop. This can be solved by (1) self-distillation on multiple PLMs in parallel, (2) restricting the communication between them expect when generating the self-labelled score, and (3) taking average predictions of all models.

Feel free to cite our article if this literature review is helpful to you.

title = "Review: Deep Learning for Sentence Semantic Similarity",
author = "Sabir, Ahmed",
year = "2022",
url = ""


Please refer to each link attached with references for the pdf version.



Ahmed Sabir

ML researcher interested in language & vision research. I’m using this Medium blog to write my learning notes.