An application for Semantic Relatedness: Post OCR Correction

11 min readFeb 20, 2022

In this post, I will discuss our work that uses semantic relatedness measure, as a post-processing based method, to improve text recognition in the wild (a.k.a OCR in the wild). However, this approach can be used for any special cases of semantic relatedness tasks such as semantic similarity and duplicate questions and answers ..etc.

Most applications such as textual entailment, plagiarism detection, or document clustering rely on the notion of semantic similarity and are usually approached with dimension reduction techniques like LDA or with embedding-based neural approaches. However, the semantic similarity and the semantic relatedness are two separate notations and can be defined as:

Semantic Similarity: is a special case of relatedness that is tied to the likeness of the concepts (e.g., car, truck).

Semantic Relatedness: is a more general notion of the relatedness of concepts. In particular, it refers to human like judgments of the degree to which a given pair of concepts is related or not. (e.g., car, parking).

Here, I will talk about a scenario where semantic similarity is not enough, and we need to design a neural approach to learn the semantic relatedness for this task.

Semantic Relatedness for Post OCR Correction

The scenario is OCR in the wild (a.k.a text spotting in the wild), where a text in an image (e.g., street sign, advertisement, or bus destination) must be identified and recognized. The main goal is to improve the performance of vision systems by leveraging semantic information. However, this approach can be used to learn the semantic relatedness between any two texts. Our code is available on Github and Colab for a quick start.

Introduction

Deep learning has been successful in tasks related to deciding whether two short pieces of text refer to the same topic, e.g., semantic textual similarity (Cer et al., 2018), textual entailment (Parikh et al., 2016), or answer ranking for Q&A (Severyn and Moschitti, 2015).

However, other tasks require a broader perspective to decide whether two text fragments are related more than whether they are similar. In this work, we will describe one of such tasks, and we retrain some of the existing sentence similarity approaches to learn this semantic relatedness. We also present a new neural approach that outperforms existing approaches when applied to this particular scenario.

Learning Semantic Relatedness for OCR Correction

To learn the semantic relatedness between the visual context information and the candidate word we introduce a multi-channel convolutional LSTM with an attention mechanism. The network is fed with the candidate word plus several words describing the image visual context (object and places labels, and descriptive captions), and is trained to produce a relatedness score between the candidate word and the context.

The architecture is inspired by (Severyn and Moschitti, 2015), that proposed CNN-based re-rankers for Q&A. The network consists of two subnetworks, each with 4-channels with kernel sizes k = (3, 3, 5, 8), and an overlap layer, as shown below (Figure 1). Next, we describe the main components:

Figure 1. Detail of the proposed architecture to estimate the semantic relatedness between a candidate word provided by an off-the-shelf OCR approach and the context in the image. We consider several sources of context including object and place labels and textual descriptions obtained by pretrained caption generation networks.

Multi-Channel Convolution. The first subnetwork consists of only convolution kernels and aims to extract n-gram or keyword features from the caption sequence.

Multi-Channel Convolution-LSTM. Following C-LSTM (Zhou et al., 2015) we forward the output of the CNN layers into an LSTM, which captures the long term dependencies over the features. We further introduce an attention mechanism to capture the most important features from that sequence. The advantage of this attention is that the model learns the sequence without relying on the temporal order. We describe in more detail the attention mechanism below.

Also, following (Zhou et al., 2015), we do not use a pooling operation after the convolution feature map. The pooling layer is usually applied after the convolution layer to extract the most important features in the sequence. However, the output of our Convolutional-LSTM model is fed into an LSTM (Hochreiter and Schmidhuber, 1997) to learn the extracted sequence, and the pooling layer would break that sequence via downsampling to a selected feature. In short, LSTM is specialized in learning sequence data, and pooling operation would break such a sequence order. On the other hand, for the Multi-Channel Convolution model, we also learn the extracted word sequence n-gram directly and without feature selection, pooling operation.

Attention Mechanism. Attention-based models have shown promising results on various NLP tasks (Bahdanau et al., 2014). Such a mechanism learns to focus on a specific part of the input (e.g., a relevant word in a sentence). We apply an attention mechanism (Raffel and Ellis, 2015) via an LSTM that captures the temporal dependencies in the sequence.

Overlap Layer. The overlapping layer is just a frequency count dictionary to compute overlap information of the inputs. The idea is to give more weight to the most frequent visual element, especially when it is observed by more than one visual classifier. The dictionary output is a fully connected layer.

Finally, we merge all subnetworks into a joint layer that is fed to a loss function which calculates the semantic relatedness between both inputs. We call the combined model Fusion Dual Convolution-LSTM-Attention (FDCLSTMAT).

Implementation details

Masking. Since we have only one candidate word at a time, we apply a convolution with masking in the candidate word side (first channel). In this case, simply zero-padding the sequence has a negative impact on the learning stability of the network. We concatenate the CNN outputs with the additional feature into MLP layers, and finally, a sigmoid layer performing binary classification.

Training. We trained the model with a binary cross-entropy loss (l) where the target value (in [0,1]) is the semantic relatedness between the word and the visual. Instead of restricting ourselves to a simple similarity function, we let the network learn the margin between the two classes –i.e., the degree of similarity. For this, we increase the depth of the network after the MLPs merge layer with more fully connected layers. The network is trained using Nesterov-accelerated Adam (Nadam) (Dozat, 2016) as it yields better results (specially in cases such as word vectors/neural language modelling) than other optimizers using only classical momentum (ADAM). We apply batch normalization (BN) (Ioffe and Szegedy, 2015) after each convolution, and between each MLPs layer. We omitted the BN after the convolution for the model without attention (FDCLSTM), as BN deteriorated the performance. Additionally, we consider 70% dropout (Srivastava et al., 2014) between each MLPs for regularization purposes.

Dataset

We evaluate the performance of the proposed approach on the noisy COCO-text (Veit et al., 2016). This dataset is based on Microsoft COCO (Lin et al., 2014) (Common Objects in Context), which consists of 63,686 images, and 173,589 text instances (annotations of the images). This dataset does not include any visual context information, thus we used out-of-the-box (1) object (He et al., 2016) and (2) place (Zhou et al., 2014) classifiers and tuned a caption generator (Vinyals et al., 2015) on the same dataset to extract contextual information from each image. Datasat github

Experiments and Results

In the following, we use different similarity or relatedness scorers to reorder the k-best hypothesis produced by an off-the-shelf state-of-the-art OCR system. We experimented with extracting k-best hypotheses for k = 1…10.

We use two pre-trained deep models: a CNN (Jaderberg et al., 2016) and an LSTM (Ghosh et al., 2017) as baselines (BL) to extract the initial list of word hypotheses. The CNN baseline uses a closed lexicon; therefore, it cannot recognize any word outside its 90K-word dictionary. Table 1 (below) presents four different accuracy metrics for this case: 1) full columns correspond to the accuracy on the whole dataset. 2) dict columns correspond to the accuracy over the cases where the target word is among the 90K-words of the CNN dictionary (which correspond to 43.3% of the whole dataset. 3) list columns report the accuracy over the cases where the right word was among the k-best produced by the baseline. 4) Mean Reciprocal Rank (MRR), where rank k is the position of the correct answer in the hypotheses list proposed by the baseline. However, for sake of the clarity, we only discuss the CNN baseline.

Comparing with sentence level model. We compare the results of our encoder with several state-of-the-art sentence encoders, tuned or trained on the same dataset. We use cosine to compute the similarity between the caption and the candidate word. Word-to-sentence representations are computed with: Universal Sentence Encoder with the Transformer USE-T (Cer et al., 2018), and Infersent (Conneau et al., 2017) with glove (Pennington et al., 2014). The rest of the systems in Table 1(below) are trained in the same conditions that our model with glove initialization with dual-channel overlapping non-static pre-trained embedding on the same dataset. Our model FDCLSTM without attention achieves a better result in the case of the second baseline LSTM that is full of false-positives and short words. The advantage of the attention mechanism is the ability to integrate information over time, and it allows the model to refer to specific points in the sequence when computing its output.

BERT. (Bidirectional Encoder Representations from Transformers) has shown groundbreaking results in many tasks such as Q&A and Natural Language Inference. However, as mentioned by the author, it is not suited for Semantic Textual Similarity (STS) tasks, as it does not generate a meaningful vector to compute the cosine distance. This can be seen in Table 1, BERT-feature. Therefore, we fine-tuned the model with one additional layer to compute the semantic score between caption and candidate word. In particular, we fed the sentence representation into a linear layer and a softmax for sentence pair tasks. A fine-tuned BERT, on the same dataset outperforms our model BL+FDCLSTMAT+Lexicon by a small non-significant margin in the first baseline.

We believe that BERT has an advantage over our model in terms of the amount and diversity of the training data. For instance, it is able to solve cases without direct context such as Anderson(name)-plaza(place).

Comparing with word level model. We also compare our result with current state-of-the-art word embeddings trained on a large general text using glove and fasttext. The word model used only object and place information and ignored the caption. Our proposed models achieve better performance than our TWE previous model (Sabir et al., 2018), that trained a word embedding (Mikolov et al., 2013) from scratch on the same task.

Human performance. To estimate an upper bound for the results, we picked 33 random pictures from the test dataset and had 16 human subjects try to select the right word among the top k = 5 candidates produced by the baseline OCR (i.e., text spotting) system. Our proposed model performance on the same images was 57%. Average human performance was 63% (highest 87%, lowest 39%).

Evaluation remarks. For evaluation, we used a less restrictive protocol than the standard one proposed by (Wang et al., 2013) and adopted in most state-of-the-art benchmarks, which does not consider words with less than three characters. This protocol was introduced to overcome the false positives on short words that most current state-of-the-art struggle with, including our Baseline. Instead, we consider all cases in the dataset, and words with less than three characters are also evaluated.

---------------------------------------------+---------+--------+
| Model  (Baselines)       |  full   |  Dic    |  list   |  k   |
---------------------------------------------+---------+--------+
| BL+Glove                 |    22.0 |    62.5 |    75.8 |    7 | 
| BL+C-LSTM                |    21.4 |    61.0 |    71.3 |    8 |  
| BL+CNN-RNN               |    21.7 |    61.8 |    73.3 |    8 |  
| BL+MVCNN                 |    21.3 |    60.6 |    71.9 |    8 |   
| BL+Attentive-LSTM        |    21.9 |    62.4 |    74.0 |    8 |    
| BL+fasttext              |    21.9 |    62.2 |    75.4 |    7 |   
| BL+InferSent             |    22.0 |    62.5 |    75.8 |    7 |    
| BL+USE-T                 |    22.0 |    62.5 |    78.3 |    6 |    
| BL+W2V Trained           |    22.2 |    63.0 |    76.3 |    7 |    +--------------------------------------------+---------+--------+
Proposed model (shallow model)                  
+--------------------------------------------+---------+--------+ 
| BL+FDCLSTM               |    22.3 |    63.3 |    75.1 |    8 |
| BL+FDCLSTM+AT            |    22.4 |    63.7 |    75.5 |    8 |   
| BL+FDCLSTM+lexicon       |    22.6 |    64.3 |    76.3 |    8 |    
| BL+FDCLSTM+AT+lexicon    |    22.6 |    64.3 |    76.3 |    8 |   
+--------------------------------------------+---------+--------+
Pre-trained model                  
+--------------------------------------------+---------+--------+ 
| BL+BERT (feature)        |    21.7 |    61.6 |    74.6 |    7 |  
| BL+BERT (fined-tuned)    |    22.7 |    64.6 |    76.6 |    8 |   
---------------------------------------------+---------+--------+
Table 1: Best results after re-ranking using different re-ranker,
and different values for k-best hypotheses extracted from the baseline output (%). (better read this in the PC version)

Figure 2. Examples of candidate re-ranking using object (c1), place (c2), and caption (c3) information. The top three examples are re-ranked based on the semantic relatedness score. The **delta-airliner** which frequently co-occur in training data is captured by overlap layers. The **12-football** show that the relation between sport and numbers. Also, **program-school** have a much more distance relation but our model is able to re-rank the most related words. The **blue-runway** and **bus-private** show that the overlap layer can be effective when the visual context appears in more than one visual context classifier. Finally, **hotdog-a** and **by-sk**i have no semantic correlation but are solved by network thanks to the frequency count dictionary.

Discussion. Our proposed approach re-ranks candidate words based on their semantic correlation score with the visual context from the image. However, there are some cases when there is no direct semantic correlation or no relation between the visual context and the candidate word. Thus, we proposed the overlap layer to address this limitation by learning correlations from the training dataset. For instance, as shown in Figure 2 (above) the company name Delta and the visual context airline. Also, adding the unigram frequency (lexicon) helps to filter out short words or false positives as shown in the examples a-hotdog and by-skiing.

Limitation. The limitation of this approach is that it depends on the baseline softmax output to re-rank the most related word. In particular, the semantic relatedness score suppresses the unrelated words and boosts the most related word probability by simple dot product multiplication. Also, since text in images is not always related to its environment (e.g., a commercial ad of a popular soft drink may appear almost anywhere), there are only a fraction of cases this approach may help solve, but given its low cost, it may be useful for domain adaptation of general post OCR correction in the wild.

Semantic Similarity Benchmark: A Case Study

What about Semantic Similarity ? Also, as proof of concept, we evaluate our model on Semantic Textual Similarity benchmark (STS) SemEval 2017dataset Task5 (En-En). This dataset comprises sentence pairs (i.e., caption, news, and forum). However, to employ our binary re-ranker (i.e., similar or not similar) we convert the problem from the degree of similarity (i.e. (1) not similar (5) similar) to a binary problem (1–2.5 not similar and 2.5 to 5 similar). Although this architecture is designed to tackle a specific problem, as shown below (in Table 2) our model FDCLSTM without attention outperforms BERT in semantic similarity measure in this task. Although the improvement is not statistically significant as the test dataset is very small (test:1379, dev:1500), we confirm our hypothesis that the tasks require a broader perspective to decide whether two text fragments are related more than whether they are similar as in traditional semantic similarity.

            + — — — — — — - - + — — — — — - - - +
            | Model           | dev   |  test   |
            + — — — — — — - - + — — — — — - - - +
            | BERT(fine-tune) | 84.6  |  83.8   |
            | FDCLSTM_AT      | 71.4  |  69.2   |
            | FDCLSTM         | 89.5  |  84.4   |
            + — — — — — — - - + — — — — — - - - +
     Table 2:Result of semantic similarity SemEval2017. 
          (better read this in the PC version)

Conclusion

In this work, we propose a simple deep learning architecture to learn semantic relatedness between word-to-word and word-to-sentence pairs, and show how it outperforms other semantic similarity scorers when used to re-rank candidate answers in the scenario of OCR in the wild (a.k.a Text Spotting problem). Note that this work also can be used to tackle similar problems, including lexical selection in Machine Translation, or word sense disambiguation.