Similarity to Probability — Part I: Visual Word Embedding for OCR Post Correction

Ahmed Sabir
21 min readNov 3, 2022


In this post, I will revisit in more detail our previous work that uses human-inspired likelihood revision or similarity to probability [Blok et al. 2003] to re-rank or score any word or text fragment based on the semantic relation to an external context. We will use the most popular Semantic Similarity pre-trained model (e.g., w2v, GloVe, fasttext, etc.) to compute these relations.

Our code is publicly available on Github and for a quick start try Colab.

First, let’s take a step back and discuss the semantic similarity basic, (1) Similarity categories, and (2) Similarity Measure: and try to answer this question: how the similarity between two text fragments can be measured.

Semantic Similarity

Semantic similarity measure aims to capture how close are two text fragments (i.e., words, phrases, sentences) via a numerical score based on how close their respective meanings are. Semantic relatedness is a concept more general than similarity, assuming that two concepts may be related but not necessarily similar (e.g., car and traffic light are not similar, but they are clearly related). In computational linguistics, semantic relatedness is defined for any two concepts that have any kind of semantic relation [Budanitsky and Hirst 2001]. Semantic similarity is more specific and is a particular case of semantic relatedness [Turney and Pantel 2010].

Semantic similarity methods can be divided into two categories: corpus-based methods and knowledge-based methods [Mihalcea et al. 2006, Zhu and Iglesias 2016]. Corpus-based methods are a common approach where word relations are learned from a general corpus, while knowledge-based approaches require building and using extra resources like WordNet [Miller 1998], BabelNet [Navigli and Ponzetto 2012], FrameNet [Baker et al. 1998], etc.

Corpus-based methods. Corpus-based semantic similarity methods are based on word associations learned from large text collections. The two words are considered similar based on their most frequent surrounding contexts. The relation is learned based on word distribution in the corpus or word co-occurrences. The two possible metrics to compute semantic similarity are:

Knowledge-based methods. Knowledge-based semantic similarity methods use ontologies or other knowledge repositories (e.g., Wiki) to measure semantic similarity between words. The similarity measure between two words is determined by how close they are in the graph defined by the reference ontology or repository. WordNet [Miller 1998] is often used as a reference graph, since it is organized in synsets (sets of synonym words) which are connected via different semantic relations (hypernymy, meronymy, antonymy, etc.) to form a graph. Since each word may have different meanings, and thus, belong to more than one synset, knowledge-based similarity methods need to either disambiguate the right sense for the words (which is not always easy or possible) or check for the interpretation of each of the words that yield maximum similarity score [Sanchez et al. 2012].

Similarity Measure. Once we have words represented as vectors in a semantic space, semantic similarities can be derived from vector distance measurements. Next, we will describe these different distance measures that are able to measure the similarity distance between two dense vectors.

  • Minkowski Distance. Minkowski is a family of distance measures that includes two particular cases [Cha 2007]: (1) Euclidean distance and (2) Manhattan distance. The Minkowski distance is defined as follows:

where m is a real number, x_i and y_i are the two vectors in the dimension space.

(1) Manhattan Distance. The Manhattan distance is a particular case of the Minkowski when m=1.

(2) Euclidean Distance. The most commonly used distance for numerical data is the Euclidean distance. This is a particular case of the Minkowski when m=2. Although the Euclidean distance is used in many clustering algorithms [Xu and Wunsch 2005], it has a major drawback when comparing vectors that have a similar distance or have smaller attribute values.

  • Cosine Distance. The cosine similarity measure is the most commonly used for document similarity and word-sentence similarity. The cosine similarity is defined as:

where ∥y∥_2 is the Euclidean norm of vector ..y = (y_1, y_2, …, y_i), etc.

In this post, we only rely on the cosine measure to compute the distance between the two vectors in the semantic space.

Semantic Similarity based Text Scorer

The similarity measure that we described in the previous section is not suitable for word re-ranking. Therefore, to be able to re-rank words based on their probability, some strategy is needed to convert similarity to probability.

In this section, we discuss the theories behind similarity to probability conversion. Blok et al. (2003) introduce a conditional probability model that assumes the preliminary probability result is updated or revised to the degree that the hypothesis proof warrants. The range of revision is based on the informativeness of the argument and its degree of similarity. That is, the similarity to probability conversion can be defined in terms of belief revision. Belief revision is a process of forming a belief by bringing into account a new piece of information.

Let us consider the following statement:

1. Tigers can bite through wire, therefore Jaguars can bite through wire.

2. Kittens can bite through wire, therefore Jaguars can bite through wire.

Consider the observations (1) and (2). In the first case, the statement seems logical because it matches our expectations (jaguars are similar to tigers, so we expect them to be able to do similar things), so the statement is consistent with our previous belief, and there is no need to revise it. In the second case, the statement is surprising because our prior belief is that kittens are not so similar to jaguars, and thus, not so strong. But if we assume the veracity of the statement, then we need to revise and update our prior belief about kitten strength.

Blok et al. (2003) formalize belief as probabilities and revised belief as conditional probabilities and provide a framework to compute them based on the similarities of the involved objects. According to the authors, belief revision should be proportional to the similarity of the involved objects (i.e., in the example, the statement about kittens and jaguars would cause a stronger belief revision than e.g., the same statement involving pigeons and jaguars because they are less similar).

In our case, we use the same rationale and the same formulas to convert similarity (or relatedness) scores into probabilities suitable for reranking.

SimProb Model

To convert the similarity to probability, or SimProb model, we need three parameters:

Hypothesis: prior probabilities.
Informativeness: conclusion events.
Similarities: measure the relatedness between involved categories.

The main goal is to predict a conditional probability of statements, given one or more others. In order to predict the conditional probability of the argument’s conclusion, given its premise or hypothesis, we will need only the prior probabilities of the statements, as well as the similarities between the involved categories (e.g., kittens and tigers).

Single-Premise Formulation of SimProb

The conditional probability P(Qc|Qa) is expressed in terms of the prior probability of the conclusion statement P(Qc), the prior probability of the premise statement P(Qa), and the similarity between the conclusion and the premise categories sim(a, c).





There are two factors to determine the hypothesis probability revision:

  • The sufficient relatedness to the category –as sim(a, c) goes to 0, α goes to 1, and thus P(Qc|Qa) = P(Qc), i.e., no revision takes place (no changes in the original belief), while as sim(a,c) goes to 1, α goes to 0, and the hypothesis probability P(Qc) is revised and raised closer to closer to 1.
  • The informativeness of the new information 1 − P(Qa) as P(Qa) approaches 1 and in consequence, is less informative, α also goes to 1, since there is no new information, and thus no revision is needed either.

Application: Visual Word Embedding

In this post, we only demonstrate this approach with Corpus-based methods via GloVe. However, please refer to our Github for other methods such as Knowledge-based methods via Sense Embedding.

Word embeddings have become one of the popular representations of vocabulary semantics. They model word context in a semantic space, in such a way that words with similar contexts are close to each other in the semantic space. In order to make our approach universal, we based our work on general word embeddings, such as word2vec [Mikolov et al. 2013b], GloVe [Pennington et al. 2014], and fasttext [Joulin et al. 2016].

In this post, we will discuss an application of word embedding based on semantic relatedness for OCR post-correction in text in the wild. This approach uses SimProb Model to convert the similarity to probability via Visual Belief Revision. Note that, for the sake of simplicity we will use only Single-Premise (only one visual or context at a time). Next, we describe the task and the implementation of SimProb in this scenario.

Task: Visual Word Embedding for OCR Post Correction

Problem formulation. Given an input image with a text on it, we assume that a baseline approach provides a ranked list of potential words w_1…w_k. Our goal is to leverage the context information of the image to re-rank the list, moving most likely words up in the list, as well as moving false-positive candidates down. Figure 1 below shows an overview of the proposed methods, where the SimProb Model is the Semantic Relatedness Measure.

Figure 1. System Overview. We propose a simple post-process methodology that can be applied to the output of a state-of-the-art OCR in the wild system. Our system re-ranks the candidate words using their semantic relatedness with contextual image information such as objects and scenes.

To use word embedding for visual semantic relatedness as post-correction, we need the correct input as word-to-visual information. Therefore, a pre-process setup is required before computing the SimProb Model. We call the SimProb block (pink block in Figure 1) the Semantic Relatedness Measure Using Word Embedding (SWE).

We summarised the process in 3 steps:

First, we use a threshold δ to eliminate lower probabilities from the visual classifier (objects, scenes).

Secondly, following Lee et al. (2018) who show the benefits of finding the proper visual context of each region in a caption guided by text using semantic similarity, we compute the embedding-based similarity of each visual with the candidate word. The idea is to match the most semantically related visual context with the candidate word.

Thirdly, we take the max-highest similarity score and the most semantically related, to the candidate word c_max as:

where sim(w, c_i) is the cosine similarity, w is the candidate word, c_i is the set of visual objects found in the target image, and δ is the threshold to eliminate objects and scenes with lower probabilities.

Finally, following [Blok et al. 2003] with confirmation assumption P(w|c) ≥ p(w), we compute the conditional probability from similarity as introduced in the SimProb (above).

Where the SimProb in this scenario is a conditional probability, which assumes that the caption preliminary probability (hypothesis) P(w) is revised to the degree approved by the semantic similarity with visual context sim(w, c). The final output text hypothesis w for a given visual context c is written as:

Hypothesis: (Blue Block in Figure 1)

Informativeness: (Green Block in Figure 1)

Similarities: (Pink Block in Figure 1)

where P(w) is the hypothesis probability (candidate word from the baseline) and P(c) is the probability of the evidence that causes hypothesis probability revision (visual context from the image).

Hypothesis: In our case, we use the unigram language model to initialize the hypothesis probability P(w) based on a common observation (general text large corpus).

Informativeness: We obtain P(c) from the visual context object/place detectors. Table 1 (below) shows examples of hypothesis probability revision based on the visual context. For instance, the probability of the word electric is raised by the presence of streetcar in the image.

Similarities: We use out-of-the-box pre-trained word embedding models (see Table 2) to compute the cosine distance between the hypothesis and its related visual environmental context.

Next, let me explain how the belief revision works in our case:

Relatedness between hypothesis and additional information. As the similarity/relatedness between the text and the visual context sim(w,c) gets closer to 0 (not related) α gets closer to 1, and thus P(w|c) = P(w) i.e., no revision takes place (no changes in the original belief). Thus, no reranking will take place if the context is not related to the hypothesis. But, as sim(w,c) gets closer to 1 (higher hypothesis-context relatedness), α gets closer to 0, so the hypothesis probability is revised and raised closer to 1.

Informativeness of the additional information. As the probability of the visual context P(c) approaches 1 and in consequence is less informative (very frequent objects have no discriminative power since they may co-occur with anything) 1−P(c) approaches zero, causing α to get closer to 1, and thus, a smaller revision of P(w).

| Similarity to Probability |
| word | context | P(w)○ | P(c) | sim(w,c) | P(w|c)●|
|----------+-----------+--------+------+----------+-------- |
| plate | moving | 0.0029 | 0.53 | 0.30 | 0.012 |
| electric | streetcar | 0.0002 | 0.48 | 0.24 | 0.001 |
| computer | street | 0.0023 | 0.46 | 0.19 | 0.007 |
| way | downtown | 0.0129 | 0.35 | 0.18 | 0.094 |
| 12 | football | 0.0021 | 0.68 | 0.17 | 0.041 |
| bike | highway | 0.0005 | 0.40 | 0.44 | 0.014 |
| walk | street | 0.0016 | 0.30 | 0.53 | 0.061 |
Table 1. Example of visual context-based hypotheses revision. ○ and ● show the hypothesis probability before and after the revision, respectively. (better read this in the PC version)


The textual hypotheses extracted above will be re-ranked based on the context information. This will be parsed by two main subsystems that, given the input image, will provide: the label of the detected objects and location identifier of the entire image. We next describe each of these modules.

Object type. The first contextual information we consider is the objects appearing in the image. This is obtained from the output of visual classifiers trained with 1000 imagenet object classes. We consider the top-5 objects with the highest probabilities. Table 2 shows a sample of objects obtained from COCO-text images, along with their frequency. In our experiments, we evaluated two out-of-the-box pre-trained object classifiers: ResNet152 and Inception-Resnet-v2.

Scene Classifier [Bolei et al. 2017]. We also considered a scene classifier to extract image location (i.e., scene in the image) information. We use a pre-trained scene classifier able to recognize 365 different scene types. The original model is based on Places365-Standard a deep convolution network trained on 1.8 million images from 365 scene categories. This same work also proposed an improved model built upon a ResNet architecture, which is the one we will consider in this work.

| Object categories extracted from COCO-text dataset |
| word | # | word | # | word | # |
| bus | 7902 | stop | 2345 | street | 13849 |
| ball | 9744 | airliner | 4252 | park | 2663 |
| passenger | 3176 | electric | 2085 | minibus | 2066 |
| analog | 1859 | motor | 1561 | pizza | 1397 |
| cab | 1312 | restaurant | 1213 | desk | 1226 |
| plate | 1061 | tennis | 680 | baseball | 545 |
Table 2. Count of the object labels in the COCO-text dataset that we use as context information. (better read this in the PC version)

Test Dataset. We evaluate the performance of the proposed approach on the noisy COCO-text [Veit et al., 2016]. This dataset is based on Microsoft COCO [Lin et al., 2014] (Common Objects in Context), which consists of 63,686 images, and 173,589 text instances (annotations of the images). This dataset does not include any visual context information, thus we used out-of-the-box (1) object [He et al., 2016] and (2) place [Zhou et al., 2014] classifiers and tuned a caption generator [Vinyals et al., 2015] on the same dataset to extract contextual information from each image. (Datasat Github)

Experiments and Results

In the following, we use different similarity or relatedness scorers to reorder the k-best hypothesis produced by an off-the-shelf state-of-the-art OCR system. We experimented with extracting k-best hypotheses for k = 1…10.

We use two pre-trained deep models: a CNN [Jaderberg et al., 2016] and an LSTM [Ghosh et al., 2017] as baselines (BL) to extract the initial list of word hypotheses. The CNN baseline uses a closed lexicon; therefore, it cannot recognize any word outside its 90K-word dictionary. Table 1 (below) presents four different accuracy metrics for this case: 1) full columns correspond to the accuracy of the whole dataset. 2) dict columns correspond to the accuracy over the cases where the target word is among the 90K-words of the CNN dictionary (which correspond to 43.3% of the whole dataset. 3) list columns report the accuracy over the cases where the right word was among the k-best produced by the baseline. 4) Mean Reciprocal Rank (MRR), where rank k is the position of the correct answer in the hypotheses list proposed by the baseline. However, for sake of the clarity, we only discuss the CNN baseline.

Since these Baseline BLs need to be fed with images of cropped words (i.e., image regions containing only text), when evaluating on the ICDAR-2017-Task 3 dataset we use the ground truth bounding boxes of the words. Also, note that we are not evaluating a whole end-to-end text spotting system (detection + recognition), but only the influence of adding external natural language understanding knowledge to the recognition phase. Therefore, each baseline takes a text image bounding box (B) as input and produces k candidate words w_1…w_k plus a probability for each prediction P(w_K|B)= 1,…, k.

Next, let’s introduce the best model in Table 2 in more detail:

GloVe. Word embedding-based algorithm that overcomes the limitation of the original Word2Vec. The main idea of GloVe is that it can derive the semantic relationships between words from a co-occurrence matrix. The advantage of GloVe over Word2Vec is that it does not rely on local information, such as the local context of words, but it incorporates global co-occurrence statistics. For that reason, GloVe can derive better semantic relationships than other models.

Relational Word Embeddings (RWE). RWE is an enhanced version of Word2Vec and fastext that encodes complementary relational knowledge to the standard word embedding in the semantic space. This enhanced embedding is still learned from pure co-occurrence statistics and does not rely on any external knowledge. In this work, we employ two pre-trained models based on Word2Vec and fastext.

Training-data Word Embeddings (TWE). Semantic Relatedness with Word Embedding. A word embedding using Word2Vec is trained on our dataset, so it can learn associations between candidate words and their visual context. In particular, we trained a Skip-gram model that works well with small amounts of training data and is able to represent low-frequency words.

| Model | full | dict | list | k | MRR |
| BaseLine CNN |full: 19.7 (30.1*) dict: 56.0 |
| BL+ GloVe | 22.0 | 62.5 | 75.8 | 7 | 44.5 |
| BL+word2vec | 21.8 | 62.1 | 80.0 | 5 | 44.3 |
| BL+fasttext | 21.9 | 62.2 | 75.4 | 7 | 44.6 |
| BL+RWE-w2v | 21.9 | 62.4 | 75.6 | 7 | 44.5 |
| BL+RWE-fasttext | 22.0 | 62.5 | 75.8 | 7 | 44.6 |
| BL+TWE(trained) | 22.2 | 63.0 | 76.3 | 7 | 44.7 |
Table 2. Comparison results of different word embedding re-ranking
scenario. The * represents the upper bound of the baseline.
(better read this in the PC version)

Table 4 (below) shows some scenarios where the advantage of relying on a place classifier yielded a better understanding of the context of the scene.

| Examples of word level information based re-ranking |
| Word | Object | Place | ranking-object | ranking-place |
| way | street | hospital | 0.1771 | 0.3425 |
| canada | passenger | bus | 0.0004 | 0.0001 |
| member | ballplayer | baseball | 0.0006 | 0.0009 |
| bank | traffic | motel | 0.0112 | 0.1079 |
| dunkin | coffee | bakery | 0.0250 | 0.0142 |
Table 4. Examples of P(word|object/scene) for each re-ranker. Semantic Relatedness using general Word Embedding capture more relevant information to improve the baseline via object or scene information. (better read this in the PC version)

Limitation. The limitation of this approach is that it depends on the baseline softmax output to re-rank the most related word. In particular, the semantic relatedness score suppresses unrelated words and boosts the most related word probability by simple dot product multiplication. Also, as shown in Tables 5, 6, and 7 (see below) there are some cases where there is no semantic relation, (1) short word and non-word, (2) no semantic direct relation, and (3) low-frequency count word (seen text in the wild). In summary, text in images is not always related to its visual environment.

Another limitation of this approach is that when the language model re-ranker is strong, the visual context re-ranker is unable to re-rank the correct candidate word. For instance, the word ohh has a large frequency count in general text. This problem can be tackled by adjusting the weight of uncommon short words in the language model.

| Semantic similarity failure cases |
| Short word or non-word |
| word | visual | word | visual |
| e | umbrella | n900eb | airliner |
| 10100 | parking | zers | racket |
| jw2 | steam | rtbf | laptop |
Table 5. Example of some cases that our model struggles
(short word and non-word) to learn semantic relatedness.
(better read this in the PC version)
| Semantic similarity failure case |
| No semantic relation |
| word | visual | word | visual |
| taxi | cell | hote | pretzel |
| winery | warplane | sterilized | toilet |
Table 6. Example of some cases that our model struggles
(cases with No semantic relation) to learn semantic
relatedness. (better read this in the PC version)
| Semantic similarity failure cases |
| Commercial brand (low freq count) |
| word | visual | word | visual |
| evergreen | military | lacoste | racket |
| reebok | racket | linksys | modem |
Table 7. Example of some cases that our model struggles
(cases of commercial brand) to learn semantic
relatedness. (better read this in the PC version)

Knowledge-based methods

In the main post (above), we introduced one of the most common corpus-based methods word embedding. Here, we show our experiment with other types of embeddings, knowledge-based embedding. One of the limitations of the word embedding systems is that they compute a single representation for each word independently from the context in which they appear, context insensitive. In this part, we describe several recent approaches that attempt to enhance word embedding by adding external knowledge.

Knowledge-based semantic similarity methods use ontology to measure semantic similarity between words. The similarity measure between the two words is determined by how close to the given ontology is. External lexical database, such as WordNet, is used to extract the ontology, which is organized through synsets. Knowledge-based Methods use directly the synsets rather than the word, and therefore they need to convert into word similarity by taking the max over all the senses.

Recently, most works in knowledge-based methods are based on a neural network that embeds words into vector space, low-dimensional vector spaces, from text corpora –i.e., word embedding. Iacobaccet al. (2015) introduce a knowledge-based word2vec, that uses an external lexical database such as BabelNet they called it sense embedding. Sense-Embedding (SensEmd) is a word embedding that is injected with an external lexical database such as BabelNet. Next, we will discuss some of the models in more detail.

Senses and Words to Vectors (Sw2v). Sw2v addresses this issue (i.e., polysemy) by proposing a model that learns words with different meanings (sense embeddings) jointly in the same space. In particular, they use external knowledge BabelNet to extract multiple senses for each word. BabelNet is the largest semantic network with a multilingual encyclopedic dictionary, comprising approximately 16 million entries for named entities linked by semantic relations.

LSTMEmbed [Iacobacci et al. 2019]. LSTMEmbed is the most recent, non-BERT-based, model in senses word embedding, similar to Sw2v. However, this model utilizes BiLSTM based architecture to learn the word and senses representations from annotated corpora. The advantage of LSTMEmbed over Sw2v is that they take word order into account during the learning process. However, in our case, word ordering is not essential as we are interested in word-to-context only, and thus LSTMmbed performs poorly in our evaluation. We use the same approach that mention above to evaluate our model with 200-D senses word embedding that is trained on the English portion of BabelWiki and English Wikipedia (3 million unique tokens).

Table 5 (below) shows a comparison result between count-based and knowledge-based word embedding in our proposed task. The result indicates that count-based embedding leverage a better similarity score in the OCR post-correction scenario as shown in Table 6 examples.

| Model | full | dict | list | k | MRR |
| Baseline |full: 19.7 (30.1*) dict: 56.0 |
Corpus-based word embedding
| BL+GloVe | 22.0 | 62.5 | 75.8 | 7 | 44.5 |
| BL+word2vec | 21.8 | 62.1 | 80.0 | 5 | 44.3 |
| BL+fasttext | 21.9 | 62.2 | 75.4 | 7 | 44.6 |
| BL+RWE-w2v | 21.9 | 62.4 | 75.6 | 7 | 44.5 |
| BL+RWE-fasttext | 22.0 | 62.5 | 75.8 | 7 | 44.6 |
| BL+TWE(trained) | 22.2 | 63.0 | 76.3 | 7 | 44.7 |
Knowledge based word embedding
| BL+SensEmd | 20.9 | 59.4 | 71.9 | 7 | 42.8 |
| BL+w2v+Sense | 21.9 | 62.2 | 75.4 | 7 | 44.4 |
| BL+SEW | 21.9 | 62.4 | 86.2 | 3 | 44.5 |
| BL+Nasari | 21.8 | 62.1 | 85.7 | 3 | 44.2 |
| BL+Sw2v | 21.8 | 62.1 | 75.2 | 7 | 44.3 |
| BL+LSTMmebed | 21.6 | 60.8 | 73.7 | 7 | 44.0 |
Table 5. Best results after re-ranking using knowledge-based embedding re-rankers. The * represents the upper bound of the baseline. (better read this in the PC version)

Table 6. (below) shows that having a sense of the word does not contribute to the final accuracy with a computational cost.

| Examples of sense embedding |
| Word | visual | sense | sim(w,c) | sim(sense,c)|
| under | kimono | robe | 0.14 | 0.17 |
| electric | streetcar | vehicle | 0.24 | 0.15 |
| design | bathroom | bathroom | 0.33 | 0.10 |
| station | street | street | 0.21 | 0.23 |
| year | residential | district | 0.24 | 0.09 |
| riding | ski | sport | 0.40 | 0.30 |
Table 6. Comparison result between knowledge-based embedding and count-based embedding. (better read this in the PC version)


In this work, we demonstrate that the Belief Revision approach that works well with human judgment can be applied to Image-based OCR Post Correction by employing word embedding. We demonstrate the benefits of the approach by showing that an OCR model result is improved via simple language grounding with visual context information.

In the next post, I will discuss our most recent work October 2022 Sentence Level Belief Revision caption based visual beam re-ranker our code is publicly available on Github. Please refer to our huggingface demo (visual belief) and huggingface demo (caption-to-visual-context re-ranker) for a fast start.


Please refer to the original paper for the full references.

Table: Table Editor

Figure: OmniGraffle

Formula: LaTeX-to-PNG


(1) Word Level Semantic Relatedness (this post)

(2) Sentence Level Semantic Relatedness (an end-to-end neural approach)

(3) Datasat github



Ahmed Sabir

ML researcher interested in language & vision research. I’m using this Medium blog to write my learning notes.