Word to Sentence Visual Semantic Similarity for Caption Generation: Lesson learned

In this blog post, I will share with you some insight and lessons learned from our recent research idea that should work in theory (i.e., BERT+GloVe), but in practice, it doesn’t work in our scenario.

Recent state-of-the-art progress in pre-trained vision and language and image captioning models relies heavily on long training on abundant data. However, these accuracy improvements depend on long iterations of training and the availability of computational resources (i.e., GPU, TPU, etc), which leads to time and energy consumption (Strubell al., 2019). In some cases, the improvements after re-training are less than 1 point in the benchmark dataset. In this work, we introduce an approach that can be applied to any caption system as a post-processing-based method that only needs to be trained once. In particular, we propose an approach for improving caption generation systems by choosing the most closely related output to the image rather than the most likely output produced by the model. Our model revises the language generation output beam search from a visual context perspective

First, let me explain why this problem is important, with some background, and related work.

Image Captioning System. Automatic caption is a fundamental task that incorporates vision and language. The task can be tackled in two stages: first, image-visual information extraction and then linguistic description generation. Most models couple the relations between visual and linguistic information via a Convolutional Neural Network (CNN) to encode the input image and Long Short Term Memory for language generation (LSTM) (Vinyals al., 2015; Andersonet al., 2018). Recently, self-attention has been used to learn these relations via Transformers (Huanget al., 2019; Cornia et al., 2020) or Transformer-based models like Vision and Language BERT (Lu et al., 2020). These systems show promising results on benchmark datasets such as COCO (Lin et al., 2014). However, the generated caption lexical diversity remains a relatively unexplored research problem. Lexical diversity refers to how accurate the generated description is for a given image. An accurate caption should provide details about specific and relevant aspects of the image. Caption lexical diversity can be divided into three levels: word level (different words), syntactic level (word order), and semantic level (relevant concepts) (Wang and Chan, 2019). In this work, we approach word-level diversity by learning the semantic correlation between the caption and its visual context, as shown in Figure 1 (below), where the visual information from the image is used to learn the semantic relation from the caption in a word and sentence manner.

Visual Context Image Captioning System. Modern sophisticated image captioning systems focus heavily on visual grounding to capture real-world scenarios. Early works (Fang et al., 2015) built a visual detector to guide and re-rank image captions with a global similarity. The work of (Wang et al., 2018) investigates the informativeness of object information (e.g., object frequency) in end-to-end caption generation. Cornia et al. (2019) propose controlled caption language grounding through visual regions from the image. Chen et al. (2020) rely on scene concept abstract (object, relationship, and attribute) grounded in the image to learn accurate semantics without labels for image caption. More recently, Zhang et al. (2021a) incorporate different concepts such as scene graph, object, and attribute to learn correct linguistic and visual relevance for better caption language grounding.

Inspired by these works, (Fang et al. 2015) that uses re-ranking via visual information, (Wang, Madhyastha, and Specia 2018; Cornia, Baraldi , and Cucchiara 2019; Chenet al. 2020) that explored the benefit of object information in image captioning, (Gupta et al. 2020) that benefits of language modeling to extract contextualized word representations and the exploitation of the semantic coherency in caption language grounding (Zhang et al. 2021a), we propose a visual grounding-based object scorer to re-rank the most closely related caption with both static and contextualized semantic similarity.

Beam search caption extraction — Baselines. We employ the three most common architectures for caption generation to extract the top beam search. The first baseline is based on the standard shallow CNN-LSTM model (Vinyals et al., 2015). The second, VilBERT (Lu et al., 2020), is fine-tuned on a total of 12 different vision and language datasets such as caption image retrieval. Finally, the third baseline is a specialized Transformer based caption generator (Cornia et al., 2020).

Problem Formulation. Beam search is the dominant method for approximate decoding in structured prediction tasks such as machine translation, speech recognition, and image captioning. The larger beam size allows the model to perform a better exploration of the search space compared to greedy decoding. Our goal is to leverage the visual context information of the image to re-rank the candidate sequences obtained through the beam search, thereby moving the most visually relevant candidate up in the list, as well as moving incorrect candidates down.

Figure 1. An overview of our visual semantic re-ranking. We employ the visual context in a word and sentence level manners from the image to re-rank the most closely related caption to its visual context. An example, from the caption Transformer (Cornia et al., 2020), show the visual re-ranker (Visual Beam) uses the semantic relation to re-rank the most descriptive caption.

Word level similarity. To learn the semantic relation between a caption and its visual context in a word-level manner: first, we employ a bidirectional LSTM based CopyRNN keyphrase extractor (Meng et al., 2017) to extract keyphrases from the sentence as context. The model is trained on combined pre-processed datasets (1) wikidump (i.e., keyword, short sentence) and (2) SemEval 2017 Task 10 (Keyphrases from scientific publications)(Augenstein et al., 2017). Secondly, GloVe is used to compute the cosine similarity between the visual context and its related context. For example, “a woman in a red dress and a black skirt walks down a sidewalk” the model will extract dress and walks, which are the highlights keywords of the caption.

Sentence level similarity. We fine-tune the BERT base model to learn the visual context information. The model learns a dictionary-like relation word-to-sentence paradigm. We use the visual data as context for the sentence via cosine distance.

  • BERT (Devlin et al., 2019). BERT achieves remarkable results on many sentence level tasks and especially in the textual semantic similarity task (STS-B)(Cer et al., 2017). Therefore, we fine-tuned BERT_base on the training dataset, (textual information, 460k captions: 373k for training and 87k for validation) i.e., visual, caption, label [semantically related or not related]), with a binary classification cross-entropy loss function [0,1] where the target is the semantic similarity between the visual and the candidate caption.
  • Sentence RoBERTa (Reimers and Gurevych, 2019). RoBERTa is an improved version of BERT, and since RoBERTa Large is more robust, we rely on pre-trained SentenceRoBERTa-sts as its yields a better cosine score.

Fusion Similarity Expert. Product of experts (PoE) (Hinton 1999) implies an effort into combining the expertise of each expert (model) in a collaborative manner. It allows each expert to specialize in analyzing one particular aspect of the problem and establishing a judgment based on that aspect. Inspired by PoE , we combined the two experts word and sentence level as late fusion as shown in Figure 1. PoE takes advantage of each expert and can produce much sharper distributions than a single model. The PoE is computed as follows:

where w is a data vector in the discrete space, θm are the parameters of each model m, pm(w|θm) is the probability of w under model m and c are the indexes of all possible vector in the data space.

Since this approach is interested in retrieving the most related caption with the highest probability after re-ranking, the normalization step is not needed:

where, p_m (w|θm) are the probabilities assigned by each expert to the candidate word or sentence w.

We evaluate the proposed approach on two different sized datasets. The idea is to evaluate our method on the most common caption dataset in two scenarios: (1) a shallow model CNN-LSTM (i.e. less data), as well as a system that is trained on a huge amount of data (i.e. Transformer).

  • Flicker 8K (Rashtchian et al., 2010). The dataset contains 8K images, each image has five human label annotated captions. We use this data to train the shallow model (6270 train/1730 test).
  • COCO (Lin et al., 2014). It contains around 120K images, and each image is annotated with five different human label captions. We use the most used split that is provided by (Karpathy and Fei-Fei, 2015), where 5k images are used for testing and 5k for validation, and the rest for model training for the Transformer baseline.

Visual Context Dataset. Since there are many public datasets for caption, they contain no textual visual information such as objects in the image. We enrich the two datasets, mentioned above, with textual visual context information. In particular, to automate visual context generation and without the need for human labeling, we use ResNet152 (He et al., 2016) to extract top-k 3 visual context information for each image in the caption dataset.

Evaluation Metric. We use the official COCO offline evaluation suite, producing several widely used caption quality metrics: BLEU (Papineniet al., 2002) METEOR (Banerjee and Lavie, 2005), ROUGE (Lin, 2004), CIDEr (Vedantam et al., 2015), and BERTscore or (B-S) (Zhang et al., 2020).

We use visual semantic information to re-rank candidate captions produced by out-of-the-box state-of-the-art caption generators. We extract top-20 beam search candidate captions from three different architectures (1) standard CNN+LSTM model (Vinyals et al., 2015), (2) a pre-trained language and vision model VilBERT (Lu et al., 2020), fine-tuned on a total of 12 different vision and language datasets such as caption image retrieval, and (3) a specialized caption-based Transformer (Cornia et al., 2020).

| Baseline Result with/without Semantic Re-ranking |
| Model | B-1 | B-4 | M | R | C | BERTscore |
| Shallow model Show and Tell (Vinyals et al., 2015) ♠ |
|--------------+-------+-------+-------+-------+-------+-----------| | BeamS | 0.331 | 0.035 | 0.093 | 0.270 | 0.035 | 0.8871 |
| BERT+GloVe top-k Visual 1 and 2 |
| +VR_V1 B-G | 0.330 | 0.035 | 0.095 | 0.273 | 0.036 | 0.8855 |
| +VR_V2 B-G | 0.320 | 0.037 | 0.099 | 0.277 | 0.041 | 0.8850 |
| RoBERT+GloVe (SBERT) top-k Visual 1 and 2 |
| +VR_V1 R+G | 0.313 | 0.037 | 0.101 | 0.273 | 0.036 | 0.8839 |
| +VR_V2 R+G | 0.330 | 0.035 | 0.095 | 0.273 | 0.036 | 0.8869 |
| Pre-trained model VilBERT (Lu et al., 2020) ♣ |
| BeamS | 0.739 | 0.336 | 0.271 | 0.543 | 1.027 | 0.9363 | |--------------+-------+-------+-------+-------+-------+-----------|
| BERT+GloVe top-k Visual 1 and 2 |
|+VR_V1 B-G | 0.739 | 0.334 | 0.273 | 0.544 | 1.034 | 0.9365 | |+VR_V2 B-G | 0.740 | 0.334 | 0.273 | 0.545 | 1.034 | 0.9365 |
| RoBERT+GloVe (SBERT) top-k Visual 1 and 2 |
| +VR_V1 R+G | 0.738 | 0.335 | 0.273 | 0.544 | 1.036 | 0.9365 |
| +VR_V2 R+G | 0.740 | 0.338 | 0.272 | 0.545 | 1.040 | 0.9366 |
| Specialized model Transformer (Cornia et al., 2020) ♣ |
| BeamS | 0.780 | 0.374 | 0.278 | 0.569 | 1.153 | 0.9399 |
| BERT+GloVe top-k Visual 1 and 2 |
| +VR_V1 B+G | 0.780 | 0.371 | 0.278 | 0.567 | 1.149 | 0.9398 |
| +VR_V2 B+G | 0.780 | 0.371 | 0.278 | 0.568 | 1.150 | 0.9399 |
| RoBERT+GloVe (SBERT) top-k Visual 1 and 2 |
| +VR_V2 R+G | 0.779 | 0.370 | 0.277 | 0.567 | 1.145 | 0.9395 |
| +VR_V2 R+G | 0.779 | 0.370 | 0.277 | 0.567 | 1.145 | 0.9395 |
Table 1. Performance of compared baselines on the Karpathy test split ♣ (for Transformer baselines) and 8K Flicker ♠(for show and tell CNN-LSTM baseline) with/withoutVisual semantic re-ranking. At inference, we use only top-k-2 object visual context once at a time.
(better read this in the PC version)

Experiments applying different rerankers to each base system are shown in Table 1 (above). The tested rerankers are: (1) VR_BERT+GloVe, which uses BERT and GloVe similarity between the candidate caption and the visual context (top-k V_1 and V_2 during the inference) to obtain the reranked score. (2) VR_RoBERTa+GloVe, which carries out the same procedure using similarity produced by Sentence RoBERTa.

Our re-ranker produced mixed results as the model struggles when the beam search is less diverse. The model is therefore not able to select the most closely related caption to its environmental context as shown in Figure 2/2_zoom (below), which is a visualization of the final visual beam re-ranking.

Figure 2. Visualization of the top-15 beam search after visual re-ranking. The color white ≤0, salmon≤ 0.4 babyblue ≤ 0.8 represents the degree of change in probability after visual re-ranking, respectively. Also, we can observe that a less diverse beam negatively impacted the score, as in the case of Transformer and show and tell baselines.
Figure 2_zoom. Here, we zoom again in Figure 2 (above) with a more sensitive setting to measure where are exactly the changes before and after re-ranking.

Evaluation of Lexical Diversity. As shown in Table 2 (below), we evaluate the model from a lexical diversity perspective. We can conclude that we have (1) more vocabulary, and (2) the Unique word per caption is also improved, even with a lower Type-Token Ratio TTR (Brown, 2005). (TTR is the number of unique words or types divided by the total number of tokens in a text fragment.).

Although this approach re-ranks higher diversity caption, the improvement is not strong enough to impact the benchmark result positively as shown in Table 1.

| Lexical Diversity |
| Model | Voc | TTR | Uniq | WPC |
| Show and tell ♠ |
| Tell BeamS | 304 | 0.79 | 10.4 | 12.7 |
| Tell+VR RoBERTa| 310 | 0.82 | 9.42 | 13.5 |
| VilBERT ♣ |
| Vil BeamS | 894 | 0.87 | 8.05 | 10.5 |
| Vil+VR RoBERTa | 953 | 0.85 | 8.86 | 10.8 |
| Transformer ♣ |
| Trans BeamS | 935 | 0.86 | 7.44 | 9.62 |
| Trans+VR BERT | 936 | 0.86 | 7.48 | 8.68 |
Table 2. Measuring the lexical diversity of caption before and after re-ranking. Uniq and WPC columns indicate the average of unique/total Words Per Caption, respectively. (The ♠ refers to the Fliker 1730 test set, and ♣ refers to the COCO Karpathy 5K test set). (better read this in the PC version)

Ablation Study. We performed an ablation study to investigate the effectiveness of each model. As to the proposed architecture, each expert tried to learn different representations in a word and sentence manner. In this experiment, we trained each model separately, as shown in Table 3 (below). The GloVe as a stand-alone performed better than the combined model (and thus, the combined model breaks the accuracy). To investigate this even further we visualized each expert before the fusion layers as shown in Figure 3.

Figure 3. (♠ Top) 1k random sample from Flicker test set with shown and tell model. Each Expert is contributing different probability confidence and therefore the model is learning the semantic relation in word level and sentence level. (♣ Bottom) 5k random sample from COCO caption with Transformer-based caption model. The GloVe score is dominating the distribution to become the expert.
| Ablation Study |
| Trans BeamS | 0.374 | 0.278 | 0.569 | 1.153 | 0.9399 |
| +VR_RoBERT-GloVe | 0.370 | 0.277 | 0.567 | 1.145 | 0.9395 |
| +VR_BERT-GloVe | 0.371 | 0.278 | 0.567 | 1.149 | 0.9398 |
| +VR_RoBERT+BERT | 0.369 | 0.278 | 0.567 | 1.144 | 0.9395 |
| +VR_V1 GloVe | 0.371 | 0.278 | 0.568 | 1.148 | 0.9398 |
| +VR_V2 GloVe | 0.371 | 0.278 | 0.568 | 1.149 | 0.9398 |
Table 2. Ablation study using different model compared to GloVe alone visual re-ranker on the Transformer baseline. (♣ Bottom Figure 3) shows that BERT is not contributing, as GloVe, to the final score for two reasons:(1) short caption, and (2) less diverse beam.

Limitation. In contrast to CNN-LSTM ♠ top Figure 3, where each expert is contributing to the final decisions, we observed that having a shorter caption (with less context) can influence the BERT similarity score negatively. Therefore, the GloVe dominates as the main expert as shown in Figure 3 (♣ Bottom).

Finally, below are some visual semantic re-ranking examples with our VR_BERT+GloVe, Baseline Beam Search, and Greedy (scenarios when the greedy is more diverse than beam search).

BL+Beam: a computer monitor sitting on a desk with a keyboard. VR_BERT+GloVe: a desk with a computer
monitor and a keyboard. Human: a computer that is on a wooden desk.
BL+Greedy: a green bus parked in front of a building. VR_BERT+GloVe: a green double decker
bus parked in front of a building ✗. Human: a passenger bus that is parked in front of a library.
BL+Beam: a plate of food on a table. ✅ VR_BERT+GloVe: a plate of food and a
drink on a table. Human: a white plate with some food on it.
BL+Greedy: a group of women sitting on a bench eating. VR_BERT+GloVe: a group of women
eating hot dogs. Human: three people are pictured while they are eating.
BL+Greedy: a group of elephants under a shelter in a field. ✅ VR_BERT+GloVe: a group of elephants under a hut. Human: a group of elephants are standing under a roof cover.
Complex image. BL+Beam: a woman wearing a white dress holding a pair of scissors ✗. VR_BERT+GloVe: a woman with a pair of scissors on ✗. Human: a silver colored necklace with a pair of mini scissors on it


In this work, we introduce an approach that overcomes the limitation of beam search and avoids re-training for better accuracy. We proposed a combined word and sentence visual beam search re-ranker. However, we discover that word and sentence similarity disagree with each other when the beam search is less diverse. Our experiments also highlight the usefulness of the model by showing successful cases.

Lessons learned & temporally solution

By looking at the ablation study (Figure 3 and Table 2) we observed that the caption expert via BERT is breaking the accuracy. Also, we can conclude two observations after manually checking some samples (1) generic or repetitive caption (2) noisy visual context. We propose a solution to the two problems as follow:

  • (1) Human-inspired natural language understanding-based decision-making is needed. For example, does the caption make any sense semantically/grammatically before applying our visual re-ranking? For example, using a language model i.e., GPT2 (Radford et al., 2019), to filter out a non-human-like caption description, as shown below.

caption 1: a blue and white bus parked at a bus stop (0.14)

caption 2: a white bus with blue and white on the side of a street (0.10)✗

  • (2) The visual classifier output also needs a visual grounding soft label (i.e., cosine distance) with the caption as shown below with the first caption Cosine(visual, caption):

Visual: Airliner, Caption: a white blue and yellow jet airliner in a runway., soft label 0.6223

Visual: Cap, Caption: a cell phone sitting on a table with a glass of water. soft label 0.058 ✗

Now, let’s apply these two ideas (1) red → (language model GPT-2) and (2) dark green → (soft label) and modify Figure 1 as shown below in Figure 2.

Figure 4. VR_BERT+GloVe_modified. Figure 1 with the new configuration to overcome some of the limitations (1) lack of semantic understanding (by using GPT-2 --> red color ) and (2) noisy visual context (by using soft label via SBERT cosine distance with the caption --> dark green color )

Now, let’s train the model again with the new modification and use the same images above for inference (🚨).

BL+Beam: a computer monitor sitting on a desk with a keyboard. VR_BERT+GloVe_modified: a desktop computer sitting on top of a desk. Human: a computer that is on a wooden desk. 🚨Better caption.
BL+Greedy: a green bus parked in front of a building. VR_BERT+GloVe_modified: a bus is parked in front of a building. Human: a passenger bus that is parked in front of a library. 🚨Same caption as beam search, and thus is not breaking the result.
BL+Beam: a plate of food on a table. ✅ VR_BERT+GloVe_modified: a plate of food and a
drink on a table. Human: a white plate with some food on it. 🚨Same caption as before.
BL+Greedy: a group of women sitting on a bench eating. VR_BERT+GloVe_modified: a couple of women sitting next to each other. Human: three people are pictured while they are eating. 🚨Different caption.
BL+Greedy: a group of elephants under a shelter in a field. ✅ VR_BERT+GloVe_modified: a group of elephants standing under a wooden structure. Human: a group of elephants are standing under a roof cover. 🚨Diverse caption.
Complex image. BL+Beam: a woman wearing a white dress holding a pair of scissors. ✗ VR_BERT+GloVe_modified: a close up of a pair of scissors (better)✗. Human: a silver colored necklace with a pair of mini scissors on it. 🚨✗ Better caption.

Although this approach is better, as the accuracy of the baseline is not impacted negatively, the re-ranker still needs to outperform beam search to be usable, which is a good starting point. In addition, it is not feasible as post-processing (it is computationally expansive with three encoders), which the main objective of this works. In future work, we will follow the same direction but in an end-to-end fashion. In particular, by relying on encoder sharing strategy to reduce parameters and computational resources.

Finally, it is my hope that we learn something new from this negative result analysis and remember that a good idea doesn’t always work in practice.

Feel free to cite our article if this insight and experiment are helpful to you

Baselines Github link:

(1) Caption Transformer: https://github.com/aimagelab/meshed-memory-transformer

(2) Vilbert: https://github.com/facebookresearch/vilbert-multi-task


(1) Table Editor

(2) Figures OmniGraffle Pro

(3) LaTeX plot tools



ML researcher interested in language & vision research. I’m using this Medium blog to write my learning notes.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ahmed Sabir

ML researcher interested in language & vision research. I’m using this Medium blog to write my learning notes.