Paper Summary: UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

Ahmed Sabir
14 min readMar 9, 2022


In this post, I will outline some important aspects of this language and vision paper. In a bird’s view, the authors tried to unify different modalities,(language and vision) in the same semantic space, to achieve a generalized model that performed well on different language and vision tasks such as visual question and answers VQA, image captioning, text prediction, sentence similarity, etc. In particular, they used contrastive learning (cross model) to leverage the alignment of text and visual information in the semantic space. As result, the model is able to learn generalized representations from both modalities: textual and visual knowledge and vice versa. The official GitHub by the authors can be found here at Github Link.

UNIfied-MOdal — UNIMO

As the most recent language and vision state-of-the-art models the UNIfied-MOdal pre-training architecture (UNIMO) uses multi-layers head attention Transformer to learn the unified modalities (textual and visual data), as shown in Figure 1 below.

Figure.1 The unified-modal pre-training architecture. Both image, text and image-text pairs can be effectively utilized for representation learning.

The different modalities are integrated into three setups : (1) text, (2) image, and (3) image-text pairs. Next, I describe each of them in more detail.

(1) For a textual input W. It is split into a sequence of subwords, by using Byte-Pair Encoding (BPE) (Sennrich et al., 2016).

BPE is a data compression form in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur within that data. For example


The byte pair “aa” occurs most often, so it will be replaced by a byte that is not used in the data, aa = Z . Now. we rewrite the same data as


After encoding the data, the self-attention mechanism is used to learn the contextual token representations as:

where [CLS] and [SEP] are special tokens to indicate the start and end of the sequences.

(2) For the image input V. It is first converted to a sequence of region features

where [IMG] denotes the representation of the entire image, and then the self-attention mechanism is leveraged to learn contextual region representations as:

Similar to previous work UNITER (Chen et al., 2020b), They used Faster R-CNN (Ren et al., 2016) to detect the salient image regions and extract the visual features (pooled ROI features) for each region.

(3) For an image-text pair (V, W). Its visual features and textual tokens are concatenated as a sequence:

Then the sequence is fed into the multi-layer Transformer block to learn cross-modal contextual representations for both the textual tokens and image regions.

The image V and text corpus W and the image-text (V, W) are learned by masking the prediction via Cross-Model Contrastive Learning (CMCL). According to the authors that will “enable the textual knowledge and visual knowledge to enhance each other in the unified semantic space.”

Cross-Modal Contrastive Learning.

Contrastive learning (CL) is the most popular and successful self-supervised learning paradigm. In a nutshell, in a self-supervised way, it pulled closer the same images as a positive sample and pushes away non-similar images as a negative sample.

The main idea of Cross-Modal Contrastive Learning (CMCL) is to let the representations of the paired image and text near in the representation space while the non-paired far away. The representations of image V and text W are used to compute the similarity between the two instances to measure their distance d(V, W).

Figure 2. Illustration of the CMCL. A series of text rephrasing techniques are utilized to create the positive image-text pairs X + and hard negative image-text pairs X−. Image and text retrieval are also utilized to obtain related images X_I and texts X_T from single-modal data, which are treated as single-modal positive samples during cross-modal learning. All of the mentioned above are encoded by the same unified-modal Transformer in pairs or individually, and the representations of images and texts are extracted to compute the contrastive loss.

As shown in Figure 2 (above), to facilitate semantic alignment between vision and language at different levels, The author proposed several text generation techniques to rephrase the original caption of an image either at a word, phrase, or sentence level. In this way, they can increase larger volumes of positive examples (X +) and negative examples (X −) for each image-text pair (V, W). Moreover, information retrieval for text and image are applied as augmentation techniques to obtain various related texts X_Text and images X_ Image for each image-text pair (V, W ). The CMCL is computed as:

where τ denotes the temperature parameter. Note that, for single-modal images, X_I and texts X_T, the original text W and image V are used to compute the cross-modal relevance, respectively.

Text Rewriting — paraphrasing. To enhance multi-granularity of semantic alignment between image and text, they rely upon a paraphrasing (rewriting) technique to change the caption of images at different levels, including (1) sentence-level and (2) phrase/word-level.

  • (1) For sentence-level rewriting. A back-translation technique (Edunov et al., 2018) is used to obtain several positive samples for each image-text pair. Specifically, each caption of an image is translated into another language and then translated back to the original language. In this way, several similar captions can be obtained for an image.

The idea of Edunov et al., 2018 is that instead of employing direct standard Back-Translation (Sennrich et al., 2016a), synthetic data based on noised beam search and sampling is used, which provide a stronger model.

  • (2) For phrase-level and word-level rewriting. First, they parse the image caption into a scene graph (Wang et al., 2018). In particular, they generate high-quality scene graphs directly from the textual description without relying on the image information. Then they randomly replace the object, attribute, or relation nodes of the scene graph with a different object, attribute, or relation from the corresponding vocabularies. This concept will make the training data more diverse at the word level.

Note that the proposed parser by Wang et al., 2018 outperformed SPICE (Anderson et al., 2016) which is the current standard caption metric.

Also, instead of randomly sampling negative samples as in previous methods (Contrastive learning-based methods), paraphrasing is used to generate larger volumes of hard negative samples. In this way, the model can learn more detailed semantic alignment from different levels between image and text.

Image/Text Retrieval. In order to incorporate more single-modal information during cross-modal learning, each image-text pair is further augmented with various related images and texts retrieved from the single-modal data.

Additionally, to add more information during cross model training, they augmented image-text pair with semantic search via cosine similarity (Retrieval method) as follows:

  • Image. Images order collections will use visual similarity (collecting images that are similar to the original image)
  • Text. For text, semantically related sentences to the original text fragment or caption are extracted based on semantic similarity.

Finally, the retrieved images and texts are encoded individually by the unified-modal Transformer as shown in Figure 2 (above), then their representations are extracted to compute the cross-modal contrastive loss in Equation 1 (above). The main idea is to provide rich background information, from a single model, for better cross-modal learning.

Visual & Language Learning

Visual learning. For visual learning, a BERT like Mask Language Modeling (MLM) is used with 15% masking of the visual feature. In particular, the mask regions are replaced by zeros, and the mask covers all the mutual intersections to avoid leaking. Note that, this approach is useful for vision as images usually are highly overlapped with each other. The masking anchors regions (anchor boxes) are is randomly chosen with overlapping ratios (larger than 0.3). For an image V, the model is trained to reconstruct masking regions v_m given the remaining region v\m:

For an image-text pair (V, W) the model is trained to reconstruct the mask region v_m given the text W and the remaining region v\m:

Language learning. The model is trained on two language modeling tasks: (1) Bidirectional prediction and (2) sequence-to-sequence (Seq2Seq) generation.

(1) Bidirectional prediction. For a given tokens in a sequence W={[CLS],w_1,…,w_n,[SEP]}, they sample 15% of the text as span. Inspired from SpanBERT(Joshi al., 2020), all tokens that are selected spans are replaced with either special [MASK] tokens, a random token, or the original tokens with probability, 80%, 10%, and 10%, respectively. The main idea is to predict the mask tokens w_m based on their surrounding context w_m\m by minimizing the negative log-likelihood:

(2) Sequence-to-sequence (Seq2Seq) generation. For the Seq2Seq generation task, they iteratively sample fragments from the token sequence until the 25% budget has been spent, inspired by Xiao et al. (2020). For each iteration:

  • Firstly, they sample a fragment length from a uniform distribution.
  • Secondly, they sample the same fragment with the specified length. Every selected fragment {w_i,…,w_j} is further appended with two special token [CLS] and [SEP](i.e., {[CLS],w_i ,w_j, [SEP]}) which denotes the beginning and end of the fragment.
  • Finally, all selected fragments are removed from the text and concatenated as the target sequence T while the remaining parts are concatenated as the source sequence S. Note that, the model is trained to generate the target sequence auto-regressively condition on the source sequence

where P_θ (T|S) is

During pre-training, they alternate between the bidirectional prediction objective and the Seq2Seq generation objective uniformly. For example, for the image-text pairs, the two objectives are applied to the captions similarly to learn cross-modal understanding and generation.

Experimental Settings

Implementation Detail. UNIMO-base employs a 12 layer transformer block and the UNIMO-large uses 24 layers Transformer block. Here are some training details:

  • Maximum sequence length for text tokens: 521
  • Maximum sequence length for image-region features: 100
  • Adam optimizer with initial learning rate 5e-5 and a learning rate linear decay schedule is utilized.
  • Initialization is used from RoBERTa (Liu et al., 2019)

As mentioned in the paper it will take a week+ almost 7 days for training UNIMO-base with 32 Nvidia Telsa V100 32GB GPU and 10 days for UNIMO-large with 64 Nvidia Telsa V100 32GB GPU.

For Visual Learning. A Faster R-CNN (Ren et al., 2016) pre-trained on Visual-Genome is used to extract the selected salient image regions features from the image. The selected region relies on a confidence threshold of 0.2 (probability) with a maximum of 100 boxes per region.

For the CMCL, they utilize back-translation to generate three positive samples and then re-writing (paraphrasing) 100 negative samples for each image-pair. Finally, they retrieved the most similar images (100 for each image and pair) from the data collection (text corpus and image-text pair).

Pre-training Dataset. The pre-training datasets consist of three types:

  • Text Corpus. The text corpus includes two large-scale corpora: BookWiki and OpenWebText, which are part of the training dataset of RoBERTa. BookWiki is composed of English Wikipedia and BookCorpus (Zhu et al., 2015), and OpenWebText is an open recreation of the WebText corpora
  • Image collections. The image collections are images without textual descriptions, including a subset of OpenImages (Krasin et al., 2017) and COCO unlabeled.
  • Image-text pairs. The image-text pairs are composed of four existing multi-modal datasets: COCO (Lin et al., 2014), Visual Genome (VG) (Krishna et al., 2017), Conceptual Captions (CC) (Sharma et al., 2018) and SBU Captions (Ordonez et al., 2011).

Fine-tuning Tasks. They fine-tune their model on two categories of downstream tasks:

  • Single-modal: generation and understanding tasks

Single-modal generation tasks tasks include:

(b) Question generation on the SQuAD 1.1 dataset (Rajpurkar et al., 2016).

(c) Abstractive summarization on the CNN/DailyMail (CNNDM) dataset (Hermann et al., 2015).

(d) Sentence compression on the Gigaword dataset (Rush et al., 2015).

Single-modal understanding tasks include:

(a) Sentiment classification on the SST-2 dataset (Socher et al., 2013).

(b) Natural language inference on the MNLI dataset (Williams et al., 2017).

(c) Linguistic acceptability analysis on the CoLA dataset (Warstadt et al., 2019).

(d) Semantic similarity analysis on the STS-B dataset (Cer et al., 2017).

  • Multi-modal vision-language: understanding and generation tasks

The multi-modal tasks include:

(a) Visual question answering (VQA) on the VQA v2.0 dataset (Goyal et al., 2017).

(b) Image caption on the Microsoft COCO Captions dataset (Chen et al., 2015).

(c) Visual entailment on the SLNI-VE dataset (Xie et al., 2019).

(d) Image-text retrieval on Flickr30k datasets (Young et al., 2014).

Results and Analysis

Multi-Modal tasks. The evaluation results on the multi-modal tasks are shown in Table 1. They compare with most of the existed multi-modal pre-training models, including ViLBERT (Lu et al., 2019), VLP (Zhou et al., 2020), UNITER (Chen et al., 2020b), Oscar (Li et al., 2020), Villa (Gan et al., 2020) and ERNIE-ViL (Yu et al., 2020). The results show that UNIMO achieves the best results against almost all benchmarks under both the base and large size of models. Particularly, UNIMO_Large outperforms previous best performing model ERNIE-ViL-large by 1.34 R@1 on image retrieval and 1.3 R@1 on text retrieval, which are great improvements for the image-text retrieval tasks. On the image caption task, UNIMO outperforms the best performing model Oscar by more than 2 BLUE4 score*.

UNIMO achieves better performance on both the multi-modal understanding and generation tasks, while previous methods usually focus on either the understanding or generation tasks. The results demonstrate the effectiveness of the unified-modal learning architecture that takes advantage of the large scale of single-modal images and texts for cross-modal learning.

*Although their result is way far behind Oscar paper, fine-tuning OSCAR on the COCO caption may produce a similar result.

Response from the authors 14–03–2022: “Regarding the caption evaluation, the OSCAR score is the cross-entropy evaluation score rather than the RL-based CIDEr optimization score, as shown in Table 2(e) in OSCAR paper. We only trained our model by cross-entropy learning without RL, so we chose the OSCAR score without RL for fair comparison”

Single-Modal tasks. Previous work multi-modal pre-training models (e.g., ViLBERT, OSCAR, etc) usually cannot effectively adapt to single-modal scenarios. Therefore, to further validate that, in this experiment, they remove the single-modal learning processes on the text corpus and image collections (i.e., “w/o single-modal”) from UNIMO and replace the CMCL with an image-text matching objective. Then, the model “w/o single- modal” is just a multi-modal pre-training method similar to UNITER (Chen et al., 2020b). As shown in Table 2 (above), the performance of the model on all the language understanding and generation tasks drop dramatically compared to UNIMO, which demonstrates that multi-modal pre-training only on image-text pairs cannot effectively adapt to the single-modal tasks.

Also, to show the effectiveness of their UNIMO on the language understanding and generation tasks, a further compare with existing pre-trained language models (PLMs), including BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), XLNet (Yang et al., 2019) and UniLM (Dong et al., 2019). The comparison results in Table 2 (above) demonstrate that UNIMO achieves better or comparable performance than existing PLMs on both the language understanding and generation tasks.

To summarized, UNIMO achieves good performance on multi-modal tasks. Also, performs very well on the single-modal tasks (language only), which demonstrates the superiority of the unified-modal learning architecture over a single model (for vision e.g, VilBERT, OSCAR, etc, and for language e.g, BERT, etc.)

Mutual Enhancement of Text and Vision

  • Text Enhance Vision. To explore whether the textual knowledge in the text corpus facilitates cross-modal learning, they remove the language part “w/o texts” and then compare the result on the vision task as shown in Table 3. The results demonstrate that textual information enhances cross-model learning.
| Text Enhance Vision |
| Model | Flicker 30K | COCO-caption |
| | R@1/R@5/R@10 | B-4/C |
| w/o text | 72.09/91.69/95.30 | 38.3/123.2 |
| UNIMO | 74.66/93.40/96.08 | 38.8/124.4 |
Table 3. Analyzing the effectiveness of textual knowledge to multi-modal tasks. (better read this in the PC version)
  • Vision Enhance Text. To further validate the benefit of the visual information for the text-only task, they remove the image pairs from the dataset (pre-training dataset) (i.e., “w/o pairs and images”). And then compare the performance on the single model language task only. Note that, the model is trained like BERT objective (Mask Language Model). Table 4 shows that the visual information indeed enhanced the language model to generalize better.
| Visual Enhance text |
| Model | STS-B | SST-2 |
| | | |
| w/o pairs and images | 90.6 | 94.7 |
| UNIMO | 91.0 | 95.1 |
Table 4. Analyzing the effectiveness of visual knowledge to language tasks. (better read this in the PC version)


This work proposed a pre-training architecture that leverages large-scale paired text and image for cross model learning. The result shows that the combined textual and visual knowledge boost each other in the semantic space. The model can be adapted in both single model (text only) and muti-model for a variety of tasks such as text understanding and text generation.

Comparison with current State-of-the-Art

Last but not least, the important question is: where does this work stand in comparison with the current state-of-the-art?

In this part, I will compare this paper’s result (table only) with the current state-of-the-art model in two popular tasks (1) Image Caption generation and (2) Semantic Similarity.

Image Captioning

| COCO-Caption (Karpathy testset) |
| MODEL | B-4 | CIDEr |
| Transformer (Vaswani et al., 2017)| 38.7 | 124.7 |
| AoANet (Huang et al., 2019) | 38.9 | 129.8 |
| M_2 (Cornia et al., 2020) | 39.1 | 131.2 |
| Vin_VL (Zhang et al., 2021) | 38.2 | 133.3 |
| OSCAR (Li et al., 2020)* | 37.4 | 127.7 |
| LEMON (Hu et al., 2021) | 40.3 | 133.3 |
| BLIP (Li et al., 2022) | 39.7 | 133.3 |
| UNIMO (this paper) | 39.6 | 127.7 |
Table 5. Comparison of different caption models on the Karpathy test. (* reproduced result lower than the original paper)
(better read this in the PC version)

Semantic Similarity

| STS-B dataset (Cer et al., 2017) |
| MODEL | p% |
| BERT_L (Devlin et al., 2019) | 86.5 |
| RoBERT_L (Liu et al., 2019) | 91.9 |
| SRoBERTa (Reimers and Gurevych, 2019) | 77.7 |
| UNILM (Dong et al., 2019) | 87.7 |
| XLNet_L (Yang et al., 2019) | 92.5 |
| XLNet_ensemble* | 93.0 |
| ALBERT (Lanet al., 2019) | 92.6 |
| SpanBERT (Joshi al., 2020) | 89.9 |
| ELECTRA (Clark et al., 2020) | 92.5 |
| ConvBERT (Jiang et al., 2021) | 87.7 |
| SRoBERTa-Whitening (Su et al., 2021) | 79.4 |
| RealFormer (He et al., 2021) | 89.8 |
| SimCSE-RoBERTa_L (unsp) (Gao et al., 2021)| 81.9 |
| SimCSE-RoBERTa_L (supervised) | 86.7 |
| UNIMO (this paper) | 92.6 |
Table 6. Comparison result on STS-B dataset (Spearman’s correlation). Note that ensemble* is not included.
(better read this in the PC version)


Please refer to the original paper for the full references, figures, and formulas.

Table Editor



Ahmed Sabir

ML researcher interested in language & vision research. I’m using this Medium blog to write my learning notes.