ICRL2022 — Interesting Papers

Ahmed Sabir
8 min readMay 1, 2022


In this short blog post, I will highlight a couple of interesting ideas presented in ICLR 2022 in both 🖼️ Computer Vision and 📖 Natural Language Processing.

Table of Contents 📖 Finetuned Language Models are Zero-Shot Learners 🧩
🖼️ BEiT: BERT Pre-Training of Image Transformers 🧩
🖼️ SimVLM: Simple Visual Language Model Pretraining
🖼️ How Much Can CLIP Benefit Vision-and-Language Tasks? 🧩
🖼️ Pix2seq: A Language Modeling Framework for Object Detection 🧩
📖 LoRA: Low-Rank Adaptation of Large Language Models 🧩
🖼️ Attention-based Interpretability Concept Transformers 🧩
📖 Charformer: Fast Character Transformers 🧩
📖 Trans-Encoder: Unsupervised sentence-pair modelling 🧩
Please click on the 🖼️📖 to jump to the paper and 🧩 for Github.
  1. 📖 Paper: Finetuned Language Models are Zero-Shot Learners 🧩
Figure. High-level comparison of different fine-tune strategies in pre-trained language models. Figure reproduced from the author ICLR slide.

Large Language models such as GPT-3, have been shown to perform remarkably in few-shot learning. However, they are less successful at zero-shot learning. This paper introduces a trick to improve the zero-shot performance of large language models with Fine-tuned LAnguage Net (FLAN). The concept of FLAN is to use an input template as an instruction to the model (see Figure) via fine-tuning for different unseen tasks. Here are some examples of the used template given to the model.


Russian Cosmonaut Valery Polyakov set the record for the longest amount of time spent in space. Based on the paragraph above, can we conclude that Russians hold the record for the longest stay in space? OPTIONS


Read the following and determine if the hypothesis can be inferred from the premise:premise <premise>
Hypothesis <hypothesis>

Note that, this method overperforms GPT-3 in some tasks with a smaller model.

2. 🖼️ Paper: BEiT: BERT Pre-Training of Image Transformers 🧩

This paper proposes a pre-trained image mask language modeling for vision tasks with a variational autoencoder and discrete tokenization for the input image.

Figure. Overview of BEIT pre-training. Figure reproduced from the paper.

The main concept is to recover or reconstruct an image given the corrupted image as follows:

  1. A discrete variational autoencoder is used to tokenize the image to a discrete token.
  2. Then the model tries to reconstruct the image from a learned vocabulary by conditioning on the visual token.

The construction loss to recover the image can be computed as:

Equation. The original image x, the corrupted image is x_bar and the visual tokens is z

In particular, (explanation from the paper) before the pre-training stage, (1) they learn an image tokenizer via autoencoding-style reconstruction, where an image is tokenized into discrete visual tokens according to the vocabulary, (2) During pre-training, each image has two views, i.e., image patches, and visual tokens. A randomly mask some proportion of image patches (as shown in the gray patches in the figure) and replace them with a special mask embedding M, (3) the patches are fed to a backbone vision Transformer. The pre-training task aims at predicting the visual tokens of the original image based on the encoding vectors of the corrupted image.

3. 🖼️ Paper: SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

This SimVLM paper proposes a large-scale image-text pre-train generative language model conditioned on visual data. The main idea of this work is that: 1) model training on billion-scale noisy web images, and then 2) transfer the same model to a different language and vision task with ease as zero-shot learning.

Figure. SimVLM based approach that process text as language model for vision task. Figure reproduced from the paper.

The author proposes Prefix Language Modeling. PrefixLM differs from the standard LM such that it enables bi-directional attention on the prefix sequence as the Equation below:

where the model only conducts autoregressive factorization on the remaining token x≥Tp with bi-directional attention.

4. 🖼️ Paper: How Much Can CLIP Benefit Vision-and-Language Tasks? 🧩

This paper investigates CLIP as a visual encoder instead of standard backbones Bottom-Up and Top-Down (BUTD) (Andreson et al. 2018). The benefit of this method as follows:

1. No dependence on in-domain detection data such as BUTD which is computationally expensive.

2. A very straightforward design and the model can be used for inference (unlike BUTD which required a pre-computed feature as shown below in the Figure).

3. A significant improvement over BUTD and without the need to label dataset.

Figure. Comparison between current trend based approach and CLIP base method. Figure reproduced from the paper.

The result performance in different tasks such as Question and Answering, Visual Entailment, Vision & language Navigation suggests that CLIP is a viable alternative to the existing visual representations (e.g., pre-trained ResNet on ImageNet as a backbone)

5. 🖼️ Paper: Pix2seq: A Language Modeling Framework for Object Detection 🧩

This paper proposes a novel idea for object detection as a language modeling problem. The model uses a sequence of tokens to describe the bounding box and then trains an auto-regressive decoder to generate the target sequence. This approach uses a more general architecture (unlike current work Faster-RCNN, DETR), and achieves state-of-the-art results on the COCO object detection dataset.

Figure. Pix2seq Architecture consisted of an image encoder and language decoder. Figure reproduced from the paper.

As shown in the Equation above, the Pix2seq is trained like a language model to predict tokens given an image, with maximum likelihood loss. Where x is the given image, y and y_hat are the input and the target sequence associated with the input x, L is the target sequence length. y and y_hat are the same in the language model setup but they are different when augmented the sequence (as shown above in the Figure). w_i is the pre-assigned weight for the j-th token in the sequence, however, the weight can be changed according to different types (e.g., class token by the object).

6 .📖 Paper: LoRA: Low-Rank Adaptation of Large Language Models 🧩

The main concept of this paper is to add a low-rank matrix to the pre-trained language model during fine-tuning which lightweight the model since it only adds a small amount of parameters. In particular, the low-rank matrix is learned during the fine-tuning meanwhile the original weight matrices of the model are frozen.

Figure. (Left) The only trainable model is A and B (r is the LoRa rank). (Right) The q and v are only the trained paramrter in LoRa. Figure reproduced from ICLR ppt.

As shown in Figure (Left) the advantage of LoRa is that it can freeze the model and only switch tasks by replacing the matrices A and B with lower parameters. For pre-trained W_0, the update is constrained by lower rank W_0+∆W=W_0_BA where B ∈ R^d×r and A ∈ R^r× k and the rank r is r<min (d, k). During the training W_0 is frozen (no gradient updates) while A and B are trainable for the modification of the forward pass:

where ∆W is accumulated gradient update during adaptation.

7. 🖼️ Paper: Attention-based Interpretability Concept Transformers 🧩

This paper proposes simple general attention for transformer in which the keys and queries can abstract with concepts. This attention can be a drop-in replacement of the head transformer (or any classifier that uses cross-attention to generate classification log-probabilities) as intermediate network outputs, providing a sort of interpretability of the model.

Figure. ConceptTransformer. The model uses vision backbone (ResNet 50 feature) with a tokenizer and a cross attention between patches and concept. Figure reproduced from the paper
ConceptTransforme Softmax

The main drawback of this work is that the concepts are pre-defined from a domain knowledge which may conversely reduce the potential benefit of the approach in all scenarios where such information or concept doesn’t exist.

8. 📖 Paper: Charformer: Fast Character Transformers via Gradient-based Subword Tokenization 🧩

Unlike the traditional method that uses a subword as a pre-processing block before the model training, this paper proposes a Token-Free Models Charformer that uses Gradient-Based Subword Tokenization (GBST) that is trained as end-to-end as shown in the Figure below.

Figure. A comparison in High-level between traditional subword Transformer models and Charformer, which uses gradient-based subword tokenization. Figure reproduced from the paper appendix.

Note that as the author mentioned Token Free models are inefficient as the sequence length is on average 4x longer than the traditional Sentence Piece subword model.

First, the model processes the sequence using a single 1D convolution to extract the mutual local position of the local neighborhood. Second, the non-overlapping characters are averages and represent as a block embedding. By doing this, each character has an n-gram of length 1-to-N. Thirdly, the characters, as a block, are being scored with a linear transformation (block scoring network) which can be used as attention over the n-gram. Then, the average weight is computed over the representation block embedding. Finally, the sequence is downsampling with mean pooling operation.

9. 📖 Paper: Trans-Encoder: Unsupervised sentence-pair modelling through self- and mutual-distillations 🧩

This work tried to combine Bi-encoder and decoder on top of any
Pre-trained Language Model (PLM) with distillation strategies for sentence similarity tasks. The process of Trans-Encoder can be summarized as shown in the Figure below:

  1. Convert the model from PLM to Bi-encoder via contrastive learning methods (SimCLR framework (Chen et al. 2020))

2. Employ the bi-encoder to generate self-label sliver data via similarity score over unlabelled data. Then, this sliver data is used to train the cross encoder with PLM initialization.

3. After step 2, we will have a fully trained cross-encoder. We repeat the same process of 2 but with a cross encoder and then we train the bi-encoder further in a loop manner.

TRANS-ENCODER Architecture. Figure reproduced from the paper.

As shown in the Figure above, two-loss functions are used: (1) Mean Squared Error for the cross-to-bi encoder distillation, and (2) a Binary Cross-Entropy loss is utilized for bi-to-cross distillation. The main reason is that the cross-encoder is not suited for sentence embedding tasks, and will overfit the data generated by the bi-encoder.

Other interesting papers:

📖 Paper: Language modeling via stochastic processes 🧩

🖼️ Paper: Learning Strides in Convolutional Neural Networks 🧩

🖼️ Paper: Image BERT Pre-training with Online Tokenizer 🧩

🖼️ Paper: ViTGAN: Training GANs with Vision Transformers 🧩?


Please refer to the original paper for the full references, figures, and formulas.



Ahmed Sabir

ML researcher interested in language & vision research. I’m using this Medium blog to write my learning notes.