In this blog post, I will share with you some insight and lessons learned from our recent research idea that should work in theory (i.e., BERT+GloVe), but in practice, it doesn’t work in our scenario. Recent state-of-the-art progress in pre-trained vision and language and image captioning models relies heavily on…