Multimodal Recurrent Neural Networks
m-RNN models the probability distribution of generating a word given previous words and an image. In its framework, a deep recurrent neural network for sentences and a deep convolutional network for images interact with each other in a multimodal layer as shown in figure below
m-RNN re-maps the last recurrent layer activation and add it to current word representation (execute in the red box) instead of concatenating them.The multimodal layer accepts three inputs: the word embedding layer II, the recurrent layer and the image representation (from 7th layer of AlexNet or 15th layer of VGGNet).These three inputs are re-mapped into the same multimodal feature space and added together activated using element-wise scaled hyperbolic tangent function for fast training. Finally, the probability distribution of the next word is generated by a softmax layer.
m-RNN achieved records breaking results at image captioning. However, the ReLU activator just slightly decrease the effect of gradient vanishing or exploding. The long dependency issue still needs better solutions for it. Moreover, m-RNN trains model using maximum likelihood estimation (MLE), thus it suffers exposure bias as well.