Image Captioning

Renjith Ms
3 min readApr 5, 2018

--

In order to develop an image captioning model there are three parts to be considered:

1)extracting image features for use in the model

2) training the model on those features

3) using the trained model to generate caption text when given an input image’s features.

For example we can use of two different techniques:

  1. Visual Geometry Group Neural Network (VGG) for the feature extraction
  2. a Recurrent Neural Network (RNN) to train and generate caption text.

Step1-

Using the pre-trained VGG model, the image is read in, resized to 224*224, and then fed into VGG neural network where the features are extracted as a Numpy array.Since VGG network is used to do an image classification, instead of getting the output from the last layer, we get the output from the fc-2 layer (fully connected layer) which contains the feature data of an image.

Step2-

For captioning, using Keras, create a single LSTM cell with 256 neurons. For this cell we have four inputs: image features, captions, a mask, and a current position.First the caption input and position input are merged (concatenate) and then it go through a word embedding layer.Then the image features, and embedded words are also merged (using concatenate) with the mask input. Together, they all go through the LSTM cell. The output of LSTM cell then goes through Dropout and Batch Normalization layer to prevent the model from overfitting. Finally, the Activation (softmax) layer is applied and we get the result.

LSTM model

The result itself is a vector with each entry representing the possibility of every word in the dictionary. The word with largest probability would be our current “best word”. Along with pre-built dictionary this vector is used to to
“interpret” the generated next word — which can be considered a type of ground truth for training in the true caption. The mask plays an important role in all of this, “recording” the previous words used in captions, so that the
model knows the words before the current word. And input the model with current position of the sentence so that it will not fall into a loops.

Captioning process to generate ‘a man in a bike down a dirt road’.

Like training, we also need to get the features for each image to be predicted. So, the images go through the VGG16 network first, to generate the features. For captioning, we used the same LSTM model. The first word input for this model is the ‘#start#’ tag, and the following input are the prediction result from the previous iteration.

good links to be read

--

--