Skip to content

gkartik/DLImageCaptioning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Image Captioning Studies

The aim of this project is to create a test-bed for implementing and evaluating various (visual, multimodal, attention-based etc.) state-of-the-art image captioning machine learning algorithms. For a comprehensive review of the field until circa 2018, please refer to this paper. We use the tensorflow tutorial on image captioning as the starting point for code development.

Baseline model

Please refer to the Jupyter notebook therein for further details on the baseline model. Here we only briefly summarize the key features.

  • Model: Attention-based supervised learning similar to Xu et al.
    • Image encoder: Last CNN layer from InceptionV3 trained on ImageNet
    • Language model: Gated recurrent unit (GRU) with 512 units
    • Attention model: Based on Bahdanau
  • Loss:SparseCategoricalCrossEntropy
  • Dataset: 6 K images with annotations from MS-COCO
  • Training-validation split: 80-20
  • Batch-size: 64
  • Epochs: 10
  • Activations:
    • Images: RELU
    • Language: tanh

Case 1: Comparison of GRU and Simple (Elman) RNN

  1. We made the following changes w.r.t baseline:
self.gru = tf.keras.layers.GRU(self.units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')

replaced by

self.myRNN = tf.keras.layers.SimpleRNN(self.units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')

and

 output, state = self.gru(x)

by

 output, state = self.myRNN(x)
  1. We also turned off random shuffling of training samples before running the two models to ensure identical training sets i.e.,
#random.shuffle(img_paths)
#random.shuffle(img_keys)

Results

The plot below shows the comparison of training loss by epochs for simple RNN and GRU. We note that by 10 epochs both models converge to the same loss.

Image of loss comparison

About

Comparison of two different sequence models for captions

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published