1. 程式人生 > >Introduction to Image Caption Generation using the Avenger’s Infinity War Characters

Introduction to Image Caption Generation using the Avenger’s Infinity War Characters

Deep learning can be a daunting field for beginners. And it was no different for me - most of the algorithms and terms sounded from another world! I needed a way to understand the concepts from scratch in order to figure out how things actually work. And lo and behold, I found an interesting way to learn deep learning concepts.

The idea is pretty simple. To understand any deep learning concept, imagine this:

A mind of a newly born baby is capable of performing a trillion calculations. And, all you need is time (epochs) and nuture (algorithms) to make it understand a “thing” (problem case). I personally call this the babifying technique.

This intuition inherently works because neural networks are inspired by the human brain in the first place. So, re-engineering the problem should definitely work! Let me explain that with a example.

What if we trained our model on American culture images, and later asked it to predict labels of traditional Indian dance folks?

Apply the re-engineering idea to the question. It would be akin to imagining a kid who has been brought up in the USA, and has been to India for a vacation. Guess what label an American kid would predict for this image? Keep that in your mind before scrolling further.

Guess the caption?

This image has a lot of traditional dressing from traditional Indian culture.

What would a kid born in America caption it (or) a model that is exposed to an American dataset?

From my experiments, the model predicted the following caption:

A Man Wearing A Hat And A Tie

It might sound funny if you’re aware of Indian culture, but that’s the bias of algorithms. Image caption generation works in a similar manner. There are two main architectures of an image captioning model.

Understanding Image Caption Generation

The first one is an image based model which extracts the features of the image, and the other is a language based model which translates the features and objects given by our image-based model to a natural sentence.

In this article, we will be using a pretrained CNN network that is trained on the ImageNet dataset. The images are transformed into a standard resolution of 224 X 224 X 3. This will make the input constant for the model for any given image.

The condensed feature vector is created from a convolutional neural network (CNN). In technical terms, this feature vector is called embedding,and the CNN model is referred to as an encoder. In the next stage, we will be using these embeddings from the CNN layer as input to the LSTM network, a decoder.

In a sentence language model, LSTM is predicting the next word in a sentence. Given the initial embedding of the image, the LSTM is trained to predict the most probable next value of the sequence. Its just like showing a person a series of pictures and asking them to remember the details. And then later show them a new image which has similar content to the previous images and ask them to recall the content. This “recall” and “remember” job is done by our LSTM network.

Technically, we also insert <start> and <stop> stoppers to signal the end of the caption.

['<start>', 'A', 'man', 'is', 'holding', 'a', 'stone', '<end>']

This way, the model learns from various instances of images and finally predicts the captions for unseen images. To learn and dig deeper, I highly recommend reading the following references:

Prerequisites

To replicate the results of this article, you’ll need to install the pre-requisites. Make sure you have anaconda installed. If you want to train your model from scratch, follow the below steps, else skip over to the Pretrained model part.

git clone https://github.com/pdollar/coco.gitcd coco/PythonAPI/makepython setup.py buildpython setup.py installcd ../../git clone https://github.com/yunjey/pytorch-tutorial.gitcd pytorch-tutorial/tutorials/03-advanced/image_captioning/pip install -r requirements.txt 

Pretrained model

You can download the pretrained model from here and the vocabulary file from here. You should extract pretrained_model.zip to ./models/ and vocab.pkl to ./data/ using theunzip command.

Now that you have the model ready, you can predict the captions using:

$ python sample.py --image='png/example.png'

The original repository and code are implemented in the command line interface and you will need to pass Python arguments. To make it more intuitive, I have made a few handy functions to leverage the model in our Jupyter Notebook environment.

Let’s begin! Import all the libraries and make sure the notebook is in the root folder of the repository:

import torchimport matplotlib.pyplot as pltimport numpy as np import argparseimport pickle import osfrom torchvision import transforms from build_vocab import Vocabularyfrom model import EncoderCNN, DecoderRNNfrom PIL import Image

Add this configuration snippet and function to load_image from notebook:

# Device configurationdevice = torch.device(‘cuda’ if torch.cuda.is_available() else ‘cpu’)
#Function to Load and Resize the image
def load_image(image_path, transform=None):  image = Image.open(image_path) image = image.resize([224, 224], Image.LANCZOS) if transform is not None:    image = transform(image).unsqueeze(0) return image

Hard code the constants with pretrained model parameters. Note that these are hard coded and should not be modified. The pretrained model was trained using the following parameters. Changes should only be made if you are training your model from scratch.

# MODEL DIRSENCODER_PATH = './models/encoder-5-3000.pkl'DECODER_PATH = './models/decoder-5-3000.pkl'VOCAB_PATH = 'data/vocab.pkl'
# CONSTANTSEMBED_SIZE = 256HIDDEN_SIZE = 512NUM_LAYERS = 1

Now, code a PyTorch function that uses pretrained files to predict the output:

def PretrainedResNet(image_path, encoder_path=ENCODER_PATH,                      decoder_path=DECODER_PATH,                     vocab_path=VOCAB_PATH,                     embed_size=EMBED_SIZE,                     hidden_size=HIDDEN_SIZE,                     num_layers=NUM_LAYERS):    # Image preprocessing    transform = transforms.Compose([        transforms.ToTensor(),         transforms.Normalize((0.485, 0.456, 0.406),                              (0.229, 0.224, 0.225))])        # Load vocabulary wrapper    with open(vocab_path, 'rb') as f:        vocab = pickle.load(f)
# Build models    encoder = EncoderCNN(embed_size).eval()  # eval mode (batchnorm uses moving mean/variance)    decoder = DecoderRNN(embed_size, hidden_size, len(vocab), num_layers)    encoder = encoder.to(device)    decoder = decoder.to(device)
# Load the trained model parameters    encoder.load_state_dict(torch.load(encoder_path))    decoder.load_state_dict(torch.load(decoder_path))
# Prepare an image    image = load_image(image_path, transform)    image_tensor = image.to(device)        # Generate a caption from the image    feature = encoder(image_tensor)    sampled_ids = decoder.sample(feature)    sampled_ids = sampled_ids[0].cpu().numpy()          # (1, max_seq_length) -> (max_seq_length)        # Convert word_ids to words    sampled_caption = []    for word_id in sampled_ids:        word = vocab.idx2word[word_id]        sampled_caption.append(word)        if word == '<end>':            break    sentence = ' '.join(sampled_caption)[8:-5].title()     # Print out the image and the generated caption    image = Image.open(image_path)    return sentence, image

To predict the labels use :

plt.figure(figsize=(12,12))predicted_label, image = PretrainedResNet(image_path='IMAGE_PATH')plt.imshow(image)print(predicted_label)

We had Hulk. Now we have ML!

Let us get started with producing captions on some scenes from Avenger’s Infinity War, and see how well it generalizes!

Test Image: Mark I

Have a look at the image shown below:

<HOLD A CAPTION IN YOUR MIND>