How to Develop a Deep Learning Photo Caption Generator from Scratch

Develop a Deep Learning Model to Automatically
Describe Photographs in Python with Keras, Step-by-Step.

Caption generation is a challenging artificial intelligence problem where a textual description must be generated for a given photograph.

It requires both methods from computer vision to understand the content of the image and a language model from the field of natural language processing to turn the understanding of the image into words in the right order. Recently, deep learning methods have achieved state-of-the-art results on examples of this problem.

Deep learning methods have demonstrated state-of-the-art results on caption generation problems. What is most impressive about these methods is a single end-to-end model can be defined to predict a caption, given a photo, instead of requiring sophisticated data preparation or a pipeline of specifically designed models.

In this tutorial, you will discover how to develop a photo captioning deep learning model from scratch.

After completing this tutorial, you will know:

How to prepare photo and text data for training a deep learning model.
How to design and train a deep learning caption generation model.
How to evaluate a train caption generation model and use it to caption entirely new photographs.

Let’s get started.

Update Nov/2017: Added note about a bug introduced in Keras 2.1.0 and 2.1.1 that impacts the code in this tutorial.
Update Dec/2017: Updated a typo in the function name when explaining how to save descriptions to file, thanks Minel.
Update Apr/2018: Added a new section that shows how to train the model using progressive loading for workstations with minimum RAM.

How to Develop a Deep Learning Caption Generation Model in Python from Scratch
Photo by Living in Monrovia, some rights reserved.

Tutorial Overview

This tutorial is divided into 6 parts; they are:

Photo and Caption Dataset
Prepare Photo Data
Prepare Text Data
Develop Deep Learning Model
Train With Progressive Loading (NEW)
Evaluate Model
Generate New Captions

Python Environment

This tutorial assumes you have a Python SciPy environment installed, ideally with Python 3.

You must have Keras (2.1.5 or higher) installed with either the TensorFlow or Theano backend.

The tutorial also assumes you have scikit-learn, Pandas, NumPy, and Matplotlib installed.

If you need help with your environment, see this tutorial:

I recommend running the code on a system with a GPU. You can access GPUs cheaply on Amazon Web Services. Learn how in this tutorial:

Let’s dive in.

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Photo and Caption Dataset

A good dataset to use when getting started with image captioning is the Flickr8K dataset.

The reason is because it is realistic and relatively small so that you can download it and build models on your workstation using a CPU.

The authors describe the dataset as follows:

We introduce a new benchmark collection for sentence-based image description and search, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events.

…

The images were chosen from six different Flickr groups, and tend not to contain any well-known people or locations, but were manually selected to depict a variety of scenes and situations.

The dataset is available for free. You must complete a request form and the links to the dataset will be emailed to you. I would love to link to them for you, but the email address expressly requests: “Please do not redistribute the dataset“.

You can use the link below to request the dataset:

Within a short time, you will receive an email that contains links to two files:

Flickr8k_Dataset.zip (1 Gigabyte) An archive of all photographs.
Flickr8k_text.zip (2.2 Megabytes) An archive of all text descriptions for photographs.

Download the datasets and unzip them into your current working directory. You will have two directories:

Flicker8k_Dataset: Contains 8092 photographs in JPEG format.
Flickr8k_text: Contains a number of files containing different sources of descriptions for the photographs.

The dataset has a pre-defined training dataset (6,000 images), development dataset (1,000 images), and test dataset (1,000 images).

One measure that can be used to evaluate the skill of the model are BLEU scores. For reference, below are some ball-park BLEU scores for skillful models when evaluated on the test dataset (taken from the 2017 paper “Where to put the Image in an Image Caption Generator“):

BLEU-1: 0.401 to 0.578.
BLEU-2: 0.176 to 0.390.
BLEU-3: 0.099 to 0.260.
BLEU-4: 0.059 to 0.170.

We describe the BLEU metric more later when we work on evaluating our model.

Next, let’s look at how to load the images.

Prepare Photo Data

We will use a pre-trained model to interpret the content of the photos.

There are many models to choose from. In this case, we will use the Oxford Visual Geometry Group, or VGG, model that won the ImageNet competition in 2014. Learn more about the model here:

Keras provides this pre-trained model directly. Note, the first time you use this model, Keras will download the model weights from the Internet, which are about 500 Megabytes. This may take a few minutes depending on your internet connection.

We could use this model as part of a broader image caption model. The problem is, it is a large model and running each photo through the network every time we want to test a new language model configuration (downstream) is redundant.

Instead, we can pre-compute the “photo features” using the pre-trained model and save them to file. We can then load these features later and feed them into our model as the interpretation of a given photo in the dataset. It is no different to running the photo through the full VGG model; it is just we will have done it once in advance.

This is an optimization that will make training our models faster and consume less memory.

We can load the VGG model in Keras using the VGG class. We will remove the last layer from the loaded model, as this is the model used to predict a classification for a photo. We are not interested in classifying images, but we are interested in the internal representation of the photo right before a classification is made. These are the “features” that the model has extracted from the photo.

Keras also provides tools for reshaping the loaded photo into the preferred size for the model (e.g. 3 channel 224 x 224 pixel image).

Below is a function named extract_features() that, given a directory name, will load each photo, prepare it for VGG, and collect the predicted features from the VGG model. The image features are a 1-dimensional 4,096 element vector.

The function returns a dictionary of image identifier to image features.

# extract features from each photo in the directory
def extract_features(directory):
	# load the model
	model = VGG16()
	# re-structure the model
	model.layers.pop()
	model = Model(inputs=model.inputs, outputs=model.layers[-1].output)
	# summarize
	print(model.summary())
	# extract features from each photo
	features = dict()
	for name in listdir(directory):
		# load an image from file
		filename = directory + '/' + name
		image = load_img(filename, target_size=(224, 224))
		# convert the image pixels to a numpy array
		image = img_to_array(image)
		# reshape data for the model
		image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
		# prepare the image for the VGG model
		image = preprocess_input(image)
		# get features
		feature = model.predict(image, verbose=0)
		# get image id
		image_id = name.split('.')[0]
		# store feature
		features[image_id] = feature
		print('>%s' % name)
	return features

1234567891011121314151617181920212223242526272829

# extract features from each photo in the directorydef extract_features(directory):# load the modelmodel=VGG16()# re-structure the modelmodel.layers.pop()model=Model(inputs=model.inputs,outputs=model.layers[-1].output)# summarizeprint(model.summary())# extract features from each photofeatures=dict()forname inlistdir(directory):# load an image from filefilename=directory+'/'+nameimage=load_img(filename,target_size=(224,224))# convert the image pixels to a numpy arrayimage=img_to_array(image)# reshape data for the modelimage=image.reshape((1,image.shape[0],image.shape[1],image.shape[2]))# prepare the image for the VGG modelimage=preprocess_input(image)# get featuresfeature=model.predict(image,verbose=0)# get image idimage_id=name.split('.')[0]# store featurefeatures[image_id]=featureprint('>%s'%name)returnfeatures

We can call this function to prepare the photo data for testing our models, then save the resulting dictionary to a file named ‘features.pkl‘.

The complete example is listed below.

from os import listdir
from pickle import dump
from keras.applications.vgg16 import VGG16
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.applications.vgg16 import preprocess_input
from keras.models import Model

# extract features from each photo in the directory
def extract_features(directory):
	# load the model
	model = VGG16()
	# re-structure the model
	model.layers.pop()
	model = Model(inputs=model.inputs, outputs=model.layers[-1].output)
	# summarize
	print(model.summary())
	# extract features from each photo
	features = dict()
	for name in listdir(directory):
		# load an image from file
		filename = directory + '/' + name
		image = load_img(filename, target_size=(224, 224))
		# convert the image pixels to a numpy array
		image = img_to_array(image)
		# reshape data for the model
		image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
		# prepare the image for the VGG model
		image = preprocess_input(image)
		# get features
		feature = model.predict(image, verbose=0)
		# get image id
		image_id = name.split('.')[0]
		# store feature
		features[image_id] = feature
		print('>%s' % name)
	return features

# extract features from all images
directory = 'Flicker8k_Dataset'
features = extract_features(directory)
print('Extracted Features: %d' % len(features))
# save to file
dump(features, open('features.pkl', 'wb'))

1234567891011121314151617181920212223242526272829303132333435363738394041424344

from os import listdirfrom pickle import dumpfrom keras.applications.vgg16 import VGG16from keras.preprocessing.image import load_imgfrom keras.preprocessing.image import img_to_arrayfrom keras.applications.vgg16 import preprocess_inputfrom keras.models import Model# extract features from each photo in the directorydef extract_features(directory):# load the modelmodel=VGG16()# re-structure the modelmodel.layers.pop()model=Model(inputs=model.inputs,outputs=model.layers[-1].output)# summarizeprint(model.summary())# extract features from each photofeatures=dict()forname inlistdir(directory):# load an image from filefilename=directory+'/'+nameimage=load_img(filename,target_size=(224,224))# convert the image pixels to a numpy arrayimage=img_to_array(image)# reshape data for the modelimage=image.reshape((1,image.shape[0],image.shape[1],image.shape[2]))# prepare the image for the VGG modelimage=preprocess_input(image)# get featuresfeature=model.predict(image,verbose=0)# get image idimage_id=name.split('.')[0]# store featurefeatures[image_id]=featureprint('>%s'%name)returnfeatures# extract features from all imagesdirectory='Flicker8k_Dataset'features=extract_features(directory)print('Extracted Features: %d'%len(features))# save to filedump(features,open('features.pkl','wb'))

Running this data preparation step may take a while depending on your hardware, perhaps one hour on the CPU with a modern workstation.

At the end of the run, you will have the extracted features stored in ‘features.pkl‘ for later use. This file will be about 127 Megabytes in size.

Prepare Text Data

The dataset contains multiple descriptions for each photograph and the text of the descriptions requires some minimal cleaning.

First, we will load the file containing all of the descriptions.

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

filename = 'Flickr8k_text/Flickr8k.token.txt'
# load descriptions
doc = load_doc(filename)

12345678910111213

# load doc into memorydef load_doc(filename):# open the file as read onlyfile=open(filename,'r')# read all texttext=file.read()# close the filefile.close()returntextfilename='Flickr8k_text/Flickr8k.token.txt'# load descriptionsdoc=load_doc(filename)

Each photo has a unique identifier. This identifier is used on the photo filename and in the text file of descriptions.

Next, we will step through the list of photo descriptions. Below defines a function load_descriptions() that, given the loaded document text, will return a dictionary of photo identifiers to descriptions. Each photo identifier maps to a list of one or more textual descriptions.

# extract descriptions for images
def load_descriptions(doc):
	mapping = dict()
	# process lines
	for line in doc.split('\n'):
		# split line by white space
		tokens = line.split()
		if len(line) < 2:
			continue
		# take the first token as the image id, the rest as the description
		image_id, image_desc = tokens[0], tokens[1:]
		# remove filename from image id
		image_id = image_id.split('.')[0]
		# convert description tokens back to string
		image_desc = ' '.join(image_desc)
		# create the list if needed
		if image_id not in mapping:
			mapping[image_id] = list()
		# store description
		mapping[image_id].append(image_desc)
	return mapping

# parse descriptions
descriptions = load_descriptions(doc)
print('Loaded: %d ' % len(descriptions))

12345678910111213141516171819202122232425

# extract descriptions for imagesdef load_descriptions(doc):mapping=dict()# process linesforline indoc.split('\n'):# split line by white spacetokens=line.split()iflen(line)<2:continue# take the first token as the image id, the rest as the descriptionimage_id,image_desc=tokens[0],tokens[1:]# remove filename from image idimage_id=image_id.split('.')[0]# convert description tokens back to stringimage_desc=' '.join(image_desc)# create the list if neededifimage_id notinmapping:mapping[image_id]=list()# store descriptionmapping[image_id].append(image_desc)returnmapping# parse descriptionsdescriptions=load_descriptions(doc)print('Loaded: %d '%len(descriptions))

Next, we need to clean the description text. The descriptions are already tokenized and easy to work with.

We will clean the text in the following ways in order to reduce the size of the vocabulary of words we will need to work with:

Convert all words to lowercase.
Remove all punctuation.
Remove all words that are one character or less in length (e.g. ‘a’).
Remove all words with numbers in them.

Below defines the clean_descriptions() function that, given the dictionary of image identifiers to descriptions, steps through each description and cleans the text.

import string

def clean_descriptions(descriptions):
	# prepare translation table for removing punctuation
	table = str.maketrans('', '', string.punctuation)
	for key, desc_list in descriptions.items():
		for i in range(len(desc_list)):
			desc = desc_list[i]
			# tokenize
			desc = desc.split()
			# convert to lower case
			desc = [word.lower() for word in desc]
			# remove punctuation from each token
			desc = [w.translate(table) for w in desc]
			# remove hanging 's' and 'a'
			desc = [word for word in desc if len(word)>1]
			# remove tokens with numbers in them
			desc = [word for word in desc if word.isalpha()]
			# store as string
			desc_list[i] =  ' '.join(desc)

# clean descriptions
clean_descriptions(descriptions)

1234567891011121314151617181920212223

import stringdef clean_descriptions(descriptions):# prepare translation table for removing punctuationtable=str.maketrans('','',string.punctuation)forkey,desc_list indescriptions.items():foriinrange(len(desc_list)):desc=desc_list[i]# tokenizedesc=desc.split()

How to Develop a Deep Learning Photo Caption Generator from Scratch

Develop a Deep Learning Model to Automatically
Describe Photographs in Python with Keras, Step-by-Step.

Tutorial Overview

Python Environment

Need help with Deep Learning for Text Data?

Photo and Caption Dataset

Prepare Photo Data

Prepare Text Data

How to Develop a Deep Learning Photo Caption Generator from Scratch

How to build a Deep Learning Image Classifier for Game of Thrones dragons

How to become a machine learning engineer: A cheat sheet

Ask HN: How to develop a core competency?

How to Develop a Neural Machine Translation System from Scratch

How to Develop a Reusable Framework to Spot

How to Implement a Machine Learning Algorithm

How To Become A Machine Learning Engineer: Learning Path

How to Learn a Machine Learning Algorithm

How to Tune a Machine Learning Algorithm in Weka

How to write a robust system level service - some key learning - 如何寫好一個健壯的系統級服務

Machine Learning: How to Build a Model From Scratch

The AI Paradox: How A Deep Learning Startup Is Building Successful AI Solutions

How to Create a Linux Virtual Machine For Machine Learning Development With Python 3

How To Setup A Learning Routine At Work

論文閱讀 | DeepDrawing: A Deep Learning Approach to Graph Drawing

How to Remove A Service Entry From Win10 Service List

WPF:How to display a Bitmap on Image control

【轉】How to initialize a two-dimensional array in Python?

How to Have a Healthy Relationship --shanbei 為單身節寫

How to Develop a Deep Learning Photo Caption Generator from Scratch

Develop a Deep Learning Model to Automatically Describe Photographs in Python with Keras, Step-by-Step.

Tutorial Overview

Python Environment

Need help with Deep Learning for Text Data?

Photo and Caption Dataset

Prepare Photo Data

Prepare Text Data

相關推薦

Develop a Deep Learning Model to Automatically
Describe Photographs in Python with Keras, Step-by-Step.