1. 程式人生 > >How to Develop a Deep Learning Photo Caption Generator from Scratch

How to Develop a Deep Learning Photo Caption Generator from Scratch

Develop a Deep Learning Model to Automatically
Describe Photographs in Python with Keras, Step-by-Step.

Caption generation is a challenging artificial intelligence problem where a textual description must be generated for a given photograph.

It requires both methods from computer vision to understand the content of the image and a language model from the field of natural language processing to turn the understanding of the image into words in the right order. Recently, deep learning methods have achieved state-of-the-art results on examples of this problem.

Deep learning methods have demonstrated state-of-the-art results on caption generation problems. What is most impressive about these methods is a single end-to-end model can be defined to predict a caption, given a photo, instead of requiring sophisticated data preparation or a pipeline of specifically designed models.

In this tutorial, you will discover how to develop a photo captioning deep learning model from scratch.

After completing this tutorial, you will know:

  • How to prepare photo and text data for training a deep learning model.
  • How to design and train a deep learning caption generation model.
  • How to evaluate a train caption generation model and use it to caption entirely new photographs.

Let’s get started.

  • Update Nov/2017: Added note about a bug introduced in Keras 2.1.0 and 2.1.1 that impacts the code in this tutorial.
  • Update Dec/2017: Updated a typo in the function name when explaining how to save descriptions to file, thanks Minel.
  • Update Apr/2018: Added a new section that shows how to train the model using progressive loading for workstations with minimum RAM.
How to Develop a Deep Learning Caption Generation Model in Python from Scratch

How to Develop a Deep Learning Caption Generation Model in Python from Scratch
Photo by Living in Monrovia, some rights reserved.

Tutorial Overview

This tutorial is divided into 6 parts; they are:

  1. Photo and Caption Dataset
  2. Prepare Photo Data
  3. Prepare Text Data
  4. Develop Deep Learning Model
  5. Train With Progressive Loading (NEW)
  6. Evaluate Model
  7. Generate New Captions

Python Environment

This tutorial assumes you have a Python SciPy environment installed, ideally with Python 3.

You must have Keras (2.1.5 or higher) installed with either the TensorFlow or Theano backend.

The tutorial also assumes you have scikit-learn, Pandas, NumPy, and Matplotlib installed.

If you need help with your environment, see this tutorial:

I recommend running the code on a system with a GPU. You can access GPUs cheaply on Amazon Web Services. Learn how in this tutorial:

Let’s dive in.

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Photo and Caption Dataset

A good dataset to use when getting started with image captioning is the Flickr8K dataset.

The reason is because it is realistic and relatively small so that you can download it and build models on your workstation using a CPU.

The authors describe the dataset as follows:

We introduce a new benchmark collection for sentence-based image description and search, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events.

The images were chosen from six different Flickr groups, and tend not to contain any well-known people or locations, but were manually selected to depict a variety of scenes and situations.

The dataset is available for free. You must complete a request form and the links to the dataset will be emailed to you. I would love to link to them for you, but the email address expressly requests: “Please do not redistribute the dataset“.

You can use the link below to request the dataset:

Within a short time, you will receive an email that contains links to two files:

  • Flickr8k_Dataset.zip (1 Gigabyte) An archive of all photographs.
  • Flickr8k_text.zip (2.2 Megabytes) An archive of all text descriptions for photographs.

Download the datasets and unzip them into your current working directory. You will have two directories:

  • Flicker8k_Dataset: Contains 8092 photographs in JPEG format.
  • Flickr8k_text: Contains a number of files containing different sources of descriptions for the photographs.

The dataset has a pre-defined training dataset (6,000 images), development dataset (1,000 images), and test dataset (1,000 images).

One measure that can be used to evaluate the skill of the model are BLEU scores. For reference, below are some ball-park BLEU scores for skillful models when evaluated on the test dataset (taken from the 2017 paper “Where to put the Image in an Image Caption Generator“):

  • BLEU-1: 0.401 to 0.578.
  • BLEU-2: 0.176 to 0.390.
  • BLEU-3: 0.099 to 0.260.
  • BLEU-4: 0.059 to 0.170.

We describe the BLEU metric more later when we work on evaluating our model.

Next, let’s look at how to load the images.

Prepare Photo Data

We will use a pre-trained model to interpret the content of the photos.

There are many models to choose from. In this case, we will use the Oxford Visual Geometry Group, or VGG, model that won the ImageNet competition in 2014. Learn more about the model here:

Keras provides this pre-trained model directly. Note, the first time you use this model, Keras will download the model weights from the Internet, which are about 500 Megabytes. This may take a few minutes depending on your internet connection.

We could use this model as part of a broader image caption model. The problem is, it is a large model and running each photo through the network every time we want to test a new language model configuration (downstream) is redundant.

Instead, we can pre-compute the “photo features” using the pre-trained model and save them to file. We can then load these features later and feed them into our model as the interpretation of a given photo in the dataset. It is no different to running the photo through the full VGG model; it is just we will have done it once in advance.

This is an optimization that will make training our models faster and consume less memory.

We can load the VGG model in Keras using the VGG class. We will remove the last layer from the loaded model, as this is the model used to predict a classification for a photo. We are not interested in classifying images, but we are interested in the internal representation of the photo right before a classification is made. These are the “features” that the model has extracted from the photo.

Keras also provides tools for reshaping the loaded photo into the preferred size for the model (e.g. 3 channel 224 x 224 pixel image).

Below is a function named extract_features() that, given a directory name, will load each photo, prepare it for VGG, and collect the predicted features from the VGG model. The image features are a 1-dimensional 4,096 element vector.

The function returns a dictionary of image identifier to image features.

1234567891011121314151617181920212223242526272829 # extract features from each photo in the directorydef extract_features(directory):# load the modelmodel=VGG16()# re-structure the modelmodel.layers.pop()model=Model(inputs=model.inputs,outputs=model.layers[-1].output)# summarizeprint(model.summary())# extract features from each photofeatures=dict()forname inlistdir(directory):# load an image from filefilename=directory+'/'+nameimage=load_img(filename,target_size=(224,224))# convert the image pixels to a numpy arrayimage=img_to_array(image)# reshape data for the modelimage=image.reshape((1,image.shape[0],image.shape[1],image.shape[2]))# prepare the image for the VGG modelimage=preprocess_input(image)# get featuresfeature=model.predict(image,verbose=0)# get image idimage_id=name.split('.')[0]# store featurefeatures[image_id]=featureprint('>%s'%name)returnfeatures

We can call this function to prepare the photo data for testing our models, then save the resulting dictionary to a file named ‘features.pkl‘.

The complete example is listed below.

1234567891011121314151617181920212223242526272829303132333435363738394041424344 from os import listdirfrom pickle import dumpfrom keras.applications.vgg16 import VGG16from keras.preprocessing.image import load_imgfrom keras.preprocessing.image import img_to_arrayfrom keras.applications.vgg16 import preprocess_inputfrom keras.models import Model# extract features from each photo in the directorydef extract_features(directory):# load the modelmodel=VGG16()# re-structure the modelmodel.layers.pop()model=Model(inputs=model.inputs,outputs=model.layers[-1].output)# summarizeprint(model.summary())# extract features from each photofeatures=dict()forname inlistdir(directory):# load an image from filefilename=directory+'/'+nameimage=load_img(filename,target_size=(224,224))# convert the image pixels to a numpy arrayimage=img_to_array(image)# reshape data for the modelimage=image.reshape((1,image.shape[0],image.shape[1],image.shape[2]))# prepare the image for the VGG modelimage=preprocess_input(image)# get featuresfeature=model.predict(image,verbose=0)# get image idimage_id=name.split('.')[0]# store featurefeatures[image_id]=featureprint('>%s'%name)returnfeatures# extract features from all imagesdirectory='Flicker8k_Dataset'features=extract_features(directory)print('Extracted Features: %d'%len(features))# save to filedump(features,open('features.pkl','wb'))

Running this data preparation step may take a while depending on your hardware, perhaps one hour on the CPU with a modern workstation.

At the end of the run, you will have the extracted features stored in ‘features.pkl‘ for later use. This file will be about 127 Megabytes in size.

Prepare Text Data

The dataset contains multiple descriptions for each photograph and the text of the descriptions requires some minimal cleaning.

First, we will load the file containing all of the descriptions.

12345678910111213 # load doc into memorydef load_doc(filename):# open the file as read onlyfile=open(filename,'r')# read all texttext=file.read()# close the filefile.close()returntextfilename='Flickr8k_text/Flickr8k.token.txt'# load descriptionsdoc=load_doc(filename)

Each photo has a unique identifier. This identifier is used on the photo filename and in the text file of descriptions.

Next, we will step through the list of photo descriptions. Below defines a function load_descriptions() that, given the loaded document text, will return a dictionary of photo identifiers to descriptions. Each photo identifier maps to a list of one or more textual descriptions.

12345678910111213141516171819202122232425 # extract descriptions for imagesdef load_descriptions(doc):mapping=dict()# process linesforline indoc.split('\n'):# split line by white spacetokens=line.split()iflen(line)<2:continue# take the first token as the image id, the rest as the descriptionimage_id,image_desc=tokens[0],tokens[1:]# remove filename from image idimage_id=image_id.split('.')[0]# convert description tokens back to stringimage_desc=' '.join(image_desc)# create the list if neededifimage_id notinmapping:mapping[image_id]=list()# store descriptionmapping[image_id].append(image_desc)returnmapping# parse descriptionsdescriptions=load_descriptions(doc)print('Loaded: %d '%len(descriptions))

Next, we need to clean the description text. The descriptions are already tokenized and easy to work with.

We will clean the text in the following ways in order to reduce the size of the vocabulary of words we will need to work with:

  • Convert all words to lowercase.
  • Remove all punctuation.
  • Remove all words that are one character or less in length (e.g. ‘a’).
  • Remove all words with numbers in them.

Below defines the clean_descriptions() function that, given the dictionary of image identifiers to descriptions, steps through each description and cleans the text.

1234567891011121314151617181920212223 import stringdef clean_descriptions(descriptions):# prepare translation table for removing punctuationtable=str.maketrans('','',string.punctuation)forkey,desc_list indescriptions.items():foriinrange(len(desc_list)):desc=desc_list[i]# tokenizedesc=desc.split()

相關推薦

How to Develop a Deep Learning Photo Caption Generator from Scratch

Tweet Share Share Google Plus Develop a Deep Learning Model to Automatically Describe Photograph

How to build a Deep Learning Image Classifier for Game of Thrones dragons

Performance of most flavors of the old generations of learning algorithms will plateau. Deep learning, training large neural networks, is scalable and perf

How to become a machine learning engineer: A cheat sheet

Machine learning engineers--i.e., advanced programmers who develop artificial intelligence (AI) machines and systems that can learn and apply knowledge--ar

Ask HN: How to develop a core competency?

I have always been obsessed with being an alpha programmer (someone that can do everything). Its been five years and I have been able to do a few but have

How to Develop a Neural Machine Translation System from Scratch

Tweet Share Share Google Plus Develop a Deep Learning Model to Automatically Translate from Germ

How to Develop a Reusable Framework to Spot

Tweet Share Share Google Plus Spot-checking algorithms is a technique in applied machine learnin

How to Implement a Machine Learning Algorithm

Tweet Share Share Google Plus Implementing a machine learning algorithm in code can teach you a

How To Become A Machine Learning Engineer: Learning Path

How To Become A Machine Learning Engineer: Learning PathWe will walk you through all the aspects of machine learning from simple linear regressions to the

How to Learn a Machine Learning Algorithm

Tweet Share Share Google Plus The question of how to learn a machine learning algorithm has come

How to Tune a Machine Learning Algorithm in Weka

Tweet Share Share Google Plus Weka is the perfect platform for learning machine learning. It pro

How to write a robust system level service - some key learning - 如何寫好一個健壯的系統級服務

set gic compute som com 服務 ant odin connect Scenario: Rewriting a quartz job service. Background: The existing service logic was hardcodi

Machine Learning: How to Build a Model From Scratch

As an online travel booking company, Momentum Travel realized early on that identifying and preventing fraud is a vital part of their business. Hear from S

The AI Paradox: How A Deep Learning Startup Is Building Successful AI Solutions

We have a paradox staring us in the face. All that web content creates a great forum for philosophical debate: Will AI save the world or bring about the ex

How to Create a Linux Virtual Machine For Machine Learning Development With Python 3

Tweet Share Share Google Plus Linux is an excellent environment for machine learning development

How To Setup A Learning Routine At Work

Credit: Pete SouzaHow To Setup A Learning Routine At Work4 tips to grow your skills while being hiredI’m convinced that your ability to continuously learn

論文閱讀 | DeepDrawing: A Deep Learning Approach to Graph Drawing

作者:Yong Wang, Zhihua Jin, Qianwen Wang, Weiwei Cui, Tengfei Ma and Huamin Qu 本文發表於VIS2019, 來自於香港科技大學的視覺化小組(屈華民教授領導)的研究 1. 簡介 圖資料廣泛用於各個領域,例如生物資訊學,金融和社交網路分析。

How to Remove A Service Entry From Win10 Service List

console hot list warn oba tor div register ever .warnbanner { width: 600px; background-color: #FFEFCE } .warnbanner.border { border: 0px

WPF:How to display a Bitmap on Image control

bug con 另一個 spa and maps api 如果 reat 一個Bitmap文件,叫做screenShotFile, 你可以這樣顯示到Image控件上。 BitmapImage bi = new BitmapImage();

【轉】How to initialize a two-dimensional array in Python?

use obj class amp example list tty address add 【wrong way:】 m=[[element] * numcols] * numrowsfor example: >>> m=[[‘a‘] *3] * 2&g

How to Have a Healthy Relationship --shanbei 為單身節寫

net stay represent lead ref uga pin first flow 我在扇貝發現一片好文。 Sometimes relationships can seem like a lot of work until you sit back and rea