1. 程式人生 > >A Gentle Introduction to Deep Learning Caption Generation Models

A Gentle Introduction to Deep Learning Caption Generation Models

Caption generation is the challenging artificial intelligence problem of generating a human-readable textual description given a photograph.

It requires both image understanding from the domain of computer vision and a language model from the field of natural language processing.

It is important to consider and test multiple ways to frame a given predictive modeling problem and there are indeed many ways to frame the problem of generating captions for photographs.

In this tutorial, you will discover 3 ways that you could frame caption generating and how to develop a model for each.

The three caption generation models we will look at are:

  • Model 1: Generate the Whole Sequence
  • Model 2: Generate Word from Word
  • Model 3: Generate Word from Sequence

We will also review some best practices to consider when preparing data and developing caption generation models in general.

Let’s get started.

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Model 1: Generate the Whole Sequence

The first approach involves generating the entire textual description for the photo given a photograph.

  • Input: Photograph
  • Output: Complete textual description.

This is a one-to-many sequence prediction model that generates the entire output in a one-shot manner.

Model 1 - Generate Whole Sequence

Model 1 – Generate the Whole Sequence

This model puts a heavy burden on the language model to generate the right words in the right order.

The photograph passes through a feature extraction model such as a model pre-trained on the ImageNet dataset.

A one hot encoding is used for the output sequence, allowing the model to predict the probability distribution of each word in the sequence over the entire vocabulary.

All sequences are padded to the same length. This means that the model is forced to generate multiple “no word” time steps in the output sequence.

Testing this method, I found that a very large language model is required and even then it is hard to get past the model generating the NLP equivalent of persistence, e.g. generating the same word repeated for the entire sequence length as the output.

Model 2: Generate Word from Word

This is a different approach where the LSTM generates a prediction of one word given a photograph and one word as input.

  • Input 1: Photograph.
  • Input 2: Previously generated word, or start of sequence token.
  • Output: Next word in sequence.

This is a one-to-one sequence prediction model that generates the textual description via recursive calls to the model.

Model 2- Generate Word From Word

Model 2 – Generate Word From Word

The one word input is either a token to indicate the start of the sequence in the case of the first time the model is called, or is the word generated from the previous time the model was called.

The photograph passes through a feature extraction model such as a model pre-trained on the ImageNet dataset. The input word is integer encoded and passes through a word embedding.

The output word is one hot encoded to allow the model to predict the probabilities of words over the whole vocabulary.

The recursive word generation process is repeated until an end of sequence token is generated.

Testing this method, I found that the model does generate some good n-gram sequences, but gets caught in a loop repeating the same sequences of words for long descriptions. There is insufficient memory in the model to remember what has been generated previously.

Model 3: Generate Word from Sequence

Given a photograph and a sequence of words already generated for the photograph as input, predict the next word in the description.

  • Input 1: Photograph.
  • Input 2: Previously generated sequences of words, or start of sequence token.
  • Output: Next word in sequence.

This is a many-to-one sequence prediction model that generates a textual description via recursive calls to the model.

Model 3 - Generate Word From Sequence

Model 3 – Generate Word From Sequence

It is a generalization of the above Model 2 where the input sequence of words gives the model a context for generating the next word in the sequence.

The photograph passes through a feature extraction model such as a model pre-trained on the ImageNet dataset. The photograph may be provided each time step with the sequence, or once at the beginning, which may be the preferred approach.

The input sequence is padded to a fixed-length and integer encoded to pass through a word embedding.

The output word is one hot encoded to allow the model to predict the probabilities of words over the whole vocabulary.

The recursive word generation process is repeated until an end of sequence token is generated.

This appears to be the preferred model described in papers on the topic and might be the best structure we have for this type of problem for now.

Testing this method, I have found that the model does readily generate readable descriptions, the quality of which is often refined by larger models trained for longer. Key to the skill of this model is the masking of padded input sequences. Without masking, the resulting generated sequences of words are terrible, e.g. the end of sequence token is repeated over and over.

Modeling Best Practices

This section lists some general tips when developing caption generation models.

  • Pre-trained Photo Feature Extraction Model. Use a photo feature extraction model pre-trained on a large dataset like ImageNet. This is called transfer learning. The Oxford Vision Geometry Group (VGG) models that won the ImageNet competition in 2014 are a good start.
  • Pre-trained Word Embedding Model. Use a pre-trained word embedding model with vectors either trained on average large corpus or trained on your specific text data.
  • Fine Tune Pre-trained Models. Explore making the pre-trained models trainable in your model to see if they can be dialed-in for your specific problem and result in a slight lift in skill.
  • Pre-Processing Text. Pre-process textual descriptions to reduce the vocabulary of words to generate, and in turn, the size of the model.
  • Preprocessing Photos. Pre-process photos for the photo feature extraction model, and even pre-extract features so the full feature extraction model is not required when training your model.
  • Padding Text. Pad input sequences to a fixed length; this is in fact a requirement of vectorizing your input for deep learning libraries.
  • Masking Padding. Use masking on the embedding layer to ignore “no word” time steps, often a zero value when words are integer encoded.
  • Attention. Use attention on the input sequence when generating the output word in order to both achieve better performance and understand where the model is “looking” when each word is being generated.
  • Evaluation. Evaluate the model using standard text translation metrics like BLEU and compare generated descriptions against multiple references image captions.

Do you have your own best practices for developing robust captioning models?
Let me know in the comments below.

Further Reading

This section provides more resources on the topic if you are looking go deeper.

Summary

In this tutorial, you discovered 3 sequence prediction models that can be used to address the problem of generating human readable textual descriptions for photographs.

Have you experimented with any of these models?
Share your experiences in the comments below.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.


Develop Deep Learning models for Text Data Today!

Deep Learning for Natural Language Processing

Develop Your Own Text models in Minutes

…with just a few lines of python code

It provides self-study tutorials on topics like:
Bag-of-Words, Word Embedding, Language Models, Caption Generation, Text Translation and much more…

Finally Bring Deep Learning to your Natural Language Processing Projects

Skip the Academics. Just Results.


相關推薦

A Gentle Introduction to Deep Learning Caption Generation Models

Tweet Share Share Google Plus Caption generation is the challenging artificial intelligence prob

A Gentle Introduction to Transfer Learning for Deep Learning

Tweet Share Share Google Plus Transfer learning is a machine learning method where a model devel

A Gentle Introduction to Applied Machine Learning as a Search Problem (譯文)

​ A Gentle Introduction to Applied Machine Learning as a Search Problem 原文作者:Jason Brownlee 原文地址:https://machinelearningmastery.com/applied-m

A Gentle Introduction to Matrix Factorization for Machine Learning

Tweet Share Share Google Plus Many complex matrix operations cannot be solved efficiently or wit

【DeepLearning學習筆記】Coursera課程《Neural Networks and Deep Learning》——Week1 Introduction to deep learning課堂筆記

決定 如同 樣本 理解 你是 水平 包含 rod spa Coursera課程《Neural Networks and Deep Learning》 deeplearning.ai Week1 Introduction to deep learning What is a

課程一(Neural Networks and Deep Learning),第一週(Introduction to Deep Learning)—— 0、學習目標

1. Understand the major trends driving the rise of deep learning. 2. Be able to explain how deep learning is applied to supervised learning. 3. Unde

課程一(Neural Networks and Deep Learning),第一週(Introduction to Deep Learning)—— 2、10個測驗題

1、What does the analogy “AI is the new electricity” refer to?  (B) A. Through the “smart grid”, AI is delivering a new wave of electricity.

A Gentle Introduction to Autocorrelation and Partial Autocorrelation (譯文)

A Gentle Introduction to Autocorrelation and Partial Autocorrelation 原文作者:Jason Brownlee 原文地址:https://machinelearningmastery.com/gentle-introdu

A brief introduction to reinforcement learning

In this article, we'll discuss: Let's start the explanation with an example -- say there is a small baby who starts learning how to walk. Let's divide thi

An Introduction to Deep Learning and Neural Networks

aitopics.org uses cookies to deliver the best possible experience. By continuing to use this site, you consent to the use of cookies.  Learn more » I und

A gentle introduction to decision trees using R

Most techniques of predictive analytics have their origins in probability or statistical theory (see my post on Naïve Bayes, for example). In this post I'l

Introduction to Deep Learning

Starting from the second course, I will also add an application on an open-source dataset for each course.

A Gentle Introduction to RNN Unrolling

Tweet Share Share Google Plus Recurrent neural networks are a type of neural network where the o

A Gentle Introduction to Autocorrelation and Partial Autocorrelation

Tweet Share Share Google Plus Autocorrelation and partial autocorrelation plots are heavily used

A Gentle Introduction to Exploding Gradients in Neural Networks

Tweet Share Share Google Plus Exploding gradients are a problem where large error gradients accu

A Gentle Introduction to Broadcasting with NumPy Arrays

Tweet Share Share Google Plus Arrays with different sizes cannot be added, subtracted, or genera

翻譯 COMMON LISP: A Gentle Introduction to Symbolic Computation

因為學習COMMON LISP,起步較為艱難,田春翻譯的那本書起點太高,而大多數書籍起點都很高。其實國內還有一本書,是Common LISP程式設計/韓俊剛,殷勇編著,由西安電子科技大學出版社出版,不過鑑於該書已經絕版,我決定還是找個英文版的練練手。 很多高手,比如田春,都

Gentle Introduction to the Adam Optimization Algorithm for Deep Learning

The choice of optimization algorithm for your deep learning model can mean the difference between good results in minutes, hours, and days. The Adam optim

Understanding Feature Engineering (Part 4) — A hands-on intuitive approach to Deep Learning Methods

Introduction Working with unstructured text data is hard especially when you are trying to build an intelligent system which interprets and understa

A Quick Introduction to Text Summarization in Machine Learning

A Quick Introduction to Text Summarization in Machine LearningText summarization refers to the technique of shortening long pieces of text. The intention i