image caption解讀系列（二）：《Show, Attend and Tell_Neural Image Caption》

阿新 • • 發佈：2018-12-13

一、相關工作

二、基本思想

文章在NIC的基礎上加入了attention機制

三、模型結構

對LSTM部分做出的改動，其餘與NIC相同。

四、程式碼分析

(0)預處理

首先是把資料中長度大於20的caption刪除，這是第一次篩選。然後建立詞彙庫（程式碼中的大小為5000），對資料再次篩選，只保留所有的單詞都在詞彙庫中的句子。建立資料集，一共361258個影象和caption對。資料集包含幾個部分：

        self.image_ids = np.array(image_ids)   #每一幅影象的id
        self.image_files = np.array(image_files) #每一幅影象對應的路徑
        self.word_idxs = np.array(word_idxs)    #（361258,20）每個caption中的單詞用單詞對應的id代替詞彙庫中單詞的下標索引(包括標點符符號和<start>)
        self.masks = np.array(masks)               #（361258,20） 用來記錄長度，有詞是1，沒詞是0
        self.batch_size = batch_size                 #定義讀取資料時候的batch_size

（1）首先是CNN（VGG網路）提取特徵，最後得到的特徵圖是（batch_size,16,16,512）,16*16代表了原本影象196個區域，每個區域用512維的特徵來表示。rshape成(batch_size,196,512)

reshaped_conv5_3_feats = tf.reshape(conv5_3_feats,[config.batch_size, 196, 512])

（2）建立 embedding_matrix

大小為(5000,512)，也就是說詞彙庫裡的5000個單詞，每個單詞用512維的向量來表示並且做初始化（使用預訓練好的word_embedding）

(3)建立RNN

不再直接使用影象特徵a,而是對不同的區域加上不同的權重，得到上下文z（context）。

首先，影象特徵作為最初的context.，使用兩個全連線層得到最初的memory（c0）和out（o0），作為LSTM最初的state。

            context_mean = tf.reduce_mean(self.conv_feats, axis = 1)  #影象特徵作為最初的context (batch_size,512)
            initial_memory, initial_output = self.initialize(context_mean)#使用兩個全連線層得到最初的memory（c）和out（o）
            initial_state = initial_memory, initial_output     #最初的輸入state

輸入的caption預設為（batch_size，max_length），這裡max_length取20。不到20的後面補0，並且用masks做了標記.

對於每個時刻的單詞，首先引入attention,加入權重。並得到加權後的context和masks。

αt維度為L=196L=196，記錄釋義aa每個畫素位置獲得的關注。

權重αt可以由前一步系統隱變數htht經過若干全連線層獲得。編碼et用於儲存前一步的資訊。灰色表示模組中有需要優化的引數。這裡寫圖片描述

“看哪兒”不單和實際影象有關，還受之前看到東西的影響。

第一步權重完全由影象特徵aa決定：這裡寫圖片描述

 alpha = self.attend(contexts, last_output)  #引入注意力機制，加入權重  (batch_size,196)對196個區域的權重
                context = tf.reduce_sum(contexts*tf.expand_dims(alpha, 2),
                                        axis = 1)  #加權之後的context (batch_size,512)
                if self.is_train:
                    tiled_masks = tf.tile(tf.expand_dims(masks[:, idx], 1),
                                         [1, self.num_ctx])  #(batch_size,196)  masks[:, idx] 全部批次某個時刻的mask
                    masked_alpha = alpha * tiled_masks   #得到加權後的結果  如果maskd對應的是0 權重也就變成了0
                    alphas.append(tf.reshape(masked_alpha, [-1]))  #masked_alpha： （batch_size,196）

githubs上tensorflow 版本的程式碼 attend部分的具體實現略有不同，這裡就不再給出細節。

把當前時刻的權重存入列表。

alphas.append(tf.reshape(masked_alpha, [-1]))  #masked_alpha： （batch_size,196）

把word_embedding和加權之後的context連線起來，作為當前時刻的輸入，得到out_put和state.

current_input = tf.concat([context, word_embed], 1)  #當前時刻的輸入是 加權後context 和word_embeeding的結合  （bacth_size,1024）
                output, state = lstm(current_input, last_state)  #(batch_size,512)
                memory, _ = state   #其他show and tell一樣  (bacth_size,512)  (batch_size,512)

利用得到的輸出和加權的context計算下一個單詞的概率。做出預測

                logits = self.decode(expanded_output)  #(bacth_size,5000)
                probs = tf.nn.softmax(logits)
                prediction = tf.argmax(logits, 1)

最後，為下個時刻提供上個時刻的輸出和state等。

last_output = output
                last_memory = memory
                last_state = state
                last_word = sentences[:, idx]  #開始下一個單詞

image caption解讀系列（二）：《Show, Attend and Tell_Neural Image Caption》

image caption解讀系列（二）：《Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Capt》

image caption解讀系列（二）：《Show, Attend and Tell_Neural Image Caption》

容器開啟數據服務之旅系列（二）：Kubernetes如何助力Spark大數據分析

JavaScript夯實基礎系列（二）：閉包

ELK系列（二）：.net core中使用ELK

eShopOnContainers學習系列（二）：數據庫連接健康檢查

linux系列（二）：cd命令

Windows Service 學習系列（二）：C# windows服務：安裝、解除安裝、啟動和停止Windows Service

faster rcnn pytorch 復現系列（二）：generate_anchors原始碼解析

Fragment全解析系列（二）：正確的使用姿勢

詳解SVM系列（二）：拉格朗日對偶性

Docker系列（二）：通過Docker安裝使用 Kubernetes （K8s）

image caption筆記（三）：《Show, Attend and Tell_Neural Image Caption》

redis系列（二）：資料操作

STM32開發筆記48：STM32F4+DP83848乙太網通訊指南系列（二）：系統時鐘

Web安全系列（二）：XSS 攻擊進階（初探 XSS Payload）

爬蟲入門系列（二）：優雅的HTTP庫requests

keras系列（二）：模型設定

TiDB EcoSystem Tools 原理解讀系列（二）TiDB-Lightning Toolset 介紹

文字編輯器啟用系列（二）：UltraEdit安裝、啟用、漢化教程

image caption解讀系列（二）：《Show, Attend and Tell_Neural Image Caption》

相關推薦