Pseudo-3D Residual Networks 演算法筆記

阿新 • • 發佈：2019-01-02

ICCV2017的文章。在視訊分類或理解領域，容易從影象領域的2D卷積聯想到用3D卷積來做，雖然用3D卷積進行特徵提取可以同時考慮到spatial和temporal維度的特徵，但是計算成本和模型儲存都太大，因此這篇文章針對視訊領域中採用的3D卷積進行改造，提出Pseudo-3D Residual Net (P3D ResNet)，思想有點像當年的Inception v3中用1*3和3*1的卷積疊加代替原來的3*3卷積，這篇文章是用1*3*3卷積和3*1*1卷積代替3*3*3卷積（前者用來獲取spatial維度的特徵，實際上和2D的卷積沒什麼差別；後者用來獲取temporal維度的特徵，因為倒數第三維是幀的數量）

，畢竟這樣做可以大大減少計算量，而如果採用3D卷積來做的話，速度和儲存正是瓶頸，這也使得像C3D演算法的網路深度只有11層，參看Figure1。該文章的網路結構可以直接在3D的ResNet網路上修改得到。順便提一下，除了採用3D卷積來提取temporal特徵外，還可以採用LSTM來提取，這也是當前視訊研究的一個方向。

Figure1是幾個模型在層數、模型大小和在Sports-1M資料集上的視訊分類效果對比，其中的P3D ResNet是在ResNet 152基礎上修改得到的，深度之所以不是152，是因為改造後的每個residual結構不是原來ResNet系列的3個卷積層，而是3或4個卷積層，詳細可以看Figure3，所以最後網路深度是199層。官方github程式碼中的網路就是199層的。ResNet 152是直接在Sports-1M資料集上fine tune得到的。可以看出199層的P3D ResNet雖然在模型大小上比ResNet-152（此處ResNet-152是在sports-1M資料集上fine tune得到的）大一些，但是準確率提升比較明顯，與C3D（此處C3D是直接在sports-1M資料集上從頭開始訓練得到的）的對比在效果和模型大小上都有較大改進，除此之外，速度的提升也是亮點，後面有詳細的速度對比。

既然想用1*3*3卷積和3*1*1卷積代替3*3*3卷積，那麼怎樣組合這兩種卷積也是一個問題，Figure2是P3D ResNet網路中residual的三種結構形式。S表示spatial 2D filters，也就是1*3*3卷積；T表示temporal 1D filters，也就是3*1*1卷積。
這裡寫圖片描述

Figure3是對於P3D ResNet網路中residual的三種結構形式的詳細介紹以及和ResNet的residual的對比。P3D ResNet的深度增加主要是P3D-A和P3D-C帶來的。

這裡寫圖片描述

Table1是P3D ResNet的速度和在UCF101資料集上的準確率對比。ResNet-50是在UCF101資料集上fine tune得到的，具體是這樣做的

：We set the input as 224 × 224 image which is randomly cropped from the resized 240 × 320 video frame. After fine-tuning ResNet-50, the networks will predict one score for each frame and the video-level prediction score is calculated by averaging all frame-level scores.
P3D-A ResNet、P3D-B ResNet、P3D-C ResNet是這樣做的：The architectures of three P3D ResNet variants are all initialized with ResNet-50 except for the additional temporal convolutions and are further fine-tuned on UCF101. 換句話說，1*3*3卷積是可以用原來ResNet-50的3*3卷積進行初始化的，但是3*1*1卷積是不行的，因為ResNet-50中沒有這樣尺寸的卷積核，因此3*1*1卷積是隨機初始化然後直接在視訊資料集上fine tune。For each P3D ResNet variant, the dimension of input video clip is set as 16 × 160 × 160 which is randomly cropped from the resized non-overlapped 16-frame clip with the size of 16 × 182 × 242. 訓練P3D的時候每個batch包含128個clip，每個clip包含16幀（frame）影象，每幀影象的大小是160*160，因此輸入就是128*3*16*160*160這樣的維度。另外為什麼是16幀呢？主要是和網路結構相關，從程式碼可以看出涉及4個pooling正好能將16降到1。測試的時候是從一個video中抽取20個clip，每個clip由16 frame影象組成，後面會詳細介紹。原文關於模型的輸入尺寸是這麼說的：Given a video clip with the size of c×l×h×w where c, l, h and w denotes the number of channels, clip length, height and width of each frame, respectively. clip length就是這裡說的16 frame。
從Table1可以看出在模型大小增加一點的情況下，速度大大提升（9 clip/s就是144 frame/s左右），準確率提升也比較明顯。另外 By additionally pursuing structural diversity, P3D ResNet makes the absolute improvement over P3D-A ResNet, P3D-B ResNet and P3D-C ResNet by 0.5%, 1.4% and 1.2% in accuracy respectively, indicating that enhancing structural diversity with going deep could improve the power of neural networks.
這裡寫圖片描述

最後的P3D ResNet是通過三種變形的交替連線得到，如Figure4所示。
這裡寫圖片描述

Table2是在Sports-1M資料集上的結果對比，Sports-1M一共包含487個class，視訊數量在1.13 million左右。Clip [email protected]表示clip的top1分類準確率（clip-level accuracy），Video [email protected]表示video的top1分類準確率（video-level accuracy），Video [email protected]表示video的top5分類準確率。在Table2中 Deep Video是採用類似AlexNet的網路進行分類的，而Single Frame和Slow Fusion的差別是輸入frame的數量，後者相當於是基於10個frame來計算clip和video-level的準確率，所以會高一些。Convolutional Pooling exploits max-pooling over the final convolutional layer of GoogleNet across each clip’s frames，也就是說是對120個frame做max-pooling得到的，所以準確率較高，但顯然速度要慢很多。C3D既可以train from scratch，也可以在I380K資料集上預訓練，然後在Sports-1M資料集上fine tune得到。ResNet-152 is fine-tuned and employed on one frame from each clip to produce a clip-level score，也就是說clip-level score是由一個frame決定的。另外ResNet-152和Deep Video(Single Frame)的區別只是網路結構不一致而已；P3D ResNet（199層）的速度應該在2clip/s以上，因為文中提到處理每個clip的時間少於0.5s。
總結下Table2（模型測試的時候）是這麼得到的：We randomly sample 20 clips from each video and adopt a single center crop per clip, which is propagated through the network to obtain a clip-level prediction score. The video-level score is computed by averaging all the clip-level scores of a video. clip-level accuracy比較容易理解，就是一個clip（包含16 frame）作為訓練好的模型的輸入，然後在pool5層會得到2048維的輸出，最後接一個全連線層得到487維輸出（對應Sports-1M的487個類別）。video-level accuracy則是對同一個video的每個clip生成的2048維輸出做平均，最後基於平均後得到的2048維特徵用channel數為487的全連線層進行連線得到輸出。
這裡寫圖片描述

實驗結果：
首先是關於實驗用到的5個數據集：
UCF101 and ActivityNet are two of the most popular video action recognition benchmarks.
UCF101 consists of 13,320 videos from 101 action categories. Three training/test splits are provided by the dataset organisers and each split in UCF101 includes about 9.5K training and 3.7K test videos.
The ActivityNet dataset is a large-scale video benchmark for human activity understanding.
The latest released version of the dataset (v1.3) is exploited, which contains 19,994 videos from 200 activity categories. The 19,994 videos are divided into 10,024, 4,926 and 5,044 videos for training, validation and test set, respectively.
ASLAN is a dataset on action similarity labeling task, which is to predict the similarity between videos. The dataset includes 3,697 videos from 432 action categories.
YUPENN and Dynamic Scene are two sets for the scenario of scene recognition. In between, YUPENN is comprised of 14 scene categories each containing 30 videos. Dynamic Scene consists of 13 scene classes with 10 videos per class.

Table3是本文演算法和其他演算法在UCF101資料集上的對比。這裡主要將演算法分成三種：End-to-end CNN architecture with fine-tuning、CNN-based representation extractor+linear SVM、Method fusion with IDT。Accuracy列中括號部分的準確率表示輸入除了視訊幀（video frame）以外，還包含光流資訊（optical flow）。這裡的P3D ResNet應該是199層的模型。IDT是人工提取的特徵。TSN是ECCV2016的演算法，算是目前效果比較好的了。只以視訊幀為輸入的P3D ResNet的效果甚至要好於一些以視訊幀和光流為輸入的網路的效果，比如引用25、29、37。P3D ResNet和C3D的效果對比可以看出前者的優勢還是比較明顯。In addition, by performing the recent state-of-the-art encoding method [22] on the activations of res5c layer in P3D ResNet, the accuracy can achieve 90.5%, making the improvement over the global representation from pool5 layer in P3D ResNet by 1.9%. 文中的這句話並沒有作為實驗結果列在表格中，不知是何原因。
這裡寫圖片描述

Table3是本文演算法和其他演算法在ActivityNet資料集上的對比
這裡寫圖片描述

Table5是在ASLAN資料集上的關於action similarity的結果對比，這個資料集是用來判斷：does a pair of videos present the same action。
這裡寫圖片描述
其他更多實驗結果可以參看原文。

關於後期優化的三個方向，作者也放出來了，非常值得一試，尤其是第三點，也就是以視訊幀和光流資訊同時作為模型的輸入，畢竟這種做法在其他演算法上效果非常明顯（Table3的括號）。原文如下：Our future works are as follows. First, attention mechanism will be incorporated into our P3D ResNet for further enhancing representation learning. Second, an elaborated study will be conducted on how the performance of P3D ResNet is affected when increasing the frames in each video clip in the training. Third, we will extend P3D ResNet learning to other types of inputs, e.g., optical flow or audio.

Pseudo-3D Residual Networks 演算法筆記

ICCV2017的文章。在視訊分類或理解領域，容易從影象領域的2D卷積聯想到用3D卷積來做，雖然用3D卷積進行特徵提取可以同時考慮到spatial和temporal維度的特徵，但是計算成本和模型儲存都太大，因此這篇文章針對視訊領域中採用的3D卷積進行改造，提

ICCV2017 : 偽3D卷積：Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks

這是提出了一個專門用於視訊理解的深度達199層的三維殘差神經網路，通過將3D卷積拆分為一個3*1*1的一維時間卷積核一個1*3*3的二位空間卷積，相比於同樣深度的2D-CNN只增添了一定數量的1D-CNN，但引數量減少很多。二維空間卷積可以使用影象進行預處理，

「Computer Vision」Note on Pseudo-3D Residual Net (P3D ResNet)

QQ Group: 428014259 Sina Weibo：小鋒子Shawn Tencent E-mail：[email protected] http://blog.csdn.net/dgyuanshaofeng/article/details/84996428 [1]

WRNS：Wide Residual Networks 論文筆記

轉載請標明出處，理解不到位的地方也希望大家批評指正，謝謝！前言俗話說，高白瘦才是唯一的出路。但在深度學習界貌似並不是這樣。Wide Residual Networks就要證明自己，矮胖的神經網路也是潛力股。其實從名字中就可以看出來，Wide Re

【論文閱讀】Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition

【論文閱讀】Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition 這是2017ICCV workshop的一篇文章，這篇文章只是提出了一個3D-ResNets網路，與之前介紹的

Channel Pruning for Accelerating Very Deep Neural Networks 演算法筆記

這是一篇ICCV2017的文章，關於用通道剪枝（channel pruning）來做模型加速，通道減枝是模型壓縮和加速領域的一個重要分支。文章的核心內容是對訓練好的模型進行通道剪枝（channel pruning），而通道減枝是通過迭代兩步操作進行的：第

Learning Spatiotemporal Features with 3D Convolutional Networks學習筆記

Learning Spatiotemporal Features with 3D Convolutional Networks Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar

深度學習論文筆記：Deep Residual Networks with Dynamically Weighted Wavelet Coefficients for Fault Diagnosis of Planetary Gearboxes

這篇文章將深度學習演算法應用於機械故障診斷，採用了“小波包分解+深度殘差網路(ResNet)”的思路，將機械振動訊號按照故障型別進行分類。文章的核心創新點：複雜旋轉機械系統的振動訊號包含著很多不同頻率的衝擊和振盪成分，而且不同頻帶內的振動成分在故障診斷中的重要程度經常是不同的，因此可以按照如下步驟設計深度

CVPR 2017論文筆記— Dilated Residual Networks

轉自：極市平臺微信公眾號 1.Background 這次我來介紹一篇深度網路文章《Dilated Residual Networks》，發表在CVPR 2017會議上。作者是普林斯頓大學的Fisher Yu博士等人。網路簡稱為DRN。文章原文可在作者主頁閱覽：Fisher Yu主頁（http

《Residual Networks Behave Like Ensembles of Relatively Shallow Networks》筆記

深度殘差網路，使網路的層數大大加深，網路的學習表達能力超過在此之前的所有網路。這篇文章從emsemble角度來解釋其工作的原理，對比從殘差角度來解釋，似乎更有說服力。殘差網路與之前網路不同點現代的計算機視覺系統架構都比較類似，都是輸入low-l

Two-Stream Convolutional Networks for Action Recognition in Videos演算法筆記

論文：Two-Stream Convolutional Networks for Action Recognition in Videos 連結：https://arxiv.org/abs/1406.2199 這篇文章是NIPS 2014年提出一個two s

《白話深度學習與Tensorflow》學習筆記（4）Deep Residual Networks

深度殘差網路：主要應用於計算機視覺——影象分類、語義分割（semantic segmentation）、目標檢測（object detection），其主要是使用CNN進行改造。何愷明老師有一篇文獻《Deep Residual Networks——Deep learnin

Hinton Neural Networks課程筆記3a：線性神經元的學習演算法

這節其實是在為反向傳播（BP）演算法鋪路，解釋了delta rule（chain rule），然後舉了個簡單的例子，並做了一些評價。回顧感知機的部分，其學習演算法簡單快速，在資料集線性可分的情況下保證收斂。回顧感知機的學習演算法，因為是Binary Th

Hinton Neural Networks課程筆記2d：為什麼感知機的學習演算法可以收斂

感知機的學習演算法非常簡單，就是每次選取一個樣本，如果預測錯誤，則根據樣本真值，權重加減一個輸入向量。這和一般使用的附有學習率的優化演算法不一樣，沒有一個超引數，使其快速而且簡潔。而為什麼這樣一個簡單的演算法可以work，Hinton在這節加以說明。注意此部分是

Residual Networks

神經網絡 erro lte 任務 google spa lin nal 描述本文介紹一下2015 ImageNet中分類任務的冠軍——MSRA何凱明團隊的Residual Networks。實際上，MSRA是今年Imagenet的大贏家，不單在分類任務，MSRA還用res

[CVPR 2016] Weakly Supervised Deep Detection Networks論文筆記

del found score feature 圖片 http spl span 根據 p.p1 { margin: 0.0px 0.0px 0.0px 0.0px; font: 13.0px "Helvetica Neue"; color: #323333 } p.p2

[CVPR2015] Is object localization for free? – Weakly-supervised learning with convolutional neural networks論文筆記

sed pooling was 技術分享 sco 評測 5.0 ict highest p.p1 { margin: 0.0px 0.0px 0.0px 0.0px; font: 15.0px "Helvetica Neue"; color: #323333 } p.p2

DeepLearning.ai-Week2-Residual Networks

ims cti del channels mar 復雜 dep 技術分享 set 1 - Import Packages import numpy as np from keras import layers from keras.layers import Input,

2.GUI控件的使用 --《UNITY 3D 遊戲開發》筆記

div info art color text tar scrip 寬高執行 1.Label 控件編寫腳本文件，直接綁定在main camera上 public class labelScript : MonoBehaviour { //設定一個值來接收外部

《機器學習實戰》第二章——k-近鄰演算法——筆記

在看這一章的書之前，在網上跟著博主Jack-Cui的部落格學習過，非常推薦。部落格地址：http://blog.csdn.net/c406495762 《Python3《機器學習實戰》學習筆記（一）：k-近鄰演算法(史詩級乾貨長文)》講述的非常細緻，文字幽默有趣，演算法細

Pseudo-3D Residual Networks 演算法筆記

相關推薦