1. 程式人生 > >學習筆記:Coarse-to-Fine Auto-Encoder Networks (CFAN) for Real-Time Face Alignment

學習筆記:Coarse-to-Fine Auto-Encoder Networks (CFAN) for Real-Time Face Alignment

這篇論文中設計的網路結構用於seetaface人臉識別引擎中。

作者提出了一個粗到精的自動編碼網路(CFAN),級聯了一些堆疊自動編碼網路(SANs)。

1、初步是將檢測到的整體的人臉的低解析度版本作為輸入,這樣第一個SAN就能快速並足夠準確的預測標誌點。---全域性SAN

2、餘下的SAN隨後通過以越來越高解析度的方式將當前標誌(先前SAN的輸出)提取的區域性特徵作為輸入進行逐步細化。--區域性SAN

在區域性SAN中,在每個標誌周圍提取SIFT特徵。

每個SAN都會根據前一個SAN預測的形狀,嘗試從不同尺度的面部影象到臉部形狀的非線性對映。

採用全域性特徵作為首個SAN的輸入可以避免平均形狀帶來的誤差。

在從第一SAN獲得面部形狀的估計S0之後,連續的SAN(稱為本地SAN)通過逐步迴歸當前位置和地面真值位置之間的偏差ΔS來努力改善形狀。

為了表徵精細變化,利用以較高解析度從當前形狀提取的形狀索引特徵來執行較小的搜尋步長和較小的搜尋區域。

所有面部特徵點的形狀索引特徵被級聯在一起,以便同時更新所有面部特徵點。這樣即使是在部分阻塞的情況下也能保證得到一個合理有效的結果。

一般來說,已有的對齊方式可被分為基於方法[7,21,14,34,19,6] 的整體特徵和基於方法[8,10,15,23,9,25,35,32,31,2,28,11] 的區域性特徵。


Fig. 1. Overview of our Coarse-to-Fine Auto-encoder Networks (CFAN) for real-time
face alignment. H1, H2 are hidden layers. Through function FΦ, the joint local features Φ(Si) are extracted around facial landmarks of current shape Si. 通過使用這種漸進式和解析度可變的策略,每個SAN的搜尋空間,或換句話說,每個SAN的任務難度得到很好的控制,因此更易於處理。
在使用Intel i7-3770(3.4 GHz CPU)的桌上型電腦上,作者的方法(在Matlab程式碼中)每個影象大約需要23毫秒,以預測68個臉部檢測時間。

假設有一副d個畫素的人臉影象x Rd (d上標),Sg(x) Rp (g下標,p上標)表示p個標誌的真實位置。面部標誌檢測是學習一個從影象到面部形狀的對映函式F:

   F : S x.

一般情況下,F是複雜且非線性的。

為了達到對映的目的,k個隱藏層自動編碼器作為深層神經網路堆疊,將影象對映到相應的形狀。

具體來說,面部對齊任務被制定為使以下目標最小化:

F = {f1, f2, ..., fk}, fi 是深度網路中第 層的對映函式,σ 是sigmoid (不知咋翻譯)函式和 ai 是每層的特徵表示。
然而,Sigmoid函式的輸出範圍為[0 1],與位置範圍不一致,因此在最後一層 fk 中利用線性迴歸得到準確的形狀估計S0。

為了防止過擬合,一個定製項(權重衰減項)被加入式子中來降低權重的量級。


F中包含了大量引數,通過優化可以很容易就降低到區域性最小值。

為了得到更好的優化,首先採用非監督的預訓練過程初始化k-1層,並隨機初始化第k層。然後用有監督的方式細緻的調整整個網路。

對於第 i 層,通過優化下面的公式來達到預訓練的目的:


其中

每個隱藏層的輸出作為下一次的輸入。對於第一層,a0 = x

因為區域性特徵點只能從它本身捕獲資訊,而忽視了與其它點的相關性。因此,級聯所有的區域性形狀索引特徵一起作為輸入。


噹噹前位置離真實位置相當遠是,有必要在先前的區域性SAN上用大的搜尋步長去近似。

全域性SAN具有四層(每層包含三層隱藏層),後跟一個線性迴歸層,能夠學習從50×50畫素的整個面部影象到面部形狀的非線性對映。每層隱藏單位數分別為1600,900,400。

區域性SAN的每層隱藏單位數分別是1296,780,400.

在全域性SAN和區域性SAN中,α = 0.001 。

總結:作者主要採用分而治之的策略,將一幅人臉影象的全域性特徵作為輸入,輸入全域性SAN中,找到相對精確的形狀;然後將所有的面部形狀索引特徵級聯一起作為區域性SAN的輸入;區域性SAN又分為相同尺寸但不同解析度的多個區域性SAN,在每個區域性SAN上搜索,提取面部標誌周圍的SIFT特徵,以便最小化形狀索引特徵的位置到真實位置之間的偏差。通過由粗到精的過程,人臉能很好地對齊。與SDM,DCNN相比,CFAN表現都比它們好。採用非線性迴歸能獲得較低的迴歸錯誤率。

1. 300 faces in-the-wild challenge, http://ibug.doc.ic.ac.uk/resources/300-W/
2. Asthana, A., Zafeiriou, S., Cheng, S., Pantic, M.: Robust discriminative response
map fitting with constrained local models. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3444–3451 (2013)
3. Belhumeur, P.N., Jacobs, D.W., Kriegman, D.J., Kumar, N.: Localizing parts of
faces using a consensus of exemplars. In: IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pp. 545–552 (2011)
4. Bengio, Y.: Learning deep architectures for
AI. Foundations and TrendsR in Machine Learning 2(1), 1–127 (2009)
5. Burgos-Artizzu, X.P., Perona, P., Doll´ar, P.: Robust face landmark estimation
under occlusion. In: IEEE International Conference on Computer Vision, ICCV
(2013)
6. Cao, X., Wei, Y., Wen, F., Sun, J.: Face alignment by explicit shape regression.
In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp.
2887–2894 (2012)
7. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 23(6), 681–685
(2001)
8. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active shape modelstheir training and application. Computer Vision and Image Understanding
(CVIU) 61(1), 38–59 (1995)
9. Cristinacce, D., Cootes, T.F.: Feature detection and tracking with constrained
local models. In: British Machine Vision Conference (BMVC), vol. 17, pp. 929–938
(2006)
10. Cristinacce, D., Cootes, T.F.: Boosted regression active shape models. In: British
Machine Vision Conference (BMVC), pp. 1–10 (2007)
11. Dantone, M., Gall, J., Fanelli, G., Van Gool, L.: Real-time facial feature detection
using conditional regression forests. In: IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pp. 2578–2585 (2012)
12. Doll´ ar, P., Welinder, P., Perona, P.: Cascaded pose regression. In: IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pp. 1078–1085 (2010)
13. Grangier, D., Bottou, L., Collobert, R.: Deep convolutional networks for scene parsing. In: International Conference on Machine Learning Workshops, vol. 3 (2009)
14. Gross, R., Matthews, I., Baker, S.: Generic vs. person specific active appearance
models. Image and Vision Computing (IVC) 23(12), 1080–1093 (2005)
15. Gu, L., Kanade, T.: A generative shape regularization model for robust face alignment. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS,
vol. 5302, pp. 413–426. Springer, Heidelberg (2008)
16. Jesorsky, O., Kirchberg, K.J., Frischholz, R.W.: Robust face detection using
the hausdorff distance. In: International Conference on Audio-and Video-based
Biometric Person Authentication (AVBPA), pp. 90–95 (2001)
17. Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems
(NIPS), pp. 1106–1114 (2012)
18. Le, V., Brandt, J., Lin, Z., Bourdev, L., Huang, T.S.: Interactive facial feature
localization. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C.
(eds.) ECCV 2012, Part III. LNCS, vol. 7574, pp. 679–692. Springer, Heidelberg
(2012)


19. Liu, X.: Discriminative face alignment. IEEE Transactions on Pattern Analysis
and Machine Intelligence (TPAMI) 31(11), 1941–1954 (2009)
20. Luo, P., Wang, X., Tang, X.: Hierarchical face parsing via deep learning. In: IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2480–2487
(2012)
21. Matthews, I., Baker, S.: Active appearance models revisited. International Journal
of Computer Vision (IJCV) 60(2), 135–164 (2004)
22. Messer, K., Matas, J., Kittler, J., Luettin, J., Maitre, G.: Xm2vtsdb: The extended
m2vts database. In: International Conference on Audio and Video-based Biometric
Person Authentication (AVBPA), vol. 964, pp. 965–966 (1999)
23. Milborrow, S., Nicolls, F.: Locating facial features with an extended active shape
model. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part IV. LNCS,
vol. 5305, pp. 504–513. Springer, Heidelberg (2008)
24. Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: A semi-automatic
methodology for facial landmark annotation. In: IEEE Conference on Computer
Vision and Pattern Recognition Workshops (CVPRW), pp. 896–903 (2013)
25. Saragih, J.M., Lucey, S., Cohn, J.F.: Face alignment through subspace constrained
mean-shifts. In: IEEE International Conference on Computer Vision (ICCV), pp.
1034–1041 (2009)
26. Sun, Y., Wang, X., Tang, X.: Deep convolutional network cascade for facial point
detection. In: IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 3476–3483 (2013)
27. Toshev, A., Szegedy, C.: Deeppose: Human pose estimation via deep neural
networks. In: IEEE Conference on Computer Vision and Pattern Recognition,
CVPR (2014)
28. Valstar, M., Martinez, B., Binefa, X., Pantic, M.: Facial point detection using
boosted regression and graph models. In: IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pp. 2729–2736 (2010)
29. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple fea
tures. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
vol. 1, p. I–511 (2001)
30. Wu, Y., Wang, Z., Ji, Q.: Facial feature tracking under varying facial expressions
and face poses based on restricted boltzmann machines. In: IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pp. 3452–3459 (2013)
31. Xiong, X., De la Torre, F.: Supervised descent method and its applications to face
alignment. In: IEEE Conference on Computer Vision and Pattern Recognition,
CVPR (2013)
32. Yu, X., Huang, J., Zhang, S., Yan, W., Metaxas, D.N.: Pose-free facial landmark
fitting via optimized part mixtures and cascaded deformable shape model. In: IEEE
International Conference on Computer Vision, ICCV (2013)
33. Zhao, X., Kim, T.K., Luo, W.: Unified face analysis by iterative multi-output
random forests. In: IEEE Conference on Computer Vision and Pattern Recognition,
CVPR (2014)
34. Zhao, X., Shan, S., Chai, X., Chen, X.: Locality-constrained active appearance
model. In: Asian Conference on Computer Vision (ACCV), pp. 636–647 (2013)
35. Zhu, X., Ramanan, D.: Face detection, pose estimation, and landmark localization
in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 2879–2886 (2012)