1. 程式人生 > >Viola–Jones object detection framework--Rapid Object Detection using a Boosted Cascade of Simple Features中文翻譯 及 matlab實現(見文末連結)

Viola–Jones object detection framework--Rapid Object Detection using a Boosted Cascade of Simple Features中文翻譯 及 matlab實現(見文末連結)

ACCEPTED CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION 2001

Rapid Object Detection using a Boosted Cascade of Simple

Features

 簡單特徵的優化級聯在快速目標檢測中的應用

Paul Viola                                                            Michael Jones

Mitsubishi Electric Research Labs                                        Compaq CRL               

三菱電氣實驗室                                                      康柏劍橋研究所

201 Broadway, 8th FL                                            One Cambridge Center

Cambridge, MA 02139                                           Cambridge, MA 02142

Abstract

This paper describes a machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection rates. This work is distinguished by three key contributions. The first is the introduction of a new image representation called the “Integral Image” which allows the features used by our detector to be computed very quickly. The second is a learning algorithm, based on AdaBoost, which selects a small number of critical visual features from a larger set and yields extremely efficient classifiers[6]. The third contribution is a method for combining increasingly more complex classi- fiers in a “cascade” which allows background regions of the image to be quickly discarded while spending more compu- tation on promising object-like regions. The cascade can be viewed as an object specific focus-of-attention mechanism which unlike previous approaches provides statistical guar- antees that discarded regions are unlikely to contain the ob- ject of interest. In the domain of face detection the system yields detection rates comparable to the best previous sys- tems.  Used in real-time applications, the detector runs at 

15 frames per second without resorting to image differenc- ing or skin color detection.

摘要

本文描述了一個視覺目標檢測的機器學習法,它能夠非常快速地處理影象而且能實現高檢測速率。這項工作可分為三個創新性研究成果。第一個是一種新的影象表徵說明,稱為“積分圖”,它允許我們的檢測的特徵得以很快地計算出來。第二個是一個學習演算法,基於Adaboost自適應增強法,可以從一些更大的設定和產量極為有效的分類器中選擇出幾個關鍵的視覺特徵。第三個成果是一個方法:用一個“級聯”的形式不斷合併分類器,這樣便允許影象的背景區域被很快丟棄,從而將更多的計算放在可能是目標的區域上。這個級聯可以視作一個目標特定的注意力集中機制,它不像以前的途徑提供統計保障,保證舍掉的地區不太可能包含感興趣的物件。在人臉檢測領域,此係統的檢測率比得上之前系統的最佳值。在實時監測的應用中,探測器以每秒15幀速度執行,不採用幀差值或膚色檢測的方法。

1.  Introduction

This paper brings together new algorithms and insights to construct a framework for robust and extremely rapid object detection. This framework is demonstrated on, and in part motivated by, the task of face detection.  Toward this end we have constructed a frontal face detection system which achieves detection and false positive rates which are equiv- alent to the best published results [16, 12, 15, 11, 1]. This face detection system is most clearly distinguished from previous approaches in its ability to detect faces extremely rapidly. Operating on 384 by 288 pixel images, faces are detected at 15 frames per second on a conventional 700 MHz Intel Pentium III. In other face detection systems, auxiliary information, such as image differences in video sequences, or pixel color in color images, have been used to achieve high frame rates.   Our system achieves high frame rates working only with the information present in a single grey scale image.  These alternative sources of information can also be integrated with our system to achieve even higher frame rates.

1.引言

本文彙集了新的演算法和見解,構築一個魯棒性良好的極速目標檢測框架。這一框架主要是體現人臉檢測的任務。為了實現這一目標,我們已經建立了一個正面的人臉檢測系統,實現了相當於已公佈的最佳結果的檢測率和正誤視率, [16,12,15,11,1]。這種人臉檢測系統區分人臉比以往的方法都要清楚,而且速度很快。通過對384×288畫素的影象,硬體環境是常規700 MHz英特爾奔騰III,人臉檢測速度達到了每秒15幀。在其它人臉檢測系統中,一些輔助資訊如視訊序列中的影象差異,或在彩色影象中畫素的顏色,被用來實現高幀率。而我們的系統僅僅使用一個單一的灰度影象資訊實現了高幀速率。上述可供選擇的資訊來源也可以與我們的系統整合,以獲得更高的幀速率。

There are three main contributions of our object detection framework. We will introduce each of these ideas briefly below and then describe them in detail in subsequent sections.

本文的目標檢測框架包含三個主要創新性成果。下面將簡短介紹這三個概念,之後將分章節對它們一一進行詳細描述。

The first contribution of this paper is a new image representation called an integral image that allows for very fast feature evaluation. Motivated in part by the work of Papageorgiou et al. our detection system does not work directly with image intensities [10].  Like these authors we use a set of features which are reminiscent of Haar Basis functions (though we will also use related filters which are more complex than Haar filters). In order to compute these fea- tures very rapidly at many scales we introduce the integral image representation for images. The integral image can be computed from an image using a few operations per pixel. Once computed, any one of these Harr-like features can be computed at any scale or location in constant time.

本文的第一個成果是一個新的影象表徵,稱為積分影象,允許進行快速特徵評估。我們的檢測系統不能直接利用影象強度的資訊工作[10]。和這些作者一樣,我們使用一系列與Haar基本函式相關的特徵:(儘管我們也將使用一些更復雜的濾波器)。為了非常迅速地計算多尺度下的這些特性,我們引進了積分影象。在一幅影象中,每個畫素使用很少的一些操作,便可以計算得到積分影象。任何一個類Haar特徵可以在任何規模或位置上被計算出來,且是在固定時間內。

The second contribution of this paper is a method for constructing a classifier by selecting a small number of im- portant features using AdaBoost [6]. Within any image sub- window the total number of Harr-like features is very large, far larger than the number of pixels. In order to ensure fast classification, the learning process must exclude a large ma- jority of the available features, and focus on a small set of critical features. Motivated by the work of Tieu and Viola, feature selection is achieved through a simple modification of the AdaBoost procedure: the weak learner is constrained so that each weak classifier returned can depend on only a single feature [2].  As a result each stage of the boosting process, which selects a new weak classifier, can be viewed as a feature selection process. AdaBoost provides an effec- tive learning algorithm and strong bounds on generalization performance [13, 9, 10].

本文的第二個成果是通過使用AdaBoost演算法選擇數個重要的特徵構建一個分類器[6]。在任何影象子窗口裡的類Haar特徵的數目非常大,遠遠超過了畫素數目。為了確保快速分類,在學習過程中必須剔除的大部分可用的特徵,關注一小部分關鍵特徵。選拔工作是通過一個AdaBoost的程式簡單修改:約束弱學習者,使每一個弱分類器返回時僅可依賴1個特徵[2]。因此,每個改善過程的階段,即選擇一個新的弱分類器的過程,可以作為一個特徵選擇過程。 AdaBoost演算法顯示了一個有效的學習演算法和良好的泛化效能[13,9,10]。

The third major contribution of this paper is a method for combining successively more complex classifiers in a cascade structure which dramatically increases the speed of the detector by focusing attention on promising regions of the image. The notion behind focus of attention approaches is that it is often possible to rapidly determine where in an image an object might occur [17, 8, 1]. More complex pro- cessing is reserved only for these promising regions.  The key measure of such an approach is the “false negative” rate of the attentional process.  It must be the case that all, or almost all, object instances are selected by the attentional filter.

本文的第三個主要成果是在一個在級聯結構中連續結合更復雜的分類器的方法,通過將注意力集中到影象中有希望的地區,來大大提高了探測器的速度。在集中注意力的方法背後的概念是,它往往能夠迅速確定在影象中的一個物件可能會出現在哪裡[17,8,1]。更復雜的處理僅僅是為這些有希望的地區所保留。衡量這種做法的關鍵是注意力過程的“負誤視”(在模式識別中,將屬於物體標註為不屬於物體)的概率。在幾乎所有的例項中,物件例項必須是由注意力濾波器選擇。

We will describe a process for training an extremely sim- ple and efficient classifier which can be used as a “super- vised” focus of attention operator.   The term supervised refers to the fact that the attentional operator is trained to detect examples of a particular class. In the domain of face detection it is possible to achieve fewer than 1% false neg- atives and 40% false positives using a classifier constructed from two Harr-like features.  The effect of this filter is to reduce by over one half the number of locations where the final detector must be evaluated.

我們將描述一個過程:訓練一個非常簡單又高效的分類器,用來作為注意力操作的“監督”中心。術語“監督”是指:注意力操作被訓練用來監測特定分類的例子。在人臉檢測領域,使用一個由兩個類Haar特徵構建的分類器,有可能達到1%不到的負誤視和40%正誤視。該濾波器的作用是減少超過一半的最終檢測器必須進行評估的地方。

Those sub-windows which are not rejected by the initial classifier are processed by a sequence of classifiers, each slightly more complex than the last. If any classifier rejects the sub-window, no further processing is performed.  The structure of the cascaded detection process is essentially that of a degenerate decision tree, and as such is related to the work of Geman and colleagues [1, 4].

這些沒有被最初的分類器排除的子視窗,由接下來的一系列分類處理,每個分類器都比其前一個稍有複雜。如果某個子視窗被任一個分類器排除,那它將不會被進一步處理。在檢測過程的級聯結構基本上是一個退化型決策樹,這點可以參照German和同事的工作[1,4]。

An extremely fast face detector will have broad prac- tical applications.   These include user interfaces, image databases,  and teleconferencing.    In applications where rapid frame-rates are not necessary, our system will allow for significant additional post-processing and analysis.  In addition our system can be implemented on a wide range of small low power devices, including hand-helds and embed- ded processors. In our lab we have implemented this face detector on the Compaq iPaq handheld and have achieved detection at two frames per second (this device has a low power 200 mips Strong Arm processor which lacks floating point hardware).

一個非常快速的人臉檢測器有廣泛實用性。這包括使用者介面,影象資料庫,及電話會議。在不太需要高幀速率的應用中,我們的系統可提供額外的重要後處理和分析。另外我們的系統能夠在各種低功率的小型裝置上實現,包括手持裝置和嵌入式處理器。在我們實驗室我們已經將該人臉檢測系統在Compaq公司的ipaq上實現,並達到了兩幀每秒的檢測率(該裝置僅有200 MIPS的低功耗處理器,缺乏浮點硬體)。

The remainder of the paper describes our contributions and a number of experimental results, including a detailed description of our experimental methodology.  Discussion of closely related work takes place at the end of each section.

本文接下來描述我們的研究成果和一些實驗結果,包括我們實驗方法學的詳盡描述。每章結尾會有對近似工作的討論。

2.  Features

Our object detection procedure classifies images based on the value of simple features.  There are many motivations for using features rather than the pixels directly. The most common reason is that features can act to encode ad-hoc domain knowledge that is difficult to learn using a finite quantity of training data.  For this system there is also a second critical motivation for features:  the feature based system operates much faster than a pixel-based system.

2.特徵

我們的目標檢測程式是基於簡單的特徵值來分類影象的。之所以選擇使用特徵而不是直接使用畫素,主要是因為特徵可以解決特定領域知識很難學會使用有限訓練資料的問題。對於這些系統來說,選擇使用特徵還有另外一個重要原因:基於特徵的系統的執行速度要遠比基於畫素的快。

The simple features used are reminiscent of Haar basis functions which have been used by Papageorgiou et al. [10]. More specifically, we use three kinds of features. The value of a two-rectangle feature is the difference between the sum of the pixels within two rectangular regions.  The regions have the same size and shape and are horizontally or ver- tically adjacent (see Figure 1).  A three-rectangle feature computes the sum within two outside rectangles subtracted from the sum in a center rectangle. Finally a four-rectangle feature computes the difference between diagonal pairs of rectangles.

上述簡單特徵是基於Haar基本函式設定的,Papageorgiou等人已使用過[10]。而我們則是更具體地選擇了特定的三類特徵。其中,雙矩形特徵的值定義為兩個矩形區域裡畫素和的差。而區域則具有相同尺寸和大小,並且水平或垂直相鄰(如圖1)。而三矩形特徵的值則是兩個外側矩形的畫素和減去中間矩形的和所得的最終值。最後一個四矩形特徵的值是計算兩組對角線矩形的區別而得的。

Given that the base resolution of the detector is 24x24, the exhaustive set of rectangle features is quite large, over 180,000 . Note that unlike the Haar basis, the set of rectan- gle features is overcomplete1 .

檢測器的基本解析度設定為24×24,既而得到數目巨大的矩形特徵的完備集,超過了180000。需要注意的是,矩形特徵的集合不像Haar基底,它是過完備1的。

 Figure 1: Example rectangle features shown relative to the enclosing detection window. The sum of the pixels which lie within the white rectangles are subtracted from the sum of pixels in the grey rectangles. Two-rectangle features are shown in (A) and (B). Figure (C) shows a three-rectangle feature, and (D) a four-rectangle feature.

矩形特徵可以反映檢測視窗之間的聯絡。白色矩形框中的畫素和減去灰色矩形框內的畫素和得到特徵值。(A)和(B)是矩形特徵。(C)是三矩形特徵。(D)是四矩形特徵。

圖 1

2.1. Integral Image

Rectangle features can be computed very rapidly using an intermediate representation for the image which we call the integral image.2The integral image at location x, y contains the sum of the pixels above and to the left of x, y , inclusive:

我們採用一箇中間表示方法來計算影象的矩形特徵,這裡稱為積分影象2。位置x,y上的積分影象包含點x,y上邊和左邊的畫素和,包括:

1 A complete basis has no linear dependence between basis elements and has the same number of elements as the image space, in this case 576. The full set of 180,000 thousand features is many times over-complete.

2 There is a close relation to “summed area tables” as used in graphics [3]. We choose a different name here in order to emphasize its use for the analysis of images, rather than for texture mapping.

1 一個完備基底在集元素之間沒有線性獨立,且數目和影象空間的元素個數相等,這裡是576。在總數為180,000的全集中,數千特徵是多次過完備的。

2在圖形學中還有個近義詞稱為“區域求和表”[3]。這裡我們選擇一個不同名稱,是為了便於讀者理解這是用來進行影象處理,而不是紋理對映的。

Figure 2: The sum of the pixels within rectangle D can be computed with four array references. The value of the integral image at location 1 is the sum of the pixels in rectangle A. The value at location 2 is A+B , at location 3 is A+C, and at location 4 is A+B+C+D. The sum within D can be computed as 4+1-(2+3).

矩形D內的畫素和可以按四個陣列計算。位置1的積分影象的值就是矩形A中的畫素之和。位置2的值是A+B,位置3的值是A+C,而位置4的值是A+B+C+D。那麼D中的畫素和就是4+1-(2+3)。

圖 2

ii(x,y)是積分影象,i(x,y)是原始影象。可以使用下列一對迴圈:

( 這裡S(x,y)是行累積和 S(x,-1)=0,ii(-1,y)=0 積分影象可以通過已知原始影象而一步求得。

Using the integral image any rectangular sum can be computed in four array references (see Figure 2).  Clearly the difference between two rectangular sums can be computed in eight references. Since the two-rectangle features defined above involve adjacent rectangular sums they can be computed in six array references, eight in the case of the three-rectangle features, and nine for four-rectangle features.

使用積分影象可以把任意一個矩形用四個陣列計算(見圖2)。顯然兩個矩形和之差可以用八個陣列。因為雙矩形特徵的定義是兩個相鄰矩形的和,所以僅用6個數組就可以計算出結果。同理三矩形特徵用8個,四矩形特徵用9個。

2.2. Feature Discussion

Rectangle features are somewhat primitive when compared with alternatives such as steerable filters [5, 7]. Steerable fil- ters, and their relatives, are excellent for the detailed analy- sis of boundaries, image compression, and texture analysis. In contrast rectangle features, while sensitive to the pres- ence of edges, bars, and other simple image structure, are quite coarse.  Unlike steerable filters the only orientations available are vertical, horizontal, and diagonal. The set of rectangle features do however provide a rich image repre- sentation which supports effective learning. In conjunction with the integral image , the efficiency of the rectangle fea- ture set provides ample compensation for their limited flex- ibility.

2.2特徵討論

和一些相似方法,如導向濾波比較起來,矩形特徵看似有些原始[5,7]。導向濾波等類似方法,非常適合做對邊界的詳細分析,影象壓縮,紋理分析。相比之下矩形特徵,對於邊緣,條紋,以及其他簡單的影象結構的敏感度,是相當粗糙的。不同於導向濾波,它僅有的有效位置就是垂直,水平和對角線。矩形特徵的設定做不過是提供了豐富的影象表徵,支援有效的學習。與積分影象一起,矩形特徵的高效給它們有限的靈活性提供了極大補償。

3.  Learning Classification Functions

Given a feature set and a training set of positive and neg- ative images, any number of machine learning approaches could be used to learn a classification function. In our sys- tem a variant of AdaBoost is used both to select a small set of features and train the classifier [6]. In its original form, the AdaBoost learning algorithm is used to boost the clas- sification performance of a simple (sometimes called weak) learning algorithm. There are a number of formal guaran- tees provided by the AdaBoost learning procedure. Freund and Schapire proved that the training error of the strong classifier approaches zero exponentially in the number of rounds.  More importantly a number of results were later proved about generalization performance [14].   The key insight is that generalization performance is related to the margin of the examples, and that AdaBoost achieves large margins rapidly.

3.自學式分類功能

給定一個特徵集和一個包含正影象和負影象的訓練集,任何數量的機器學習方法可以用來學習分類功能。在我們的系統中,使用AdaBoost的一種變種來選擇小規模特徵集和除錯分類器[6]。在其原來的形式中,這種AdaBoost自學式演算法是用來提高一個簡單(有時稱為弱式)自學式演算法的。AdaBoost自學步驟提不少有效保證。Freund和Schapire證明,在相當數量的迴圈中,強分類器的除錯誤差接近於零。更重要的是,最近相當數量的結果證明了關於泛化效能的優勢[14]。其關鍵觀點是泛化效能與例子的邊界有關,而AdaBoost能迅速達到較大的邊界。

Recall that there are over 180,000 rectangle features as- sociated with each image sub-window, a number far larger than the number of pixels.  Even though each feature can be computed very efficiently, computing the complete set is prohibitively expensive. Our hypothesis, which is borne out by experiment, is that a very small number of these features can be combined to form an effective classifier. The main challenge is to find these features.

回想一下,有超過180,000個矩形特徵與每個影象子視窗有關,這個數字遠大過畫素數。雖然每個特徵的計算效率非常高,但是對整個集合進行計算卻花費高昂。而我們的假說,已被實驗證實,可以將極少數的特徵結合起來,形成有效的分類器。而主要挑戰是如何找到這些特徵。

In support of this goal, the weak learning algorithm is designed to select the single rectangle feature which best separates the positive and negative examples (this is similar to the approach of [2] in the domain of image database re- trieval). For each feature, the weak learner determines the optimal threshold classification function, such that the min- imum number of examples are misclassified. A weak clas- sifier hj(x) thus consists of a feature fj , a threshold θj and a parity pj indicating the direction of the inequality sign:

Here x is a 24x24 pixel sub-window of an image. See Ta- ble 1 for a summary of the boosting process.

為實現這一目標,我們設計弱學習演算法,用來選擇使得正例和負例得到最佳分離的單一矩形特徵(這是[2]中方法類似,在影象資料庫檢索域)。對於每一個特徵,弱學習者決定最優閾值分類功能,這樣可以使錯誤分類的數目最小化。弱分類器hj(x)包括:特徵 fj,閾值 θj,和一個正負校驗 pj,即保證式子兩邊符號相同:


這裡是一個影象中2424畫素的子視窗。表1是優化過程的概述。

In practice no single feature can perform the classifica- tion task with low error. Features which are selected in early rounds of the boosting process had error rates between 0.1 and 0.3.  Features selected in later rounds, as the task be- comes more difficult, yield error rates between 0.4 and 0.5.

在實踐中沒有單個特徵能在低錯誤的條件下執行分類任務。在優化過程的迴圈初期中被選中的特徵錯誤率在0.1到0.3之間。在迴圈後期,由於任務變得更難,因此被選擇的特徵誤差率在0.4和0.5之間。

3.1. Learning Discussion

Many general feature selection procedures have been pro- posed (see chapter 8 of [18] for a review). Our final appli- cation demanded a very aggressive approach which would discard the vast majority of features. For a similar recogni- tion problem Papageorgiou et al. proposed a scheme for fea- ture selection based on feature variance [10]. They demon- strated good results selecting 37 features out of a total 1734 features.

3.1自學習討論

許多通用的特徵選擇程式已經提出(見18]的第八章)。我們的最終應用的方法要求是一個非常積極的,能拋棄絕大多數特徵的方法。對於類似的識別問題,Papageorgiou等人提出了一個基於特徵差異的特徵選擇計劃。他們從1734個特徵中選出37個特徵,實現了很好的結果。

Roth et al.   propose a feature selection process based on the Winnow exponential perceptron learning rule [11]. The Winnow learning process converges to a solution where many of these weights are zero. Nevertheless a very large number of features are retained (perhaps a few hundred or thousand).

Roth等人提出了一種基於winnow指數感知機學習規則的特徵選擇過程[11]。這種Winnow學習過程收斂了一個解決方法,其中有不少權重為零。然而卻保留下來相當大一部分的特徵(也許有好幾百或幾千)。

Table  1:   The  AdaBoost  algorithm  for  classifier learning.  Each round of boosting selects one feature from the

180,000 potential features.

表1:關於自學式分類的Adaboost演算法。每個迴圈都在180,000個潛在特徵中選擇一個特徵。

3.2. Learning Results

While details on the training and performance of the final system are presented in Section 5, several simple results merit discussion.  Initial experiments demonstrated that a frontal face classifier constructed from 200 features yields a detection rate of 95% with a false positive rate of 1 in 14084. These results are compelling, but not sufficient for many real-world tasks. In terms of computation, this clas- sifier is probably faster than any other published system, requiring 0.7 seconds to scan an 384 by 288 pixel image. Unfortunately, the most straightforward technique for im- proving detection performance, adding features to the classifier, directly increases computation time.

3.2自學習結果

最終系統的詳細除錯和執行將在第5節中介紹,現在對幾個簡單的結果進行討論。初步實驗證明,正面人臉分類器由200個特徵構造而成,正誤視率在14084中為1,檢測率為95%。這些結果是引人注目的,但對許多實際任務還是不夠的。就計算而言,這個分類器可能比任何其他公佈的系統更快,掃描由1個384乘288畫素影象僅需要0.7秒。不幸的是,若用這個最簡單的技術改善檢測效能,給分類器新增特徵,會直接增加計算時間。

For the task of face detection, the initial rectangle fea- tures selected by AdaBoost are meaningful and easily inter- preted. The first feature selected seems to focus on the prop- erty that the region of the eyes is often darker than the region of the nose and cheeks (see Figure 3).  This feature is rel- atively large in comparison with the detection sub-window, and should be somewhat insensitive to size and location of the face. The second feature selected relies on the property that the eyes are darker than the bridge of the nose.

對於人臉檢測的任務,由AdaBoost選擇的最初的矩形特徵是有意義的且容易理解。選定的第一個特徵的重點是眼睛區域往往比鼻子和臉頰區域更黑暗(見圖3)。此特徵的檢測子視窗相對較大,並且某種程度上不受面部大小和位置的影響。第二個特徵選擇依賴於眼睛的所在位置比鼻樑更暗

Figure 3:  The first and second features selected by Ad- aBoost. The two features are shown in the top row and then overlayed on a typical training face in the bottom row. The first feature measures the difference in intensity between the region of the eyes and a region across the upper cheeks. The feature capitalizes on the observation that the eye region is often darker than the cheeks. The second feature compares the intensities in the eye regions to the intensity across the bridge of the nose.

這兩個特點顯示在最上面一行,然後一個典型的除錯面部疊加在底部一行。第一個特點,測量眼睛部區域和上臉頰地區的強烈程度的區別。該特徵利用了眼睛部區域往往比臉頰更暗。第二個特點比較了眼睛區域與鼻樑的強度。

4.  The Attentional Cascade

This section describes an algorithm for constructing a cas- cade of classifiers which achieves increased detection per- formance while radically reducing computation time. The key insight is that smaller, and therefore more efficient, boosted classifiers can be constructed which reject many of the negative sub-windows while detecting almost all posi- tive instances (i.e. the threshold of a boosted classifier can be adjusted so that the false negative rate is close to zero). Simpler classifiers are used to reject the majority of sub- windows before more complex classifiers are called upon to achieve low false positive rates.

4.注意力級聯

本章描述了構建級聯分類器的演算法,它能增加檢測效能達從而從根本上減少計算時間。它的主要觀點是構建一種優化分類器,其規模越小就越高效。這種分類器在檢測幾乎所有都是正例時剔除許多負子視窗(即,優化分類器閾值可以調整使得負誤視率接近零)。在呼叫較複雜的分類器之前,我們使用相對簡單的分類器來剔除大多數子視窗,以實現低正誤視率。

The overall form of the detection process is that of a degenerate decision tree, what we call a “cascade” (see Fig- ure 4). A positive result from the first classifier triggers the evaluation of a second classifier which has also been ad- justed to achieve very high detection rates. A positive result from the second classifier triggers a third classifier, and so on. A negative outcome at any point leads to the immediate rejection of the sub-window.

在檢測過程中,整體形式是一個退化決策樹,我們稱之為“級聯”(見圖4)。從第一個分類得到的有效結果能觸發第二個分類器,也已調整至達到非常高的檢測率。再得到一個有效結果使得第二個分類器觸發第三個分類器,以此類推。在任何一個點的錯誤結果都導致子視窗立刻被剔除。

Stages in the cascade are constructed by training clas- sifiers using AdaBoost and then adjusting the threshold to minimize false negatives.  Note that the default AdaBoost threshold is designed to yield a low error rate on the train- ing data. In general a lower threshold yields higher detection rates and higher false positive rates.

級聯階段的構成首先是利用AdaBoost訓練分類器,然後調整閾值使得負誤視最大限度地減少。注意,預設AdaBoost的閾值旨在資料過程中產生低錯誤率。一般而言,一個較低的閾值會產生更高的檢測速率和更高的正誤視率。

Figure 4: Schematic depiction of a the detection cascade. A series of classifiers are applied to every sub-window. The initial classifier eliminates a large number of negative exam- ples with very little processing. Subsequent layers eliminate additional negatives but require additional computation. Af- ter several stages of processing the number of sub-windows have been reduced radically.  Further processing can take any form such as additional stages of the cascade (as in our detection system) or an alternative detection system.

一系列的分類器適用於每一個子視窗。最初的分類器用很少的處理來消除大部分的負例。隨後的層次消除額外的負例,但是需要額外的計算。經過數個階段處理以後,子視窗的數量急劇減少。進一步的處理可以採取任何形式,如額外的級聯階段(正如我們的檢測系統中的)或者另一個檢測系統。

For example an excellent first stage classifier can be con- structed from a two-feature strong classifier by reducing the threshold to minimize false negatives. Measured against a validation training set, the threshold can be adjusted to de- tect 100% of the faces with a false positive rate of 40%. See Figure 3 for a description of the two features used in this classifier.

 例如,一個兩特徵強分類器通過降低閾值,達到最小的負誤視後,可以構成一個優秀的第一階段分類器。測量一個定的訓練集時,閾值可以進行調整,最後達到100%的人臉檢測率和40%的正誤視率。圖3為此分類器這兩個特徵的使用說明。

Computation of the two feature classifier amounts to about 60 microprocessor instructions.   It seems hard to imagine that any simpler filter could achieve higher rejec- tion rates.  By comparison, scanning a simple image tem- plate, or a single layer perceptron, would require at least 20 times as many operations per sub-window.

計算這兩個特徵分類器要使用大約60個微處理器指令。很難想象還會有其它任何簡單的濾波器可以達到更高的剔除率。相比之下,一個簡單的影象掃描模板,或單層感知器,將至少需要20倍於每個子視窗的操作。

The  structure  of  the  cascade  reflects  the  fact  that within any single image an overwhelming majority of sub- windows are negative. As such, the cascade attempts to re- ject as many negatives as possible at the earliest stage pos- sible. While a positive instance will trigger the evaluation of every classifier in the cascade, this is an exceedingly rare event.

 該級聯結構反映了,在任何一個單一的影象中,絕大多數的子視窗是無效的。因此,我們的級聯試圖在儘可能早的階段剔除儘可能多的負例。雖然正例將觸發評估每一個在級聯中的分類器,但這極其罕見。

Much like a  decision tree,  subsequent classifiers are trained using those examples which pass through all the previous stages.  As a result, the second classifier faces a more difficult task than the first. The examples which make it through the first stage are “harder” than typical exam- ples.  The more difficult examples faced by deeper classi- fiers push the entire receiver operating characteristic (ROC) curve downward. At a given detection rate, deeper classi- fiers have correspondingly higher false positive rates.

隨後的分類器就像一個決策樹,使用這些通過所有以前的階段例子進行訓練。因此,第二個分類器所面臨的任務比第一個更難。這些過第一階段的例子比典型例子更“難”。這些例子推動整個受試者工作特徵曲線(ROC)向下。在給定檢測率的情況下,更深層次分類器有著相應較高的正誤視率。

4.1. Training a Cascade of Classifiers

The cascade training process involves two types of trade- offs.    In most cases classifiers with more features will achieve higher detection rates and lower false positive rates.At the same time classifiers with more features require more time to compute. In principle one could define an optimiza- tion framework in which: i) the number of classifier stages, ii) the number of features in each stage, and iii) the thresh- old of each stage, are traded off in order to minimize the expected number of evaluated features. Unfortunately find- ing this optimum is a tremendously difficult problem.

4.1 除錯分類器級聯

級聯的除錯過程包括兩個型別的權衡。在大多數情況下具有更多的特徵分類器達到較高的檢測率和較低的正誤視率。同時具有更多的特徵的分類器需要更多的時間來計算。原則上可以定義一個優化框架,其中:一)分級級數,二)在每個階段的特徵數目,三)每個階段為最小化預計數量評價功能而進行的門限值交換。不幸的是,發現這個最佳方案是一個非常困難的問題。

In practice a very simple framework is used to produce an effective classifier which is highly efficient. Each stage in the cascade reduces the false positive rate and decreases the detection rate.  A target is selected for the minimum reduction in false positives and the maximum decrease in detection. Each stage is trained by adding features until the target detection and false positives rates are met ( these rates are determined by testing the detector on a validation set). Stages are added until the overall target for false positive and detection rate is met.

在實踐中用一個非常簡單的框架產生一個有效的高效分類器。級聯中的每個階段降低了正誤視率並且減小了檢測率。現在的目標旨在最小化正誤視率和最大化檢測率。除錯每個階段,不斷增加特徵,直到檢測率和正誤視率的目標實現(這些比率是通過將探測器在驗證設定上測試而得的)。同時新增階段,直到總體目標的正誤視和檢測率得到滿足為止。

4.2. Detector Cascade Discussion

The complete face detection cascade has 38 stages with over 6000 features. Nevertheless the cascade structure results in fast average detection times.  On a difficult dataset, con- taining 507 faces and 75 million sub-windows, faces are detected using an average of 10 feature evaluations per sub- window. In comparison, this system is about 15 times faster than an implementation of the detection system constructed by Rowley et al.3 [12]

4.2 探測器級聯的探討

完整的人臉檢測級聯已經有擁有超過6000個特徵的38個階段。儘管如此,級聯結構還是能夠縮短平均檢測時間。在一個複雜的包含507張人臉和7500萬個子視窗的資料集中,人臉在檢測時是每個子視窗由平均10個特徵來評估。相比之下,本系統的速度是由羅利等人3[12]構建的檢測系統的15倍。

A notion similar to the cascade appears in the face de- tection system described by Rowley et al. in which two de- tection networks are used [12]. Rowley et al. used a faster yet less accurate network to prescreen the image in order to find candidate regions for a slower more accurate network. Though it is difficult to determine exactly, it appears that Rowley et al.’s two network face system is the fastest existing face detector.4

由Rowley等人描述的一個類似於級聯的概念出現人臉檢測系統中。在這個系統中他們使用了兩個檢測網路。Rowley等人用更快但相對不準確的網路,以先篩選影象,這樣做是為了使較慢但更準確的網路找到候選區域。雖然這很難準確判斷,但是Rowley等人的雙網路系統,是目前速度最快的臉部探測器。4

The structure of the cascaded detection process is es- sentially that of a degenerate decision tree, and as such is related to the work of Amit and Geman [1].  Unlike tech- niques which use a fixed detector, Amit and Geman propose an alternative point of view where unusual co-occurrences of simple image features are used to trigger the evaluation of a more complex detection process. In this way the full detection process need not be evaluated at many of the po- tential image locations and scales. While this basic insight is very valuable, in their implementation it is necessary to first evaluate some feature detector at every location. These features are then grouped to find unusual co-occurrences. In practice, since the form of our detector and the features that it uses are extremely efficient, the amortized cost of evalu- ating our detector at every scale and location is much faster than finding and grouping edges throughout the image.

在檢測過程中的級聯結構基本上是退化決策樹,因此是涉及到了Amit和Geman[1]的工作。,Amit和Geman建議不再使用固定一個探測器的技術,而他們提出一個不尋常的合作同現,即簡單的影象特徵用於觸發評價一個更為複雜的檢測過程。這樣,完整的檢測過程中不需要對潛在的影象位置和範圍進行估計。然而這種基本的觀點非常有價值,在它們的執行過程中,必須要對每一個位置的某些功能檢測首先進行估計。這些特徵被歸類,以用於找到不尋常的合作。在實踐中,由於我們的檢測器的形式,它的使用非常高效,用於評估我們在每個探測器的規模和位置的成本消耗比尋找和分組整個影象邊緣快很多。

In recent work Fleuret and Geman have presented a face detection technique which relies on a “chain” of tests in or- der to signify the presence of a face at a particular scale and location [4]. The image properties measured by Fleuret and Geman, disjunctions of fine scale edges, are quite different than rectangle features which are simple, exist at all scales, and are somewhat interpretable. The two approaches also differ radically in their learning philosophy. The motivation for Fleuret and Geman’s learning process is density estima- tion and density discrimination, while our detector is purely discriminative. Finally the false positive rate of Fleuret and Geman’s approach appears to be higher than that of previ- ous approaches like Rowley et al. and this approach. Un- fortunately the paper does not report quantitative results of this kind. The included example images each have between 2 and 10 false positives.

在最近的工作中Fleuret和Geman已經提交了一種人臉檢測技術,它以“鏈測試”為主調,用來表示在某一特定範圍和位置人臉是否存在[4]。由Fleuret和Geman測量的影象屬性,細尺度邊界的分離,與簡單、存在於所有尺度且某種程度可辨別的矩陣特徵有很大的不同。這兩種方法的基本原理也存在根本上的差異。Fleuret和Geman的學習過程的目的是密度估計和密度辨別,而我們的探測器是單純的辨別。最後,Fleuret和Geman的方法中的正誤視率似乎也比以前的如Rowley等人的方法中的更高。不幸的是,這種辦法在文章中並沒有定量分析結果。影象所包含的每個例子都有2到10個正誤視。

5    Results

A 38 layer cascaded classifier was trained to detect frontal upright faces. To train the detector, a set of face and non- face training images were used. The face training set con- sisted of 4916 hand labeled faces scaled and aligned to a base resolution of 24 by 24 pixels.   The faces were ex- tracted from images downloaded during a random crawl of the world wide web. Some typical face examples are shown in Figure 5.  The non-face subwindows used to train the detector come from 9544 images which were manually in- spected and found to not contain any faces. There are about 350 million subwindows within these non-face images.

5.實驗結果

我們訓練一個38層級聯分類器,用來檢測正面直立人臉。為了訓練分類器,我們使用了一系列包含人臉和不包含人臉的圖片。人臉訓練集由4916個手標人臉組成,都縮放和對齊成24×24畫素的基本塊。提取人臉的圖片是在使用隨機爬蟲在全球資訊網上下載。一些典型人臉例子如圖5所示。訓練檢測器的沒有人臉的子視窗來自9544張圖片,都已經進行人工檢查,確定不包含任何人臉。在這些沒有人臉的圖片中,子視窗共有大概3.5億個。

The number of features in the first five layers of the de- tector is 1, 10, 25, 25 and 50 features respectively.  The remaining layers have increasingly more features. The total number of features in all layers is 6061.

在開始五層檢測器中特徵的數量分別為1、10、25、25和50。剩下的各層包含的特徵數量急劇增多。特徵總數是6061個。

Each classifier in the cascade was trained with the 4916 training faces (plus their vertical mirror images for a total of 9832 training faces) and 10,000 non-face sub-windows (also of size 24 by 24 pixels) using the Adaboost training procedure.  For the initial one feature classifier, the non- face training examples were collected by selecting random sub-windows from a set of 9544 images which did not con- tain faces. The non-face examples used to train subsequent layers were obtained by scanning the partial cascade across the non-face images and collecting false positives. A max- imum of 10000 such non-face sub-windows were collected for each layer.

在級聯中的每個分類器都經過4916個受訓人臉(加上它們的垂直映象,一共有9832個受訓人臉)和10000個無人臉的子視窗(同樣它們的尺寸都是24×24),使用自適應增強訓練程式訓練。對於最初的含一個特徵的分類器,無人臉訓練例項從一系列9544張沒有人臉的圖片中隨機選擇出子視窗。用來訓練隨後的層的沒有人臉例項是通過掃描部分級聯的無人臉影象以及收集正誤視率而得的。每一層收集的像這樣無人臉的子視窗的最大值是10000。

Figure 5: Example of frontal upright face images used for training

Speed of the Final Detector

The speed of the cascaded detector is directly related to the number of features evaluated per scanned sub-window. Evaluated on the MIT+CMU test set [12], an average of 10 features out of a total of 6061 are evaluated per sub-window. This is possible because a large majority of sub-windows are rejected by the first or second layer in the cascade. On a 700 Mhz Pentium III processor, the face detector can pro- cess a 384 by 288 pixel image in about .067 seconds (us- ing a starting scale of 1.25 and a step size of 1.5 described below).  This is roughly 15 times faster than the Rowley- Baluja-Kanade detector [12] and about 600 times faster than the Schneiderman-Kanade detector [15].

最終檢測器的速度

級聯的檢測器的速度是和在每次掃描子視窗中評估的特徵數目有直接影響的。在MIT+CMU測試集的評估中[12],平均6061個特徵中有10個特徵被挑出,評估每一個子視窗。這並非不可能,因為有大量子視窗被級聯的第一層和第二層剔除。在700兆赫的奔騰3處理器上,該人臉檢測可以約0.67秒的速度處理一幅384×288畫素大小的影象(使用)。這個大概是Rowley-Baluja-Kanade檢測器[12]的速度的15倍,是Schneiderman- Kanade檢測器[15]速度的約600倍。

Image Processing

All example sub-windows used for training were vari- ance normalized to minimize the effect of different light- ing conditions. Normalization is therefore necessary during detection as well.  The variance of an image sub-window can be computed quickly using a pair of integral images. Recall that  , where    is the standard deviation,      is the mean, and     is the pixel value within the sub-window. The mean of a sub-window can be com- puted using the integral image. The sum of squared pixels is computed using an integral image of the image squared (i.e. two integral images are used in the scanning process). During scanning the effect of image normalization can be achieved by post-multiplying the feature values rather than pre-multiplying the pixels.

影象處理

所有用來訓練的子視窗例項都經過方差標準化達到最小值,儘量減少不同光照條件的影響。因此,在檢測中也必須規範化。一個影象子視窗的方差可以使用一對積分影象快速計算。回憶,此處是標準差,是均值,而是在子視窗中的畫素值。子視窗的均值可以由積分影象計算得出。畫素的平方和可以由一個影象的積分影象的平方得出(即,兩個積分影象在掃描程序中使用)。在掃描影象中,影象的規範化可以通過後乘以特徵值達到,而不是預先乘以畫素值。

Scanning the Detector

The final detector is scanned across the image at multi- ple scales and locations. Scaling is achieved by scaling the detector itself, rather than scaling the image. This process makes sense because the features can be evaluated at any scale with the same cost. Good results were obtained using a set of scales a factor of 1.25 apart.

掃描檢測器

掃描最終檢測器在多尺度和定位下對影象進行掃描。尺度縮放更多是由縮放檢測器自身而不是縮放影象得到。這個程序的意義在於特徵可以在任意尺度下評估。使用1.25的間隔的可以得到良好結果。

The detector is also scanned across location. Subsequent locations are obtained by shifting the window some number of pixels Δ. This shifting process is affected by the scale of the detector: if the current scale is S the window is shifted by [SΔ] , where [] is the rounding operation.

檢測器也根據定位掃描。後續位置的獲得是通過將視窗平移⊿個畫素獲得的。這個平移程式受檢測器的尺度影響:若當前尺度是s,視窗將移動[s⊿],這裡[]是指湊整操作。

The choice of Δ affects both the speed of the detector as well as accuracy. The results we present are for  Δ = 1.0 . We can achieve a significant speedup by setting Δ = 1.5 with only a slight decrease in accuracy.

⊿的選擇不僅影響到檢測器的速度還影響到檢測精度。我們展示的結果是取了⊿=1.0。通過設定⊿=1.5,我們實現一個有意義的加速,而精度只有微弱降低。

Integration of Multiple Detections

Since the final detector is insensitive to small changes in translation and scale, multiple detections will usually occur around each face in a scanned image. The same is often true of some types of false positives. In practice it often makes sense to return one final detection per face. Toward this end it is useful to postprocess the detected sub-windows in order to combine overlapping detections into a single detection.

多檢測的整合

因為最終檢測器對於傳遞和掃描中的微小變化都很敏感,在一幅掃描影象中每個人臉通常會得到多檢測結果,一些型別的正誤視率也是如此。在實際應用中每個人臉返回一個最終檢測結果才顯得比較有意義。

In these experiments detections are combined in a very simple fashion.  The set of detections are first partitioned into disjoint subsets. Two detections are in the same subset if their bounding regions overlap.  Each partition yields a single final detection.  The corners of the final bounding region are the average of the corners of all detections in the set.

在這些試驗中,我們用非常簡便的模式合併檢測結果。首先把一系列檢測分割成許多不相交的子集。若兩個檢測結果的邊界區重疊了,那麼它們就是相同子集的。每個部分產生單個最終檢測結果。最後的邊界區的角落定義為一個集合中所有檢測結果的角落平均值。

Experiments on a Real-World Test Set

We tested our system on the MIT+CMU frontal face test set [12]. This set consists of 130 images with 507 labeled frontal faces. A ROC curve showing the performance of our detector on this test set is shown in Figure 6. To create the ROC curve the threshold of the final layer classifier is adjusted from -∞ to +∞ .  Adjusting the threshold to +∞ will yield a detection rate of 0.0 and a false positive rate of 0.0. Adjusting the threshold to -∞ , however, increases both the detection rate and false positive rate, but only to a certain point. Neither rate can be higher than the rate of the detection cascade minus the final layer. In effect, a threshold of -∞ is equivalent to removing that layer.  Further increasing the detection and false positive rates requires decreasing the threshold of the next classifier in the cascade.Thus, in order to construct a complete ROC curve, classifier layers are removed. We use the number of false positives as opposed to the rate of false positives for the x-axis of the ROC curve to facilitate comparison with other systems. To compute the false positive rate, simply divide by the total number of sub-windows scanned.  In our experiments, the number of sub-windows scanned is 75,081,800.

在現實測試集中實驗

我們在MIT+CMU正面人臉測試集[12]上對系統進行測試。這個集合由130幅影象組成,共有507個標記好的正面人臉。圖6是一個ROC曲線,顯示在該測試集上執行的檢測器的效能。其中末層分類器的閾值設定為從—∞到+∞。當調節閾值趨近+∞時,檢測率趨於0.0,正誤視率也趨於0.0。而當調節閾值趨近—∞時,檢測率和正誤視率都增長了,但最終會趨向一個恆值。速率最高的就是級聯中末層的。實際上,閾值趨近—∞就等價於移走這一層。要想得到檢測率和正誤視率更多的增長,就需要減小下一級分類器的閾值。因此,為了構建一個完整的ROC曲線,我們將分類器層數移走了。為了方便與其它系統比較,我們使用正誤視的數目而不是正誤視概率作為座標的x軸為了計算正誤視率,簡單將掃描的子視窗總數與之相除即可。在我們的實驗中,掃描過的子視窗總數達到了75,081,800。

Unfortunately, most previous published results on face detection have only included a single operating regime (i.e. single point on the ROC curve). To make comparison with our detector easier we have listed our detection rate for the false positive rates reported by the other systems. Table 2 lists the detection rate for various numbers of false detec- tions for our system as well as other published systems. For the Rowley-Baluja-Kanade results [12], a number of differ- ent versions of their detector were tested yielding a number of different results they are all listed in under the same head- ing. For the Roth-Yang-Ahuja detector [11], they reported their result on the MIT+CMU test set minus 5 images containing line drawn faces removed.

不幸的是,大多數人臉檢測的先前已公佈的結果僅有單一操作制度(即,ROC曲線上的單一點)。為了使之與我們的檢測器更容易進行比較,我們將我們系統在由其它系統測出的正誤視率下的檢測率進行列表。表2列出了我們的系統和其它已公佈系統的不同數目錯誤檢測結果下的檢測率。對Rowley-Baluja-Kanade的結論[12],我們對他們的一些不同版本的檢測器進行測試,產生一些不同結果,都列在同一標題下。Roth-Yang-Ahuja[11]檢測器的結果的5幅影象包括線繪人臉被移除了。

Figure 6:   ROC curve for our face detector on the MIT+CMU test set. The detector was run using a step size of 1.0 and starting scale of 1.0 (75,081,800 sub-windows scanned).

圖 6 檢測器在MIT+CMU測試集上的ROC曲線

Figure 7 shows the output of our face detector on some test images from the MIT+CMU test set.

圖7則展示了對於一些來自MIT+CMU測試集中的測試圖片,我們的人臉檢測器的輸出結果。

Figure 7: Output of our face detector on a number of test images from the MIT+CMU test set.

圖7:我們的人臉