1. 程式人生 > >faced: CPU Real Time face detection using Deep Learning

faced: CPU Real Time face detection using Deep Learning

What is the problem?

There are many scenarios where a single class object detection is needed. This means that we want to detect the location of all objects that belong to a specific class in an image. For example, we could be detecting faces for a face identification system or people for pedestrian tracking.

What is more, most of the time we would like to run these models in real time. In order to achieve this, we have a feed of images providing samples at rate x and we need a model to run in less than rate x for each of the samples. Then, we can process images as soon as they are available.

The most accessible and used solution nowadays to solve this task (and many others in computer vision) is to perform

transfer learning on previously trained models (in general standard models trained on huge datasets like those found in Tensorflow Hub or in TF Object Detection API)

There are plenty of trained object detection architectures (e.g. FasterRCNN, SSD or YOLO) that achieve impressive accuracy within real-time performance running on GPUs

.

Extracted from SSD paper here

GPUs are expensive but necessary in the training phase. However, in inference having a dedicated GPU to achieve real-time performance is not viable. All of the general object detection models (as those mentioned above) fail to run in real time without a GPU.

Then, how can we revisit the object detection problem for single class objects to achieve real-time performance but on CPU?

Main idea: simpler tasks require less learnable features

All of the above mentioned architectures were designed to detect multiple object classes (trained on COCO or PASCAL VOC datasets). In order to be able to classify each bounding box to it’s appropriate class, these architectures require a massive amount of feature extraction. This translates to huge amount of learnable parameters, huge amount of filters, huge amount of layers. In other words, this networks are big.

If we define simpler tasks (rather than multiple-class bounding box classification) then we can think of the network needing to learn less features to perform the task. Detecting a face in an image is obviously more simple than detecting cars, people, traffic signs and dogs (all within the same model). The amount of features required by a Deep Learning model in order to recognize faces (or any single class object) will be less than the amount of features for detecting tens of classes at the same time. The required information to perform the first task is less than the latter task.

Single class object detection models will need less learnable features. Less parameters mean that the network will be smaller. Smaller networks run faster because it requires less computations.
Then, the question is: how small can we go to achieve real time performance on CPU but keeping accuracy?

This is faced main concept: building the smallest possible network to (hopefully) run in real time in CPU while keeping accuracy.

The architecture

faced is an ensemble of 2 neural networks, both implemented using Tensorflow.

Main network

faced main architecture is heavily based on YOLO’s architecture. Basically, it’s a Fully Convolutional Network (FCN) that runs a 288x288 input image through a series of convolutional and pooling layers (no other layer types are involved).

Convolutional layers are in charge of extracting space-aware features. Pooling layers increase the receptive field of consequent convolutional layers.

The architecture’s output is a 9x9 grid (versus 13x13 grid in YOLO). Each grid cell is in charge of predicting whether a face is inside that cell (versus YOLO where each cell can detect up to 5 different object).

Each grid cell has 5 associated values. The first one is the probability p of that cell containing the center of a face. The other 4 values are the (x_center, y_center, width, height) of the detected face (relative to the cell).