1. 程式人生 > >Time Face Pose Estimation with Deep Learning

Time Face Pose Estimation with Deep Learning

Introduction to Real-Time Face Pose Estimation with Deep Learning

Facial recognition is a thriving application of deep learning. From phones to airport cameras, it has seen a rapid adoption rate in the industry, both commercially and in research. And when you combine that with pose estimation, you get a very powerful match.

Last week, I worked on a project to predict the pose estimation of a face presented to the camera. In this article, I’ll start introducing the face pose estimation problem. Later, you’ll find out how I trained a deep learning model to solve it. Finally, you’ll also have the chance to train your model and try it out on your own image.

Setting up the Problem Statement

Since the face of a person is a 3D object, it can rotate over all three axis — but with some limitations, of course. In a face pose estimation problem, we call these movements as roll, pitch, and yaw, better visualized in the figure below:

Estimating these poses is useful for liveness detection systems

. For instance, it may ask a user to perform some predefined random movements (e.g., “rotate your face to the right”) to check his liveness. You can also use it to realize which students are paying attention to the teacher as he/she explains a concept, estimate where a driver is looking, and so on.

Understanding our Dataset

For this problem, I created a toy database with 6,288 images from different face datasets. For each image, I detected the facial landmarks with Dlib (68 points, see figure below) and computed the pairwise Euclidean distance between all points. Thus, given 68 points, we ended up with (68*67)/2 = 2,278 features. The roll, pitch, and yaw of each face were labeled byAmazon’s Face Detection API.

Face landmarks computed by Dlib library.

Training the model

Now, it’s time to train our model. I used Keras for this step. However, I exported it later to the TensorFlow format in order to use it in a C++ application. After a few attempts, I ended up with the following configuration:

  • batch size: 32
  • #epochs: 100
  • optimizer: Adam
  • early stopping with patience = 25

And here is the final network architecture:

Model summary

There are a few details about my model architecture I would like to highlight:

  • Strong Regularization: Since we have lots of features (2,278) there are few types of regularization presented in my network to prevent overfitting and handle the curse of dimensionality. First, all layers have L2-regularization since we don’t want the model to give high importance to some features, especially in the first layer. In addition, the model’s architecture itself is a kind of regularization. It follows the patterns of Autoencoders, where each layer has fewer neurons than previous to “force” the network learn relevant information and ignore irrelevant ones.
  • I didn’t use dropout due to the low amount of neurons in each layer.
  • I could have used L1-regularization in the first layer to force the network to ignore unnecessary features (since this regularization tends to set the weights associated with this kind of information to zero). However, since the network is already strongly regularized, I preferred not to do this. In my tests, L2-regularization worked a little better too. :)

You can check the full source code here.

Results

Here is the graph of train/val loss after training for 100 epochs:

Train loss: 29.2128348724  Val loss: 33.3887316318 Test loss: 39.9278703463

As we can see, our model achieved a good result even on the test set. The graph also follows the pattern expected when training deep learning models. Since we are using MSE as loss function, we can estimate our model has an error of ±6º.

In the graphs below, we can see the difference between real and predicted angles for each point in the test set. As we can observe, the yaw was the easiest one to predict, followed by roll, and pitch, respectively. However, we can observe some outliers in all the graphs. I’ll investigate them in a future article.

Results on the test set for each pose (raw, pitch, and yaw). Each point in the graph represents the difference between real and predicted angles for each point in test set.

End Notes

I uploaded a Jupyter Notebook where you can see all the steps I went through to solve this problem. Also, you can follow the instructions available in the same repository to play around with the toy dataset I created, and then test it on your own image(s). Try it out!

As the next steps after this, I have a few ideas:

  • Try to find a better model. If you do, please let me know!
  • Inspect the outliers in the results of the test set
  • Create the same tutorial using TensorFlow
  • Since my architecture is the beginning of an Autoencoder, I could replicate and transpose my weights to use it as the “decoder”. Then, I would give my network the roll, pitch, and yaw as inputs and the model would output the 2,278 features. I could try using these features to draw the 68 landmark points and visualize the result. That would be awesome!

Finally, take a look at the video running the model in real-time on a CPU: