Thanarat Horprasert Chalidabhongse HeadPose2 Page

Head Pose Estimation

Facial Feature Tracking

We employ a hierarchical method for tracking the facial features used by the head orientation recovery model. We assume the initial locations of these points are given. This initialization problem has been addressed in the literature. Given these points in the first frame, we track each point independently by employing image-based parameterized tracking [3]. This performs both global region tracking of the whole face and local region tracking of the individual facial features. We then refine our position estimates by employing color correlation. Figure below shows the hierarchical framework for facial feature tracking.

From face and feature tracking, the locations of feature points in the next frame are estimated. The positional refinement step involves creating a template, from the current frame, of each feature point and then performing a color correlation over a search region around the estimated location in the next frame. In the correlation, a mis-match score, M, which is a measure of the difference between template, T, and the target image, I, is determined. M is defined as follows:

M = S(i, j) MAX (|RT(i,j) - RI(i,j)|, (|GT(i,j) - GI(i,j)|, (|BT(i,j) - BI(i,j)|)
where R, G, B represent red, green, and blue pixel values respectively, and i, j are the index of the pixel.
Our orientation computational model assumes that the corners of the eyes are collinear. We take this constraint into account while the corners of the eyes are being tracked. While performing correlation over the search region, the mis-match scores and their corresponding pixel positions are tabulated. Then, we consider the best m candidates for each feature. Next, a minimum-squared-error line fitting algorithm is used to determine a single ``eye'' line. More precisely, given the set of 4m candidate points (xi, yi), we find c0 and c1 that minimize the error function

S(i=1 to n) [(c0 + c1xi) - yi]^2
Once the best fit line is obtained, we then choose the one, among the candidates for each eye corner, that minimizes

|d|/|dmax| + M/Mmax
where |d| is the vertical distance between the point to the line; and where |dmax| and Mmax are the maximum value of |d| and M respectively. Representative results of our tracking method are shown in figure below.

Roll Recovery

Roll is straight forward to recover from the image of the eye corners. From the figure, we immediately see that the head roll is

g = arctan ( D y / D x )
where D y is the vertical distance and D x is the horizontal distance between E1 and E4 respectively.

Yaw Recovery

Let D1 denote the width of the eyes and D2 denote half of the distance between the two innner eye corners. The head yaw is recovered based on the assumptions that

E1E2 = E3E4 (i.e., the eyes are of equal width)
E1, E2, E3, and E4 are collinear.

Then from the well-known projective invariance of the cross-ratios we have

which yields

where

From perspective projection we obtain

where f is the focal length of the camera. From (6), we obtain

where

From (9), we can determine the head yaw (b)

However, since e3 is measured relative to the projection of the midpoint of E1E4 (which is unknown) we need to determine S and e3 from the relative distance among the projections of four eye corners.
From (3) - (8), we obtain a quadratic equation in S:

So,

where
A = (1 - Du/Dv)
B = -((2/Q) + 2)(1 + Du/Dv)
C = ((2/Q) + 1)(1 - D/uDv)
To determine e3, we employ another two cross-ratio invariants

From (13) and (14) it can be shown that

By replacing (12) and (15) in (10), we can now determine the head yaw angle (b). Note that b depends only on the relative distances among four eye corners and the focal length while being independent of the face structure and the distance of the face from the camera. It is also not influenced by other parameters such as the translation of the face along any axis.

Pitch Recovery

Let D denote the 3D length of nasal bridge, p0 denote the projected length of the nasal bridge when it is parallel to the image plane and p1 denote the observed length of the nasal bridge at the unknown pitch. Let (X0, Y0, Z0) and (X5, Y5, Z5) denote the 3D coordinates of the tip of the nose at 0 degree and at the current angle a. From the perspective projection, we obtain

From (16) and (17) it can be shown that

The estimated pitch angle, a, can be computed by :

where

Computing T requires estimating p0 which is not generally known. Instead, we obtain it by first catagorizing the observed face with respect to the variables of gender, race and age, and then use tabulated anthropometric data to estimate the mean of p0. Let N denote the average length of the nasal bridge and E denote the average length of the eye fissure (Biocular width).
By employing these statistical estimates for the face structure variables, p0 can be estimated:

where w is the length of projective eye fissure in the image plane.