- paper link
- Recurrent CNN
- Appearance Cues
- Shape Cues
1 Background
Gaze estimation has become an established line of research in computer vision, being a key feature in human-computer interaction (HCI) and usability research.
2 Motivation
Gaze behavior is a dynamic process and existing work and existing work ignores the temporal information of the human gaze.
3 Contribution
A recurrent-CNN gaze estimation network is proposed, which combines the appearance feature and temporal information of full face, eye region, and facial landmarks from still images.
4 Method
The proposed architecture consists of two parts, namely static network and a temporal network. Except for appearance features, the facial landmarks encode the global structural information. The designed static network is a 3 channel feature extraction layer, namely the full face appearance, the eye patch, and the facial landmarks. The advantage of an eye patch image is that it encodes the subtle changes information. The global structure information is represented by 68 3D facial landmarks.
Each stream of the static network is based on VGG-16 network. Both full-face stream and eye-patch stream have FC layers to obtain the corresponding feature vectors. The 204D (68*3) landmark coordinates vector is concatenated to the feature vector from each appearance stream, which then undergoes 2 FC layers to obtain the final feature vector of the input frame. The final feature vector of each frame is leveraged to predict the gaze vector of the last frame. The loss function is the average Euclidean distance between the predicted gaze vector and the ground truth.
5 Experimental Results
The proposed model is tested on the EYEDIAP dataset and the results are evaluated through average estimation error. They first evaluate the contribution of each static component. Experimental results show that the fusion model, which consists of the face, eyes, and landmarks, outperforms all other models. Besides, data normalization plays a key role in accurate gaze estimation.
Second, they evaluate the contribution of the temporal unit, i.e. RNN module. When testing in the FT (float target) scenario, the estimation error is significantly lower than that of the static mode. In the CS (continuous screen target) scenario, the temporal method did not perform superior due to less head movement.
Finally, they visualize the estimation error distribution across gaze and head orientation. It is observed that the accuracy of the static model drops in extreme head poses and gaze direction. Besides, the visualization results show that the temporal model contributes most to the situations with large angle changes.
6 Discussion
The advantage of this work is obvious that they leverage the temporal information to estimate the eye gaze. The limitation is that a more superior temporal model needs to be considered.
读者认为:这篇论文除了时序信息外的另一个点在于,利用facial landmark的3D坐标作为人脸的全局结构信息,这是在appearance-based 视线估计任务中一个较为方便的融合结构信息的方法!