2008 Head Pose Estimation in Computer Vision_A Survey(Erik Muiphy et al)
- 格式:pdf
- 大小:1.71 MB
- 文档页数:20
人脸识别英文专业词汇gallery set参考图像集Probe set=test set测试图像集face renderingFacial Landmark Detection人脸特征点检测3D Morphable Model 3D形变模型AAM (Active Appearance Model)主动外观模型Aging modeling老化建模Aging simulation老化模拟Analysis by synthesis 综合分析Aperture stop孔径光标栏Appearance Feature表观特征Baseline基准系统Benchmarking 确定基准Bidirectional relighting 双向重光照Camera calibration摄像机标定(校正)Cascade of classifiers 级联分类器face detection 人脸检测Facial expression面部表情Depth of field 景深Edgelet 小边特征Eigen light-fields本征光场Eigenface特征脸Exposure time曝光时间Expression editing表情编辑Expression mapping表情映射Partial Expression Ratio Image局部表情比率图(,PERI) extrapersonal variations类间变化Eye localization,眼睛定位face image acquisition 人脸图像获取Face aging人脸老化Face alignment人脸对齐Face categorization人脸分类Frontal faces 正面人脸Face Identification人脸识别Face recognition vendor test人脸识别供应商测试Face tracking人脸跟踪Facial action coding system面部动作编码系统Facial aging面部老化Facial animation parameters脸部动画参数Facial expression analysis人脸表情分析Facial landmark面部特征点Facial Definition Parameters人脸定义参数Field of view视场Focal length焦距Geometric warping几何扭曲Street view街景Head pose estimation头部姿态估计Harmonic reflectances谐波反射Horizontal scaling水平伸缩Identification rate识别率Illumination cone光照锥Inverse rendering逆向绘制技术Iterative closest point迭代最近点Lambertian model朗伯模型Light-field光场Local binary patterns局部二值模式Mechanical vibration机械振动Multi-view videos多视点视频Band selection波段选择Capture systems获取系统Frontal lighting正面光照Open-set identification开集识别Operating point操作点Person detection行人检测Person tracking行人跟踪Photometric stereo光度立体技术Pixellation像素化Pose correction姿态校正Privacy concern隐私关注Privacy policies隐私策略Profile extraction轮廓提取Rigid transformation刚体变换Sequential importance sampling序贯重要性抽样Skin reflectance model,皮肤反射模型Specular reflectance镜面反射Stereo baseline 立体基线Super-resolution超分辨率Facial side-view面部侧视图Texture mapping纹理映射Texture pattern纹理模式Rama Chellappa读博计划:1.完成先前关于指纹细节点统计建模的相关工作。
一份深度学习“人体姿势估计”全指南,从DeepNet到HRNet 从DeepNet到HRNet,这有一份深度学习“人体姿势估计”全指南
几十年来,人体姿态估计(Human Pose estimation)在计算机视觉界备受关注。
它是理解图像和视频中人物行为的关键一步。
在近年深度学习兴起后,人体姿态估计领域也发生了翻天覆地的变化。
今天,文摘菌就从深度学习+二维人体姿态估计的开山之作DeepPose开始讲起,为大家盘点近几年这一领域的最重要的论文。
什么是人体姿势估计?
人体姿态估计(Human Pose Estimation,以下简称为HPE)被定义为图像或视频中,人体关节(也被称为关键点-肘部、手腕等)的定位问题。
它也被定义为,在所有关节姿势组成的空间中搜索特定姿势。
二维姿态估计-运用二维坐标(x,y)来估计RGB图像中的每个关节的二维姿态。
三维姿态估计-运用三维坐标(x,y,z)来估计RGB图像中的三维姿态。
HPE有一些非常酷的应用,在动作识别(action recognition)、动画(animation)、游戏(gaming)等领域都有着广泛的应用。
例如,一个非常火的深度学习APP ——HomeCourt,可以使用姿态估计(Pose Estimation)来分析篮球运动员的动作。
为什么人体姿势估计这么难?
灵活、小而几乎看不见的关节、遮挡、衣服和光线变化都为人体姿态估计增加了难度。
二维人体姿态估计的不同方法
传统方法
关节姿态估计的传统方法是使用图形结构框架。
这里的基本思想是,将目标对象表示成一堆“部件(parts)”的集合,而部件的组合方式是可以发生形变的(非死板的)。
Face tracking with automatic model constructionJesus Nuevo,Luis M.Bergasa ⁎,David F.Llorca,Manuel OcañaDepartment of Electronics,Universidad de Alcala.Esc.Politecnica,Crta Madrid-Barcelona,Km 33,600.28871Alcala de Henares,Madrid,Spaina b s t r a c ta r t i c l e i n f o Article history:Received 12November 2009Received in revised form 26August 2010Accepted 15November 2010Available online xxxx Keywords:Face trackingAppearance modeling Incremental clustering Robust fittingDriver monitoringThis paper describes an active model with a robust texture model built on-line.The model uses one camera and it is able to operate without active illumination.The texture model is de fined by a series of clusters,which are built in a video sequence using previously encountered samples.This model is used to search for the corresponding element in the following frames.An on-line clustering method,named leaderP is described and evaluated on an application of face tracking.A 20-point shape model is used.This model is built of fline,and a robust fitting function is used to restrict the position of the points.Our proposal is to serve as one of the stages in a driver monitoring system.To test it,a new set of sequences of drivers recorded outdoors and in a realistic simulator has been compiled.Experimental results for typical outdoor driving scenarios,with frequent head movement,turns and occlusions are presented.Our approach is tested and compared with the Simultaneous Modeling and Tracking (SMAT)[1],and the recently presented Stacked Trimmed Active Shape Model (STASM)[2],and shows better results than SMAT and similar fitting error levels to STASM,with much faster execution times and improved robustness.©2010Elsevier B.V.All rights reserved.Contents 1.Introduction ...............................................................02.Background ...............................................................03.Robust simultaneous modeling and tracking ................................................03.1.Appearance modeling .......................................................03.2.Shape model ...........................................................04.Tests and results .............................................................04.1.Test set ..............................................................04.2.Performance evaluation ......................................................04.3.Results ..............................................................04.3.1.R-SMAT with automatic initialization...........................................04.4.Timings..............................................................05.Conclusions and future work .......................................................0Acknowledgments...............................................................0References ..................................................................1.IntroductionDriver inattention is a major cause of traf fic accidents,and it has been found to be involved in some form in 80%of the crashes and 65%of the near crashes within 3s of the event [3].Monitoring a driver to detect inattention is a complex problem that involves physiological and behavioural elements.Different works have been presented in recentyears,focused mainly in drowsiness,with a broad range of techniques.Physiological measurements such as electro-encephalography (EEG)[4]or electro-oculography (EOG),provide the best data for detection [4].The problem with these techniques is that they are intrusive to the subject.Moreover,medical equipment is always expensive.Lateral position of the vehicle inside the lane,steering wheel movements and time-to-line crossing are commonly used,and some commercial systems have been developed [5,6].These techniques are not invasive,and to date they obtain the most reliable results.However,the measurements they use may not re flect behaviours such as the so-called micro-sleeps [7].They also require a trainingImage and Vision Computing 29(2011)209–218⁎Corresponding author.Tel.:+34918856569.E-mail addresses:jnuevo@depeca.uah.es (J.Nuevo),bergasa@depeca.uah.es(L.M.Bergasa),llorca@depeca.uah.es (D.F.Llorca),mocana@depeca.uah.es (M.Ocaña).0262-8856/$–see front matter ©2010Elsevier B.V.All rights reserved.doi:10.1016/j.imavis.2010.11.004Contents lists available at ScienceDirectImage and Vision Computingj o u r na l ho m e p a g e :w w w.e l s ev i e r.c o m /l o c a t e /i m av i speriod for each person,and thus are not applicable to the occasional driver.Drivers in fatigue exhibit changes in the way their eyes perform some actions,like moving or blinking.These actions are known as visual behaviours,and are readily observable in drowsy and distracted drivers.Face pose[8]and gaze direction also contain information and have been used as another element of inattention detection systems [9].Computer vision has been the tool of choice for many researchers to be used to monitor visual behaviours,as it is non-intrusive.Most systems use one or two cameras to track the head and eyes of the subject[10–14].A few companies commercialize systems[15,16]as accessories for installation in vehicles.These systems require user-specific calibration,and some of them use near-IR lighting,which is known to produce eye fatigue.Reliability of these systems is still not high enough for car companies to take on the responsibility of its production and possible liability in case of malfunctioning.Face location and tracking are thefirst processing stages of most computer vision systems for driver monitoring.Some of the most successful systems to date use near-IR active illumination[17–19],to simplify the detection of the eyes thanks to the bright pupil effect. Near-IR illumination is not as useful during the day because sunlight also has a near-IR component.As mentioned above,near-IR can produce eye fatigue and thus limits the amount of time these systems can be used on a person.Given the complexity of the problem,it has been divided in parts and in this work only the problem of face tracking is addressed.This paper presents a new active model with the texture model built incrementally.We use it to characterize and track the face in video sequences.The tracker can operate without active illumination. The texture model of the face is created online,and thus specific for each person without requiring a training phase.A new online clustering algorithm is described,and its performance compared with the method proposed in[1].Two shape models,trained online and off-line,are compared.This paper also presents a new video sequence database,recorded in a car moving outdoors and in a simulator.The database is used to assess the performance of the proposed face tracking method in the challenging environment a driver monitoring application would meet.No evaluations of face pose estimation and driver inattention detection are performed.The rest of the paper is structured as follows.Section2presents a few remarkable works in face tracking in the literature that are related to our proposal.Section3describes our approach.Section4describes the video dataset used for performance evaluation,and experimental results.This paper closes with conclusions and future work.2.BackgroundHuman face tracking is a broadfield in computing research[20], and a myriad of techniques have been developed in the last decades.It is of the greatest interest,as vast amounts of information are contained in face features,movements and gestures,which are constantly used for human communication.Systems that work on such data often use face tracking[21,22].Non-rigid object tracking has been a major focus of research in later years,and general purpose template-based trackers have been used to track faces in the literature with success.Several efficient approaches have been presented[23–26].Statistical models have been used for face modeling and tracking. Active Shape Models[27](ASM)are similar to the active contours (snakes),but include constraints from a Point Distribution Model (PDM)[28]computed in advance from a training set.Advances in late years have increased their robustness and precision to remarkable levels(STASM,[2]).Extensions of ASM that include modeling of texture have been presented,of which Active Appearance Models (AAMs)[29]are arguably the best known.Active Appearance Models are global models in the sense that the minimization is performed over all pixels that fall inside the mesh defined by the mean of the PDM.All these models have an offline training phase,which require comprehensive training sets so they can generalize properly to unseen instances of the object.This is time consuming process,and there is still the risk that perfectly valid instances of the object would not be modeled correctly.Several methods that work without a priori models have been presented in the literature.Most of them focus on patch tracking on a video sequence.The classic approach is to use the image patch extracted on thefirst frame of the sequence to search for similar patches on the following frames.Lukas–Kanade method[30]was one of thefirst proposed solutions and it is still widely used.Jepson et al.[31]presented a system with appearance model based on three components:a stable component that is learned over a long period based on wavelets,a2-frame tracker and an outlier rejection process.Yin and Collins[32]build an adaptive view-dependent appearance model on-line.The model is made of patches selected around Harris corners.Model and target patches are matched using correlation,and the change in position, rotation and scale is obtained with the Procrustes algorithm.Another successful line of work in object tracking without a priori training is based on classification instead of modeling.Collins and Liu [33]presented a system based on background/foreground discrimi-nation.Avidan[34]presents one of the many systems that use machine learning to classify patches[35,36].Avidan uses weak classifiers trained every frame and AdaBoost to combine them.Pilet et al.[37]train keypoint classifiers using Random Trees that are able to recognize hundreds of keypoints in real-time.Simultaneous Modeling and Tracking(SMAT)[1]is in line with methods like Lucas–Kanade,relaying on matching to track patches. Lukas–Kanade extracts a template at the beginning of the sequence and uses it for tracking,and will fail if the appearance of the patch changes considerably.Matthews et al.[38]proposed an strategic update of the template,which keeps the template from thefirst frame to correct errors that appear in the localization.When the error is too high,the update is blocked.In[39],a solution is proposed withfixed template that adaptively detected and selected the window around the features.SMAT builds a more complex model based on incremental clustering.In this paper we combine concepts from active models with the incremental clustering proposed in SMAT.The texture model is created online,making the model adaptative,while the shape model is learnt offline.The clustering used by SMAT has some limitations,and we propose some modifications to obtain a more robust model and better tracking.We name the approach Robust SMAT for this reason.Evaluation of face tracking methods is performed in most works with images captured indoors.Some authors use freely available image sets,but most of them test on internal datasets created by them,which limits the validity of a comparison with other systems.Only a few authors[40,41]have used images recorded in a vehicle,but the number of samples is limited.To the best of our knowledge,there is no publicly available video dataset of people driving,either in a simulator or in a real road.We propose a new dataset that covers such scenarios.3.Robust simultaneous modeling and trackingThis section describes the Simultaneous Modeling and Tracking (SMAT)of Dowson and Bowden[1],and some modifications we propose to improve its performance.SMAT tries to build a model of appearance of features and how their positions are related(the structure model,or shape),from samples of texture and shape obtained in previous frames.The models of appearance and shape are independent.Fitting is performed in the same fashion of ASM:the features arefirst found separately using correlation,and then theirfinal positions are constrained by the shape model.If thefinal positions are found to be reliable and not caused byfitting errors,the appearance model is updated,otherwise it is left unchanged.Fig.1shows aflow chart of the algorithm.210J.Nuevo et al./Image and Vision Computing29(2011)209–2183.1.Appearance modelingEach one of the possible appearances of an object,or a feature of it,can be considered as a point in a feature space.Similar appearances will be close in this space,away from other points representing dissimilar appearances of the object.These groups of points,or clusters,form a mixture model that can be used to de fine the appearance of the object.SMAT builds a library of exemplars obtained from previous frames,image patches in this case.Dowson and Bowden de fined a series of clusters by their median patch,also known as representative ,and their variance.A new incoming patch is made part of the cluster if the distance between it and the median of the cluster is below a threshold that is a function of the variance.The median and variance of a cluster are recalculated every time a patch is added to it.Up to M exemplars per cluster are kept.If the size limit is reached,the most distant element from the representative is removed.Every time a cluster is updated,the weight of the clusters is recalculated as in Eq.(1):w t +1ðÞk=w t ðÞk +α11+αifk =k u w t ðÞk11+αotherwise8>><>>:ð1Þwhere α∈[0,1)is the learning rate,and k u is the index of the updatedcluster.The number of clusters is also limited to K .If K is reached,the cluster with the lowest weight is discarded.In a later work,Dowson et al.[42],introduced a different condition for membership,that compares the probability of the exemplar belonging to foreground (a cluster)or to the background p fg j d x ;μn ðÞ;σfg n p bg j d x ;μn ðÞ;σbg nð2Þwhere σfg n is obtained from the distances between the representative and the other exemplars in the cluster,and σbg n is obtained from the distances between the representative and the exemplars in the cluster offset by 1pixel.We have found that this clustering method can be improved in several ways.The adapting nature of the clusters could theoretically lead two or more clusters to overlap.However,in our tests we have observed that the opposite is much more frequent:the representative of thecluster rarely changes after the cluster has reached a certain number of elements.Outliers can be introduced in the model in the event of an occlusion of the face by a hand or other elements like a scarf.In most cases,these exemplars would be far away from the representative in the cluster.To remove them and reduce memory footprint,SMAT keeps up to M exemplars per cluster.If the size limit is reached,the most distant element from the representative is removed.When very similar patches are constantly introduced,one of them will be finally chosen as the median,and the variance will decrease,over fitting the cluster and discarding valuable exemplars.At a frame rate of 30fps,with M set to 50,the cluster will over fit in less than 2s.This would happen even if the exemplar to be removed is chosen randomly.This procedure will discard valuable information and future,subtle changes to the feature will lead to the creation of another cluster.We propose an alternative clustering method,named leaderP ,to partially solve these and other problems.The method is a modi fication of the leader algorithm [43,44],arguably the simplest and most frequently used incremental clustering method.In leader ,each cluster C i is de fined by only one exemplar,and a fixed membership threshold T .It starts by making the first exemplar the representative of a cluster.If an incoming exemplar ful fills being within the threshold T it is marked as member of that cluster,otherwise it becomes a cluster on its own.The pseudocode is shown in Algorithm 1.Algorithm 1.Leader clustering.1:Let C ={C 1,…,C n }be a set of n clusters,withweights {w 1t ,…,w n t}2:procedure leader (E ,C )cluster patch E 3:for all C i ∈C do 4:if d (C k ,E )b T thenCheck if patch E ∈C k 5:UpdateWeights (w 1t ,…,w n t)As in Eq.(1)6:return 7:End If 8:End For 9:Create new cluster C n +1,with E asrepresentative.10:Set w n +1t +1←0Weight of new cluster C n +111:C ←C ∪C n +1Add new cluster to the model12:if n +1N K thenRemove the cluster with lowest weight13:Find C k |w k ≤w i i =1,…,n 14:C ←C ∖C k 15:end if16:end procedureFig.1.SMAT block diagram.211J.Nuevo et al./Image and Vision Computing 29(2011)209–218On the other hand,leaderP keeps the first few exemplars added to the cluster are kept,up to P .The median of the cluster is chosen as the representative,as in the original clustering of Dowson and Bowden.When the number of exemplars in the cluster reaches P ,all exemplars but the representative are discarded,and it starts to work under the leader algorithm.P is chosen as a small number (we use P =10).The membership threshold is however flexible:the distances between the representative and each of the exemplars that are found to be members of the cluster is saved,and the variance of those distances is used to calculate the threshold.Because the representative is fixed and distance is a scalar,many values can be kept in memory without having a impact on the overall performance.Keeping more values reduces the risk of over fitting.The original proposal of SMAT used Mutual Information (MI)as a distance measure to compare the image patches,and found it to perform better that Sum of Squared Differences (SSD),and slightly better than correlation in some tests.Any de finition of distance could be used.We have also tested Zero-mean Normalized Cross-Correlation (ZNCC).Several types of warping were tested in [42]:translation,Euclidean,similarity and af fine.The results showed an increasing failure rate as the degrees of freedom of the warps increased.Based on this,we have chosen to use the simplest,and the patches are only translated depending on the point distribution model.3.2.Shape modelIn the original SMAT of Dowson and Bowden,the shape was also learned on-line.The same clustering algorithm was used,but themembership of a new shape to a cluster was calculated using Mahalanobis distance.Our method relies on the pre-learned shape model.The restric-tions on using a pre-learned model for shape are less than those for an appearance model,as it is of lower dimensionality and the deforma-tions are easier to model.It has been shown [45]that location and tracking errors are mainly due to appearance,and that a generic shape model for faces is easier to construct.We use the method of classic ASM [27],which applies PCA to a set of samples created by hand and extracts the mean s 0and an orthogonal vector basis (s 1,…,s N ).The shapes are first normalized and aligned using Generalized Procrustes Analysis [46].Let s =(x 0,y 0,…,x n −1,y n −1)be a shape.A shape can be generated from this base as s =s 0+∑mi =1p i ·s i :ð3ÞUsing L 2norm,the coef ficients p =p 1;…;p N ðÞcan be obtained for a given shape s as a projection of s on the vector basis p =S Ts −s 0ðÞ;p i =s −s 0ðÞ·s i ð4Þwhere S is a matrix with the eigenvectors s i as rows.The estimation of p with Eq.(4)is very sensitive to the presence of outlier points:a high error value from one point will severely in fluence the values of p .We use M-estimators [47]to solve this problem.This technique hasbeenFig.2.Trayectory of the vehicle (map from ).Fig.3.Samples of outdoor videos.212J.Nuevo et al./Image and Vision Computing 29(2011)209–218applied to ASM and AAM in previous works [48,49],so it is only brie fly presented here.Let s be a shape,obtained by fitting each feature independently.The function to minimize is arg min p∑2ni =1ρr i;θð5Þwhere ρ:R ×R þ→R þis an M-estimator,and θis obtained from the standard deviation of the residues [50].r i is the residue for coordinate i of the shaper i=x i−s i o+∑mj =1p j s i j!!ð6Þwhere x i are the points of the shape s ,and s i j is the i th element of the vector s j .Minimizing function 5is a case of re-weighted least squared.The weight decreases more rapidly than the square of the residue,and thus a point with error tending to in finite will have zero weight in the estimation.Several robust estimators have been tested:Huber ,Cauchy ,Gaussian and Tukey functions [50].A study was made in [19]that resulted in similar performance for all of them in a similar scenario to that of this paper,and Huber function was chosen.Huber function performs correctly up to a number of outliers of 50%of the points.We use the 20-point distribution of the BioID database [51].Data from this database was used to train the model.This distribution places the points in some of the most salient locations of the face,and has been used in several other works [40].4.Tests and resultsThis section presents the video sequences used to test different tracking algorithms in a driving scenario.The dataset contains most actions that appear in everyday driving situations.A comparison between our approach and SMAT is presented.Additionally,we compare R-SMAT results with the recently introduced Stacked Trimmed ASM (STASM).4.1.Test setDriving scenarios present a series of challenges for a face tracking algorithm.Drivers move constantly,rotate their head (self-occluding part of the face)or occlude their face with their hands (or other elements such as glasses).If other people are in the car,talking and gesturing are common.There are also constant background changes and,more importantly,frequent illumination changes,produced by shadows of trees or buildings,streets lights,other vehicles,etc.A considerable amount of test data is needed to properly evaluate the performance of a system under all these situations.A new video dataset has been created,with sequences of subjects driving outdoor,and in a simulator.The RobeSafe Driver Monitoring Video (RS-DMV)dataset contains 10sequences,7recorded outdoors (Type A )and 3in a simulator (TypeB ).Outdoor sequences were recorded on RobeSafe's vehicle moving at the campus of the University of Alcala.Drivers were fully awake,talked frequently with other passengers in the vehicle and were asked to look regularly to the rear-view mirrors and operate the car sound system.The cameras are placed over the dashboard,to avoid occlusions caused by the wheel.All subjects drove the same streets,shown in Fig.2.The length of the track is around 1.1km.The weather conditions during the recordings were mostly sunny,which made noticeable shadows appear on the face.Fig.3shows a few samples from these video sequences.Type B sequences were recorded in a realistic truck simulator.Drivers were fully awake,and were presented with a demanding driving environment were many other vehicles were presentandFig.4.Samples of sequences in simulator.m e 17m e 17C u m u l a t i v e E r rD i s t r i b u t i o n00.10.20.30.40.50.60.70.80.91C u m u l a t i v e E r rD i s t r i b u t i o n(a)(b)Fig.5.Performance of different shape models,with leaderP clustering.213J.Nuevo et al./Image and Vision Computing 29(2011)209–218potentially dangerous situations took place.These situations increase the probability of small periods of distraction leading to crashes or near-crashes.The sequences try to capture both distracted behaviour and the reaction to dangerous driving situations.A few images from TypeB sequences can be seen in Fig.4.The recording took place in a low-light scenario that approached nighttime conditions.This forced the camera to increase exposure time to a maximum,which lead to motion blur being present during head movements.Low power near-IR illumination was used in some of the sequences to increase the available light.The outdoor sequences are around 2min long,and sequences in the simulator are close to 10min in length.The algorithms in this paper were tested on images of approximately 320×240pixels,but high resolution images were acquired so they can be used in other research projects.The images are 960×480pixels for the outdoor sequences and 1392×480for the simulator sequences,and are stored without compression.Frame rate is 30frames per second in both cases.The camera has a 2/3″sensor,and used 9mm standard lenses.Images are grayscale.The recording software controlled camera gain using values of the pixels that fell directly on the face of the driver.The RS-DMV is publicly available,free of charge,for research purposes.Samples and information on how to obtain the database are available at the authors'webpage.14.2.Performance evaluationPerformance of the algorithms is evaluated as the error between the estimated position of the features and their actual position,asgiven by a human operator.Hand-marking is a time consuming task,and thus not all frames in all videos have been marked.Approximately 1in 30frames (1per second)has been marked in the sequences in RS-DMV.We call this frames keyframes .We used the metric m e ,introduced by Cristinacce and Cootes [40].Let x i be the points of the ground-truth shape s ,and let ^xi be the points of the estimated shape ^s .Then,m e =1ns ∑n i =1d i ;d i=ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffix i −^x i T x i −^x ir ð7Þwhere n is the number of points and s is the inter-ocular distance.Wealso discard the point on the chin and the exterior of the eyes,because their location changes much from person to person.Moreover,the variance of their position when marked by human operators is greater than for the other points.Because only 17points are used,we note the metric as m e 17.In the event of a tracking loss,of if the face cannot be found,the value of m e 17for that frame is set to ∞.During head turns,the inter-eye distance reduces with the cosine of the angle.In these frames,s is not valid and is calculated from its value on previous frames.Handmarked points and software used to ease the marking process are distributed with the RS-DMV dataset.4.3.ResultsWe tested the performance of R-SMAT approach on the RS-DMV dataset,as well as that of SMAT.We compared these results with those obtained by STASM,using the implementation in [2].One of the most remarkable problems of (R-)SMAT is that it needs to be properly initialized,and the first frames of the sequence are key1/personal/jnuevo.Fig.7.Samples of type A sequence #1.Outlier points are drawn in red.m e 17FrameFig.6.m e 17error for a sequence.214J.Nuevo et al./Image and Vision Computing 29(2011)209–218to building a good model.We propose STASM to initialize (R-)SMAT in the first frame.STASM has been shown to be very accurate when the face is frontal.Nonetheless,a slightly incorrect initialization will make (R-)SMAT track the (slightly)erroneous points.To decouple this error from the evaluation of accuracy of (R-)SMAT in the tests,the shape was initialized in the first frame with positions from the ground-truth data.At the end of this section,the performance of R-SMAT with automatic initialization is evaluated.First,a comparison of the shape models is presented.With the best shape model,the original clustering algorithm and the proposed alternative are evaluated.Results are presented for outdoor and simulator sequences separately,as each has speci fic characteristics on their own.The incremental shape model of SMAT was found to produce much higher error than the pre-learned model.Fig.5shows the cumulative distribution error of the incremental shape model (on-line )with the robust pre-learned model (using Huber function)(robust ).Forcomparison purposes,the figure also shows the performance for the pre-learned shape model fitted using a L 2norm (non-robust ).All models use leaderP clustering,and patches of 15×15pixels.Clear improvements in performance are made by the change to a pre-learned model with robust fitting.The robust,pre-learned shape model is very important in the first frames,because it allows the model to have bigger certainties that the patches that are being included correspond to correct positions.Robust shape model is used in the rest of the experiments in this paper.Fig.6shows the plot of the m e 17distance of both models in a sequence.A clear example of the bene fits of the robust model is depicted in Fig.7.The online model diverges as soon as a few points are occluded by the hand,while the robust model keeps track of the face.The method is also able to keep track of the face during head rotations,although with increased fitting error.This is quite remarkable for a model that has only been trained with fully frontal faces.Fig.8shows the performance of the original SMAT clustering compared with the proposed leaderP clustering algorithm,as implemented in R-SMAT.R-SMAT presents much better performance than the original SMAT clustering.This is specially clear in Fig.8(b).We stated in Section 3.1that the original clustering method could lead to over fitting,and type B sequences are specially prone to this:patches are usually dark and do not change much from frame to frame,and the subject does not move frequently.When a movement takes place,it leads to high error values,because the model has problems finding the features.m e 17m e 17C u m u l a t i v e E r rD i s t r i b u t i o nC u m u l a t i v e E r rD i s t r i b u t i o n(a)(b)parison of the performance of clustering algorithms.Table 1Track losses for different clustering methods.MeanMaximum Minimum R-SMATType A 0.99% 2.08%(seq.#7)0%(seq.#1,#2,#6)Type B 0.71% 1.96%(seq.#9)0%(seq.#10)SMATType A 1.77% 5.03%(seq.#4)0%(seq.#1,#2,#5)TypeB1.03%2.45%(seq.#9)0%(seq.#10)m e 17m e 17C u m u l a t i v e E r rD i s t r i b u t i o n00.10.20.30.40.50.60.70.80.91C u m u l a t i v e E r r D i s t r i b u t i o n(a)(b)parison of the performance of STASM and SMAT.215J.Nuevo et al./Image and Vision Computing 29(2011)209–218。
Head Pose Estimation in Computer Vision:A SurveyErik Murphy-Chutorian,Student Member,IEEE,and Mohan Manubhai Trivedi,Fellow,IEEE Abstract—The capacity to estimate the head pose of another person is a common human ability that presents a unique challenge for computer vision pared to face detection and recognition,which have been the primary focus of face-related vision research,identity-invariant head pose estimation has fewer rigorously evaluated systems or generic solutions.In this paper,we discuss the inherent difficulties in head pose estimation and present an organized survey describing the evolution of the field.Our discussion focuses on the advantages and disadvantages of each approach and spans90of the most innovative and characteristic papers that have been published on this topic.We compare these systems by focusing on their ability to estimate coarse and fine head pose,highlighting approaches that are well suited for unconstrained environments.Index Terms—Head pose estimation,human-computer interfaces,gesture analysis,facial landmarks,face analysis.Ç1I NTRODUCTIONF ROM an early age,people display the ability to quicklyand effortlessly interpret the orientation and movement of a human head,thereby allowing one to infer the intentions of others who are nearby and to comprehend an important nonverbal form of communication.The ease with which one accomplishes this task belies the difficulty of a problem that has challenged computational systems for decades.In a computer vision context,head pose estimation is the process of inferring the orientation of a human head from digital imagery.It requires a series of processing steps to transform a pixel-based representation of a head into a high-level concept of direction.Like other facial vision processing steps, an ideal head pose estimator must demonstrate invariance to a variety of image-changing factors.These factors include physical phenomena like camera distortion,projective geometry,multisource non-Lambertian lighting,as well as biological appearance,facial expression,and the presence of accessories like glasses and hats.Although it might seem like an explicit specification of a vision task,head pose estimation has a variety of inter-pretations.At the coarsest level,head pose estimation applies to algorithms that identify a head in one of a few discrete orientations,e.g.,a frontal versus left/right profile view.At the fine(i.e.,granular),a head pose estimate might be a continuous angular measurement across multiple degrees of freedom(DOF).A system that estimates only a single DOF,perhaps the left to right movement is still a head pose estimator,as is the more complex approach that estimates a full3D orientation and position of a head,while incorporating additional DOF including movement of the facial muscles and jaw.In the context of computer vision,head pose estimation is most commonly interpreted as the ability to infer the orientation of a person’s head relative to the view of a camera.More rigorously,head pose estimation is the ability to infer the orientation of a head relative to a global coordinate system,but this subtle difference requires knowledge of the intrinsic camera parameters to undo the perceptual bias from perspective distortion.The range of head motion for an average adult male encompasses a sagittal flexion and extension(i.e.,forward to backward movement of the neck)fromÀ60:4 to69:6 ,a frontal lateral bending(i.e.,right to left bending of the neck)fromÀ40:9 to36:3 ,and a horizontal axial rotation(i.e.,right to left rotation of the head)from-79:8 to75:3 [26].The combination of muscular rotation and relative orientation is an often overlooked ambiguity(e.g.,a profile view of a head does not look exactly the same when a camera is viewing from the side as compared to when the camera is viewing from the front and the head is turned sideways). Despite this problem,it is often assumed that the human head can be modeled as a disembodied rigid object.Under this assumption,the human head is limited to three DOF in pose,which can be characterized by pitch,roll,and yaw angles as pictured in Fig.1.Head pose estimation is intrinsically linked with visual gaze estimation,i.e.,the ability to characterize the direction and focus of a person’s eyes.By itself,head pose provides a coarse indication of gaze that can be estimated in situations when the eyes of a person are not visible(like low-resolution imagery,or in the presence of eye-occluding objects like sunglasses).When the eyes are visible,head pose becomes a requirement to accurately predict gaze direction.Physiological investigations have demonstrated that a person’s prediction of gaze comes from a combination of both head pose and eye direction[59].By digitally. E.Murphy-Chutorian is with Google Inc.,1600Amphitheatre Parkway,Mountain View,CA94043.E-mail:erikmchut@..M.M.Trivedi is with the Computer Vision and Robotics ResearchLaboratory,Department of Electrical and Computer Engineering,Uni-versity of California,San Diego,6500Gilman Dr.,La Jolla,CA92093-0434.E-mail:mtrivedi@.Manuscript received5June2007;revised29Nov.2007;accepted24Mar.2008;published online16Apr.2008.Recommended for acceptance by T.Darrell.For information on obtaining reprints of this article,please send e-mail to:tpami@,and reference IEEECS Log NumberTPAMI-2007-06-0331.Digital Object Identifier no.10.1109/TPAMI.2008.106.0162-8828/09/$25.00ß2009IEEE Published by the IEEE Computer Societycomposing images of specific eye directions on different head orientations,the authors established that an observer’s interpretation of gaze is skewed in the direction of the target’s head.A graphic example of this effect was demonstrated in the 19th-century drawing shown in Fig.2[134].In this sketch, two views of a head are presented at different orientations, but the eyes are drawn in an identical configuration in both. Glancing at this image,it is quite clear that the perceived direction of gaze is highly influenced by the pose of the head.If the head is removed entirely and only the eyes remain,the perceived direction is similar to that in which the head is in a frontal configuration.Based on this observation and our belief that the human gaze estimation ability is appropriately processing the visual information,we hypothesize that a passive camera sensor with no prior knowledge of the lighting conditions has insufficient information to accurately estimate the orientation of an eye without knowledge of the head orientation as well.To support this statement,consider the proportions of visible sclera(i.e.,the white area), surrounding the eye.The high contrast between the sclera and the iris is discernible from a distance and might have evolved to facilitate gaze perception[54].An eye direction model that uses this scleral iris cue would need a head pose estimate to interpret the gaze direction since any head movement introduces a gaze shift that would not affect the visible sclera.Consequently,to computationally estimate human gaze in any configuration,an eye tracker should be supplemented with a head pose estimation system.This paper presents a survey of the head pose estimation methods and systems that have been published over the past14years.This work is organized by common themes and trends,paired with the discussion of the advantages and disadvantages inherent in each approach.Previous literature surveys have considered general human motion [76],[77],face detection[41],[143],face recognition[149], and affect recognition[25].In this paper,we present a similar treatment for head pose estimation.The remainder of this paper is structured as follows: Section2describes the motivation for head pose estimation approaches.Section3contains an organized survey of head pose estimation approaches.Section4discusses the ground truth tools and data sets that are available for evaluation and compares the systems described in our survey based on published results and general applicability.Section5pre-sents a summary and concluding remarks.2M OTIVATIONPeople use the orientation of their heads to convey rich, interpersonal information.For example,a person will point the direction of his head to indicate who is the intended target of a conversation.Similarly,in a dialogue,head direction is a nonverbal communique that cues a listener when to switch roles and begin speaking.There is an important meaning in the movement of the head as a form of gesturing in a conversation.People nod to indicate that they understand what is being said and they use additional gestures to indicate dissent,confusion,consideration,and agreement.Exaggerated head movements are synonymous with pointing a finger,and they are a conventional way of directing someone to observe a particular location.In addition to the information that is implied by deliberate head gestures,there is much that can be inferred by observing a person’s head.For instance,quick head movements may be a sign of surprise or alarm.In people, these commonly trigger reflexive responses from an observer,which is very difficult to ignore even in the presence of contradicting auditory stimuli[58].Other important observations can be made by establishing the visual focus of attention from a head pose estimate.If two people focus their visual attention on each other,sometimes referred to as mutual gaze,this is often a sign that two people are engaged in a discussion.Mutual gaze can also be used as a sign of awareness,e.g.,a pedestrian will wait for a stopped automobile driver to look at him before stepping into a crosswalk.Observing a person’s head direction can also provide information about the environment.If a person shifts his head toward a specific direction,there is a high likelihood that it is in the direction of an object of interest. Children as young as six months exploit this property, known as gaze following,by looking toward the line-of-sight of a caregiver as a saliency filter for the environment[79].Like speech recognition,which has already become entwined in many widely available technologies,head pose estimation will likely become an off-the-shelf tool to bridge the gap between humans and computers.Fig.1.The three degrees of freedom of a human head can be described by the egocentric rotation angles pitch,roll,and yaw.Fig. 2.Wollaston illusion:Although the eyes are the same in both images,the perceived gaze direction is dictated by the orientation of the head[134].3H EAD P OSE E STIMATION M ETHODSIt is both a challenge and our aspiration to organize all of the varied methods for head pose estimation into a single ubiquitous taxonomy.One approach we had considered was a functional taxonomy that organized each method by its operating domain.This approach would have separated methods that require stereo depth information from systems that require only monocular video.Similarly,it would have segregated approaches that require a near-field view of a person’s head from those that can adapt to the low resolution of a far-field view.Another important considera-tion is the degree of automation that is provided by each system.Some systems estimate head pose automatically,while others assume challenging prerequisites,such as locations of facial features that must be known in advance.It is not always clear whether these requirements can be satisfied precisely with the vision algorithms that are available today.Instead of a functional taxonomy,we have arranged each system by the fundamental approach that underlies its implementation.This organization allows us to discuss the evolution of different techniques,and it allows us to avoid ambiguities that arise when approaches are extended beyond their original functional boundaries.Our evolu-tionary taxonomy consists of the following eight categories that describe the conceptual approaches that have been used to estimate head pose:.Appearance template methods compare a new image of a head to a set of exemplars (each labeled with a discrete pose)in order to find the most similar view..Detector array methods train a series of head detectors each attuned to a specific pose and assigna discrete pose to the detector with the greatest support..Nonlinear regression methods use nonlinear regres-sion tools to develop a functional mapping from the image or feature data to a head pose measurement..Manifold embedding methods seek low-dimen-sional manifolds that model the continuous variation in head pose.New images can be embedded into these manifolds and then used for embedded template matching or regression..Flexible models fit a nonrigid model to the facialstructure of each individual in the image plane.Head pose is estimated from feature-level comparisons or from the instantiation of the model parameters..Geometric methods use the location of features suchas the eyes,mouth,and nose tip to determine pose from their relative configuration..Tracking methods recover the global pose change ofthe head from the observed movement between video frames..Hybrid methods combine one or more of theseaforementioned methods to overcome the limita-tions inherent in any single approach.Table 1provides a list of representative systems for each of these categories.In this section,each category is described in ments are provided on the functional requirements of each approach and the advantages and disadvantages of each design choice.3.1Appearance Template MethodsAppearance template methods use image-based compar-ison metrics to match a view of a person’s head to a set of exemplars with corresponding pose labels.In the simplestMURPHY-CHUTORIAN AND TRIVEDI:HEAD POSE ESTIMATION IN COMPUTER VISION:ASURVEY 609TABLE 1Taxonomy of Head Pose Estimation Approachesimplementation,the queried image is given the same pose that is assigned to the most similar of these templates.An illustration is presented in Fig. 3.Some characteristic examples include the use of normalized cross-correlation at multiple image resolutions[7]and mean squared error (MSE)over a sliding window[91].Appearance templates have some advantages over more complicated methods.The templates can be expanded to a larger set at any time,allowing systems to adapt to changing conditions.Furthermore,appearance templates do not require negative training examples or facial feature points.Creating a corpus of training data requires only cropping head images and providing head pose annota-tions.Appearance templates are also well suited for both high and low-resolution imagery.There are many disadvantages with appearance tem-plates.Without the use of some interpolation method,they are only capable of estimating discrete pose locations.They typically assume that the head region has already been detected and localized,and localization error can degrade the accuracy of the head pose estimate.They can also suffer from efficiency concerns,since as more templates are added to the exemplar set,more computationally expensive image comparisons will need to be computed.One proposed solution to these last two problems was to train a set of Support Vector Machines(SVMs)to detect and localize the face,and subsequently use the support vectors as appear-ance templates to estimate head pose[88],[89].Despite those limitations,the most significant problem with appearance templates is that they operate under the faulty assumption that pairwise similarity in the image space can be equated to similarity in pose.Consider two images of the same person in slightly different poses and two images of different people in the same pose.In this scenario,the effect of identity can cause more dissimilarity in the image than from a change in pose,and template matching would improperly associate the image with the incorrect pose.Although this effect would likely lessen for widely varying poses,there is still no guarantee that pairwise similarity corresponds to similarity in the pose domain(e.g.,a right-profile image of a face may be more similar to a left-profile image than to a frontal view).Thus, even with a homogeneous set of discrete poses for every individual,errors in template comparison can lead to highly erroneous pose estimates.To decrease the effect of the pairwise similarity problem, many approaches have experimented with various distance metrics and image transformations that reduce the head pose estimation error.For example,the images can be convolved with a Laplacian-of-Gaussian filter[32]to emphasize some of the more common facial contours while removing some of the identity-specific texture variation across different individuals.Similarly,the images can be convolved with a complex Gabor wavelet to emphasize directed features,such as the vertical lines of the nose and horizontal orientation of the mouth[110],[111].The magnitude of this complex convolution also provides some invariance to shift,which can conceivably reduce the appearance errors that arise from the variance in facial feature locations between people.3.2Detector ArraysMany methods have been introduced in the last decade for frontal face detection[97],[104],[126].Given the success of these approaches,it seems like a natural extension to estimate head pose by training multiple face detectors,each to specific a different discrete pose.Fig.4illustrates this approach.For arrays of binary classifiers,successfully detecting the face will specify the pose of the head, assuming that no two classifiers are in disagreement.For detectors with continuous output,pose can be estimated by the detector with the greatest support.Detector arrays are similar to appearance templates in that they operate directly on an image patch.Instead of comparing an image to a large set of individual templates,the image is evaluated by a detector trained on many images with a supervised learning algorithm.An early example of a detector array used three SVMs for three discrete yaws[47].A more recent system trained five FloatBoost classifiers operating in a far-field multicamera setting[146].610IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL.31,NO.4,APRIL2009Fig.3.Appearance template methods compare a new head view to a set of training examples(each labeled with a discrete pose)and find the most similarview.Fig.4.Detector arrays are comprised of a series of head detectors, each attuned to a specific pose,and a discrete pose is assigned to the detector with the greatest support.An advantage of detector array methods is that a separate head detection and localization step is not required,since each detector is also capable of making the distinction between head and nonhead.Simultaneous detection and pose estimation can be performed by applying the detector to many subregions in the image. Another improvement is that unlike appearance templates, detector arrays employ training algorithms that learn to ignore the appearance variation that does not correspond to pose change.Detector arrays are also well suited for high and low-resolution images.Disadvantages to detector array methods also exist.It is burdensome to train many detectors for each discrete pose. For a detector array to function as a head detector and pose estimator,it must also be trained on many negative nonface examples,which require substantially more training data. In addition,there may be systematic problems that arise as the number of detectors is increased.If two detectors are attuned to very similar poses,the images that are positive training examples for one must be negative training examples for another.It is not clear whether prominent detection approaches can learn a successful model when thepositive and negative examples are very similar in appear-ance.Indeed,in practice,these systems have been limited to one DOF and fewer than12detectors.Furthermore,since the majority of the detectors have binary outputs,there is no way to derive a reliable continuous estimate from the result, allowing only coarse head pose estimates and creating ambiguities when multiple detectors simultaneously classi-fy a positive image.Finally,the computational requirements increase linearly with the number of detectors,making it difficult to implement a real-time system with a large array. As a solution to this last problem,it has been suggested that a router classifier can be used to pick a single subsequent detector to use for pose estimation[103].In this case,the router effectively determines pose(i.e.,assuming this is a face,what is its pose?),and the subsequent detector confirms the choice(i.e.,is this a face in the pose specified by the router?).Although this technique sounds promising in theory,it should be noted that it has not been demonstrated for pitch or yaw change,but rather only for rotation in the camera plane,first using neural network-based face detectors[103],and later with cascaded AdaBoost detectors[53].3.3Nonlinear Regression MethodsNonlinear regression methods estimate pose by learning a nonlinear functional mapping from the image space to one or more pose directions.An illustration is provided in Fig.5.The allure of these approaches is that with a set of labeled training data,a model can be built that will provide a discrete or continuous pose estimate for any new data sample.The caveat with these approaches is that it is not clear how well a specific regression tool will be able to learn the proper mapping.The high dimensionality of an image presents a challenge for some regression tools.Success has been demonstrated using Support Vector Regressors(SVRs)if the dimensionality of the data can be reduced,as for example with principal component analysis(PCA)[64], [65],or with localized gradient orientation histograms[86]—the latter giving better accuracy for head pose estimation. Alternatively,if the location of facial features are known in advance,regression tools can be used on relatively low-dimensional feature data extracted at these points[70],[78].Of the nonlinear regression tools used for head pose estimation,neural networks have been the most widely used in the literature.An example is the multilayer perceptron(MLP),consisting of many feed-forward cells defined in multiple layers(e.g.,the output of the cells in one layer comprise the input for the subsequent layer)[8],[23]. The perceptron can be trained by backpropagation,which is a supervised learning procedure that propagates the error backward through each of the hidden layers in the network to update the weights and biases.Head pose can be estimated from cropped images of a head using an MLP in different configurations.For instance,the output nodes can correspond to discretized poses,in which case a Gaussian kernel can be used to smooth the training poses to account for the similarity of nearby poses[10],[106], [148].Although this approach works,it is similar to detector arrays and appearance templates in that it provides only a coarse estimate of pose at discrete locations.An MLP can also be trained for fine head pose estimation over a continuous pose range.In this configuration,the network has one output for each DOF and the activation of the output is proportional to its corresponding orientation [108],[115],[117],[129],[130].Alternatively,a set of MLP networks with a single output node can be trained individually for each DOF.This approach has been used for heads viewed from multiple far-field cameras in indoor environments,using background subtraction or a color filter to detect the facial region and Bayesian filtering to fuse and smooth the estimates of each individual camera[120], [128],[129],[130].A locally linear map(LLM)is another popular neural network consisting of many linear maps[100].To build the network,the input data is compared to a centroid sample for each map and used to learn a weight matrix.Head poseMURPHY-CHUTORIAN ANDTRIVEDI:HEAD POSE ESTIMATION IN COMPUTER VISION:A SURVEY611Fig.5.Nonlinear regression provides a functional mapping from theimage or feature data to a head pose measurement.estimation requires a nearest-neighbor search for the closest centroid,followed by linear regression with the correspond-ing map.This approach can be extended with difference vectors and dimensionality reduction[11]as well as decomposition with Gabor wavelets[56].As was mentioned earlier with SVR,it is possible to train a neural network using data from facial feature locations. This approach has been evaluated with associative neural networks[33],[34].The advantages of neural network approaches are numerous.These systems are very fast,only require cropped labeled faces for training,work well in near-field and far-field imagery,and give some of the most accurate head pose estimates in practice(see Section4).The main disadvantage to these methods is that they are to prone to error from poor head localization.As a suggested solution,a convolutional network[62]that extends the MLP by explicitly modeling some shift,scale, and distortion invariance can be used to reduce this source of error[95],[96].3.4Manifold Embedding MethodsAlthough an image of a head can be considered a data sample in high-dimensional space,there are inherently many fewer dimensions in which pose can vary(Fig.6). With a rigid model of the head,this can be as few as three dimensions for orientation and three for position.Therefore, it is possible to consider that each high-dimensional image sample lies on a low-dimensional continuous manifold constrained by the allowable pose variations.For head pose estimation,the manifold must be modeled and an embed-ding technique is required to project a new sample into the manifold.This low-dimensional embedding can then be used for head pose estimation with techniques such as regression in the embedded space or embedded template matching.Any dimensionality reduction algorithm can be considered an attempt at manifold embedding,but the challenge lies in creating an algorithm that successfully recovers head pose while ignoring other sources of image variation.Two of the most popular dimensionality reduction techniques,PCA and its nonlinear kernelized version, KPCA,discover the primary modes of variation from a set of data samples[23].Head pose can be estimated with PCA,e.g.,by projecting an image into a PCA subspace and comparing the results to a set of embedded templates[75]. It has been shown that similarity in this low-dimensional space is more likely to correlate with pose similarity than appearance template matching with Gabor wavelet pre-processing[110],[111].Nevertheless,PCA and KPCA are inferior techniques for head pose estimation[136].Besides the linear limitations of standard PCA that cannot ade-quately represent the nonlinear image variations caused by pose change,these approaches are unsupervised techniques that do not incorporate the pose labels that are usually available during training.As a result,there is no guarantee that the primary components will relate to pose variation rather than to appearance variation.Probably,they will correspond to both.To mitigate these problems,the appearance information can be decoupled from the pose by splitting the training data into groups that each share the same discrete head pose.Then,PCA and KPCA can be applied to generate a separate projection matrix for each group.These pose-specific eigenspaces,or pose-eigenspaces,each represent the primary modes of appearance variation and provide a decomposition that is independent of the pose variation. Head pose can be estimated by normalizing the image and projecting it into each of the pose-eigenspaces,thus finding the pose with the highest projection energy[114].Alter-natively,the embedded samples can be used as the input to a set of classifiers,such as multiclass SVMs[63].As testament to the limitations of KPCA,however,it has been shown that,by skipping the KPCA projection altogether and using local Gabor binary patterns,one can greatly improve pose estimation with a set of multiclass SVMs[69]. Pose-eigenspaces have an unfortunate side effect.The ability to estimate fine head pose is lost since,like detector arrays,the estimate is derived from a discrete set of measurements.If only coarse head pose estimation is desired,it is better to use multiclass linear discriminant analysis(LDA)or the kernelized version,KLDA[23],since these techniques can be used to find the modes of variation in the data that best account for the differences between discrete pose classes[15],[136].Other manifold embedding approaches have shown more promise for head pose estimation.These include Isometric feature mapping(Isomap)[101],[119],Locally Linear Embedding(LLE)[102],and Laplacian Eigenmaps (LE)[6].To estimate head pose with any of these techniques,there must be a procedure to embed a new data sample into an existing manifold.Raytchev et al.[101] described such a procedure for an Isomap manifold,but,for out-of-sample embedding in an LLE and LE manifold,there has been no explicit solution.For these approaches,a new sample must be embedded with an approximate technique, such as a Generalized Regression Neural Network[4].612IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,VOL.31,NO.4,APRIL2009Fig. 6.Manifold embedding methods try to directly project a processed image onto the head pose manifold using linear and nonlinear subspace techniques.。
ACM International Collegiate Programming Contest2008East Central Regional ContestMcMaster UniversityUniversity of CincinnatiUniversity of Michigan–Ann ArborYoungstown State UniversityNovember1,2008Sponsored by IBMRules:Problem A:Bordering on MadnessBob Roberts owns a design business which creates custom artwork for various corporations.One technique that his company likes to use is to take a simple rectilinearfigure(afigure where all sides meet at90or270degrees and which contains no holes)and draw one or more rectilinear borders around them.Each of these borders is drawn so that it is a set distance d away from the previously drawn border(or the originalfigure if it is thefirst border)and then the new area outlined by each border is painted a unique color.Some examples are shown below(without the coloring of the borders).The example on the left shows a simple rectilinearfigure(grey)with two borders drawn around it.The one on the right is a more complicatedfigure;note that the border may become disconnected.These pieces of art can get quite large,so Bob would like a program which can draw prototypes of the finished pieces in order to judge how aesthetically pleasing they are(and how much money they will cost to build).To simplify things,Bob never starts with afigure that results in a border where2horizontal (or vertical)sections intersect,even at a point.This disallows such cases as those shown below:InputInput will consist of multiple test cases.Thefirst line of the inputfile will contain a single integer indicating the number of test cases.Each test case will consist of two or more lines.Thefirst will contain three positive integers n,m and d indicating the number of sides of the rectlinearfigure,the number of borders to draw,and the distance between each border,where n≤100and m≤20. The remaining lines will contain the n vertices of thefigure,each represented by two positive integers indicating the x and y coordinates.The vertices will be listed in clockwise order starting with the vertex with the largest y value and(among those vertices)the smallest x value.OutputFor each test case,output three lines:thefirst will list the case number(as shown in the examples), the second will contain m integers indicating the length of each border,and the third will contain m integers indicating the additional area contributed to the artwork by each border.Both of these sets of numbers should be listed in order,starting from the border nearest the originalfigure.Lines two and three should be indented two spaces and labeled as shown in the examples.Separate test cases with a blank line.Sample Input26210 2030100301000000102010 10172050705070000030 20302010601060402040 Sample OutputCase1:Perimeters:340420Areas:30003800Case2:Perimeters:380Areas:2660Problem B:Jack of All TradesJack Barter is a wheeler-dealer of the highest sort.He’ll trade anything for anything,as long as he gets a good deal.Recently,he wanted to trade some red agate marbles for some goldfish.Jack’s friend Amanda was willing to trade him1goldfish for2red agate marbles.But Jack did some more digging and found another friend Chuck who was willing to trade him5plastic shovels for3marbles while Amanda was willing to trade1goldfish for3plastic shovels.Jack realized that he could get a better deal going through Chuck(1.8marbles per goldfish)than by trading his marbles directly to Amanda (2marbles per goldfish).Jack revels in transactions like these,but he limits the number of other people involved in a chain of transactions to9(otherwise things can get a bit out of hand).Normally Jack would use a little program he wrote to do all the necessary calculations tofind the optimal deal,but he recently traded away his computer for afine set of ivory-handled toothpicks.So Jack needs your help.InputInput will consist of multiple test cases.Thefirst line of thefile will contain an integer n indicating the number of test cases in thefile.Each test case will start with a line containing two strings and a positive integer m≤50.Thefirst string denotes the items that Jack wants,and the second string identifies the items Jack is willing to trade.After this will be m lines of the forma1name1a2name2indicating that some friend of Jack’s is willing to trade an amount a1of item name1for an amount a2 of item name2.(Note this does not imply the friend is also willing to trade a2of item name2for a1of item name1.)The values of a1and a2will be positive and≤20.No person will ever need more than 231−1items to complete a successful trade.OutputFor each test case,output the phrase Case i:(where i is the case number starting at1)followed by the best possible ratio that Jack can obtain.Output the ratio using5significant digits,rounded.Follow this by a single space and then the number of ways that Jack could obtain this ratio.Sample Input2goldfish marbles31goldfish2marbles5shovels3marbles1goldfish3shovelsthis that47this2that14this4that7this2theother1theother1thatSample OutputCase1:1.80001Case2:0.285713Problem C:LCRLCR is a simple game for three or more players.Each player starts with three chips and the object is to be the last person to have any chips.Starting with Player1,each person rolls a set of three dice.Each die has six faces,one face with an L,one with a C,one with an R and three with a dot.For each L rolled,the player must pass a chip to the player on their left(Player2is considered to be to the left of Player1);for each R rolled,the player passes a chip to the player on their right;and for each C rolled, the player puts a chip in a central pile which belongs to no player.No action is taken for any dot that is rolled.Play continues until only one player has any chips left.In addition,the following rules apply:1.A player with no chips is not out of the game,since they may later gain chips based on otherplayers’rolls.2.A player with only1or2chips left only rolls1or2dice,respectively.A player with no chips leftdoes not roll but just passes the dice to the next player.Your job is to simulate this game given a sequence of dice rolls.InputInput will consist of multiple test cases.Each test case will consist of one line containing an integer n (indicating the number of players in the game)and a string(specifying the dice rolls).There will be at most10players in any game,and the string will consist only of the characters‘L’,‘C’,‘R’and‘.’.In some test cases,there may be more dice rolls than are needed(i.e.,some player wins the game before you use all the dice rolls).If there are not enough dice rolls left to complete a turn(for example,only two dice rolls are left for a player with3or more chips)then those dice rolls should be ignored.A value of n=0will indicate end of input.OutputFor each test case,output the phrase“Game i:”on a single line(where i is the case number starting at1)followed by a description of the state of the game.This desciption will consist of n+1lines of the formPlayer1:c1Player2:c2...Player n:cnCenter:ctwhere c1, are the number of chips each player has at the time the simulation ended(either because some player has won or there are no more remaining dice rolls)and ct is the number of chips in the center pile.In addition,if some player has won,you should append the string“(W)”after their chip count;otherwise you should append the string“(*)”after the chip count of the player who is the next to roll.The only blank on any line should come before the game number or the player number. Use a single blank line to separate test cases.Sample InputR.L.RLLLCLR.LL..R...CLR. 5RL....C.LSample OutputGame1:Player1:0Player2:0Player3:6(W)Center:3Game2:Player1:1Player2:4Player3:1Player4:4(*)Player5:4Center:1Problem D:Party Party PartyEmma has just graduated high school and it is the custom for the new graduates to throw parties for themselves and invite everyone in school to attend.Naturally,Emma wishes to attend as many parties as possible.This is not such a problem on a weekday since usually there are only two or three parties in the evening.But,Saturdays are packed!Typically some parties start at8AM(breakfast is served) while others might end at midnight(much to the annoyance of the neighbors).Emma naturally wants to know how many parties she can attend.Each party has a starting and stopping time,which are on the hour.These are listed via a24-hour clock.For example,a party might start at10AM(10)and end at2PM(14).The earliest a party can start is8AM(8)and the latest it can end is midnight(24).In order not to be rude,Emma stays at each party at least one half hour and will consider traveling time between parties to be instantaneous. If there are times during the day when there are no parties to attend,she’ll simply go home and rest. InputThere will be multiple test cases.Each test case starts with a line containing an integer p(≤100) indicating the number of parties on the given day.(A value of p=0indicates end of input.)The following p lines are each of the form s e,both integers where8≤s<e≤24,indicating a party that starts at time s and ends at time e.Note there may be multiple parties with the same starting and ending time.OutputFor each input set output a line of the formOn day d Emma can attend as many as n parties.where you determine the value of n and d is the number of the test case starting at1.Sample Input8121313141213910910121312149113141514151415Sample OutputOn day1Emma can attend as many as7parties.On day2Emma can attend as many as2parties.Problem E:Su-Su-SudokuBy now,everyone has played Sudoku:you’re given a9-by-9grid of boxes which you are tofill in with the digits1through9so that1)every row has all nine digits,2)every column has all nine digits,and 3)all nine3-by-3subgrids have all nine digits.To start the game you are given a partially completed grid and are asked tofill in the remainder of the boxes.One such puzzle is shown below.415676985972453919853752472398187286231In this problem,you will be given Sudoku grids which you have nearly completed;indeed you’vefilled in every box exceptfive.You are asked to complete the grid,or determine that it’s impossible.(You might have already made an error!)InputThefirst line of input will contain a positive integer indicating the number of test cases to follow.Each test case will be a nearly completed Sudoku grid consisting of9lines,each containing9characters from the set of digits0through9.There will be exactlyfive0’s in each test case,indicating thefive unfilled boxes.OutputOutput for each test case should be eitherCould not complete this grid.if it is impossible to complete the grid according to the rules of the game,or the completed grid,in the form given below.(There are no blank spaces in the output.)If there is a way to complete the grid,it will be unique.Separate test cases with a blank line.Sample Input2481253697267948105539671204654389712908704563173562849702136958315897426896425371481253697267948105539671284654289710908704562173562849702136958315897426896425371Sample Output481253697267948135539671284654389712928714563173562849742136958315897426896425371Could not complete this grid.Problem F:Tanks a LotImagine you have a car with a very large gas tank-large enough to hold whatever amount you need. You are traveling on a circular route on which there are a number of gas stations.The total gas in all the stations is exactly the amount it takes to travel around the circuit once.When you arrive at a gas station,you add all of that station’s gas to your tank.Starting with an empty tank,it turns out there is at least one station to start,and a direction(clockwise or counter-clockwise)where you can make it around the circuit.(On the way home,you might ponder why this is the case-but trust us,it is.) Given the distance around the circuit,the locations of the gas stations,and the number of miles your car could go using just the gas at each station,find all the stations and directions you can start at and make it around the circuit.InputThere will be a sequence of test cases.Each test case begins with a line containing two positive integers c and s,representing the total circumference,in miles,of the circle and the total number of gas stations. Following this are s pairs of integers t and m.In each pair,t is an integer between0and c−1measuring the clockwise location(from some arbitraryfixed point on the circle)around the circumference of one of the gas stations and m is the number of miles that can be driven using all of the gas at the station. All of the locations are distinct and the maximum value of c is100,000.The last test case is followed by a pair of0’s.OutputFor each test case,print the test case number(in the format shown in the example below)followed by a list of pairs of values in the form i d,where i is the gas station location and d is either C,CC,or CCC, indicating that,when starting with an empty tank,it is possible to drive from location i around in a clockwise(C)direction,counterclockwise(CC)direction,or either direction(CCC),returning to location i.List the stations in order of increasing location.Sample Input1042343619355014121311100Sample OutputCase1:2C4CC9CCase2:0CCC1CCC2CCC3CCC4CCCProblem G:The Worm TurnsWinston the Worm just woke up in a fresh rectangular patch of earth.The rectangular patch is divided into cells,and each cell contains either food or a rock.Winston wanders aimlessly for a while until he gets hungry;then he immediately eats the food in his cell,chooses one of the four directions(north, south,east,or west)and crawls in a straight line for as long as he can see food in the cell in front of him.If he sees a rock directly ahead of him,or sees a cell where he has already eaten the food,or sees an edge of the rectangular patch,he turns left or right and once again travels as far as he can in a straight line,eating food.He never revisits a cell.After some time he reaches a point where he can go no further so Winston stops,burps and takes a nap.For instance,suppose Winston wakes up in the following patch of earth(X’s represent stones,all other cells contain food):01234X13XIf Winston starts eating in row0,column3,he might pursue the following path(numbers represent order of visitation):0123442X11816620143X2281012In this case,he chose his path very wisely:every piece of food got eaten.Your task is to help Winston determine where he should begin eating so that his path will visit as many food cells as possible. InputInput will consist of multiple test cases.Each test case begins with two positive integers,m and n, defining the number of rows and columns of the patch of earth.Rows and columns are numbered starting at0,as in thefigures above.Following these is a non-negative integer r indicating the number of rocks,followed by a list of2r integers denoting the row and column number of each rock.The last test case is followed by a pair of zeros.This should not be processed.The value m×n will not exceed 625.OutputFor each test case,print the test case number(beginning with1),followed by four values:amount row column directionwhere amount is the maximum number of pieces of food that Winston is able to eat,(row,column) is the starting location of a path that enables Winston to consume this much food,and direction isone of E,N,S,W,indicating the initial direction in which Winston starts to move along this path.If there is more than one starting location,choose the one that is lexicographically least in terms of row and column numbers.If there are optimal paths with the same starting location and different starting directions,choose thefirst valid one in the list E,N,S,W.Assume there is always at least one piece of food adjacent to Winston’s initial position.Sample Input55304313200Sample OutputCase1:2203WProblem H:You’ll be Working on the RailroadCongratulations!Your county has just won a state grant to install a rail system between the two largest towns in the county—Acmar and Ibmar.This rail system will be installed in sections,each section connecting two different towns in the county,with thefirst section starting at Acmar and the last ending at Ibmar.The provisions of the grant specify that the state will pay for the two largest sections of the rail system,and the county will pay for the rest(if the rail system consists of only two sections,the state will pay for just the larger section;if the rail system consists of only one section,the state will pay nothing).The state is no fool and will only consider simple paths;that is,paths where you visit a town no more than once.It is your job,as a recently elected county manager,to determine how to build the rail system so that the county pays as little as possible.You have at your disposal estimates for the cost of connecting various pairs of cities in the county,but you’re short one very important requirement—the brains to solve this problem.Fortunately,the lackeys in the computing services division will come up with something.InputInput will contain multiple test cases.Each case will start with a line containing a single positive integer n≤50,indicating the number of railway section estimates.(There may not be estimates for tracks between all pairs of towns.)Following this will be n lines each containing one estimate.Each estimate will consist of three integers s e c,where s and e are the starting and ending towns and c is the cost estimate between them.(Acmar will always be town0and Ibmar will always be town1.The remaining towns will be numbered using consecutive numbers.)The costs will be symmetric,i.e.,the cost to build a railway section from town s to town e is the same as the cost to go from town e to town s,and costs will always be positive and no greater than1000.It will always be possible to somehow travel from Acmar to Ibmar by rail using these sections.A value of n=0will signal the end of input.OutputFor each test case,output a single line of the formc1c2...cm costwhere each ci is a city on the cheapest path and cost is the cost to the county(note c1will always be0 and cm will always be1and ci and ci+1are connected on the path).In case of a tie,print the path with the shortest number of sections;if there is still a tie,pick the path that comesfirst lexicographically. Sample Input70210036245343354417518Sample Output03413。
gallery set参考图像集Probe set=test set测试图像集face renderingFacial Landmark Detection人脸特征点检测3D Morphable Model 3D形变模型AAM (Active Appearance Model)主动外观模型Aging modeling老化建模Aging simulation老化模拟Analysis by synthesis 综合分析Aperture stop孔径光标栏Appearance Feature表观特征Baseline基准系统Benchmarking 确定基准Bidirectional relighting双向重光照Camera calibration摄像机标定(校正)Cascade of classifiers级联分类器face detection 人脸检测Facial expression面部表情Depth of field 景深Edgelet 小边特征Eigen light-fields本征光场Eigenface特征脸Exposure time曝光时间Expression editing表情编辑Expression mapping表情映射Partial Expression Ratio Image局部表情比率图(,PERI) extrapersonal variations类间变化Eye localization,眼睛定位face image acquisition人脸图像获取Face aging人脸老化Face alignment人脸对齐Face categorization人脸分类Frontal faces 正面人脸Face Identification人脸识别Face recognition vendor test人脸识别供应商测试Face tracking人脸跟踪Facial action coding system面部动作编码系统Facial aging面部老化Facial animation parameters脸部动画参数Facial expression analysis人脸表情分析Facial landmark面部特征点Facial Definition Parameters人脸定义参数Field of view视场Focal length焦距Geometric warping几何扭曲Street view街景Head pose estimation头部姿态估计Harmonic reflectances谐波反射Horizontal scaling水平伸缩Identification rate识别率Illumination cone光照锥Inverse rendering逆向绘制技术Iterative closest point迭代最近点Lambertian model朗伯模型Light-field光场Local binary patterns局部二值模式Mechanical vibration机械振动Multi-view videos多视点视频Band selection波段选择Capture systems获取系统Frontal lighting正面光照Open-set identification开集识别Operating point操作点Person detection行人检测Person tracking行人跟踪Photometric stereo光度立体技术Pixellation像素化Pose correction姿态校正Privacy concern隐私关注Privacy policies隐私策略Profile extraction轮廓提取Rigid transformation刚体变换Sequential importance sampling序贯重要性抽样Skin reflectance model,皮肤反射模型Specular reflectance镜面反射Stereo baseline立体基线Super-resolution超分辨率Facial side-view面部侧视图Texture mapping纹理映射Texture pattern纹理模式Rama Chellappa读博计划:1.完成先前关于指纹细节点统计建模的相关工作。
头部姿势估计实时随机森林算法(Random Forests for Real Time Head Pose Estimation)数据介绍:Fast and reliable algorithms for estimating the head pose are essential for many applications and higher-level face analysis tasks. We address the problem of head pose estimation from depth data, which can be captured using the ever more affordable 3D sensing technologies available today.关键词:算法,估算,实时,头部姿势,高品质,低质量,algorithms,estimation,real time,head pose,high-quality,low-quality,数据格式:TEXT数据详细介绍:Random Forests for Real Time Head Pose EstimationFast and reliable algorithms for estimating the head pose are essential for many applications and higher-level face analysis tasks. We address the problem of head pose estimation from depth data, which can be captured using the ever more affordable 3D sensing technologies available today.To achieve robustness, we formulate pose estimation as a regression problem. While detecting specific face parts like the nose is sensitive to occlusions, we learn the regression on rather generic face surface patches. We propose to use random regression forests for the task at hand, given their capability to handle large training datasets.In this page, our research work on head pose estimation is presented, source code is made available and an annotated database can be downloaded for evaluating other methods trying to tackle the same problem.Real time head pose estimation from high-quality depth dataIn our CVPR paper Real Time Head Pose Estimation with Random Regression Forests, we trained a random regression forest on a very large, synthetically generated face database. In our experiments, we show that our approach can handle real data presenting large pose changes, partial occlusions, and facial expressions, even though it is trained only on synthetic neutral face data. We have thoroughly evaluated our system on a publicly available database on which we achieve state-of-the-art performance without having to resort to the graphics card. The video shows the algorithm running in real time, on a frame by frame basis (no temporal smoothing), using as input high resolution depth images acquired with the range scanner of Weise et al.CODEThe discriminative random regression forest code used for the DAGM'11 paper is made available for research purposes. Together with the basic head pose estimation code, a demo is provided to run the estimation directly on the stream of depth images coming from a Kinect, using OpenNI. A sample forest is provided which was trained on the Biwi Kinect Head Pose Database.Because the software is an adaptation of the Hough forest code, the same licence applies:By installing, copying, or otherwise using this Software, you agree to be bound by the terms of the Microsoft Research Shared Source License Agreement (non-commercial use only). If you do not agree, do not install copy or use the Software. The Software is protected by copyright and other intellectual property laws and is licensed, not sold.THE SOFTWARE COMES "AS IS", WITH NO WARRANTIES. THIS MEANS NO EXPRESS, IMPLIED OR STATUTORY WARRANTY, INCLUDING WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, ANY WARRANTY AGAINST INTERFERENCE WITH YOUR ENJOYMENT OF THE SOFTWARE OR ANY WARRANTY OF TITLE OR NON-INFRINGEMENT. THERE IS NO WARRANTY THAT THIS SOFTWARE WILL FULFILL ANY OF YOUR PARTICULAR PURPOSES OR NEEDS. ALSO, YOU MUST PASS THIS DISCLAIMER ON WHENEVER YOU DISTRIBUTE THE SOFTWARE OR DERIVATIVE WORKS.NEITHER MICROSOFT NOR ANY CONTRIBUTOR TO THE SOFTWARE WILL BE LIABLE FOR ANY DAMAGES RELATED TO THE SOFTWARE OR THIS MSR-SSLA, INCLUDING DIRECT, INDIRECT, SPECIAL, CONSEQUENTIAL OR INCIDENTAL DAMAGES, TO THE MAXIMUM EXTENT THE LAW PERMITS, NO MATTER WHAT LEGAL THEORY IT IS BASED ON. ALSO, YOU MUST PASS THIS LIMITATION OF LIABILITY ONWHENEVER YOU DISTRIBUTE THE SOFTWARE OR DERIVATIVE WORKS.If you do use the code, please acknowledge our papers:Real Time Head Pose Estimation with Random Regression Forests@InProceedings{fanelli_CVPR11,author = {G. Fanelli and J. Gall and L. Van Gool},title = {Real Time Head Pose Estimation with Random Regression Forests}, booktitle = {Computer Vision and Pattern Recognition (CVPR)},year = {2011},month = {June},pages = {617-624}}Real Time Head Pose Estimation from Consumer Depth Cameras@InProceedings{fanelli_DAGM11,author = {G. Fanelli and T. Weise and J. Gall and L. Van Gool},title = {Real Time Head Pose Estimation from Consumer Depth Cameras}, booktitle = {33rd Annual Symposium of the German Association for Pattern Recognition (DAGM'11)},year = {2011},month = {September}}If you have questions concerning the source code, please contact Gabriele Fanelli.Biwi Kinect Head Pose DatabaseThe database was collected as part of our DAGM'11 paper Real Time Head Pose Estimation from Consumer Depth Cameras.Because cheap consumer devices (e.g., Kinect) acquire row-resolution, noisy depth data, we could not train our algorithm on clean, synthetic images as was done in our previous CVPR work. Instead, we recorded several people sitting in front of a Kinect (at about one meter distance). The subjects were asked to freely turn their head around, trying to span all possible yaw/pitch angles they could perform.To be able to evaluate our real-time head pose estimation system, the sequences were annotated using the automatic system of ,i.e., each frame is annotated with the center of the head in 3D and the head rotation angles.The dataset contains over 15K images of 20 people (6 females and 14 males - 4 people were recorded twice). For each frame, a depth image, the corresponding rgb image (both 640x480 pixels), and the annotation is provided. The head pose range covers about +-75 degrees yaw and +-60 degrees pitch. Ground truth is provided in the form of the 3D location of the head and its rotation angles.Even though our algorithms work on depth images alone, we provide the RGB images as well.The database is made available for research purposes only. You are required to cite our work whenever publishing anything directly or indirectly using the data:@InProceedings{fanelli_DAGM11,author = {G. Fanelli and T. Weise and J. Gall and L. Van Gool},title = {Real Time Head Pose Estimation from Consumer Depth Cameras}, booktitle = {33rd Annual Symposium of the German Association for Pattern Recognition (DAGM'11)},year = {2011},month = {September}}Files:Data (5.6 GB, .tgz compressed) Readme fileSample code for reading depth images and ground truthIf you have questions concerning the data, please contact Gabriele Fanelli. 数据预览:点此下载完整数据集。
姿态和表情变化下的三维人脸标志点定位梁艳;章云【摘要】Landmark localization on 3D facial scans is important for face recognition, tracking, modeling, expression analysis, and so on. However, landmark localization in the presence of large pose and expression variations is still a great challenge. In this paper, a method for 3D facial landmark localization is presented. The method is insensitive to pose and expression. Candidate landmarks are detected using HK curvature analysis. According to the priori knowledge on fa-cial shape, a facial geometrical structure-based classification strategy is proposed to subdivide the candidate landmarks. Landmark localization is obtained by matching candidate landmarks with a facial landmark model (FLM). The landmark localization accuracy of our method is first experimented on the CASIA dataset. Then, our method is compared with the state-of-the-art methods on the UND/FRGC v2.0 dataset. Experimental results confirm that our method achieves high accuracy and robustness both to large pose and expression variations.%三维人脸标志点定位在人脸识别、人脸跟踪、人脸建模、表情分析等方面具有非常重要的作用.然而,在姿态和表情变化很大的条件下进行标志点定位,这仍然是一个很具挑战性的课题.本文提出一种对姿态和表情不敏感的三维人脸标志点定位方法,利用HK曲率分析检测出候选标志点,根据对面部形状的先验知识,提出一种基于人脸几何结构的分类策略对候选标志点进一步细分,通过把候选标志点与人脸标志点模型进行匹配,实现标志点的精确定位.首先在CASIA数据集对该方法的标志点定位精度进行测试,然后在UND/FRGC v2.0数据集对该方法与其他方法进行比较.实验结果表明该方法在姿态和表情变化很大的条件下具有高精度和高鲁棒性.【期刊名称】《控制理论与应用》【年(卷),期】2017(034)006【总页数】9页(P820-828)【关键词】人脸标志点定位;HK曲率分析;人脸标志点模型;分类策略;网格分割【作者】梁艳;章云【作者单位】广东工业大学自动化学院,广东广州510006;华南师范大学软件学院,广东佛山528225;广东工业大学自动化学院,广东广州510006【正文语种】中文【中图分类】TP391Three-dimensional(3D)facial landmark localization aims at automatically localizing facial feature points on 3D facial scans.It has a large number of applications such as face recognition,head pose estimation,face tracking,face expression analysis,3D face reconstruction,lip reading,and so on[1].Reliable facial landmark localization is essential.Despite the fact that a large number of works have focusedonfacialfeaturedetectioninrecentyears,facial landmark localization remains an open issue[2–4].It can be well done in the case of neutral frontal facial scans[5–7].However,in the presence of large pose variations,3D facial scans suffer from the problem of missing parts due toself-occlusion,which degrades landmark localizationperformance.Additionally,3D facial scans are sensitive to expressions.When expression changes,the point positions on the facial surface can exhibit large variations,up to 25 mm of linear displacement around themouth[8].Some examples highlighting these issues are illustrated inFig.1.Therefore,landmark localization in the presence of large pose and expression variations is still a challenging task.Bagchi et al.[9]presented a HK curvature-based approach for nose tip and eye corners detection.They detected the feature points through an analysis of the curvature of a 3D facial surface.The method was tested on the FRAV3D database of 752 scans,with pose variations.Although they obtained a localization success rate of 98.80%for nose tip,they only obtained a lo-calization success rate of 72.20%for eye corners.It indicates that the curvature alone is not suf ficiently robust for detecting eye corners. An HK curvature-based method was also mentioned in[11]to detect nose tip and eye corners.The method was assessed on the CASIA 3Ddataset[10].It achieved a localization success rate of 99.3%for nose tip and 98.2%for eye corners.But authors did not define what a correct detection is. Yang et al.[12]proposed an automatic nose tip location method to deal with the problem of pose variations.They believed that the depth value of nose tip was always the largest when a 3D facial scan was projected onto the corrected pose direction.Based on this assumption,they detected candidate nose tip points by directional maximum.A similar approach was also mentioned in[13].However,in the presence of large posevariations,e.g.rotation along the X-axis direction,this assumption was not always correct and it led to the failure of nose tip detection. Statisticalshapemodel-basedapproaches areapopular choice for landmark localization[14].The common denominator of these methods is the availability of a reliable low-dimensional model for the shapes of interest.They combine the responses of local feature detectors with shape constraints that ensure plausibility of the result at a global level.Creusot et al.[15]presented a method called local shape dictionary to detect facial landmarks.Local shapes were learnt at a set of 14 manually-placed landmark positions and characterized by a set of 10 shape descriptors.But the method had limitations,e.g.only linear combinations were used for the descriptors.A shape index-based statistical shape model(SI–-SSM)was presented to detect and track landmarks on 3D and 4D data[16].The SI–-SSM was constituted by the global shape of landmarks and the local features from patches around each landmark.However,ICP was used to get the initialized patches,which made this method sensitive to yaw variations.Zhao et al.[17]proposed a general learning-based framework for detecting facial features on 3D facial data.The method relied on a statistical facial feature model which was constructed by applying PCA to the global 3D face landmark con figurations,the local texture,and the local shape around each landmark.The method addressed the problem of 3D landmark detection under expression and occlusion variations.However,it was sensitive to pose changes.Facial landmark model(FLM)was used in some works for facial landmark localization[18–20].For example,Perakis etal.[18]used shape index and spin images to extract candidate landmarks.The candidate landmarks were matched with the FLM and the positions of feature points were located.The method addressed the problem of large yaw and expression variations.However,they used three FLMs(i.e.FLM8,FLM5L,and FLM5R)to deal with pose variations,which increased the complexity of the calculation.Additionally,in contrast to HK curvature,the calculation of shape index was more complicated.In this paper,we propose an approach of automatic facial landmarks localization which deals with large pose and expression variations based on the facial geometrical structure.The approach consists of three main steps.Firstly,candidate landmark points(i.e.nose tip,chin tip,eye corners and mouth corners)are detected by using HK curvatureanalysis.Secondly,candidate landmarks are subdivided according to a facial geometrical structure-based classi fication strategy.Finally,landmarks are detected using an FLM.The experimental results on the CASIA 3D face dataset and the FRGC v2.0 dataset[21]combined with the UND Ear dataset[22]demonstrate the effectiveness of our method.The main contributions of this paper are the following:1)We propose a facial geometrical structure-based classi fication strategy to subdivide candidate landmarks.2)Compared with[18],we only use one FLM to deal with pose and expression variations.3)Our method is accurate and robust to large yaw variations(up to 60 degrees)and strong expression variations.The rest of this paper is organized as follows:Section 2 describes the proposed approach for 3D facial landmark localization.Section 3 analyzes the experimental ments and conclusions areprovided in Section 4.The aim of our work is to localize eight landmarks,i.e.left eye outer corner(LEOC),left eye inner corner(LEIC),right eye inner corner(REIC),right eye outer corner(REOC),nose tip(NT),mouth left corner(MLC),mouth right corner(MRC),and chin tip(CT),on a facial scan(Fig.2).However,some landmarks are not always visible due to self-occlusion.For example,only five of these landmarks are visible on side scans.There-fore,it is very dif ficult to precisely localize landmarks when the pose of scan is unknown. HK classi fication is a method for curvature analysis on a 3D object proposed by Besl and Jain[23].It can detect fundamental elements on a facial surface based on the signs of Gaussian and mean curvature,which make it translation and rotation invariant.It is widely adopted in many works.However,there are multiple points which have signi ficant curvatures.This method can not provide the exact landmark localization results.In the traditional FLM-based method,the plausible landmark combinations are selected by matching with the FLM.They must have the same dimensions.However,eight landmarks are visible on frontal facial scans,while only five landmarks are visible on side facial scans.The traditional FLM-based method cannot handle frontal and side facial scans simultaneously.In this section,we describe a facial geometrical structure-based method tolocalize landmarks on 3D facial scans which exhibit large pose and expression variations.Candidate landmarks are detected by HK classification.Based on the priori knowledge of facial geometrical structure,we can subdivided the candidate landmarks and determine which landmark point is visible.If eight landmarks are visible,candidate landmark setsarematchedwiththeFLMdirectly.Otherwise,candidate landmark sets should be complemented first,and then matched with the FLM.We use HK classi fication to detect candidate landmarks.Hstands for Mean curvature andKstands for Gaussian curvature.The Mean and Gaussian curvature of a 3D pointpare de fined aswherek1(p)andk2(p)are the maximum and minimum principal curvature ofp,respectively.Each point can be labeled into basic geometric shape type based on the combination of the signs of Mean and Gaussian curvature as shown in Table 1[24]. Weuseathresholdingprocesstoretainthemostsigni ficant candidate landmarks.It is given aswhereθHN,θHC(θHC′),θHEI,andθHEOare the mean curvature threshold of nose tip,chin tip,eye inner corner/mouth corner and eye outercorner,respectively.θKN,θKC(θKC′),θKEI,andθKEOare the Gaussian curvature threshold of nose tip,chin tip,eye inner corner/mouth corner and eye outer corner,respectively.Thresholds are set by the rule that the retained candidates account for a certain proportion of a facesurface,0.5%for nose tip,1%for chin tip,2%for eye inner corner/mouthcorner,and 0.5%for eye outer corner[25].According to this rule,the threshold values in our experimentsare:θHN=−0.1,θHC=−0.05,θHC′=−0.01,θHEI=0.01,θHEO=0.01,θKN=0.01,θKC=0.001,θKC′=0.005,θKEI=0.001,andθKEO=−0.01.Then,candidate landmark points can be detected and divided into four classes:nosetip,chin tip,eye inner corner/mouth corner and eye outercorner.However,since candidate landmarks of mouth corner and eye inner corner are elliptical concave,they belong to the same class.In addition,it cannot catch the relative position.For example,candidate landmarks of left and right eye outer corner are in the same class.Therefore,the HK classi fication alone is insuf ficient.Candidate landmarks need to be further classi fied.Nose is the most stable facial region under pose and expression variations[26].Once the nose is identi fied,we can further classify the candidate points according to the priori knowledge on facial shape[27].It is helpful to the subsequent step and can improve the accuracy of landmark localization.Here,we propose a facial geometrical structure-based classi fication strategy.Firstly,the areas of nose tip and nasion are detected.The centroids of the two areas are regarded as the coarse positions of nose tip and nasion.Secondly,an intrinsic coordinate system is de fined by nose tip,nasion,and the normal vector at nose tip.Then,according to the posi-tion in the coordinate system,the candidate landmarks can be further subdivided.In order to identify the nose,we localize nose tip and nasion as auxiliarypoints.The nose tip is typical peak region and the nasion is typical saddle ridge region.Thus,the candidates of nose tip and nasion can be detected by using the method mentioned in Section 2.1.However,the candidate points are discrete.Furthermore,there are a large number ofoutliers.Therefore,we first perform mesh segmentation to detect the areas of nose tip and nasion.The segmentation process is similar to the region growing algorithm in image segmentation,which uses the adjacency relation in topology information for area extension.The connected regions whose Mean and Gaussian curvature conform to certain conditions are separated from the original data.is the set of mesh vertices andis the set of triangles formed by adjacent vertices.Nis the number of mesh vertices.AV(pi)is the set of adjacent vertices ofpi.C={ci}is the set of HK classi fication labels.If the values of Mean and Gaussian curvature ofpiconform to a certain combination(Section 2.1),ci=1,otherwiseci=0.The method of mesh segmentation is summarized in Algorithm 1.Algorithm 1 Mesh segmentation.Input a 3D meshM,the set of adjacent verticesAVand the set of HK classi fication labelsC.Procedure:1)Set a labellifor each vertex.Initializeli=0 and the number of connected regionsn=1;2)Choose a vertexpiwhichli=0andci=1.Setp=piandli=1.piis added to the set of connected regionsREn;3)For each vertexpj∈AV(p),ifcj=1andlj=0,pjisadded to the se t of connected regionsREn;4)For each unprocessed vertexpk∈REn.Setp=pkandlk=1.Repeat step 3);5)n=n+1.Repeat steps 2),3),and 4)until every vertex has been processed. Output the set of connected regionsREn.We usually get more than one connected region because of outliers.But the nose tip and nasion connected regions are the largest(having the largest number of points).Fig.3 shows examples of nose tip and nasion connected regions in red and outlier connected regions ingreen.Therefore,we choose the largest peak connected region and saddle ridge connected region as nose tip area and nasion area,respectively. Once the areas of nose tip and nasion are detected,the centroid of nose tip areaGtipand the centroid of nasion areaGnare computed as the coarse positions of nose tip and nasion,respectively.Nose tip,nasion,and the normal vector at nose tip are needed to de fine an intrinsic coordinate system.Note that the centroid of nose tip areaGtipobtained in Section 2.2.1 may not be located on the facial surface and there is no normal vector.Thus,we choose the closest vertice on the surface,which is denoted byto replaceGtip.The normal vector atis given by wherePD1andPD2are the two principal directions atThen,the intrinsic coordinate system can be de fined byandNV.Its origin isAs can be seen from Fig.4,a facial surface is divided into four areas by theXZandYZplanes.Left eye outer/inner corner is located above theXZplane and on the left of theYZplane.Right eye outer/inner corner is located abovetheXZplane and on the right of theYZplane.Mouth left corner is located below theXZplane and on the left of theYZplane.Mouth right corner is located below theXZplane and on the right of theYZplane.Thus,according to the position in the coordinate system,the candidate landmarks obtained in Section 2.1 are subdivided into eight classes:nose tip,chin tip,left eye outer corner,right eye outer corner,left eye inner corner,right eye inner corner,mouth left corner and mouth right corner.It can improve the ef ficiency of landmark localization.In addition,the candidate points which do not conform to the facial geometrical structure will be filtered out.For example,the candidate points,which are labeled hyperbolic concave while located below theXZplane,cannot be the eye outer corners.Therefore,they will be regarded as outliers and filtered out.An FLM is created to represent the landmarkpositions.Totrainthemodel,wemanuallylabeledthetarget landmarks for each facial scan of the training set.Each example is represented by a shape vector of concatenated 3D coordinates:whereωi=(xi,yi,zi)is the coordinates of each landmark andmis the number of landmarks.The shape vector is a3m-dimensional vector.In this paper,m=8.In order to remove the translational,rotational,and scale effects,we use Procrustes analysis to align the training shapes to their mean shape.It is performed by minimizing the Procrustes distanceD:whereΩkis thekth training shape.is the mean shape which is de fined aswhereLis the number of examples in the training set.PCA is then applied to the aligned shape vectors.It providesanef ficientparameterizationoftheshapemodel through dimensionality reduction[28].Ifϕcontains thefeigenvectorsϕicorresponding to theflargest eigenvaluesλiof the covariance matrixCMof the aligned shape vectors,then the FLM can be modeled bybis a set of parameters controlling the shape variation of FLM.It is af-dimensional vector.The variance of theith parameter,bi,across the training set is given byλi[29].The value offis typically determined by the number of components which is required to account for 99%of the training variation,and it is computed asIn the traditional FLM-based method,the plausible landmark combinations and the FLM must have the same dimensions.However,large yaw variations will result in missing data.Eight landmarks are not always visible on facial scans.The traditional FLM-based method cannot handle frontal and side facial scans simultaneously.Perakis et al.[18]used three FLMs to deal with this problem.In this work,candidate landmarks have been subdivided into eight classes according to the facial geometrical structure-based classi fication strategy.Candidate landmark set is empty set,which means the corresponding landmark is invisible.Fig.5 shows a facial scan at approximate 60◦yaw.Elliptical concave regions which can correspond from eye inner corners and mouth corners are labeled in red.It is clear that the candidatelandmarksetsofrighteyeinnercornerandmouth right corner areempty sets.Therefore,we can easily estimate whether the landmark points are visible and only use one FLM to deal with large pose variations,which greatly decreases the complexity of the calculation.We select the plausible landmark combinations by matching them with the FLM.Letαbe the number of nonempty candidate landmark sets of a test facial scan.The nonempty landmark class is referred asLNand the empty landmark class is referred asLE.We create combinationsΨby selecting one candidate landmark point from eachLNclass(Section 2.2.2).In order to filter out the translational and rotati onal effects,Ψis aligned to′by minimizing the Procrustes distance,where′is the subset ofand contains the landmarks belonging toLN.Note that the dimensions ofΨandmay be different.Ψcannot be projected onto the PCA feature subspace directly.Therefore,we useΨ′insteadofΨ,whereΨ′contains the landmarks ofΨ(αlandmarks)and the landmarks ofwhich belong toLE(8−αlandmarks).ProjectΨ′onto the PCA feature subspace and its deformation parameters can be computed asWe assume thatbiare independent and gaussian.Then we set deformation constraints,to each element[18,29].The candidate landmarkshapeΨbelongs to the shape class with probabilitywhereλiare the eigenvalues that satisfy the deformation constraints andEsumis the sum of the eigenvalueswhich are incorporated into the FLM.IfPr(Ψ)exceeds 0.99,the landmark setΨis consideredplausible,otherwise it is rejected.Since the plausible landmark set may not be unique,we use Procrustesdistance as a distance metric.The plausible landmark set which has the minimum Procrustes distance from FLM is selected as the landmark localization result.In this section,we present three sets of experiments to evaluate the performance of the proposed method for landmark localization.The first one is to demonstrate the method’s robustness to pose v ariations.The second one is to demonstrate the method’s tolerance to expression variations.Both experiments are conducted on the CASIA 3D facedataset[10].The third one is to compare our method with other state-of-the-art methods on the FRGC v2.0 dataset[21]combined with the Ear dataset from the University of Notre Dame(UND),collections F and G[22]. The CASIA dataset consists of 4,624 scans of 123 persons,with different expressions,illumination and pose variations.The scans are given as point clouds of abou t1.5×104vertices.The FRGC v2.0 dataset is composed of4,007 range images of 466 persons.The images have frontal poses and various facial expressions.All images have resolution of480×640.The UND dataset contains side scans with yaw rotations of 45,60,and 90 degrees.The institution collecting the UND dataset is the same that de fined the FRGCv2.0 dataset.There is a partial overlap between subjects in the two datasets. In these experiments,the error is computed as the Euclidean distance between the detected landmark and the corresponding manually labeled landmark.If the error is under a certain threshold(TH=10 mm),it is deemed to be a successful detection.In this experiment,we use the subset of the CASIA dataset which containsscans with pose variations.Note that the CASIA dataset contains side scans with yaw rotations of 30◦,60◦,and 90◦.However,both sides of the face have missing data and noses are missed for the 90◦side scans.Therefore,we use only the 30◦and 60◦side scans.The training set is composed of one frontal scan per subject(123 scans).In order to evaluate the performance of our method against pose variations,we de fine five testing sets:1)TS00,contains five frontal scans per subject(615 scans);2)TS30--y,contains two scans with yaw rotations of 30◦(l eft and right)per subject(246 scans);3)TS60--y,contains two scanswith yaw rotations of 60◦(left and right)per subject(246 scans);4)TS30--p,contains two scans with pitch rotations of 30◦(looking-up and looking-down)per subject(246 scans);5)TS30--r,contai ns two scans with roll rotations of 30◦(tilt left and tilt right)per subject(246 scans).These scans are independent of the 123 scans used in training the model.Scans in the training set and the testing sets are all neutral.Fig.6 illustrates several landmark localization examples with pose variations.Note that only five landmarks are localized in facial scans with large yaw variations(Fig.6(d)).Table 2 shows the mean error,standard deviation,and success rates of landmark localization on testing sets with pose variations.The mean error is less than 4.92mm and the standard deviation is less than2.90 mm in all testing sets.The success rate of landmark localization is at least 96.54%.Theresults show that our method has high accuracy to pose variations.Note that the accuracy of landmark localization decreases when the yaw angle increases.This is because as the yaw angle increases,the proportion of non-facial regions increases,too.It will bring in more noise and outliers and lead to the failure of landmark localization.Inthisexperiment,weanalyzedtheeffectofexpressions on the landmark localization accuracy.The same FLM is used in Experiment 1 and Experiment 2.We de fine three testing sets:1)TS--neutral,contains five neutral frontal scans per subject(615 scans);2)TS--small,contains four small expression frontal scans persubject(including smile and closed eyes)(492 scans);3)TS--large,contains six large expression frontal scans per subject(including laugh,anger,and surprise)(738 scans).Some landmark localization examples with expression variations are shown in Fig.7.Table 3 provides an exhaustive summary of results obtained using testing sets with expression variations.As can be seen from the table,the best result is achieved for the neutral scans category with a99.03%accuracy.When the testing set consists of large expression scans,the success rate of landmark localization is 92.98%,which represents a high performance.In addition,it can be observed that landmarks with less deformation in expressions are better localized(i.e.,nose tip and eye inner corners).Mouth corners are detected with the worst accuracy on the TS-large dataset because they have large displacement and deformation resulting fromfacial expressions.In this experiment,our method is compared with several state-of-the-art methods on the UND/FRGCv2.0dataset.WeusetheFRGCv2.0datasetforfrontal facial scans and the UND dataset for side facial scans.We test our method on 1,500 facial scans which are randomly selected from the FRGC v2.0 dataset.Many of these scans have facial expressions.In addition,we test our method on the45◦side scans(118 left and 118 right from 118 subjects)and the 60◦side scans(87 left and 87 right from 87 subjects)from the UND dataset.The localization results with different methods are presented in Table 4 and 5. Table 4 shows the performance comparison on the FRGC v2.0 dataset which contains scans with frontal poses and various expressions.Our method outperforms the state-of-the-art except[17].Zhao’smethod[17]exhibits the minimum mean error and standard deviation for eye inner corners and mouth corners.However,it cannot deal with large yaw variations which will make some landmarks invisible.Moreover,it needs not only range images but also texture images.Table 5 presents comparative results on the UND/FRGC v2.0 dataset which contains scans with various poses and pared to[18,30],our method achieves smaller mean localization distanceerrorsandstandarddeviationsforalllandmarks.These results indicate that our method is more accurate and robust to pose and expression variations.In this paper,we present a pose and expression independent approach for3D facial landmark localization.Candidate landmarks are detected using HK curvature analysis and further classi fied according to a facial geometrical structure-based classi fication strategy.They are identi fied and labeled by being matched with an FLM.We perform experiments on the CASIA dataset which contains scans with yaw variations up to 60 degrees and varying degrees of expressions.Furthermore,our method is compared with state-ofthe-art methods on the UND/FRGC v2.0 dataset.The results show that our method has high accuracy and robustness both to large pose and expression variations.[1]MARTINEZ B,VALSTAR M F,BINEFA X,et al.Local evidence aggregation for regression-based facial point detection[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(5):1149–1163.[2]SANGINETO E.Pose and expression independent facial landmark localization using dense-SURF and the Hausdorff distance[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(3):624–638.[3]NGUYEN D H,NA J,YI J.Nose tip detection from 3D facial mesh data using a rotationally invariant local shape descriptor[C]//Proceedings of the 5th IAPR International Conference on Biometrics.New York:IEEE,2012:91–96.[4]KAUSHIK V D,SINGH S,TIWARI H,et al.Automatic 3D facial feature extraction algorithm[C]//Proceedings of IEEE International Conference on New Technologies,Mobility&Security.Tangier,Morocco:IEEE,2008:106–110.[5]JU Q,O’KEEFE S,AUSTIN J.Binary neural network based 3D facial feature localization[C]//Proceedings of International Joint Conference on NeuralNetworks.New Jersey:IEEE,2009:1462–1469.[6]SZEPTYCKI P,ARDABILIAN M,CHEN L.A coarse-to- fine curvature analysis-based rotation invariant 3D face landmarking[C]//Proceedings of IEEE Conference on Biometrics:Theory,Applications and Systems.New Jersey:IEEE,2009:1–6.[7]COLOMBOA,CUSANOC,SCHETTINIR.3Dfacedetectionusing curvature analysis[J].Pattern Recognitio,2006,39(3):444–455.[8]BROOKE N M,SUMMERFIELD Q.Analysis,synthesis and perception of visible articulatory movements[J].Journal of Phonetics,1983,11(1):63–76.[9]BAGCHI P,BHATTACHARJEE D,NASIPURI M,et al.A novel approach to nose-tip and eye corners detection using H-K curvature analysis in case of 3D images[C]//Proceedings of International Conference on Emerging Applications of Information Technology.New York:IEEE,2012:311–315. [10]ZHONG C,SUN Z,TAN T.Learning ef ficient codes for 3D face recognition[C]//Proceedings of IEEE International Conference on Image Processing.New York:IEEE,2008:1928–1931.[11]KHADHRAOUI T,BENZARTI F,AMIRI H.A novel approach to nose-tip and eye-corners detection using HK-classi fication in case of 3D face analysis[C]//Proceedings of World Congress on Computer Applications and Information Systems.New York:IEEE,2014:1–4.[12]YANG J,LIAO Z,LI X,et al.A method for robust nose tip location across pose variety in 3D face data[C]//Proceedings of International AsiaConferenceonInformaticsinControl,AutomationandRobotics.New York:IEEE,2009:114–117.。
粗略的头部姿势估计(Coarse Head Pose Estimation)数据介绍:We evaluate two coarse pose estimation schemes, based on (1) a probabilistic model approach base on 1 and (2) a neural network approach.关键词:头部姿势,评估,粗糙,概率模型,神经网络, HeadPose,Estimation,Coarse,probabilistic model,neural network,数据格式:VIDEO数据详细介绍:Coarse Head Pose EstimationDescriptionWe evaluate two coarse pose estimation schemes, based on (1) a probabilistic model approach base on 1 and (2) a neural network approach. We compare the results of the two techniquesfor varying resolution, head localization accuracy and required pose accuracy using the CMU PIE database.The probabilistic model approach was more robust to headlocalization accuracy but did not perform as well on very low-resolution head images. The neural network method was able to perform pose estimation even for head images as small as 8X8 pixels. At this resolution, the neural network was able to determine head pan angle class 88% of the time for 9 poseswith a step size of 22.5° For 5 poses with a step size of 45° recognition was 96%. The neural network was also considerably faster running at over 300Hz on a standard PC. Both methods were very sensitive to head tilt.In our initial tests, the PM approach appeared to be moreextensible to data acquired under different conditions. We conjecture that there is a tradeoff between model complexity, extensibility and accuracy. In general for head pose estimation and tracking (fine or coarse), there is a consistent tradeoff between complex models i.e., 3D geometric models with elaborate initialization or specialized training sets, accuracy,and lack of extensibility –i.e. to people who do not fit the model, for which initialization is not as good or for individuals or lighting conditions which differ from the training set.The details for the implementation specifications for resolution and localization accuracy can be found in paper 2. Click on the following image to see a demo (video 5.7MB MPEG1).数据预览:点此下载完整数据集。
Head Pose Estimation in Computer Vision:A Survey Erik Murphy-Chutorian,Student Member,IEEE and Mohan Manubhai Trivedi,Senior Member,IEEEAbstract—The capacity to estimate the head pose of another person is a common human ability that presents a unique chal-lenge for computer vision pared to face detection and recognition,which have been the primary foci of face-related vision research,identity-invariant head pose estimation has fewer rigorously evaluated systems or generic solutions.In this paper, we discuss the inherent difficulties in head pose estimation and present an organized survey describing the evolution of thefield. Our discussion focuses on the advantages and disadvantages of each approach and spans90of the most innovative and characteristic papers that have been published on this topic.We compare these systems by focusing on their ability to estimate coarse andfine head pose,highlighting approaches that are well suited for unconstrained environments.Index Terms—Head Pose Estimation,Human Computer In-terfaces,Gesture Analysis,Facial Land Marks,Face AnalysisI.I NTRODUCTIONF ROM an early age,people display the ability to quicklyand effortlessly interpret the orientation and movement of a human head,thereby allowing one to infer the intentions of others who are nearby and to comprehend an important nonverbal form of communication.The ease with which one accomplishes this task belies the difficulty of a problem that has challenged computational systems for decades.In a computer vision context, head pose estimation is the process of inferring the orientation of a human head from digital imagery.It requires a series of processing steps to transform a pixel-based representation of a head into a high-level concept of direction.Like other facial vision processing steps,an ideal head pose estimator must demonstrate invariance to a variety of image-changing factors. These factors include physical phenomena like camera distortion, projective geometry,multi-source non-Lambertian lighting,as well as biological appearance,facial expression,and the presence of accessories like glasses and hats.Although it might seem like an explicit specification of a vision task,head pose estimation has a variety of interpretations.At the coarsest level,head pose estimation applies to algorithms that identify a head in one of a few discrete orientations,e.g.,a frontal versus left/right profile view.At thefine(i.e.,granular),a head pose estimate might be a continuous angular measurement across multiple Degrees of Freedom(DOF).A system that estimates only a single DOF,perhaps the left to right movement is still a head pose estimator,as is the more complex approach that estimates a full3D orientation and position of a head,while incorporating additional degrees of freedom including movement of the facial muscles and jaw.Manuscript received June5,2007;revised November29,2007and Febru-ary28,2008;accepted March24,2008.E.Murphy-Chutorian and M.M.Trivedi are with the Computer Vi-sion and Robotics Research Laboratory,Department of Electrical and Computer Engineering,University of California,San Diego.Email: {erikmc,mtrivedi}@In the context of computer vision,head pose estimation is most commonly interpreted as the ability to infer the orientation of a person’s head relative to the view of a camera.More rigorously, head pose estimation is the ability to infer the orientation of a head relative to a global coordinate system,but this subtle difference requires knowledge of the intrinsic camera parameters to undo the perceptual bias from perspective distortion.The range of head motion for an average adult male encompasses a sagittalflexion and extension(i.e.,forward to backward movement of the neck) from−60.4◦to69.6◦,a frontal lateral bending(i.e.,right to left bending of the neck)from−40.9◦to36.3◦,and a horizontal axial rotation(i.e.,right to left rotation of the head)from−79.8◦to 75.3◦[26].The combination of muscular rotation and relative orientation is an often overlooked ambiguity(e.g.,a profile view of a head does not look exactly the same when a camera is viewing from the side as compared to when the camera is viewing from the front and the head is turned sideways).Despite this problem,it is often assumed that the human head can be modeled as a disembodied rigid object.Under this assumption,the human head is limited to3degrees of freedom(DOF)in pose,which can be characterized by pitch,roll,and yaw angles as pictured in Fig.1.Head pose estimation is intrinsically linked with visual gaze estimation,i.e.,the ability to characterize the direction and focus of a person’s eyes.By itself,head pose provides a coarse indi-cation of gaze that can be estimated in situations when the eyes of a person are not visible(like low-resolution imagery,or in the presence of eye-occluding objects like sunglasses).When the eyes are visible,head pose becomes a requirement to accurately predict gaze direction.Physiological investigations have demonstrated that a person’s prediction of gaze comes from a combination of both head pose and eye direction[59].By digitally composing images of specific eye directions on different head orientations, the authors established that an observer’s interpretation of gaze is skewed in the direction of the target’s head.A graphic example of this effect was demonstrated in the nineteenth-century drawing shown in Fig.2[134].In this sketch, two views of a head are presented at different orientations,but the eyes are drawn in an identical configuration in both.Glancing at this image,it is quite clear that the perceived direction of gaze is highly influenced by the pose of the head.If the head is removed entirely,and only the eyes remain,the perceived direction is similar to that in which the head is in a frontal configuration. Based on this observation and our belief that the human gaze estimation ability is appropriately processing the visual information,we hypothesize that a passive camera sensor with no prior knowledge of the lighting conditions has insufficient information to accurately estimate the orientation of an eye without knowledge of the head orientation as well.To support this statement,consider the proportions of visible sclera(i.e.,the white area),surrounding the eye.The high-contrast between the sclera and the iris are discernible from a distance,and might have evolved to facilitate gaze perception[54].An eye direction modelDigital Object Indentifier 10.1109/TPAMI.2008.1060162-8828/$25.00 © 2008 IEEERollPitchYawFig.1T HE THREE DEGREES -OF -FREEDOM OF A HUMAN HEAD CAN BEDESCRIBED BY THE EGOCENTRIC ROTATION ANGLES pitch ,roll ,AND yaw .that uses this scleral iris cue would need a head pose estimate to interpret the gaze direction,since any head movement introduces a gaze shift that would not affect the visible sclera.Consequently,to computationally estimate human gaze in any configuration,an eye tracker should be supplemented with a head pose estimation system.This paper presents a survey of the head pose estimation meth-ods and systems that have been published over the past 14years.This work is organized by common themes and trends,paired with the discussion of the advantages and disadvantages inherent in each approach.Previous literature surveys have considered general human motion [76,77],face detection [41,143],face recognition [149],and affect recognition [25].In this paper,we present a similar treatment for head pose estimation.The remainder of the paper is structured as follows:Section II describes the motivation for head pose estimation approaches;Section III contains an organized survey of head pose estimation approaches;Section IV discusses the ground truth tools and datasets that are available for evaluation and compares the systems described in our survey based on published results and general applicability;Section V presents a summary and concluding remarks.II.M OTIVATIONPeople use the orientation of their heads to convey rich,inter-personal information.For example,a person will point the direction of his head to indicate who is the intended target of a conversation.Similarly in a dialogue,head direction is a nonverbal communique that cues a listener when to switch roles and begin speaking.There is important meaning in the movement of the head as a form of gesturing in a conversation.People nod to indicate that they understand what is being said,and they use additional gestures to indicate dissent,confusion,consideration,and agreement.Exaggerated head movements are synonymous with pointing a finger,and they are a conventional way of directing someone to observe a particular location.In addition to the information that is implied by deliberate head gestures,there is much that can be inferred by observing a person’s head.For instance,quick head movements may be a sign of surprise or alarm.In people,these commonly trigger reflexive responses from an observer,which is very difficult to ignore even in the presence of contradicting auditory stimuli [58].Other important observations can be made by establishing the(a)(b)Fig.2W OLLASTON I LLUSION :A LTHOUGH THE EYES ARE THE SAME IN BOTHIMAGES ,THE PERCEIVED GAZE DIRECTION IS DICTATED BY THEORIENTATION OF THE HEAD [134].visual focus of attention from a head pose estimate.If two people focus their visual attention on each other,sometimes referred to as mutual gaze ,this is often a sign that two people are engaged in a discussion.Mutual gaze can also be used as a sign of awareness,e.g.,a pedestrian will wait for a stopped automobile driver to look at him before stepping into a crosswalk.Observing a person’s head direction can also provide information about the environment.If a person shifts his head towards a specific direction,there is a high likelihood that it is in the direction of an object of interest.Children as young as six months exploit this property known as gaze following ,by looking towards the line-of-sight of a caregiver as a saliency filter for the environment [79].Like speech recognition,which has already become entwined in many widely available technologies,head pose estimation will likely become an off-the-shelf tool to bridge the gap between humans and computers.III.H EAD P OSE E STIMATION M ETHODSIt is both a challenge and our aspiration to organize all of the varied methods for head pose estimation into a single ubiquitous taxonomy.One approach we had considered was a functional taxonomy that organized each method by its operating domain.This approach would have separated methods that require stereo depth information from systems that require only monocular video.Similarly,it would have segregated approaches that require a near-field view of a person’s head from those that can adapt to the low resolution of a far-field view.Another important consideration is the degree of automation that is provided by each system.Some systems estimate head pose automatically,while others assume challenging prerequisites,such as locations of facial features that must be known in advance.It is not always clear whether these requirements can be satisfied precisely with the vision algorithms that are available today.Instead of a functional taxonomy,we have arranged each system by the fundamental approach that underlies its imple-mentation.This organization allows us to discuss the evolution of different techniques,and it allows us to avoid ambiguities that arise when approaches are extended beyond their original functional boundaries.Our evolutionary taxonomy consists of the following eight categories that describe the conceptual approaches that have been used to estimate head pose:TABLE IT AXONOMY OF H EAD P OSE E STIMATION A PPROACHES.Approach Representative WorksAppearance Template Methods•Image Comparison Mean Squared Error[91],Normalized Cross-Correlation[7]•Filtered Image Comparison Gabor-wavelets[111]Detector Arrays•Machine Learning SVM[47],Adaboost Cascades[53]•Neural Networks Router Networks[103]Nonlinear Regression Methods•Regression Tools SVR[65,86],Feature-SVR[78]•Neural Networks MLP[108,115,130],Convolutional Network[96],LLM[100],Feature-MLP[33] Manifold Embedding Methods•Linear Subspaces PCA[111],Pose-eigenspaces[114],LEA[27]•Kernelized Subspaces KPCA[136],KLDA[136]•Nonlinear Subspaces Isomap[45,101],LE[6],LLE[102],Biased Manifold Embedding[4],SSE[142] Flexible Models•Feature Descriptors Elastic Graph Matching[55]•Active Appearance Models ASM[60],AAM[18,140]Geometric Methods•Facial Features Planar&3D Methods[30],Projective Geometry[42],Vanishing Point[132] Tracking Methods•Feature Tracking RANSAC[31,147],Weighted Least Squares[145],Physiological Constraint[144]•Model Tracking Parameter Search[98,107,138],Least-squares[14],Dynamic Templates[139]•Affine Transformation Brightness and Depth Constancy[80]•Appearance-based Particle Filters Adaptive Diffusion[94],Dual-linear State Model[85]Hybrid Methods•Geometric&Tracking Local Feature configuration+SSD Tracking[43,52]•Appearance Templ.&Tracking A V AM with Keyframes[82],Template Matching w/Particle Filtering[1]•Nonlinear Regression&Tracking SVR+3D Particle Filter[85]•Manifold Embedding&Tracking PCA Comparison+Continuous Density HMM[49]•Other Combinations Nonlinear regression+Geometric+Particle Filter[109],KLDA+EGM[136]•Appearance Template Methods compare a new image ofa head to a set of exemplars(each labeled with a discretepose)in order tofind the most similar view.•Detector Array Methods train a series of head detectors each attuned to a specific pose and assign a discrete pose to the detector with the greatest support.•Nonlinear Regression Methods use nonlinear regression tools to develop a functional mapping from the image or feature data to a head pose measurement.•Manifold Embedding Methods seek low-dimensional man-ifolds that model the continuous variation in head pose.New images can be embedded into these manifolds and then used for embedded template matching or regression.•Flexible Modelsfit a non-rigid model to the facial structure of each individual in the image plane.Head pose is estimated from feature-level comparisons or from the instantiation of the model parameters.•Geometric Methods use the location of features such as the eyes,mouth,and nose tip to determine pose from their relative configuration.•Tracking Methods recover the global pose change of the head from the observed movement between video frames.•Hybrid Methods combine one or more of these aforemen-tioned methods to overcome the limitations inherent in any single approach.Table I provides a list of representative systems for each of these categories.In this section,each category is described in detail. Comments are provided on the functional requirements of each approach and the advantages and disadvantages of each design choice.A.Appearance Template MethodsAppearance template methods use image-based comparison metrics to match a view of a person’s head to a set of exemplars with corresponding pose labels.In the simplest implementation, the queried image is given the same pose that is assigned to the most similar of these templates.An illustration is presented in Fig. 3.Some characteristic examples include the use of normalized cross-correlation at multiple image resolutions[7]and mean squared error(MSE)over a sliding window[91]. Appearance templates have some advantages over more com-plicated methods.The templates can be expanded to a larger set at any time,allowing systems to adapt to changing conditions.Fur-thermore,appearance templates do not require negative training examples or facial feature points.Creating a corpus of training data requires only cropping head images and providing head pose annotations.Appearance templates are also well suited for both high and low-resolution imagery.There are many disadvantages with appearance templates. Without the use of some interpolation method,they are only capable of estimating discrete pose locations.They typically assume that the head region has already been detected and localized,and localization error can degrade the accuracy of the head pose estimate.They can also suffer from efficiency concerns, since as more templates are added to the exemplar set,more computationally expensive image comparisons will need to be computed.One proposed solution to these last two problems was to train a set of Support Vector Machines(SVMs)to detect and localize the face,and subsequently use the support vectors as appearance templates to estimate head pose[88,89].Despite those limitations,the most significant problem with appearance templates is that they operate under the faulty assump-tion that pairwise similarity in the image space can be equated to similarity in pose.Consider two images of the same personFig.3A PPEARANCE T EMPLATE METHODS COMPARE A NEW HEAD VIEW TO A SET OF TRAINING EXAMPLES(EACH LABELED WITH A DISCRETE POSE)AND FIND THE MOST SIMILAR VIEW.at slightly different poses and two images of different people at the same pose.In this scenario,the effect of identity can cause more dissimilarity in the image than from a change in pose,and template matching would improperly associate the image with the incorrect pose.Although this effect would likely lessen for widely varying poses,there is still no guarantee that pairwise similarity corresponds to similarity in the pose domain(e.g.,a right-profile image of a face may be more similar to a left-profile image than to a frontal view).Thus,even with a homogeneous set of discrete poses for every individual,errors in template comparison can lead to highly erroneous pose estimates.To decrease the effect of the pairwise similarity problem,many approaches have experimented with various distance metrics and image transformations that reduce the head pose estimation error. For example,the images can be convolved with a Laplacian-of-Gaussianfilter[32]to emphasize some of the more common facial contours while removing some of the identity-specific texture variation across different individuals.Similarly,the images can be convolved with a complex Gabor-wavelet to emphasize directed features,such as the vertical lines of the nose and horizontal orientation of the mouth[110,111].The magnitude of this complex convolution also provides some invariance to shift,which can conceivably reduce the appearance errors that arise from the variance in facial feature locations between people.B.Detector ArraysMany methods have been introduced in the last decade for frontal face detection[97,104,126].Given the success of these approaches,it seems like a natural extension to estimate head pose by training multiple face detectors,each to specific a different discrete pose.Fig.4illustrates this approach.For arrays of binary classifiers,successfully detecting the face will specify the pose of the head,assuming that no two classifiers are in disagreement.For detectors with continuous output,pose can be estimated by the detector with the greatest support.Detector arrays are similar to appearance templates in that they operate directly on an image patch.Instead of comparing an image to a large set of individual templates,the image is evaluated by a detector trained on many images with a supervised learning algorithm.An early example of a detector array used three SVMs for three discrete yaws[47].A more recent system trainedfive FloatBoost classifiers operating in a far-field,multi-camera setting[146].Fig.4D ETECTOR A RRAYS COMPRISE A SERIES OF HEAD DETECTORS,EACH ATTUNED TO A SPECIFIC POSE,AND A DISCRETE POSE IS ASSIGNED TO THE DETECTOR WITH THE GREATEST SUPPORT.An advantage of detector array methods is that a separate head detection and localization step is not required,since each detector is also capable of making the distinction between head and non-head.Simultaneous detection and pose estimation can be performed by applying the detector to many subregions in the image.Another improvement is that unlike appearance templates, detector arrays employ training algorithms that learn to ignore the appearance variation that does not correspond to pose change. Detector arrays are also well suited for high and low-resolution images.Disadvantages to detector array methods also exist.It is bur-densome to train many detectors for each discrete pose.For a detector array to function as a head detector and pose estimator, it must also be trained on many negative non-face examples, which require substantially more training data.In addition,there may be systematic problems that arise as the number of detectors is increased.If two detectors are attuned to very similar poses, the images that are positive training examples for one must be negative training examples for another.It is not clear whether prominent detection approaches can learn a successful model when the positive and negative examples are very similar in appearance.Indeed,in practice these systems have been limited to one degree of freedom and fewer than12detectors.Furthermore, since the majority of the detectors have binary outputs,there is no way to derive a reliable continuous estimate from the result, allowing only coarse head pose estimates and creating ambiguities when multiple detectors simultaneously classify a positive image. Finally,the computational requirements increase linearly with the number of detectors,making it difficult to implement a real-time system with a large array.As a solution to this last problem,it has been suggested that a router classifier can be used to pick a single subsequent detector to use for pose estimation[103].In this case the router effectively determines pose(i.e.,assuming this is a face,what is its pose?),and the subsequent detector confirms the choice(i.e.,is this a face at the pose specified by the router?).Although this technique sounds promising in theory, it should be noted that it has not been demonstrated for pitch or yaw change,but rather only for rotation in the camera-plane,first using neural network-based face detectors[103],and later with cascaded AdaBoost detectors[53].Fig.5N ONLINEAR REGRESSION PROVIDES A FUNCTIONAL MAPPING FROM THE IMAGE OR FEATURE DATA TO A HEAD POSE MEASUREMENT.C.Nonlinear Regression MethodsNonlinear regression methods estimate pose by learning a non-linear functional mapping from the image space to one or more pose directions.An illustration is provided in Fig.5.The allure of these approaches is that with a set of labeled training data, a model can be built that will provide a discrete or continuous pose estimate for any new data sample.The caveat with these approaches is that it is not clear how well a specific regression tool will be able to learn the proper mapping.The high-dimensionality of an image presents a challenge for some regression tools.Success has been demonstrated using Support Vector Regressors(SVRs)if the dimensionality of the data can be reduced,as for example with Principal Component Analysis(PCA)[64,65],or with localized gradient orientation histograms[86]–the latter giving better accuracy for head pose estimation.Alternatively,if the location of facial features are known in advance,regression tools can be used on relatively low-dimensional feature data extracted at these points[70,78].Of the nonlinear regression tools used for head pose estimation, neural networks have been the most widely used in the literature. An example is the multi-layer perceptron(MLP),consisting of many feed-forward cells defined in multiple layers(e.g.the output of the cells in one layer comprise the input for the subsequent layer)[8,23].The perceptron can be trained by backpropagation, which is a supervised learning procedure that propagates the error backwards through each of the hidden layers in the network to update the weights and biases.Head pose can be estimated from cropped images of a head using an MLP in different configurations.For instance,the output nodes can correspond to discretized poses,in which case a Gaussian kernel can be used to smooth the training poses to account for the similarity of nearby poses[10,106,148].Although this approach works,it is similar to detector arrays and appearance templates in that it provides only a coarse estimate of pose at discrete locations.An MLP can also be trained forfine head pose estimation over a continuous pose range.In this configuration,the network has one output for each DOF,and the activation of the output is proportional to its corresponding orientation[108,115,117,129, 130].Alternatively,a set of MLP networks with a single output node can be trained individually for each DOF.This approach has been used for heads viewed from multiple far-field cameras in indoor environments,using background subtraction or a color filter to detect the facial region and Bayesianfiltering to fuseandFig.6M ANIFOLD E MBEDDING M ETHODS TRY TO DIRECTLY PROJECT A PROCESSED IMAGE ONTO THE HEAD POSE MANIFOLD USING LINEAR ANDNON-LINEAR SUBSPACE TECHNIQUES.smooth the estimates of each individual camera[120,128–130].A locally-linear map(LLM)is another popular neural network consisting of many linear maps[100].To build the network,the input data is compared to a centroid sample for each map and used to learn a weight matrix.Head pose estimation requires a nearest-neighbor search for the closest centroid,followed by linear regression with the corresponding map.This approach can be extended with difference vectors and dimensionality reduction [11]as well as decomposition with Gabor-wavelets[56].As was mentioned earlier with SVR,it is possible to train a neural network using data from facial feature locations.This approach has been evaluated with associative neural networks [33,34].The advantages of neural network approaches are numerous. These systems are very fast,only require cropped labeled faces for training,work well in near-field and far-field imagery,and give some of the most accurate head pose estimates in practice (see Section IV).The main disadvantage to these methods is that they are to prone to error from poor head localization.As a suggested solution,a convolutional network[62]that extends the MLP by explicitly modeling some shift,scale,and distortion invariance can be used to reduce this source of error[95,96].D.Manifold Embedding MethodsAlthough an image of a head can be considered a data sam-ple in high-dimensional space,there are inherently many fewer dimensions in which pose can vary.With a rigid model of the head,this can be as few as three dimensions for orientation and three for position.Therefore,it is possible to consider that each high-dimensional image sample lies on a low-dimensional continuous manifold constrained by the allowable pose variations. For head pose estimation,the manifold must be modeled,and an embedding technique is required to project a new sample into the manifold.This low-dimensional embedding can then be used for head pose estimation with techniques such as regression in the embedded space or embedded template matching.Any dimensionality reduction algorithm can be considered an attempt at manifold embedding,but the challenge lies in creating an algorithm that successfully recovers head pose while ignoring other sources of image variation.Two of the most popular dimensionality reduction techniques, principal component analysis(PCA)and its nonlinear kernelized version KPCA,discover the primary modes of variation from a set of data samples[23].Head pose can be estimated with PCA,e.g., by projecting an image into a PCA subspace and comparing the results to a set of embedded templates[75].It has been shown that similarity in this low-dimensional space is more likely to correlate with pose similarity than appearance template matching with Gabor-wavelet preprocessing[110,111].Nevertheless,PCA and KPCA are inferior techniques for head pose estimation [136].Besides the linear limitations of standard PCA that cannot adequately represent the nonlinear image variations caused by pose change,these approaches are unsupervised techniques that do not incorporate the pose labels that are usually available during training.As a result,there is no guarantee that the primary components will relate to pose variation rather than to appearance variation.Probably,they will be correspond to both.To mitigate these problems,the appearance information can be decoupled from the pose by splitting the training data into groups that each share the same discrete head pose.Then,PCA and KPCA can be applied to generate a separate projection matrix for each group.These pose-specific eigenspaces,or pose-eigenspaces,each represent the primary modes of appearance variation and provide a decomposition that is independent of the pose variation.Head pose can be estimated by normalizing the image and projecting it into each of the pose-eigenspaces, thusfinding the pose with the highest projection energy[114]. Alternatively,the embedded samples can be used as the input to a set of classifiers,such as multi-class SVMs[63].As testament to the limitations of KPCA,however,it has been shown that by skipping the KPCA projection altogether and using local Gabor binary patterns,one can greatly improve pose estimation with a set of multi-class SVMs[69].Pose-eigenspaces have an unfortunate side-effect.The ability to estimatefine head pose is lost since, like detector arrays,the estimate is derived from a discrete set of measurements.If only coarse head pose estimation is desired,it is be better to use multi-class linear discriminant analysis(LDA)or the kernelized version,KLDA[23]since these techniques can be used tofind the modes of variation in the data that best account for the differences between discrete pose classes[15,136]. Other manifold embedding approaches have shown more promise for head pose estimation.These include Isometric feature mapping(Isomap)[101,119],Locally Linear Embedding(LLE) [102],and Laplacian Eigenmaps(LE)[6].To estimate head pose with any of these techniques,there must be a procedure to embed a new data sample into an existing manifold.Raytchev et al. [101]described such a procedure for an Isomap manifold,but for out-of-sample embedding in an LLE and LE manifold there has been no explicit solution.For these approaches,a new sample must be embedded with an approximate technique,such as a Generalized Regression Neural Network[4].Alternatively,LLE and LE can be replaced by their linear approximations,locally embedded analysis(LEA)[27]and locality preserving projections (LPP)[39].There are still some remaining weaknesses in the manifold embedding approaches mentioned thus far.With the exception of LDA and KLDA,each of these techniques operates in an unsupervised fashion,ignoring the pose labels that might be available during training.As a result,they have the tendency to build manifolds for identity as well as pose[4].As one solution to this problem,identity can be separated from pose by creating a separate manifold for each subject that can be aligned together. For example,a high-dimensional ellipse can befit to the data in a set of Isomap manifolds and then used to normalize the manifolds[45].To map from the feature space to the embedded space,nonlinear interpolation can be performed with radial basis functions.Nevertheless,even this approach has its weaknesses, because appearance variation can be due to factors other than identity and pose,such as lighting.For a more general solution, instead of making separate manifolds for each variation,a single manifold can be created that uses a distance metric that is biased towards samples with smaller pose differences[4].This change was shown to improve the head pose estimation performance of Isomap,LLE,and LE.Another difficulty to consider is the heterogeneity of the train-ing data that is common in many real-world training scenarios.To model identity,multiple people are needed to train a manifold,but it is often impossible to obtain a regular sampling of poses from each individual.Instead,the training images comprise a disjoint set of poses for each person sampled from some continuous measurement device.A proposed remedy to this problem is to create individualize submanifolds for each subject,and use them to render virtual reconstructions of the discrete poses that are missing between subjects[142].This work introduced Synchro-nized Submanifold Embedding(SSE),a linear embedding that creates a projection matrix that minimizes the distances between each sample and its nearest reconstructed neighbors(based on the pose label),while maximizing the distances between samples from the same subject.All of the manifold embedding techniques described in this section are linear or nonlinear approaches.The linear techniques have the advantage that embedding can be performed by matrix multiplication,but they lack the representational ability of the nonlinear techniques.As a middle ground between these ap-proaches,the global head pose manifold can be approximated by a set of localized linear manifolds.This has been demonstrated for head pose estimation with PCA,LDA,and LPP[66].E.Flexible ModelsThe previous methods have considered head pose estimation as a signal detection problem,mapping a rectangular region of image pixels to a specific pose orientation.Flexible models take a different approach.With these techniques,a non-rigid model is fit to the image such that it conforms to the facial structure of each individual.In addition to pose labels,these methods require training data with annotated facial features,but it enables them to make comparisons at the feature level rather than at the global appearance level.A conceptual illustration is presented in Fig.7. Recall the appearance template methods from Section III-A.To estimate pose,the view of a new head is overlaid on each template and a pixel-based metric is used to compare the images.Even with perfect registration,however,the images of two different people will not line up exactly,since the location of facial features vary between people.Now,consider a template based on a deformable graph of local feature points(eye corners, nose,mouth corners,etc).To train this system,facial feature locations are manually labeled in each training image,and local feature descriptors such as Gabor-jets,can be extracted at each location.These features can be extracted from views of multiple people,and extra invariance can achieved by storing a bunch of。