Model-based Stereo with Occlusions
- 格式:pdf
- 大小:518.98 KB
- 文档页数:15
姿态估计算法汇总基于RGB、RGB-D以及点云数据作者:Tom Hardy点击上⽅“3D视觉⼯坊”,选择“星标”⼲货第⼀时间送达作者⼁Tom Hardy@知乎编辑⼁3D视觉⼯坊姿态估计算法汇总|基于RGB、RGB-D以及点云数据主要有整体⽅式、霍夫投票⽅式、Keypoint-based⽅式、Dense Correspondence⽅式等。
实现⽅法:传统⽅法、深度学习⽅式。
数据不同:RGB、RGB-D、点云数据等;标注⼯具实现⽅式不同整体⽅式整体⽅法直接估计给定图像中物体的三维位置和⽅向。
经典的基于模板的⽅法构造刚性模板并扫描图像以计算最佳匹配姿态。
这种⼿⼯制作的模板对集群场景不太可靠。
最近,⼈们提出了⼀些基于深度神经⽹络的⽅法来直接回归相机或物体的6D姿态。
然⽽,旋转空间的⾮线性使得数据驱动的DNN难以学习和推⼴。
1.Discriminative mixture-of-templates for viewpoint classification2.Gradient response maps for realtime detection of textureless objects.paring images using the hausdorff distance4.Implicit 3d orientation learning for 6d object detection from rgb images.5.Instance- and Category-level 6D Object Pose Estimation基于模型2.Deep model-based 6d pose refinement in rgbKeypoint-based⽅式⽬前基于关键点的⽅法⾸先检测图像中物体的⼆维关键点,然后利⽤PnP算法估计6D姿态。
1.Surf: Speeded up robust features.2.Object recognition from local scaleinvariant features3.3d object modeling and recognition using local affine-invariant image descriptors and multi-view spatial constraints.5.Stacked hourglass networks for human pose estimation6.Making deep heatmaps robust to partial occlusions for 3d object pose estimation.7.Bb8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth8.Real-time seamless single shot 6d object pose prediction.9.Discovery of latent 3d keypoints via end-toend geometric reasoning.10.Pvnet: Pixel-wise voting network for 6dof pose estimation.Dense Correspondence/霍夫投票⽅式1.Independent object class detection using 3d feature maps.2.Depth encoded hough voting for joint object detection and shape recovery.3.aware object detection and pose estimation.4.Learning 6d object pose estimation using 3d object coordinates.5.Global hypothesis generation for 6d object pose estimation.6.Deep learning of local rgb-d patches for 3d object detection and 6d pose estimation.7.Cdpn: Coordinates-based disentangled pose network for real-time rgb-based 6-dof object pose estimation.8.Pix2pose: Pixel-wise coordinate regression of objects for 6d pose estimation.9.Normalized object coordinate space for categorylevel 6d object pose and size estimation.10.Recovering 6d object pose and predicting next-bestview in the crowd.基于分割深度学习相关⽅法1.PoseCNN: A convolutional neural network for 6d object pose estimation in cluttered scenes.2.Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views.6.Robust 6D Object Pose Estimation in Cluttered Scenesusing Semantic Segmentation and Pose Regression Networks - Arul Selvam Periyasamy, Max Schwarz, and Sven Behnke. [[Paper]数据格式不同根据数据格式的不同,⼜可分为基于RGB、RGB-D、点云数据的识别算法。
I。
Put the following English terms into Chinese. (1'×10=10’)所指对象referent所指论Referential theory专有名词 proper name普通名词 common nouns固定的指称记号 rigid designators指称词语deixical items确定性描述语definite descriptions编码时间 coding—time变异性variability表示反复的词语 iterative表述句 constative补救策略redressive strategies不可分离性 non—detachability不确定性indeterminacy不使用补救策略,赤裸裸地公开施行面子威胁行bald on record without redressive actions 阐述类言语行为 representatives承诺类言语行为 commissives指令类言语行为directives表达类言语行为expressives,宣告类言语行为declarations诚意条件 sincerity condition次要言外行为 secondary illocutionary act等级含义 scalar implicature等级划分法 rating scales副语言特征 paralinguistic features非公开施行面子威胁行为 off record非规约性non—conventionality非规约性意义 non-conventional implicature非论证性的 non—demonstrative非自然意义non—natural meaning (meaning—nn)否定测试法negation test符号学 semiotics构成性规则 constitutive rules古典格莱斯会话含义理论 Classical Gricean theory of conversational implicature关联论Relevance Theory关联原则Principle of Relevance归属性用法 attributive use规约性含义conventional implicature人际修辞 interpersonal rhetoric篇章修辞textual rhetoric含蓄动词 implicative verbs合适条件 felicity conditions呼语 vocatives互相显映 mutually manifest会话含义 conversational implicature话语层次策略 utterance-level strategy积极面子positive face间接言语行为 indirect speech acts间接指令 indirect directives结语 upshots交际意图communicative intention可撤销性 cancellability可废弃性 defeasibility可推导性 calculability跨文化语用失误cross—cultural pragmatic failure跨文化语用学cross—cultural pragmatics命题内容条件 propositional content condition面子保全论 Face-saving Theory面子论 Face Theory面子威胁行为 Face Threatening Acts (FTAs)蔑视 flouting明示 ostensive明示-推理模式ostensive—inferential model摹状词理论Descriptions粘合程度 scale of cohesion篇章指示 discourse deixis前提 presupposition前提语 presupposition trigger强加的绝对级别absolute ranking of imposition确定谈话目的 establishing the purpose of the interaction确定言语事件的性质 establishing the nature of the speech event 确定性描述语 definite descriptions认知语用学 cognitive pragmatics上下文 co—text社会语用迁移sociopragmatic transfer社交语用失误 sociopragmatic failure施为句 performative省力原则 the principle of least effort实情动词 factive verbs适从向 direction of fit手势型用法 gestural usage首要言外行为 primary illocutionary act双重或数重语义模糊 pragmatic bivalence/ plurivalence顺应的动态性 dynamics of adaptability顺应性adaptability语境关系的顺应(contextual correlates of adaptability)、语言结构的顺应(structural objects of adaptability)、顺应的动态性(dynamics of adaptability)和顺应过程的意识程度(salience of the adaptation processes)。
三维重构英语Three-dimensional ReconstructionIntroductionThree-dimensional reconstruction is a process of creating a three-dimensional model of an object or a scene from a set of two-dimensional images or data points. This technique finds applications in various fields such as computer vision, medical imaging, archaeology, and virtual reality. In this article, we will explore the concept of three-dimensional reconstruction and its significance in different domains.1. Principles of Three-dimensional ReconstructionThree-dimensional reconstruction involves a series of steps to transform two-dimensional information into a three-dimensional representation. The basic principles include:1.1 Image Acquisition: High-quality images or data points are collected using appropriate sensors or cameras. These images provide the necessary visual information for the reconstruction process.1.2 Feature Extraction: Distinctive features are identified and extracted from the acquired images. These features can be edges, corners, or keypoints that enable matching between different images.1.3 Image Matching: Corresponding features from multiple images are matched to establish their spatial relationship. This step is crucial to accurately reconstruct the three-dimensional model.1.4 Camera Calibration: In order to determine the precise position and orientation of the camera, calibration parameters are estimated. This calibration helps to align the images correctly during the reconstruction process.1.5 Depth Estimation: The depth information of each feature point is estimated using techniques like stereo vision, structure from motion, or depth sensors. This depth information is essential for creating the three-dimensional model.1.6 Surface Reconstruction: Based on the provided depth information, a three-dimensional model is reconstructed by connecting the points in space. Various algorithmic approaches such as triangulation, point cloud merging, or mesh generation can be employed for this purpose.2. Applications of Three-dimensional Reconstruction2.1 Medical Imaging: Three-dimensional reconstruction plays a vital role in medical imaging, especially in areas like computed tomography (CT) scans, magnetic resonance imaging (MRI), and ultrasound. This technique assists medical professionals in visualizing internal organs or tissues accurately, aiding in diagnosis and treatment planning.2.2 Archaeology and Cultural Heritage: Three-dimensional reconstruction enables archaeologists to digitally recreate historical artifacts, buildings, or archaeological sites. By reconstructing these objects in a virtual environment, researchers can study and analyze them in great detail without the risk of damage or loss.2.3 Robotics and Autonomous Systems: Three-dimensional reconstruction is crucial in robotics for mapping environments and object recognition. Robots equipped with cameras or sensors can reconstruct their surroundings, enabling them to navigate and interact with the environment more efficiently.2.4 Virtual Reality and Entertainment: Three-dimensional reconstruction is extensively utilized in virtual reality (VR) and the entertainment industry to create realistic virtual environments, characters, and scenery. This enhances the user's immersion and creates a more engaging experience.3. Challenges and Future Developments3.1 Data Acquisition: Obtaining high-quality and accurate data is crucial for accurate three-dimensional reconstruction. Challenges such as noise, occlusions, and limited viewpoints may affect the reconstruction process. Advancements in sensors and imaging techniques are continuously being developed to overcome these challenges.3.2 Computational Complexity: Three-dimensional reconstruction algorithms often require significant computational power and time, particularly when dealing with large-scale reconstructions. The development of efficient and scalable algorithms is an ongoing research area to address this challenge.3.3 Reconstruction Accuracy: Achieving high reconstruction accuracy is an ongoing challenge. Factors like lighting conditions, image resolution, and calibration errors can impact the final model. Continuous refinement of algorithms and calibration techniques will contribute to enhanced accuracy.3.4 Fusion of Multimodal Data: Integrating data from different sources, such as images, depth sensors, and thermal cameras, offers the potential for more comprehensive and accurate three-dimensional reconstructions. Future developments should focus on effectively fusing and utilizing multimodal data.ConclusionThree-dimensional reconstruction has revolutionized various fields by enabling the creation of virtual three-dimensional models from two-dimensional data. Its applications in medicine, archaeology, robotics, and entertainment have opened up new possibilities for research, diagnosis, and virtual experiences. With ongoing advancements in technology and algorithm developments, three-dimensional reconstruction will continue to advance, providing even greater accuracy and realism in the future.。
动作捕捉浅析(一)——光学动作捕捉一、光学式:光学式运动捕捉通过对目标上特定光点的监视和跟踪来完成运动捕捉的任务。
目前常见的光学式运动捕捉大多基于计算机视觉原理。
从理论上说,对于空间中的一个点,只要它能同时为两部相机所见,则根据同一时刻两部相机所拍摄的图像和相机参数,可以确定这一时刻该点在空间中的位置。
当相机以足够高的速率连续拍摄时,从图像序列中就可以得到该点的运动轨迹;二、资料 中文资料:PS 光学式人体运动捕捉系统是目前光学式当中的价格最便宜,性能最好的系统。
目前它具有世界上性能价格比最高的特点,被广泛应用于游戏制作、步态分析,运动医学及康复治疗,运动分析,人体工程学研究、模拟训练、生物力学研究等领域。
客户为大学,科研单位,体育研究所,实验室等。
和机械电子式动作捕捉系统有很大的区别。
一元硬币大小的发光二极管,依靠这些主动方式的发光二极管可以快速,高精度,方便的获取人体各个部位的运动数据。
不同于被动式的光学反射标志。
PS 系统的主要特点◆独特性能够实时获取多达120个LED 主动方式标志点的运动轨迹。
相比传统的被动方式标志点的光学运动捕捉系统,具有许多良好的性能。
采用具有发明专利权的主动式LED 标志,每个标志是唯一的,因此很好的解决了运动标志点的错位问题。
即使某一个标志点LED 脱落,系统依然会识别。
◆高精度3600 x 3600 像素(1296万像素)。
通过使用“子像素“技术,系统的像素数可达到30000 x 30000像素。
Optical Motion CaptureYiannis Aloimonos and Gutemberg Guerra-FilhoComputer Vision LaboratoryCenter for Automation ResearchInstitute for Advanced Computer StudiesDepartment of Computer ScienceUniversity of MarylandCollege Park, Maryland 20742-3275, USAMotion capture is the process of recording real life movement of a subject as sequences of Cartesian coordinates in 3D space. Optical motion capture (OMC) uses cameras to reconstruct the body posture of the performer. One approach employs a set of multiple synchronized cameras to capture markers placed in strategic locations on the body. A motion capture system has applications in computer graphics for character animation, in virtual reality for human control-interface, and in video games for realistic simulation of human motion. In this tutorial, we discuss the theoretical and empirical aspects of an optical motion capture system. Basically, for a motion capture system implementation; the resources required consist of a number of synchronized cameras, an image acquisition system, a capturing area, and aspecial suit with markers. The locations of the markers on the suit are designed such that the required body parts (e.g. joints) are covered. We present our motion capture system using a framework that identifies different sub-problems to be solved in a modular way. The sub-problems involved in OMC are initialization, marker detection, spatial correspondence, temporal correspondence, and post-processing. In this tutorial, we discuss the theory involved in each sub-problem and the corresponding novel techniques used in the current implementation. The initialization includes setting up a human model and the computation of intrinsic and extrinsic camera calibration. Marker detection involves finding the 2D pixel coordinates of markers in the images. The spatial correspondence problem consists in finding pairs of detected markers in different images captured at the same time with different viewpoints such that each pair corresponds to the projections of the same scene point. Given camera calibration and the spatial matching, the 3D reconstruction of markers (translational data) is achieved by triangulating the various camera views. The temporal correspondence problem (tracking) involves matching two clouds of 3D points representing detected markers at two consecutive frames, respectively. The temporal correspondence module builds a track for each marker where the marker’s 3D coordinates are concatenated according to time. Post-processing consists in labeling each track with a marker code, finding missing markers lost by occlusions, correcting possible gross errors, and filtering noise. Once the translational data is processed, a hierarchical human model may be used to compute rotational data (joint angles). We consider standard data formats available for motion capture data (e.g. bvh, acclaim). Other important techniques used to improve consistency in the motion data are volumetric reconstruction, inverse kinematics, and inverse dynamics. We also cover topics related to editing and manipulation of motion data.Tutorial SlidesThe Language of Human MovementOutline∙⎬Introductiono o Realistic Movement: Synthesis and Analysiso o Motion Capture Technologieso o Applications∙⎬Required Resourceso o Capture Roomo o Body Suito o Camera Equipmento o Acquisition System∙⎬Initializationo o Markers’ Configurationo o Camera Calibrationo o World Coordinate System Alignmento o Background Subtractiono o Kinematic Human Body Model∙⎬Marker/Feature Detectiono o Edgeso o Cornerso o SIFT Features∙⎬Spatial Correspondenceo o Stereo Matchingo o Wide Baselineo o Dense Correspondenceo o Triangulation∙⎬Temporal Correspondenceo o Tracking with Appearanceo o2D and 3D Tracking∙⎬Post-Processingo o Labelingo o Missing Markerso o Rigidity Testo o Motion Data Filteringo o Translational and Rotational Datao o Data File Formats∙⎬Advanced Topicso o Visual Hull Reconstructiono o Monocular Markerless MoCap∙Original Multiview VideoOne approach employs a set of multiple synchronized cameras to capture markers placed in strategic locations on the body. The original videos for the human activities jump and tiptoe are presented in videos 1a and 1b, respectively.Video 1a: Original jump action. Video 1b: Original tiptoe action.Marker DetectionMarker detection involves finding the 2D pixel coordinates of markers in the images. In our system, the subject wears a black suit with white markers in a squared shape. Red circles represent the markers detected by our system in videos 2a and 2b.Video 2a: Markers detected in jump action. Video 2b: Markers detected in tiptoe action.Spatial CorrespondenceThe spatial correspondence problem consists in finding pairs of detected markers in different images captured at the same time with different viewpoints such that each pair corresponds to the projections of the same scene point. The pairs of markers computed by our system are displayed in videos 3a and 3b. The matches are represented by disparity vectors for markers in consecutive cameras.Video 3a: Disparity vectors in jump action. Video 3b: Disparity vectors in tiptoe action.3D ReconstructionGiven camera calibration and the spatial matching, the 3D reconstruction of markers (translational data) is achieved by triangulating the various camera views. The reconstructed points are shown in videos 4a and 4b, where the points are virtually inserted in the original background. In videos 5a and 5b, the reconstructed points are projected into different viewpoints.Video 4a: 3D points in the original background (jumpaction). Video 4b: 3D points in the original background (tiptoeaction).Video 5a: 3D points from different viewpoints (jumpaction). Video 5b: 3D points from different viewpoints (tiptoe action).Temporal CorrespondenceThe temporal correspondence problem (tracking) involves matching two clouds of 3D points representing detected markers at two consecutive frames, respectively. Given the correspondence between consecutive frames, a time series of 3D coordinates is built. Videos 6a and 6b draw the trajectories of some markers.Video 6a: Trajectories of markers in jump action. Video 6b: Trajectories of markers in tiptoe action.Post-ProcessingPost-processing consists in labeling each track with a marker code, filling track gaps caused by occlusions, correcting possible gross errors, filtering or smoothing noise, and interpolating data along time. The final result of our Optical Motion Capture System is shown in videos 7a and 7b, where a humanoid model, called "flat head", performs the actions.因版权问题,更加详细的资料请登陆搜维尔主站察看。
立体匹配的原理和方法Stereo matching is a fundamental problem in computer vision that aims to establish correspondences between points in a pair of stereo images. 立体匹配是计算机视觉中的一个基本问题,旨在建立一对立体图像中点的对应关系。
It is a crucial step in tasks such as depth estimation, visual odometry, and 3D reconstruction. 这是深度估计、视觉里程计和三维重建等任务中的一个关键步骤。
The principle of stereo matching is to find corresponding points in two images taken from different viewpoints. 立体匹配的原理在于找出来自不同视角拍摄的两幅图像中对应的点。
By comparing these points, the depth information of the scene can be inferred. 通过比较这些点,可以推断出场景的深度信息。
One common method for stereo matching is the use of pixel-based matching algorithms. 一个常见的立体匹配方法是使用基于像素的匹配算法。
These algorithms compare the intensity or color of pixels in the two images to find correspondences. 这些算法比较两幅图像中像素的强度或颜色来找到对应的点。
However, pixel-based methods often struggle with handling textureless regions or occlusions in the images. 然而,基于像素的方法常常难以处理图像中无纹理区域或遮挡。
Model-based Stereo with OcclusionsFabiano Romeiro and Todd ZicklerSchool of Engineering and Applied SciencesHarvard Universityromeiro@ zickler@ Abstract.This paper addresses the recovery of face models from stereopairs of images in the presence of foreign-body occlusions.In the pro-posed approach,a3D morphable model(3DMM)for faces is augmentedby an occlusion map defined on the model shape,and occlusion is de-tected with minimal computational overhead by incorporating robust es-timators in thefitting process.Additionally,the method uses an explicitmodel for texture(or reflectance)in addition to shape,which is in contrastto most existing multi-view methods that use a shape model alone.We ar-gue that both model components are required to handle certain classes ofoccluders,and we present empirical results to support this claim.In fact,the empirical results in this paper suggest that even in the absence of oc-clusions,stereo reconstruction using existing shape-only face models canperform poorly by some measures,and that the inclusion of an explicittexture model may be worth its computational expense.1IntroductionBeing able to automatically recognize faces,track them,and estimate their ex-pression and pose are important for many applications.Performing these tasks reliably requires the ability to represent the appearance of faces over large vari-ations in illumination and viewpoint.It also requires the ability to model the effects of occlusions—both self-occlusions caused by the face itself and occlu-sions caused by“foreign bodies”(eye glasses,long facial hair,clothing,hands and limbs,etc.)in the environment.Illumination effects can often be well-represented using purely image-based methods(e.g.[1–4]),but to effectively handle extreme changes in3D pose,one typically requires a mechanism for“warping”2D images.3D morphable mod-els(3DMMs),which are parametric models of shape and reflectance,are useful for this purpose because they explicitly represent3D shape and therefore han-dle self-occlusions in a natural way.In a3D model-based approach,one is faced with the problem offinding the parameters of the model that best explain the input data.The estimated model parameters can then be used to perform recognition,track the face,detect ex-pressions,synthesize new images,etc.Thefitting problem is complicated in the presence of foreign-body occluders,because unlike self-occlusions,the image effects induced by foreign bodies cannot be explained by the face model.In this paper we present a3D model-based method for face reconstruction and recognition that exploits stereo imaging to handle foreign body occlusions. In the proposed approach,occlusion is represented using a single occlusion map defined on the3D shape model,and this occlusion map is recovered effi-ciently by incorporating robust estimators in thefitting process.In addition to including an occlusion map,we differentiate between two types of constraints forfitting a model to multiple views.According to thefirst constraint,each image should agree with a given model’s shape and reflectance; and according to the second,the images should agree with each other given the model’s shape.Wefind that the importance of these two constraints(roughly speaking,the“texture match”and the“stereo match”)varies depending on the type of foreign body occluders that are present.We alsofind that even in the absence of occluders,explicitly enforcing the texture match constraint signif-icantly improvesfitting accuracy in comparison to an approach that uses the stereo match constraint alone(suggested in[5]).1.1Related Work3D Morphable Models(3DMMs)[6]use high resolution linear3D shape and texture models to represent faces.Typically,this model isfit to an input image by minimizing an energy function that measures the difference between inten-sities in the observed image and those predicted by the model.Recognition can be performed based on the model parameters[7]or by using the model to syn-thesize new views of the face in a canonical pose and lighting configuration[8].Using a stereo pair for thefitting of a3DMM imposes additional geometric constraints on the face shape,which can improve the quality of results.Also, by imposing a stereo matching constraint,thefitting of the shape and texture parameters can be decoupled[5].According to this approach,the shape pa-rameters are recovered by minimizing the per vertex intensity differences be-tween two calibrated views,and the texture is estimated separately using this shape.While the decoupling of shape and texture is appealing from an effi-ciency standpoint,the results we show here suggest that there are significant benefits to estimating both components jointly.Explicit handling of foreign-body occlusions has been addressed for the case of monocularfitting of3DMMs in[9],where a generalized EM algorithm is used to alternate between the estimation of a visibility map given the model and the model parameters given the visibility map.To account for spatial coherence of occluders the visibility map is modeled by a Markov randomfield(MRF)on the image plane.In contrast,we model occlusions using a visibility map on the surface,and approximate the occlusion process using a robust estimator.While it gives up the preference for spatial coherence,the proposed approach can be implemented with little computational overhead.In addition,it can be easily extended to more views,since the occlusion map is on the surface.Also related to this work are2D active appearance models(AAMs),which trade precision for speed and are often used for tracking.2D AAMs[10]typi-cally use low-resolution2D deformable shapes along with linear texture mod-els.Thefitting is done by matching a warped face image(with the warping be-ing given by the linear shape model)against the linear texture model,and solv-ing for the shape and texture parameters that give the bestfit.Performance can be improved using an extension to the inverse compositional image alignment algorithm[11],by including3D constraints[12],or by using multiple views [13,14].Fitting AAMs in the presence of occlusions can also be approached us-ing robust estimators[15].The main advantages of the3D approach over2D AAMs are the ability to directly model lighting effects because it has access to surface normals and to more easily handle self-occlusions.2Background2.13D Morphable Models for FacesAs a3D morphable model for faces,we use the shape and texture bases(3DFS-100)made available by the University of Freiburg[6].These bases were ob-tained byfirst concatenating the N vertices(or RGB color values in the case of texture)of each scan i of a large set of high resolution3D face scans into vectors (FS i for shape,and FT i for texture),and putting them into correspondence.That is,the vectors are made such that the same entry in each vector corresponds to the same facial feature[16–18].These vectors are denoted:FS i=[[X i1Y i1Z i1]...[X i N Y i N Z i N]],FT i=[[R i1G i1B i1]...[R i N G i N B i N]] Principal component analysis(PCA)is performed on this set of vectors,and the most significant eigenvectors are used as bases for shape and texture.Shape and texture are then expressed as linear combinations of these basis elements:S=S0+m∑i=1αi S i,T=T0+m∑i=1βi T i,where S0and T0are the average face shape and texture and(S1,...,S m)and(T1, ...,T m)are the eigenvectors of shape and texture respectively.Here,S i,T i∈R3N. Thus,in this model,faces are represented by the set of coefficientsα=(α1,...,αm) andβ=(β1,...,βm)that correspond to their shape and texture.If one assumes the coefficients are drawn from independent normal distri-butions,PCA also gives an estimate of their probability distributions;P(α)∝exp(−12m∑i=1α2iσ2i),P(β)∝exp(−12m∑i=1β2iγ2i),(1)whereσi andγi are determined by the respective eigenvalues of the covariance matrices of{FS i}and{FT i}.2.2Image Formation ModelWe assume faces to be in or close to the space spanned by the shape and texture bases of Sect.2.1.Then,given a face’s shape parametersαand a suitable rigidbody transformation(rotation R and translation t,that align the face model with the actual face),the true color value(γ(k))of the face at the position correspond-ing to the face model’s vertex k will equal that predicted by the model:γ(k)≈I m(k),(2) where I m(k)is the RGB value of the texture at v k as given by the texture param-etersβ,and a suitable set of lighting parameters.For a lighting model,we assume the surface is Lambertian,and use(R amb, G amb,B amb)for the ambient light color,(R dir,G dir,B dir)for the directional light color,(R offset,G offset,B offset)for the color channels offsets,and l for the directional light direction.Then we have:I m(k)R=R offset+t k R·(R amb+R dir·(n k·l)),(3) with similar definitions for the G,B channels.The symbol t k represents the k th RGB value in the face model’s texture vector representation given the texture coefficientsβ,and n k represents the surface normal at v k.Assume we are given a stereo pair(I1,I2)of face images captured from a pair of calibrated cameras.Letting P1and P2denote the two camera projection matrices,and assuming we are given the shape parametersαand rigid body transformation parameters(R,t),we have two available measurements ofγ(k). These can be written I1(P1(R(v k−c)+c+t))and I2(P2(R(v k−c)+c+t)),where c is the centroid of the average face shape.Assuming that the cameras are ra-diometrically calibrated(i.e.,have the same exposure,white balance,etc.)with additive Gaussian noise,a reasonable estimator forγ(k)is:ˆγ(k)=¯I(v k,R,t) =I1(P1(R(v k−c)+c+t))+I2(P2(R(v k−c)+c+t))2.(4)Thus a simple approximation for the distribution of I m(k),given I1,I2,α,R,t is a normal distribution with mean¯I and standard deviationσt(say):I m(k)∼N(¯I(v k,R,t),σt).(5)In addition,whenα,I2,R,t are given,and again assuming that the cameras are radiometrically calibrated,we can use the following model for the noisy observation in I1of a vertex v k that is visible in both images:I1(P1(R(v k−c)+c+t))∼N(I2(P2(R(v k−c)+c+t)),σs).(6) Note that if the cameras are not radiometrically calibrated,this can be general-ized by incorporating camera-dependent gains and offsets into I1and I2.For simplicity,we make use of the following notation in the next section:ρ-the6parameters of the rigid body transformation(3for R,3for t).τ-the11lighting parameters(3for i amb,3for i dir,3for i o f f set,i={R,G,B},and2 for l).s k-the position of the k th model vertex given pose parameters(R,t)and shape parametersα;s k=R(v k−c)+c+t.3Robust Stereo Fitting of3DMMs3.1Joint shape and texture stereofittingWe use an energy function that incorporates both a shape model and a texture model by combining terms derived from Eqs.5and6,with regularization:E=∑k|v k∈V ||I1(P1s k)−I2(P2s k)||2σ2sStereo Match +m∑i=1α2iσ2iShape Prior+(7)∑k|v k∈V ||I m(k)−¯I(s k)||2σ2tTexture Model Match +m∑i=1β2iγ2iTexture Prior.Here,the symbol V is used to denote the set of vertices v k of the face model with parameters(α,ρ)that are visible in both I1and I2.Model-fitting is performed byfinding parametersα,β,ρ,τthat minimize E.This can be interpreted in a MAP framework as a search for parameters (α,β,ρ,τ)for which the posterior P(α,β,ρ,τ|I1,I2)is maximal,and such an in-terpretation highlights the assumptions underlying our approach.First,we ex-pand the posterior as P(α,β,ρ,τ|I1,I2)=P(α,ρ|I1,I2)·P(β,τ|I1,I2,α,ρ).Thefirst term is then rewritten P(α,ρ|I1,I2)∝P(I1|α,ρ,I2)·P(α),which by Bayes’rule, assumes thatα,ρ,I2are mutually independent and that the distribution of face poses(ρ)is uniform.The assumption that shape(α)and pose(ρ)are indepen-dent from I2may seem non-trivial.But without knowledge of face texture(β), little can be inferred about I2,because any image I2can be explained by a suit-ably selected texture.Using Eq.6we write:P(I1|α,ρ,I2)∝∏k|v k∈V exp−12||I1(P1s k)−I2(P2s k)||2σ2s.(8)and using Eq.5(assuming the texture(β)and scene lighting(τ)independent, andτuniformly distributed),we write:P(β,τ|I1,I2,α,ρ)∝P(β)·∏k|v k∈V exp−12||I m(k)−¯I(s k)||2σ2t.(9)Finally,we obtain the energy E by substituting Eqs.1,8and9into our expres-sion for the posterior,taking the logarithm,negating it and ignoring constant factors.One can make the following observations about this energy function.First, suppose one were to include only the last three terms in Eq.7,which would correspond to maximizing P(I1|α,β,ρ,τ)·P(I2|α,β,ρ,τ)·P(α,β).This approachwould not account for the correlation between I1and I2.The two images are not independent given(α,β,ρ,τ)because the true appearance of the face de-viates from that given by the face model,and consequently,the two prediction errors are correlated.Second,suppose we were to ignore the third and the fourth terms in Eq.7. This is the approach taken in[5],and it corresponds to maximizing P(α,ρ|I1,I2) without including a texture model.As we will show experimentally in Sect.4, this approach can perform poorly because it does not necessarily ensure that important features(eyes,eyebrows,lips)are properly aligned.Finally,we can compare our approach to an uncalibrated case in which one has no information about the stereo cameras.In this case,separate pose param-eters(ρ1,ρ2)could be used for each image,and one might seek to maximize P(α,β,τ,ρ1,ρ2|I1,I2).In this case,by the same argument as in thefirst observa-tion,I1and I2are still not independent givenα,β,τ,ρ1,ρ2,therefore maximiz-ing P(I1|α,β,τ,ρ1)·P(I2|α,β,τ,ρ2)·P(α,β)(which would be the trivial extension of the monocularfitting case to two images[6])does not necessarily maximize P(α,β,τ,ρ1,ρ2|I1,I2).3.2Handling OcclusionWhile the approach in the previous section correctly handles cases of self-oc clusion(where one part of the face occludes another),it does not account for the possibility of foreign-body occlusions.To handle such situations,we use a modified version of the energy function in Eq.7,introducing a robust estimator h a:E =∑k|v k∈V h a||I1(P1s k)−I2(P2s k)||2σ2s+||I m(k)−¯I(s k)||2σ2t+m∑i=1α2iσ2i+m∑i=1β2iγ2i(10)This modification requires little change in the optimization procedure,and allows thefitting to be significantly more robust to foreign-body occlusions(see Sect.4.2).Intuitively,by introducing the robust estimator we are limiting the impact in the energy function of vertices whose stereo matching term or tex-ture matching term are high.More formally,this approach can be justified by introducing a binary occlusion map O:{1,..,N}→{0,1}N,defined on the set of all vertices of the face model.This map dictates whether a vertex of the face model is occluded by a foreign-body in at least one of the images(O(k)=1)or not occluded in either(O(k)=0).Thus,the image formation model is altered so that the visible parts of the face present in the images are generated only by vertices v k for which O(k)=0.In this setting,it can be shown that minimizing E corresponds to search-ing forα,β,ρ,τ,O for which P(α,β,ρ,τ,O|I1,I2)is maximal.Again,we can write P(α,β,ρ,τ,O|I1,I2)=P(α,ρ,O|I1,I2)·P(β,τ|I1,I2,α,ρ,O).We expand thefirst term by making the same assumptions as those used in the previous section, obtaining P(α,ρ,O|I1,I2)∝P(I1|α,ρ,O,I2)·P(α,O).The term P(I1|α,ρ,O,I2)is then approximated as in Eq.8,where the product is now over{k|v k∈V,O(k)=0}.In favor of simplicity and efficiency,we ignore spatial coherence of occlu-sions,and assume O(k)∼i.i.d.Bernoulli,obtaining the following prior on O:P(O)∝∏k|v k∈Vexp(−ηo·O(k)).(11)Using this prior avoids the trivial labeling of all vertices being occluded during the optimization process.Combining these terms and assuming the shape(α)and occlusion map(O) to be independent,we obtain an expression for P(α,ρ,O|I1,I2).Substituting this expression into the posterior along with an expression for the posterior’s sec-ond term similar to Eq.9(but with the product over{k|v k∈V,O(k)=0},one sees that maximizing the posterior corresponds to minimizing:E =∑k|v k∈V f(α,β,ρ,τ,O,k)+m∑i=1α2iσ2i+m∑i=1β2iγ2i,(12)wheref(α,β,ρ,τ,O,k)=g(α,β,ρ,τ,k)·(1−O(k))+2ηo·O(k),(13)andg(α,β,ρ,τ,k)=||I1(P1s k)−I2(P2s k)||2σ2s+||I m(k)−¯I(s k)||2σ2t.(14)The minimization of E can be rearranged as:min α,β,ρ,τ,O E =minα,β,ρ,τ{minO{∑k|v k∈Vf(α,β,ρ,τ,O,k)}+m∑i=1α2iσi+m∑i=1β2iγi}(15)=minα,β,ρ,τ{∑k|v k∈Vh(g(α,β,ρ,τ,k),k)+m∑i=1α2iσ2i+m∑i=1β2iγ2i}(16)whereh(g(α,β,ρ,τ,k),k)=minO(k){g(α,β,ρ,τ,k)·(1−O(k))+2ηo·O(k)}.(17) Relaxing the binary process O(k)to an outlier process that varies continuously 0≤O a(k)≤1,we can approximate h(g,k)by a robust function h a,h a(g)=−σo·ln((1−exp(−e oσo))·exp(−gσo)+exp(−e oσo))(18)with suitable parameters e o andσo.These parameters are determined empiri-cally to provide a smooth approximation of the min function(see Fig.1).This leads to E as in Eq.10,where the minimization is overα,β,ρ,τ.Following optimization,the occlusion map is recovered from(for v k∈V): O∗(k)=1,if h a(g(α∗,β∗,ρ∗,τ∗,k))≥2ηo−εO∗(k)=0,if h a(g(α∗,β∗,ρ∗,τ∗,k))<2ηo−ε,where(α∗,β∗,ρ∗,τ∗)=arg minα,β,ρ,τE .(19)−1000−50005001000−10001002003004005006007008009001000g h a (g )−1000−50005001000−10001002003004005006007008009001000g h a (g )(a)(b)Fig.1.Robust estimator h a (g )(Eq.18)used to handle foreign-body occlusions in the fitting process:(a)e o =300,σo =1(b)e o =300,σo =503.3On Foreign Body OcclusionsIn a stereo setup,there can be several cases of foreign-body occlusion of a vertex of the face model.We can classify these cases with respect to the positioning of the occluder in (see Fig.2):half-occlusion (HO ),where the vertex is occluded in one of I 1or I 2;full-occlusion-near (FO n ),where the vertex is occluded in both I 1and I 2and the occluding object is close to the face;and full-occlusion-far (FO f ),where the occluder is far from the face relative to the face size.We can also classify occluders with respect to their texture,which can be one of:texture-less (non-skincolor);texture-less (skincolor);and textured.Depending on the type of occlusion,we expect either the stereo match term or the texture match term to play a more prominent role in the fitting process (see Table 1).For example,in the case of half-occlusion (HO )by a non-skinlike surface,one can expect the stereo match term to provide an important cue as to whether a vertex is occluded.This is because the observed intensities at the projections of a half-occluded vertex correspond to observations of two very different surfaces.When the occlusion is of type full-occlusion-near (FO n )on the other hand,the stereo match term will not provide much help in determin-ing an occlusion because the two observed intensities will come from nearby locations on the occluder and will be very similar.In this case,provided that the occluder has non-skinlike color,the texture match will be the most help-ful in determining its presence.Of course,when the occluder lacks texture and is skinlike,there is little visual information to discriminate between it and the face.Experimental results are shown in Sect.4.2.Occluder classification HO FO n FO ftexture-less (non-skincolor)S T Ttexture-less (skincolor)X X X textured S T S+TTable 1.Most relevant terms in the energy for each of the occlusion cases:S for stereo match term and T for texture match term (see Fig.2).Fig.2.Categories of foreign-body occlusions.From left to right,occlusions can be one of: half-occlusion(HO),full-occlusion-near(FO n),full-occlusion-far(FO f).The stereo and texture terms play different roles in each case(see Table1).3.4Optimization ProcedureInitial Fit Like previous approaches[5,17],we assume that either by user se-lection,or by means of an automated detection process,image coordinates of a subset of specific feature points of the face(e.g.corners of the eyes,corners of the mouth,tip of the nose,corners of the ears)in both I1and I2are available. (Some of the feature points may be occluded in one or both images).Let j1,..,j p denote the indices of the vertices in the face model correspond-ing to these feature points.Starting from the average shape parameters(α=0), we use a quasi-newton gradient descent method to minimizeE f=∑i=1,..,p δ1i||P1s ji−p1i||2+δ2i||P2s ji−p2i||2,(20)and obtain a rough initial estimate of the shape and rigid body transformation parameters.Here,δ1i=1if the i th feature is visible in image I1and0otherwise (and similarly forδ2i and I2),p1i is the image coordinate of the i th feature in image I1,and p2i is the image coordinate of the i th feature in image I2. Optimization For comparison purposes we evaluate thefitting performance of E and E with and without the texture model terms.In experiments where we utilize only the stereo terms in E(or E ),we start with model parameters α,ρfrom the initialfit.In experiments that include texture we also start with the average texture parameters(β=0),and lighting parametersτsuch that i amb=1,i dir=1(i.e.,white ambient and directional lights),and i offset=0(zero offset),where i=R,G,B.The lighting direction l is initialized to be the bisector of the two cameras viewing directions.We minimize:E+λ·E f(21) with respect to the suitable parameters,using a stochastic quasi-newton gradi-ent descent method.To avoid local minima,we use a coarse-to-fine approach,with3levels of res-olution.At the coarsest resolution,we use versions of I1and I2that are down-sampled by a factor of four,together with a corresponding low resolution ver-sion of the3D face model.As we progress toward thefinest level of resolution, we use smaller and smaller values forλ,σs andσt,which gives smaller weightsto the feature term and the shape and texture priors.At regular intervals(more frequently at coarser levels),we recompute the self-occluded vertices(and thus V)as well as the normals(n k).Instead of computing the energy using all the vertices v k∈V,at each iteration we randomly select a sub-set of these vertices on which to compute the energy(we use1000,2000and3000,at each level of resolution).In this selection process,we select vertices with probability pro-portional to the average(over the stereo pair)foreshortened area of the patch around them.When we utilize the complete E or E ,we sample at the baricen-ters of the triangles of the mesh instead of the vertices because that allows for easier computation of the gradient of the energy.In this case,both V and the occlusion map are defined over the set of triangles,and k indexes the triangles that compose the model.4Experimental ResultsWe evaluated the procedure of Sect.3.4using the original energy(E)and the robust energy(E ),along with modifications of these energies obtained by ex-cluding the texture terms.Throughout this section,we refer to these as stereo+ texture,stereo,robust stereo+texture,and robust stereo,respectively.To ensure a valid comparison between the different cases,we used equivalent parameters for the feature match weight(λ)and the model priors(σs andσt)in each ex-periment.Only thefirst40shape and texture basis vectors were used,since this was found to provide adequate results.4.1Accuracy in the Absence of OcclusionsTo evaluate the benefits of incorporating a texture model in the absence of oc-clusions,testing was performed on a subset of sixty individuals from the K.U. Leuven stereo face database[5],which contains stereo pairs of each individual in eight different positions.We obtainedfitting results using the stereo and the stereo+texture methods for all eight poses in each of the sixty people,for a total of480modelfits.Note that the stereofitting approach is that proposed in[5].Figures3and4exemplify the differences between thefits obtained using stereo(first two terms of E)and stereo+texture(E).Atfirst glance,the results in Fig.3suggest that the shape estimates using both methods are quite similar.Thestereo matching cost(∑k|vk∈V ||I1(P1s k)−I2(P2s k)||2|V|)was computed to be280.77for thestereo method and340.17for the stereo+texture method,so the shape obtained using only the stereo term is better in terms of the per-vertex stereo intensity match.However,from Fig.4it is clear that the eye,eyebrow and mouth align-ment between the model and the images is significantly more accurate when the texture model is included.These results suggest that either approach may be sufficient if the desired output is a depth map or3D model for image synthesis.For recognition,how-ever,where one links shape parameters to identity,it is important for features inthefitted model to be aligned with the features in the database models.Our ex-periments suggest that one way to ensure this alignment is to include a texture model in thefitting procedure.The same effect can be observed by studying the distribution of the480re-covered shape models(60individuals under8poses)in the forty-dimensional whitened shape parameter space.Two statistics relate to the quality of thefit-ting procedure from a recognition standpoint.First,for a single individual,we would like the difference between thefits for different poses to be small.Sec-ond,we would like the difference betweenfits for distinct individuals to be large.These can be measured based on the within-class(within-subject)scat-ter matrix(S w)and the between-class scatter matrix(S b).Roughly speaking,the larger the determinant and trace of(S−1w S b)are,the more accurate a classifier based on thesefits will ing results from the480fits we found the de-terminants of S−1w S b to be2.9640e−5and1.3418e−11and the traces of S−1w S b to be104.0478and69.4101for the stereo+texture method and the stereo method, respectively.These quantitative results support the qualitative observations in Figs.3and4and suggest thatfits obtained with the inclusion of the texture model are significantly more robust to pose changes.parison of afit using both stereo and texture to that obtained using stereo alone.Rows indicate left and right images of the stereo pair.First column:shape estimate using stereo,second column:input images,third column:shape estimate using stereo and texture.4.2Accuracy with OcclusionsWe also tested the occlusion cases described in Sect.3.3by applying the robust fitting process to captured data.For thesefitting results,a value of n o=250wasFig.4.Same comparison as that in Fig.3,but mapped with estimated textures and ren-dered semi-transparently over input images.While both the shape obtained using stereo (top)and that obtained using stereo and texture(bottom)provide reasonable depth maps for the input stereo pair(Fig.3),only the joint use of stereo and texture ensures feature alignment.used for the robust stereo method,and a value of n o=800was used for the robust stereo+texture method.Figure5shows results obtained using the robust stereo and robust stereo+ texture method in the case of half-occlusion(case HO)by a textureless foreign body.As described in Sect.3.3,in this case we expect the results for both meth-ods to be similar because the stereo cue is sufficient to detect the occluder. As shown in thefigure,this is indeed the case.Notice that the occlusion map captures not only the occluder,but also artifacts that are not predicted by the model,including specular highlights and cast shadows.Figure6shows similar results for the case of a textured occluder that is close to the surface(case FO n).In this case,the stereo constraint is insufficient for detecting the occluder,and the addition of a texture term provides substantial improvement.The results from the two occlusion cases are compared to the‘ground truth’shape obtained in the absence of occlusion in Fig.7.The results obtained by the robust stereo+texture method are relatively consistent over all cases,but the same cannot be said for those obtained using the stereo match alone.Notice that in all cases,the recovered models deviate from the unoccluded model in the unobserved regions of the face.This is to be expected,since there is no shape or texture information available in these regions.5ConclusionsWe have presented a method for the recovery of face models from stereo pairs of images in the presence of foreign-body occlusions.In this approach,a face model(a3DMM)is augmented by an occlusion map defined on the model shape,and foreign-body occlusions are detected efficiently using robust esti-。