Human Pose Estimation From Silhouettes A Consistent Approach Using Distance
- 格式:pdf
- 大小:222.66 KB
- 文档页数:9
(原)⼈体姿态识别alphapose转载请注明出处:论⽂RMPE: Regional Multi-Person Pose Estimation官⽅代码:官⽅pytorch代码:1. 简介该论⽂指出,定位和识别中不可避免的会出现错误,这些错误会引起单⼈姿态估计(single-person pose estimator,SPPE)的错误,特别是完全依赖⼈体检测的姿态估计算法。
因⽽该论⽂提出了区域姿态估计(Regional Multi-Person Pose Estimation,RMPE)框架。
主要包括symmetric spatial transformer network (SSTN)、Parametric Pose Non-Maximum-Suppression (NMS), 和Pose-Guided Proposals Generator (PGPG)。
并且使⽤symmetric spatial transformer network (SSTN)、deep proposals generator (DPG) 、parametric pose nonmaximum suppression (p-NMS) 三个技术来解决野外场景下多⼈姿态估计问题。
2. 之前算法的问题2.1检测框定位错误如下图所⽰。
红框为真实框,黄框为检测到的框(IoU>0.5)。
由于定位错误,黄框得到的热图⽆法检测到关节点解决⽅法:增⼤训练时的框(框增⼤0.2-0.3倍)2.2 检测框冗余如下图所⽰。
同⼀个⼈可能检测到多个框。
解决⽅法:使⽤p-NMS来解决⼈体检测框不准确时的姿态估计问题。
3. ⽹络结构3.1 总体结构总体⽹络结构如下图:Symmetric STN=STN+SPPE+SDTNSTN:空间变换⽹络,对于不准确的输⼊,得到准确的⼈的框。
输⼊候选区域,⽤于获取⾼质量的候选区域。
SPPE:得到估计的姿态。
SDTN:空间逆变换⽹络,将估计的姿态映射回原始的图像坐标。
一份深度学习“人体姿势估计”全指南,从DeepNet到HRNet 从DeepNet到HRNet,这有一份深度学习“人体姿势估计”全指南
几十年来,人体姿态估计(Human Pose estimation)在计算机视觉界备受关注。
它是理解图像和视频中人物行为的关键一步。
在近年深度学习兴起后,人体姿态估计领域也发生了翻天覆地的变化。
今天,文摘菌就从深度学习+二维人体姿态估计的开山之作DeepPose开始讲起,为大家盘点近几年这一领域的最重要的论文。
什么是人体姿势估计?
人体姿态估计(Human Pose Estimation,以下简称为HPE)被定义为图像或视频中,人体关节(也被称为关键点-肘部、手腕等)的定位问题。
它也被定义为,在所有关节姿势组成的空间中搜索特定姿势。
二维姿态估计-运用二维坐标(x,y)来估计RGB图像中的每个关节的二维姿态。
三维姿态估计-运用三维坐标(x,y,z)来估计RGB图像中的三维姿态。
HPE有一些非常酷的应用,在动作识别(action recognition)、动画(animation)、游戏(gaming)等领域都有着广泛的应用。
例如,一个非常火的深度学习APP ——HomeCourt,可以使用姿态估计(Pose Estimation)来分析篮球运动员的动作。
为什么人体姿势估计这么难?
灵活、小而几乎看不见的关节、遮挡、衣服和光线变化都为人体姿态估计增加了难度。
二维人体姿态估计的不同方法
传统方法
关节姿态估计的传统方法是使用图形结构框架。
这里的基本思想是,将目标对象表示成一堆“部件(parts)”的集合,而部件的组合方式是可以发生形变的(非死板的)。
(2条消息)2020CVPR人体姿态估计论文盘点Hey,今天总结盘点一下2020CVPR论文中涉及到人体姿态估计的论文。
人体姿态估计分为2D(6篇)和3D(11篇)两大类。
2D 人体姿态估计[1].UniPose: Unified Human Pose Estimation in Single Images and Videos作者 | Bruno Artacho, Andreas Savakis单位 | 罗切斯特理工学院摘要:我们提出了一个统一的人体姿态估计框架UniPose,它基于我们的“瀑布式”萎缩空间池架构,在多个姿态估计指标上取得了state-of-art结果。
单姿态合并率上下文分割和联合定位在一个阶段内估计人体姿态,精度高,不依赖统计后处理方法UniPose中的瀑布模块利用了级联结构中渐进式过滤的效率,绘制可与空间金字塔结构相媲美的多尺度视野。
此外,我们的方法扩展到单姿态LSTM进行多帧处理,并获得了视频中时间姿态估计的最新结果。
我们在多个数据集上的结果表明,具有ResNet主干网和瀑布模型的UniPose是一个健壮而有效的姿势估计体系结构,可获得单人姿势检测的state-of-the-art.一种不需要后处理的单人姿态估计方法,可扩展到视频[2].The Devil Is in the Details: Delving Into Unbiased Data Processing for Human Pose Estimation作者 | Junjie Huang, Zheng Zhu, Feng Guo, Guan Huang单位 | XForwardAI Technology Co.,Ltd;清华GitHub:https:///HuangJunJie2017/UDP-Pose摘要:近年来,自顶向下的姿态估计方法在人体姿态估计中占据主导地位。
据我们所知,数据处理作为训练和推理中的一个有趣的基本组成部分,并没有在姿态估计领域中得到系统的考虑。
人体姿态模型英文文献Unfortunately, I don't have access to the specific article database that you're referencing. However, I can provide you with a general outline and structure for an English literature review on human pose estimation models, which you can use as a starting point for your research. Please note that this is a general template, and you will need to conduct your own research and analysis to fill in the specific details and references.Title: A Review of Human Pose Estimation Models.Abstract: This article presents a comprehensive review of human pose estimation models, focusing on the recent advancements and challenges in this field. It discusses various techniques, including deep learning-based methods, traditional computer vision approaches, and their applications in real-world scenarios. The article also highlights the importance of pose estimation in areas such as human-computer interaction, sports analysis, andhealthcare.Introduction: Human pose estimation is a crucial task in computer vision that aims to detect and localize key body joints of a person in an image or video. It has applications in various domains, including action recognition, sports analysis, virtual reality, and healthcare. In recent years, significant progress has been made in this field, especially with the advent of deep learning techniques. This article aims to provide a comprehensive review of human pose estimation models, focusing on their principles, recent advancements, and potential challenges.Section 1: Principles of Human Pose Estimation.This section introduces the fundamental concepts and principles of human pose estimation. It explains the importance of keypoint detection and the challenges involved in accurately estimating pose in different scenarios.Section 2: Traditional Computer Vision Approaches.This section reviews traditional computer vision techniques used for human pose estimation. It discusses methods such as feature extraction, shape models, and optimization algorithms. It also highlights the limitations of these approaches and their inability to handle complex poses and backgrounds.Section 3: Deep Learning-Based Methods.This section presents a detailed overview of deep learning-based human pose estimation models. It covers convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformer-based architectures. It also discusses the advantages of these methods, such as their ability to learn complex representations and handle diverse poses and backgrounds.Section 4: Applications of Human Pose Estimation.This section explores the various applications ofhuman pose estimation in real-world scenarios. It covers areas such as human-computer interaction, sports analysis, virtual reality, and healthcare. It also discusses the potential impact of pose estimation in these domains and the challenges associated with their implementation.Section 5: Challenges and Future Directions.This section highlights the current challenges and future directions in human pose estimation research. It identifies areas such as robustness to occlusions, pose estimation in crowded scenes, and real-time performance as key areas for further exploration. It also discusses potential advancements in deep learning techniques and the integration of pose estimation with other computer vision tasks.Conclusion: Human pose estimation has emerged as a crucial task in computer vision, with significant progress made in recent years. Deep learning-based methods have demonstrated remarkable performance in estimating poses in diverse scenarios. However, there are still challenges tobe addressed, such as robustness to occlusions and real-time performance. Future research in this field is expected to bring further advancements in pose estimation techniques and their applications in various domains.This outline provides a general structure for a literature review on human pose estimation models. You can expand each section by adding more details, discussing specific models, techniques, and applications. Remember to include relevant references and citations to support your arguments and analysis.。
第38卷第3期__________________________计算机仿真_____________________________2021年3月文章编号:1006 -9348(2021 )03 -0292 -06基于S M P L模型的人体姿态估计李健,杨镖镖,张皓若(陕西科技大学电子信息与人工智能学院,陕西西安710021)摘要:针对目前人体形变模型中姿态估计算法容易出现误差、信息缺失等问题,提出一种利用深度相机获取的人体三维信息 来优化模型的方法。
通过深度相机I C i n e c t获取的三维骨架信息,与S M P L模型进行配准,修正原始的模型姿态,得到一个接近人体真实姿态的模型。
实验结果表明,融合人体三维信息后,模型的准确性得到一定程度上的提高。
关键词:形变模型;姿态估计;深度相机;三维骨架中图分类号:TP391 文献标识码:BH u m a n Pose Estimation Based on S M P L ModelLI Jian, YANG Biao - biao,ZHANG Hao - ruo(College of Electronic Information and Artificial Intelligence,Shaanxi University of Science and Technology,Xi'an Shanxi710021, China)A B S T R A C T:Aiming at the current pose estimation algorithms in human body deformation models that are prone t oerrors and lack of information,a method for optimizing the model i s proposed,using the three- dimensional information of the human body obtained by the depth camera.The three- dimensional skeleton information obtained by the depth camera Kinect was registered with the S M P L model,and the original model pose was corrected t o obtain a model close to the real pose of the human body.The experimental results show that the accuracy of the model i s improved to a certain extent after fusing the three- dimensional information of the human body.K E Y W O R D S:MoqDnable model;Pose estimation;Depthi引言三维人体建模一直以来都是三维重建研究领域一个比 较富有挑战性且比较重要的部分,其本质在于将现实世界中 的人体以三维数字化的方式在计算机中存储并表示。
以依赖联合解释变量估计身体姿势在这项工作中,我们根据静态图像估计2 d人类身体姿势。
近期非常成功的在解决这一任务的方法是依靠识别将可变形的部分集合于树模型。
在这样一个图形结构框架内,我们为获得好的模板,通过新的、非线性联合解释变量。
特别是,我们使用两层随机森林作为解释变量。
第一层作为独立的身体部位分类器。
第二层将第一个分布估计纳入考虑,从而将由相互依存的和同现的各部分建模。
这形成一个姿势估计框架,需要身体部位之间的依赖关系联合定位考虑,因而能够绕过树结构的不确定性,如腿和手臂。
在实验中,我们证明我们的身体部位联合解释变量比基于树的最先进的方法实现更高的依赖联合定位精度。
背景由于其应用的相关领域广,从静态图像估计一个人的姿势是很活跃的研究领域。
在这个领域最受欢迎的一个方法是图像结构框架,模型模拟使用一个树模型各个部分的空间关系。
图示结构在姿势模拟上改进了许多方面,如或模拟身体部位的模型。
在目标检测中,表现最好的方法之一是依赖于所谓的可变形模型,一部分使用混合物的恒星模型模板的部分。
最近有研究显示,混合物的模板也可以有效地一部分用于树模型,形成非常强大的姿势估计模型。
特别是,相反的建模转换一个部位模板的经典图形结构模型, 由不同的可变形模板编码身体的部分转换四肢。
虽然这种方法优于经典图片结构模型对人类姿势估计,它说明,使用模板,视窗扫描模板与线性向量机模板训练HOG的特性对噪声非常敏感,限制性能。
在这项研究中,我们因此在一个更好的图形结构背景下解决获取的问题。
同样地,我们没有明确模拟肢体的转换。
但是使用不明确的肢体姿势变化学习模板的处理。
相反,我们不使用噪音敏感、视窗扫描模板,而是提出非线性解释变量联合位置作为解释变量。
我们依靠随机森林, 从深度数据联合位置显示快速、健壮,和准确的预测或身体部分。
虽然以前的工作独立对所有身体部位模板并使用图形结构框架、模型空间和方向关系部分模板。
我们提出一个更有识别力的模板表示需要共生和关系在某种程度上考虑到其他地区,如图1。
2021年第40卷第2期传感器与微系统(Transducer and Microsystem Technologies)23DOI : 10.13873/J. 1000-9787(2021)02-0023-03基于深度学习的3D 时空特征融合步态识别赵黎明,张荣,张超越(宁波大学信息科学与工程学院,浙江宁波315211)摘 要:现有基于轮廓图的步态识别方法易受服装等外部条件干扰,而基于3D 模型的识别方法虽然一 定程度上抵抗了外部干扰,但对摄像设备有额外的要求,且模型计算复杂。
针对上述问题,利用3D 姿态 估讣技术,建立了行人运动的"轻"模型,利用神经网络框架,提取行人3D 空间运动的时空信息,并且与伦 廓图的信息相融合,进一步丰富了步态特征。
在CASIA-B 的数据集上的实验结果表明:融合了 3D 时空运 动信息增强了步态特征的鲁棒性,进一步提升了识别率。
关键词:深度学习;步态识别;3D 姿态;吋空特征融合中图分类号:TP391 文献标识码:A 文章编号:1000-9787(2021)02-0023-03Fusion of 3D spatiotemporal features for gait recognitionbased on deep learning **收稿日期:2020-11-20*基金项目:浙江省公益性技术研究项目(LGF18F020007,LGF21 F020008);宁波市自然科学基金资助项目(2018A610057,2018A610163)ZHAO Liming, ZHANG Rung, ZHANG Chaoyue(College of Information Science and Engineering ,Ningbo University ,Ningbo 315211,China)Abstract : The existing gait recognition methods based on silhouettes are easy to be interfered by clothing and other external conditions ・ Although the 3D model-based recognition method resists external interference to a certain extent , it has additional requirements of camera equipment and complicated model calculations. In order to solve the above problems," light" model for pedestrian motion is established by using 3D pose estimation technology ・ The spatiotemporal information of pedestrian 3D spatial motion is extracted by using neural network framework ,and the information is fused with the information of skeleton map to further enrich the gait features. The results on CASIA-B dataset show that the fusion of 3D spatiotemporal motion information enhances the robustness of gait features and further improves recog n ition rate.Keywords : deep learning ; gait recognition ; 3D pose ; spatiotemporal feature fusion 0引言步态识别是利用步态信息对人的身份进行识别⑴。
人体姿态捕捉方法综述人体姿态捕捉(Human Pose Estimation)是指从图像或视频中提取人体姿态的过程。
它在许多应用领域中起着重要的作用,如人机交互、多媒体检索、人体动作分析等。
随着计算机视觉和深度学习的发展,人体姿态捕捉方法不断演进和改进。
本文将对人体姿态捕捉方法进行综述,系统地介绍几种主要方法。
传统的人体姿态捕捉方法主要分为基于模型的方法和基于特征的方法。
基于模型的方法试图通过建立人体姿态模型来解决捕捉问题,并通过优化算法来拟合模型与输入图像之间的对应关系。
基于特征的方法则试图直接从输入图像中提取特征,并通过分类或回归算法来估计人体姿态。
基于模型的方法主要包括预定义模型和灵活模型。
预定义模型是指事先定义好的人体姿态模型,如人体关节模型、骨骼模型等。
这些模型一般是基于人体解剖学知识构建的,并通过优化算法来拟合模型与图像之间的对应关系。
灵活模型则是指根据输入图像自动学习的模型,如图像表示模型、概率图模型等。
这些模型能够根据输入图像的不同自适应调整,提高姿态估计的准确性和鲁棒性。
基于特征的方法主要包括手工设计特征和深度学习特征。
手工设计特征是指通过对输入图像进行特征提取和降维,将复杂的姿态估计问题简化为特征分类或回归问题。
常用的手工设计特征包括HOG(Histogram of Oriented Gradient)、SIFT(Scale-Invariant Feature Transform)等。
深度学习特征则是指通过深度神经网络自动学习图像特征,并通过分类或回归算法来估计人体姿态。
深度学习特征在人体姿态捕捉问题中取得了显著的成果,如卷积神经网络(CNN)、循环神经网络(RNN)等。
除了基于模型和特征的方法,还有一些将两者结合起来的方法,如混合方法和端到端方法。
混合方法将传统的基于模型和特征的方法进行融合,通过建立模型和提取特征相结合来解决姿态捕捉问题。
端到端方法则是指直接从原始图像输入开始,通过一个深度神经网络来学习图像特征和姿态估计模型,实现一体化的姿态捕捉流程。
HUMAN POSE ESTIMATION FROM SILHOUETTESA CONSISTENT APPROACH USING DISTANCE LEVEL SETSC.Sminchisescu and A.TeleaINRIA-Rhone-Alpes655avenue de l’Europe,38330Montbonnot,FranceCristian.Sminchisescu@inria.frEindhoven University of TechnologyDepartment of Mathematics and Computer ScienceDen Dolech2,5600MB Eindhoven,The Netherlandsalext@win.tue.nlABSTRACTWe present a novel similarity measure(likelihood)for estimating three-dimensional human pose from image silhouettes in model-based vision applications.One of the challenges in such approaches is the con-struction of a model-to-image likelihood that truly reflects the good configurations of the problem.This is hard,commonly due to the violation of consistency principle resulting in the introduction of spurious, unrelated peaks/minima that make the search for model localization difficult.We introduce an entirely con-tinuous formulation which enforces model estimation consistency by means of an attraction/explanation silhouette-based term pair.We subsequently show how the proposed method provides significant consoli-dation and improved attraction zone around the desired likelihood configurations and elimination of some of the spurious ones.Finally,we present a skeleton-based smoothing method for the image silhouettes that stabilizes and accelerates the search process.Keywords:human tracking,model-based estimation,constrained optimization,level set methods,fast marching methods1INTRODUCTION AND PREVIOUS WORK Human pose estimation from images is an active area of computer vision research with many potential ap-plications ranging from computer interfaces to motion capture for character animation,biometrics or intel-ligent surveillance.One promising approach,called model based[Smin01b,Deut00,Heap98,Smin01a, Gavr96,Breg98,Kakad96,Rehg95],relies on a3D articulated volumetric model of the human body to constrain the localization process in one or several images.The goal in human pose estimation applica-tions is to estimate the model’s articulation and pos-sibly structural parameters such that the projection of the3D geometrical model closelyfits a human in one or several images.Typically,model localization is a multi-dimensional expensive search process in the model parameter space for good cost configurations defined in terms of maxima of a likelihood,or minima of an energy function.Such costs are defined in terms of the association of model predictions with extracted image features.The search process produces a param-eter configuration which brings the3D model close to the tracked2D image in the metric of the predefined likelihood model.The above problem is hard since likelihood cost surfaces are typically multi-peaked, due to factors like multiple scence objects,ambiguous feature assignments,occlusions,and depth uncertain-ties.Search strategies for locating good peaks in the model parameter space based on local and global search methods,possibly in temporal sequences, have received significant attention[Smin01b,Deut00, Heap98,Smin01a,Gavr96,Breg98]and are not ad-dressed here.However,the dificulty and intinsically ill-posed nature of such search problems raise two complementary questions about the design of the cost surface whoose minima are to be found:what are good image features which will read-ily qualify for likelihood terms for samplingand continuous evaluation?how to define such terms to limit the number ofspurious minima in parameter space and renderthe search more efficient and effective.Likelihood models defined in terms of edges[Deut00, Smin01a,Kakad96],silhouettes[Deut00,Smin01a, Heap98]or intensities[Smin01a,Side00,Rehg95]are the most common.While image intensities seem to be good cues for various types of optical-flow based lo-cal search,they are not invariant to lighting changes, and typically rely on low inter-frame intensity varia-tions and motion.It is consequently difficult to sam-ple configurations out of the region where such photo-metric model is valid.Edges and/or silhouettes have therefore been more used in approaches that employ, at least partially,some form of parameter-space sam-pling[Deut00,Smin01a,Heap98].Deutscher[Deut00]uses a silhouette based term for his cost function design in a multi-camera setting. However,this term peaks if the model is inside the silhouette without demanding that the silhouette area is fully explained(see Sec.4.1).Consequently,an entire family of undesired configurations situated in-side the silhouette will generate good costs under this likelihood model.Moreover,the term is purely dis-crete,not suitable for continuous estimation.The situ-ation is alleviated by the use of the additional cues and sensor-fusion from multiple cameras with good re-sults.Delamarre[Dela99]uses silhouette contours in a multi-camera setting and computes assignments us-ing a form of ICP(Iterative Closest Point)algorithm and knowledge of normal contour directions.The method is local and not necessarily enforces globally consistent assignments,but again relies on fusing in-formation from many camera to ensure consistency. Brand[Bran99]and Rosales[Rosa00]use silhouettes to infer temporal and static human poses.However, their motivation is slightly different in using silhou-ettes as inputs to a system which directly learns3D to 2D mappings.Summarizing,many likelihood terms used in model-based vision applications have the undesirable prop-erty that they not only peak around the desired model configurations,which correspond to subject local-ization in the image,but also in totally unrelated, false configurations.This poses huge burdens on any search algorithm,as the number of spurious min-ima could grow unbounded and therefore discriminat-ing them from“good peaks”can only be done via temporal processing.Consequently,anyfinite sam-ples/hypothesis estimator has a great chance to miss significant,true minima.In practice,extracting pose from silhouette using sin-gle images remains an under-constrained problemwith potential multiple solutions.A more global search method,multiple cameras,temporal disam-biguation and/or additional features have thus to beused in conjunction with the local method we pro-pose in this work,to robustify the search for good costconfigurations[Smin01b,Deut00,Heap98,Smin01a, Gavr96].In this paper,we assume a reasonable ini-tialisation and restrict our attention to the design oflikelihoods with larger basin of attraction zones and globally consistent responses around the desirable cost minima.We achieve this by means of an en-tirely continuous formulation and a new likelihood term for silhouettes in model-based applications.Theproposed term allows a globally consistent response for the subject localization in the image by means of a pair of attraction/explanation components that a)push the geometric model inside the subject’s silhou-ette and b)demand that the area associated with the silhouette is entirely explained by the model.We sub-sequently show how this proposal significantly im-proves the pose estimation results compared to pre-viously used similarity measures.In Section2,we describe the human body model weemploy.Section3outlines the search process for op-timal configurations.Section4introduces our new likelihood terms and details its two components.Sec-tion5presents a new technique for smoothing the image-acquired silhouettes that stabilizes and acceler-ates the search process.Finally,Section6concludesthe paper and proposes directions for future work.2HUMAN MODEL2.1MODEL DESCRIPTIONOur human body model(Fig.1)consists of kinematic ‘skeletons’of articulated joints controlled by angular joint parameters,covered by‘flesh’built fromsuperquadric ellipsoids with additional tapering and bending parameters[Barr84].A typical model hasaround30joint parameters,plus8internal propor-tion parameters encoding the positions of the hip, clavicle and skull tip joints,plus9deformable shapeparameters for each body part,gathered into a vec-tor.The state of a complete model is thus given as a single parameter vector.We note,however,that only joint parameters are typically estimated during object localization and tracking,theother parameters remainingfixed.Although this model is far from photo-realistic,it suf-fices for a high-level interpretation and realistic oc-clusion prediction.Moreover,it offers a good trade-off between computational complexity and coveragea c dFigure1:Human model:flat shaded(a,b)anddiscretization(c,d)in typical motion tracking applications.2.2MODEL TO IMAGE FITTINGThe model is used in the human pose estimation ap-plication as follows(see also Fig.2for an overview of the application pipeline).Figure2:Human pose estimation applicationpipelineThe pipeline starts by extracting a human silhou-ette(see the example in Fig.3b)from the camera-acquired images(Fig.3a)by subtracting the scene background and thresholding the result to a bilevel image.To stabilize the further parameter estimation step,a special smoothing is applied on the extracted image.This smoothing is described separately in Sec.5.The model’s superquadric surfaces are dis-cretized as meshes parameterized by angular coordi-nates in a2D topological domain.Mesh nodesare transformed into3D points and then into predicted image points using com-posite nonlinear transformations,where is a sequence of parametric deformations that construct the corresponding part in its own reference frame,is a chain of rigid transformations that map it through the kinematic chain to its3D position,and is the perspective projection.During parameter estimation(see Sec.3),prediction-to-image matching cost metrics are evaluated for pre-dicted image feature,and the results are summed to produce the image contribution to the overall pa-rameter space cost function.For certain likelihood terms like edge based ones,predictions are associ-ated with nearby image features.The cost is then afunction of the prediction errors. For other likelihood terms(like the silhouette attrac-tion term we employ here),a potential surface is built for the current image,and the prediction is only eval-uated at a certain location on this surface.3PARAMETER ESTIMATIONWe aim towards a probabilistic interpretation and op-timal estimates of the model parameters by maximiz-ing the total probability according to Bayes rule:(1) where and are the new silhouette likelihood terms we propose,defining similiarity criteria be-tween the model projection and the image silhouette to be defined in the next section,and is a prior on model parameters.The prior encodes static knowl-edge on humans,such as anatomical joint angle limits for parameters or non-penetration constraints on the body parts(see[Smin01b,Smin01a]for details).In a maximum a-posteriori estimate(MAP)approach, we spatially discretize the continuous formulation in Eqn.1,and attempt to minimize the negative log-likelihood,or’energy’,for the total posterior prob-ability.The energy is expressed as the following cost function:where is the negative log of the model prior.In the following,we shall concentrate on the behavior and properties of the negative log-likelihood. Various search methods attempt to identify the min-ima of the function,by either local continous de-scent,stochastic search,parameter space subdivision or combinations of them[Smin01b,Deut00,Heap98, Smin01a,Gavr96,Breg98].All these methods re-quire the evaluation of.Continuous methods require supplementary evaluations of thefirst order gradient and sometimes the second order Hessian of. In this paper,we use a second order local continuous method,where a descent direction is chosen by solv-ing the regularized subproblem[Flet87]:subject towhere:is a symmetric positive-definite stabilization matrix(ofter set to identity)is a dynamically chosen weighting factoris a matrix containing joint angle limits constraints acting as effective priors,definingan addmissible subspace to search for modelparameters(see[Smin01b,Smin01a]for de-tails).The parameter controls the descent type:leads to a gradient descent,while leads to a Newton-Raphson step.The optimization routine automatically decides over the type and size of the optimal step within the admissible trust radius(see [Flet87,Trig00]for details).4OBSERV ATION LIKELIHOODWhether continuous or discrete,the search process depends critically on the observation likelihood com-ponent of the parameter-space cost function.Besides smoothness properties,necessary for the stability of the local continuous descent search,the likelihood should be designed to limit the number of spurious local minima in parameter space.We propose a new likelihood term,based on two components:thefirst component maximizes the model-image silhouette area overlap.the second component pushes the model insidethe image silhouette.The above pair of cost terms produces a global and consistent response.In other words,this term en-forces the model to remain within the image silhou-ette,but also demands that the image silhouette is en-tirely explained,i.e.that all silhouette parts contribute to the cost function that drives thefitting process.In the following,we detail the two cost components.4.1SILHOUETTE-MODEL AREA OVERLAPTERMThis term maximizes the model-image area overlap. The area of the predicted model can be computed from the model’s projected triangulation by summing over all visible triangles(triangles having all the vertices visible).(2)where describes the modulo operation,and the computation assumes the triangle vertices are sorted in counter-clockwise order to preserve positive area sign.In subsequent derivations we drop the modulo notation for simplicity.Let be the area of the target silhouette.The area alignment cost,i.e.the difference between the model and image silhouette areas,is:(7)and(9) where is the distance from a predicted model point to a given silhouette.We estimate by computing the distance transform of the silhouette and evaluating it in the points:(10)We use a level-set based approach to quickly and ro-bustly estimate ,as follows.We initialize to zeroon ,i.e.regardas the zero level set of the func-tion .Next,we compute by solving the Eikonal equation [Seth99]:(11)for all points outside .The solution of equation 11has the property that its isolines,or level sets,are at equal distance from each other in the 2D space(Fig.3).Consequently,is a good approximation of the distance transform.a)b)c)d)Figure 3:Distance transform computation:original image (a),silhouette (b),distance plot (c)and distance level sets (d)Equation 11is efficiently solved by using the fast marching method (FMM),introduced by Sethian in [Seth96].We briefly outline here the FMM.A de-tailed description of the FMM,up to the implemen-tation details we have ourselves used,is given in [Seth96,Seth99].First,is initialized to zero in allpoints on the silhouette .Next,the solutionis built outwards starting from the smallest known value.This is done by evolving a so-called narrow band of pixels,initially identical to ,in normal di-rection to ,with unit constant speed.As the narrow band evolves,it takes the shape of the consecutive,equidistant level sets,or isolines,of the function (Fig.3d).Using the FMM to compute the distancehas sev-eral advantages.First,the function obtained is con-tinuous over the 2D plane,which is important as weneed to evaluate its first and second order derivatives,as explained below.Secondly,the FMM performs ro-bustly even for noisy silhouettes .This is essential for practical applications,as the silhouettes extracted from real images have many disconnected,spurious pixels (Fig.3b is a typical example).Thirdly,theFMM is very efficient,as it needsopera-tions,where is the number of image pixels and is the average number of pixels in the narrowband,of the same order as the number of pixels on the silhouette’scontour.is computed in real time for pixel images on an SGI O2R5000machine.Finally,im-plementing the FMM is straightforward,as described in [Seth96].Overall,we believe that using the FMM to compute is a more efficient and effective method than e.g.chamfer based methods widely used in vi-sion and imaging applications.The gradient and Hessian of the corresponding sil-houette attraction term are computed from the model-image Jacobian,as follows:(12)(13)Figure 4shows the effect of the silhouette attraction and area overlap terms for two images taken from a longer tracking sequence.The figure shows the initial images (a,e),the initial model configuration (b,f),and the fitting results obtained when using only the silhou-ette attraction term (c,g)and finally both the silhou-ette attraction and the area overlap terms (d,h).One can notice that the silhouette attraction term does not suffice for a good fit.Indeed,any parameter config-uration which places the model inside the image sil-houette can be potentially chosen.Adding the area overlap term stabilizes the estimation and drives it to-wards relatively satisfactory results.Moreover,the cost term has the desired properties of a wide attrac-tion zone.This makes it a good candidate in tracking applications where recovery from tracking failures is highly desirable.5SILHOUETTE SMOOTHINGThe gradient and Hessian introduced in the pre-vious sections are at the core of the optimization pro-cess that fits the model to the observed image features.The stability of the optimization is influenced by the behavior of and :if the silhouette data are noisy,then the cost termsand ,and their derivativesa b c dhgfe Figure 4:Model estimation based on various silhouette terms original images (a,e),initial models (b,f),silhouette attraction term only (c,g),silhouette attraction and area overlap terms (d,h)and ,are not smooth functions.In such cases,the optimization process might fail or take too long to converge or might fit the model erroneously to the image silhouette.We alleviate this problem by performing a smoothing on the silhouettes acquired from the image data.The smoothing aims to produce silhouettes from the image data that can be easier approximated by our human body models than the original ’raw’silhouettes.The process runs as follows (see Fig.6for anoverview).Figure 5:Examples of raw silhouettes,skele-tons,and smoothed silhouettesFirst,the raw silhouettes are extracted from the im-age data,as explained in 4.Due to the limitations of the extraction process,these silhouettes may have a jagged boundary,contain spurious pixels,or miss pixels on the real silhouette,as in Fig.6b.In second next step,we compute the skeleton of the silhouette,as follows.We apply the FMM algorithm inwards on the raw silhouette and compute the dis-tance map of all the points inside the silhouette to its boundary (Fig.6b).The silhouette skeleton is then computed as being those points of the evolv-ing narrow band that meet other similar points due to the band’s evolution under normal speed.In other words,the skeleton points are those points where the narrow band collapses onto itself during its evolu-tion driven by the FMM algorithm.We identify these points using a technique similar to the ones described in [Sidd99,Ogni95b].In the third step,the obtained skeleton (Fig.6c)is pruned of its small,less significant branches by re-taining only its points that originate from points on the initial narrow band situated at a distance larger than a given threshold [Ogni95a,Ogni95b,Sidd99].The above pruning scheme is based on two obser-vations:a)every skeleton point is generated by the collapsing of a compact segment of the original boundary [Sidd99,Kimm95],and b)the importance of a skeleton point can be measured by the length of the boundary segment out of which it originates [Ogni95a,Ogni95b].In the last step,we ’inflate’the pruned skeleton to ob-a b c d eFigure6:Skeleton-based silhouette smoothing pipelinetain the smoothed silhouette.To do this,we execute again the FMM algorithm outwards from the skeleton, as follows.We initialize the narrow band to the skele-ton points and the function to the value of at those points,where is the distance from the skele-ton to the silhouette,computed in the previous step. We stop the FMM execution when points of the out-wards evolving narrow band reach a value of zero. At that moment,the inflated skeleton matches the ini-tial silhouette(Fig.6d).However,due to its pruning, most of the noise of the initial raw silhouette has been removed,as seen in the examples in Fig.5.Since the FMM algorithm performs in real time,as noted in Sec.4.2,the whole skeleton-based smooth-ing process takes less than a second for our typical im-ages.By adjusting the skeleton pruning threshold,we obtain different smoothing levels.Smoother silhou-ettes,produced by a higher threshold,lead in practice to a more stable and sensibly faster convergence of the model parameter estimation.Moreover,pruned skeletons typically lead,due to the properties of the Eikonal equation used in the reconstruction,to silhou-ettes having rounded edges.These shapes are easier appproximated by the superquadric shapes used in our human body model than the raw,arbitrarily shaped silhouettes.However,if the skeletons are pruned too much,the smoothed silhouettes might miss important image cues,such as the orientation of a limb.Con-versely,less smoothed silhouettes are closer to the observed data,thus more accurate,but,as mentioned, may lead to numerically unstable derivative estima-tions.Currently we estimate,by trial and error,a good value for the pruning threshold for a given applica-tion configuration(camera parameters,lighting,raw silhouette extraction parameters,optimization method parameters,etc).This value works well for the vari-ous images we have tried it on.However,a better strategy we plan to investigate is to use an adaptively optimal threshold for each image.6CONCLUSIONSWe have presented a method to build more consis-tent likelihood terms for silhouettes,and applied it for human pose estimation in a model based context. Aiming to build cost surfaces whose minima accu-rately reflect the good configurations in the problem, we define a novel likelihood model composed of an attraction term and an area overlap term which en-sures consistent model localization in the image with improved attraction zones.Secondly,we propose a smoothing method for the silhouettes extracted from the image data that stabilizes the optimization process used for pose estimation.Both the likelihood attrac-tion term and silhouette smoothing method are based on distance functions extracted using level-set tech-niques for evolving boundaries under constant speed in the normal direction.In particular,the fast march-ing method allows us to calculate distance transforms, skeletons,and to reconstruct silhouettes from their skeletons in a simple to implement and efficient way. Our future work aims at employing silhouette skele-tons,extracted with level set methods,directly as like-lihood terms for human pose estimation applications. Together with this,we aim to develop an automatic procedure of setting the prunung threshold for the skeleton-based smoothing we employ on the image-extracted silhouettes.REFERENCES[Barr84]Barr,A.:Global and local deformations of solid primitives,Computer Graphics,no.18,pp.21–30,1984.[Bran99]Brand,M.:Shadow Puppetry,Proc.ICCV, pp.1237–1244,1999.[Breg98]Bregler,C.and Malik,J.:Tracking Peo-ple with Twists and Exponential Maps,Proc.CVPR,1998.[Dela99]Delamarre,Q.and Faugeras,O.:3D Artic-ulated Models and Multi-View Tracking withSilhouettes,Proc.ICCV,pp.716–721,1999. [Deut00]Deutscher,J.and Blake,A.and Reid,I.: Articulated Body Motion Capture by AnnealedParticle Filtering,Proc.CVPR,vol.2,pp.126–133,2000.[Flet87]Fletcher,R.:Practical Methods of Opti-mization,John Wiley&Sons,1987.[Gavr96]Gavrila, D.and Davis,L.:3-D Model Based Tracking of Humans in Action:a Multi-view Approach,Proc.CVPR,pp.73–80,1996. [HAWG]Hanim-Humanoid Animation Work-ing Group,Specifications for a standardhumanoid,available at http://www.h-/Specifications/H-Anim1.1/ [Heap98]Heap,T.and Hogg, D.:Wormholes in Shape Space:Tracking through discontinu-ities changes in shape,Proc.ICCV,pp.334–349,1998.[Howe99]Howe,N.and Leventon,M.and Free-man,W.:Bayesian Reconstruction of3D Hu-man Motion from Single-Camera Video,Proc.ANIPS,1999.[Kakad96]Kakadiaris,I.and Metaxas,D.:Model-Based Estimation of3D Human Motion withOcclusion Prediction Based on Active Multi-Viewpoint Selection,Proc.CVPR,pp.81–87,1996.[Kimm95]Kimmel,R.and Shaked D.and Kiryati N.and Bruckstein A.M.:Skeletonization vis Dis-tance Maps and Level Sets,Computer Visionand Image Understanding,vol.62,no.3,pp.382–391,1995.[MacC00]MacCormick,J.and Isard,M.:Par-titioned sampling,articulated objects,andinterface-quality hand tracker,Proc.ECCV,vol.2,pp.3–19,2000.[Ogni95a]Ogniewicz,R.L.:Automatic Medial Axis Pruning by Mapping Characteristics ofBoundaries Evolving under the Euclidean Ge-ometric Heat Flow onto Voronoi Skeletons,Harvard Robotics Laboratory,Technical Re-port95–4,1995.[Ogni95b]Ogniewicz,R.L.and Kubler.O:Hierar-chic Voronoi Skeletons,Pattern Recognition,nr.28,pp.343–359,1995.[Rehg95]Rehg,J.and Kanade,T.:Model-Based Tracking of Self Occluding Articulated Ob-jects,Proc.ICCV,pp.612–617,1995.[Rosa00]Rosales,R.and Sclaroff,S.:Inferring Body Pose without Tracking Body Parts,Proc.CVPR,pp.721–727,2000.[Seth96]Sethian,J.A.:A Fast Marching Level Set Method for Monotonically Advancing Fronts,Proc.Nat.Acad.Sci.vol.93,nr.4,pp.1591–1595,1996.[Seth99]Sethian,J. A.:Level Set Methods and Fast Marching Methods,Cambridge Univer-sity Press,2nd edition,1999.[Sidd99]Siddiqi,K.and Bouix,S.and Tannenbaum,A.and Zucker,S.W.:The Hamilton-JacobiSkeleton,Proc.Intl.Conf.on Computer VisionICCV’99,pp.828–834,1999.[Side00]Sidenbladh,H.and Black,M.and Fleet,D.: Stochastic Tracking of3D Human Figures Us-ing2D Image Motion,Proc.ECCV,pp.702–718,2000.[Smin01a]Sminchisescu,C.and Triggs,B.:A Ro-bust Multiple Hypothesis Approach to Monoc-ular Human Motion Tracking,research reportINRIA-RR-4208,June2001.[Smin01b]Sminchisescu, C.and Triggs, B.: Covariance-Scaled Sampling for Monocular3D Body Tracking,Proc.CVPR,pp.447–454,2001.[Smin01c]Sminchisescu, C.and Telea, A.:A Framework for Generic State Estimation inComputer Vision Applications,Proc.ICVS,Springer Verlag,2001.[Sull99]Sullivan,J.and Blake, A.and Isard,M.and MacCormick,J.:Object Localization byBayesian Correlation,Proc.ICCV,pp.1068–1075,1999.[Terz88]Terzopoulos,D.and Witkin,A.and Kass, M.:Constraints on deformable models:Re-covering3-D shape and non-rigid motion,Ar-tificial Intelligence,36(1),pp.91–123,1988. [Trig00]Triggs,B.and McLauchlan,P.and Hartley, R.and Fitzgibbon,A.:Bundle Adjustment-AModern Synthesis,Vision Algorithms:Theoryand Practice,Springer-Verlag,LNCS1883,pp.298–372,2000.[Wach99]Wachter,S.and Nagel,H.:Tracking Per-sons in Monocular Image Sequences,Proc.CVIU,74(3),pp.174–192,1999.[Wren]Wren,C.and Pentland,A.:DYNAMAN;A Recursive Model of Human Motion,MIT Me-dia Lab technical report No.451,2000.[Zhu97]Zhu,S.C.and Mumford, D.:Learning Generic Prior Models for Visual Computation,IEEE Trans.PAMI,19(11),pp1236–1250,1997.。