01-3D Object Recognition in Cluttered Scenes with Local Surface Features-A Survey
- 格式:pdf
- 大小:1.98 MB
- 文档页数:18
Object Recognition from Local Scale-Invariant FeaturesDavid G.LoweComputer Science DepartmentUniversity of British ColumbiaV ancouver,B.C.,V6T1Z4,Canadalowe@cs.ubc.caAbstractAn object recognition system has been developed that uses a new class of local image features.The features are invariant to image scaling,translation,and rotation,and partially in-variant to illumination changes and affine or3D projection. These features share similar properties with neurons in in-ferior temporal cortex that are used for object recognition in primate vision.Features are efficiently detected through a stagedfiltering approach that identifies stable points in scale space.Image keys are created that allow for local ge-ometric deformations by representing blurred image gradi-ents in multiple orientation planes and at multiple scales. The keys are used as input to a nearest-neighbor indexing method that identifies candidate object matches.Final veri-fication of each match is achieved byfinding a low-residual least-squares solution for the unknown model parameters. Experimental results show that robust object recognition can be achieved in cluttered partially-occluded images witha computation time of under2seconds.1.IntroductionObject recognition in cluttered real-world scenes requires local image features that are unaffected by nearby clutter or partial occlusion.The features must be at least partially in-variant to illumination,3D projective transforms,and com-mon object variations.On the other hand,the features must also be sufficiently distinctive to identify specific objects among many alternatives.The difficulty of the object recog-nition problem is due in large part to the lack of success in finding such image features.However,recent research on the use of dense local features(e.g.,Schmid&Mohr[19]) has shown that efficient recognition can often be achieved by using local image descriptors sampled at a large number of repeatable locations.This paper presents a new method for image feature gen-eration called the Scale Invariant Feature Transform(SIFT). This approach transforms an image into a large collection of local feature vectors,each of which is invariant to image translation,scaling,and rotation,and partially invariant to illumination changes and affine or3D projection.Previous approaches to local feature generation lacked invariance to scale and were more sensitive to projective distortion and illumination change.The SIFT features share a number of properties in common with the responses of neurons in infe-rior temporal(IT)cortex in primate vision.This paper also describes improved approaches to indexing and model ver-ification.The scale-invariant features are efficiently identified by using a stagedfiltering approach.Thefirst stage identifies key locations in scale space by looking for locations that are maxima or minima of a difference-of-Gaussian function. Each point is used to generate a feature vector that describes the local image region sampled relative to its scale-space co-ordinate frame.The features achieve partial invariance to local variations,such as affine or3D projections,by blur-ring image gradient locations.This approach is based on a model of the behavior of complex cells in the cerebral cor-tex of mammalian vision.The resulting feature vectors are called SIFT keys.In the current implementation,each im-age generates on the order of1000SIFT keys,a process that requires less than1second of computation time.The SIFT keys derived from an image are used in a nearest-neighbour approach to indexing to identify candi-date object models.Collections of keys that agree on a po-tential model pose arefirst identified through a Hough trans-form hash table,and then through a least-squaresfit to afinal estimate of model parameters.When at least3keys agree on the model parameters with low residual,there is strong evidence for the presence of the object.Since there may be dozens of SIFT keys in the image of a typical object,it is possible to have substantial levels of occlusion in the image and yet retain high levels of reliability.The current object models are represented as2D loca-tions of SIFT keys that can undergo affine projection.Suf-ficient variation in feature location is allowed to recognize perspective projection of planar shapes at up to a60degree rotation away from the camera or to allow up to a20degree rotation of a3D object.2.Related researchObject recognition is widely used in the machine vision in-dustry for the purposes of inspection,registration,and ma-nipulation.However,current commercial systems for object recognition depend almost exclusively on correlation-based template matching.While very effective for certain engi-neered environments,where object pose and illumination are tightly controlled,template matching becomes computa-tionally infeasible when object rotation,scale,illumination, and3D pose are allowed to vary,and even more so when dealing with partial visibility and large model databases.An alternative to searching all image locations for matches is to extract features from the image that are at least partially invariant to the image formation process and matching only to those features.Many candidate feature types have been proposed and explored,including line seg-ments[6],groupings of edges[11,14],and regions[2], among many other proposals.While these features have worked well for certain object classes,they are often not de-tected frequently enough or with sufficient stability to form a basis for reliable recognition.There has been recent work on developing much denser collections of image features.One approach has been to use a corner detector(more accurately,a detector of peaks in local image variation)to identify repeatable image loca-tions,around which local image properties can be measured. Zhang et al.[23]used the Harris corner detector to iden-tify feature locations for epipolar alignment of images taken from differing viewpoints.Rather than attempting to cor-relate regions from one image against all possible regions in a second image,large savings in computation time were achieved by only matching regions centered at corner points in each image.For the object recognition problem,Schmid&Mohr [19]also used the Harris corner detector to identify in-terest points,and then created a local image descriptor at each interest point from an orientation-invariant vector of derivative-of-Gaussian image measurements.These image descriptors were used for robust object recognition by look-ing for multiple matching descriptors that satisfied object-based orientation and location constraints.This work was impressive both for the speed of recognition in a large database and the ability to handle cluttered images.The corner detectors used in these previous approaches have a major failing,which is that they examine an image at only a single scale.As the change in scale becomes sig-nificant,these detectors respond to different image points. Also,since the detector does not provide an indication of the object scale,it is necessary to create image descriptors and attempt matching at a large number of scales.This paper de-scribes an efficient method to identify stable key locations in scale space.This means that different scalings of an im-age will have no effect on the set of key locations selected.Furthermore,an explicit scale is determined for each point, which allows the image description vector for that point to be sampled at an equivalent scale in each image.A canoni-cal orientation is determined at each location,so that match-ing can be performed relative to a consistent local2D co-ordinate frame.This allows for the use of more distinctive image descriptors than the rotation-invariant ones used by Schmid and Mohr,and the descriptor is further modified to improve its stability to changes in affine projection and illu-mination.Other approaches to appearance-based recognition in-clude eigenspace matching[13],color histograms[20],and receptivefield histograms[18].These approaches have all been demonstrated successfully on isolated objects or pre-segmented images,but due to their more global features it has been difficult to extend them to cluttered and partially occluded images.Ohba&Ikeuchi[15]successfully apply the eigenspace approach to cluttered images by using many small local eigen-windows,but this then requires expensive search for each window in a new image,as with template matching.3.Key localizationWe wish to identify locations in image scale space that are invariant with respect to image translation,scaling,and ro-tation,and are minimally affected by noise and small dis-tortions.Lindeberg[8]has shown that under some rather general assumptions on scale invariance,the Gaussian ker-nel and its derivatives are the only possible smoothing ker-nels for scale space analysis.To achieve rotation invariance and a high level of effi-ciency,we have chosen to select key locations at maxima and minima of a difference of Gaussian function applied in scale space.This can be computed very efficiently by build-ing an image pyramid with resampling between each level. Furthermore,it locates key points at regions and scales of high variation,making these locations particularly stable for characterizing the image.Crowley&Parker[4]and Linde-berg[9]have previously used the difference-of-Gaussian in scale space for other purposes.In the following,we describe a particularly efficient and stable method to detect and char-acterize the maxima and minima of this function.As the2D Gaussian function is separable,its convolution with the input image can be efficiently computed by apply-ing two passes of the1D Gaussian function in the horizontal and vertical directions:For key localization,all smoothing operations are done us-ingThe input image isfirst convolved with the Gaussian function usingto give a new image,B,which now has an effective smoothing of.The difference of Gaussian function is obtained by subtracting image B from A,result-ing in a ratio of between the two Gaussians.To generate the next pyramid level,we resample the al-ready smoothed image B using bilinear interpolation with a pixel spacing of1.5in each direction.While it may seem more natural to resample with a relative scale ofThe pixel differences are efficient to compute and provide sufficient accuracy due to the substantial level of previous smoothing.The effective half-pixel shift in position is com-pensated for when determining key location.Robustness to illuminationchange is enhanced by thresh-olding the gradient magnitudes at a value of0.1timesthe Figure1:The second image was generated from thefirst by rotation,scaling,stretching,change of brightness and con-trast,and addition of pixel noise.In spite of these changes, 78%of the keys from thefirst image have a closely match-ing key in the second image.These examples show only a subset of the keys to reduce clutter.maximum possible gradient value.This reduces the effect of a change in illumination direction for a surface with3D relief,as an illumination change may result in large changes to gradient magnitude but is likely to have less influence on gradient orientation.Each key location is assigned a canonical orientation so that the image descriptors are invariant to rotation.In or-der to make this as stable as possible against lighting or con-trast changes,the orientation is determined by the peak in a histogram of local image gradient orientations.The orien-tation histogram is created using a Gaussian-weighted win-dow with of3times that of the current smoothing scale. These weights are multiplied by the thresholded gradient values and accumulated in the histogram at locations corre-sponding to the orientation,.The histogram has36bins covering the360degree range of rotations,and is smoothed prior to peak selection.The stability of the resulting keys can be tested by sub-jecting natural images to affine projection,contrast and brightness changes,and addition of noise.The location of each key detected in thefirst image can be predicted in the transformed image from knowledge of the transform param-eters.This framework was used to select the various sam-pling and smoothing parameters given above,so that max-Image transformation Ori%89.0B.Decrease intensity by0.285.985.4D.Scale by0.780.383.5F.Stretch by1.565.090.3H.All of A,B,C,D,E,G.71.8 Figure2:For various image transformations applied to a sample of20images,this table gives the percent of keys that are found at matching locations and scales(Match%)and that also match in orientation(Ori%).imum efficiency could be obtained while retaining stability to changes.Figure1shows a relatively small number of keys de-tected over a2octave range of only the larger scales(to avoid excessive clutter).Each key is shown as a square,with a line from the center to one side of the square indicating ori-entation.In the second half of thisfigure,the image is ro-tated by15degrees,scaled by a factor of0.9,and stretched by a factor of1.1in the horizontal direction.The pixel inten-sities,in the range of0to1,have0.1subtracted from their brightness values and the contrast reduced by multiplication by0.9.Random pixel noise is then added to give less than 5bits/pixel of signal.In spite of these transformations,78% of the keys in thefirst image had closely matching keys in the second image at the predicted locations,scales,and ori-entationsThe overall stability of the keys to image transformations can be judged from Table2.Each entry in this table is gen-erated from combining the results of20diverse test images and summarizes the matching of about15,000keys.Each line of the table shows a particular image transformation. Thefirstfigure gives the percent of keys that have a match-ing key in the transformed image within in location(rel-ative to scale for that key)and a factor of1.5in scale.The second column gives the percent that match these criteria as well as having an orientation within20degrees of the pre-diction.4.Local image descriptionGiven a stable location,scale,and orientation for each key,it is now possible to describe the local image region in a man-ner invariant to these transformations.In addition,it is desir-able to make this representation robust against small shifts in local geometry,such as arise from affine or3D projection.One approach to this is suggested by the response properties of complex neurons in the visual cortex,in which a feature position is allowed to vary over a small region while orienta-tion and spatial frequency specificity are maintained.Edel-man,Intrator&Poggio[5]have performed experiments that simulated the responses of complex neurons to different3D views of computer graphic models,and found that the com-plex cell outputs provided much better discrimination than simple correlation-based matching.This can be seen,for ex-ample,if an affine projection stretches an image in one di-rection relative to another,which changes the relative loca-tions of gradient features while having a smaller effect on their orientations and spatial frequencies.This robustness to local geometric distortion can be ob-tained by representing the local image region with multiple images representing each of a number of orientations(re-ferred to as orientation planes).Each orientation plane con-tains only the gradients corresponding to that orientation, with linear interpolation used for intermediate orientations. Each orientation plane is blurred and resampled to allow for larger shifts in positions of the gradients.This approach can be efficiently implemented by using the same precomputed gradients and orientations for each level of the pyramid that were used for orientation selection. For each keypoint,we use the pixel sampling from the pyra-mid level at which the key was detected.The pixels that fall in a circle of radius8pixels around the key location are in-serted into the orientation planes.The orientation is mea-sured relative to that of the key by subtracting the key’s ori-entation.For our experiments we used8orientation planes, each sampled over a grid of locations,with a sample spacing4times that of the pixel spacing used for gradient detection.The blurring is achieved by allocating the gradi-ent of each pixel among its8closest neighbors in the sam-ple grid,using linear interpolation in orientation and the two spatial dimensions.This implementation is much more effi-cient than performing explicit blurring and resampling,yet gives almost equivalent results.In order to sample the image at a larger scale,the same process is repeated for a second level of the pyramid one oc-tave higher.However,this time a rather than a sample region is used.This means that approximately the same image region will be examined at both scales,so that any nearby occlusions will not affect one scale more than the other.Therefore,the total number of samples in the SIFT key vector,from both scales,is or160 elements,giving enough measurements for high specificity.5.Indexing and matchingFor indexing,we need to store the SIFT keys for sample im-ages and then identify matching keys from new images.The problem of identifyingthe most similar keys for high dimen-sional vectors is known to have high complexity if an ex-act solution is required.However,a modification of the k-d tree algorithm called the best-bin-first search method(Beis &Lowe[3])can identify the nearest neighbors with high probability using only a limited amount of computation.To further improve the efficiency of the best-bin-first algorithm, the SIFT key samples generated at the larger scale are given twice the weight of those at the smaller scale.This means that the larger scale is in effect able tofilter the most likely neighbours for checking at the smaller scale.This also im-proves recognition performance by giving more weight to the least-noisy scale.In our experiments,it is possible to have a cut-off for examining at most200neighbors in a probabilisticbest-bin-first search of30,000key vectors with almost no loss of performance compared tofinding an exact solution.An efficient way to cluster reliable model hypotheses is to use the Hough transform[1]to search for keys that agree upon a particular model pose.Each model key in the database contains a record of the key’s parameters relative to the model coordinate system.Therefore,we can create an entry in a hash table predicting the model location,ori-entation,and scale from the match hypothesis.We use a bin size of30degrees for orientation,a factor of2for scale, and0.25times the maximum model dimension for location. These rather broad bin sizes allow for clustering even in the presence of substantial geometric distortion,such as due to a change in3D viewpoint.To avoid the problem of boundary effects in hashing,each hypothesis is hashed into the2clos-est bins in each dimension,giving a total of16hash table entries for each hypothesis.6.Solution for affine parametersThe hash table is searched to identify all clusters of at least 3entries in a bin,and the bins are sorted into decreasing or-der of size.Each such cluster is then subject to a verification procedure in which a least-squares solution is performed for the affine projection parameters relating the model to the im-age.The affine transformation of a model point to an image point can be written aswhere the model translation is and the affine rota-tion,scale,and stretch are represented by the parameters.We wish to solve for the transformation parameters,so Figure3:Model images of planar objects are shown in the top row.Recognition results below show model outlinesand image keys used for matching.the equation above can be rewritten as...This equation shows a single match,but any number of fur-ther matches can be added,with each match contributing two more rows to thefirst and last matrix.At least3matches are needed to provide a solution.We can write this linear system asThe least-squares solution for the parameters x can be deter-Figure4:Top row shows model images for3D objects with outlines found by background segmentation.Bottom image shows recognition results for3D objects with model outlines and image keys used for matching.mined by solving the corresponding normal equations, which minimizes the sum of the squares of the distances from the projected model locations to the corresponding im-age locations.This least-squares approach could readily be extended to solving for3D pose and internal parameters of articulated andflexible objects[12].Outliers can now be removed by checking for agreement between each image feature and the model,given the param-eter solution.Each match must agree within15degrees ori-entation,Figure6:Stability of image keys is tested under differingillumination.Thefirst image is illuminated from upper leftand the second from center right.Keys shown in the bottomimage were those used to match second image tofirst.the top row of Figure4.The models were photographed on ablack background,and object outlinesextracted by segment-ing out the background region.An example of recognition isshown in the samefigure,again showing the SIFT keys usedfor recognition.The object outlines are projected using theaffine parameter solution,but this time the agreement is notas close because the solution does not account for rotationin depth.Figure5shows more examples in which there issignificant partial occlusion.The images in these examples are of size pix-els.The computation times for recognition of all objects ineach image are about1.5seconds on a Sun Sparc10pro-cessor,with about0.9seconds required to build the scale-space pyramid and identify the SIFT keys,and about0.6seconds to perform indexing and least-squares verification.This does not include time to pre-process each model image,which would be about1second per image,but would onlyneed to be done once for initial entry into a model database.The illumination invariance of the SIFT keys is demon-strated in Figure6.The two images are of the same scenefrom the same viewpoint,except that thefirst image is il-luminated from the upper left and the second from the cen-ter right.The full recognition system is run to identify thesecond image using thefirst image as the model,and thesecond image is correctly recognized as matching thefirst.Only SIFT keys that were part of the recognition are shown.There were273keys that were verified as part of thefinalmatch,which means that in each case not only was the samekey detected at the same location,but it also was the clos-est match to the correct corresponding key in the second im-age.Any3of these keys would be sufficient for recognition.While matching keys are not found in some regions wherehighlights or shadows change(for example on the shiny topof the camera)in general the keys show good invariance toillumination change.8.Connections to biological visionThe performance of human vision is obviously far superiorto that of current computer vision systems,so there is poten-tially much to be gained by emulating biological processes.Fortunately,there have been dramatic improvements withinthe past few years in understanding how object recognitionis accomplished in animals and humans.Recent research in neuroscience has shown that objectrecognition in primates makes use of features of intermedi-ate complexity that are largely invariant to changes in scale,location,and illumination(Tanaka[21],Perrett&Oram[16]).Some examples of such intermediate features foundin inferior temporal cortex(IT)are neurons that respond toa darkfive-sided star shape,a circle with a thin protrudingelement,or a horizontal textured region within a triangularboundary.These neurons maintain highly specific responsesto shape features that appear anywhere within a large por-tion of the visualfield and over a several octave range ofscales(Ito et.al[7]).The complexity of many of these fea-tures appears to be roughly the same as for the current SIFTfeatures,although there are also some neurons that respondto more complex shapes,such as faces.Many of the neu-rons respond to color and texture properties in addition toshape.The feature responses have been shown to dependon previous visual learning from exposure to specific objectscontaining the features(Logothetis,Pauls&Poggio[10]).These features appear to be derived in the brain by a highlycomputation-intensive parallel process,which is quite dif-ferent from the stagedfiltering approach given in this paper.However,the results are much the same:an image is trans-formed into a large set of local features that each match asmall fraction of potential objects yet are largely invariantto common viewing transformations.It is also known that object recognition in the brain de-pends on a serial process of attention to bind features to ob-ject interpretations,determine pose,and segment an objectfrom a cluttered background[22].This process is presum-ably playing the same role in verification as the parametersolving and outlier detection used in this paper,since theaccuracy of interpretations can often depend on enforcing asingle viewpoint constraint[11].9.Conclusions and commentsThe SIFT features improve on previous approaches by beinglargely invariant to changes in scale,illumination,and localaffine distortions.The large number of features in a typical image allow for robust recognition under partial occlusion in cluttered images.Afinal stage that solves for affine model parameters allows for more accurate verification and pose determination than in approaches that rely only on indexing.An important area for further research is to build models from multiple views that represent the3D structure of ob-jects.This would have the further advantage that keys from multiple viewing conditions could be combined into a single model,thereby increasing the probability offinding matches in new views.The models could be true3D representations based on structure-from-motion solutions,or could repre-sent the space of appearance in terms of automated cluster-ing and interpolation(Pope&Lowe[17]).An advantage of the latter approach is that it could also model non-rigid de-formations.The recognition performance could be further improved by adding new SIFT feature types to incorporate color,tex-ture,and edge groupings,as well as varying feature sizes and offsets.Scale-invariant edge groupings that make local figure-ground discriminations would be particularly useful at object boundaries where background clutter can interfere with other features.The indexing and verification frame-work allows for all types of scale and rotation invariant fea-tures to be incorporated into a single model representation. Maximum robustness would be achieved by detecting many different feature types and relying on the indexing and clus-tering to select those that are most useful in a particular im-age.References[1]Ballard,D.H.,“Generalizing the Hough transform to detectarbitrary patterns,”Pattern Recognition,13,2(1981),pp.111-122.[2]Basri,Ronen,and David.W.Jacobs,“Recognition using re-gion correspondences,”International Journal of Computer Vision,25,2(1996),pp.141–162.[3]Beis,Jeff,and David G.Lowe,“Shape indexing usingapproximate nearest-neighbour search in high-dimensional spaces,”Conference on Computer Vision and Pattern Recog-nition,Puerto Rico(1997),pp.1000–1006.[4]Crowley,James L.,and Alice C.Parker,“A representationfor shape based on peaks and ridges in the difference of low-pass transform,”IEEE Trans.on Pattern Analysis and Ma-chine Intelligence,6,2(1984),pp.156–170.[5]Edelman,Shimon,Nathan Intrator,and Tomaso Poggio,“Complex cells and object recognition,”Unpublished Manuscript,preprint at /edelman/mirror/nips97.ps.Z[6]Grimson,Eric,and Thom´a s Lozano-P´e rez,“Localizingoverlapping parts by searching the interpretation tree,”IEEE Trans.on Pattern Analysis and Machine Intelligence,9 (1987),pp.469–482.[7]Ito,Minami,Hiroshi Tamura,Ichiro Fujita,and KeijiTanaka,“Size and position invariance of neuronal responses in monkey inferotemporal cortex,”Journal of Neurophysiol-ogy,73,1(1995),pp.218–226.[8]Lindeberg,Tony,“Scale-space theory:A basic tool foranalysing structures at different scales”,Journal of Applied Statistics,21,2(1994),pp.224–270.[9]Lindeberg,Tony,“Detecting salient blob-like image struc-tures and their scales with a scale-space primal sketch:a method for focus-of-attention,”International Journal ofComputer Vision,11,3(1993),pp.283–318.[10]Logothetis,Nikos K.,Jon Pauls,and Tomaso Poggio,“Shaperepresentation in the inferior temporal cortex of monkeys,”Current Biology,5,5(1995),pp.552–563.[11]Lowe,David G.,“Three-dimensional object recognitionfrom single two-dimensional images,”Artificial Intelli-gence,31,3(1987),pp.355–395.[12]Lowe,David G.,“Fitting parameterized three-dimensionalmodels to images,”IEEE Trans.on Pattern Analysis and Ma-chine Intelligence,13,5(1991),pp.441–450.[13]Murase,Hiroshi,and Shree K.Nayar,“Visual learning andrecognition of3-D objects from appearance,”International Journal of Computer Vision,14,1(1995),pp.5–24. [14]Nelson,Randal C.,and Andrea Selinger,“Large-scale testsof a keyed,appearance-based3-D object recognition sys-tem,”Vision Research,38,15(1998),pp.2469–88. [15]Ohba,Kohtaro,and Katsushi Ikeuchi,“Detectability,uniqueness,and reliability of eigen windows for stable verification of partially occluded objects,”IEEE Trans.on Pattern Analysis and Machine Intelligence,19,9(1997), pp.1043–48.[16]Perrett,David I.,and Mike W.Oram,“Visual recognitionbased on temporal cortex cells:viewer-centered processing of pattern configuration,”Zeitschrift f¨u r Naturforschung C, 53c(1998),pp.518–541.[17]Pope,Arthur R.and David G.Lowe,“Learning probabilis-tic appearance models for object recognition,”in Early Vi-sual Learning,eds.Shree Nayar and Tomaso Poggio(Oxford University Press,1996),pp.67–97.[18]Schiele,Bernt,and James L.Crowley,“Object recognitionusing multidimensional receptivefield histograms,”Fourth European Conference on Computer Vision,Cambridge,UK (1996),pp.610–619.[19]Schmid,C.,and R.Mohr,“Local grayvalue invariants forimage retrieval,”IEEE PAMI,19,5(1997),pp.530–534.[20]Swain,M.,and D.Ballard,“Color indexing,”InternationalJournal of Computer Vision,7,1(1991),pp.11–32. [21]Tanaka,Keiji,“Mechanisms of visual object recognition:monkey and human studies,”Current Opinion in Neurobiol-ogy,7(1997),pp.523–529.[22]Treisman,Anne M.,and Nancy G.Kanwisher,“Perceiv-ing visually presented objects:recognition,awareness,and modularity,”Current Opinion in Neurobiology,8(1998),pp.218–226.[23]Zhang,Z.,R.Deriche,O.Faugeras,Q.T.Luong,“A robusttechnique for matching two uncalibrated images through the recovery of the unknown epipolar geometry,”Artificial In-telligence,78,(1995),pp.87-119.。
使用判别训练的部件模型进行目标检测Pedro F. Felzenszwalb, Ross B.Girshick, David McAllester and Deva Ramanan使用判别训练的部件模型进行目标检测 Object Detection with Discriminatively Trained Part Based Models摘要本文介绍了一个基于混合多尺度可变形部件模型(mixtures of multiscale deformablepart model) 的目标检测系统。
此系统可以表示各种多变的目标并且在PASCAL目标检测挑战赛上达到了目前最优结果(state-of-the-art)。
虽然可变形部件模型现在很流行,但它的价值并没有在类似PASCAL这种较难的测试集上进行展示。
此系统依赖于使用未完全标注(partially labeled)的样本进行判别训练的新方法。
我们提出了一种间隔敏感(margin-sensitive)的难例挖掘方法(data-mining hard negativeexample),称为隐藏变量SVM(latent SVM, LSVM),是MI-SVM 加入隐藏变量后的重新表示。
LSVM的训练问题是一个半凸规划(semi-convex)问题,但如果将正样本的隐藏变量的值指定后,LSVM的训练问题变为凸规划问题。
最终可以使用一个迭代训练方法来解决,此迭代算法不断交替地固定正样本的隐藏变量和最优化目标函数。
关键词目标识别(ObjectRecognition),可变形模型(Deformable Models),图结构模型(Pictorial Structures),判别训练(Discriminative Training),隐藏变量SVM(Latent SVM)1 引言目标检测是计算机视觉领域内一项基础性的工作。
本论文研究在静态图片中检测并定位某一类目标(例如人或车)的问题。
第34卷第4期 2017年4月计算机应用与软件Computer Applications and SoftwareV o L34No. 4Apr. 2017融合深度及边界信息的图像目标识别原或鑫周向东(复旦大学计算机科学技术学院上海200433)摘要为精确定位候选目标,提高目标识别效果,提出一种融合图像边界信息和深度信息的目标识别方法,该方法可以产生数量更少、定位更准确的图像候选目标。
然后提取深度学习特征,通过支持向量机分类模型,实现目标识别。
在两个常用数据集上进行对比实验显示,与Baseline和选择性搜索等方法相比,该方法显著地提高 了目标识别的性能。
关键词 目标识别区域融合深度信息深度学习支持向量机中图分类号T P3文献标识码A D O I:10. 3969/j. issn. 1000-386x. 2017. 04. 031OBJECT RECOGNITION COMBINED WITH DEPTH AND BOUNDARY INFORMATIONY u a n Yuxin Z h o u Xiangdong{School of Computer Science and Technology ,Fudan University,Shanghai 200433,China)Abstract In order to locate the candidate object accurately and improve the target recognition effect,an object recognition method combined with depth and boundary information is proposed. T h e proposed method can generate less but better object candidates with more accurate location. T h e n the depth learning feature is extracted, and the S V M classification model is used to realize the target recognition. Experimental results on two c o m m o n data sets show that compared with Baseline and selective search,this method improves the performance of object recognition significantly. Keywords Object recognition Region merge Depth information D e e p learning S V M〇引言目标识别技术在安全监控、人机交互、医学、国防 及公共安全领域具有十分重要的应用价值和发展前 景。
计算机研究与发展ISSN 100021239ΠCN 1121777ΠTPJournal of Computer Research and Development 46(6):100921018,2009 收稿日期:2008-06-25;修回日期:2008-11-25 通讯作者:孙艳丰(yf sun @ ) 基金项目:国家自然科学基金项目(60533030,60825203);北京市自然科学基金项目(4061001);国家科技支撑计划基金项目(2007BA H13B01)BJUT 23D 三维人脸数据库及其处理技术尹宝才 孙艳丰 王成章 盖 赟(北京工业大学计算机学院多媒体与智能软件技术北京市重点实验室 北京 100124)(yinbc @ )BJ UT 23D Large Scale 3D F ace Database and Information ProcessingYin Baocai ,Sun Yanfeng ,Wang Chengzhang ,and Ge Yun(B ei j ing M unici pal Key L aboratory of M ultimedia and I ntelli gent S of tw are Technolog y College of Com p uter S cience and Technology ,B ei j ing Universit y of Technolog y ,B ei j ing 100124)Abstract 3D face recognition has become one of t he most active research topics in face recognition due to it s robust ness in t he variation on po se and illumination.3D database is t he basis of t his work.Design and ruction of t he face database mainly include acquisition of prototypical 3D face data ,p reprocessing and standardizing of t he data and t he st ruct ure design.Currently ,BJ U T 23D database is t he largest Chinese 3D face database in t he world.It contains 1200Chinese 3D face images and p rovides bot h t he text ure and shape information of human faces.This data resource plays an important role in 3D face recognition and face model.In t his paper ,t he data description ,data collection schema and t he po st 2p rocessing met hods are provided to help using t he data and f ut ure extension.A 3D face data dense correspondence met hod is int roduced.Dense correspondence means t hat t he key facials point s are caref ully labeled and aligned among different faces ,which can be used for a broad range of face analysis tasks.As an applicatio n ,a pose estimation and face recognition algorit hm acro ss different po ses is p ropo sed.Eexp remental result s show t hat t he propo sed algorit hm has a good performance.K ey w ords 3D face database ;face recognition ;3D face model ;morp hable model ;mesh resampling摘 要 BJ U T 23D 是目前国际上最大的中国人的三维人脸数据库,其中包括经过预处理的1200名中国人的三维人脸数据,这一数据资源对于三维人脸识别与建模方面的研究有重要意义.首先介绍了BJ U T 23D 数据库的数据获取条件、数据形式,并针对数据库建立过程中数据预处理技术进行了讨论.最后作为数据库的直接应用,进行了多姿态人脸识别和人脸姿态估计算法的研究.实验结果证实,该算法具有良好的性能.关键词 三维人脸数据库;人脸识别;三维人脸模型;形变模型;网格重采样中图法分类号 TP391 经过40多年的发展,尤其是近10年的研究,人脸识别的理论和算法均取得了长足的进步,但这些理论和算法主要针对输入是二维人脸图像而开展的.理论和实验研究已经证实,二维图像中人脸姿态或成像时光照条件的变化对算法的识别性能有很大影响.而更实用的人脸识别算法应该是在摄像环境不可控、用户不配合的情况下使用.所以目前算法的缺陷大大限制了人脸识别技术在实际中的广泛应用.如何解决不同姿态、不同光照条件下的人脸识别问题是二维人脸识别研究的瓶颈,也是当前的研究热点.与二维人脸图像数据相比,三维人脸数据中包含人脸的空间信息,这是人脸本身固有的特征信息,对姿态、光照条件的变化具有鲁棒性.因此,近年来利用三维人脸数据进行人脸识别的途径已经引起人们的广泛关注,也出现了一些识别算法[1].与二维图像不同,三维人脸数据有多种不同的形式,如人脸的深度数据、曲面点的三维坐标及其点之间的连接关系、面部轮廓线数据等.针对不同形式人脸数据的识别算法也需要相同形式的数据资源.人脸数据库对人脸识别算法的研究与开发、模型训练、算法性能比较测试是不可缺少的数据资源,尤其在基于统计学习算法占主导地位的人脸识别领域,模型训练所采用的人脸库的规模、覆盖的人脸数据的变化很大程度上影响算法精度和鲁棒性;不同算法性能测试所用到的数据库的规模和属性同样决定了评测的合理性和测试结果的有效性.所以,随着三维人脸识别研究的不断深入,建立各种数据形式的三维人脸数据库,为同行提供模型训练数据资源、算法研究与比较的数据平台,具有重要的意义.经过长期的研究积累,我们研究小组采用Cyberware3030R G BΠPS激光扫描仪获取三维人脸原始数据,通过对齐算法构建了可进行线性计算的三维人脸数据库BJ U T23D[2],该库包含1200个中性表情的中国人的三维人脸样本数据,其中部分数据有多个样本.扫描后的数据是由点的纹理信息、三维坐标信息及其点之间的连接关系构成.该数据库目前可以为诸如人脸跟踪、识别、动画等研究人员提供很好的数据资源.本文先对三维人脸数据的采集环境、条件、数据形式进行了介绍,然后研究了数据库建立过程中的数据获取、数据处理、数据对齐等相关技术.这些技术为数据库的使用及其相关的研究工作会提供一些有益的帮助.1 相关的三维人脸数据库综述目前已经有一些包含三维信息的三维人脸数据库,按着数据库的构造方法可以将它们分为基于多视角几何信息的方法、基于结构光的方法和基于三维扫描仪的方法.CMU的FIA数据库是基于多视角几何信息的三维数据库[3],其中数据是用6个摄像机从3个不同角度获取20s的视频信息,然后用计算机视觉的方法恢复三维信息得到的人脸数据.由于没有对视频人脸进行标定,这类方法是用复杂的人脸跟踪算法重构人脸的形状信息,所以其效果受人脸跟踪效果的影响较大.3D2RAM是基于结构光的方法建立的三维人脸数据库[4],它用一个照相机和放映机获取人的3D坐标信息,建立一个含129人的3D人脸数据库.该库样本的坐标信息精度高,但对于面部的眼睛或阴影部分无法获取其3D信息,导致面部曲面形状不完全.由于三维扫描仪能够获取人脸部较精确的形状和纹理信息,因此成为建立三维人脸数据库非常好的工具.在GavabDB数据库中[5],使用Minolta V I2700数字转换器获取61个有表情变化的从不同视角扫描的人脸数据.由于有些视角具有不可见部分,为获取完整的三维人脸表面信息还需要进行适当的后处理.Cyberware扫描仪通过一次扫描可以获取人不同视角的完整数据,因此获取的数据准确性好,大大简化了后处理工作,用该设备建立的U SF三维人脸数据库[6]有200人的三维人脸数据,由于每个样本的形状和纹理信息维数很高,因此对于人脸数据处理与分析方面的研究,这样规模的数据还远远满足不了需要.2007年, Huang的研究小组利用Cyberware扫描仪建立了一个含有475人的三维人脸数据库[7],样本主要有中性和微笑两种表情,年龄分布在19~25岁之间,这一数据库可以缓解现有数据库规模小的缺陷,也为人脸识别、跟踪、对齐、动画等相关研究工作提供重要基础.2 BJUT23D数据库介绍BJ U T23D的三维人脸数据通过Cyberware 3030R G BΠPS激光扫描仪获取.扫描时,一条红色激光线从扫描仪里面发射出来,照射到头部Π脸部,经过激光线的反射,被仪器接收和计算.扫描时要求被扫描者端坐在旋转平台的一个高度适中的椅子上,并直视前方,以保证头部在扫描仪的中部.扫描期间需保持端坐不动和静止的脸部表情直至扫描结束.该扫描仪通过一次扫描得到人头部的几何信息和彩色纹理信息,并使用柱面坐标记录几何信息.扫描精度为圆周方向(用φ表示,0≤φ≤2π)489个采样点,轴方向(用h表示,0≤h≤300mm)478个采样点,扫描半径(用r表示)在260mm~340mm之间.每一0101计算机研究与发展 2009,46(6)个几何采样点对应一个24位(用R,G,B表示)纹理像素点,并以489×478大小的纹理图像存储.Fig.1 Cyberware laser scanner.图1 Cyberware激光扫描仪1)光照条件用激光扫描仪扫描人脸时可以同时获取人脸的三维几何信息和彩色纹理信息,人脸纹理的好坏直接影响到所创建人脸库的质量及应用价值,并给基于人脸库进行的人脸建模、人脸识别、人脸动画等方面的研究带来很大的影响.为了得到统一的、较为真实的纹理信息,我们的数据采集在同一个扫描间进行,并对光照条件做了一定的限制.扫描间是一个特定、封闭的环境,其四周设置4盏专用的照明灯,由前后左右4个方向指向被扫描对象,并保证扫描对象各个方向具有相同的光照强度.为了模拟正常的环境光,扫描间的4盏灯都是60W的白炽灯,同时设置扫描间的墙壁为通体白色,这样4盏灯相互照射后,从墙壁上返回的光形成了一个统一对环境光的模拟制式.由于镜面反射对模型的生成会产生较大的影响,所以要求光的强度在一定的范围内.所有扫描工作都在扫描间完成,这样既保证对环境光的光照条件近似模拟,也保证所有三维人脸数据的光照条件完全相同.2)饰物由于扫描仪对头发等深色部位的扫描效果比较差,而人脸研究仅对人的面部区域感兴趣,因此要求被扫描者佩戴泳帽并将头发全部包住.该泳帽一般应选择颜色较鲜明的色彩以便和面部区域分离,方便后期处理.此外还要求被扫描者不能化妆、不戴眼镜等任何饰物.3)数据规模及形式BJ U T23D三维人脸数据库共包括1200名中国人的三维人脸数据,其中500人的数据对外公开发布,男女各250人,年龄分布在16岁~49岁之间,所有人脸数据均是中性表情.部分人脸有3个样本,以便于人脸识别研究.三维扫描仪进行一次柱面扫描就是对人的头部表面进行高密度采样,采样信息包括空间几何信息和彩色纹理信息.空间几何信息由两部分组成,既空间三维采样点的坐标信息(用(X,Y,Z)表示,约2×105个点),和由网格描述的这些点之间的连接关系,网格组成的三角面片约有4×105个.彩色纹理信息是采样点柱面投影得到的二维图像,以普通图像格式存储,图像的长和宽由投影参数、扫描设备硬件与操作平台决定,本文得到的纹理分辨率为478×489,如图2(c)所示.为建立几何信息同纹理信息之间的联系,在几何信息中还存储几何采样点在纹理信息文件中对应纹理点的归一化坐标,归一化坐标表明本采样点在纹理信息文件中对应纹理点位置的索引信息,几何信息和纹理信息之间的关系就是通过该索引信息建立起来的.图2是扫描后的三维人脸及其对应的几何、纹理信息.Fig.2 3D prototytical face data.(a)Scanned3D face;(b) Shape data;and(c)Texture image.图2 三维原始人脸数据.(a)三维人脸;(b)几何数据;(c)纹理图像4)人脸数据的命名规则在数据库中,每个三维人脸数据由单一的文件组成,文件按照统一的规则进行命名.文件名有6部分信息,命名规则为性别+I D+年龄+表情+内容+发布情况.具体表示形式如下:x_xxxx_Ax_Ex_Cxxxx_Rx1 2 3 4 5 6每部分的具体含义为:1表示性别区域,由一个字母组成.“M”表示男性,“F”表示女性.2表示I D区域,由4个数字组成.表示该文件在数据库中的I D,当组成文件I D所需数字不足4位时剩余高位用0补齐.3表示年龄区域,由一个字母“A”和一位数字组成.A是年龄的英文Age的首字母.由于研究时1101尹宝才等:BJ U T23D三维人脸数据库及其处理技术关心的是人脸数据所处的年龄段,所以只记录每个人脸数据所属的年龄段,并用1位数字表示.每个年龄段的代表数字如表1所示:T able1 Correspondence of N otation and Age表1 年龄符号对应表Notation Age Range110-19220-29330-39440-494表示表情区域,由一个字母“E”和一位代表表情的字母组成.表情字母表示人脸数据具有的表情.每个表情代表字母的含义如表2所示.目前数据库中所有人脸都是中性表情.T able2 Correspondence of N otation and Expression表2 表情符号对应表Notation ExpressionN NormalH HappyP SurpriseA Angry5表示数据内容区域,由5位字母组成.C是Content的首字母,后面的4位字母“t rim”表示该数据经过预处理.6表示发布标记区域,由两位字母组成.首字母为R,第2个数字表示是否已经发布,其中“0”表示未发布,“1”表示已发布.目前发布的数据是无法直接读取的,用户需要使用我们提供的工具将原始数据转换成可读的文本形式.转换后的文本数据包含3个部分信息:顶点信息、纹理信息、网格信息.①顶点信息:顶点信息由密集采样点组成,三维人脸模型的顶点信息就是由这些采样点构成的.数据的表示形式为Vertex1:X=-87.616997,Y=-12.994000,Z=37.046001, Vertex1表示序号为1的顶点,X,Y,Z分别表示该点的3个坐标值.②纹理信息:纹理信息描述了每个顶点的对应的纹理值.数据表示形式为Text ure1:R=144,G=99,B=85,Text ure1表示顶点1的像素值,R,G,B分别表示点在3个颜色通道的值.③网格信息:网格信息描述顶点之间的连接关系.库中的数据使用三角网格来描述顶点之间的连接关系.数据的表示形式为Triangle1:Fi rst V ertex=36407,Second V ertex=36310,Thi r d V ertex=36392,Triangle1表示第1个三角网格,其后的3部分信息分别表示依附该三角网格的3个顶点的标号.3 建立BJUT23D的信息处理技术扫描后的数据还有许多信息缺失和不平滑的情况,另外肩部和头部的信息对于人脸识别及相关研究是无用的,它们的存在将会增加数据规模,为后续数据库的应用增加计算量,所以需要对扫描后的数据进行预处理.3.1 面部数据的分离和预处理扫描人脸时,由于光照条件的细微变化、人脸表面的不光滑性以及头发等复杂结构的影响,射在人脸表面的光线在返回时运动轨迹发生偏离,会使扫描后得到的三维人脸数据发生变形,出现一些毛刺和空洞等现象.在对耳朵、下巴等部位扫描采样时,捕捉不到的三维信息也会形成空洞,有些地方则因为局部表面不光滑会产生毛刺.对此,我们采用交互的方式,使用插值、平滑等预处理方法弥补三维人脸上的空洞并去掉毛刺.面部数据的分离是将人脸面部区域从整个头部扫描数据中分离出来,去除头发、肩等部位的三维数据.我们使用的方法[8]首先确定分离的边界.由于在三维人脸几何数据上直接进行边界关键点标定和边缘自动检测十分困难,所以借助人脸的纹理图像来进行不规则边界的确定,即在三维人脸对应的二维纹理图像上确定面部发际边界和耳朵部位的边界,然后通过纹理几何的对应关系,找到三维人脸几何数据相应的分割边界.对于耳下的垂直切面和脖子下的水平切面则直接在几何数据上确定,用来去除肩部以下和耳朵后面的数据.确定了人脸的分离边界后,即可将人脸的面部区域从原始扫描数据中分离出来.如图3所示为分离后的三维人脸,图3(a)是分离后的几何形状及其对应的纹理图像,图3(b)是分离后不同角度下的三维人脸面部图像.2101计算机研究与发展 2009,46(6)Fig.3 3D face data.(a )The cutted shape and texture for 3D face and (b )Frontal and side 3D face.图3 三维人脸数据.(a )分离后的三维人脸几何信息和纹理信息;(b )正面、侧面3D 人脸 为保证三维人脸数据的一致性,在数据获取时要求被扫描者保持指定的姿态和位置,既目视前方,头部保持垂直.但实际扫描得到的人脸样本的姿态不可避免地存在一定偏差,因此需要对不同的人脸数据进行坐标矫正,将不同的三维人脸数据统一到同一个坐标系.切割后的三维人脸数据接近一个柱面分布,所以用三维人脸数据的离散点集来拟合一个柱面,用柱面的中心轴作为三维人脸数据的新的垂直坐标轴(Z 轴),过鼻尖点且与新的垂直坐标轴垂直相交的直线作为新的前向坐标轴(Y 轴),新的X 坐标轴则由Y 轴和Z 轴的叉乘运算确定.通过坐标变换可以得到每个三维人脸在新的坐标系下的坐标值,经过坐标变换的所有三维人脸数据均变换到朝向、姿态相同的坐标系下.如图4是三维人脸的坐标矫正示意图,其中Z 是矫正后的垂直轴,Z 0是矫正前的垂直轴,X ,Y ,Z 是矫正后的坐标轴.Fig.4 Recorrected face by a cylinder.图4 人脸柱面矫正3.2 人脸数据的规格化由于人脸的个性化差异,扫描得到的人脸数据有很大差别.首先是构成三维人脸的点数和面数不同,这样的数据使基于形变模型的三维人脸重建无法进行,也不利于人脸的统一表示;其次是点或面的排列与人脸特征无关.因此建库时对预处理过的三维人脸数据进行了规格化,规格化后的数据既可以用统一的向量形式来表示,又保证所有的三维人脸数据特征对齐.规格化[9]的第1步是建立不同三维人脸数据间的稠密对应,既根据人脸面部特征建立不同的三维人脸数据间点到点的一一对应关系.例如,已知一个人脸上的鼻尖点可以根据对应关系找到另外一个人脸上的鼻尖点,如果以某一个人脸作为标准人脸,就可以将人脸数据根据标准人脸的点和面进行有序化.事实上,在三维数据上建立基于特征的点对点的稠密对应非常困难.首先不同人脸的个性差异导致三维人脸的几何差异很大,而且还要考虑纹理特征信息的对应;其次三维人脸数据是稠密点集,数据量很大,因此很难使用一般方法建立这种对应关系.文献[9]考虑到扫描人脸数据是以柱面的形式表示,将三维人脸展开为二维形式,借助在二维图像上光流对应计算的方法建立三维数据的对应.但光流算法的前提假设是两幅图像间光流的变化是连续光滑的,对于比较相像的两幅人脸可以近似地看做视频序列的相邻两帧图像,此时对应计算效果比较好.但对于形状差别较大的人脸数据,光流算法的前提假设不满足,对应计算将产生较大的误差.另外,这种将复杂三维几何进行柱面展开形成二维图像的方法实际上损失了很多三维信息,所以其对应计算的效果不是很好.为此,BJ U T 23D 数据库采用基于网格重采样的对齐方法.网格重采样是通过原始数据建立网格和曲面的常用方法,它摒弃了在二维图像上的处理方法,直接在三维空间进行,能够更多更精确地保留原数据的三维信息.利用重采样可以将不规则的多边形网格转化为规则的网格的特点,该方法将不同网格数和空间点数的原型人脸全部规格化为采样点数、网格数、拓扑完全一致的原形人脸,且重采样后的人脸同一相对位置的点都固定地代表了同一个面部特征,在此基础上能够直接进行不同人脸的点与点的线性组合,从特征的角度更具有线性组合的合3101尹宝才等:BJ U T 23D 三维人脸数据库及其处理技术理性.人脸对齐主要由人脸分片和网格重采样两个计算过程组成.1)人脸分片人脸分片将三维人脸分割成多个面片为网格重采样做准备.目前自动分片算法[10]的研究主要是针对纹理映射领域,虽然能够达到自动,但分片的形状不确定,无法保证所有人脸分出的同一片包含的人脸特征相同或相近.Krishnamurt hy 等人[11]提出的交互的人工分片方法,由用户选取一序列点,然后采用贪心图算法,在网格连线上寻找相邻点的最短路径,这些路径则形成分片的边界.该方法以网格的连接关系为基础进行分片操作,实现比较复杂.本文根据三维人脸数据包含三维几何与纹理两部分数据的特点,基于面部纹理图像手工交互标定特征点,然后以特征点的连线作为分片边界,划分特征区域,最后通过柱面映射找到三维人脸网格上的分割结点和分割线.考虑到重采样后网格要求比较均匀,所以采用面积比较接近的矩形进行分割.如图5所示是三维人脸分割的结果,一个人脸被分为122个面片.Fig.5 Divide the 3D face into patches.图5 三维人脸分片2)三维人脸网格重采样对于初始分片后的三维人脸通过网格重采样进行网格细分.重采样时首先要确定每个面片的4个角点.对于规格的矩形面片,直接使用其4个顶点作为角点;对分割后处于边界的不规格面片,利用最小内角法或长宽比法确定4个角点.为了能够进行均匀重采样,对所有矩形的边长度进行统计,然后进行等形线的均匀初始化,这样不仅使边界边的划分更均匀,还可以减少边界曲线提取的计算量.对等形线初始化后的网格进一步的细分,利用点的合力调整新获得弹性点的位置,从而获得了每一面片的均匀重采样网格.对每个面片重复以上重采样过程,直到重采样的密度与原始三维人脸数据的密度比较接近为止.如图6(c )是对人脸数据进行5次重采样的结果,约由13×104个点,25×104个三角面组成.详细的三维人脸重采样过程参见文献[8].Fig.6 Face mesh resampling.(a )The ioslines initialized ;(b )One time mesh resampling ;and (c )Five times mesh resampling.图6 人脸重采样.(a )初始化网格;(b )1次重采样的结果;(c )5次重采样结果经过上面的重采样处理,所有三维人脸具有相同数量的点和三角面片,且整个网格的拓扑结构完全相同,从而可以建立三维人脸数据间严格的一一对应,这样的对应可以将所有三维人脸表示为统一的表示形式.另外,由于这里的分片是基于特征的分片,因此重采样后点的对应也是基于特征的稠密对应.图7是分别基于网格重采样的方法和光流的方法进行人脸对齐的结果.从图中可以看出,基于网格重采样方法的对齐效果好于光流的算法.Fig.7 The correspondence based on mesh resampling and optical flow.(a )The correspondence based onmesh resampling and (b )The correspondence based on optical flow.图7 基于重采样算法和光流算法的对齐效果比较.(a )基于网格重采样方法的对齐结果;(b )基于光流方法的对齐结果4101计算机研究与发展 2009,46(6)4 BJUT 23D 的应用———多姿态人脸识别算法研究[12] 实用的人脸识别系统应该是在用户不配合的情况下使用,此时人的头部会以多种姿态的形式出现,所以进行人脸识别必须考虑头部姿态的变化,多姿态人脸识别也一直是人脸识别研究的难点.作为三维人脸数据库BJ U T 23D 的直接应用成果,我们小组进行了多姿态人脸识别研究,并借助于三维人脸形变模型[9]实现了对人脸的姿态估计.4.1 算法整体框架根据二维人脸库(gallery )中的人脸图像(每个人只需要一幅二维人脸图像),采用三维人脸形变模型重建其对应的三维人脸.在识别阶段采用该三维人脸模型估计二维测试图像中人脸的旋转角度,并以测试图像中人脸在3个方向上的旋转角度为基准,将人脸库(gallery )中重建的三维人脸旋转到相同视角的同一姿态.最后,采用相同姿态下人脸图像进行人脸对象的分类识别.算法的整体框架如图8所示:Fig.8 The f ramework for multipose face recognition.图8 算法整体框架4.2 三维人脸形变模型形变模型的基础是线性组合理论,即使用一类对象中若干典型样本张成该类对象的一个子空间,用子空间基底的组合近似地表示该类对象的特定实例.使用形变模型进行三维人脸建模分为两个过程:一是建立模型,包括原始人脸数据的获取、人脸数据的对应和建立组合模型;二是针对特定人脸图像进行二维人脸图像与模型的优化匹配,实现三维人脸的重建.建立形变模型使用的三维人脸数据源于BJ U T 23D 数据库,所有数据均经过前述的规格化处理,实现了三维人脸的点到点的对应.第i 个三维人脸数据用形状和纹理向量表示为S i =(X i 1,Y i 1,Z i 1,X i 2,…,X in ,X in ,X in ,)T,T i =(R i 1,G i 1,B i 1,R i 2,…,R in ,G in ,B in )T,1≤i ≤N ,(1)其中N 三维人脸的总数,n 是三维人脸顶点的个数.由于原型人脸数量比较大(N =200),且人脸数据间有一定相关性,因此使用主元分析方法(PCA )对人脸形状和纹理向量进行处理,压缩数据量,消除数据间的相关性,得到形变模型的表示形式:S model =S -+∑m-1i αi s i,T model =T -+∑m-1iβi t i,(2)其中S -,T -是原型三维人脸的平均形状和纹理向量,m 是主元个数,s =(s 1,s 2,…,s m -1),t =(t 1,t 2,…,t m -1)是形状和纹理的主元向量组,α=(α1,α2,…,αm -1),β=(β1,β2,…,βm -1)是模型的组合参数.4.3 模型匹配模型匹配就是将形变模型与输入二维人脸图像进行优化匹配,使模型人脸与输入人脸的匹配误差最小,得到模型的组合参数.本文用图像对应像素点的灰度差的平方和作为两图像的匹配误差,即E I =∑x ,y|I input (x ,y )-I mod el (x ,y )|2,(3)其中I input 是输入的人脸图像,I mod el 是三维模型人脸在某视点观察得到的人脸图像,可通过投影模型和5101尹宝才等:BJ U T 23D 三维人脸数据库及其处理技术。
3D Object Recognition in Cluttered Sceneswith Local Surface Features:A SurveyYulan Guo,Mohammed Bennamoun,Ferdous Sohel,Min Lu,and Jianwei Wan Abstract—3D object recognition in cluttered scenes is a rapidly growing research area.Based on the used types of features,3D object recognition methods can broadly be divided into two categories—global or local feature based methods.Intensive research has been done on local surface feature based methods as they are more robust to occlusion and clutter which are frequently present in areal-world scene.This paper presents a comprehensive survey of existing local surface feature based3D object recognition methods.These methods generally comprise three phases:3D keypoint detection,local surface feature description,and surface matching.This paper covers an extensive literature survey of each phase of the process.It also enlists a number of popular and contemporarydatabases together with their relevant attributes.Index Terms—3D object recognition,keypoint detection,feature description,range image,local featureÇ1I NTRODUCTIONO BJECT recognition in cluttered scenes is a fundamental research area in computer vision.It has numerous applications,such as intelligent surveillance,automatic assembly,remote sensing,mobile manipulation,robotics, biometric analysis and medical treatment[1],[2],[3],[4],[5]. In the last few decades,2D object recognition has been extensively investigated and is currently a relatively mature research area[6].Compared to2D images,range images have shown to exhibit several advantages for object recogni-tion.For instance,(i)range images provide more geometri-cal(depth)information compared to2D images.Range images also encode surface metric dimensions unambigu-ously.(ii)Features extracted from range images are com-monly not affected by scale,rotation and illumination[7]. (iii)An estimated3D pose of an object from range images is more accurate compared to an estimated pose from2D images.Accordingly,range images have the potential to overcome many of the difficulties faced by2D images in the context of object recognition[8].These advantages make3D object recognition an active research topic[9].Moreover,the rapid technological development of low cost3D acquisition systems(e.g.,Microsoft Kinect)make range images more accessible[10],[11],[12].Furthermore,advances in comput-ing devices enable the processing of any computationally intensive3D object recognition algorithm to run in a fairly acceptable manner.All these combined factors have contrib-uted to theflourishment of research towards the develop-ment of3D object recognition systems.Existing3D object recognition methods can be divided into two broad categories:global feature based methods and local feature based methods[13],[14].The global fea-ture based methods process the object as a whole for recog-nition.They define a set of global features which effectively and concisely describe the entire3D object(or model)[15]. These methods have been widely used in the context of3D shape retrieval and classification[16],[17].Examples in this category include geometric3D moments[18],shape distri-bution[19],viewpoint feature histogram[20],and potential well space embedding[21].They,however,ignore the shape details and require a priori segmentation of the object from the scene[11].They are therefore not suitable for the recognition of a partially visible object from cluttered scenes [14].On the other hand,the local feature based methods extract only local surfaces around specific keypoints.They generally handle occlusion and clutter better compared to the global feature based methods[14].This type has also proven to perform notably better in the area of2D object recognition[22].This conclusion has also been extended to the area of3D object recognition[23].On that basis,the focus of this paper is on3D object recognition in cluttered scenes with local surface features.Several survey papers were published on3D object recog-nition and its relatedfields,such as3D shape matching and modeling.Among these,several survey papers on3D object recognition are also available.For instance,the articles in [24],[25],[26],[27],[28]and[29].Two reviews on3D model-ing and range image registration by[30]and[31]are also worth mentioning.However,none of these papers specifi-cally focuses on local feature based3D object recognition.Bronstein et al.[32]presented a review on keypoint detection and local surface feature description methods. They however,covered only four keypoint detectors and seven feature descriptors.An article on the evaluation ofY.Guo is with the College of Electronic Science and Engineering,NationalUniversity of Defense Technology,Changsha,Hunan410073,China,andwith the School of Computer Science and Software Engineering,The Uni-versity of Western Australia,Perth,WA6009,Australia.E-mail:yulan.guo@.M.Bennamoun and F.Sohel are with the School of Computer Science andSoftware Engineering,The University of Western Australia,Perth,WA6009,Australia.M.Lu and J.Wan are with the College of Electronic Science and Engineer-ing,National University of Defense Technology,Changsha,Hunan410073,China.Manuscript received6May2013;revised24Nov.2013;accepted3Apr.2014.Date of publication10Apr.2014;date of current version9Oct.2014.Recommended for acceptance by A.Fitzgibbon.For information on obtaining reprints of this article,please send e-mail to:reprints@,and reference the Digital Object Identifier below.Digital Object Identifier no.10.1109/TPAMI.2014.23168280162-8828ß2014IEEE.Personal use is permitted,but republication/redistribution requires IEEE permission.See /publications_standards/publications/rights/index.html for more information.keypoint detection methods has just been published[33]. However,this article only covers eight keypoint detection methods.A large number of significant research contribu-tions on local feature based methods is available and there is no review article which comprehensively analyzes these methods.Compared with the existing literature,the main contribu-tions of this paper are as follows:i)To the best of our knowledge,this is thefirst surveypaper in the literature that focuses on3D object recog-nition based on local surface features.ii)As opposed to previous reviews,e.g.,in[32]and[33],we adequately cover the contemporary litera-ture of keypoint detection and local surfacedescription methods.We present a comprehensivereview of29keypoint detection and38featuredescription methods.iii)This paper also provides an insightful analysis on all aspects of surface matching,including featurematching,hypothesis generation and verification. iv)Compared to the earlier surveys,this paper covers the most recent and advanced work.It therefore,pro-vides the reader with the state-of-the-art methods. v)A comparative summary of attributes is reported in tabular forms(e.g.,Tables3,4and5).The rest of this paper is organized as follows.Section2 describes the background concepts and terminology of3D object recognition based on local surface features.Sections 3and4provide a comprehensive survey of the existing methods for3D keypoint detection and local surface fea-ture description,respectively.Section5presents a review of3D object recognition methods.Section6presents a brief discussion on potential future research directions.Finally, Section7concludes this paper.2B ACKGROUND C ONCEPTS AND T ERMINOLOGY2.1Background ConceptsA range image can be represented in three types,namely a depth image,a point cloud or a polygonal mesh.Given a range image,the goal of3D object recognition is to correctly identify objects present in the range image,and determine their poses(i.e.positions and orientations)[34].At a conceptual level,a typical local feature based3D object recognition system consists of three main phases: 3D keypoint detection,local surface feature description and surface matching.In the3D keypoint detection phase, the3D points with rich information content are identi-fied as keypoints.The inherent scale of each keypoint is also detected.Both the location and scale(i.e.,neighbor-hood size)of a keypoint define a local surface which is used in the subsequent feature description phase[35].In the local surface feature description phase,the geometric information of the neighborhood surface of the keypoint is encoded into a representative feature descriptor.Dur-ing the surface matching phase,the scene features are matched against all model features in the library,result-ing in a set of feature correspondences and hypotheses. These hypotheses arefinally verified to infer the identity and pose of the object.2.2Databases and EvaluationMany databases have been built to test various algorithms.A set of popular2.5D range image and3D model databases are enlisted together with their major attributes in Tables1 and2,respectively.The variations(var.),including occlu-sion(o),clutter(c)and deformation(d),in each database are also provided.The symbol‘-’denotes that the corre-sponding item is not reported.In addition to range images/ models,registered2D color(usually RGB)images are also simultaneously provided in several databases(shown in Tables1and2).There are series of evaluation criteria which are fre-quently used to assess the performance of each phase of a3D object recognition system[3],[36],[37],[38].Repeat-ability score is a frequently used criteria for3D keypoint detector evaluation.It is computed as the ratio of the number of corresponding keypoints to the minimum number of keypoints in the two images[33],[39],[40], [41],[42].Recall vs1-Precision is a frequently used crite-ria for local surface feature descriptor evaluation.It is generated by varying the thresholds for feature matching and calculating the feature recall and precision under each threshold[3],[43],[44],[45],[46].Recognition rate is commonly used to evaluate the overall performance of a recognition system[38],[44],[47],[48].Popular Range ImageDatabases33D K EYPOINT D ETECTIONKeypoint detection is thefirst major phase of a local surface feature based3D object recognition system.The simplest keypoint detection methods are surface sparse sampling and mesh decimation[38],[63],[64].However,these meth-ods do not result in qualified keypoints in terms of repeat-ability and informativeness.That is because they give no or little consideration to the richness of discriminative infor-mation of these detected keypoints[4].Therefore,it is neces-sary to detect keypoints according to their distinctiveness.Based on whether the scale is predetermined or adap-tively detected,keypoint detection methods can be classi-fied into two categories:fixed-scale keypoint detection methods and adaptive-scale keypoint detection methods. We adopt the same classification as[33]in this paper.3.1Fixed-Scale Keypoint DetectionFixed-scale keypoint detection methods define a point, which is distinctive within a predetermined neighborhood, as a keypoint.The neighborhood size is determined by the scale,which is an input parameter to the algorithm[35].As described in the following subsections,distinctiveness measures can either be curvatures or other surface variation (OSV)measures.3.1.1Curvature Based MethodsThese methods use different curvatures as distinctiveness measures to detect keypoints.Mokhtarian et al.[65]detected keypoints using the Gaussian and mean curvatures.They declared a point p as a keypoint if its curvature value was larger than the curvature values of its1-ring neighbors(k-ring neighbors are defined as the points which are distant from p by k edges).Yamany and Farag[66]used simplex angles to detect keypoints.A simplex angle’is related to the mean curvature.The key-points are detected at the locations where the simplex angles satisfy the constraint j sinð’Þj!t.Their threshold t is crucial for the performance of their keypoint detection,and choosing an appropriate threshold is still an unresolved issue[66].Gal and Cohen-Or[67]proposed a saliency grade for keypoint detection.A saliency grade for a point p is a lin-ear combination of two terms.Thefirst term is the sum of the curvatures of its neighboring points,and the second term is the variance of the curvature values in the neighbor-hood.The points with high saliency grades are selected as keypoints.Chen and Bhanu[68]detected keypoints based on shape index values.That is,within a neighborhood,the point p is marked as a keypoint only if its shape index value is a local optimum(maximum/minimum).Experimental results showed that the detected keypoints were quite uni-formly distributed over the surface[33].However,this method is sensitive to noise[33].3.1.2Other Surface Variation Measure Based Methods These methods use other surface variation measures rather than just curvatures to detect keypoints.Matei et al.[69]used the smallest eigenvalue 3of the covariance matrix of the neighboring points to measure the surface variation around a point p.The points were sorted according to their surface variations.Zhong[70]employed the ratio of two successive eigenvalues to prune the points. Only the points which satisfy the constraints 21<t21and 32<t32are retained and further detected based on the smallest eigenvalue 1.Guo et al.[44]first decimated a range image and then chose the points,which satisfied the con-straint 12>t from the decimated image,as keypoints.These methods achieve good results in terms of repeatability.More-over,they are particularly computationally efficient[33].Glomb[71]introduced four propositions to extend the popular Harris detector[72]from2D images to3D meshes.They found that the Harris detector,which used the derivative of afitted quadratic surface,achieved the best results.Following this proposition,Sipiran and Bus-tos[40]proposed a“Harris3D”detector.Given a point p, the neighboring points werefirst translated to the centroid and then rotated to align the normal at p with the z axis. Next,these transformed points werefitted into a qua-dratic surface,described mathematically by fðu;vÞ¼a T u2;uv;v2;u;v;1ðÞ.A symmetric matrix E was then defined using the derivatives of this function.The Harris3D oper-ator value at the point p was calculated as VðpÞ¼det EðÞÀa tr EðÞðÞ2,where detðEÞand trðEÞrepresented the deter-minant and trace of the matrix E,respectively.a was a parameter which needs to be tuned experimentally. Finally,afixed percentage of points with the largest val-ues of VðpÞwere selected as keypoints.Experimental results showed that it was robust to several transforma-tions including noise,change of tessellations,local scaling, shot noise and presence of holes.It also outperformed[73] (described in Section3.2.4)and[15](described in Sec-tion3.2.1)in many aspects,especially in the cases of high levels of local scaling and shot noise.Since only afixed-scale is used tofind the keypoints, their implementation is straightforward.However,these methods have several major drawbacks.First,it is possi-ble that they may detect too few keypoints,particularly on the less curved parts of the3D object[4].This would be a problem for object recognition.Second,fixed-scale methods determine the scale empirically.They do notPopular3D ModelDatabasesfully exploit the scale information encoded in the local geometric structures to detect the inherent scale of a key-point.Therefore,the neighborhood size cannot be adap-tively determined [48].3.2Adaptive-Scale Keypoint DetectionAdaptive-scale keypoint detection methods first build a scale-space for a given range image.They then pick the points with extreme distinctiveness measures in both the spatial and scale neighborhoods as keypoints.As a result,both the locations and scales of the keypoints are detected.According to the scale-space construction technique,these methods can be divided into four categories:coordinate smoothing (CS),geometric attribute smoothing (GAS),sur-face variation and transform based methods.3.2.1Coordinate Smoothing Based MethodsThese methods construct a scale-space by successively smoothing the 3D coordinates of a range image.They are broadly based on the 2D scale-space theory first introduced in [74].Akag €und €u z and Ulusoy [75]first obtained the scale-space of a 3D surface by constructing a Gaussian pyramid of the surface.They then computed the mean and Gaussian curvature values at all points for all scales,and classified each point as one of the eight surface types based on its Gaussian and mean curvatures [24].Within the classified scale-space,each connected volume that had the same sur-face type was detected.The center of the connected volume was chosen as the location of a keypoint.The weighted average of the scale values within each connected volume was selected as the scale of that keypoint.This algorithm is invariant to scale and rotation.The results showed that for scale varying databases,the adaptive-scale features ensured a superior recognition performance to the fixed-scale fea-tures [13].Castellani et al.[15]resampled the mesh M at N O levels to yield a set of octave meshes M j ðj ¼1;2;...;N O Þ.They then applied N S Gaussian filters on each octave mesh M j to obtain a set of filtering maps F j i ði ¼1;2;...;N S Þ.Next,they projected F ji ðp Þto the normal of p to get a scalar map M j i ðp Þ,which was further normalized to an inhibitedsaliency map ^M j i.Finally,keypoints were detected as the local maxima within the inhibited saliency map ^M j i.Only the potential keypoints which appeared in at least three octaves were considered as validate keypoints.Experimen-tal results showed that this method detected only a limited number of keypoints,which were well-localized and often at the extremities of long protrusions of the surface [15],[33].Darom and Keller [76]followed the work of [15].They however used density-invariant Gaussian filters to obtain octave meshes,which was more robust to varying mesh res-olutions and nonuniform sampling compared to [15].Li and Guskov [77]projected the 3D points onto a series of increasingly smoothed versions of the shape to build a scale-space.They then computed the normal differences between neighboring scales at each point p .The points whose normal differences were larger (or smaller)than all of their spatial and scale neighbors were detected as key-points.Experimental results showed that the keypoints atcoarse scales were more repeatable compared to their coun-terparts at finer scales.Lo and Siebert [78]detected keypoints on a depth image using an enhanced version of the Scale Invariant Feature Transform (SIFT)algorithm [22],namely 2.5D SIFT.They created a discrete scale-space representation of the depth image by using Gaussian smoothing and Difference Of Gaussian (DOG)pyramid techniques.The signal maxima and minima were detected within the DOG scale-space.The keypoints were finally validated and located by comparing the ratio of the principal curva-tures (i.e.,k 1k 2)with a predefined threshold.The 2.5D SIFT achieved a superior matching performance compared to the 2D SIFT [78].Knopp et al.[23]first voxelized a mesh to a 3D voxel image.Next,they calculated the second-order derivatives Lðv ;s Þat each voxel v using box filters with increasing stan-dard deviations s .They defined a saliency measure s for each voxel v and scale s based on the Hessian matrix.Local extrema over the space s ðv ;s Þwere used to detect the “3D SURF”keypoints and their corresponding scales.Coordinate smoothing based methods directly apply the 2D scale-space theory to 3D geometric data by replacing pixel intensities with 3D point coordinates.Consequently,the extrinsic geometry and topology of a 3D shape are altered,and the causality property of a scale-space is vio-lated [79].The causality property is an important axiom for a scale-space representation.It implies that any structure (feature)in a coarse scale should be able to find its cause in the finer scales [80].3.2.2Geometric Attribute Smoothing Based Methods These methods construct a scale-space by successively smoothing the geometric attributes of a range image.Since the filtering is applied to the geometric attributes rather than the range image itself,no modification is made to the extrinsic geometry of the 3D shape.Therefore,the causality property of the scale-space is preserved.Novatnack and Nishino [81]represented a surface using its normals and parameterized the surface on a 2D plane to obtain a dense and regular 2D representation.They then constructed a scale-space of the normal field by successively convolving the 2D normal map with geo-desic Gaussian kernels of increasing standard deviations.Keypoints and their corresponding scales were detected by identifying the points in the scale-space where the corner responses were maxima along both the spatial and scale axes.This work was one of the first to consider a geometric scale space that can directly be used on range images,preceding several methods including [82]and [83].Experimental results showed that the keypoints could be detected and localized robustly,and the num-ber of detected keypoints was sufficiently large [48],[84].One major limitation of this method is that it requires accurate estimations of the surface normals to construct the 2D scale-space [41],[85].Flint et al.[86]convolved the 3D voxel image of a range image with a set of Gaussian kernels to build a density scale-space.The keypoints were detected over the scale-space using the determinant of the Hessian matrix.Tests ona number of scenes showed that,this THRIFT method can repeatably extract the same keypoints under a range of transformations.However,it is sensitive to the variations of point density[87].Moreover,regular resampling in a3D space throughout the data is very time-consuming[88].Hua et al.[89]first mapped a3D surface to a canonical 2D domain by using a non-linear optimization method. They then encoded the surface curvatures and conformal factors into the rectangular2D domain,resulting in a shape vector image.Next,they built a vector-valued scale-space on the shape vector image through a geodesic distance-weighted inhomogeneous linear diffusion and a Difference of Diffusion(DoD)computation.They detected keypoints as the points that had maxima/minima DoD values across the scales in both the curvature and conformal factor chan-nels.Experiments showed that this method achieved a very high repeatability.It was also superior to the regular anisotropic diffusion and isotropic diffusion methods[89]. However,it is difficult to apply this method to large and topologically complicated surfaces[83],[90]and certain high-genus surfaces(i.e.,surfaces which have a large num-ber of holes.)[89].Zou et al.[90]proposed a Geodesic Scale-Space(GSS) based on the convolution of a variable-scale geodesic Gauss-ian kernel with the surface geometric attributes.Keypoints were detected by searching the local extrema in both the spatial and scale domains in the GSS.These keypoints were further pruned based on their contrasts and anisotropies. Experiments demonstrated that this method was robust to noise and varying mesh resolutions.However,the cost of computing geodesic distances is extremely high when the scale increases[83].Zaharescu et al.[91]first defined a scalarfield(photomet-ric or geometric attribute)for each point p.They then con-volved the scalarfield with a set of geodesic Gaussian kernels and performed DOG calculations to obtain their scale-space.MeshDOG keypoints were claimed as the max-ima in the DOG scale-space.These keypoints werefinally pruned by a set of operations including non-maximum sup-pression,thresholding and corner detection.This method offers a canonical formula to detect both photometric and geometric keypoints on a mesh surface.It is capable to detect a sufficient number of repeatable keypoints.It is how-ever,sensitive to varying mesh resolutions[33],[83].Zou et al.[82]formalized an Intrinsic Geometric Scale-Space(IGSS)of a3D surface by gradually smoothing the Gaussian curvatures via shape diffusion.Keypoints were identified as extrema in the normalized Laplacian of the IGSS with respect to both the spatial and scale domains.The IGSS representation is invariant to conformal deformations(i.e., transformations which preserve both the size and the sign of angles).Experimental results showed that the detected key-points spread across various scales and were robust to noise.Hou and Qin[83]downsampled the surface as the scale increased,and built a scale-space of the scalarfield(e.g.,cur-vature and texture)on the surface by using a Gaussian ker-nel.Keypoints werefinally detected as local extrema in the scale-space.The capability of simultaneous sampling in both the spatial and scale domains makes this method efficient and stable.Experimental results showed that this method significantly reduced the processing time compared to GSS.It is also more stable under different mesh resolutions than MeshDOG.Another merit is its invariance to isometric defor-mations(i.e.,transformations which preserve distance).3.2.3Surface Variation Based MethodsThese methodsfirst calculate surface variations at a set of varying neighborhood sizes.They then detect keypoints by finding the maxima of surface variations in the local neigh-borhood with different neighborhood sizes.They are based on the assumption that the neighborhood size can be regarded as a discrete scale parameter,and increasing the local neighborhood size is similar to applying a smoothing filter[92].These methods avoid making direct changes to the3D surfaces,and they are straightforward to implement.Pauly et al.[92]measured the surface variation d by using the eigenvalues 1, 2and 3of the covariance matrix of the neighboring points,that is d¼ 3ð 1þ 2þ 3Þ.Local maxima in the surface variation space were determined as keypoints. Experiments demonstrated that the surface variation corre-sponded well to the smoothed surface using the standard Gaussianfiltering.Two major drawbacks of this method are that,surface variation is sensitive to noise,and a heuristic pre-smoothing procedure is required to detect the maxima in the surface variation space[41],[88].Ho and Gibbins[88]used the standard deviation of the shape index values of the neighboring points to measure the surface variation.The detected keypoints on the Dragon model are illustrated in Fig.1a.Experimental results showed that this method was effective and robust to minor noise.It achieved high repeatability results even with noisy ter,Ho and Gibbins[85]estimated the curved-ness at different scales,and picked the points that had extreme values in the scale-space as keypoints.Similarly, Ioanou et al.[93]proposed a multi-scale Difference of Nor-mals(DoN)operator for the segmentation of large unorga-nized3D point clouds.The DoN operator provides a substantial reduction in points,which reduces the computa-tional cost of any subsequent processing phase of the scene (when processing is performed on the segmented parts).Unnikrishnan and Hebert[41]proposed an integral oper-ator Bðp;rÞto capture the surface variation at a point p with a neighborhood size r.In fact,Bðp;rÞdisplaces a point p along its normal direction n,and the magnitude of displace-ment is proportional to the mean curvature.They then defined the surface variation dðp;rÞas:d p;rðÞ¼2pÀB p;rðÞk krexpÀ2pÀB p;rðÞk kr:(1) Fig.1.Keypoints detected on the Dragon model.(a)Keypoints detected by[88].Different sizes of the spheres correspond to different scales of the keypoints.(b),(c),(d)Keypoints and their neighborhoods detected at three scales by[41].Each colored patch corresponds to the neighbor-hood of a keypoint,the sizes of blue spheres correspond to the scales.。
EFFICIENT FEATURE EXTRACTION FOR2D/3D OBJECTSIN MESH REPRESENTATIONCha Zhang and Tsuhan ChenDept.of Electrical and Computer Engineering,Carnegie Mellon University 5000Forbes Avenue,Pittsburgh,PA15213,USA{czhang,tsuhan}@ABSTRACTMeshes are dominantly used to represent3D models as they fit well with graphics rendering hardware.Features such as volume,moments,and Fourier transform coeffi-cients need to be calculated from the mesh representation efficiently.In this paper,we propose an algorithm to cal-culate these features without transforming the mesh into other representations such as the volumetric representa-tion.To calculate a feature for a mesh,we show that we can first compute it for each elementary shape such as a triangle or a tetrahedron,and then add up all the values for the mesh.The algorithm is simple and efficient,with many potential applications.1.INTRODUCTION3D scene/object browsing is becoming more and more popular as it engages people with much richer experience than2D images.The Virtual Reality Modeling Language (VRML)[1],which uses mesh models to represent the3D content,is rapidly becoming the standard file format for the delivery of3D contents across the Internet.Tradition-ally,in order to fit graphics rendering hardware well,a VRML file models the surface of a virtual object or envi-ronment with a collection of3D geometrical entities,such as vertices and polygons.In many applications,there is a high demand to calcu-late some important features for a mesh model,e.g.,the volume of the model,the moments of the model,or even the Fourier transform coefficients of the model.One ex-ample application is the search and retrieval of3D models in a database[2][3][9].Another example is shape analysis and object recognition[4].Intuitively,we may calculate these features by first transforming the3D mesh model into its volumetric representation and then finding these features in the voxel space.However,transforming a3D mesh model into its volumetric representation is a time-consuming task,in addition to a large storage requirement [5][6][7].Work supported in part by NSF Career Award9984858.In this paper,we propose to calculate these features from the mesh representation directly.We calculate a fea-ture for a model by first finding it for the elementary shapes,such as triangles or tetrahedrons,and then add them up.The computational complexity is proportional to the number of elementary shapes,which is typically much smaller than the number of voxels in the equivalent volu-metric representation.Both2D and3D meshes are consid-ered in this paper.The result is general and has many po-tential applications.The paper is organized as follows.In Section2we discuss the calculation of the area/volume of a mesh.Sec-tion3extends this idea and presents the method to com-pute moments and Fourier transform for a mesh.Some applications are provided in Section4.Conclusions and discussions are given in Section5.2.AREA/VOLUME CALCULATIONThe computation of the volume of a3D model is not a trivial work.One can convert the model into a discrete3D binary image.The grid points in the discrete space are called voxels.Each voxel is labeled with‘1’or‘0’to indi-cate whether this point is inside or outside the object.The number of voxels inside the object,or equivalently the summation of all the voxel values in the discrete space, can be an approximation for the volume of the model. However,the transforming from a3D mesh model into a binary image is very time-consuming.Moreover,in order to improve the accuracy,the resolution of the3D binary image needs to be very high,which can further increase the computation load.2.1.2D Mesh AreaWe explain our approach starting from the computation of areas for2D meshes.A2D mesh is simply a2D shape with polygonal contours.As shown in Figure1,suppose we have a2D mesh with bold lines representing its edges. Although we can discretize the2D space into a binary image and calculate the area of the mesh by counting the pixels inside the polygon,doing so is very computationally intensive."positive"area "negtive"areaFigure 1:The calculation of a 2D polygon area To start with our algorithm,let us make the assump-tion that the polygon is close.If it is not,a contour close process can be performed first [9].Since we know all the vertices and edges of the polygon,we can calculate the normal for each edge easily.For example,edge AB in Figure 1has the normal:2122121212)()(ˆ)(ˆ)(y y x x y x x xy y N AB −+−−+−−=(1)where (x 1,y 1)and (x 2,y 2)are the coordinates of vertices Aand B ,respectively,and xˆand y ˆare the unit vectors for the axes.We define the normal here as a normalized vec-tor which is perpendicular to the corresponding edge and pointing outwards of the mesh.In computer graphics lit-erature,there are different ways to check whether a point is inside or outside a polygon [8],thus it is easy to find the correct direction of the ter we will show that even if we only know that all the normals are pointing to the same side of the mesh (either inside or outside,as long as they are consistent),we are still able to find the correct area of the mesh.After getting the normals,we construct a set of trian-gles by connecting all the polygon vertices with the origin.Each edge and the origin form an elementary triangle,which is the smallest unit for computation.We define the signed area for each elementary triangle as below:The magnitude of this value is the area of the triangle,while the sign of the value is determined by checking the posi-tion of the origin with respect to the edge and the direction of the normal.Take the triangle OAB in Figure 1as an example.The area of OAB is:.)(212112y x y x S OAB +−=(2)The sign of S OAB is the same as the sign of the inner prod-uct AB N OA ⋅,which is positive in this case.The total area of the polygon can be computed by summing up all the signed areas .That is,¦=ii total S S (3)where i goes through all the edges or elementary triangles.Following the above steps,the result of equation (3)is guaranteed to be positive,no matter the origin is inside oroutside the mesh.Note here that we do not make any as-sumption that the polygon is convex.In real implementation,we do not need to check the signs of the areas each time.Let:().'',21'2112¦=+−=ii total i i i i i S S y x y x S (4)where i stands for the index of all the edges or elementary triangles.(x i1,y i1),(x i2,y i2)are coordinates of the starting point and the end point of edge i .When we loop through all the edges,we need to keep forwarding so that the in-side part of the mesh is always kept at the left hand side or the right hand side.According to the final sign of the result S’total ,we may know whether we are looping along the right direction (the right direction should give the positive result),and the final result can be simply achieved by tak-ing the magnitude of S’total .2.2.3D CaseWe can extend the above algorithm into the 3D case.In a VRML file,the mesh is represented by a set of vertices and polygons.Before we calculate the volume,we do some preprocessing on the model and make sure that all the polygons are triangles.Such preprocessing,called tri-angulation,is commonly used in mesh coding,mesh signal processing,and mesh editing.The direction of the normal for a triangle can be determined by the order of the verti-ces and the right-hand rule,as shown in Figure 2.The con-sistent condition is very easy to satisfy.For two neighbor-ing triangles,if the common edge has different directions,then the normals of the two triangles are consistent.For example,in Figure 2,AB is the common edge of triangle ACB and ABD .In triangle ACB ,the direction is from B to A ,and in triangle ABD ,the direction is from A toB ,thus N ACB and N ABD are consistent.BFigure 2:Normals and order of verticesIn the 3D case,the elementary calculation unit is a tet-rahedron.For each triangle,we connect each of its vertices with the origin and form a tetrahedron,as shown in Figure 3.As in the 2D case,we define the signed volume for each elementary tetrahedron as:The magnitude of its value is the volume of the tetrahedron,and the sign of the value is determined by checking if the origin is at the same side as the normal with respect to the triangle.In Figure 3,tri-2,z 2)Figure 3:The calculation of 3D volumeangle ACB has a normal N ACB .The volume of tetrahedron OACB is:.)(61321312231213132123z y x z y x z y x z y x z y x z y x V OACB +−−++−=(5)As the origin O is at the opposite side of N ACB ,the sign of this tetrahedron is positive.The sign can also be calculated by inner product ACB N OA ⋅.In real implementation,again we only need to com-pute:()¦=+−−++−=iitotal i i i i i i i i i i i i i i i i i i i V V z y x z y x z y x z y x z y x z y x V ''.61'321312231213132123(6)where i stands for the index of triangles or elementary tetrahedrons.(x i1,y i1,z i1),(x i2,y i2,z i2)and (x i3,y i3,z i3)are coordinates of the vertices of triangle i and they are or-dered so that the normal of triangle i is consistent with others.Volume of a 3D mesh model is always positive.The final result can be achieved by take the absolute value of V’total .In order to compute other 3D model features such as moments or Fourier transform coefficients,we reverse the sequence of vertices for each triangle if V’total turns out to be negative.3.MOMENTS AND FOURIER TRANSFORMThe above algorithm can be generalized to calculate other features for 2D and 3D mesh models.Actually,whenever the feature to be calculated can be written as a signed sum of features of the elementary shape (triangle in the 2D case and tetrahedron in the 3D case),and the feature of the elementary shape can be derived in an explicit form,the proposed algorithm applies.Although this seems to be a strong constrain,many of the commonly-used fea-tures fall into this category.For example,all the features that have the form of integration over the space inside the object can be calculated with this algorithm.This includes moments,Fourier transform,wavelet transform,and many others.In classical mechanics and statistical theory,the con-cept of moments is used extensively.In this paper,themoments of a 3D mesh model are defined as:³³³=dxdydzz y x z y x M r q p pqr ),,(ρ(7)where ),,(z y x ρis an indicator function:¯®=otherwise.,0meshthe inside is z)y,(x,if ,1),,(z y x ρ(8)and p ,q ,r are the orders of the moment.Central moments can be obtained easily from the result of equation (7).Since the integration can be rewritten as the sum of inte-grations over each elementary shape:¦³³³=ii r q p i pqr dxdydz z y x z y x s M ),,(ρ(9)where ),,(z y x i ρis the indicator function for elementary shape i ,and s i is the sign of the signed volume for shape i .We can use the same process as that in Section 2to calculate a number of low order moments for triangles and tetrahedrons that are extensively used.A few examples for the moments of a tetrahedron are given in the Appendix.More examples can be found in [9].Fourier transform is a very powerful tool in many sig-nal processing applications.The Fourier transform of a 2D or 3D mesh model is defined by the Fourier transform of its indicator function:³³³++−=Θdxdydzz y x e w v u zw yv xu i ),,(),,()(ρ(10)Since Fourier transform is also an integration over the space inside the object,it can also be calculated by de-composing the integration into integrations over each ele-mentary shape.The explicit form of the Fourier transform of a tetrahedron is given in the Appendix.As the moments and Fourier transform coefficients of an elementary shape are explicit,the above computation is very efficient.The computational complexity is O (N),where N is the number of edges or triangles in the mesh.Note that in the volumetric approach,where a 2D or 3D binary image is obtained first before getting any of the features,the computational complexity is O (M),where M is the number of grid points inside the model,not consid-ering the cost of transforming the data representation.It is obvious that M is typically much larger that N ,especially when a relatively accurate result is required and the resolu-tion of the binary image has to be large.The storage space required by our algorithm is also much smaller.Previous work by Lien and Kajiya [10]provide a similar method for calculating the moments for tetrahe-drons.Our work gives more explicit forms of the moments and extends their work to calculating the Fourier trans-form.4.APPLICATIONSA good application of our algorithm is to find the principal axes of a 3D mesh model.This is useful when we want to compare two 3D models that are not well aligned.In a 3Dmodel retrieval system [2][9],this is required because some of the features may not be invariant to arbitrary rota-tions.We construct a 3x3matrix by the second order mo-ments of the 3D model:.002011101011020110101110200»»»¼º«««¬ª=M M M M M M M M M S (11)The principal axes are obtained by computing the ei-genvectors of matrix S ,which is also known as the princi-ple component analysis (PCA).The eigenvector corre-sponding to the largest eigenvalue is made the first princi-pal axis.The next eigenvector corresponding to the secon-dary eigenvalue is the second principal axis,and so on.Inorder to make the final result unique,we further make sure that the 3rd order moments,M 300and M 030,are positive after the transform.Figure 4shows the results of this algo-rithm.Before rotationAfter rotationBefore rotation After rotationFigure 4:3D models before and after PCAThe Fourier transform of a 3D mesh model can be used in many applications.For example,the coefficients can be directly used as features in a retrieval system [9].Other applications are shape analysis,object recognition,and model matching.Note that in our algorithm,the result-ing Fourier transform is in continuous form.There is no discretization alias since we can evaluate a Fourier trans-form coefficient from the continuous form directly.5.CONCLUSIONS AND DISCUSSIONSIn this paper,we propose an algorithm for computing fea-tures for a 2D or 3D mesh model.Explicit methods to compute the volume,moments and Fourier transform from a mesh representation directly are given.The algorithm is very efficient,and has many potential applications.The proposed algorithm still has some room for im-provement.For example,it is still difficult to get the ex-plicit form of a high order moment for a triangles and tet-rahedrons.Also the Fourier transform may lose its compu-tational efficiency if many coefficients are required simul-taneously.More research is in progress to speed this up.REFERENCES[1]R.Carey,G.Bell,and C.Marrin,“The Virtual Reality Modeling Language”.Apr.1997,ISO/IEC DIS 14772-1.[Online]:/Specifications/.[2]Eric Paquet and Marc Rioux,“A Content-based Search Engine for VRML Database”,Computer Vision and Pattern Recognition,1998.Proceedings.1998,IEEE Computer Society Conference on ,pp.541-546,1998.[3]Sylvie Jeannin,Leszek Cieplinski,Jens Rainer Ohm,Munchurl Kim,MPEG-7Visual part of eXperimentation Model Version 7.0,ISO/IEC JTC1/SC29/WG11/N3521,Beijing,July 2000.[4]Anthony P.Reeves,R.J.Prokop,Susan E.Andrews and Frank P.Kuhl,“Three-Dimensional Shape Analysis Using Moments and Fourier Descriptors”,IEEE Trans.Pattern Analysis and Machine Intelligence ,pp.937-943,Vol.10,No.6,Nov.1988.[5]Homer H.Chen,Thomas S.Huang,“A Survey of Construction and Manipulation of Octrees”,Computer Vision,Graphics,and Image Processing ,pp.409-431,Vol.43,1988.[6]Shi-Nine Yang and Tsong-Wuu Lin,“A New Linear Octree Con-struction by Filling Algorithms”,Computers and Communications,1991.Conference Proceedings.Tenth Annual International Phoenix Conference on ,pp.740-746,1991.[7]Yoshifumi Kitamura and Fumio Kishino,“A Parallel Algorithm for Octree Generation from Polyhedral Shape Representation”,Pattern Recognition,1996.Proceedings of the 13th International Conference on ,pp.303-309,Vol.3,1996.[8]James D.Foley,Andries van Dam,Steven K.Feiner,and John F.Hughes,Computer Graphics principles and practice,Second Edition ,Addison-Wesley Publishing Company,Inc.,1996.[9]/projects/3DModelRetrieval/.[10]Sheue-ling Lien and James T.Kajiya,“A Symbolic Method for Calculating the Integral Properties of Arbitrary Nonconvex Polyhedra”,IEEE Computer Graphics and Applications ,pp.35-41,Oct.1984.APPENDIX().61321312231213132123000z y x z y x z y x z y x z y x z y x M +−−++−=().00032110041M x x x M ++=().000313221232221200101M x x x x x x x x x M +++++=()000321212331223221333231300x x x )()()(201M x x x x x x x x x x x x M +++++++++=))()((i))()((e *i ))()((e *i ))()((e *i (*),,(333222111323232313131333)wz vy i(ux 323232212121222)wz vy i(ux 313131212121111)wz vy i(ux 000333222111wz vy ux wz vy ux wz vy ux wz wz vy vy ux ux wz wz vy vy ux ux wz vy ux wz wz vy vy ux ux wz wz vy vy ux ux wz vy ux wz wz vy vy ux ux wz wz vy vy ux ux wz vy ux M w v u ++++++−+−+−+−+−+−+−+++−+−+−+−+−+−+++−+−+−−+−+−++=ℑ++++++。
VoxelNet:End-to-End Learning for Point Cloud Based3D Object DetectionYin ZhouApple Inc****************Oncel TuzelApple Inc****************AbstractAccurate detection of objects in3D point clouds is a central problem in many applications,such as autonomous navigation,housekeeping robots,and augmented/virtual re-ality.To interface a highly sparse LiDAR point cloud with a region proposal network(RPN),most existing efforts have focused on hand-crafted feature representations,for exam-ple,a bird’s eye view projection.In this work,we remove the need of manual feature engineering for3D point clouds and propose VoxelNet,a generic3D detection network that unifies feature extraction and bounding box prediction into a single stage,end-to-end trainable deep network.Specifi-cally,VoxelNet divides a point cloud into equally spaced3D voxels and transforms a group of points within each voxel into a unified feature representation through the newly in-troduced voxel feature encoding(VFE)layer.In this way, the point cloud is encoded as a descriptive volumetric rep-resentation,which is then connected to a RPN to generate detections.Experiments on the KITTI car detection bench-mark show that VoxelNet outperforms the state-of-the-art LiDAR based3D detection methods by a large margin.Fur-thermore,our network learns an effective discriminative representation of objects with various geometries,leading to encouraging results in3D detection of pedestrians and cyclists,based on only LiDAR.1.IntroductionPoint cloud based3D object detection is an important component of a variety of real-world applications,such as autonomous navigation[11,14],housekeeping robots[26], and augmented/virtual reality[27].Compared to image-based detection,LiDAR provides reliable depth informa-tion that can be used to accurately localize objects and characterize their shapes[21,5].However,unlike im-ages,LiDAR point clouds are sparse and have highly vari-able point density,due to factors such as non-uniform sampling of the3D space,effective range of the sensors, occlusion,and the relative pose.To handle these chal-lenges,many approaches manually crafted featurerepresen-Figure1.V oxelNet directly operates on the raw point cloud(no need for feature engineering)and produces the3D detection re-sults using a single end-to-end trainable network.tations for point clouds that are tuned for3D object detec-tion.Several methods project point clouds into a perspec-tive view and apply image-based feature extraction tech-niques[28,15,22].Other approaches rasterize point clouds into a3D voxel grid and encode each voxel with hand-crafted features[41,9,37,38,21,5].However,these man-ual design choices introduce an information bottleneck that prevents these approaches from effectively exploiting3D shape information and the required invariances for the de-tection task.A major breakthrough in recognition[20]and detection[13]tasks on images was due to moving from hand-crafted features to machine-learned features.Recently,Qi et al.[29]proposed PointNet,an end-to-end deep neural network that learns point-wise features di-rectly from point clouds.This approach demonstrated im-pressive results on3D object recognition,3D object part segmentation,and point-wise semantic segmentation tasks.In[30],an improved version of PointNet was introduced which enabled the network to learn local structures at dif-ferent scales.To achieve satisfactory results,these two ap-proaches trained feature transformer networks on all input points(∼1k points).Since typical point clouds obtained using LiDARs contain∼100k points,training the architec-1Figure2.V oxelNet architecture.The feature learning network takes a raw point cloud as input,partitions the space into voxels,and transforms points within each voxel to a vector representation characterizing the shape information.The space is represented as a sparse 4D tensor.The convolutional middle layers processes the4D tensor to aggregate spatial context.Finally,a RPN generates the3D detection.tures as in[29,30]results in high computational and mem-ory requirements.Scaling up3D feature learning networks to orders of magnitude more points and to3D detection tasks are the main challenges that we address in this paper.Region proposal network(RPN)[32]is a highly opti-mized algorithm for efficient object detection[17,5,31, 24].However,this approach requires data to be dense and organized in a tensor structure(e.g.image,video)which is not the case for typical LiDAR point clouds.In this pa-per,we close the gap between point set feature learning and RPN for3D detection task.We present V oxelNet,a generic3D detection framework that simultaneously learns a discriminative feature represen-tation from point clouds and predicts accurate3D bounding boxes,in an end-to-end fashion,as shown in Figure2.We design a novel voxel feature encoding(VFE)layer,which enables inter-point interaction within a voxel,by combin-ing point-wise features with a locally aggregated feature. Stacking multiple VFE layers allows learning complex fea-tures for characterizing local3D shape information.Specif-ically,V oxelNet divides the point cloud into equally spaced 3D voxels,encodes each voxel via stacked VFE layers,and then3D convolution further aggregates local voxel features, transforming the point cloud into a high-dimensional volu-metric representation.Finally,a RPN consumes the vol-umetric representation and yields the detection result.This efficient algorithm benefits both from the sparse point struc-ture and efficient parallel processing on the voxel grid.We evaluate V oxelNet on the bird’s eye view detection and the full3D detection tasks,provided by the KITTI benchmark[11].Experimental results show that V oxelNet outperforms the state-of-the-art LiDAR based3D detection methods by a large margin.We also demonstrate that V oxel-Net achieves highly encouraging results in detecting pedes-trians and cyclists from LiDAR point cloud.1.1.Related WorkRapid development of3D sensor technology has moti-vated researchers to develop efficient representations to de-tect and localize objects in point clouds.Some of the earlier methods for feature representation are[39,8,7,19,40,33, 6,25,1,34,2].These hand-crafted features yield satisfac-tory results when rich and detailed3D shape information is available.However their inability to adapt to more complex shapes and scenes,and learn required invariances from data resulted in limited success for uncontrolled scenarios such as autonomous navigation.Given that images provide detailed texture information, many algorithms infered the3D bounding boxes from2D images[4,3,42,43,44,36].However,the accuracy of image-based3D detection approaches are bounded by the accuracy of the depth estimation.Several LIDAR based3D object detection techniques utilize a voxel grid representation.[41,9]encode each nonempty voxel with6statistical quantities that are de-rived from all the points contained within the voxel.[37] fuses multiple local statistics to represent each voxel.[38] computes the truncated signed distance on the voxel grid.[21]uses binary encoding for the3D voxel grid.[5]in-troduces a multi-view representation for a LiDAR point cloud by computing a multi-channel feature map in the bird’s eye view and the cylindral coordinates in the frontal view.Several other studies project point clouds onto a per-spective view and then use image-based feature encoding公众号DASOU-整理schemes[28,15,22].There are also several multi-modal fusion methods that combine images and LiDAR to improve detection accu-racy[10,16,5].These methods provide improved perfor-mance compared to LiDAR-only3D detection,particularly for small objects(pedestrians,cyclists)or when the objectsare far,since cameras provide an order of magnitude more measurements than LiDAR.However the need for an addi-tional camera that is time synchronized and calibrated with the LiDAR restricts their use and makes the solution more sensitive to sensor failure modes.In this work we focus on LiDAR-only detection.1.2.Contributions•We propose a novel end-to-end trainable deep archi-tecture for point-cloud-based3D detection,V oxelNet, that directly operates on sparse3D points and avoids information bottlenecks introduced by manual feature engineering.•We present an efficient method to implement V oxelNet which benefits both from the sparse point structure and efficient parallel processing on the voxel grid.•We conduct experiments on KITTI benchmark and show that V oxelNet produces state-of-the-art results in LiDAR-based car,pedestrian,and cyclist detection benchmarks.2.VoxelNetIn this section we explain the architecture of V oxelNet, the loss function used for training,and an efficient algo-rithm to implement the network.2.1.VoxelNet ArchitectureThe proposed V oxelNet consists of three functional blocks:(1)Feature learning network,(2)Convolutional middle layers,and(3)Region proposal network[32],as il-lustrated in Figure2.We provide a detailed introduction of V oxelNet in the following sections.2.1.1Feature Learning NetworkVoxel Partition Given a point cloud,we subdivide the3D space into equally spaced voxels as shown in Figure2.Sup-pose the point cloud encompasses3D space with range D, H,W along the Z,Y,X axes respectively.We define each voxel of size v D,v H,and v W accordingly.The resulting 3D voxel grid is of size D =D/v D,H =H/v H,W = W/v W.Here,for simplicity,we assume D,H,W are a multiple of v D,v H,v W.Grouping We group the points according to the voxel they reside in.Due to factors such as distance,occlusion,ob-ject’s relative pose,and non-uniform sampling,the LiDARFullyConnectedNeuralNetPoint-wiseInputPoint-wiseFeatureElement-wiseMaxpoolPoint-wiseConcatenateLocallyAggregatedFeaturePoint-wiseconcatenatedFeatureFigure3.V oxel feature encoding layer.point cloud is sparse and has highly variable point density throughout the space.Therefore,after grouping,a voxel will contain a variable number of points.An illustration is shown in Figure2,where V oxel-1has significantly more points than V oxel-2and V oxel-4,while V oxel-3contains no point.Random Sampling Typically a high-definition LiDAR point cloud is composed of∼100k points.Directly pro-cessing all the points not only imposes increased mem-ory/efficiency burdens on the computing platform,but also highly variable point density throughout the space might bias the detection.To this end,we randomly sample afixed number,T,of points from those voxels containing more than T points.This sampling strategy has two purposes,(1)computational savings(see Section2.3for details);and(2)decreases the imbalance of points between the voxels which reduces the sampling bias,and adds more variation to training.Stacked Voxel Feature Encoding The key innovation is the chain of VFE layers.For simplicity,Figure2illustrates the hierarchical feature encoding process for one voxel. Without loss of generality,we use VFE Layer-1to describe the details in the following paragraph.Figure3shows the architecture for VFE Layer-1.Denote V={p i=[x i,y i,z i,r i]T∈R4}i=1...t as a non-empty voxel containing t≤T LiDAR points,where p i contains XYZ coordinates for the i-th point and r i is the received reflectance.Wefirst compute the local mean as the centroid of all the points in V,denoted as(v x,v y,v z). Then we augment each point p i with the relative offset w.r.t. the centroid and obtain the input feature set V in={ˆp i= [x i,y i,z i,r i,x i−v x,y i−v y,z i−v z]T∈R7}i=1...t.Next, eachˆp i is transformed through the fully connected network (FCN)into a feature space,where we can aggregate in-formation from the point features f i∈R m to encode the shape of the surface contained within the voxel.The FCN is composed of a linear layer,a batch normalization(BN) layer,and a rectified linear unit(ReLU)layer.After obtain-ing point-wise feature representations,we use element-wise MaxPooling across all f i associated to V to get the locally aggregated feature˜f∈R m for V.Finally,we augmenteach f i with˜f to form the point-wise concatenated featureas f outi =[f T i,˜f T]T∈R2m.Thus we obtain the outputfeature set V out={f outi }i...t.All non-empty voxels areencoded in the same way and they share the same set of parameters in FCN.We use VFE-i(c in,c out)to represent the i-th VFE layer that transforms input features of dimension c in into output features of dimension c out.The linear layer learns a ma-trix of size c in×(c out/2),and the point-wise concatenation yields the output of dimension c out.Because the output feature combines both point-wise features and locally aggregated feature,stacking VFE lay-ers encodes point interactions within a voxel and enables thefinal feature representation to learn descriptive shape information.The voxel-wise feature is obtained by trans-forming the output of VFE-n into R C via FCN and apply-ing element-wise Maxpool where C is the dimension of the voxel-wise feature,as shown in Figure2.Sparse Tensor Representation By processing only the non-empty voxels,we obtain a list of voxel features,each uniquely associated to the spatial coordinates of a particu-lar non-empty voxel.The obtained list of voxel-wise fea-tures can be represented as a sparse4D tensor,of size C×D ×H ×W as shown in Figure2.Although the point cloud contains∼100k points,more than90%of vox-els typically are empty.Representing non-empty voxel fea-tures as a sparse tensor greatly reduces the memory usage and computation cost during backpropagation,and it is a critical step in our efficient implementation.2.1.2Convolutional Middle LayersWe use Conv M D(c in,c out,k,s,p)to represent an M-dimensional convolution operator where c in and c out are the number of input and output channels,k,s,and p are the M-dimensional vectors corresponding to kernel size,stride size and padding size respectively.When the size across the M-dimensions are the same,we use a scalar to represent the size e.g.k for k=(k,k,k).Each convolutional middle layer applies3D convolution,BN layer,and ReLU layer sequentially.The convolutional middle layers aggregate voxel-wise features within a pro-gressively expanding receptivefield,adding more context to the shape description.The detailed sizes of thefilters in the convolutional middle layers are explained in Section3.2.1.3Region Proposal NetworkRecently,region proposal networks[32]have become an important building block of top-performing object detec-tion frameworks[38,5,23].In this work,we make several key modifications to the RPN architecture proposed in[32], and combine it with the feature learning network and con-volutional middle layers to form an end-to-end trainable pipeline.The input to our RPN is the feature map provided by the convolutional middle layers.The architecture of this network is illustrated in Figure4.The network has three blocks of fully convolutional layers.Thefirst layer of each block downsamples the feature map by half via a convolu-tion with a stride size of2,followed by a sequence of con-volutions of stride1(×q means q applications of thefilter). After each convolution layer,BN and ReLU operations are applied.We then upsample the output of every block to a fixed size and concatanate to construct the high resolution feature map.Finally,this feature map is mapped to the de-sired learning targets:(1)a probability score map and(2)a regression map.2.2.Loss FunctionLet{a pos i}i=1...N pos be the set of N pos positive an-chors and{a neg j}j=1...N neg be the set of N neg negative anchors.We parameterize a3D ground truth box as (x g c,y g c,z g c,l g,w g,h g,θg),where x g c,y g c,z g c represent the center location,l g,w g,h g are length,width,height of the box,andθg is the yaw rotation around Z-axis.To re-trieve the ground truth box from a matching positive anchor parameterized as(x a c,y a c,z a c,l a,w a,h a,θa),we define the residual vector u∗∈R7containing the7regression tar-gets corresponding to center location∆x,∆y,∆z,three di-Voxel Input Feature BufferVoxel CoordinateBufferK T7Sparse TensorK31Voxel-wise FeatureK C 1Point CloudIndexingMemory CopyS t a c k e d V F EFigure 5.Illustration of efficient implementation.mensions ∆l,∆w,∆h ,and the rotation ∆θ,which are com-puted as:∆x =x g c −x a cd a ,∆y =y g c −y a c d a ,∆z =z gc −z a c h a ,∆l =log(l g l a ),∆w =log(w g w a ),∆h =log(h gh a ),(1)∆θ=θg −θawhere d a =(l a )2+(w a )2is the diagonal of the base of the anchor box.Here,we aim to directly estimate the oriented 3D box and normalize ∆x and ∆y homogeneously with the diagonal d a ,which is different from [32,38,22,21,4,3,5].We define the loss function as follows:L =α1N pos i L cls (p posi ,1)+β1N neg jL cls (p neg j ,0)+1N posiL reg (u i ,u ∗i )(2)where p pos i and p neg j represent the softmax output for posi-tive anchor a posi and negative anchor a neg j respectively,whileu i ∈R 7and u ∗i ∈R 7are the regression output and ground truth for positive anchor a pos i .The first two terms are the normalized classification loss for {a pos i }i =1...N pos and {a negj }j =1...N neg ,where the L cls stands for binary cross en-tropy loss and α,βare postive constants balancing the rel-ative importance.The last term L reg is the regression loss,where we use the SmoothL1function [12,32].2.3.Efficient ImplementationGPUs are optimized for processing dense tensor struc-tures.The problem with working directly with the point cloud is that the points are sparsely distributed across space and each voxel has a variable number of points.We devised a method that converts the point cloud into a dense tensor structure where stacked VFE operations can be processed in parallel across points and voxels.The method is summarized in Figure 5.We initialize aK ×T ×7dimensional tensor structure to store the voxel input feature buffer where K is the maximum number of non-empty voxels,T is the maximum number of points per voxel,and 7is the input encoding dimension for each point.The points are randomized before processing.For each point in the point cloud,we check if the corresponding voxel already exists.This lookup operation is done effi-ciently in O (1)using a hash table where the voxel coordi-nate is used as the hash key.If the voxel is already initial-ized we insert the point to the voxel location if there are less than T points,otherwise the point is ignored.If the voxel is not initialized,we initialize a new voxel,store its coordi-nate in the voxel coordinate buffer,and insert the point to this voxel location.The voxel input feature and coordinate buffers can be constructed via a single pass over the point list,therefore its complexity is O (n ).To further improve the memory/compute efficiency it is possible to only store a limited number of voxels (K )and ignore points coming from voxels with few points.After the voxel input buffer is constructed,the stacked VFE only involves point level and voxel level dense oper-ations which can be computed on a GPU in parallel.Note that,after concatenation operations in VFE,we reset the features corresponding to empty points to zero such that they do not affect the computed voxel features.Finally,using the stored coordinate buffer we reorganize the com-puted sparse voxel-wise structures to the dense voxel grid.The following convolutional middle layers and RPN oper-ations work on a dense voxel grid which can be efficiently implemented on a GPU.3.Training DetailsIn this section,we explain the implementation details of the V oxelNet and the training procedure.work DetailsOur experimental setup is based on the LiDAR specifi-cations of the KITTI dataset [11].Car Detection For this task,we consider point clouds within the range of [−3,1]×[−40,40]×[0,70.4]meters along Z,Y ,X axis respectively.Points that are projected outside of image boundaries are removed [5].We choose a voxel size of v D =0.4,v H =0.2,v W =0.2meters,which leads to D =10,H =400,W =352.We set T =35as the maximum number of randomly sam-pled points in each non-empty voxel.We use two VFE layers VFE-1(7,32)and VFE-2(32,128).The final FCN maps VFE-2output to R 128.Thus our feature learning net generates a sparse tensor of shape 128×10×400×352.To aggregate voxel-wise features,we employ three convo-lution middle layers sequentially as Conv3D(128,64,3,(2,1,1),(1,1,1)),Conv3D(64,64,3,(1,1,1),(0,1,1)),andConv3D(64,64,3,(2,1,1),(1,1,1)),which yields a4D ten-sor of size64×2×400×352.After reshaping,the input to RPN is a feature map of size128×400×352,where the dimensions correspond to channel,height,and width of the3D tensor.Figure4illustrates the detailed network ar-chitecture for this task.Unlike[5],we use only one anchor size,l a=3.9,w a=1.6,h a=1.56meters,centered at z a c=−1.0meters with two rotations,0and90degrees. Our anchor matching criteria is as follows:An anchor is considered as positive if it has the highest Intersection over Union(IoU)with a ground truth or its IoU with ground truth is above0.6(in bird’s eye view).An anchor is considered as negative if the IoU between it and all ground truth boxes is less than0.45.We treat anchors as don’t care if they have 0.45≤IoU≤0.6with any ground truth.We setα=1.5 andβ=1in Eqn.2.Pedestrian and Cyclist Detection The input range1is [−3,1]×[−20,20]×[0,48]meters along Z,Y,X axis re-spectively.We use the same voxel size as for car detection, which yields D=10,H=200,W=240.We set T=45 in order to obtain more LiDAR points for better capturing shape information.The feature learning network and con-volutional middle layers are identical to the networks used in the car detection task.For the RPN,we make one mod-ification to block1in Figure4by changing the stride size in thefirst2D convolution from2to1.This allowsfiner resolution in anchor matching,which is necessary for de-tecting pedestrians and cyclists.We use anchor size l a= 0.8,w a=0.6,h a=1.73meters centered at z a c=−0.6 meters with0and90degrees rotation for pedestrian detec-tion and use anchor size l a=1.76,w a=0.6,h a=1.73 meters centered at z a c=−0.6with0and90degrees rota-tion for cyclist detection.The specific anchor matching cri-teria is as follows:We assign an anchor as postive if it has the highest IoU with a ground truth,or its IoU with ground truth is above0.5.An anchor is considered as negative if its IoU with every ground truth is less than0.35.For anchors having0.35≤IoU≤0.5with any ground truth,we treat them as don’t care.During training,we use stochastic gradient descent (SGD)with learning rate0.01for thefirst150epochs and decrease the learning rate to0.001for the last10epochs. We use a batchsize of16point clouds.3.2.Data AugmentationWith less than4000training point clouds,training our network from scratch will inevitably suffer from overfitting. To reduce this issue,we introduce three different forms of data augmentation.The augmented training data are gener-ated on-the-fly without the need to be stored on disk[20].1Our empirical observation suggests that beyond this range,LiDAR returns from pedestrians and cyclists become very sparse and therefore detection results will be unreliable.Define set M={p i=[x i,y i,z i,r i]T∈R4}i=1,...,N as the whole point cloud,consisting of N points.We parame-terize a3D bouding box b i as(x c,y c,z c,l,w,h,θ),where x c,y c,z c are center locations,l,w,h are length,width, height,andθis the yaw rotation around Z-axis.We de-fineΩi={p|x∈[x c−l/2,x c+l/2],y∈[y c−w/2,y c+ w/2],z∈[z c−h/2,z c+h/2],p∈M}as the set con-taining all LiDAR points within b i,where p=[x,y,z,r] denotes a particular LiDAR point in the whole set M.Thefirst form of data augmentation applies perturbation independently to each ground truth3D bounding box to-gether with those LiDAR points within the box.Specifi-cally,around Z-axis we rotate b i and the associatedΩi with respect to(x c,y c,z c)by a uniformally distributed random variable∆θ∈[−π/10,+π/10].Then we add a translation (∆x,∆y,∆z)to the XYZ components of b i and to each point inΩi,where∆x,∆y,∆z are drawn independently from a Gaussian distribution with mean zero and standard deviation1.0.To avoid physically impossible outcomes,we perform a collision test between any two boxes after the per-turbation and revert to the original if a collision is detected. Since the perturbation is applied to each ground truth box and the associated LiDAR points independently,the net-work is able to learn from substantially more variations than from the original training data.Secondly,we apply global scaling to all ground truth boxes b i and to the whole point cloud M.Specifically, we multiply the XYZ coordinates and the three dimen-sions of each b i,and the XYZ coordinates of all points in M with a random variable drawn from uniform distri-bution[0.95,1.05].Introducing global scale augmentation improves robustness of the network for detecting objects with various sizes and distances as shown in image-based classification[35,18]and detection tasks[12,17].Finally,we apply global rotation to all ground truth boxes b i and to the whole point cloud M.The rotation is applied along Z-axis and around(0,0,0).The global ro-tation offset is determined by sampling from uniform dis-tribution[−π/4,+π/4].By rotating the entire point cloud, we simulate the vehicle making a turn.4.ExperimentsWe evaluate V oxelNet on the KITTI3D object detection benchmark[11]which contains7,481training images/point clouds and7,518test images/point clouds,covering three categories:Car,Pedestrian,and Cyclist.For each class, detection outcomes are evaluated based on three difficulty levels:easy,moderate,and hard,which are determined ac-cording to the object size,occlusion state,and truncation level.Since the ground truth for the test set is not avail-able and the access to the test server is limited,we con-duct comprehensive evaluation using the protocol described in[4,3,5]and subdivide the training data into a training setMethod ModalityCar Pedestrian CyclistEasy Moderate Hard Easy Moderate Hard Easy Moderate HardMono3D[3]Mono 5.22 5.19 4.13N/A N/A N/A N/A N/A N/A 3DOP[4]Stereo12.639.497.59N/A N/A N/A N/A N/A N/A VeloFCN[22]LiDAR40.1432.0830.47N/A N/A N/A N/A N/A N/A MV(BV+FV)[5]LiDAR86.1877.3276.33N/A N/A N/A N/A N/A N/A MV(BV+FV+RGB)[5]LiDAR+Mono86.5578.1076.67N/A N/A N/A N/A N/A N/A HC-baseline LiDAR88.2678.4277.6658.9653.7951.4763.6342.7541.06 V oxelNet LiDAR89.6084.8178.5765.9561.0556.9874.4152.1850.49 Table1.Performance comparison in bird’s eye view detection:average precision(in%)on KITTI validation set.Method ModalityCar Pedestrian CyclistEasy Moderate Hard Easy Moderate Hard Easy Moderate HardMono3D[3]Mono 2.53 2.31 2.31N/A N/A N/A N/A N/A N/A 3DOP[4]Stereo 6.55 5.07 4.10N/A N/A N/A N/A N/A N/A VeloFCN[22]LiDAR15.2013.6615.98N/A N/A N/A N/A N/A N/A MV(BV+FV)[5]LiDAR71.1956.6055.30N/A N/A N/A N/A N/A N/A MV(BV+FV+RGB)[5]LiDAR+Mono71.2962.6856.56N/A N/A N/A N/A N/A N/A HC-baseline LiDAR71.7359.7555.6943.9540.1837.4855.3536.0734.15 V oxelNet LiDAR81.9765.4662.8557.8653.4248.8767.1747.6545.11 Table2.Performance comparison in3D detection:average precision(in%)on KITTI validation set.and a validation set,which results in3,712data samples for training and3,769data samples for validation.The split avoids samples from the same sequence being included in both the training and the validation set[3].Finally we also present the test results using the KITTI server.For the Car category,we compare the proposed method with several top-performing algorithms,including image based approaches:Mono3D[3]and3DOP[4];LiDAR based approaches:VeloFCN[22]and3D-FCN[21];and a multi-modal approach MV[5].Mono3D[3],3DOP[4]and MV[5]use a pre-trained model for initialization whereas we train V oxelNet from scratch using only the LiDAR data provided in KITTI.To analyze the importance of end-to-end learning,we implement a strong baseline that is derived from the V ox-elNet architecture but uses hand-crafted features instead of the proposed feature learning network.We call this model the hand-crafted baseline(HC-baseline).HC-baseline uses the bird’s eye view features described in[5]which are computed at0.1m resolution.Different from[5],we in-crease the number of height channels from4to16to cap-ture more detailed shape information–further increasing the number of height channels did not lead to performance improvement.We replace the convolutional middle lay-ers of V oxelNet with similar size2D convolutional layers, which are Conv2D(16,32,3,1,1),Conv2D(32,64,3,2, 1),Conv2D(64,128,3,1,1).Finally RPN is identical in V oxelNet and HC-baseline.The total number of parame-ters in HC-baseline and V oxelNet are very similar.We train the HC-baseline using the same training procedure and data augmentation described in Section3.4.1.Evaluation on KITTI Validation SetMetrics We follow the official KITTI evaluation protocol, where the IoU threshold is0.7for class Car and is0.5for class Pedestrian and Cyclist.The IoU threshold is the same for both bird’s eye view and full3D evaluation.We compare the methods using the average precision(AP)metric. Evaluation in Bird’s Eye View The evaluation result is presented in Table1.V oxelNet consistently outperforms all the competing approaches across all three difficulty levels. HC-baseline also achieves satisfactory performance com-pared to the state-of-the-art[5],which shows that our base region proposal network(RPN)is effective.For Pedestrian and Cyclist detection tasks in bird’s eye view,we compare the proposed V oxelNet with HC-baseline.V oxelNet yields substantially higher AP than the HC-baseline for these more challenging categories,which shows that end-to-end learn-ing is essential for point-cloud based detection.We would like to note that[21]reported88.9%,77.3%, and72.7%for easy,moderate,and hard levels respectively, but these results are obtained based on a different split of 6,000training frames and∼1,500validation frames,and they are not directly comparable with algorithms in Table1. Therefore,we do not include these results in the table. Evaluation in3D Compared to the bird’s eye view de-tection,which requires only accurate localization of ob-jects in the2D plane,3D detection is a more challeng-ing task as it requiresfiner localization of shapes in3D space.Table2summarizes the comparison.For the class Car,V oxelNet significantly outperforms all other ap-proaches in AP across all difficulty levels.Specifically, using only LiDAR,V oxelNet significantly outperforms the。
3D Model based Object Class Detection in An Arbitrary View Pingkun Yan,Saad M.Khan,Mubarak ShahSchool of Electrical Engineering and Computer ScienceUniversity of Central Florida/˜visionAbstractIn this paper,a novel object class detection method based on3D object modeling is presented.Instead of using a complicated mechanism for relating multiple2D training views,the proposed method establishes spatial connections between these views by mapping them directly to the surface of3D model.The3D shape of an object is reconstructed by using a homographic framework from a set of model views around the object and is represented by a volume consisting of binary slices.Features are computed in each2D model view and mapped to the3D shape model using the same ho-mographic framework.To generalize the model for object class detection,features from supplemental views are also considered.A codebook is constructed from all of these fea-tures and then a3D feature model is built.Given a2D test image,correspondences between the3D feature model and the testing view are identified by matching the detected fea-tures.Based on the3D locations of the corresponding fea-tures,several hypotheses of viewing planes can be made. The one with the highest confidence is then used to detect the object using feature location matching.Performance of the proposed method has been evaluated by using the PASCAL VOC challenge dataset and promising results are demonstrated.1.IntroductionIn recent years,the problem of object detection has re-ceived considerable attention from both the computer vi-sion and machine learning communities.The key challenge of this problem is the ability to recognize any member in a category of objects in spite of wide variations in visual appearance due to geometrical transformations,change in viewpoint,or illumination.In this paper,a novel3D feature model based object class detection method is proposed to deal with these challenges. The objective of this work is to detect the object given an arbitrary2D view using a general3D feature model of the class.In our work,the objects can be arbitrarily transformed (with translation and rotation),and the viewing position and orientation of the camera is arbitrary as well.In addition, camera parameters are assumed to be unknown.Object detection in such a setting has been considered a very challenging problem due to various difficulties of ge-ometrically modeling relevant3D object shapes and the ef-fects of perspective projection.In this paper,we exploit a recently proposed3D reconstruction method using ho-mographic framework for3D object shape reconstruction. Given a set of2D images of an object taken from differ-ent viewpoints around the object with unknown camera pa-rameters,which are called model views,the3D shape of this specific object can be reconstructed using the homo-graphic framework proposed in[10].In our work,3D shape is represented by a volume consisting of binary slices with 1denoting the object and0for background.By using this method,we can not only reconstruct3D shapes for the ob-jects to be detected,but also have access to the homogra-phies between the2D views and the3D models,which are then used to build the3D feature model for object class de-tection.In the feature modeling phase of the proposed method, SIFT features[12]are computed for each of the2D model views and mapped to the surface of the3D model.Since it is difficult to accurately relate2D coordinates to a3D model by projecting the3D model to a2D view(with unknown camera parameters),we propose to use a homography trans-formation based algorithm.Since the homographies have been obtained during the3D shape reconstruction process, the projection of a3D model can be easily computed by in-tegrating the transformations of slices from the model to a particular view,as opposed to directly projecting the entire model by estimation of the projection matrix.To generalize the model for object class detection,images of other objects of the class are used as supplemental views.Features from these views are mapped to the3D model in the same way as for those model views.A codebook is constructed from all of these features and then a3D feature model is built.The 3D feature model thus combines the3D shape information and appearance features for robust object class detection.Given a new2D test image,correspondences between the3D feature model and this testing view are identified by matching feature.Based on the3D locations of the cor-responding features,several hypotheses of viewing planes can be made.For each hypothesis,the feature points are projected to the viewing plane and aligned with the features in the2D testing view.A confidence is assigned to each hypothesis and the one with the highest confidence is then used to produce the object detection result.The rest of the paper is organized as follows.Section2 provides a brief review of related work.A summary of the 3D shape model construction method is presented in Sec-tion3.The3D feature model and our object class detection approach are presented in Sections4and5,respectively. Section6provides the detection results,andfinally,Sec-tion7concludes the work.2.Related WorkAs the approaches for recognizing an object class from some particular viewpoint or detecting a specific object from an arbitrary view are advancing toward maturity[3, 9,11],solutions to the problem of object class detection using multiple views are still relatively far behind.Object detection can be considered even more difficult than clas-sification,since it is expected to provide accurate location and size of the object.Researchers in computer vision have studied the problem of multi-view object class detection resulting successful ap-proaches following two major directions.One path attempts to use increasing number of local features by applying mul-tiple feature detectors simultaneously[1,6,13,14,15].It has been shown that the recognition performance can be benefited by providing more feature support.However,the spatial connections of the features in each view and/or be-tween different views have not been pursued in these works. These connections can be crucial in object class detection tasks.Recently,much attention has been drawn to the sec-ond direction related to multiple views for object class de-tection[5,7,8].The early methods apply several single-view detectors independently and combine their responses via some arbitration logic.Features are shared among the different single-view detectors to limit the computational overload.Most recently,Thomas et al.[16]developed a single integrated multi-view detector that accumulates evi-dence from different training views.Their work combines a multi-view specific object recognition system[9],and the Implicit Shape Model for object class detection[11],where single-view codebooks are strongly connected by the ex-change of information via sophisticated activation links be-tween each other.In this paper,we present a unified method to relate mul-tiple2D views based on3D object modeling.The main contribution of our work is an efficient object detection sys-tem capable of recognizing and localizing objects from the same class under different viewing conditions.Thus,3D lo-cations of the features are considered during detection and better accuracy is obtained.3.Construction of3D Shape ModelA recently proposed homographic framework was em-ployed in our work to construct3D models from multiple 2D views[10].A summary of the3D reconstruction algo-rithm is provided as follows.3.1.View Warping and IntersectionLet I i denote the foreground likelihood map(where each pixel value is the likelihood of that pixel being a foreground) in the i th view of total M views.Considering a reference plane,πr,in the scene with homography Hπr,ifrom the i th view toπr,warping I i toπr gives the warped foreground likelihood map:ˆIi,r=[Hπr,i]I i.(1) The visual hull intersection onπr(AND-fusion of the shadow regions)is achieved by multiplying these warped foreground likelihood maps:θr=MY i=1ˆI i,r,(2)whereθr is the grid of the object occupancy likelihoods planeπr.Each value inθr gives the likelihood of this grid location being inside the body of the object,indeed,repre-senting a slice of the object cut out by planeπr.It should be noted that due to the multiplication step in(2),the locations outside the visual hull intersection region will be penalized, thus,having a much lower occupancy likelihood.3.2.Construction in Parallel PlanesThe grid of the object occupancy likelihood can be com-puted at an arbitrary number of planes in the scene with dif-ferent heights,each giving a slice of the object.Naturally this does not apply to planes that do not pass through the object’s body,since visual hull intersection on these planes will be empty,therefore a separate check is not necessary.Let v x,v y,and v z denote the vanishing points for the X, Y,and Z directions,respectively,and l be the normalized vanishing line of reference plane in the XY Z coordinate space.The reference plane to the image view homography can be represented asˆHref=[v x v y l].(3) Supposing that another planeπhas a translation of z along the reference direction Z from the reference plane,it is easy(a)V olumeprojection(b)Plane transformationFigure1.Illustration of equivalence of3D to2D projection and plane transformation using homographies.(a)A2D view of a 3D volume V is generated by projecting the volume on a image plane.(b)The same view can be obtained by integrating the trans-formation of each slice in the volume to the image plane using homographies.to show that the homography of planeπto the image view can be computed byˆHπ=[v x v yαz v z+l]=ˆH ref+[I|αz v z],(4) whereαis a scaling factor.The image to plane homography Hπis obtained by invertingˆHπ.Starting with a reference plane in the scene(typically the ground plane),visual hull intersection is performed on suc-cessively parallel planes in the up direction along the body of the object.The occupancy gridsθi are stacked up to cre-ate a three dimensional data structureΘ=[θ1;θ2;...θM].Θrepresents a discrete sampling of a continuous occupancy space encapsulating the object shape.Object structure is then segmented out fromΘby dividing the space into the object and background regions using the geodesic active contour method[2].By using the above homography based framework,3D models for different objects can be con-structed.4.3D Feature Model Description and TrainingIn the proposed method,not only the3D shape of the tar-get object is exploited,but also the appearance features.We relate the features with the3D model to construct a feature model for object class detection.4.1.Attaching2D Features to3D ModelThe features used in our work are computed using the SIFT feature detector[12].Feature vectors are computed for all of the training images.In order to efficiently relate the features computed from different views and different objects,all the detected features are attached to the3D sur-face of the previously built model.By using the3D feature model,we avoid storing all the2D training views,thus there is no need to build complicated connections between the views.The spatial relationship between the feature points from different views are readily available,which can be eas-ily retrieved when matched feature points are found.The features computed in2D images are attached to the 3D model by using the novel homographic framework.In-stead of directlyfinding the3D location of each2D fea-ture,we map the3D points from the model’s surface to the 2D views,andfind the corresponding features.Our method does not require the estimation of a projection matrix from 3D model to a2D image plane,which is a non-trivial prob-lem.In our work,the problem is successfully solved by transforming the model to various image planes using ho-mography.Since the homographies between the model and the image planes have already been obtained during the con-struction of the3D model,we are able to map the3D points to2D planes using homography transformation.In our work,a3D shape is represented by a binary vol-ume V,which consists of K slices S j,j∈[1,K].As shown in Fig.1(b),each slice of the object is transformed to a2D image plane by using the corresponding homography ˆH in(4).The transformed slice accounts for a small patchof the object projection.Integrating all these K patches to-gether,the whole projection of3D object in the2D image plane can be produced.In this way,we obtain the model projection by using a series of simple homography transfor-mations and the hard problem of estimating the projection matrix of a3D model to a2D view is avoided.In our method,the3D shapes are represented using bi-nary volumes with a stack of slices along the reference di-rection.Thus,the surface points can be easily obtained by applying edge detection techniques.After transforming the surface points to2D planes,feature vectors computed in2D can be related to the3D points according to their locations. That is the way a3D feature model is built.4.2.Beyond the Model ViewsThe training images in our work come from two sources. One set of images is taken around a specific object of the target class to reconstruct it in3D as shown in Fig.2. These images are called model views,which provide multi-ple views of the object but are limited to the specific object. To generalize the model for recognizing other objects in the same class,another set of training images is obtained by using Google image search.Images of objects in the same class with different appearances and postures are selected. These images are denoted as the supplemental views.Since the homographies between the supplemental im-ages and the3D model are unknown,features computed from the supplemental images cannot be directly attached to the feature model.Instead,we utilize the model views as bridges to connect the supplemental images to the model as illustrated in Fig.2.For each supplemental image,the model view,which has the most similar viewpoint is speci-fied.The supplemental images are deformed to their speci-fied view by using an affine transformation alignment.Then we can assume that each supplemental image will have theFigure2.Construction of3D feature model for motorbikes.3D shape model of motorbike(at center)is constructed using the model views (images on the inner circle)taken around the object from different viewpoints.Supplemental images(outer circle)of different motorbikes are obtained by using Google’s image search.The supplemental images are aligned with the model views for feature mapping.Feature vectors are computed from all the training images and then attached to the3D model surface by using the homography transformation.same homography as the model’s corresponding view.The2D features computed from all of the supplemental train-ing images can now be correctly attached to the3D modelsurface using the same method as discussed for the model views.A codebook is constructed by combining all themapped features with their3D locations.5.Object Class DetectionGiven a new test image,our objective is to detect objects belonging to the same class in this image by using the learnt3D feature model M.Each entry of M consists of a codeand its3D locations{c,l3c}.Let F denote the SIFT features computed from the input image,which is composed by thefeature descriptor and its2D location in the image{f,l2f}. Object O n is detected by matching the features F to the3D feature model M.In our work,feature matching is achieved in threephases.In thefirst phase,we match the features by com-paring all the input features to the codebook entries in Eu-clidean space.However,not all the matched codebook en-tries in3D are visible at the same time from a particular viewpoint.So,in the second phase,matched codes in3D are projected to viewing planes and hypotheses of view-points are made by selecting viewing planes with the largest number of visible points projected.In the third phase,for each hypothesis,the projected points are compared to2D matched feature points using both feature descriptors and locations.This is done by iteratively estimating the affine transformation between the feature point sets and remov-ing the outliers with large distance between corresponding points.Outliers belonging to the background can be re-jected during this matching process.The object location and bounding box is then determined according to the2D loca-tions of thefinal matched feature points.The confidence of detection is given by the degree of match.6.Experimental ResultsThe proposed method has been tested on two object classes:motorbikes and horses.For the motorbikes,we took23model views around a motorbike and obtained45Figure3.Detection of motorbikes and horses using the proposed approach.The ground truth is shown in green and red boxes display our detected results.supplemental views by using Google’s image search.Sometraining images of the motorbikes and the3D shape modelare shown in Fig.2.For the horses,18model views weretaken and51supplemental views were obtained.To measure the performance of our3D feature modelbased object class detection technique,we have evaluatedthe method on the PASCAL VOC Challenge2006testdataset[4],which has become a standard testing dataset forobjective evaluation of object classification and detectionalgorithms.The dataset is very challenging due to the largevariability in the scale and poses,the extensive clutter,andpoor imaging conditions.Some successful detection resultsare shown in Fig.3.The green box indicates the groundtruth,while our results are shown in red boxes.For quantitative evaluation,we adopt the same evalua-tion criteria used in PASCAL VOC challenge,so that ourresults can be directly comparable with[8,16,4].By usingthis criteria,a detection is considered correct,if the area ofoverlap between the predicted bounding box B p and groundtruth bounding box B gt exceeds50%using the formulaarea(B p∩B gt)p gt >0.5.(5)The average precision(AP)and precision-recall(PR)curvecan then be computed for performance evaluation.Fig.4(a)shows the PR curves of our approach and themethods in[8,16]for motorbike detection.The curve of ourapproach shows a substantial improvement over the preci-sion compared to the method in[8],which is also indicatedby the AP value(0.182).Although our performance is lowerthan that of[16],considering the smaller training image setused in our experiments,this can be regarded as satisfactory.Fig.4(b)shows the performance curves for horse detection.While there is no result reported in the VOC challenge usingresearchers’own training dataset for this task,we comparedour result to those using the provided training dataset.Ourapproach performs better than the reported methods and ob-tained AP value of0.144.It is noted that the absolute per-formance level is lower than that of motorbike detection,which might be caused by the non-rigid body deformationof horses.7.ConclusionIn this paper,we have proposed a novel multi-viewgeneric object class detection method based on3D objectFigure4.The PR curves for(a)motorbike detection and(b)horse detection using our3D feature model based approach.The curves reported in[4]on the same test dataset are also included for comparison.shape and appearance modeling.The proposed method builds a3D feature model for establishing spatial connec-tions between different image views by mapping appear-ance features to the surfaces of a3D shape.Through a novel homographic framework,our method exploits the relation-ship between multiple2D views in a more unified man-ner.Experimental evaluation of the proposed method sug-gests collaborative information in the2D training images can be represented in a more unified way through a3D fea-ture model of the object.We have also revealed that both appearance and shape can be salient properties to assist in object detection.Performance of the proposed method has been evaluated using the PASCAL VOC challenge dataset and promising results have been demonstrated.In our future work,we plan to extend our method by taking supplemental views in a more automated fashion.So,more supplemen-tal views can be easily incorporated to improve the perfor-mance.We will also test the method on objects classes with more complex shapes and appearances. AcknowledgmentsThis research was funded in part by the ernment V ACE program.References[1] A.C.Berg,T.L.Berg,and J.Malik.Shape matching andobject recognition using low distortion correspondences.In CVPR(1),pages26–33,2005.2[2]V.Caselles,R.Kimmel,and G.Sapiro.Geodesic active con-tours.International Journal of Computer Vision,22(1):61–79,1997.3[3]H.Chang and D.-Y.Yeung.Graph laplacian kernels for ob-ject classification from a single example.In CVPR(2),pages 2011–2016,2006.2[4]M.Everingham,A.Zisserman,C.Williams,and L.vanGool.The PASCAL Visual Object Classes Chal-lenge2006(VOC2006)Results./challenges/VOC/voc2006/results.pdf.5,6 [5]J.D.R.Farquhar,D.R.Hardoon,H.Meng,J.Shawe-Taylor,and S.Szedm´a k.Two view learning:SVM-2K,theory and practice.In NIPS,2005.2[6]L.Fei-Fei,R.Fergus,and P.Perona.A bayesian approachto unsupervised one-shot learning of object categories.In ICCV,pages1134–1141,2003.2[7]R.Fergus,L.Fei-Fei,P.Perona,and A.Zisserman.Learn-ing object categories from google’s image search.In ICCV, pages1816–1823,2005.2[8]R.Fergus,P.Perona,and A.Zisserman.A sparse object cate-gory model for efficient learning and exhaustive recognition.In CVPR,pages443–461,2005.2,5[9]V.Ferrari,T.Tuytelaars,and L.V an Gool.Integrating multi-ple model views for object recognition.In CVPR,volume2, pages105–112,2004.2[10]S.M.Khan,P.Yan,and M.Shah.A homographic frameworkfor the fusion of multi-view silhouettes.In ICCV,2007.1,2 [11] B.Leibe and B.Schiele.Scale-invariant object categoriza-tion using a scale-adaptive mean-shift search.In DAGM, pages145–153,2004.2[12] D.G.Lowe.Distinctive image features from scale-invariantkeypoints.International Journal of Computer Vision,60:91–110,Nov.2004.1,3[13] E.Nowak,F.Jurie,and B.Triggs.Sampling strategies forbag-of-features image classification.In ECCV(4),pages 490–503,2006.2[14]J.Shotton,A.Blake,and R.Cipolla.Contour-based learningfor object detection.In ICCV,pages503–510,2005.2 [15] E.B.Sudderth,A.B.Torralba,W.T.Freeman,and A.S.Willsky.Learning hierarchical models of scenes,objects, and parts.In ICCV,pages1331–1338,2005.2[16] A.Thomas,V.Ferrari,B.Leibe,T.Tuytelaars,B.Schiele,and L.Van Gool.Towards multi-view object class detection.In CVPR,volume2,pages1589–1596,2006.2,5。
基于凹凸性方法的杂乱场景点云分割算法黄镇;韩慧妍;韩燮【摘要】为了处理杂乱场景中多个物体的分割问题,提出了一种基于RGB-D点云数据的分割方法;该方法先将场景点云超体聚类分解为基于体素网格的邻接图,然后对邻接图的边缘进行分类创建凸度图,再通过区域生长合并具有凸关系的分块从而得到未知物体.此外,提出用欧几里得算法对区域生长进行改进,发现对于碗和杯子这类具有内部凹面的物体有较好地分割效果.在对象分割数据库和手动提取场景中的实验结果,表明该方法可以在杂乱的桌面场景中分割各种形状的对象.%In order to deal with the problem of segmentation of multiple objects in clutter scene , a segmentation method based on RGB-D point cloud data was proposed .The method firstly decomposed the scene cloud superclass cluster into the adjacency graph based on the voxel mesh , then classified the edge of the adjacent graph to create the convexity graph , and then merged the convexity with the convexity through the regional growth to get the un-known object .In addition , it was proposed to improve the regional growth with the Euclidean algorithm and found that the objects with internal concave surfaces such as bowl and cup had a better segmentation effect .The results in the object segmentation database and the manual extraction of the scene indicate that the method can segment ob -jects of various shapes in cluttered desktop scenes .【期刊名称】《科学技术与工程》【年(卷),期】2018(018)014【总页数】5页(P43-47)【关键词】计算机视觉;超体聚类;凹凸性;欧几里得;点云分割【作者】黄镇;韩慧妍;韩燮【作者单位】中北大学大数据学院,太原030051;中北大学大数据学院,太原030051;中北大学大数据学院,太原030051【正文语种】中文【中图分类】TP391.41尽管进行了很长时间的研究,将场景分割成物体仍是计算机视觉中比较有挑战性的课题之一[1]。
3D Object Recognition in Cluttered Sceneswith Local Surface Features:A SurveyYulan Guo,Mohammed Bennamoun,Ferdous Sohel,Min Lu,and Jianwei Wan Abstract—3D object recognition in cluttered scenes is a rapidly growing research area.Based on the used types of features,3D object recognition methods can broadly be divided into two categories—global or local feature based methods.Intensive research has been done on local surface feature based methods as they are more robust to occlusion and clutter which are frequently present in areal-world scene.This paper presents a comprehensive survey of existing local surface feature based3D object recognition methods.These methods generally comprise three phases:3D keypoint detection,local surface feature description,and surface matching.This paper covers an extensive literature survey of each phase of the process.It also enlists a number of popular and contemporarydatabases together with their relevant attributes.Index Terms—3D object recognition,keypoint detection,feature description,range image,local featureÇ1I NTRODUCTIONO BJECT recognition in cluttered scenes is a fundamental research area in computer vision.It has numerous applications,such as intelligent surveillance,automatic assembly,remote sensing,mobile manipulation,robotics, biometric analysis and medical treatment[1],[2],[3],[4],[5]. In the last few decades,2D object recognition has been extensively investigated and is currently a relatively mature research area[6].Compared to2D images,range images have shown to exhibit several advantages for object recogni-tion.For instance,(i)range images provide more geometri-cal(depth)information compared to2D images.Range images also encode surface metric dimensions unambigu-ously.(ii)Features extracted from range images are com-monly not affected by scale,rotation and illumination[7]. (iii)An estimated3D pose of an object from range images is more accurate compared to an estimated pose from2D images.Accordingly,range images have the potential to overcome many of the difficulties faced by2D images in the context of object recognition[8].These advantages make3D object recognition an active research topic[9].Moreover,the rapid technological development of low cost3D acquisition systems(e.g.,Microsoft Kinect)make range images more accessible[10],[11],[12].Furthermore,advances in comput-ing devices enable the processing of any computationally intensive3D object recognition algorithm to run in a fairly acceptable manner.All these combined factors have contrib-uted to theflourishment of research towards the develop-ment of3D object recognition systems.Existing3D object recognition methods can be divided into two broad categories:global feature based methods and local feature based methods[13],[14].The global fea-ture based methods process the object as a whole for recog-nition.They define a set of global features which effectively and concisely describe the entire3D object(or model)[15]. These methods have been widely used in the context of3D shape retrieval and classification[16],[17].Examples in this category include geometric3D moments[18],shape distri-bution[19],viewpoint feature histogram[20],and potential well space embedding[21].They,however,ignore the shape details and require a priori segmentation of the object from the scene[11].They are therefore not suitable for the recognition of a partially visible object from cluttered scenes [14].On the other hand,the local feature based methods extract only local surfaces around specific keypoints.They generally handle occlusion and clutter better compared to the global feature based methods[14].This type has also proven to perform notably better in the area of2D object recognition[22].This conclusion has also been extended to the area of3D object recognition[23].On that basis,the focus of this paper is on3D object recognition in cluttered scenes with local surface features.Several survey papers were published on3D object recog-nition and its relatedfields,such as3D shape matching and modeling.Among these,several survey papers on3D object recognition are also available.For instance,the articles in [24],[25],[26],[27],[28]and[29].Two reviews on3D model-ing and range image registration by[30]and[31]are also worth mentioning.However,none of these papers specifi-cally focuses on local feature based3D object recognition.Bronstein et al.[32]presented a review on keypoint detection and local surface feature description methods. They however,covered only four keypoint detectors and seven feature descriptors.An article on the evaluation ofY.Guo is with the College of Electronic Science and Engineering,NationalUniversity of Defense Technology,Changsha,Hunan410073,China,andwith the School of Computer Science and Software Engineering,The Uni-versity of Western Australia,Perth,WA6009,Australia.E-mail:yulan.guo@.M.Bennamoun and F.Sohel are with the School of Computer Science andSoftware Engineering,The University of Western Australia,Perth,WA6009,Australia.M.Lu and J.Wan are with the College of Electronic Science and Engineer-ing,National University of Defense Technology,Changsha,Hunan410073,China.Manuscript received6May2013;revised24Nov.2013;accepted3Apr.2014.Date of publication10Apr.2014;date of current version9Oct.2014.Recommended for acceptance by A.Fitzgibbon.For information on obtaining reprints of this article,please send e-mail to:reprints@,and reference the Digital Object Identifier below.Digital Object Identifier no.10.1109/TPAMI.2014.23168280162-8828ß2014IEEE.Personal use is permitted,but republication/redistribution requires IEEE permission.See /publications_standards/publications/rights/index.html for more information.keypoint detection methods has just been published [33].However,this article only covers eight keypoint detection methods.A large number of significant research contribu-tions on local feature based methods is available and there is no review article which comprehensively analyzes these methods.Compared with the existing literature,the main contribu-tions of this paper are as follows:i)To the best of our knowledge,this is the first survey paper in the literature that focuses on 3D object recog-nition based on local surface features.ii)As opposed to previous reviews,e.g.,in [32]and[33],we adequately cover the contemporary litera-ture of keypoint detection and local surface description methods.We present a comprehensive review of 29keypoint detection and 38feature description methods.iii)This paper also provides an insightful analysis on allaspects of surface matching,including feature matching,hypothesis generation and verification.iv)Compared to the earlier surveys,this papercoversthe most recent and advanced work.It therefore,pro-vides the reader with the state-of-the-art methods.v)A comparative summary of attributes is reported intabular forms (e.g.,Tables 3,4and 5).The rest of this paper is organized as follows.Section 2describes the background concepts and terminology of 3D object recognition based on local surface features.Sections 3and 4provide a comprehensive survey of the existing methods for 3D keypoint detection and local surface fea-ture description,respectively.Section 5presents a review of 3D object recognition methods.Section 6presents a brief discussion on potential future research directions.Finally,Section 7concludes this paper.2B ACKGROUND C ONCEPTS AND T ERMINOLOGY2.1Background ConceptsA range image can be represented in three types,namely a depth image,a point cloud or a polygonal mesh.Given a range image,the goal of 3D object recognition is to correctly identify objects present in the range image,and determine their poses (i.e.positions and orientations)[34].At a conceptual level,a typical local feature based 3D object recognition system consists of three main phases:3D keypoint detection,local surface feature description and surface matching.In the 3D keypoint detection phase,the 3D points with rich information content are identi-fied as keypoints.The inherent scale of each keypoint is also detected.Both the location and scale (i.e.,neighbor-hood size)of a keypoint define a local surface which is used in the subsequent feature description phase [35].In the local surface feature description phase,the geometric information of the neighborhood surface of the keypoint is encoded into a representative feature descriptor.Dur-ing the surface matching phase,the scene features are matched against all model features in the library,result-ing in a set of feature correspondences and hypotheses.These hypotheses are finally verified to infer the identity and pose of the object.2.2Databases and EvaluationMany databases have been built to test various algorithms.A set of popular 2.5D range image and 3D model databases are enlisted together with their major attributes in Tables 1and 2,respectively.The variations (var.),including occlu-sion (o),clutter (c)and deformation (d),in each database are also provided.The symbol ‘-’denotes that the corre-sponding item is not reported.In addition to range images/models,registered 2D color (usually RGB)images are also simultaneously provided in several databases (shown in Tables 1and 2).There are series of evaluation criteria which are fre-quently used to assess the performance of each phase of a 3D object recognition system [3],[36],[37],[38].Repeat-ability score is a frequently used criteria for 3D keypoint detector evaluation.It is computed as the ratio of the number of corresponding keypoints to the minimum number of keypoints in the two images [33],[39],[40],[41],[42].Recall vs 1-Precision is a frequently used crite-ria for local surface feature descriptor evaluation.It is generated by varying the thresholds for feature matching and calculating the feature recall and precision under each threshold [3],[43],[44],[45],[46].Recognition rate is commonly used to evaluate the overall performance of a recognition system [38],[44],[47],[48].Popular Range Image Databases33D K EYPOINT D ETECTIONKeypoint detection is thefirst major phase of a local surface feature based3D object recognition system.The simplest keypoint detection methods are surface sparse sampling and mesh decimation[38],[63],[64].However,these meth-ods do not result in qualified keypoints in terms of repeat-ability and informativeness.That is because they give no or little consideration to the richness of discriminative infor-mation of these detected keypoints[4].Therefore,it is neces-sary to detect keypoints according to their distinctiveness.Based on whether the scale is predetermined or adap-tively detected,keypoint detection methods can be classi-fied into two categories:fixed-scale keypoint detection methods and adaptive-scale keypoint detection methods. We adopt the same classification as[33]in this paper.3.1Fixed-Scale Keypoint DetectionFixed-scale keypoint detection methods define a point, which is distinctive within a predetermined neighborhood, as a keypoint.The neighborhood size is determined by the scale,which is an input parameter to the algorithm[35].As described in the following subsections,distinctiveness measures can either be curvatures or other surface variation (OSV)measures.3.1.1Curvature Based MethodsThese methods use different curvatures as distinctiveness measures to detect keypoints.Mokhtarian et al.[65]detected keypoints using the Gaussian and mean curvatures.They declared a point p as a keypoint if its curvature value was larger than the curvature values of its1-ring neighbors(k-ring neighbors are defined as the points which are distant from p by k edges).Yamany and Farag[66]used simplex angles to detect keypoints.A simplex angle’is related to the mean curvature.The key-points are detected at the locations where the simplex angles satisfy the constraint j sinð’Þj!t.Their threshold t is crucial for the performance of their keypoint detection,and choosing an appropriate threshold is still an unresolved issue[66].Gal and Cohen-Or[67]proposed a saliency grade for keypoint detection.A saliency grade for a point p is a lin-ear combination of two terms.Thefirst term is the sum of the curvatures of its neighboring points,and the second term is the variance of the curvature values in the neighbor-hood.The points with high saliency grades are selected as keypoints.Chen and Bhanu[68]detected keypoints based on shape index values.That is,within a neighborhood,the point p is marked as a keypoint only if its shape index value is a local optimum(maximum/minimum).Experimental results showed that the detected keypoints were quite uni-formly distributed over the surface[33].However,this method is sensitive to noise[33].3.1.2Other Surface Variation Measure Based Methods These methods use other surface variation measures rather than just curvatures to detect keypoints.Matei et al.[69]used the smallest eigenvalue 3of the covariance matrix of the neighboring points to measure the surface variation around a point p.The points were sorted according to their surface variations.Zhong[70]employed the ratio of two successive eigenvalues to prune the points. Only the points which satisfy the constraints 21<t21and 32<t32are retained and further detected based on the smallest eigenvalue 1.Guo et al.[44]first decimated a range image and then chose the points,which satisfied the con-straint 12>t from the decimated image,as keypoints.These methods achieve good results in terms of repeatability.More-over,they are particularly computationally efficient[33].Glomb[71]introduced four propositions to extend the popular Harris detector[72]from2D images to3D meshes.They found that the Harris detector,which used the derivative of afitted quadratic surface,achieved the best results.Following this proposition,Sipiran and Bus-tos[40]proposed a“Harris3D”detector.Given a point p, the neighboring points werefirst translated to the centroid and then rotated to align the normal at p with the z axis. Next,these transformed points werefitted into a qua-dratic surface,described mathematically by fðu;vÞ¼a T u2;uv;v2;u;v;1ðÞ.A symmetric matrix E was then defined using the derivatives of this function.The Harris3D oper-ator value at the point p was calculated as VðpÞ¼det EðÞÀa tr EðÞðÞ2,where detðEÞand trðEÞrepresented the deter-minant and trace of the matrix E,respectively.a was a parameter which needs to be tuned experimentally. Finally,afixed percentage of points with the largest val-ues of VðpÞwere selected as keypoints.Experimental results showed that it was robust to several transforma-tions including noise,change of tessellations,local scaling, shot noise and presence of holes.It also outperformed[73] (described in Section3.2.4)and[15](described in Sec-tion3.2.1)in many aspects,especially in the cases of high levels of local scaling and shot noise.Since only afixed-scale is used tofind the keypoints, their implementation is straightforward.However,these methods have several major drawbacks.First,it is possi-ble that they may detect too few keypoints,particularly on the less curved parts of the3D object[4].This would be a problem for object recognition.Second,fixed-scale methods determine the scale empirically.They do notPopular3D ModelDatabasesfully exploit the scale information encoded in the local geometric structures to detect the inherent scale of a key-point.Therefore,the neighborhood size cannot be adap-tively determined [48].3.2Adaptive-Scale Keypoint DetectionAdaptive-scale keypoint detection methods first build a scale-space for a given range image.They then pick the points with extreme distinctiveness measures in both the spatial and scale neighborhoods as keypoints.As a result,both the locations and scales of the keypoints are detected.According to the scale-space construction technique,these methods can be divided into four categories:coordinate smoothing (CS),geometric attribute smoothing (GAS),sur-face variation and transform based methods.3.2.1Coordinate Smoothing Based MethodsThese methods construct a scale-space by successively smoothing the 3D coordinates of a range image.They are broadly based on the 2D scale-space theory first introduced in [74].Akag €und €u z and Ulusoy [75]first obtained the scale-space of a 3D surface by constructing a Gaussian pyramid of the surface.They then computed the mean and Gaussian curvature values at all points for all scales,and classified each point as one of the eight surface types based on its Gaussian and mean curvatures [24].Within the classified scale-space,each connected volume that had the same sur-face type was detected.The center of the connected volume was chosen as the location of a keypoint.The weighted average of the scale values within each connected volume was selected as the scale of that keypoint.This algorithm is invariant to scale and rotation.The results showed that for scale varying databases,the adaptive-scale features ensured a superior recognition performance to the fixed-scale fea-tures [13].Castellani et al.[15]resampled the mesh M at N O levels to yield a set of octave meshes M j ðj ¼1;2;...;N O Þ.They then applied N S Gaussian filters on each octave mesh M j to obtain a set of filtering maps F j i ði ¼1;2;...;N S Þ.Next,they projected F ji ðp Þto the normal of p to get a scalar map M j i ðp Þ,which was further normalized to an inhibitedsaliency map ^M j i.Finally,keypoints were detected as the local maxima within the inhibited saliency map ^M j i.Only the potential keypoints which appeared in at least three octaves were considered as validate keypoints.Experimen-tal results showed that this method detected only a limited number of keypoints,which were well-localized and often at the extremities of long protrusions of the surface [15],[33].Darom and Keller [76]followed the work of [15].They however used density-invariant Gaussian filters to obtain octave meshes,which was more robust to varying mesh res-olutions and nonuniform sampling compared to [15].Li and Guskov [77]projected the 3D points onto a series of increasingly smoothed versions of the shape to build a scale-space.They then computed the normal differences between neighboring scales at each point p .The points whose normal differences were larger (or smaller)than all of their spatial and scale neighbors were detected as key-points.Experimental results showed that the keypoints atcoarse scales were more repeatable compared to their coun-terparts at finer scales.Lo and Siebert [78]detected keypoints on a depth image using an enhanced version of the Scale Invariant Feature Transform (SIFT)algorithm [22],namely 2.5D SIFT.They created a discrete scale-space representation of the depth image by using Gaussian smoothing and Difference Of Gaussian (DOG)pyramid techniques.The signal maxima and minima were detected within the DOG scale-space.The keypoints were finally validated and located by comparing the ratio of the principal curva-tures (i.e.,k 1k 2)with a predefined threshold.The 2.5D SIFT achieved a superior matching performance compared to the 2D SIFT [78].Knopp et al.[23]first voxelized a mesh to a 3D voxel image.Next,they calculated the second-order derivatives Lðv ;s Þat each voxel v using box filters with increasing stan-dard deviations s .They defined a saliency measure s for each voxel v and scale s based on the Hessian matrix.Local extrema over the space s ðv ;s Þwere used to detect the “3D SURF”keypoints and their corresponding scales.Coordinate smoothing based methods directly apply the 2D scale-space theory to 3D geometric data by replacing pixel intensities with 3D point coordinates.Consequently,the extrinsic geometry and topology of a 3D shape are altered,and the causality property of a scale-space is vio-lated [79].The causality property is an important axiom for a scale-space representation.It implies that any structure (feature)in a coarse scale should be able to find its cause in the finer scales [80].3.2.2Geometric Attribute Smoothing Based Methods These methods construct a scale-space by successively smoothing the geometric attributes of a range image.Since the filtering is applied to the geometric attributes rather than the range image itself,no modification is made to the extrinsic geometry of the 3D shape.Therefore,the causality property of the scale-space is preserved.Novatnack and Nishino [81]represented a surface using its normals and parameterized the surface on a 2D plane to obtain a dense and regular 2D representation.They then constructed a scale-space of the normal field by successively convolving the 2D normal map with geo-desic Gaussian kernels of increasing standard deviations.Keypoints and their corresponding scales were detected by identifying the points in the scale-space where the corner responses were maxima along both the spatial and scale axes.This work was one of the first to consider a geometric scale space that can directly be used on range images,preceding several methods including [82]and [83].Experimental results showed that the keypoints could be detected and localized robustly,and the num-ber of detected keypoints was sufficiently large [48],[84].One major limitation of this method is that it requires accurate estimations of the surface normals to construct the 2D scale-space [41],[85].Flint et al.[86]convolved the 3D voxel image of a range image with a set of Gaussian kernels to build a density scale-space.The keypoints were detected over the scale-space using the determinant of the Hessian matrix.Tests ona number of scenes showed that,this THRIFT method can repeatably extract the same keypoints under a range of transformations.However,it is sensitive to the variations of point density[87].Moreover,regular resampling in a3D space throughout the data is very time-consuming[88].Hua et al.[89]first mapped a3D surface to a canonical 2D domain by using a non-linear optimization method. They then encoded the surface curvatures and conformal factors into the rectangular2D domain,resulting in a shape vector image.Next,they built a vector-valued scale-space on the shape vector image through a geodesic distance-weighted inhomogeneous linear diffusion and a Difference of Diffusion(DoD)computation.They detected keypoints as the points that had maxima/minima DoD values across the scales in both the curvature and conformal factor chan-nels.Experiments showed that this method achieved a very high repeatability.It was also superior to the regular anisotropic diffusion and isotropic diffusion methods[89]. However,it is difficult to apply this method to large and topologically complicated surfaces[83],[90]and certain high-genus surfaces(i.e.,surfaces which have a large num-ber of holes.)[89].Zou et al.[90]proposed a Geodesic Scale-Space(GSS) based on the convolution of a variable-scale geodesic Gauss-ian kernel with the surface geometric attributes.Keypoints were detected by searching the local extrema in both the spatial and scale domains in the GSS.These keypoints were further pruned based on their contrasts and anisotropies. Experiments demonstrated that this method was robust to noise and varying mesh resolutions.However,the cost of computing geodesic distances is extremely high when the scale increases[83].Zaharescu et al.[91]first defined a scalarfield(photomet-ric or geometric attribute)for each point p.They then con-volved the scalarfield with a set of geodesic Gaussian kernels and performed DOG calculations to obtain their scale-space.MeshDOG keypoints were claimed as the max-ima in the DOG scale-space.These keypoints werefinally pruned by a set of operations including non-maximum sup-pression,thresholding and corner detection.This method offers a canonical formula to detect both photometric and geometric keypoints on a mesh surface.It is capable to detect a sufficient number of repeatable keypoints.It is how-ever,sensitive to varying mesh resolutions[33],[83].Zou et al.[82]formalized an Intrinsic Geometric Scale-Space(IGSS)of a3D surface by gradually smoothing the Gaussian curvatures via shape diffusion.Keypoints were identified as extrema in the normalized Laplacian of the IGSS with respect to both the spatial and scale domains.The IGSS representation is invariant to conformal deformations(i.e., transformations which preserve both the size and the sign of angles).Experimental results showed that the detected key-points spread across various scales and were robust to noise.Hou and Qin[83]downsampled the surface as the scale increased,and built a scale-space of the scalarfield(e.g.,cur-vature and texture)on the surface by using a Gaussian ker-nel.Keypoints werefinally detected as local extrema in the scale-space.The capability of simultaneous sampling in both the spatial and scale domains makes this method efficient and stable.Experimental results showed that this method significantly reduced the processing time compared to GSS.It is also more stable under different mesh resolutions than MeshDOG.Another merit is its invariance to isometric defor-mations(i.e.,transformations which preserve distance).3.2.3Surface Variation Based MethodsThese methodsfirst calculate surface variations at a set of varying neighborhood sizes.They then detect keypoints by finding the maxima of surface variations in the local neigh-borhood with different neighborhood sizes.They are based on the assumption that the neighborhood size can be regarded as a discrete scale parameter,and increasing the local neighborhood size is similar to applying a smoothing filter[92].These methods avoid making direct changes to the3D surfaces,and they are straightforward to implement.Pauly et al.[92]measured the surface variation d by using the eigenvalues 1, 2and 3of the covariance matrix of the neighboring points,that is d¼ 3ð 1þ 2þ 3Þ.Local maxima in the surface variation space were determined as keypoints. Experiments demonstrated that the surface variation corre-sponded well to the smoothed surface using the standard Gaussianfiltering.Two major drawbacks of this method are that,surface variation is sensitive to noise,and a heuristic pre-smoothing procedure is required to detect the maxima in the surface variation space[41],[88].Ho and Gibbins[88]used the standard deviation of the shape index values of the neighboring points to measure the surface variation.The detected keypoints on the Dragon model are illustrated in Fig.1a.Experimental results showed that this method was effective and robust to minor noise.It achieved high repeatability results even with noisy ter,Ho and Gibbins[85]estimated the curved-ness at different scales,and picked the points that had extreme values in the scale-space as keypoints.Similarly, Ioanou et al.[93]proposed a multi-scale Difference of Nor-mals(DoN)operator for the segmentation of large unorga-nized3D point clouds.The DoN operator provides a substantial reduction in points,which reduces the computa-tional cost of any subsequent processing phase of the scene (when processing is performed on the segmented parts).Unnikrishnan and Hebert[41]proposed an integral oper-ator Bðp;rÞto capture the surface variation at a point p with a neighborhood size r.In fact,Bðp;rÞdisplaces a point p along its normal direction n,and the magnitude of displace-ment is proportional to the mean curvature.They then defined the surface variation dðp;rÞas:d p;rðÞ¼2pÀB p;rðÞk krexpÀ2pÀB p;rðÞk kr:(1) Fig.1.Keypoints detected on the Dragon model.(a)Keypoints detected by[88].Different sizes of the spheres correspond to different scales of the keypoints.(b),(c),(d)Keypoints and their neighborhoods detected at three scales by[41].Each colored patch corresponds to the neighbor-hood of a keypoint,the sizes of blue spheres correspond to the scales.。