Abstract Real-time affine region tracking and coplanar grouping
- 格式:pdf
- 大小:453.46 KB
- 文档页数:8
第28卷㊀第5期2023年10月㊀哈尔滨理工大学学报JOURNAL OF HARBIN UNIVERSITY OF SCIENCE AND TECHNOLOGY㊀Vol.28No.5Oct.2023㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀空间通道双重注意力道路场景语义分割王小玉,㊀林㊀鹏(哈尔滨理工大学计算机科学与技术学院,哈尔滨150080)摘㊀要:无人驾驶领域的一个重要问题就是在低功耗移动电子设备上怎样运行实时高精度语义分割模型㊂由于现有语义分割算法参数量过多㊁内存占用巨大导致很难满足无人驾驶等现实应用的问题,并且在影响语义分割模型的精度和推理速度的众多因素中,空间信息和上下文特征尤为重要,并且很难同时兼顾㊂针对该问题提出采用不完整的ResNet18作为骨干网络,ResNet18是一个轻量级的模型,参数量较少,占用内存不大;同时采用双边语义分割模型的技术,在两条路径上添加通道空间双重注意力机制,来获取更多的上下文信息和空间信息的想法㊂另外还采用了精炼上下文信息的注意力优化模块,和融合两条路径输出的融合模块,添加的模块对于参数量和内存的影响很小,可以即插即用㊂以Cityscapes 和CamVid 为数据集㊂在Citycapes 上,mIoU 达到77.3%;在CamVid 上,mIoU 达到66.5%㊂输入图像分辨率为1024ˑ2048时,推理时间为37.9ms ㊂关键词:无人驾驶;实时语义分割;深度学习;注意力机制;深度可分离卷积DOI :10.15938/j.jhust.2023.05.013中图分类号:TP391.41文献标志码:A文章编号:1007-2683(2023)05-0103-07Semantic Segmentation of Unmanned Driving SceneBased on Spatial Channel Dual AttentionWANG Xiaoyu,㊀LIN Peng(Harbin University of Scienceand Technology,Computer Scienceand Technology,Harbin 150080,China)Abstract :An important issue in the field of unmanned driving is how to run real-time high-precision semantic segmentation modelson low-power mobile electronic devices.Existing semantic segmentation algorithms have too many parameters and huge memory usage,which makes it difficult to meet the problems of real-world applications such as unmanned driving.However,among the many factors that affect the accuracy and speed of the semantic segmentation model,spatial information and contextual features are particularly important,and it is difficult to take into account both.In response to this problem,it is proposed to use the incomplete ResNet18as the backbone network,design a bilateral semantic segmentation model,and add a channel space dual attention model to the two paths to obtain more contextual and spatial information.In addition,the attention optimization module that refines the context information and the fusion module that integrates the output of the two paths are also used.Take Cityscapes and CamVid as data sets.On Citycapes,mIoU reached 77.3%;on CamVid,mIoU reached 66.5%.When the input image resolution is 1024ˑ2048,the segmentation speed is 37.9ms.Keywords :driverless technology;real-time semantic segmentation;deep learning;attention mechanism;depth separable convolu-tion㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀收稿日期:2022-04-04基金项目:国家自然科学基金(61772160);黑龙江省教育厅科学技术研究项目(12541177).作者简介:林㊀鹏(1997 ),男,硕士研究生.通信作者:王小玉(1971 ),女,教授,硕士研究生导师,E-mail:wangxiaoyu@.0㊀引㊀言随着人工智能与汽车交通的结合, 自动驾驶热潮被掀起,如何准确㊁快速地检测路况㊁路标等信息成为目前研究的热点目标[1]㊂许多研究人员逐渐将注意力转向了对道路场景的理解㊂主要领域之一是道路场景的语义分割[2]㊂基于深度学习的图像语义分割作为计算机视觉中的一项基本任务,旨在估计给定输入图像中所有像素的类别标签,并呈现出不同颜色区域掩模的分割结果㊂2014年,文[2]提出的全卷积神经网络(FCN),被誉为深度卷积神经网络的奠基之作,标志着分割领域正式进入全新的发展时期㊂与之前所有图像语义分割算法最大的不同在于,FCN用卷积层代替分类模型中全部的全连接层,学习像素到像素的映射㊂并且,提出了在上采样阶段联合不同池化层的结果,来优化最终输出的方法[2]㊂目前很多的优秀的基于深度学习的图像语义分割算法都是基于FCN的思想实现的[3]㊂2015年,剑桥大学在FCN的基础上,实现了突破,提出了SegNet模型[3]㊂从那时起,更多的语义分割算法被开发出来,并且分割的准确性一直在提高,如deeplab系列[4],多路级联模型(refinenet)[4]和PSPNet等[5]㊂近年来,深度学习在图像语义分割方面有了很大的进步㊂在自动驾驶等领域有着很大的应用潜力㊂但是算法模型大多关注对图像分割准确率的提升,其计算成本和内存占用较高,模型的实时性得不到保证[6]㊂在许多实际应用中,对于模型的实时性也有很高的要求㊂根据这一需求,目前最常用的ENet,MobileNet系列也随即被提出[7]㊂实时进行语义信息分割技术逐渐分化一个新的领域㊂在实时语义分割的任务中,为了提高推理速度,有的模型采取缩小图片尺寸的操作,有的采取删减特征图通道的操作,但是这些操作都会丢失一些空间信息[7]㊂这是因为初始图像经历了多次卷积和池化,最终导致初始图片被模型加载后,特征图的分辨率由大变小㊂对于分割任务来说,获取丰富的上下文信息和空间信息㊁高分辨率的特征㊁深层特征的语义信息,可以更好地提高模型的分割精度[8]㊂近年来,在实时语义信息分割算法中,双边分割网络算法(BiSeNet)在语义分割任务上获得了瞩目的成绩[9]㊂本文在BiSeNet的基础上,上下文路径以轻量化模型ResNet18作为骨干网络㊂引入两个空间通道双重注意力机制CBAMT和CSSE模块㊂通过在上下文路径的轻量型特征提取网络引入CBAMT模块,从空间和通道两个维度来判断应该学习什么特征[10]㊂然后使用注意力优化模块(ARM),强化对轻量型特征提取模型不同阶段的特征学习[11]㊂通过在空间路径引入CSSE模块获取更多的空间特征,并且可以利用深度可分离卷积减少参数量㊂最后使用特征融合模块(FFM)将两条路径的输出进行融合㊂1㊀本文算法BiSeNet其结构如图1所示,双边分割网络设计有2条支路结构:空间支路和上下文支路㊂空间支路解决空间信息的缺失;上下文支路解决感受野小的问题,获取丰富的上下文信息[12]㊂两条路径采取的方法分别为:在空间支路中,输入的图像经过三层由大卷积核组成的卷积层的卷积,将输入图像压缩成原图尺寸1/8的特征图,这样就保留丰富的空间信息㊂并且这些卷积层的卷积核都是小步长的,经过这些卷积层的学习,最终可以生成高分辨率的特征[13];在上下文支路中,将全局平均池化添加到支路中,获取最大的感受野㊂并且还添加注意力机制来指导特征学习㊂图1㊀原始模型的结构Fig.1㊀original model1.1㊀基于空间和通道的双重注意力机制单元文[3]提出一种轻量的空间通道双重注意力机401哈㊀尔㊀滨㊀理㊀工㊀大㊀学㊀学㊀报㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀第28卷㊀制CBAM,可以在通道和空间维度上进行注意力关注[14]㊂CBAM由两个单独的子模块组成,分别是通道注意力模块(CAM)和空间注意力模块(SAM)㊂前者是关注于通道,后者是关注于空间㊂这样的优点是不仅可以很好地的控制模型的参数量,并且能够将其加入到当前已有的模型结构中㊂总之, CBAM是一种随插随用的模块㊂1.1.1㊀CAM对输入的特征图G(HˑWˑC)分别进行基于宽高的整体最大池化和平均整体池化,得到两张1ˑ1ˑC特征的图像㊂然后将它们发送到一个双层神经网络(MLP),这个双层神经网络是共用的[15]㊂第一层神经元个数为C/r(r为减少率),激活函数为Relu;第二层神经元个数为C㊂然后将MLP输出的特征相加并由sigmoid激活㊂生成最终的通道注意特征图M_c㊂最后,用乘法将M_c和输入特征图G 相乘㊂生成的特征图即为空间注意力机制模块需要的输入特征图Gᶄ㊂1.1.2㊀SAMSAM将Gᶄ作为输入特征图㊂首先进行以通道为基础的最大全局池化和平均全局池化㊂然后将两个特征图HˑWˑ1拼接操作,即通道拼接㊂经过7ˑ7卷积,降维为一个通道,即HˑWˑ1㊂随后由sigmoid函数生成特征图Gᵡ㊂最后将Gᵡ和Gᶄ进行乘法操作,生成最后的特征图㊂1.2㊀改进的空间支路为了使语义分割模型有更好的分割效果,可以通过将低级的空间特征和庞大的深层语义信息相结合来提高模型的分割精度[15]㊂本文提出的空间路径是由3个卷积组成㊂第一层包括一个步长为2的卷积,剩下两层是步长为1的深度可分离卷积[15]㊂然后是批标准化(BN),和以线性整流函数(ReLU)作为激活函数㊂此外本文还在空间路径上添加通道空间模块(CSSE)㊂具体算法如下:特征图HˑWˑC经过全局平均池化,得到特征图1ˑ1ˑC㊂然后经过两个1ˑ1ˑ1的卷积处理,最终得到一个C维向量㊂然后用sigmoid归一化函数得到对应的mask,最后乘以通道分组得到信息校准后的Mᶄ特征图㊂sSE模块类似于SAM㊂具体过程是直接在特征Mᶄ(HˑWˑC)上使用1ˑ1ˑ1,将特征图Mᶄ卷积成为HˑWˑ1的特征图㊂然后用sigmoid 进行激活得到空间特征图㊂最后应用它直接对原始特征图完成空间信息的校准㊂CSSE模块是将cCE 模块和sSE模块以串联的方式连接,并且通过实验证明,组成的CSSE对模型的分割效果的也有提升㊂CSSE结构如图2所示㊂图2㊀CSSE结构图Fig.2㊀CSSE structure diagram1.3㊀改进的上下文支路在原始模型中,为了可以有更大的感受野和更多的语义信息,BiSeNet设计了Context path[15]㊂并且使用Xception作为特征提取的骨干网络[16]㊂Xception可以快速缩小特征图以获得大感受野,来编码高级语义上下文信息[16]㊂本文提出的改进的上下文路径使用轻量级模型ResNet18作为特征提取骨干网络,并且在路径中额外添加了CBAMT 模块㊂本文的特征提取的骨干网络是由4个block组成,每个block由两个3ˑ3的卷积和BN层,以及relu组成㊂此外,本文提出的CBAMT模块是基于文[6]中提出的一种triplet attention方法㊂该方法使用三重分支结构来捕获维度交互,从而计算注意力的权重,实现通道和空间的交互[16]㊂本文提出的改进后的CBAMT模块,采用了triplet attention(三重分支)的思想,三重分支结构3个并行分支分支组成,其中两个分支主要负责维度C与维度H或W之间的交互[17]㊂最后一个分支类似于SAM,用于构建空间感知模块[17]㊂最后,将所有分支的输出进行平均水平聚合㊂CBAMT将CAM模块的输出特征图Fᶄ利用两个平行的包含Z池化层,用于维度交互的分支,将维度C与维度H或W的维度进行交互,将两个输出结果相加得到特征图Fᵡ㊂然后使用特征图Fᵡ作为SAM的输入以得到最终特征㊂Z池化层的作用是将维度H和W的张量减少到2维,并将该维度的平均池化特征和最大池化特征联系起来,这使得该层在减少其深度的同时保持真实张量的丰富表示,这有利于后续计算[18]㊂最后,改进的上下文路径中保留了全局平局池化结构,这样可以为模型提供全局上下文信息,更好地增强模型分割效果㊂CBAMT模块结构如图3,改进后的整体网络模型如图4所示,以及Z-pool计算:501第5期王小玉等:空间通道双重注意力道路场景语义分割Mc(F )=σ((AvgPool(F ),MaxPool(F ))(1)式中:F 为输入特征图;σ为sigmoid 激活函数;Avg-Pool 和MaxPool 分别表示全局平均池化和全局最大池化,f7x7表示卷积操作时,卷积核大小为7㊂图3㊀空间通道注意力模块CBAMT 结构图Fig.3㊀Spatial channel attention module CBAMTstructurediagram图4㊀改进后的模型的整体结构Fig.4㊀The overall structure of the improved mod1.4㊀特征融合模块(FFM )特征融合模块的功能是把来自空间支路的特征和上下文支路的特征融合[18]㊂之所以需要FFM 来融合两者,是由于前者是低层次的特征,后者是高层次的特征[18]㊂具体流程:将来自空间支路和上下文支路的特征进行向量拼接的操作,得到特征图H ,然后对特征图H 进行全局平局池化,得到1ˑ1ˑC 向量㊂最后通过类似SENet 中的通道权重相乘,对特征图H 重新进行加权,得到最后的特征图Hᶄ㊂图5显示了该模块的结构㊂图5㊀FFM 结构图Fig.5㊀FFM structure diagram1.5㊀注意力优化模块(ARM )原始模型还针对上下文路径设计了ARM,如图6所示㊂首先为了获得整体上下文语境信息,使用全局平局池化㊂来帮助模型学习特征,来强化特征提取网络不同阶段的特征学习㊂此外还可以简单地完成整体上下文语境信息的集成㊂并且不必利用上采样,计算成本可以忽略不计㊂图6㊀ARM 结构图Fig.6㊀ARM block diagram1.6㊀注意力优化模块(ARM )上下文路径中添加了两个辅助损失函数来更好地监督输出㊂主损失函数和辅助损失函数都使用Softmax 函数为式(2)[19]㊂辅助损失函数监督模型的训练,主损失函数监督整个BiSeNet 的输出(Lp)㊂添加两个特殊的辅助损失函数监督Context Path 的输出(Li)借助参数α以平衡主损失函数与辅助损失函数的权重,如式(3):Loss =1n ði l i =1n ði loge p iði e p i(2)L (X |W )=l p (X :W )+αðK i l i (X i :W )(3)其中:l p 为主要的loss ;l i 为辅助的loss ;X i 为ResNet601哈㊀尔㊀滨㊀理㊀工㊀大㊀学㊀学㊀报㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀第28卷㊀第i个阶段的输出特征;K=3,ɑ为1㊂在训练过程中,只使用了辅助loss进行训练㊂2㊀实验结果与分析2.1㊀数据集本文使用两个数据集,均是城市道路场景数据集,分别为Cityscapes数据集和CamVid数据集㊂这两个数据集是道路场景语义分割中最常用来进行模型评估的数据集[19]㊂CamVid数据集有11个类别;而Cityscapes包含两类,一类是5000张带有高质量像素级标签的精细图像,一类是20000张带有粗糙标签的附加图,本实验使用的是Cityscapes中5000个高质量像素级标签的精细图像进行实验㊂最后从速度即推理时间以及精度两个方面与Baseline模型进行对比,分析模型的分割性能,并且通过可视化结果展示模型的分割性能㊂2.2㊀参数设置本文实验环境为Win10操作系统,Nvidia RTX 1080Ti6GB,Python3.9编译环境,Pytorch1.9框架㊂具体参数为 bitchsize=8,momentum=0.9,weight-decay=5ˑ10-4㊂采用 poly 学习率,power=0.9㊂本文采取随机梯度下降优化算法(SGD)进行模型训练,并使用 poly 学习策略,其公式为:η=η∗(1-itermax_iter)power(4)其中:初始学习率为2.5ˑ10-2㊂iter是当前的迭代次数;max_iter是总迭代次数[19]㊂设置为1000(即产生1000个epoch)㊂功率值固定为0.9;主要和次要损失平衡参数设置为1㊂2.3㊀消融实验本文还做了在相同条件下CBAMT和CSSE这两个模块对模型性能的提升的有效性试验结果见表1㊂从表1可以看出,CBAMT和CSSE两个模块均可以提高模型分割精度,而且CBAMT的提升效果要优于CSSE㊂表1㊀各模块在CamVid数据集上的有效性验证Tab.1㊀Validation of each module on the CamVid dataset CBAM CSSE CBAMT FFM ARM mIoU%ɿɿɿɿ66.5ɿɿɿ66.1ɿɿɿɿ65.9ɿɿɿ65.7㊀㊀注:ɿ表示有效㊂2.4㊀算法整体性能分析与比较本文使用的Baseline模型是个人实现的Res-Net18版本的BiSeNet模型㊂2.4.1㊀分割精度模型性能采用平均交并比(mIOU)来衡量,计算公式为mIoU=1k+1ðk i=0p iiðk j=0p ij+ðk j=0p ji-p ii(5)本文算法与其他算法的分割结果的对比如表2所示㊂由表2可见,本文模型的精度与原BiSeNet 对比,在Cityscapes和CamVid上分割精度度提高了1.6%和1.1%㊂表2㊀分割精度与对比Tab.2㊀Segmentation accuracy and comparison模型mIoU/%Cityscapes CamVid SegNet58.957.0ENet65.752.9DFANet71.361.5 MobileNet177.864.7 BiSeNet(Res-18)75.765.4本文算法77.366.52.4.2㊀推理速度在测试速度实验中,Baseline模型在Cityscapes 上的推理时间为21.5ms,在CamVid上的推理时间为35.5ms,结果如表3所示㊂表3㊀推理速度对比Tab.3㊀Split speed and comparison模型Cityscapes/ms CamVid/msSegNet24.615.7 MobileNet132.310.5 BiSeNet(Res-18)35.521.5本文算法37.924.5㊀㊀本文模型在Cityscapes上的推理时间为37.9ms,在CamVid上的推理时间为24.5ms,证明本文网络本文网络充分满足实时语义分割的要求㊂总之,从速度和精度两个方面综合分析,本文提出的模型在Cityscapes和Camvid数据集上,比701第5期王小玉等:空间通道双重注意力道路场景语义分割BiSeNet(Res18)在推理速度与分割精度之间实现了更好的平衡,与ENet 相比,在精度得到了显著提升,其次与目前常见的MobileNet1相比,推理时间接近,精度方面有所提升㊂但是MobileNet1采用分组卷积,同时模型也没有考虑到空间信息,而且模型层数还是较多,而且对硬件要求,比如GPU 较高㊂而且由于分组卷积,导致在多次重复实验中,偶尔会出现分割效果很差的情况,通过查看文献得知,可能与分组卷积会导致模型学废,后续会对这方面继续研究㊂2.4.3㊀可视化结果本文提出的模型在CamVid 上的分割效果以及与Baseline 模型的比较如图7所示㊂首先,前三列图像分别是初始图㊁标签图和模型的分割效果图㊂从前三者可以看出,改进后的模型有着很好的分割性能㊂另外该模型对不同物体的分割效果是有所区别的㊂其中较大物体的分割效果较好,基本可以准确识别其类别,例如树木㊂相反,对于很小的物体的分割结果存在一些问题㊂比如存在部分细小物体没有识别等问题㊂另外模型同样存在当前大多数实时分割模型对没有标记的物体分割非常混乱的通病㊂通过观察本文模型与Baseline 模型的实际分割效果图(即最后一列图像)的对比,可以看出改进后的语义分割模型的的分割效果优于基础模型㊂图7㊀可视化结果Fig.7㊀Visualization resul2㊀结㊀论本文对语义分割算法的准确度和实时性表现进行深入分析,提出了一种空间通道双重注意力道路场景分割模型㊂在保证分割准确度的同时兼顾模型的实时性㊂上下文路径的CBAMT 模块可以获取更多重要的上下文特征信息,空间路径的CSSE 获取了更丰富的空间信息㊂实验证明,本文提出的模型在精度和速度的平衡性优于原BiSeNet 模型㊂所构建的注意力机制以及轻量级模型对于其他研究者具有参考意义㊂由于本文算法仅对道路场景数据集进行深入测试,对于其他类别缺乏针对性,在后续研究中,会考虑结合具体图像分割目标进行模型设计,进一步提升模型的实用性能,并且对实际的目标进行研究和测试㊂参考文献:[1]㊀JIA Gengyun,ZHAO Haiying,LIU Feiduo,et al.Graph-Based Image Segmentation Algorithm Based on Superpixels[J].Journal of Beijing University of Postsand Telecommunications,2018,41(3):46.[2]㊀黄福蓉.用于实时道路场景的语义分割算法CBR-ENet[J].中国电子科学研究院学报,2021,16(3):27.HUANG Furong.Semantic Segmentation Algorithm CBR-ENet for Real-time Road Scenes[J].Journal of China A-cademy of Electronic Sciences,2021,16(3):277.[3]㊀CANAYAZ M.C +EffxNet:A Novel Hybrid Approachfor COVID-19Diagnosis on CT Images Based on CBAM and EfficientNet[J].Chaos,Solitons &Fractals,2021,151:111310.[4]㊀祖宏亮.基于模糊聚类的图像分割算法研究[D].哈尔滨:哈尔滨理工大学,2020.[5]㊀吕沛清.基于改进U-Net 的肝脏CT 图像自动分割方801哈㊀尔㊀滨㊀理㊀工㊀大㊀学㊀学㊀报㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀第28卷㊀法研究[D].哈尔滨:哈尔滨理工学报.2022: [6]㊀TANG X,TU W,LI K,et al.DFFNet:an IoT-percep-tive Dual Feature Fusion Network for General Real-timeSemantic Segmentation[J].Information Sciences,2021,565:326.[7]㊀ZHANG R X,ZHANG L M.Panoramic Visual Percep-tion and Identification of Architectural Cityscape Elementsin a Virtual-reality Environment[J].Future GenerationComputer Systems,2021,118:107.[8]㊀A Method to Identify How Librarians Adopt a TechnologyInnovation,CBAM(Concern Based Adoption Model)[J].Journal of the Korean Society for Library and Infor-mation Science,2016,50(3):[9]㊀张立国,程瑶,金梅,等.基于改进BiSeNet的室内场景语义分割方法[J].计量学报,2021,42(4):515.ZHANG Liguo,CHENG Yao,JIN Mei,et al.SemanticSegmentation Method of Indoor Scene Based on ImprovedBiSeNet[J].Acta Metrology,2021,42(4):515. [10]高翔,李春庚,安居白.基于注意力和多标签分类的图像实时语义分割[J].计算机辅助设计与图形学学报,2021,33(1):59.GAO Xiang,LI Chungeng,An Jubai.Real-time Seman-tic Segmentation of Images Based on Attention and Multi-label Classification[J].Journal of Computer-Aided De-sign and Graphics,2021,33(1):59.[11]YIN J,GUO L,JIANG W,et al.Shuffle Net-inspiredLightweight Neural Network Design for Automatic Modula-tion Classification Methods in Ubiquitous IoT Cyber-phys-ical Systems[J].Computer Communications,2021,176:249.[12]RÜNZ M,AGAPITO L.Co-fusion:Real-time Segmenta-tion,Tracking and Fusion of Multiple Objects[C]//2017IEEE International Conference on Robotics and Automa-tion(ICRA).IEEE,2017:4471.[13]CHEN Y C,LAI K T,LIU D,et al.Tagnet:Triplet-at-tention Graph Networks for Hashtag Recommendation[J].IEEE Transactions on Circuits and Systems for VideoTechnology,2021,32(3):1148.[14]任天赐,黄向生,丁伟利,等.全局双边网络的语义分割算法[J].计算机科学,2020,47(S1):161.REN Tianci,HUANG Xiangsheng,DING Weili,et al.Semantic Segmentation Algorithm for Global Bilateral Net-works[J].Computer Science,2020,47(S1):161.[15]LI J,LIN Y,LIU R,et al.RSCA:Real-time Segmenta-tion-based Context-aware Scene Text Detection[C]//Pro-ceedings of the IEEE/CVF Conference on Computer Vi-sion and Pattern Recognition,2021:2349. [16]SAFAE El Houfi,AICHA Majda.Efficient Use of RecentProgresses for Real-time Semantic Segmentation[J].Ma-chine Vision and Applications,2020,31(6):45. [17]MARTIN F.Grace,PING Juliann.Driverless Technolo-gies and Their Effects on Insurers and the State:An Ini-tial Assessment[J].Risk Management and Insurance Re-view,2018,21(3):1.[18]WEI W,ZHOU B,POŁAP D,et al.A Regional Adap-tive Variational PDE Model for Computed Tomography Im-age Reconstruction[J].Pattern Recognition,2019,92:64.[19]FAN Borui,WU Wei.Sufficient Context for Real-TimeSemantic Segmentation[J].Journal of Physics:Confer-ence Series,2021,1754(1):012230.(编辑:温泽宇)901第5期王小玉等:空间通道双重注意力道路场景语义分割。
1 Real-Time Tracking via On-line Boosting Helmut Grabner,Michael Grabner,Horst BischofInstitute for Computer Graphics and VisionGraz University of Technology{hgrabner,mgrabner,bischof}@icg.tu-graz.ac.atAbstractVery recently tracking was approached using classification techniques suchas support vector machines.The object to be tracked is discriminated by aclassifier from the background.In a similar spirit we propose a novel on-lineAdaBoost feature selection algorithm for tracking.The distinct advantage ofour method is its capability of on-line training.This allows to adapt the clas-sifier while tracking the object.Therefore appearance changes of the object(e.g.out of plane rotations,illumination changes)are handled quite naturally.Moreover,depending on the background the algorithm selects the most dis-criminating features for tracking resulting in stable tracking results.By usingfast computable features(e.g.Haar-like wavelets,orientation histograms,lo-cal binary patterns)the algorithm runs in real-time.We demonstrate the per-formance of the algorithm on several(publically available)video sequences.1IntroductionThe efficient and robust tracking of objects in complex environments is important for a variety of applications including video surveillance[25],autonomous driving[1]or human-computer interaction[5,4].Thus it is a great challenge to design robust visual tracking methods which can cope with the inevitable variations that can occur in natural scenes such as changes in the illumination,changes of the shape,changes in the view-point,reflectance of the object or partial occlusion of the target.Moreover tracking suc-cess or failure may also depend on how distinguishable an object is from its background. Stated differently,if the object is very distinctive,a simple tracker may already fulfill the requirements.However,having objects similar to the background requires more sophisti-cated features.As a result there is the need for trackers which can handle on the one hand all possible variations of appearance changes of the target object and on the other hand are able to reliably cope with background clutter.Several approaches have been proposed to fulfill these two main requirements for tracking.To cope with appearance variations of the target object during tracking,existing tracking approaches(e.g.[3,8,10])are enhanced by adaptivness to be able to incre-mentally adjust to the changes in the specific tracking environment(e.g.[14,23,17,22, 13,2,15,26]).In other words,invariance against the different variations is obtained by adaptive methods or representations.Many classical algorithms have been modified in order to be able to adjust the tracking algorithm to the tracking environment.The classi-cal subspace tracking approach of Black et al.[3]was enhanced by incremental subspace updating in[14,22].In[14]it is proposed to express the general adaption problem as a2subspace adaption problem,where the visual appearance variations at a short time scale are represented as a linear subspace.In contrast,[26]suggests an on-line selection of local Haar-like features in order to handle possible variations in appearance.In addition,to the on-line adaption problem,recently many techniques have addressed the idea of using information about the background in order to increase the robustness of tracking[1,19,2,26,6].Especially the work of Collins and Liu[6]emphasizes the im-portance of the background appearance.They postulate that the feature space that best distinguishes between object and background is the best feature space to use for track-ing.The idea of considering the tracking problem as a classification problem between object and background has lead to further works[1,2,27]applying well known clas-sifiers to the tracking problem.In[1]a tracker is realized by using a Support Vector Machine which learns off-line to distinguish between the object and the background.The work most closely related to ours is that of[2].Again tracking is considered as a binary classification problem,where an ensemble of weak classifiers is combined into a strong classifier capturing the idea of AdaBoost for selecting discriminative features for tracking. To achieve robustness against appearance variations novel weak classifiers can be added to the ensemble.However,this is done in a batch processing mode using off-line boosting in a batch manner after new training examples have been collected.Moreover the features are quite restricted.The novelty of this paper is to present a real-time object tracking method which is based on a novel on-line version of the AdaBoost algorithm.Our algorithm performs on-line updating of the ensemble of features for the target object during tracking and thus is able to cope with appearance changes of the object.Furthermore,the on-line trained classifier uses the surrounding background as negative examples in the update and becomes therefore very robust against the drifting problem[18].In addition this negative update allows the algorithm to choose the most discriminative features between the object and the background.Therefore the method can deal with both appearance variations of the object and different backgrounds.The algorithm,which uses only grayvalue information (but can also be extended to color),is able to run in real-time since training is simply done by updating the model with the positive and negative examples of the current frame.The reminder of the paper is organized as follows.In Section2we introduce the tracking algorithm and the novel on-line version of AdaBoost for feature selection which forms the bases of the approach.In Section3we show experimental results illustrating the adaptivity,the robustness and the generality of the proposed tracker.2TrackingThe main idea is to formulate the tracking problem as a binary classification task and to achieve robustness by continuously updating the current classifier of the target object. The principle of the tracking approach is depicted in Figure1.Since we are interested in tracking,we assume that the target object has already been detected.This image region is assumed to be a positive image sample for the tracker.At the same time negative examples are extracted by taking regions of the same size as the target window from the surrounding background.These samples are used to make several iterations of the on-line boosting algorithm in order to obtain afirst model which is already stable.Note that these iterations are only necessary for initialization of the tracker.The(a)(b)(c)(d)Figure 1:The four main steps of tracking by a classisifer.Given an initial position of the object (a)in time t ,the classifier is evaluated at many possible positions in a surrounding search region in frame t +1.The achieved confidence map (c)is analyzed in order to estimate the most probable position and finally the tracker (classifier)is updated (d).tracking step is based on the classical approach of template tracking [12].We evaluate the current classifier at a region of interest and obtain for each position a confidence value.We analyze the confidence map and shift the target window to the new location of the maxima.For maximum detection also a mean shift procedure [7]can be ing a motion model for the target object would allow a reduction of the search window.Once the objects has been detected the classifier has to be updated in order to adjust to possible changes in appearance of the target object and to become discriminative to a different background.The current target region is used as a positive update of the classifier while again the surrounding regions represent the negative samples.As new frames arrive,the whole procedure is repeated and the classifier is therefore able to adapt to possible appearance changes and in addition becomes robust against background clutter.Note that the classifier focuses on the current target object while at the same time tries to distinguish the target from its surrounding.Apart from this,tracking of multiple objects is feasible by initializing a separate classifier for each target object.2.1On-line AdaBoostIn this section we briefly review the on-line boosting algorithm (for more details see[11])which allows to generate classifiers that can be efficiently updated by incrementally applying samples.For better understanding of this approach we define the following terms:Weak classifier:A weak classifier has only to perform slightly better than random guess-ing (i.e.,for a binary decision problem,the error rate must be less than 50%).The hypothesis h weak generated by a weak classifier corresponds to a feature and is ob-tained by applying a defined learning algorithm.Selector:Given a set of M weak classifiers with hypothesis H weak ={h weak 1,...,h weak M },a selector selects exactly one of those.h sel (x )=h weak m (x )(1)where m is chosen according to a optimization criterion.In fact we use the esti-mated error e i of each weak classifier h weak i∈H weak such that m =argmin i e i .Strong classifier:Given a set of N weak classifiers,a strong classifier is computed by alinear combination of selectors.Moreover,the value con f (·)(which is related to3Figure2:Principle of on-line boosting for feature selection.the margin)can be interpreted as a confidence measure of the strong classifier.hStrong(x)=sign(con f(x))(2)con f(x)=N∑n=1αn·h sel n(x)(3)The main idea of on-line boosting is the introduction of the so called selectors.They are randomly initialized and each of them holds a seperate feature pool of weak classifiers. When a new training sample arrives the weak classifiers of each selector are updated.The best weak classifier(having the lowest error)is selected by the selector where the error of the weak classifier is estimated from samples seen so far.The complexity is determined by the number of selectors.The part which requires most of the processing time is the updating of weak classifiers. In order to speed up this process,we propose as a modification(similar to[28])to use a single“global weak classifier”pool(see Figure2)which is shared by all selectors instead of single pools for each of them.The advantage of this modification is that now for each sample that arrives,all weak classifiers need to be updated only once.Then the selectors sequentially switch to the best weak classifier with respect to the current estimatedλand the importance weight is passed on to the next selector.This procedure is repeated until all selectors are updated.Finally,at each time step an updated strong classifier is available.In order to increase the diversity of the weak classifiers and to allow changes in the environment,the worst weak classifier of the shared feature pool is replaced with a new randomly chosen one.2.2FeaturesWe use three different types of features for generating weak hypotheses.Haar-like fea-tures like Viola and Jones[24],orientation histograms[16,21,9]and a simple1version of local binary patterns(LBPs)[20].Note,that the computation of all feature types can be 1Binary patterns using a four-neighborhood(i.e.24=16patterns)as a16bin histogram feature.45 done very efficiently using integral images and integral histograms as data structures[21]. This allows to do exhaustive template matching while tracking is still real-time.To obtain a hypothesis from these features we model the probability distribution for positive samples and negative samples.Probability density estimation is done by a Kalmanfiltering technique.For the classic Haar-like wavelets we use a simple threshold and a Bayesian decision criterion as learning algorithm.For the histogram based feature types(orientation histograms and LBPs),we use nearest neighbor learning.Of course, other types and other learning algorithms can be used to obtain a weak hypotheses.(For more details see[11])3Experiments and DiscussionThe experimental section is divided into two parts.First,we perform experiments demon-strating three specific properties of our tracking approach and second we present results on public available sequences for comparison to other tracking approaches.Each tracking task has been initialized by manually marking the target object in thefirst frame.Track-ing has been applied to sequences consisting of hundreds of frames.The performance (speed)depends on the size of the search region which we have defined by enlarging the target region by one third in each direction(for this region the integral representations are computed).In our experiments no motion model has been used.We achieve a frame rate of about20fps.The strong classifier consists of50selectors and the shared feature pool provides250weak classifiers.All experiments have been done on a standard1.6GHz PC with256MB RAM.3.1IllustrationsThe goal of this section is to illustrate three properties of the proposed tracker-adaptivity, robustness and generality.Therefore multiple challenging sequences have been captured with a static camera having a resolution of640×480.AdaptivityTo illustrate the adaptivity of the tracker and its capability to select the best features de-pending on the background,we process a scene where the target object is a small textured patch,see Figure3.The goal of the scenario is to show how the proposed on-line feature selection method can adapt its model to the current tracking problem.Since our approach looks for suitable features which can best discriminate the object from the background, we change the background from a homogeneous to a textured one(same texture as the patch).For evaluation we consider the distribution of the selected feature types(for this experiment we simply used Haar-like features and LBPs).As we can see in the second row of Figure3,the initial choice mainly uses Haar-like features(green)while LBPs(red) are rather rarely used for handling this tracking task.However,after putting texture to the background we can see from the plot in the second row,that the features for tracking the target immediately change.Note that the exchange of features occurs within a quite short period.As a result LBP features have become much more important for the tracking task because the Haar-like features are no longer distriminative.Furthermore the targetobject is still successful tracked even though the texture of the target is the same as in thebackground.(a)Frame7(b)Frame55(c)Frame121(d)Frame152(e)Frame245Figure 3:First row shows the tracked object which is marked with a yellow rectangle.On-line feature selection for tracking is analyzed by considering the ratio of two selected feature types -Haar-like features (dashed green)and local binary patterns (solid red).This experiment shows the importance of on-line learning because we cannot train for all different backgrounds prior tracking starts.In addition,it shows that there is the need of different types of features.RobustnessFor successful real-world tracking a tracker must be able to handle various appearance changes (i.e.:illumination changes,occlusions,out-of-plane rotations,movement)which can occur in natural scenes.Figure 4illustrates the behavior of our proposed method in case of such interferences of the target object.The sequence shows a glass which ini-tially gets occluded (more than 50%)by a paper,afterwards it is moved behind it with additional illumination changes which are caused by the occlusion and finally view-point changes of the target object.The continuous increase of the confidence maximum value in the initial phase (see row 3,image 1)implies the adapting of the tracker to the target object with respect to its background.However if the target object changes its appear-ance or the surrounding background of the target becomes different,the tracker needs to update his features which is reflected in oscillations of the confidence maximum (see row 3)and a flattened confidence map (see row 2).Movement of the tracking target is represented by a shifted peak in the confidence map.To summarize,the tracker is able to handle all kinds of appearance variations of the target object and always aims at finding the best discriminative features for discriminating the target object from the surrounding background.Therefore,the adaptivity is strongly related to the robustness of a tracker.GeneralityIn order to demonstrate the generality we use four different sequences with diverse tar-get objects,see Figure 5.The first sequence,which is depicted in row 1,illustrates the tracking of a tiny Webcam in a cluttered scene.Even though the target object contains few texture and changes in pose occur it is robustly tracked.The second sequence (row 2)67Figure4:Tracking results on a sequence(row1)containing a combination of appearance changes(i.e.illumination,occlusion,movement,out of plane rotation).The behaviour of the proposed tracker is analyzed considering the confidence map(row2)and the maxi-mum confidence value over time(row3).shows the results of tracking a face.As we can see from the depicted frames,the pro-posed approach can cope very well with occlusions which again is the effect of the large number of local features for tracking.Moreover even large pose variations of the head do not confuse the tracker showing that the tracker adapts to novel appearances of the target object.Row3illustrates that only little texture of the object is sufficient for tracking.A glass,having almost no texture,is tracked and again shows the reliability and adaptivity of the proposed tracker.Row4demonstrates the behavior in case of multiple very similar target objects.As can be seen,even though the objects significantly overlap the trackers get not confused demonstrating that the classifiers have really learned to distinguish the specific object from its background.Of course,an overlap over a long duration can cause adaption to the foreground object andfinally leading to failure to track the occluded object because of the bad updates.However,this can be prevented by using some higher level control logic.To summarize,the proposed tracker has the ability to adapt to the appearance of all kinds of objects by learning a good selection of features.Therefore the tracker’s property to adapt is useful for both the handling of appearance variations and for selecting features in order to adapt to any target object.3.2Evaluation on public available sequencesUnfortunately up to now there is no public available framework for comparing tracking techniques.Therefore we decided to process public available sequences,see Figure6and 7,which have already been used in other publications for illustrating tracking results([17, 15]).These sequences contain changes in brightness,view-point and further appearance variations.The results show that the proposed tracker can cope with all these variations and results are at least as good as those presented in the according publications.Figure 5:To illustrate the generality of the proposed method,sequences of four different objects have been captured.The tracking results show that even objects with almost no texture (see row 3)can be successfully tracked.Moreover the tracking algorithm can cope with multiple initialized objects even if they have similar appearance (see row4).(a)Frame30(b)Frame201(c)Frame403(d)Frame451(e)Frame682(f)Frame379(g)Frame451(h)Frame915(i)Frame981(j)Frame 1287Figure 6:These sequences have been provided by Lim and Ross ([17]).The first sequence shows a person moving from dark towards bright area while making changes in pose and and partial occlusion of the target region can be seen.In the second row,an animal doll is moving with large pose,light variations in a cluttered background.8(a)Frame1(b)Frame127(c)Frame269(d)Frame509(e)Frame 885Figure 7:Experimental results on the sequence provided by Jepson [15].Again the object of interest is a face which moves in front of cluttered background and contains variations in appearance.4ConclusionIn this paper we have proposed a robust and generic real-time tracking technique (about 20fps using a standard 1.6GHz PC with 512MB RAM)which considers the tracking problem as a binary classification problem between object and background.Most ex-isting approaches construct a representation of the target object before the tracking task starts and therefore utilize a fixed representation to handle appearance changes during tracking.However,our proposed method does both -adjusting to the variations in ap-pearance during tracking and selecting suitable features which can learn any object and can discriminate it from the surrounding background.The basis is an on-line AdaBoost algorithm which allows to update features of the classifier during tracking.Furthermore the efficient computation of the features allows to use this tracker within real-time appli-cations.Finally,since the tracker is based on a classifier approach now there are several new venues of research like how we can construct a more generic model (like a detector)of the target object during tracking.AcknowledgmentsThe project results have been developed in the MISTRAL Project which is financed by the Austrian Research Promotion Agency (www.ffg.at).In addition this work has been spon-sored in part by the Austrian Federal Ministry of Transport,Innovation and Technology under P-Nr.I2-2-26p VITUS2,by the Austrian Joint Research Project Cognitive Vision under projects S9103-N04and S9104-N04,and by the EU FP6-507752NoE MUSCLE IST project.References[1]S.Avidan.Support vector tracking.PAMI ,26:1064–1072,2004.[2]S.Avidan.Ensemble tracking.In Proc.CVPR ,volume 2,pages 494–501,2005.[3]M.J.Black and A.D.Jepson.Eigentracking:Robust matching and tracking of articulatedobjects using a view-based representation.IJCV ,26(1):63–84,1998.[4] A.Bobick,S.Intille,J.Davis,F.Baird,C.Pinhanez,L.Campbell,Y .Ivanov,A.Schutte,andA.Wilson.The munications of the ACM ,39(3&4):438–455,2000.[5]puter vision face tracking for use in a perceptual user interface.IntelTechnology Journal ,1998.910[6]R.T.Collins,Y.Liu,and M.Leordeanu.Online selection of discriminative tracking features.PAMI,27(10):1631–1643,Oct.2005.[7] aniciu and P.Meer.Mean shift analysis and applications.In Proc.ICCV,volume2,pages1197–1203,1999.[8] aniciu,V.Ramesh,and P.Meer.Real-time tracking of non-rigid objects using meanshift.In Proc.CVPR,volume2,pages142–149,2000.[9]N.Dalal and B.Triggs.Histograms of oriented gradients for human detection.In Proc.CVPR,volume1,pages886–893,2005.[10] A.Elgammal,R.Duraiswami,and L.S.Davis.Probabilistic tracking in joint feature-spatialspaces.In Proc.CVPR,volume1,pages781–788,2003.[11]H.Grabner and H.Bischof.On-line boosting and vision.In Proc.CVPR,volume1,pages260–267,2006.[12]G.D.Hager and P.N.Belhumeur.Efficient region tracking with parametric models of geom-etry and illumination.PAMI,20(10):1025–1039,1998.[13] B.Han and L.Davis.On-line density-based appearance modeling for object tracking.In Proc.ICCV,volume2,pages1492–1499,2005.[14]J.Ho,K.Lee,M.Yang,and D.Kriegman.Visual tracking using learned linear subspaces.InProc.CVPR,volume1,pages782–789,2004.[15] A.D.Jepson,D.J.Fleet,and T.F.El-Maraghi.Robust online appearance models for visualtracking.In Proc.CVPR,volume1,pages415–422,2001.[16]K.Levi and Y.Weiss.Learning object detection from a small number of examples:Theimportance of good features.In Proc.CVPR,pages53–60,2004.[17]J.Lim,D.Ross,R.Lin,and M.Yang.Incremental learning for visual tracking.In Advancesin Neural Information Processing Systems17,pages793–800.MIT Press,2005.[18]Iain Matthews,Takahiro Ishikawa,and Simon Baker.The template update problem.PAMI,26:810–815,2004.[19]H.T.Nguyen and A.Smeulders.Tracking aspects of the foreground against the background.In Proc.ECCV,volume2,pages446–456,2004.[20]T.Ojala,M.Pietik¨a inen,and T.M¨a enp¨a¨a.Multiresolution gray-scale and rotation invarianttexture classification with local binary patterns.PAMI,24(7):971–987,2002.[21] F.Porikli.Integral histogram:A fast way to extract histograms in cartesian spaces.In Proc.CVPR,volume1,pages829–836,2005.[22] D.Ross,J.Lim,and M.Yang.Adaptive proballistic visual tracking with incremental subspaceupdate.In Proc.ECCV,volume2,pages470–482,2004.[23]J.Vermaak,P.P´e rez,M.Gangnet,and A.Blake.Towards improved observation models forvisual tracking:Selective adaption.In Proc.ECCV,pages645–660,2002.[24]P.Viola and M.Jones.Rapid object detection using a boosted cascade of simple features.InProc.CVPR,volume1,pages511–518,2001.[25]P.Viola,M.J.Jones,and D.Snow.Detecting pedestrians using patterns of motion and appear-ance.In Proc.ICCV,volume2,pages734–741,2003.[26]J.Wang,X.Chen,and W.Gao.Online selecting discriminative tracking features using particlefilter.In Proc.CVPR,volume2,pages1037–1042,2005.[27]O.Williams,A.Blake,and R.Cipolla.Sparse bayesian learning for efficient visual tracking.PAMI,27:1292–1304,2005.[28]J.Wu,J.M.Rehg,and M.D.Mullin.Learning a rare event detection cascade by direct featureselection.In Advances in Neural Information Processing Systems16.MIT Press,2003.。
近年来,智能视频监控技术的研究与应用备受人们关注。
作为其基本处理部分,视频监控图像的运动目标检测是一个非常活跃的研究方向,属于计算机视觉领域的重要研究内容,在智能监控、视频压缩、自动导航、人机交互、虚拟现实等方面有着广泛的应用前景.随着计算机软硬件技术的发展,计算机技术与监控技术相结合成为了一个新兴的应用研究方向。
这种监控系统与传统意义上的监控系统的本质区别在十其智能性.视频监控的目的主要是用十对入侵者的监视、交通流量的监测,以及大门出入人员的保安监控等。
传统的监控系统大都需要人工配合进行监控,存在各种问题。
采用红外传感器等半自动的检测方法又存在对猫犬等动物产生误报警等问题.因此,对智能化的监控系统的研究就非常有必要。
简言之,不仅仅用摄像机来代替人眼简单的获取现场图像信息,应用计算机技术来协助监控人员甚至代替监控人员来完成监控任务,从而既获得良好的监控效果,又大大地减轻监控中的人力投入。
由此可见,智能监控系统有着广泛的应用前景和潜在的市场价值。
然而要使监控系统达到智能化,就必须使计算机能够从监控摄像机里所获取的视频图像序列中提取出感兴趣的目标,并对其进行目标分类和跟踪,从而达到对目标行为进行理解与描述的目的。
智能视频监控是计算机视觉领域一个新兴的应用方向和备受关注的前沿课题。
视频监控技术的发展大致经历了以下二个阶段:(1)第一代视频监控系统:早期的视频监控都是以模拟设备为主的闭路电视监控系统,称其为模拟视频监控系统。
通常采用同轴电缆的传输方式进行信号传输,但是这种模拟方式的传输要保证宽带信号具有高的信噪比和较小失真是十分困难的,所以第一代监控系统的可靠性和抗干扰性都较差,功能也相对简单。
(2)第二代视频监控系统:随着数字技术的发展,图像数据压缩编码技术及标准的改进,芯片成本的不断下降,数字视频监控系统也迅速发展起来。
人们利用计算机的高速数据处理能力进行视频采集和处理,大大提高了图像质量,增强了视频监控的能力,提高了系统的可靠性,增强其可扩展性,其功能也越来越专业化、多样化。
Face tracking with automatic model constructionJesus Nuevo,Luis M.Bergasa ⁎,David F.Llorca,Manuel OcañaDepartment of Electronics,Universidad de Alcala.Esc.Politecnica,Crta Madrid-Barcelona,Km 33,600.28871Alcala de Henares,Madrid,Spaina b s t r a c ta r t i c l e i n f o Article history:Received 12November 2009Received in revised form 26August 2010Accepted 15November 2010Available online xxxx Keywords:Face trackingAppearance modeling Incremental clustering Robust fittingDriver monitoringThis paper describes an active model with a robust texture model built on-line.The model uses one camera and it is able to operate without active illumination.The texture model is de fined by a series of clusters,which are built in a video sequence using previously encountered samples.This model is used to search for the corresponding element in the following frames.An on-line clustering method,named leaderP is described and evaluated on an application of face tracking.A 20-point shape model is used.This model is built of fline,and a robust fitting function is used to restrict the position of the points.Our proposal is to serve as one of the stages in a driver monitoring system.To test it,a new set of sequences of drivers recorded outdoors and in a realistic simulator has been compiled.Experimental results for typical outdoor driving scenarios,with frequent head movement,turns and occlusions are presented.Our approach is tested and compared with the Simultaneous Modeling and Tracking (SMAT)[1],and the recently presented Stacked Trimmed Active Shape Model (STASM)[2],and shows better results than SMAT and similar fitting error levels to STASM,with much faster execution times and improved robustness.©2010Elsevier B.V.All rights reserved.Contents 1.Introduction ...............................................................02.Background ...............................................................03.Robust simultaneous modeling and tracking ................................................03.1.Appearance modeling .......................................................03.2.Shape model ...........................................................04.Tests and results .............................................................04.1.Test set ..............................................................04.2.Performance evaluation ......................................................04.3.Results ..............................................................04.3.1.R-SMAT with automatic initialization...........................................04.4.Timings..............................................................05.Conclusions and future work .......................................................0Acknowledgments...............................................................0References ..................................................................1.IntroductionDriver inattention is a major cause of traf fic accidents,and it has been found to be involved in some form in 80%of the crashes and 65%of the near crashes within 3s of the event [3].Monitoring a driver to detect inattention is a complex problem that involves physiological and behavioural elements.Different works have been presented in recentyears,focused mainly in drowsiness,with a broad range of techniques.Physiological measurements such as electro-encephalography (EEG)[4]or electro-oculography (EOG),provide the best data for detection [4].The problem with these techniques is that they are intrusive to the subject.Moreover,medical equipment is always expensive.Lateral position of the vehicle inside the lane,steering wheel movements and time-to-line crossing are commonly used,and some commercial systems have been developed [5,6].These techniques are not invasive,and to date they obtain the most reliable results.However,the measurements they use may not re flect behaviours such as the so-called micro-sleeps [7].They also require a trainingImage and Vision Computing 29(2011)209–218⁎Corresponding author.Tel.:+34918856569.E-mail addresses:jnuevo@depeca.uah.es (J.Nuevo),bergasa@depeca.uah.es(L.M.Bergasa),llorca@depeca.uah.es (D.F.Llorca),mocana@depeca.uah.es (M.Ocaña).0262-8856/$–see front matter ©2010Elsevier B.V.All rights reserved.doi:10.1016/j.imavis.2010.11.004Contents lists available at ScienceDirectImage and Vision Computingj o u r na l ho m e p a g e :w w w.e l s ev i e r.c o m /l o c a t e /i m av i speriod for each person,and thus are not applicable to the occasional driver.Drivers in fatigue exhibit changes in the way their eyes perform some actions,like moving or blinking.These actions are known as visual behaviours,and are readily observable in drowsy and distracted drivers.Face pose[8]and gaze direction also contain information and have been used as another element of inattention detection systems [9].Computer vision has been the tool of choice for many researchers to be used to monitor visual behaviours,as it is non-intrusive.Most systems use one or two cameras to track the head and eyes of the subject[10–14].A few companies commercialize systems[15,16]as accessories for installation in vehicles.These systems require user-specific calibration,and some of them use near-IR lighting,which is known to produce eye fatigue.Reliability of these systems is still not high enough for car companies to take on the responsibility of its production and possible liability in case of malfunctioning.Face location and tracking are thefirst processing stages of most computer vision systems for driver monitoring.Some of the most successful systems to date use near-IR active illumination[17–19],to simplify the detection of the eyes thanks to the bright pupil effect. Near-IR illumination is not as useful during the day because sunlight also has a near-IR component.As mentioned above,near-IR can produce eye fatigue and thus limits the amount of time these systems can be used on a person.Given the complexity of the problem,it has been divided in parts and in this work only the problem of face tracking is addressed.This paper presents a new active model with the texture model built incrementally.We use it to characterize and track the face in video sequences.The tracker can operate without active illumination. The texture model of the face is created online,and thus specific for each person without requiring a training phase.A new online clustering algorithm is described,and its performance compared with the method proposed in[1].Two shape models,trained online and off-line,are compared.This paper also presents a new video sequence database,recorded in a car moving outdoors and in a simulator.The database is used to assess the performance of the proposed face tracking method in the challenging environment a driver monitoring application would meet.No evaluations of face pose estimation and driver inattention detection are performed.The rest of the paper is structured as follows.Section2presents a few remarkable works in face tracking in the literature that are related to our proposal.Section3describes our approach.Section4describes the video dataset used for performance evaluation,and experimental results.This paper closes with conclusions and future work.2.BackgroundHuman face tracking is a broadfield in computing research[20], and a myriad of techniques have been developed in the last decades.It is of the greatest interest,as vast amounts of information are contained in face features,movements and gestures,which are constantly used for human communication.Systems that work on such data often use face tracking[21,22].Non-rigid object tracking has been a major focus of research in later years,and general purpose template-based trackers have been used to track faces in the literature with success.Several efficient approaches have been presented[23–26].Statistical models have been used for face modeling and tracking. Active Shape Models[27](ASM)are similar to the active contours (snakes),but include constraints from a Point Distribution Model (PDM)[28]computed in advance from a training set.Advances in late years have increased their robustness and precision to remarkable levels(STASM,[2]).Extensions of ASM that include modeling of texture have been presented,of which Active Appearance Models (AAMs)[29]are arguably the best known.Active Appearance Models are global models in the sense that the minimization is performed over all pixels that fall inside the mesh defined by the mean of the PDM.All these models have an offline training phase,which require comprehensive training sets so they can generalize properly to unseen instances of the object.This is time consuming process,and there is still the risk that perfectly valid instances of the object would not be modeled correctly.Several methods that work without a priori models have been presented in the literature.Most of them focus on patch tracking on a video sequence.The classic approach is to use the image patch extracted on thefirst frame of the sequence to search for similar patches on the following frames.Lukas–Kanade method[30]was one of thefirst proposed solutions and it is still widely used.Jepson et al.[31]presented a system with appearance model based on three components:a stable component that is learned over a long period based on wavelets,a2-frame tracker and an outlier rejection process.Yin and Collins[32]build an adaptive view-dependent appearance model on-line.The model is made of patches selected around Harris corners.Model and target patches are matched using correlation,and the change in position, rotation and scale is obtained with the Procrustes algorithm.Another successful line of work in object tracking without a priori training is based on classification instead of modeling.Collins and Liu [33]presented a system based on background/foreground discrimi-nation.Avidan[34]presents one of the many systems that use machine learning to classify patches[35,36].Avidan uses weak classifiers trained every frame and AdaBoost to combine them.Pilet et al.[37]train keypoint classifiers using Random Trees that are able to recognize hundreds of keypoints in real-time.Simultaneous Modeling and Tracking(SMAT)[1]is in line with methods like Lucas–Kanade,relaying on matching to track patches. Lukas–Kanade extracts a template at the beginning of the sequence and uses it for tracking,and will fail if the appearance of the patch changes considerably.Matthews et al.[38]proposed an strategic update of the template,which keeps the template from thefirst frame to correct errors that appear in the localization.When the error is too high,the update is blocked.In[39],a solution is proposed withfixed template that adaptively detected and selected the window around the features.SMAT builds a more complex model based on incremental clustering.In this paper we combine concepts from active models with the incremental clustering proposed in SMAT.The texture model is created online,making the model adaptative,while the shape model is learnt offline.The clustering used by SMAT has some limitations,and we propose some modifications to obtain a more robust model and better tracking.We name the approach Robust SMAT for this reason.Evaluation of face tracking methods is performed in most works with images captured indoors.Some authors use freely available image sets,but most of them test on internal datasets created by them,which limits the validity of a comparison with other systems.Only a few authors[40,41]have used images recorded in a vehicle,but the number of samples is limited.To the best of our knowledge,there is no publicly available video dataset of people driving,either in a simulator or in a real road.We propose a new dataset that covers such scenarios.3.Robust simultaneous modeling and trackingThis section describes the Simultaneous Modeling and Tracking (SMAT)of Dowson and Bowden[1],and some modifications we propose to improve its performance.SMAT tries to build a model of appearance of features and how their positions are related(the structure model,or shape),from samples of texture and shape obtained in previous frames.The models of appearance and shape are independent.Fitting is performed in the same fashion of ASM:the features arefirst found separately using correlation,and then theirfinal positions are constrained by the shape model.If thefinal positions are found to be reliable and not caused byfitting errors,the appearance model is updated,otherwise it is left unchanged.Fig.1shows aflow chart of the algorithm.210J.Nuevo et al./Image and Vision Computing29(2011)209–2183.1.Appearance modelingEach one of the possible appearances of an object,or a feature of it,can be considered as a point in a feature space.Similar appearances will be close in this space,away from other points representing dissimilar appearances of the object.These groups of points,or clusters,form a mixture model that can be used to de fine the appearance of the object.SMAT builds a library of exemplars obtained from previous frames,image patches in this case.Dowson and Bowden de fined a series of clusters by their median patch,also known as representative ,and their variance.A new incoming patch is made part of the cluster if the distance between it and the median of the cluster is below a threshold that is a function of the variance.The median and variance of a cluster are recalculated every time a patch is added to it.Up to M exemplars per cluster are kept.If the size limit is reached,the most distant element from the representative is removed.Every time a cluster is updated,the weight of the clusters is recalculated as in Eq.(1):w t +1ðÞk=w t ðÞk +α11+αifk =k u w t ðÞk11+αotherwise8>><>>:ð1Þwhere α∈[0,1)is the learning rate,and k u is the index of the updatedcluster.The number of clusters is also limited to K .If K is reached,the cluster with the lowest weight is discarded.In a later work,Dowson et al.[42],introduced a different condition for membership,that compares the probability of the exemplar belonging to foreground (a cluster)or to the background p fg j d x ;μn ðÞ;σfg n p bg j d x ;μn ðÞ;σbg nð2Þwhere σfg n is obtained from the distances between the representative and the other exemplars in the cluster,and σbg n is obtained from the distances between the representative and the exemplars in the cluster offset by 1pixel.We have found that this clustering method can be improved in several ways.The adapting nature of the clusters could theoretically lead two or more clusters to overlap.However,in our tests we have observed that the opposite is much more frequent:the representative of thecluster rarely changes after the cluster has reached a certain number of elements.Outliers can be introduced in the model in the event of an occlusion of the face by a hand or other elements like a scarf.In most cases,these exemplars would be far away from the representative in the cluster.To remove them and reduce memory footprint,SMAT keeps up to M exemplars per cluster.If the size limit is reached,the most distant element from the representative is removed.When very similar patches are constantly introduced,one of them will be finally chosen as the median,and the variance will decrease,over fitting the cluster and discarding valuable exemplars.At a frame rate of 30fps,with M set to 50,the cluster will over fit in less than 2s.This would happen even if the exemplar to be removed is chosen randomly.This procedure will discard valuable information and future,subtle changes to the feature will lead to the creation of another cluster.We propose an alternative clustering method,named leaderP ,to partially solve these and other problems.The method is a modi fication of the leader algorithm [43,44],arguably the simplest and most frequently used incremental clustering method.In leader ,each cluster C i is de fined by only one exemplar,and a fixed membership threshold T .It starts by making the first exemplar the representative of a cluster.If an incoming exemplar ful fills being within the threshold T it is marked as member of that cluster,otherwise it becomes a cluster on its own.The pseudocode is shown in Algorithm 1.Algorithm 1.Leader clustering.1:Let C ={C 1,…,C n }be a set of n clusters,withweights {w 1t ,…,w n t}2:procedure leader (E ,C )cluster patch E 3:for all C i ∈C do 4:if d (C k ,E )b T thenCheck if patch E ∈C k 5:UpdateWeights (w 1t ,…,w n t)As in Eq.(1)6:return 7:End If 8:End For 9:Create new cluster C n +1,with E asrepresentative.10:Set w n +1t +1←0Weight of new cluster C n +111:C ←C ∪C n +1Add new cluster to the model12:if n +1N K thenRemove the cluster with lowest weight13:Find C k |w k ≤w i i =1,…,n 14:C ←C ∖C k 15:end if16:end procedureFig.1.SMAT block diagram.211J.Nuevo et al./Image and Vision Computing 29(2011)209–218On the other hand,leaderP keeps the first few exemplars added to the cluster are kept,up to P .The median of the cluster is chosen as the representative,as in the original clustering of Dowson and Bowden.When the number of exemplars in the cluster reaches P ,all exemplars but the representative are discarded,and it starts to work under the leader algorithm.P is chosen as a small number (we use P =10).The membership threshold is however flexible:the distances between the representative and each of the exemplars that are found to be members of the cluster is saved,and the variance of those distances is used to calculate the threshold.Because the representative is fixed and distance is a scalar,many values can be kept in memory without having a impact on the overall performance.Keeping more values reduces the risk of over fitting.The original proposal of SMAT used Mutual Information (MI)as a distance measure to compare the image patches,and found it to perform better that Sum of Squared Differences (SSD),and slightly better than correlation in some tests.Any de finition of distance could be used.We have also tested Zero-mean Normalized Cross-Correlation (ZNCC).Several types of warping were tested in [42]:translation,Euclidean,similarity and af fine.The results showed an increasing failure rate as the degrees of freedom of the warps increased.Based on this,we have chosen to use the simplest,and the patches are only translated depending on the point distribution model.3.2.Shape modelIn the original SMAT of Dowson and Bowden,the shape was also learned on-line.The same clustering algorithm was used,but themembership of a new shape to a cluster was calculated using Mahalanobis distance.Our method relies on the pre-learned shape model.The restric-tions on using a pre-learned model for shape are less than those for an appearance model,as it is of lower dimensionality and the deforma-tions are easier to model.It has been shown [45]that location and tracking errors are mainly due to appearance,and that a generic shape model for faces is easier to construct.We use the method of classic ASM [27],which applies PCA to a set of samples created by hand and extracts the mean s 0and an orthogonal vector basis (s 1,…,s N ).The shapes are first normalized and aligned using Generalized Procrustes Analysis [46].Let s =(x 0,y 0,…,x n −1,y n −1)be a shape.A shape can be generated from this base as s =s 0+∑mi =1p i ·s i :ð3ÞUsing L 2norm,the coef ficients p =p 1;…;p N ðÞcan be obtained for a given shape s as a projection of s on the vector basis p =S Ts −s 0ðÞ;p i =s −s 0ðÞ·s i ð4Þwhere S is a matrix with the eigenvectors s i as rows.The estimation of p with Eq.(4)is very sensitive to the presence of outlier points:a high error value from one point will severely in fluence the values of p .We use M-estimators [47]to solve this problem.This technique hasbeenFig.2.Trayectory of the vehicle (map from ).Fig.3.Samples of outdoor videos.212J.Nuevo et al./Image and Vision Computing 29(2011)209–218applied to ASM and AAM in previous works [48,49],so it is only brie fly presented here.Let s be a shape,obtained by fitting each feature independently.The function to minimize is arg min p∑2ni =1ρr i;θð5Þwhere ρ:R ×R þ→R þis an M-estimator,and θis obtained from the standard deviation of the residues [50].r i is the residue for coordinate i of the shaper i=x i−s i o+∑mj =1p j s i j!!ð6Þwhere x i are the points of the shape s ,and s i j is the i th element of the vector s j .Minimizing function 5is a case of re-weighted least squared.The weight decreases more rapidly than the square of the residue,and thus a point with error tending to in finite will have zero weight in the estimation.Several robust estimators have been tested:Huber ,Cauchy ,Gaussian and Tukey functions [50].A study was made in [19]that resulted in similar performance for all of them in a similar scenario to that of this paper,and Huber function was chosen.Huber function performs correctly up to a number of outliers of 50%of the points.We use the 20-point distribution of the BioID database [51].Data from this database was used to train the model.This distribution places the points in some of the most salient locations of the face,and has been used in several other works [40].4.Tests and resultsThis section presents the video sequences used to test different tracking algorithms in a driving scenario.The dataset contains most actions that appear in everyday driving situations.A comparison between our approach and SMAT is presented.Additionally,we compare R-SMAT results with the recently introduced Stacked Trimmed ASM (STASM).4.1.Test setDriving scenarios present a series of challenges for a face tracking algorithm.Drivers move constantly,rotate their head (self-occluding part of the face)or occlude their face with their hands (or other elements such as glasses).If other people are in the car,talking and gesturing are common.There are also constant background changes and,more importantly,frequent illumination changes,produced by shadows of trees or buildings,streets lights,other vehicles,etc.A considerable amount of test data is needed to properly evaluate the performance of a system under all these situations.A new video dataset has been created,with sequences of subjects driving outdoor,and in a simulator.The RobeSafe Driver Monitoring Video (RS-DMV)dataset contains 10sequences,7recorded outdoors (Type A )and 3in a simulator (TypeB ).Outdoor sequences were recorded on RobeSafe's vehicle moving at the campus of the University of Alcala.Drivers were fully awake,talked frequently with other passengers in the vehicle and were asked to look regularly to the rear-view mirrors and operate the car sound system.The cameras are placed over the dashboard,to avoid occlusions caused by the wheel.All subjects drove the same streets,shown in Fig.2.The length of the track is around 1.1km.The weather conditions during the recordings were mostly sunny,which made noticeable shadows appear on the face.Fig.3shows a few samples from these video sequences.Type B sequences were recorded in a realistic truck simulator.Drivers were fully awake,and were presented with a demanding driving environment were many other vehicles were presentandFig.4.Samples of sequences in simulator.m e 17m e 17C u m u l a t i v e E r rD i s t r i b u t i o n00.10.20.30.40.50.60.70.80.91C u m u l a t i v e E r rD i s t r i b u t i o n(a)(b)Fig.5.Performance of different shape models,with leaderP clustering.213J.Nuevo et al./Image and Vision Computing 29(2011)209–218potentially dangerous situations took place.These situations increase the probability of small periods of distraction leading to crashes or near-crashes.The sequences try to capture both distracted behaviour and the reaction to dangerous driving situations.A few images from TypeB sequences can be seen in Fig.4.The recording took place in a low-light scenario that approached nighttime conditions.This forced the camera to increase exposure time to a maximum,which lead to motion blur being present during head movements.Low power near-IR illumination was used in some of the sequences to increase the available light.The outdoor sequences are around 2min long,and sequences in the simulator are close to 10min in length.The algorithms in this paper were tested on images of approximately 320×240pixels,but high resolution images were acquired so they can be used in other research projects.The images are 960×480pixels for the outdoor sequences and 1392×480for the simulator sequences,and are stored without compression.Frame rate is 30frames per second in both cases.The camera has a 2/3″sensor,and used 9mm standard lenses.Images are grayscale.The recording software controlled camera gain using values of the pixels that fell directly on the face of the driver.The RS-DMV is publicly available,free of charge,for research purposes.Samples and information on how to obtain the database are available at the authors'webpage.14.2.Performance evaluationPerformance of the algorithms is evaluated as the error between the estimated position of the features and their actual position,asgiven by a human operator.Hand-marking is a time consuming task,and thus not all frames in all videos have been marked.Approximately 1in 30frames (1per second)has been marked in the sequences in RS-DMV.We call this frames keyframes .We used the metric m e ,introduced by Cristinacce and Cootes [40].Let x i be the points of the ground-truth shape s ,and let ^xi be the points of the estimated shape ^s .Then,m e =1ns ∑n i =1d i ;d i=ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffix i −^x i T x i −^x ir ð7Þwhere n is the number of points and s is the inter-ocular distance.Wealso discard the point on the chin and the exterior of the eyes,because their location changes much from person to person.Moreover,the variance of their position when marked by human operators is greater than for the other points.Because only 17points are used,we note the metric as m e 17.In the event of a tracking loss,of if the face cannot be found,the value of m e 17for that frame is set to ∞.During head turns,the inter-eye distance reduces with the cosine of the angle.In these frames,s is not valid and is calculated from its value on previous frames.Handmarked points and software used to ease the marking process are distributed with the RS-DMV dataset.4.3.ResultsWe tested the performance of R-SMAT approach on the RS-DMV dataset,as well as that of SMAT.We compared these results with those obtained by STASM,using the implementation in [2].One of the most remarkable problems of (R-)SMAT is that it needs to be properly initialized,and the first frames of the sequence are key1/personal/jnuevo.Fig.7.Samples of type A sequence #1.Outlier points are drawn in red.m e 17FrameFig.6.m e 17error for a sequence.214J.Nuevo et al./Image and Vision Computing 29(2011)209–218to building a good model.We propose STASM to initialize (R-)SMAT in the first frame.STASM has been shown to be very accurate when the face is frontal.Nonetheless,a slightly incorrect initialization will make (R-)SMAT track the (slightly)erroneous points.To decouple this error from the evaluation of accuracy of (R-)SMAT in the tests,the shape was initialized in the first frame with positions from the ground-truth data.At the end of this section,the performance of R-SMAT with automatic initialization is evaluated.First,a comparison of the shape models is presented.With the best shape model,the original clustering algorithm and the proposed alternative are evaluated.Results are presented for outdoor and simulator sequences separately,as each has speci fic characteristics on their own.The incremental shape model of SMAT was found to produce much higher error than the pre-learned model.Fig.5shows the cumulative distribution error of the incremental shape model (on-line )with the robust pre-learned model (using Huber function)(robust ).Forcomparison purposes,the figure also shows the performance for the pre-learned shape model fitted using a L 2norm (non-robust ).All models use leaderP clustering,and patches of 15×15pixels.Clear improvements in performance are made by the change to a pre-learned model with robust fitting.The robust,pre-learned shape model is very important in the first frames,because it allows the model to have bigger certainties that the patches that are being included correspond to correct positions.Robust shape model is used in the rest of the experiments in this paper.Fig.6shows the plot of the m e 17distance of both models in a sequence.A clear example of the bene fits of the robust model is depicted in Fig.7.The online model diverges as soon as a few points are occluded by the hand,while the robust model keeps track of the face.The method is also able to keep track of the face during head rotations,although with increased fitting error.This is quite remarkable for a model that has only been trained with fully frontal faces.Fig.8shows the performance of the original SMAT clustering compared with the proposed leaderP clustering algorithm,as implemented in R-SMAT.R-SMAT presents much better performance than the original SMAT clustering.This is specially clear in Fig.8(b).We stated in Section 3.1that the original clustering method could lead to over fitting,and type B sequences are specially prone to this:patches are usually dark and do not change much from frame to frame,and the subject does not move frequently.When a movement takes place,it leads to high error values,because the model has problems finding the features.m e 17m e 17C u m u l a t i v e E r rD i s t r i b u t i o nC u m u l a t i v e E r rD i s t r i b u t i o n(a)(b)parison of the performance of clustering algorithms.Table 1Track losses for different clustering methods.MeanMaximum Minimum R-SMATType A 0.99% 2.08%(seq.#7)0%(seq.#1,#2,#6)Type B 0.71% 1.96%(seq.#9)0%(seq.#10)SMATType A 1.77% 5.03%(seq.#4)0%(seq.#1,#2,#5)TypeB1.03%2.45%(seq.#9)0%(seq.#10)m e 17m e 17C u m u l a t i v e E r rD i s t r i b u t i o n00.10.20.30.40.50.60.70.80.91C u m u l a t i v e E r r D i s t r i b u t i o n(a)(b)parison of the performance of STASM and SMAT.215J.Nuevo et al./Image and Vision Computing 29(2011)209–218。
tracking-by-segmentation方法的原理"Tracking by Segmentation"(通过分割进行跟踪)是一种计算机视觉中用于目标跟踪的方法。
该方法的原理是通过将目标从视频帧中分割出来,然后在连续帧之间跟踪这个目标的运动。
以下是"Tracking by Segmentation" 方法的基本原理:目标分割:首先,从视频帧中分割出包含目标对象的图像区域。
这通常需要使用图像分割算法,例如背景减除、阈值分割、边缘检测或语义分割等技术。
目标分割的目的是将目标与背景分离,以便进一步的跟踪。
特征提取:一旦目标被成功分割,就需要从目标区域中提取特征,以描述目标的外观和形状。
这些特征可以包括颜色直方图、纹理特征、形状描述符等。
这些特征将用于后续帧中的目标匹配。
运动估计:在接下来的视频帧中,通过比较当前帧中的目标特征与之前帧中的特征,估计目标的运动。
这可以通过不同的方法实现,如光流估计、外观模型匹配等。
通过运动估计,系统可以预测目标在下一帧中的位置。
目标匹配和跟踪:使用目标的特征和运动信息,将目标在连续帧之间进行匹配和跟踪。
目标匹配可以是一个关键步骤,它确定目标在新帧中的位置,以确保跟踪的连续性。
匹配可以通过各种方法实现,包括相关滤波、卡尔曼滤波、粒子滤波等。
更新目标模型:随着时间的推移,目标的外观可能会发生变化,例如光照条件的变化、遮挡或目标本身的运动。
因此,需要定期更新目标模型,以确保跟踪的准确性。
这可能涉及到在线学习或模型适应的技术。
终止条件:跟踪可以在达到某些终止条件时结束,例如目标不再可见、跟踪失败或用户停止跟踪。
在终止时,系统可能会输出跟踪结果或汇总目标的轨迹信息。
"Tracking by Segmentation" 方法的优点是它能够处理目标在复杂背景下的跟踪,并且对目标的外观和形状变化相对鲁棒。
然而,它也面临着挑战,例如遮挡、光照变化、目标形状变化等问题可能会导致跟踪失败。
The new generation in signal analysisReal-Time Spectrum AnalyzerMonitoring ReceiverRF Direction Finding andLocalization SystemMore and more devices have to share the available frequency spectrum as aresult of new technologies such as the Internet of things (IoT), machine tomachine (M2M) or car to car (C2C) communications, and the rapidly growing4G/5G mobile networks.It doesn’t matter whether you are making a wideband measurement of entirefrequency ranges, or searching for hidden signals, or needing to reliablydetect very short impulses, or localizing interference signals –SignalSharkgives you all the measurement solutions you need to cope with the increasinglycomplex radio frequency spectrum. Its design and excellent performance makeit ideal for on-site measurements as well as for fully-fledged laboratory use. SignalShark. Seven senses for signalsSignalShark –there’s a reason for the name. Just like its namesake, theSignalShark is an extremely efficient hunter, perfectly designed for its task.Its prey: interference signals. Its success rate: Exceptional. The real-timeanalyzer is a successful hunter, thanks to the interplay of its highly developedseven sensory functions. Seven senses that don’t miss a thing, and that makeit easy for you to identify and track down interferers in real-time./watch?v=pSZdR27j5LQ&t=14s• Frequency range: 8 kHz to 8 GHz• Weight: Approx. 4.1 kg / 9 lbs (with one battery)• Dimensions: 230 × 335 × 85 mm (9.06ʺ× 13.19ʺ× 3.35ʺ)Make it your deviceSignalShark is ready for the future, thanks toits many expansion facilities, and it can beoptimally adapted as needed to the widestvariety of applications.SignalShark – the 40 MHz real-timespectrum analyzerWhether you are in the lab or out in thefield, you will have the right analysis toolin hand with the SignalShark. You will beconvinced by its truly outstanding RF perfor-mance, as well as by its easily understood,application-oriented operating concept.The high real-time bandwidth with very highFFT overlapping ensures that you can reliablycapture even extremely brief and infrequentevents. The unusually fast scan rate results invery short measurement times even if youneed to cover wider frequency bands thanthe real-time bandwidth. Comprehensiveevaluation tools make sure that you canperform current and future measurementand analysis tasks up to laboratory instru-ment standards reliably, simply, and faster.SignalShark – the monitoring receiverThe extremely High Dynamic Range (HDR) ofthe SignalShark ensures that you can reliablydetect even the weakest signals in the pre-sence of very strong signals, and not confusethem with the artifacts of a normal receiver.This is a basic requirement for most tasksin the field of radio monitoring. Alongsidethe real-time spectrum analyzer, there is areceiver for audio demodulation, level mea-surement, and modulation analysis, whichcan be tuned to any frequency and channelbandwidth within the 40 MHz real-timebandwidth. And, if you need even more thanthe analysis tools of the SignalShark, you canprocess the I/Q data from the receiver exter-nally as a real-time stream and store themon internal or external data storage media.SignalShark – the direction findingand localization systemIt is often necessary to locate the positionof a signal transmitter once the signals havebeen detected and analyzed. SignalSharksupports the new Automatic Direction-Finding Antennas (ADFA) from Narda,allowing you to localize the source veryquickly and reliably. In fact, localization ischild’s play, thanks to the integrated mapsand localization firmware. Conveniently,homing-in using an ADFA mounted on amoving vehicle is also supported. Powerful,state of the art algorithms minimize theeffects of false bearings caused by reflectionsoff urban surroundings in real-time. Extre-mely light weight and easy to use manualdirection finding antennas are availablefor ”last mile“ localization.V I D E OVideo display port for external monitor or projector USB 2.0 for keyboard, mouse, printer, etc.fast, convenient measurementBuilt-in loudspeaker gives clear,loud sound reproduction, even in noisy environments/watch?v=0jqrwU_jPcsV I D E OSignalShark is a handy, portable, battery powered measuring device, yet it boasts performance that is otherwise only found in large, heavy laboratory grade equipment. It can be readily used instead of such expensive equipment because of its wide range of connection facilities and measurement functions.SignalShark –the real-time spectrum analyzer• HDR: extremely low noise and distortion, simultaneously • real-time bandwidth: 40 MHz – FFT overlap: 75 % (Fspan > 20 MHz)– FFT overlap: 87.5 % (Fspan ≤20 MHz, RBW ≤400 kHz))– FFT size: up to 16,384• Minimum signal duration for 100 % POI: 3.125 µs at full amplitude accuracy • Minimum detectable signal duration: < 5 ns • Persistence: up to 1.6 million spectrums per second • Spectrogram time resolution: down to 31.25 µs • Spectrogram detectors: up to three at the same time • RBW: 1 Hz - 800 kHz in real-time spectrum mode, 1 Hz - 6.25 MHz in scan spectrum mode• Filters conforming to CISPR and MIL for EMC measurements • Scan speed: Scan rate up to 50 GHz/s • Detectors: +Pk, RMS, Avg, -Pk, Sample• Markers: 8, additional noise power density and channel power function •Peak table: shows up to 50 highest spectral peaksReliable detection of extremely short and rare events in a 40 MHz real-time bandwidthA real-time analyzer calculates the spectrum by applying the FFT on overlapping time segments of the underlying I/Q data within its real-time bandwidth. The real-time band-width is only one of the key parameters for a real-time analyzer. The probability of inter-cept, POI, is easily just as important. This parameter describes the minimum time that the signal must be present for it to be always detected without any reduction in level. This time is affected by the maximum resolution bandwidth RBW and the FFT overlap. The SignalShark is a match for established laboratory analyzers with its minimum duration of 3.125 µsec for 100 %POI and full amplitude accuracy. The mini-mum detectable signal duration is < 5 nsec.SignalShark accomplishes this by a large signal immunity in combination with a very low intrinsic noise as well as a high FFT overlap and its large resolution bandwidth.That is outstanding for a hand-held analyzer. To accomplish this, SignalShark generally operates with an 87.5 % overlap, which is again outstanding for a hand-held analyzer.This means that even the shortest impulses are detected and the full signal to noise ratio is maintained for longer signals.Spectrogram shows more details than everWith SignalShark, you can use up to three detectors at the same time for the Spectrogram view. This makes it possible for you to easily visualize impulse inter-ference on broadcast signals and get much more information from the spectrogram. The extraordinarily fine time resolution of 31.25 µs means that you can completely reveal the time signatures of many signals.With the I/Q Analyzer option, you can resolve the spectrogram even more, to less than 200 ns.Persistence ViewA color display of the spectrum shows how often the displayed levels have occurred. This enables you to detect signals that would be masked by stronger signals in a normal spectrum view.=SignalShark is not just a very powerful real-time spectrum analyzer. It is also the ideal monitoring receiver, thanks to its near ITU-ideal spectrum monitoring dynamic capabilities, second receiver path and demodulators.SignalShark –the monitoring receiver• HDR: extremely low noise and distortion, simultaneously • CBW: 25 Hz - 40 MHz (Parks-McClellan, α= 0.16)• Filters for EMC measurements: CISPR, MIL • Detectors: +Pk, RMS, Avg, -Pk, Sample• EMC detectors: CPk, CRMS, CAvg (compliant with CISPR)• Level units: dBm, dB µV, dB(µV/m) …• Level uncertainty: < ±2dB • AFC• Audio demodulators: CW, AM, Pulse, FM, PM, LSB, USB, ISB, I/Q • AGC & squelch for audio demodulators • Modulation measurements: AM, FM, PM • I/Q streaming: Vita 49 (sample rate ≤25,6 MHz)• Remote control protocol: SCPIThe benefit of HDRThe extremely high dynamic range (HDR) of the SignalShark ensures that you can reliably detect even the weakest signals in the presence of very strong signals. The SignalShark’s pre-selector allows it to suppress frequencies that would other-wise interfere with the measurement. The excellent dynamic range of the SignalShark is the result of the ideal combination of the displayed averaged noise level (DANL)with the so-called large-signal immunity parameters, i.e. the second and third order intermodulation intercept points (IP 2and IP 3).It is important that these three factors are always specified for the same device setting (e.g. no attenuation, no pre-amplifier), as they vary considerably according to the setting.DDC 2, the additional receiver pathThe tuning frequency and the channel band-width of an additional receiver path, DDC 2,can be set independently from the real-time spectrum analyzer path, DDC 1, within the real-time bandwidth of the SignalShark. The I/Q data can be streamed to external devices in real-time, or they can be processed by the SignalShark itself for level measurements,audio demodulation, and modulation measurements. The very steep cutoffchannel filters capture 100 % of the signal in the selected channel without any degra-dation while completely suppressing the adjacent channels.CISPR compliant EMC detectors now also available for on-site applications The facility for selecting all the filters and detectors necessary for CISPR or MIL com-pliant EMC measurements is also available for the receiver as well as for the spectrum. If an interferer is detected, you can now decide on the spot whether or not the device needs to be taken out of service because of violating EMC regulations.EQDDC 1Overlap BufferFFT DetectorsPersist.Persistence StreamSpectrum StreamADC DataDDC 2DetectorsDetectorsI/Q BufferTrigger UnitDemodulatorsAGCLevel StreamDem. Det.StreamDem. Audio StreamAM & FM StreamI/Q StreamI 2+Q2I 2+Q2PATH 1PATH 2The block circuit diagram shows the two, independent digital down converters (DDC). These make it possible e.g. to observe the spectrum of the signal spectrum and demodulate it at the same time independently within the real-time bandwidth.Automatic Direction Finding Antenna ADFA 1 + 2Narda offers a large number of automatic and directional antennas for the SignalShark. Their unique characteristics combined with the SignalShark makes them unbeatable.Automatic Direction Finding Antenna ADFA 1The frequency range of ADFA 1 makes it particularly suitable for localizing interferers,e.g. in mobile communications networks:Frequency range: 200 MHz - 2.7 GHz Nine dipoles arranged on a 380 mm diameter circle for DFA central monopole is used as a reference element for DF or as an omnidirectional monitoring antennaBuilt-in phase shifter and switch matrix Direction finding method: correlative interferometerBearing uncertainty: 1° RMS (typ.)Built-in electronic compassBuilt-in GNSS receiver with antenna and PPS outputDiameter: 480 mmAutomatic Direction Finding Antenna ADFA 2 (available 2019)This ADFA is suitable for a wide range of localization tasks due to its wide frequency range:Frequency range: (500 kHz) 10 MHz -8 GHz Two crossed coils for DF at low frequencies Nine dipoles arranged on a 380 mm dia-meter circle for DF at medium frequencies Nine monopoles arranged on a 125 mm diameter circle for DF at high frequencies A central monopole is used as a reference element for DF or as an omnidirectional monitoring antennaBuilt-in phase shifter and switch matrix Direction finding method: Watson-Watt or correlative interferometerBearing uncertainty (10 MHz - 200 MHz): 2° RMS (typ.)Bearing uncertainty (200 MHz - 8 GHz): 1° RMS (typ.)Built-in electronic compassBuilt-in GNSS receiver with antenna and PPS output Diameter: 480 mm Automatic Direction Finding Antenna ADFA accessoriesConnecting cable, length 5 m or 15 m,low lossTripod including mounting accessories Mounting kit for magnetic attachment to a vehicle roofMounting kit for mast attachmentAfter you have localized the signal by SignalShark and ADFA using the car, you will need for last mile or to enter a building Narda’s handy, feather-light directional antennas and active antenna handle. They are the ideal choice in this situation. The antenna handle does more than just hold the antenna. Among other features, it has a built-in operating button that allows you to perform the main steps during manual direction finding, making the combination unbeatable.and take bearings on very weak or distant signals. The preamplifier gain is taken into account automatically when you make field strength or level measurements.The integrated operating button lets you make the main steps in the manual direction finding process.The following antennas to fit the antenna handle are available:• Loop Antenna: 9 kHz - 30 MHz• Directional Antenna 1: 20 MHz - 250 MHz • Directional Antenna 2: 200 MHz - 500 MHz • Directional Antenna 3: 400 MHz - 8 GHz A plug-in adapter with male N connector allows you to take advantage of the features of the handle even when you are using third-party antennas or external filters.Directional antenna 3400 MHz - 8 GHz350 g / 0.77 lbsDirectional antenna 1 20 MHz - 250 MHz 400 g / 0.88 lbs Loop antenna 9 kHz - 30 MHz 380 g / 0.84 lbs Directional antenna 2 200 MHz - 500 MHz 300 g / 0.66 lbs Active antenna handle with integrated compass and preamplifier 9 kHz - 8 GHz 470 g / 1.04 lbsAdapter,male N connectorN Antenna Elements0°90°180°270°Element SwitchReference Elementn1Quadrature Phase Shifter(Smart Antenna)+The Narda antenna handle and directional antennas are extremely light, making for fatigue-free signal searches.The convenient plug-in system allows you to change antennas very quickly.SignalShark recognizes the antenna and applies the appropriate antenna factors for field strength measurements automatically.SignalShark receives the azimuth,elevation and polarization of the antenna from the 3D electronic compass built into the handle, so manual direction finding could hardly be simpler.The preamplifier built into the handle is activated and deactivated bySignalShark, so you can further reduce SignalShark’s low noise figure to detectYou will often need to locate the position of a signal transmitter once thesignals have been detected or analyzed. SignalShark combined with Narda’snew automatic direction finding antennas (ADFA) and the very powerfulmap and localization firmware provides reliable bearings in the twinklingof an eye. The bearing results are processed by the SignalShark withoutneeding an external PC. Reliable localization of transmitters has not beenpossible before with so few hardware components.Transmitter localizationSignalShark simplifies transmitter localizationby autonomously evaluating all the availablebearing results and plotting them on a map,using a statistical distribution of bearinglines. The result is a so-called “heat map”,on which the possible location of the trans-mitter is plotted and color-coded accordingto probability. SignalShark also draws anellipse on the map centered on the estima-ted position of the transmitter and indicatingthe area where the transmitter has a 95 %probability of being located. The algorithmused by SignalShark to calculate the positionof an emitter is extremely powerful. It candetermine the position of the emitter bycontinuous direction finding when movingaround in a vehicle, even in a complexenvironment such as an inner-city area.The calculation is continuous inreal-time, so you can viewthe changing heat mapon the screen of theSignalShark andFast automatic direction findingSignalShark supports the new automaticdirection finding antennas (ADFA) fromNarda, which let you take a completebearing cycle in as little as 1.2 ms.The omnidirectional channel power and thespectrum are also measured during a bearingcycle, so you can monitor changes in thesignal level or spectrum concurrently withthe bearings. The AFDAs use differentantenna arrays, depending on the frequencyrange. At low frequencies, a pair of crossedcoils are used for the Watson-Watt methodof direction finding. At medium and highfrequencies, a circular array of nine dipolesor monopoles is used for the correlativeinterferometer direction finding method.SignalShark –The RF direction finding and localization system• Frequency range ADFA 1: 200 MHz - 2.7 GHz• Frequency range ADFA 2: 10 MHz - 8 GHz• Azimuth and elevation bearings• DF quality index• Complete bearing cycle: down to 1.2 ms• Omnidirectional level and spectrum during DF process• Uses OpenStreetMaps, other map formats can be imported• Easy to use, powerful map and localization software• The map and localization software runs on the handheldunit itselfThe SignalShark is a very powerful platform that Narda is continuously expanding. Options that will be available for delivery in 2019 are described below. Only the firmware of the SignalShark will be used to realize these options, which will be capable of on-site activation.High time resolution spectrogram HTRSalso available in the spectrum pathIn real-time spectrum mode, the ring buffer ofthe SignalShark records the I/Q data from thereal-time spectrum path rather than from thereceiver I/Q data. If you or a trigger eventhalts the real-time analyzer, the last up to200 million I/Q samples of the monitoredfrequency range are available. This correspondsto a timespan of at least 4 s, so you can zoomin on the spectrogram with a resolution ofbetter than 200 ns when the analyzer is halted.The FFT overlap can be up to 93.75 %, and nodetectors are needed that could reduce thetime resolution. You can even subsequentlyalter the RBW. The persistence view also adjustsso that it exactly summarizes the spectrumsin the time period covered by the zoomedsegment. This ensures that all the time orspectral details in the I/Q data can be madevisible. You can of course also save the I/Qdata of the zoomed segment.DF SpectrumThe SignalShark can find the directions ofseveral transmitters simultaneously in DFspectrum evaluation mode. This mode offersa persistence spectrum and a spectrogramof the azimuth in addition to the usual levelspectrum and spectrogram view. You canalso monitor frequency ranges that arewider than the real-time bandwidth of theSignalShark. You can distinguish betweendifferent transmitters much more easilythan before by means of DF spectrum mode,because the SignalShark shows you thedirection of incidence as well as the levelof each frequency bin.SignalShark I/Q analyzerSignalShark has a ring buffer for up to 200 million I/Q samples. The receiver I/Q data are normally written continuouslyto the ring buffer. The recording can be stopped by a trigger event. The recorded I/Q data are then transferred to the CPU of the SignalShark, where they are further processed.The following trigger sources are available: Frequency mask triggerReceiver levelExternal trigger sourceTimestampUser inputFree runThe following I/Q data views are available: I and Q versus timeMagnitude versus time (Zero-span) Vector diagramHigh time resolution spectrogram Persistence You can of course also save the I/Q data as adata set, and you can even stream the datadirectly to permanent storage media in orderto make very long recordings of the I/Q data.You can then replay such long-term recor-dings using the integrated I/Q analyzer, orprocess them externally.2 x 10 MHz LTE signal recorded in a HTRS. Time resolution1 µs. The extremely high time resolution renders the signaltransparent at low traffic levels (right), so you can spotpossible interference within the frame structure.More Information about technical details andaccessories like transport case and car chargerunit can be found in the SignalShark data sheet./en/signalsharkNarda is a leading supplier …N S T S 06/18 E 0333A T e c h n i c a l a d v an c e s , e r r o r s a n d o m i s s i o n s e x c l u d e d .© N a r d a S a f e t y T e s t S o l u t i o n s 2014. ® T h e n a m e a n d l o g o a r e t h e r e g i s t e r e d t r a d e m a r k s o f N a r d a S a f e t y T e s t S o l u t i o n s G m b H a n d L 3 C o m m u n i c a t i o n s H o l d i n g s , I n c .—T r a d e n a m e s a r e t h e t r a d e m a r k s o f t h e i r o w n e r s .r o e n e r -d e s i g n .d eNarda Safety Test Solutions 435 Moreland RoadHauppauge, NY11788, USA Phone +1 631 231-1700Fax +1 631 231-1711**************************… of measuring equipment in the RF test and measurement, EMF safety and EMC sectors. The RF test and measurement sector covers analyzers and instruments for measuring andidentifying radio sources. The EMF safety product spectrum includes wideband and frequency-selective measuring devices, and monitors for wide area coverage or which can be worn on the body for personal safety. The EMC sector offers instruments for determining the electro-magnetic compatibility of devices under the PMM brand. The range of services includes servicing, calibration, accredited calibration, and continuous training programs.Narda Safety Test Solutions GmbH Sandwiesenstraße 772793 Pfullingen, Germany Tel. +49 7121 97 32 0Fax +49 7121 97 32 790********************* /en/signalshark。
基于简单线性迭代聚类超像素的meanshift跟踪邵辰琳;杨卫平;张志龙【摘要】In order to enhance robustness of target tracking algorithm under conditions of motion displacement,occlusion,deformation and similar object disturbance,it is proposed to construct target appearance model by using super pixel,and match appearance model with candidate region to obtain candidate region target super pixel,and use Meanshift algorithm to determine target center point tracking algorithm.Simulation experiments select representative of video Girl and FaceOcc1 from Benchmark library,which represent video scene in terms of movement displacement,occlusion,deformation,interference of similarobjects.Tracking success rate and tracking accuracy of algorithm are 0.601 and 0.856 in video Girl,and success rate and accuracy of KCF algorithm with best tracking performance are higher than normal algorithm of 0.059 and 0.084 respectively.In video FaceOcc1,tracking success rate and accuracy of proposed algorithm only ranked second to KCF,suggesting a fine robustness even when target is blocked or interfered by analogues.%为了增强目标跟踪算法在被跟踪目标发生运动位移、遮挡、形变、相似物体干扰等情况下的鲁棒性,提出利用超像素构建目标外观模型,将外观模型与候选区域进行匹配,获取候选区域当中目标超像素,并用Meanshift算法确定目标中心点的跟踪算法.仿真实验选取Benchmark库当中在运动位移、遮挡、形变、相似物体干扰方面具有代表性的视频Girl和FaceOcc1.该算法在视频Girl中的跟踪成功率和跟踪精度为0.601、0.856,比对比实验的经典算法当中跟踪效果最好的KCF算法的成功率和精度分别高0.059和0.084;在视频FaceOcc1中跟踪成功率和精度仅次于KCF.表明该跟踪算法在受到相似物体干扰和目标遮挡时具有良好的鲁棒性.【期刊名称】《应用光学》【年(卷),期】2017(038)002【总页数】7页(P193-199)【关键词】超像素;颜色直方图;Meanshift算法;外观模型;跟踪【作者】邵辰琳;杨卫平;张志龙【作者单位】国防科技大学电子科学与工程学院,湖南长沙 410073;国防科技大学电子科学与工程学院,湖南长沙 410073;国防科技大学电子科学与工程学院,湖南长沙 410073【正文语种】中文【中图分类】TN202;TP391目标跟踪是计算机视觉领域的重要研究内容,广泛应用于安防监控、人机交互、智慧城市以及军事领域。
帧间差分法英文简写Frame Difference Method (also known as Frame Difference Technique or Frame-to-Frame Difference Method) is a widely used image processing technique that is employed to detect moving objects or changes in video sequences. This method works by comparing consecutive frames in a video stream and computing the pixel-wise differences between them. The resulting difference image is then thresholded to highlight the areas where significant motion or changes have occurred.Frame Difference Method is a straightforward and computationally efficient approach to detect motion in videos. It has found applications in various domains, including surveillance systems, video compression, object tracking, and human-computer interaction. The simplicity and effectiveness of this method make it an attractive choice in scenarios where real-time or low computational power is a concern.The following steps provide a general outline of the Frame Difference Method:1. Frame Extraction: In this step, consecutive frames from a video sequence are extracted.2. Frame Difference: The extracted frames are then compared to compute the pixel-wise differences between them. This can be done by subtracting the pixel values of two frames or applying other mathematical operations.3. Thresholding: The obtained difference image is thresholded tocreate a binary mask. The threshold value determines the sensitivity of motion detection. Pixels with differences above the threshold are considered as moving objects or regions of interest.4. Filtering: Additional filtering techniques, such as morphological operations or temporal filtering, can be applied to remove noise or refine the detected motion regions.5. Motion Representation: The final output of the Frame Difference Method is typically a binary image or a sequence of binary images, where white pixels indicate motion or changes.Some factors and considerations to take into account when using the Frame Difference Method include:1. Sensitivity: The threshold value used for motion detection affects the sensitivity of the method. A lower threshold will detect more motion but may also produce more false positives, while a higher threshold may miss certain motions.2. Noise: The resulting difference image may contain noise or small changes due to factors such as lighting variations or camera noise. Applying filtering techniques can help reduce these artifacts.3. Background Subtraction: Frame Difference Method can be combined with background subtraction techniques to enhance motion detection in dynamic backgrounds.4. Object Occlusion: If objects in the video sequence occlude each other, the Frame Difference Method may struggle in accuratelydetecting and tracking them.Overall, Frame Difference Method is a viable and efficient approach for motion detection in video sequences. By comparing consecutive frames and analyzing pixel-wise differences, it enables the identification of moving objects or changes, making it an essential tool in various applications and industries.。
Real-time affine region tracking and coplanar grouping∗Vittorio Ferrari1,Tinne Tuytelaars2,Luc Van Gool1,21Computer Vision Group(BIWI),ETH Zuerich,Switzerland2ESAT-PSI,University of Leuven,Belgium{ferrari,vangool}@vision.ee.ethz.ch,Tinne.Tuytelaars@esat.kuleuven.ac.beAbstractWe present a novel approach for tracking locally planar re-gions in an image sequence and their grouping into larger planar surfaces.The tracker recovers the affine transfor-mation of the region and therefore yields reliable point cor-respondences between frames.Both edges and texture in-formation are exploited in an integrated way,while not re-quiring the complete region’s contour.The tracker with-stands zoom,out-of-plane rotations,discontinuous motion and changes in illumination conditions while achieving real-time performance for a region.Multiple tracked re-gions are grouped into disjoint coplanarity classes.Wefirst define a coplanarity score between each pair of regions, based on motion and texture cues.The scores are then ana-lyzed by a clique-partitioning algorithm yielding the copla-narity classes that bestfit the data.The method works in the presence of perspective distortions,discontinuous planar surfaces and considerable amounts of measurement noise.1.IntroductionHistogram-based region trackers have proven to run in real-time[6,4].The results of these trackers are impres-sive,but for some applications they lack certain properties. Firstly,they don’t provide point correspondences under ro-tation.Secondly,they achieve high performance by con-sidering a low dimensional search space for the region de-formations during tracking(e.g.translation and size only), and are therefore unable to correctly handle the skew and anisotropic scaling effects caused by out-of-plane rotation. On the other hand,trackers that deal with the full set of affine motion parameters have been proposed[3,13],but they rely on heavier techniques which makes them slow. Moreover,they often need the complete region’s contour, what strongly limits the number of regions that can be tracked in a scene.We present a region tracker that combines these two ma-jor properties:it tracks a region under complete affine de-formation at real-time speed.Moreover,the tracker does ∗This research was supported by EC project CIMWOS.not need closed contours in the images themselves to de-fine the regions;it can deal with large displacements be-tween subsequent frames,and it can recover from a tem-porary loss of the region.These properties are based on a particular way of searching the affine transformation space which exploits the nature of the two types of regions we are considering.By recovering the complete affine transforma-tion,the tracker yields reliable point-wise correspondences as the region evolves.This is very useful for a number of applications.In the second part of the paper,we investigate one such application:the detection of planar structures from video sequences.Planes play an important part in3D reconstruc-tion(e.g.buildings and indoor scenes).Parallax-based scene analysis methods[9,7]and some special reconstruc-tion techniques[5]need a reference plane.Navigation sys-tems need tofind freefloor space,etc.For the detection of the planar structures,the set of all tracked regions is par-titioned into disjoint coplanarity classes.Each class cor-responds to a distinct plane in3D space.Neither the size nor the number of classes is known beforehand.The basic grouping unit is the pair:with the selected,affine region model providing three independent coplanar points per re-gion,a pair of regions is the smallest set sufficient for con-sidering general planar motions(homographies).We define a coplanarity score between each pair of regions,based on the combination of a planar motion compatibility cue and a novel texture cue.The cues are computed from the point correspondences generated by the tracker.The score con-veys information about the probability that the regions lie on the same plane.We exploit the intrinsic transitivity of the coplanarity property to resolve grouping ambiguities aris-ing from noisy scores by formulating the coplanar group-ing problem in Clique Partitioning[8]terms.This formula-tion offers an elegant approach for the treatment of general grouping problems.We introduce a simple,but effective, polynomial time heuristic for its solution.The structure of the paper is as follows.Section2de-scribes the regions that are the basis for the tracking.Sec-tion3describes the tracker.Section4discusses the copla-nar grouping of the tracked regions.Section5shows some experimental results.Section6concludes the paper.2.Affinely invariant regionsTuytelaars and Van Gool[12]have proposed a method for the automatic extraction and wide-baseline matching of small,planar regions.These regions are extracted around anchor points and are affinely invariant:given the same anchor point in two images of a scene,regions covering the same physical surface will be extracted,in spite of the changing viewpoint.We deal with two of their region types(figure1):parallelogram-shaped(anchored on corner points)and elliptical(anchored on local intensity extrema). The former are based on two straight edges intersecting in the proximity of the corner.Thisfixes a corner of the par-allelogram(call it c)and the orientation of its sides.The opposite corner(call it q)isfixed by computing an affinely invariant measure on the region’s texture.Elliptical regions are extracted based on the intensity profile along rays ema-nating from the intensity extremum,without needing edges.Because of the nature of their respective anchor points and extraction methods,these two types complement each other well and experiments show that hundreds of uniformly distributed regions can be extracted from images of typical indoor and outdoor scenes.One of the advantages of using these invariant regions is that tracked regions that have been lost due to large occlu-sions can be recovered by reverting to the matching tech-niques proposed by Tuytelaars and Van Gool(albeit not yet considered in thispaper).Figure1:A parallelogram-shaped and an elliptical region.3.Region trackingBoth the geometry-based and intensity-based regions are tracked using the same scheme.In the following we con-sider tracking a region R from a frame F i−1to its successor frame F i in the image sequence.First we compute a pre-dictionˆR i=A i−1R i−1of R i using the affine transforma-tion A i−1between the two previous frames.An estimateˆa i=A i−1a i−1of the region’s anchor point1,is computed, around which a circular search space S i is defined.The radius of S i is proportional to the current translational ve-locity of the region.The anchor points in S i are extracted.1Harris corners for geometry-based regions and intensity extrema for intensity-based regions These provide potentially better estimates for the region’slocation.We investigate the point closest toˆa i looking for the target region R i.The anchor point investigation al-gorithm differs for parallelogram-shaped and elliptical re-gions and will be explained in the two following subsec-tions.Since the anchor points are sparse in the image,theone closest to the predicted location is,in most cases,the correct one.If not,the anchor points are iteratively inves-tigated,from the closest(toˆa i)to the farthest,until R i isfound.Thisfirst pruning of the search space helps achiev-ing high speed while keeping the radius of S i wide enoughto ensure tolerance to large image displacements.In somecases it is possible that no correct R i is found around any anchor point in S i(e.g.:occlusion,sudden acceleration,failure of the anchor point extractor).When this happens the region’s location is set to the prediction(a i=ˆa i),and the tracking process proceeds to the next frame,with a largerS.In most cases this allows to recover the region in oneof the next few frames,while avoiding the computationallyexpensive process of searching F i further.3.1.Parallelogram-shaped regionsGiven a corner point h,the region predictionˆR i,and the region in the previous frame R i−1we want to test if R i is anchored to h and,in that case,extract it.The idea is to construct at h the region most similar to R i−1.The process follows two steps.Thefirst tracks two of the straight region sides exploiting the geometric information(edges)of the image,and already yields partial information about R i.The second step starts from the output of thefirst,and completes R i by exploiting intensity information(texture).In thefirst step a polyline snake with three-vertices re-covers two of the sides,but not yet their lengths.We ex-ploit the fact that translatingˆR i so thatˆc=h automatically provides an estimation of the sides.We initialize the center vertex v c of the snake at h and the other two vertices v1,v2 so that the line segments v c v1and v c v2have the orienta-tion of the predicted region sides(figure2).The three points are iteratively moved in order to maximize the total sum of gradient magnitudes along the two line segments:E S(v c,v1,v2)=p∈v c v1| I(p)|+p∈v c v2| I(p)|where| I(p)|is the image gradient magnitude at pixel p.The snake can deform only by hinging around v c and the length of the line segments is keptfixed(we are inter-ested in their orientation only).These constraints reduce the number of DOF to four,thereby reducing the search space and improving efficiency.The optimization process is efficiently implemented by a Dynamic Programming algorithm inspired by[1,14].The algorithm has a higher probability of being attracted towardv c h=12 Figure2:Left:Polyline snake initialization.Right:ˆq ini-tialization.contours than the traditional snake implementation[10], and it is ensured to converge[1].In practice h is often very close(a few pixels)to the intersection point c of the target region sides.Hence our initialization is often very good and this reduces the number of iterations and the risk of being attracted by nearby distractor edges.The tracked region sides lift four DOF:the two coordi-nates of c=v c and the orientations of the two sides.These correspond to the translation,rotation and skew components of the affine transformation A i mapping R i−1to R i.This is all the information we can extract from the geometric fea-tures of the image.The two remaining DOF correspond to the position of the point q(they arise from the scale com-ponents of A i)and are derived from the texture content of the region by the second step of the algorithm.An initial estimateˆq is obtained by aligningˆR i on the structure formed by v1,v c,v2,so thatˆc=v c and the sides are oriented like v c v1,v c v2(figure2).This estima-tion is refined by movingˆq so as to maximize the similarity between the resulting region R i(ˆq)and the region in the previous frame R i−1.As a similarity measure we use the normalized cross-correlation between R i−1and R i(ˆq)after aligning them via A(ˆq),the affine transformation mapping R i−1onto R i(ˆq).Therefore,q is obtained as the location ofˆq maximizing the objective function:E c=CrossCorr(A(ˆq)R i−1,R i(ˆq))(1) Notice that this similarity measure is invariant not only un-der geometric affine transformations,but also under linear transformations of the pixels intensities.This makes the tracking process relatively robust to changes in illumination conditions.The maximization process is implemented by Gradient Descent,initialized onˆq,where at each iteration ˆq is moved1pixel in the direction of maximal increase. Typicallyˆq is initialized close to the absolute maximum, because most of the variability of the affine transformation is lifted by the sides tracking step.This strongly reduces the risk of converging toward a local maximum and keeps the number of iterations low.Extensive experiments con-firm this consideration and indicate that,in most cases,3 iterations are enough.At the end of the second step,the most similar region to R i−1anchored to h is constructed.This does not mean that it is the correct region though,as h could just be the wrong corner.Hence,asfinal verification we check if the maxi-mum cross-correlation value is above a predefined thresh-old(typically0.9),otherwise the algorithm proceeds to the next corner.3.2.Elliptical regionsLet us now focus on the elliptical regions.Given an inten-sity extremum i,the region predictionˆR i and the region in the previous frame R i−1,is R i anchored to i?Since the elliptical regions of[12]exploit only the raw intensity pat-tern of the image and do not rely on the presence of nearby edges,we can no longer devise a two-step search strategy like for the parallelogram-shaped case.A natural alterna-tive would be to look for the complete set of6parameters of A i simultaneously,by minimizing a cross-correlation based criteria similar to equation(1)(as proposed in[3]).The search process could be initialized from the affine trans-formation A i−1between the two previous frames(possibly translated so that c=i).The problem is that searching for an optimum in this six-dimensional space,starting from an imprecise initialization,would probably take too much computation power to be achieved in real-time.We exploit instead the property that R i can be extracted from F i independently,provided we are considering the correct intensity extremum.In afirst step,R i is extracted around i using an optimized implementation of the algo-rithm described in[12].The second step could consist of verifying that R i cross-correlates with R i−1above a prede-fined threshold.Unfortunately,since an ellipse has onlyfive DOF,it is not possible to directly compute A i from R i−1 and R i.The missing DOF corresponds to a free rotation in the ellipse plane around its center.We want to avoid the approach of[12],which consists of an exhaustive search for the rotation maximizing the cross-correlation,because of its inefficiency.We propose an alternative based on a photometric invariant version of the axis of inertia.First ˆR i is affinely mapped to a reference circular region O.The major and minor axes of inertia are then extracted(figure3) as the lines passing from the center of O with orientations θmax,θmin defined by the solutions of:tan2(θ)+m20−m02m11tanθ−1=0(2) with m pq the(p,q)order moment centered on the region’s geometric center.Equation(2)differs from the usual defi-nition of the axis of inertia by the use of these moments in-stead of moments centered on the center of gravity weighted with image intensity.This makes them invariant to affine changes of the intensities.These axes are invariant under rotation,in the sense that they will cover the same part of the region after a rotation.Mapping the axis back to the original elliptical region will now provide two affinely in-variant lines,and their intersection points with the ellipse. The mapping of the center of the ellipse and these inter-sections allow us to compute A i;the cross-correlation testFigure3:From left to right:original elliptical region; mapped to circular region;axes of inertia in circular re-gion;axes mapped back to elliptical region.can follow and,if failed,the tracker can proceed to the next intensity extremum.4.Coplanar groupingIn this section,the coplanarity cues and score are in-troduced(subsection 4.1),followed by an alternative graph-theoretic approach(subsection4.2)ensuring effec-tive coplanar grouping.Note that from now on,perspective effects are fully taken into account.This is important,since the affine approximation is only valid on a local scale. 4.1.Coplanarity cues and scoreLet R,S be two regions tracked in n frames.Consider three points r1i,r2i,r3i characterizing R in frame F i.If R is parallelogram-shaped,these are three corners(all but q). In the elliptical case,these are the center c and the two in-tersections of the axes of inertia with the ellipse.{r p i}p=1..3 completely define R in F i and the correspondences between{r p i}and{r p j}implicitly encode the affine transformation of R between frames F i,F j.We assume analogous defini-tions for S.We introduce two numerical coplanarity cues that will later be integrated in a single coplanarity score.Thefirst cue is purely based on the motion of R and S between the first and last frames(F1,F n).We compute by least squares approximation the2D homography H that best maps the6 points in F1(the set{r p1}p=1..3∪{s p1}p=1..3)to their corre-sponding points in F n(the set{r p n}p=1..3∪{s p n}p=1..3).If R,S are coplanar,H correctly describes the motion of both regions.We measure this via the following error:c m=163i=1(d(H r1i,r n i)+d(H s1i,s n i))(3)where d(p1,p2)is the euclidean distance between points p1,p2.Assuming noise-free data,if R,S are coplanar c m=0;c m is related to the difference between the position and orientation of the R plane and the S plane in3D space. Hence we use expression(3)as a cue about the potential coplanarity of two regions:the smaller c m,the higher the chances of R,S being coplanar.While the motion cue is based completely on local in-formation,the second cue takes a larger view and considers the image data between R and S.The idea is to check if R,S are coplanar and located on a continuous,unoccluded Figure4:Top:Non-coplanr pair of regions.Bottom:Other view.The central line segment is misaligned.physical planar surface in3D space.In order to take into account a small,but representative,sample of the surface between R and S we consider three lines connecting cor-responding characteristic points of the two regions.To keep the notation simple,we restrict the explanation to thefirst line.Consider the line l1connecting thefirst characteristic points R11,S11of the two regions in thefirst frame.Divide l1in s=d(R11,S11)msegments of equal length m,denote them{l j1}j=1..s.Let{l j n}j=1..s be a list of segments in F n, whose coordinates are obtained by projecting{l j1}j=1..s via H.We are interested in the similarity between correspond-ing segments in the two frames,and in particular in the least similar one:minj=1..sCrossCorr(l j1,l j n)(4)where CrossCorr(l j1,l j n)is the value of the normalized cross-correlation of the intensity profile on line segment l i1 with the one on l j n.Only if R,S are coplanar and located on a continuous,unoccluded planar surface,will all seg-ments score well.If R,S are not coplanar,segments close to the region may still score well:H describes the motion in that zone best,and probably the neighboring area is planar. Nevertheless,central segments will tend to be misaligned as H can not correctly describe their motion,and therefore have low scores(figure4).Taking the least scoring segment ensures the detection of exactly those significant cases.We define the second coplanarity cue c t as the average of expression(4)over the three lines connecting r11with s11,r21with s21and r31with s31.Coplanar pairs located on a discontinuous planar surface(e.g.:the surface is inter-rupted between the two regions)will tend to have a low c t; clearly,this should not be interpreted as an indication that two regions are not coplanar.Hence,we use c t only to in-crement the total coplanarity score.Nevertheless the role of this cue must not be underestimated,as we expect it to sub-stantially reinforce the total score of a significant portion of the coplanar pairs,hence helping the forthcoming grouping algorithm.From the above considerations,we define the copla-narity score of a pair:w=(h t−c m)+c t h t if c t>0.60otherwise(5)where h t>0is an homography error threshold,acting likea splitting point between positive and negative scores(w<0suggests that R,S are not coplanar,while w≥0suggests they are)and defining the maximal positive contribution ofeach cue.The range of w is]−∞,2h t].In practice though, for h t=2.0scores above1.0already indicate very probable coplanarity.4.2.Grouping algorithmTaken one by one,the coplanarity scores are unreliable be-cause they arise from very limited,noisy information.In practice it happens that a coplanar pair has w<0(false negative),and the contrary(false positive).Nevertheless, taken altogether,the scores clearly contain reliable infor-mation about the correct grouping.We want to be robust to misleading local information by exploiting the transitivity of coplanarity:if R,S are coplanar and S,T too,then R,T must be coplanar2.How can transitivity help us?Consider a scene with three regions.Let w ij be the score of the pair (i,j)composed of the ith and jth region.Given the scores w12=9,w13=7,w23=−3,and the transitivity property, the best choice is to group the three regions together(w23 is a false negative score).Next,we formulate the coplanar grouping problem so as to exploit transitivity to detect and avoid false scores.We propose to construct a complete graph G where each vertex represents a region and edges are weighted with the coplanarity scores.We partition G into completely con-nected disjoint subsets of vertices(cliques)so as to maxi-mize the total score on the remaining edges(Clique Parti-tioning,or CP).The transitivity property is ensured by the clique constraint:every two vertices in a clique are con-nected,and no two vertices from different cliques are con-nected.Hence,the generated cliques correspond to the best possible coplanar grouping(given the cues).The CP formu-lation of coplanar grouping is made possible by the presence of positive and negative weights:they naturally lead to the definition of a best solution without the need of knowing the number of cliques(planes)or introducing any artificial stop-ping criteria like in other graph-based approaches to group-ing based on strictly positive weights[11,2].On the other hand,our approach needs a parameter h t that determines the splitting point between positive and negative scores.But,in 2Coplanarity is reflexive,symmetric and transitive(an equivalence re-lation).134527−493−1−23134527−493−1−231345279−13Figure5:An example graph and two iterations of CP.Not displayed edges have zero weight.our context,this parameter is easily determined and exper-iments show the optimal solution of CP to be generated for a wide range of h t.CP can be solved by Linear Programming[8](LP).Let w ij be the weight of the edge connecting(i,j),and x ij∈{0,1}indicate whether the edge exists in the solution.The following LP can be established:maximize1≤i<j≤nw ij x ijsubject to x ij+x jk−x ik≤1,∀1≤i<j<k≤nx ij−x jk+x ik≤1,∀1≤i<j<k≤n−x ij+x jk+x ik≤1,∀1≤i<j<k≤nx ij∈{0,1},∀1≤i<j<k≤n(6) The inequalities express the clique constraints(transitivity), while the objective function to be maximized corresponds to the sum of the intra-clique edges.Unfortunately CP is an NP-hard problem[8]:LP(6)has worst case exponential complexity in the number n of vertices(regions),making it impractical for large n.The challenge is tofind a practical way out this com-plexity trap.The correct partitioning of the example infig-ure5is{{1,3},{2,4,5}}.A simple greedy strategy merg-ing two vertices(i,j)if w ij>0fails because it merges (1,2)as itsfirst move.Such an approach suffers from two problems:the generated solution depends on the order by which vertices are processed and it looks only at local infor-mation.We propose the following iterative heuristic.The algorithm starts with the partitionΦ={{i}}1≤i≤n com-posed of n singleton cliques each containing a different ver-tex.The function m(c1,c2)=i∈c1,j∈c2w ij defines the cost of merging cliques c1,c2.We consider the functions b(c)=max t∈Φm(c,t)and d(c)=arg max t∈Φm(c,t) representing,respectively,the score of the best merging choice for clique c and the clique with whom to merge.We merge cliques c i,c j if and only if d(c i)=c j and d(c j)=c i and b(c i)=b(c j)>0.In other words,two cliques are merged only if each one represents the best merging option for the other and if merging them increases the total score. At each iteration the functions b(c),d(c)are computed,and all pairs of cliques fulfilling the criteria are merged.The algorithm runs until no two cliques can be merged.Figure5shows an interesting case.In thefirst itera-tion{1}is merged with{3}and{4}with{5}.Noticehow{2}is,correctly,not merged with{1}even thoughm({1},{2})=3>0.In the second iteration{2}iscorrectly merged with{4,5},resisting the(false)attrac-tion of{1,3}(b({1,3},{2})=1,d({1,3})={2}).The algorithm terminates after the third iteration because m({1,3},{2,4,5})=−3<0.The second iteration shows the power of CP.Vertex2is connected to unreliable edges (w12is false positive,w25is false negative).Given ver-tices{1,2,3}only,it is not possible to derive the correctpartitioning{{1,3},{2}};but,as we add vertices{4,5}, the global information increases and CP manages to get thecorrect partitioning out of it.The proposed heuristic is order independent,takes a more global view than a direct greedy strategy,and resolvesseveral ambiguous situations while maintaining polynomialcomplexity(worst case O(n3),but faster in practice).In the first iterations,being biased toward very positive weights,the algorithm risks to take wrong merging decisions.Nev-ertheless our particular merging criterion ensures this risk to quickly diminish with the size of the cliques in the correct solution(number of regions in a plane)and at each iteration, as the cliques grow and increase their resistance against spurious weights.Moreover,in our application,very pos-itive scores arise only when both cues score well and are therefore much more reliable than negative scores,which are often due to large homography errors due to measure-ment noise.In summary,the algorithm uses reliable data as seeds,and then proceeds to the robust construction of the correct solution byfiltering out spurious data.5.Experiments5.1.TrackingWe present two sequences demonstrating the tracker’s qual-ities.In both sequences,the images of the tracked pla-nar patches are put into complete correspondence along the frames.This allows to derive three reliable point correspon-dences per region between any pair of frames.The Book sequence(figure6)features a parallelogram-shaped and an elliptical region undergoing simultaneous ro-tation and scaling.The physical planar patch covered by each region is accurately tracked along the sequence,as proven by the constant high cross-correlation scores(fig-ure6).The axes of inertia of the elliptical region reliably follow the rotation of the book.The computational perfor-mance3meets the real-time expectations(figure6).With an average cost per frame of0.018seconds,the parallelogram-shaped region is tracked particularly efficienlty.The dif-ference with the elliptical one(average0.034seconds per frame)is mostly due to the different(and currently slower) algorithm needed for the anchor point extraction.3All experiments performed on a Sun UltraSparc-IIi,440MHz.onds(y axe)to track each frame(x axe)for parallelogram-shaped region(thick)and elliptical one(thin;peak corre-spond to a temporary loss).Bottom right:cross-correlation scores.Control of out-of-plane rotation and robustness to dis-continuous motion are exemplified in the Poster sequence (figure7).The region rotates significantly around the verti-cal axis,causing skew and anisotropic scaling effects in the image.The tracker was able to handle this situation by cor-rectly transforming the2D region’s shape:despite the very different viewpoints of frames1and200,the region is cov-ering the same physical surface.In our application,this is a required feature:the region deformation yields precious information about a plane orientation.As it was taken with a handheld camera,the sequence contains a certain amount of irregular motion:the region sometimes bruskly changes direction and velocity,making it hard to predict the next lo-cation accurately.Moreover,we increased the average ve-locity even further by subsampling the sequence to contain only every fourth frame.The total effect is a very discon-tinuous motion(irregular and fast)where the region moves fast between each frame and where the predicted location is often far from the correct one.The tracker successfully managed tofind the region in every frame despite predic-。