Hierarchical Image Partitioning with Dual Graph Contraction
- 格式:pdf
- 大小:591.78 KB
- 文档页数:31
acceleration transducer 加速度传感器acceptance testing 验收测试accessibility 可及性accumulated error 累积误差AC—DC-AC frequency converter 交—直—交变频器AC (alternating current) electric drive 交流电子传动active attitude stabilization 主动姿态稳定actuator 驱动器,执行机构adaline 线性适应元adaptation layer 适应层adaptive telemeter system 适应遥测系统adjoint operator 伴随算子admissible error 容许误差aggregation matrix 集结矩阵AHP (analytic hierarchy process) 层次分析法amplifying element 放大环节analog-digital conversion 模数转换annunciator 信号器antenna pointing control 天线指向控制anti—integral windup 抗积分饱卷aperiodic decomposition 非周期分解a posteriori estimate 后验估计approximate reasoning 近似推理a priori estimate 先验估计articulated robot 关节型机器人assignment problem 配置问题,分配问题associative memory model 联想记忆模型associatron 联想机asymptotic stability 渐进稳定性attained pose drift 实际位姿漂移attitude acquisition 姿态捕获AOCS (attritude and orbit control system) 姿态轨道控制系统attitude angular velocity 姿态角速度attitude disturbance 姿态扰动attitude maneuver 姿态机动attractor 吸引子augment ability 可扩充性augmented system 增广系统automatic manual station 自动-手动操作器automaton 自动机backlash characteristics 间隙特性base coordinate system 基座坐标系Bayes classifier 贝叶斯分类器bearing alignment 方位对准bellows pressure gauge 波纹管压力表benefit-cost analysis 收益成本分析bilinear system 双线性系统biocybernetics 生物控制论biological feedback system 生物反馈系统black box testing approach 黑箱测试法blind search 盲目搜索block diagonalization 块对角化Boltzman machine 玻耳兹曼机bottom—up development 自下而上开发boundary value analysis 边界值分析brainstorming method 头脑风暴法breadth—first search 广度优先搜索butterfly valve 蝶阀CAE (computer aided engineering) 计算机辅助工程CAM (computer aided manufacturing) 计算机辅助制造Camflex valve 偏心旋转阀canonical state variable 规范化状态变量capacitive displacement transducer 电容式位移传感器capsule pressure gauge 膜盒压力表CARD 计算机辅助研究开发Cartesian robot 直角坐标型机器人cascade compensation 串联补偿catastrophe theory 突变论centrality 集中性chained aggregation 链式集结chaos 混沌characteristic locus 特征轨迹chemical propulsion 化学推进calrity 清晰性classical information pattern 经典信息模式classifier 分类器clinical control system 临床控制系统closed loop pole 闭环极点closed loop transfer function 闭环传递函数cluster analysis 聚类分析coarse—fine control 粗—精控制cobweb model 蛛网模型coefficient matrix 系数矩阵cognitive science 认知科学cognitron 认知机coherent system 单调关联系统combination decision 组合决策combinatorial explosion 组合爆炸combined pressure and vacuum gauge 压力真空表command pose 指令位姿companion matrix 相伴矩阵compartmental model 房室模型compatibility 相容性,兼容性compensating network 补偿网络compensation 补偿,矫正compliance 柔顺,顺应composite control 组合控制computable general equilibrium model 可计算一般均衡模型conditionally instability 条件不稳定性configuration 组态connectionism 连接机制connectivity 连接性conservative system 守恒系统consistency 一致性constraint condition 约束条件consumption function 消费函数context—free grammar 上下文无关语法continuous discrete event hybrid system simulation 连续离散事件混合系统仿真continuous duty 连续工作制control accuracy 控制精度control cabinet 控制柜controllability index 可控指数controllable canonical form 可控规范型[control]plant 控制对象,被控对象controlling instrument 控制仪表control moment gyro 控制力矩陀螺control panel 控制屏,控制盘control synchro 控制[式]自整角机control system synthesis 控制系统综合control time horizon 控制时程cooperative game 合作对策coordinability condition 可协调条件coordination strategy 协调策略coordinator 协调器corner frequency 转折频率costate variable 共态变量cost—effectiveness analysis 费用效益分析coupling of orbit and attitude 轨道和姿态耦合critical damping 临界阻尼critical stability 临界稳定性cross—over frequency 穿越频率,交越频率current source inverter 电流[源]型逆变器cut—off frequency 截止频率cybernetics 控制论cyclic remote control 循环遥控cylindrical robot 圆柱坐标型机器人damped oscillation 阻尼振荡damper 阻尼器damping ratio 阻尼比data acquisition 数据采集data encryption 数据加密data preprocessing 数据预处理data processor 数据处理器DC generator-motor set drive 直流发电机-电动机组传动D controller 微分控制器decentrality 分散性decentralized stochastic control 分散随机控制decision space 决策空间decision support system 决策支持系统decomposition—aggregation approach 分解集结法decoupling parameter 解耦参数deductive-inductive hybrid modeling method 演绎与归纳混合建模法delayed telemetry 延时遥测derivation tree 导出树derivative feedback 微分反馈describing function 描述函数desired value 希望值despinner 消旋体destination 目的站detector 检出器deterministic automaton 确定性自动机deviation 偏差舱deviation alarm 偏差报警器DFD 数据流图diagnostic model 诊断模型diagonally dominant matrix 对角主导矩阵diaphragm pressure gauge 膜片压力表difference equation model 差分方程模型differential dynamical system 微分动力学系统differential game 微分对策differential pressure level meter 差压液位计differential pressure transmitter 差压变送器differential transformer displacement transducer 差动变压器式位移传感器differentiation element 微分环节digital filer 数字滤波器digital signal processing 数字信号处理digitization 数字化digitizer 数字化仪dimension transducer 尺度传感器direct coordination 直接协调disaggregation 解裂discoordination 失协调discrete event dynamic system 离散事件动态系统discrete system simulation language 离散系统仿真语言discriminant function 判别函数displacement vibration amplitude transducer 位移振幅传感器dissipative structure 耗散结构distributed parameter control system 分布参数控制系统distrubance 扰动disturbance compensation 扰动补偿diversity 多样性divisibility 可分性domain knowledge 领域知识dominant pole 主导极点dose-response model 剂量反应模型dual modulation telemetering system 双重调制遥测系统dual principle 对偶原理dual spin stabilization 双自旋稳定duty ratio 负载比dynamic braking 能耗制动dynamic characteristics 动态特性dynamic deviation 动态偏差dynamic error coefficient 动态误差系数dynamic exactness 动它吻合性dynamic input—output model 动态投入产出模型econometric model 计量经济模型economic cybernetics 经济控制论economic effectiveness 经济效益economic evaluation 经济评价economic index 经济指数economic indicator 经济指标eddy current thickness meter 电涡流厚度计effectiveness 有效性effectiveness theory 效益理论elasticity of demand 需求弹性electric actuator 电动执行机构electric conductance levelmeter 电导液位计electric drive control gear 电动传动控制设备electric hydraulic converter 电-液转换器electric pneumatic converter 电-气转换器electrohydraulic servo vale 电液伺服阀electromagnetic flow transducer 电磁流量传感器electronic batching scale 电子配料秤electronic belt conveyor scale 电子皮带秤electronic hopper scale 电子料斗秤elevation 仰角emergency stop 异常停止empirical distribution 经验分布endogenous variable 内生变量equilibrium growth 均衡增长equilibrium point 平衡点equivalence partitioning 等价类划分ergonomics 工效学error 误差error-correction parsing 纠错剖析estimate 估计量estimation theory 估计理论evaluation technique 评价技术event chain 事件链evolutionary system 进化系统exogenous variable 外生变量expected characteristics 希望特性external disturbance 外扰fact base 事实failure diagnosis 故障诊断fast mode 快变模态feasibility study 可行性研究feasible coordination 可行协调feasible region 可行域feature detection 特征检测feature extraction 特征抽取feedback compensation 反馈补偿feedforward path 前馈通路field bus 现场总线finite automaton 有限自动机FIP (factory information protocol)工厂信息协议first order predicate logic 一阶谓词逻辑fixed sequence manipulator 固定顺序机械手fixed set point control 定值控制FMS (flexible manufacturing system)柔性制造系统flow sensor/transducer 流量传感器flow transmitter 流量变送器fluctuation 涨落forced oscillation 强圃获荡formal language theory 形式语言理论formal neuron 形式神经元forward path 正向通路forward reasoning 正向推理fractal 分形体,分维体frequency converter 变频器frequency domain model reduction method 频域模型降阶法frequency response 频域响应full order observer 全阶观测器functional decomposition 功能分解FES (functional electrical stimulation) 功能电刺激functional simularity 功能相似fuzzy logic 模糊逻辑game tree 对策树gate valve 闸阀general equilibrium theory 一般均衡理论generalized least squares estimation 广义最小二乘估计generation function 生成函数geomagnetic torque 地磁力矩geometric similarity 几何相似gimbaled wheel 框架轮global asymptotic stability 全局渐进稳定性global optimum 全局最优globe valve 球形阀goal coordination method 目标协调法grammatical inference 文法推断graphic search 图搜索gravity gradient torque 重力梯度力矩group technology 成组技术guidance system 制导系统gyro drift rate 陀螺漂移率gyrostat 陀螺体Hall displacement transducer 霍尔式位移传感器hardware-in—the—loop simulation 半实物仿真harmonious deviation 和谐偏差harmonious strategy 和谐策略heuristic inference 启发式推理hidden oscillation 隐蔽振荡hierarchical chart 层次结构图hierarchical planning 递阶规划hierarchical control 递阶控制homeostasis 内稳态homomorphic model 同态系统horizontal decomposition 横向分解hormonal control 内分泌控制hydraulic step motor 液压步进马达hypercycle theory 超循环理论I controller 积分控制器identifiability 可辨识性IDSS (intelligent decision support system)智能决策支持系统image recognition 图像识别impulse 冲量impulse function 冲击函数,脉冲函数inching 点动incompatibility principle 不相容原理incremental motion control 增量运动控制index of merit 品质因数inductive force transducer 电感式位移传感器inductive modeling method 归纳建模法industrial automation 工业自动化inertial attitude sensor 惯性姿态敏感器inertial coordinate system 惯性坐标系inertial wheel 惯性轮inference engine 推理机infinite dimensional system 无穷维系统information acquisition 信息采集infrared gas analyzer 红外线气体分析器inherent nonlinearity 固有非线性inherent regulation 固有调节initial deviation 初始偏差initiator 发起站injection attitude 入轨姿势input-output model 投入产出模型instability 不稳定性instruction level language 指令级语言integral of absolute value of error criterion 绝对误差积分准则integral of squared error criterion 平方误差积分准则integral performance criterion 积分性能准则integration instrument 积算仪器integrity 整体性intelligent terminal 智能终端interacted system 互联系统,关联系统interactive prediction approach 互联预估法,关联预估法interconnection 互联intermittent duty 断续工作制internal disturbance 内扰ISM (interpretive structure modeling) 解释结构建模法invariant embedding principle 不变嵌入原理inventory theory 库伦论inverse Nyquist diagram 逆奈奎斯特图inverter 逆变器investment decision 投资决策isomorphic model 同构模型iterative coordination 迭代协调jet propulsion 喷气推进job-lot control 分批控制joint 关节Kalman-Bucy filer 卡尔曼—布西滤波器knowledge accomodation 知识顺应knowledge acquisition 知识获取knowledge assimilation 知识同化KBMS (knowledge base management system)知识库管理系统knowledge representation 知识表达ladder diagram 梯形图lag—lead compensation 滞后超前补偿Lagrange duality 拉格朗曰对偶性Laplace transform 拉普拉斯变换large scale system 大系统lateral inhibition network 侧抑制网络least cost input 最小成本投入least squares criterion 最小二乘准则level switch 物位开关libration damping 天平动阻尼limit cycle 极限环linearization technique 线性化方法linear motion electric drive 直线运动电气传动linear motion valve 直行程阀linear programming 线性规划LQR (linear quadratic regulator problem)线性二次调节器问题load cell 称重传感器local asymptotic stability 局部渐近稳定性local optimum 局部最优log magnitude—phase diagram 对数幅相图long term memory 长期记忆lumped parameter model 集总参数模型Lyapunov theorem of asymptotic stability李雅普诺夫渐近稳定性定理macro-economic system 宏观经济系统magnetic dumping 磁卸载magnetoelastic weighing cell 磁致弹性称重传感器magnitude—frequency characteristic 幅频特性magnitude margin 幅值裕度magnitude scale factor 幅值比例尺manipulator 机械手man—machine coordination 人机协调manual station 手动操作器MAP (manufacturing automation protocol)制造自动化协议marginal effectiveness 边际效益Mason’s gain formula 梅森增益公式master station 主站matching criterion 匹配准则maximum likelihood estimation 最大似然估计maximum overshoot 最大超调量maximum principle 极大值原理mean-square error criterion 均方误差准则mechanism model 机理模型meta-knowledge 元知识metallurgical automation 冶金自动化minimal realization 最小实现minimum phase system 最小相位系统minimum variance estimation 最小方差估计minor loop 副回路missile-target relative movement simulator 弹体—目标相对运动仿真器modal aggregation 模态集结modal transformation 模态变换MB (model base)模型库model confidence 模型置信度model fidelity 模型逼真度model reference adaptive control system 模型参考适应控制系统model verification 模型验证modularization 模块化MEC (most economic control)最经济控制motion space 可动空间MTBF (mean time between failures)平均故障间隔时间MTTF (mean time to failures)平均无故障时间multi—attributive utility function 多属性效用函数multicriteria 多重判据multilevel hierarchical structure 多级递阶结构multiloop control 多回路控制multi—objective decision 多目标决策multistate logic 多态逻辑multistratum hierarchical control 多段递阶控制multivariable control system 多变量控制系统myoelectric control 肌电控制Nash optimality 纳什最优性natural language generation 自然语言生成nearest-neighbor 最近邻necessity measure 必然性侧度negative feedback 负反馈neural assembly 神经集合neural network computer 神经网络计算机Nichols chart 尼科尔斯图noetic science 思维科学noncoherent system 非单调关联系统noncooperative game 非合作博弈nonequilibrium state 非平衡态nonlinear element 非线性环节nonmonotonic logic 非单调逻辑nonparametric training 非参数训练nonreversible electric drive 不可逆电气传动nonsingular perturbation 非奇异摄动non—stationary random process 非平稳随机过程nuclear radiation levelmeter 核辐射物位计nutation sensor 章动敏感器Nyquist stability criterion 奈奎斯特稳定判据[size=9pt]objective function 目标函数observability index 可观测指数observable canonical form 可观测规范型on—line assistance 在线帮助on—off control 通断控制open loop pole 开环极点operational research model 运筹学模型optic fiber tachometer 光纤式转速表optimal trajectory 最优轨迹optimization technique 最优化技术orbital rendezvous 轨道交会orbit gyrocompass 轨道陀螺罗盘orbit perturbation 轨道摄动order parameter 序参数orientation control 定向控制originator 始发站oscillating period 振荡周期output prediction method 输出预估法oval wheel flowmeter 椭圆齿轮流量计overall design 总体设计overdamping 过阻尼overlapping decomposition 交叠分解Pade approximation 帕德近似Pareto optimality 帕雷托最优性passive attitude stabilization 被动姿态稳定path repeatability 路径可重复性pattern primitive 模式基元PR (pattern recognition) 模式识别P control 比例控制器peak time 峰值时间penalty function method 罚函数法perceptron 感知器periodic duty 周期工作制perturbation theory 摄动理论pessimistic value 悲观值phase locus 相轨迹phase trajectory 相轨迹phase lead 相位超前photoelectric tachometric transducer 光电式转速传感器phrase—structure grammar 短句结构文法physical symbol system 物理符号系统piezoelectric force transducer 压电式力传感器playback robot 示教再现式机器人PLC (programmable logic controller) 可编程序逻辑控制器plug braking 反接制动plug valve 旋塞阀pneumatic actuator 气动执行机构point-to-point control 点位控制polar robot 极坐标型机器人pole assignment 极点配置pole—zero cancellation 零极点相消polynomial input 多项式输入portfolio theory 投资搭配理论pose overshoot 位姿过调量position measuring instrument 位置测量仪posentiometric displacement transducer 电位器式位移传感器positive feedback 正反馈power system automation 电力系统自动化predicate logic 谓词逻辑pressure gauge with electric contact 电接点压力表pressure transmitter 压力变送器price coordination 价格协调primal coordination 主协调primary frequency zone 主频区PCA (principal component analysis) 主成分分析法principle of turnpike 大道原理priority 优先级process-oriented simulation 面向过程的仿真production budget 生产预算production rule 产生式规则profit forecast 利润预测PERT (program evaluation and review technique) 计划评审技术program set station 程序设定操作器proportional control 比例控制proportional plus derivative controller 比例微分控制器protocol engineering 协议工程prototype 原型pseudo random sequence 伪随机序列pseudo—rate-increment control 伪速率增量控制pulse duration 脉冲持续时间pulse frequency modulation control system脉冲调频控制系统pulse width modulation control system 脉冲调宽控制系统PWM inverter 脉宽调制逆变器pushdown automaton 下推自动机QC (quality control)质量管理quadratic performance index 二次型性能指标qualitative physical model 定性物理模型quantized noise 量化噪声quasilinear characteristics 准线性特性queuing theory 排队论radio frequency sensor 射频敏感器ramp function 斜坡函数random disturbance 随机扰动random process 随机过程rate integrating gyro 速率积分陀螺ratio station 比值操作器reachability 可达性reaction wheel control 反作用轮控制realizability 可实现性,能实现性real time telemetry 实时遥测receptive field 感受野rectangular robot 直角坐标型机器人rectifier 整流器recursive estimation 递推估计reduced order observer 降阶观测器redundant information 冗余信息reentry control 再入控制regenerative braking 回馈制动,再生制动regional planning model 区域规划模型regulating device 调节装载regulation 调节relational algebra 关系代数relay characteristic 继电器特性remote manipulator 遥控操作器remote regulating 遥调remote set point adjuster 远程设定点调整器rendezvous and docking 交会和对接reproducibility 再现性resistance thermometer sensor 热电阻resolution principle 归结原理resource allocation 资源分配response curve 响应曲线return difference matrix 回差矩阵return ratio matrix 回比矩阵reverberation 回响reversible electric drive 可逆电气传动revolute robot 关节型机器人revolution speed transducer 转速传感器rewriting rule 重写规则rigid spacecraft dynamics 刚性航天动力学risk decision 风险分析robotics 机器人学robot programming language 机器人编程语言robust control 鲁棒控制robustness 鲁棒性roll gap measuring instrument 辊缝测量仪root locus 根轨迹roots flowmeter 腰轮流量计rotameter 浮子流量计,转子流量计rotary eccentric plug valve 偏心旋转阀rotary motion valve 角行程阀rotating transformer 旋转变压器Routh approximation method 劳思近似判据routing problem 路径问题sampled—data control system 采样控制系统sampling control system 采样控制系统saturation characteristics 饱和特性scalar Lyapunov function 标量李雅普诺夫函数SCARA (selective compliance assembly robot arm)平面关节型机器人scenario analysis method 情景分析法scene analysis 物景分析s—domain s域self-operated controller 自力式控制器self-organizing system 自组织系统self—reproducing system 自繁殖系统self-tuning control 自校正控制semantic network 语义网络semi—physical simulation 半实物仿真sensing element 敏感元件sensitivity analysis 灵敏度分析sensory control 感觉控制sequential decomposition 顺序分解sequential least squares estimation 序贯最小二乘估计servo control 伺服控制,随动控制servomotor 伺服马达settling time 过渡时间sextant 六分仪short term planning 短期计划short time horizon coordination 短时程协调signal detection and estimation 信号检测和估计signal reconstruction 信号重构similarity 相似性simulated interrupt 仿真中断simulation block diagram 仿真框图simulation experiment 仿真实验simulation velocity 仿真速度simulator 仿真器single axle table 单轴转台single degree of freedom gyro 单自由度陀螺single level process 单级过程single value nonlinearity 单值非线性singular attractor 奇异吸引子singular perturbation 奇异摄动sink 汇点slaved system 受役系统slower—than-real-time simulation 欠实时仿真slow subsystem 慢变子系统socio—cybernetics 社会控制论socioeconomic system 社会经济系统software psychology 软件心理学solar array pointing control 太阳帆板指向控制solenoid valve 电磁阀source 源点specific impulse 比冲speed control system 调速系统spin axis 自旋轴spinner 自旋体stability criterion 稳定性判据stability limit 稳定极限stabilization 镇定,稳定Stackelberg decision theory 施塔克尔贝格决策理论state equation model 状态方程模型state space description 状态空间描述static characteristics curve 静态特性曲线station accuracy 定点精度stationary random process 平稳随机过程statistical analysis 统计分析statistic pattern recognition 统计模式识别steady state deviation 稳态偏差steady state error coefficient 稳态误差系数step-by-step control 步进控制step function 阶跃函数stepwise refinement 逐步精化stochastic finite automaton 随机有限自动机strain gauge load cell 应变式称重传感器strategic function 策略函数strongly coupled system 强耦合系统subjective probability 主观频率suboptimality 次优性supervised training 监督学习supervisory computer control system 计算机监控系统sustained oscillation 自持振荡swirlmeter 旋进流量计switching point 切换点symbolic processing 符号处理synaptic plasticity 突触可塑性synergetics 协同学syntactic analysis 句法分析system assessment 系统评价systematology 系统学system homomorphism 系统同态system isomorphism 系统同构system engineering 系统工程tachometer 转速表target flow transmitter 靶式流量变送器task cycle 作业周期teaching programming 示教编程telemechanics 远动学telemetering system of frequency divisiontype 频分遥测系统telemetry 遥测teleological system 目的系统teleology 目的论temperature transducer 温度传感器template base 模版库tensiometer 张力计texture 纹理theorem proving 定理证明therapy model 治疗模型thermocouple 热电偶thermometer 温度计thickness meter 厚度计three—axis attitude stabilization 三轴姿态稳定three state controller 三位控制器thrust vector control system 推力矢量控制系统thruster 推力器time constant 时间常数time—invariant system 定常系统,非时变系统time schedule controller 时序控制器time-sharing control 分时控制time—varying parameter 时变参数top-down testing 自上而下测试topological structure 拓扑结构TQC (total quality control) 全面质量管理tracking error 跟踪误差trade—off analysis 权衡分析transfer function matrix 传递函数矩阵transformation grammar 转换文法transient deviation 瞬态偏差transient process 过渡过程transition diagram 转移图transmissible pressure gauge 电远传压力表transmitter 变送器trend analysis 趋势分析triple modulation telemetering system 三重调制遥测系统turbine flowmeter 涡轮流量计Turing machine 图灵机two-time scale system 双时标系统ultrasonic levelmeter 超声物位计unadjustable speed electric drive 非调速电气传动unbiased estimation 无偏估计underdamping 欠阻尼uniformly asymptotic stability 一致渐近稳定性uninterrupted duty 不间断工作制,长期工作制unit circle 单位圆unit testing 单元测试unsupervised learing 非监督学习upper level problem 上级问题urban planning 城市规划utility function 效用函数value engineering 价值工程variable gain 可变增益,可变放大系数variable structure control system 变结构控制vector Lyapunov function 向量李雅普诺夫函数velocity error coefficient 速度误差系数velocity transducer 速度传感器vertical decomposition 纵向分解vibrating wire force transducer 振弦式力传感器vibrometer 振动计viscous damping 粘性阻尼voltage source inverter 电压源型逆变器vortex precession flowmeter 旋进流量计vortex shedding flowmeter 涡街流量计WB (way base)方法库weighing cell 称重传感器weighting factor 权因子weighting method 加权法Whittaker-Shannon sampling theorem 惠特克—香农采样定理Wiener filtering 维纳滤波work station for computer aided design 计算机辅助设计工作站w-plane w平面zero—based budget 零基预算zero—input response 零输入响应zero—state response 零状态响应电气自动化专业词汇zero sum game model 零和对策模型z—transform z变换[/size]11。
第50 卷第 8 期2023年8 月Vol.50,No.8Aug. 2023湖南大学学报(自然科学版)Journal of Hunan University(Natural Sciences)基于Swin Transformer的深度有监督哈希图像检索方法苗壮,赵昕昕,李阳†,王家宝,张睿(陆军工程大学指挥控制工程学院,江苏南京 210007)摘要:在深度有监督哈希图像检索的特征提取过程中,一直由卷积神经网络架构主导,但是随着Transformer在视觉领域中的应用,Transformer替代卷积神经网络架构成为可能.为了解决现存基于Transformer的哈希方法中不能生成层次表示和计算复杂度高等问题,提出了一种基于Swin Transformer的深度有监督哈希图像检索方法.该方法以Swin Transformer网络模型为基础,在网络最后添加一个哈希层,为图像进行哈希编码.该模型中引入了局部思想和层级结构,能够有效解决上述问题.与现有的13种先进方法相比,所提方法的哈希检索性能得到大幅提升.在两个常用检索数据集CIFAR-10和NUS-WIDE上进行实验,实验结果表明:在CIFAR-10数据集上所提方法mAP最高达到98.4%,与TransHash方法相比平均提高7.1%,与VTS16-CSQ方法相比平均提高0.57%;在NUS-WIDE数据集上所提方法mAP最高达到93.6%,与TransHash方法相比平均提高18.61%,与VTS16-CSQ方法相比检索精度平均提高8.6%.关键词:哈希学习;深度学习;图像检索;Swin Transformer中图分类号:TP391 文献标志码:ADeep Supervised Hashing Image Retrieval Method Based on SwinTransformerMIAO Zhuang,ZHAO Xinxin,LI Yang†,WANG Jiabao,ZHANG Rui(Command and Control Engineering College, Army Engineering University of PLA, Nanjing 210007, China)Abstract:The feature extraction process in deep supervised Hash image retrieval has been dominated by the convolutional neural network architecture. However,with the application of Transformer in the field of vision,it becomes possible to replace the convolutional neural network architecture with Transformer. In order to address the limitations of existing Transformer-based hashing methods,such as the inability to generate hierarchical representations and high computational complexity, a deep supervised hash image retrieval method based on Swin Transformer is proposed. The proposed method utilizes the Swin Transformer network model, and incorporates a hash layer at the end of the network to generate hash encode for images. By introducing the concepts of locality and hierarchy into the model, the method effectively solve the above problems. Compared with 13 existing state-of-the-∗收稿日期:2022-06-20基金项目:国家自然科学基金资助项目(61806220), National Natural Science Foundation of China(61806220);国家重点研发计划项目(2017YFC0821905), National Key Research and Development Program of China(2017YFC0821905)作者简介:苗壮(1976―),男,辽宁辽阳人,陆军工程大学教授,博士生导师,博士† 通信联系人,E-mail:**********************文章编号:1674-2974(2023)08-0062-10DOI:10.16339/ki.hdxbzkb.2023274第 8 期苗壮等:基于Swin Transformer的深度有监督哈希图像检索方法art methods, the method proposed in this paper has greatly improved the performance of hash retrieval. Experiments are carried out on two commonly used retrieval datasets,namely CIFAR-10 and NUS-WIDE. The experimental results show that the proposed method achieves the highest mean average precision (mAP) of 98.4% on the CIFAR-10 dataset. This represents an average increase of 7.1% compared with the TransHash method and an average increase of 0.57% compared with the VTS16-CSQ method. On the NUS-WIDE dataset,the proposed method achieves the highest mAP of 93.6%. This corresponds to an average improvement of 18.61% compared with the TransHash method, and an average increase of 8.6% in retrieval accuracy compared with the VTS16-CSQ method. Key words:hash learning;deep learning;image retrieval;Swin Transformer随着互联网技术的飞速发展,高维数据呈爆炸性增长,如何高效地实现大规模信息检索已经成为表示学习领域的严峻挑战.在解决这项具有挑战性任务所提的方法中[1-4],基于哈希的方法取得了显著的成功[5].哈希方法旨在学习一个哈希函数,该函数能够将高维像素空间中的图像映射到低维汉明空间,同时保留它们在原始像素空间中的视觉相似性.根据特征提取方式的不同,现存的基于哈希方法可以分为两类,即传统方法和基于深度学习的方法.传统方法[6-8]通过手工制作的视觉描述符(例如GIST[9])来学习哈希函数.然而,其并不能准确地保留原始图像对的语义相似性,导致后续哈希函数学习过程中性能的下降.与传统的哈希方法相比,基于深度学习的哈希方法[10-11]通常可以实现显著的性能改进.基于深度学习的哈希方法包括两个阶段:第一阶段旨在学习一个能够生成图像特征表示的深度卷积神经网络(Convolutional Neural Networks, CNNs),例如AlexNet.第二阶段通过深度哈希函数将连续特征映射为二进制哈希码,并利用各种损失函数[12-16]来保留图像原始像素空间中的相似性.基于深度学习的哈希方法通过卷积神经网络[17-22]以端到端的方式来同时学习图像的特征表示和二进制哈希码,取得了较好的检索性能.例如,文献[23]提出采用成对损失的贝叶斯学习框架;文献[13]进一步提出用柯西分布代替先前的神经网络输出对数的概率生成函数,来惩罚汉明距离大于阈值的相似图像对;文献[24]开创性提出了一种针对多标签图像检索的相似矩阵.虽然上述方法取得了较好的检索性能,但是依然存在一些问题.例如,图像的二进制哈希码是由图像连续特征表示通过sign函数量化得到,由于sign函数不可导,这使得优化过程变成NP难问题.先前大多数工作[25]都采用连续松弛方法来解决该问题(使用tanh函数或sigmoid函数代替sign函数),然后在测试阶段使用sign函数直接得到最终的二进制哈希码.第一种典型的方法就是通过在损失函数中添加一个惩罚项来量化函数,这个惩罚项有助于获得sign(h)≈h.该方法还存在一些变体,如文献[23]提出DHN方法实现了同时控制量化误差和优化损失函数.文献[26]提出DCH架构,通过联合优化柯西交叉熵损失和柯西量化损失生成紧凑且集中的二进制哈希码,实现了高效的汉明空间检索.第二种方法是替代方案,它将优化问题分解为几个子问题,可以用交替最小化方法迭代求解.在这个替代过程中,反向传播只能用于一个子问题,而其他的子问题可以用其他的优化方法来解决.例如,文献[27]的最后一层输出直接限制为二进制哈希码,它保留了哈希编码的离散化这一特性,并提出了交替优化策略优化目标函数.第三种方法如HashNet[28]方法提出通过收敛的连续方法直接进行哈希码学习,从连续的相似数据中学习到准确的二进制哈希码,即使用不同的β值来使tan(βz)无限接近于sign (h).除上述三种常用方法外,还存在其他方法.例如,Greedy Hash[29]方法提出使用贪心算法在前传过程中保持对网络输出的离散约束,在反传过程中将哈希层的梯度完全传送到网络前层,进一步解决了离散优化问题并提高了哈希检索精度.在此启发下,设计了贪心非对称损失函数,该损失函数能够解决优化过程中梯度消失问题,并充分利用训练数据,提高了哈希码学习精度.近年来,Transformer[30]在自然语言处理[31-32]中取得了巨大的成功.随着Transformer在自然语言处理中的成功,许多研究人员开始探索将Transformer作为视觉任务的独立架构使用.将Transformer用于计算机视觉存在两个主要的挑战.首先,Transformer在63湖南大学学报(自然科学版)2023 年一组Token上操作,但在图像中不存在类似自然语言中的单词的自然Token.其次,图像具有较强的局部结构,而Transformer结构以同样的方式处理所有的Token,忽略了图像局部性.因此,2020年谷歌提出了Vision Transformer(ViT)[33],开创性地解决了第一个挑战,即简单地将图像分割为不重叠的块,并将每个块作为视觉Token,有效地缓解了这一问题.随后,Swin[34]和Twins[35]提出了用局部ViT来解决第二个问题.它们在非重叠窗口内进行自注意力计算以及采用局部分组的自注意力机制.这些机制引入了局部性,不仅提高了架构的性能,而且能够充分利用内存并提高计算效率.与此同时,Transformer在哈希图像检索领域中也存在一定的应用.TransHash[36]提出一种基于纯视觉Transformer的双流特征学习架构,能够同时学习图像有区别的全局和局部特征表示.然而,TransHash 未能利用在大规模数据集上预训练的Transformer模型和最新研究的目标函数.因此,文献[33]提出了用于图像检索的视觉Transformer哈希方法(VTS),该方法通过预训练的ViT模型来提取图像的连续特征表示,并在网络最后添加一个哈希模块来学习图像的二进制哈希码,并达到当前深度有监督哈希图像检索领域最先进水平.但是,研究发现ViT在特征提取阶段均保持相同的特征分辨率以及在整幅图像内计算各像素点之间的依赖关系,这导致ViT不能提取图像的多层次表示和计算复杂度高等问题.因此,尝试探索以Swin Transformer[34]为基础设计哈希骨干网,将局部思想和层级结构引入Transformer来解决上述问题,并进一步提高哈希检索性能.本文建立了一个基于Swin Transformer的深度有监督哈希图像检索方法,这是第一个以Swin Trans⁃former为基础设计的哈希学习方法.图像经过Swin Transformer能够得到深度特征表示,然而学习得到的深度特征表示是连续的,因此在测试阶段需要通过sign函数将特征表示转化为二进制哈希码.但是,由于sign函数不可导,使得优化过程变为NP难问题,这导致了次优的哈希检索性能.因此,设计了贪心非对称损失来解决此问题,并得了较好的结果.综上所述,本文主要贡献可以概括为:1)提出了一种基于Swin Transformer的深度有监督哈希图像检索方法.该方法通过将局部思想和层级结构引入Transformer来生成图像的层次表示,能够得到更好保留图像底层信息的深度特征. 2)设计了一种贪心非对称损失函数.该损失函数能够解决优化过程中梯度消失问题,并充分利用训练数据,提高哈希码学习精度.3)在两个常用检索数据集CIFAR-10和NUS-WIDE上的实验结果表明,本文方法检索性能优于其他方法.在CIFAR-10数据集上,与TransHash方法相比,当训练数据远远小于TransHash时,本文方法mAP平均提高7.1%,与VTS16-CSQ方法相比,本文方法mAP平均提高0.57%.在NUS-WIDE数据集上,与TransHash方法相比,本文方法mAP平均提高18.61%,与VTS16-CSQ方法相比,本文方法检索精度平均提高8.6%.充分验证了所提方法的有效性.1 本文方法图1为本文网络结构的总体框架,主要包含特征提取和损失函数两个阶段.本文方法的主要思路是:通过以Swin Transformer为基础设计的哈希网络对输入图像进行深度特征提取,再使用深度哈希函数生成图像的哈希编码,最后通过图像之间哈希编码的汉明距离远近来进行图像检索.1.1 特征提取以Swin Transformer为基础设计的哈希网络进行图像深度特征提取.Swin Transformer与主流的CNN 模型相似,均采用金字塔结构将模型分为不同的Stage.它包含1个Patch partition和4个Stage,每个Stage都是由Swin Transformer Block堆叠而成,如图2所示.在哈希特征提取过程中,首先将大小为H×W (224×224)的RGB图像输入哈希网络中,通过Patch partition层将图像划分成长宽均为p=4、特征维度为48的不重叠Patch,这样使得Transformer Block能够直接处理具有二维结构的图像.之后再将Patch送入各个Stage中,对其进行自注意力计算.在Stage 1中,先通过Linear embedding层将每个Patch的特征维度映射成C维,再送入Swin Transformer Block中;Stage 2、Stage 3和Stage 4相同,通过一个Patch merging 层将相邻2×2个Patch合并,来改变每个Stage中特征图的分辨率从而生成层次表示.如图2所示,Swin Transformer Block是由窗口多头自注意力机制(Window Multi-head Self-Attention,W-MSA)和滑动窗口多头自注意力机制(Shifted-Window Multi-head Self-Attention,SW-MSA)交替组成.W-MSA机制引入窗口操作将图像Patch进行非64第 8 期苗壮等:基于Swin Transformer 的深度有监督哈希图像检索方法重叠均匀划分,每个窗口包含相邻的M ×M 个Patch.这样使得每个窗口内部可以单独进行自注意力计算,能够有效降低计算量,从而解决了上述VTS 方法中计算复杂度高的问题.但是,由于W-MSA 缺乏跨窗口的连接,无法对全局特征进行建模.针对该问题,Swin Transformer 又进一步提出了SW-MSA 机制.SW-MSA 机制在W-MSA 模块划分的规则窗口中,从(ëûM /2,ëûM /2)像素开始替换第一模块中的规则窗口,这种划分方法能够引入上一模块中相邻非重叠窗口之间的连接,提高模型的全局建模能力.但是,划分后会存在许多大小不规则的窗口从而加大计算量.为了高效计算,Swin Transformer 提出使用沿左上方循环移动方式计算,如图3所示(只画出8×8的特征图在M =4的情况).在循环移动后的特征图中,一个窗口可以由几个在特征图中不相邻的子窗口组成,因此使用MASK 机制将自注意力计算限制在每个子窗口内.通过循环位移方式,窗口的数量与常规窗口保持相同.这样Swin Transformer 既可以实现相邻窗口之间的信息交流,又能够保持较少的计算量.Swin Transformer Block 的计算过程如下:z l =W-MSA(LN(z l -1))+z l -1z l =MLP (LN(z l ))+z l ,z l +1=SW-MSA(LN(z l ))+z l , z l +1=MLP (LN(z l +1))+z l +1(1)式中:z l 、z l 代表l 模块(S )W-MSA 和MLP 的输出特征.在自注意力计算过程中,通过添加一个相对位置偏置P ∈R M2×M 2来计算每个head的相似性:图1 本文网络结构图Fig.1 Network architecture of our method图2 两个连续的Swin Transformer BlocksFig.2 Two successive Swin Transformer Blocks图3 W-MSA 和SW-MSA Fig.3 W-MSA and SW-MSA65湖南大学学报(自然科学版)2023 年Attention(Q,K,V)=SoftMax(QK T/d+B)V(2)式中:Q、K、V∈R M2×d代表query、key、value矩阵;d代表Q/K的维度;M2代表一个窗口中包含的Patch 数量.为学习得到图像的哈希特征,本文以Swin Transformer模型为基础,通过在模型Stage 4后添加一个哈希层构建了一个哈希特征提取网络,来将输出特征向量映射到不同位大小的哈希码中.对于大小为224×224的查询图像x i,通过骨干网提取特征及哈希映射得到哈希特征h i:h i=f(x i,θ)(3)式中:f表示骨干网函数,θ表示Swin Transformer骨干网的参数.1.2 损失函数为了学习得到更优的哈希码,设计了一个损失函数,可表示为:L=L1+λL2,其中L1为贪心损失[37],L2为非对称成对损失[38],λ为超参数.贪心损失L1能够解决优化过程中的梯度消失问题,非对称成对损失L2能够使训练过程充分利用数据集合的标签信息,有效地训练网络.具体的贪心损失L1如式(4)所示,B表示图像哈希码集合,H表示图像深度特征集合:L1=loss(B)+α||H-sign(H)||p p(4)为了得到图像的哈希编码,通常会在哈希编码层后面使用sign函数将深度特征映射为二值化哈希码.但由于sign函数不可导,会使优化过程变为NP 难问题.传统方法使用tanh函数或sigmoid函数进行松弛,这样虽然能够训练网络,但会产生次优解.为了更好地解决这个问题,本文的贪心损失L1提出利用贪心算法解决离散优化问题.贪心算法认为离连续最优解最近的离散点,就是所希望得到的离散最优解.算法过程如表1所示.非对称成对损失如式(5)所示,它的作用是:在训练过程中,采用非对称策略训练网络.这样不仅能够充分利用数据集合的监督信息,而且也可以高效训练网络.所谓非对称策略是指,采用不同的方式来处理查询图像和数据集合图像.对于查询图像,通过骨干网进行深度特征提取,再使用深度哈希函数生成查询图像的哈希码;而对于数据集合图像,它的哈希码则是直接学习得到.minθ,V L2(θ,V)=∑i=1m∑j=1n[tanh(f(x i,θ))T v j-c S ij]2(5)式中:S ij表示相似矩阵,V表示数据集合图像.损失函数部分具体过程可参考文献[37]和文献[38].2 实验2.1 数据集本文在CIFAR-10[39]和NUS-WIDE[37]这两个常用的图像检索数据集上进行实验.CIFAR-10数据集是一个单标签图像数据集,共包含10个类,每类包含6 000个样本,总共有60 000张彩色图像,图像大小为32×32.对于CIFAR-10数据集,如果两张图像标签相同,那么将两张图像视为相似对.NUS-WIDE数据集是一个多标签图像数据集,共包含81个类,总共有269 648张图像.本文选择了195 834张属于21个最常见的类的图像进行实验.对于NUS-WIDE数据集,如果两张图像至少共享一个公共标签,那么它们将被定义为相似对.2.2 实验设置全部实验均基于PyTorch 1.7深度学习框架,使用2块Geforce RTX 2080 Ti显卡进行测试.在数据处理上,将所有图像的大小首先调整到256×256.然后对于训练图像,采用标准的图像增广技术,包括随机表1 贪心算法原理Tab.1 The principle of greedy algorithm训练集X以及神经网络f(θ),θ表示神经网络参数,H为图像的特征表示,B为图像哈希码- H i=f(X,θ).- B=sign(H). [前向传播过程]- 计算贪心损失L1:L1=loss(B)+α||H-sign(H)||p ploss可以是任何类型损失,这里loss为交叉熵损失- 计算∂L1∂B=∂loss∂B- 令∂loss∂H=∂loss∂B[后传过程]- 计算∂L1∂H=∂loss∂H+α∂||H-sign(H)||p p∂H=∂loss∂B+α||H-sign(H)||p-1p-1- 计算∂L1∂θ=∂L1∂H×∂H∂θ- 更新网络参数66第 8 期苗壮等:基于Swin Transformer的深度有监督哈希图像检索方法水平翻转和随机裁剪,其中随机裁剪大小为224.对于测试图像,只应用裁剪大小为224的中心裁剪.在参数设置上,Batch Size设置为128,采用Adam优化器,学习率为0.000 1,权重衰减值为0.000 01,训练次数设置为150.在数据集划分上,CIFAR-10数据集的划分与TransHash[36]设置相同,随机从每类中抽取500张,总共5 000张作为训练集;再随机从每类中抽取100张,共1 000张作为查询集;除查询集外的59 000张图像作为数据库集合.由于实验条件有限,对于NUS-WIDE数据集,除训练集和查询集外,其他设置均与TransHash设置相同.NUS-WIDE数据集从数据集合中随机抽取2 000张作为训练集;再随机抽取1 000张作为查询集;除查询集外的其他图像作为数据库集合.在评价指标选取上,本文选择图像检索中最常用的评价指标:mAP (mean Average Precision)和P‒R(Precision‒Recall).2.3 实验结果与分析实验在CIFAR‒10和NUS‒WIDE数据集上进行,与13种哈希方法进行了比较,其中包括4种传统哈希学习方法:SH[8]、ITQ[40]、KSH[40]、BRE[41];9种深度哈希学习方法:DSH[16]、DHN[23]、HashNet[28]、DCH[13]、IDHN[24]、DPN[14]、TransHash[36]、VTS16‒CSQ[33]和DPLAH[42].哈希检索性能如表2所示,其中加粗表示最优值,下画线表示次优值.除VTS16‒CSQ方法外,其他12种方法结果均取自TransHash[36]和DPLAH[42].其中TransHash是首个将Transformer应用于深度有监督哈希图像检索的方法,VTS16-CSQ是基于Transformer模型的哈希检索方法中检索精度最高的方法.虽然TransHash是首个将Transformer应用到深度有监督哈希图像中的方法,但是该方法与最先进的基于卷积神经网络的方法[43]相比仍然存在较大的劣势.由于实验条件有限,在NUS-WIDE数据集上本文只使用TransHash五分之一的数据进行训练,而在CIFAR-10上则与TransHash设置一样.虽然在训练集设置上与TransHash相比有较大的劣势,但实验结果表明,本文方法仍在两个数据集上不同比特条件下均表现出最优异的检索性能.在CIFAR-10数据集上mAP最高达到98.4%,与TransHash相比平均提高7.1%,与VTS16-CSQ相比平均提高0.57%;在NUS-WIDE数据集上mAP最高达到93.6%,与最先进的基于CNN的哈希方法[43]相比平均提高3.8%,与TransHash 和VTS16-CSQ方法相比分别平均提高18.6%和8.6%.传统的非深度哈希方法因为使用手工特征而不能达到较好的检索性能,深度有监督哈希方法通过机器学习技术能够获得更具有辨别性的哈希特征并取得了较高的检索结果.然而,本文所提方法比现有的深度有监督哈希方法性能更优.在NUS-WIDE数据集上,于训练集不足VTS16-CSQ方法1/5的情况下取得了93.6%的检索精度,比目前本领域最先进表2 在CIFAR-10和NUS-WIDE上图像检索精度对比Tab.2 Comparison of the retrieval accuracy on CIFAR-10 and NUS-WIDEMethods SH[8]ITQ[39]KSH[40]BRE[41]DSH[16]DHN[23]HashNet[28]DCH[13]IDHN[24]DPN[14]DPLAH[42]TransHash[36]VTS16-CSQ[33]OursCIFAR-10 (mAP@54000)16 bits————0.614 50.654 40.510 50.668 00.541 90.825 00.938 00.907 50.979 00.984 032 bits————0.68150.67110.62780.69360.56950.8380—0.910 80.979 00.984 048 bits————0.632 80.692 10.663 10.680 70.589 50.830 00.960 00.914 1—0.983 064 bits————0.691 00.673 70.682 60.677 50.597 20.829 00.958 00.916 60.976 00.982 0NUS-WIDE (mAP@5000)16 bits0.405 80.508 60.356 10.502 70.633 80.647 10.682 10.703 60.699 9—0.885 00.726 30.819 00.913 032 bits0.420 90.542 50.332 70.529 00.650 70.672 50.695 30.717 80.714 9——0.739 30.846 00.927 048 bits0.421 10.558 00.312 40.547 50.666 40.698 10.719 30.710 60.722 5—0.918 00.753 2—0.936 064 bits0.410 40.561 10.336 80.554 60.685 60.702 70.734 10.705 60.725 6—0.923 00.748 80.853 00.936 067湖南大学学报(自然科学版)2023 年的方法(VTS16-CSQ )的检索性能提高了8.6%,与基于卷积神经网络的方法(IDHN )相比,检索性能提高了20.19%.在CIFAR-10和NUS-WIDE 数据集上的P ‒R 曲线如图4和图5所示.P -R 曲线与横坐标轴所围面积越大,则表示该方法性能越好.从图中可以看出,本文方法P -R 曲线一直位于其他方法的上方,再一次证明本文方法检索性能的优越性.本文方法能够取得比较优异的检索结果原因如下:首先,设计的基于Swin Transformer 的哈希特征骨干网不仅能够捕获图像的局部信息,而且能够使局部信息之间产生联系,增大感受野获得图像之间的远程依赖关系,因此能够得到更好地保留图像底层信息的哈希特征.其次,设计的贪心非对称损失函数,通过贪心原理能够更好地解决优化过程中的离散优化问题,也能通过非对称方式训练网络,充分利用数据集合中的监督信息.2.4 消融实验我们设计了详细的消融实验来证明所提方法每一部分的有效性.对6种哈希方法(DSH [16]、Hash⁃Net [28]、GreedyHash [29]、IDHN [24]、CSQ [44]、DPN [14])在5种不同哈希骨干网(AlexNet [18]、ResNet50[14]、VTS32[33]、VTS16[33]和本文创建的哈希骨干网)上的性能表现进行了评估.在CIFAR-10和NUS-WIDE 数据集上16和64比特下的检索结果如表3和表4所示,其中黑色加粗表示相同损失函数在不同的骨干网上的最优值,黑色下画线表示次优值;红色加粗表示相同骨干网在不同损失函数下的最优值,红色下画线表示次优值.为了证明所提哈希骨干网的有效性,在5种不同的哈希骨干网中使用相同的损失函数进行了实验(表3和表4纵列所示).实验结果表明,各个损失函数在所提骨干网上的检索性能均能优于在其他骨干网上的检索性能,证明了所提骨干网的有效性.如表3所示,在CIFAR-10数据集上16和64比特下,与AlexNet 网络相比mAP 分别平均提高了20.3%和15.6%;与ResNet50网络相比mAP 分别平均提高了19%和14.6%.如表4所示,在NUS-WIDE 数据集上16和64比特下,与AlexNet 网络相比mAP 分别平均提高了7.81%和4.8%;与ResNet50网络相比mAP 分别平均提高了6.3%和5.55%.图4 CIFAR-10 P -R 曲线Fig.4 P -Rcurves of CIFAR-10图5 NUS-WIDE P -R 曲线Fig.5 P -R curves of NUS-WIDE表3 在CIFAR-10上图像检索精度对比Tab.3 Comparison of the retrieval accuracy on CIFAR-10骨干网AlexNet [18]ResNet50[14]VTS32[33]VTS16[33]Our backboneDSH[16]16 bit 0.7350.6910.9300.9800.94964 bit 0.7850.5240.9530.9740.952HashNet[28]16 bit 0.6510.5550.9620.9780.91964 bit 0.7980.8800.9560.9830.930GreedyHash[29]16 bit 0.7610.8150.9480.9700.97164 bit 0.8160.8630.9430.9800.980IDHN[24]16 bit 0.7590.7560.9590.9760.93464 bit 0.7670.8650.9600.9840.949CSQ[44]16 bit 0.7590.8360.9600.9790.92064 bit 0.7830.8370.9560.9760.924DPN[14]16 bit 0.7350.8200.9510.9740.92264 bit 0.7790.8200.9500.9750.932Our loss 16 bit 0.9370.963——0.98364 bit 0.9470.962——0.98268第 8 期苗壮等:基于Swin Transformer 的深度有监督哈希图像检索方法与VTS 网络相比,本文骨干网在NUS-WIDE 数据集上表现出优异的检索性能,达到目前检索领域最先进的水平.然而,在CIFAR-10数据集上,所提方法次优于基于VTS 的哈希方法性能,分析原因如下:首先,本文的哈希骨干网是以Swin Transformer 在ImageNet 上的预训练模型为基础构建的,由于CIFAR-10中的图像与ImageNet 中的部分图像属性相似,导致在CIFAR-10数据集上再训练出现过拟合现象;其次,基于VTS 的哈希方法的骨干网是以Vi⁃sion Transformer 为基础构建的,与本文所使用的Swin Transformer 在参数量和计算成本上存在巨大的差异.为了进一步证明所提损失函数的有效性,在7种不同的损失函数下使用相同的骨干网进行实验(表3和表4横行所示).如表3所示,AlexNet 网络在GreedyHash 损失函数下能够达到次优值,而在所提贪心非对称损失条件下能够达到最优值;ResNet50网络在HashNet 和CSQ 损失下分别取得次优值,而在所提贪心非对称损失条件下能够达到最优值;所提骨干网在GreedyHash 损失函数下能够达到次优值,而在所提贪心非对称损失下能够达到最优.同理,表4也存在上述情况.进一步证明了所提的贪心非对称损失的有效性.2.5 特征可视化本节对该哈希特征提取网络的4个阶段所得到的特征图进行可视化.如图6所示,随机选取两张图像对每个阶段产生的特征图进行可视化,可以发现:特征提取初期,骨干网更加偏向于提取图像的纹理层特征;特征提取最后阶段,骨干网更加偏向于提取图像的语义特征,得到的特征信息更加聚焦于图像中的目标.这进一步说明Swin Transformer 能够更好地提取图像的深度特征.3 结 论本方法以预训练的Swin Transformer 模型为基础,提出了一个基于视觉Transformer 的哈希特征提取网络.同时,提出贪心非对称损失以端到端的方式进行训练并指导图像哈希码的学习.我们注意到,在不同的检索框架下,所提出的骨干网的性能在两个表4 在NUS-WIDE 上图像检索精度对比Tab.4 Comparison of the retrieval accuracy on NUS-WIDE骨干网AlexNet [18]ResNet50[14]VTS32[33]VTS16[33]Our backboneDSH[16]16 bit 0.7750.7710.7960.7980.80464 bit 0.8080.7910.8290.8290.831HashNet[28]16 bit 0.7520.7680.7760.7920.84664 bit 0.8450.8390.8620.8730.874GreedyHash[29]16 bit 0.7480.7560.7580.7620.90864 bit 0.8000.7880.7980.7860.932IDHN[24]16 bit 0.7910.8090.8330.8410.84164 bit 0.8130.7940.8410.8510.842CSQ[44]16 bit 0.7800.7930.8240.8190.84564 bit 0.8340.8380.8610.8530.875DPN[14]16 bit 0.7500.7890.8020.7970.82164 bit 0.8350.8400.8610.8460.869Our loss 16 bit 0.8820.908——0.91364 bit 0.9250.936——0.936(a ) 原图 (b ) Stage 1 (c ) Stage 2 (d ) Stage 3 (e ) Stage 4图6 特征图可视化Fig.6 Visualization of feature maps69湖南大学学报(自然科学版)2023 年常用数据集上均优于AlexNet和ResNet骨干网.此外,该模型的性能还优于目前所有的基于Trans⁃former的哈希方法,且检索性能达到目前领域的最先进水平.参考文献[1]FU C,XIANG C,WANG C X,et al.Fast approximate nearest neighbor search with the navigating spreading-out graph[J].Proceedings of the VLDB Endowment,2019,12(5):461-474.[2]GE T Z,HE K M,KE Q F,et al.Optimized product quantization for approximate nearest neighbor search[C]//2013 IEEEConference on Computer Vision and Pattern Recognition.Portland,OR,USA:IEEE,2013:2946-2953.[3]JÉGOU H,DOUZE M,SCHMID C.Product quantization for nearest neighbor search[J].IEEE Transactions on PatternAnalysis and Machine Intelligence,2011,33(1):117-128.[4]MALKOV Y A,YASHUNIN D A.Efficient and robust approximate nearest neighbor search using hierarchical navigablesmall world graphs[J].IEEE Transactions on Pattern Analysisand Machine Intelligence,2020,42(4):824-836.[5]LI Y,MIAO Z A,HE M,et al.Deep attention residual hashing [J].IEICE Transactions on Fundamentals of Electronics,Communications and Computer Sciences,2018,E101,A(3):654-657.[6]CHARIKAR M S.Similarity estimation techniques from rounding algorithms[C]//Proceedings of the Thirty-fourth Annual ACMSymposium on Theory of Computing. Montreal,Quebec,Canada,New York:ACM,2002:380-388.[7]INDYK P,MOTWANI R,RAGHAVAN P,et al.Locality-preserving hashing in multidimensional spaces[C]//Proceedings ofthe Twenty-ninth Annual ACM Symposium on Theory ofComputing.El Paso,Texas,USA,New York:ACM,1997:618-625.[8]WEISS Y,TORRALBA A,FERGUS R. Spectral hashing [C]// Proceedings of 22th Annual Conference on Neural InformationProcessing Systems. Red Hook: Curran Associates, 2008: 1753-1760.[9]OLIVA A,TORRALBA A.Modeling the shape of the scene:a holistic representation of the spatial envelope[J].InternationalJournal of Computer Vision,2001,42(3):145-175.[10]LIONG V E,LU J W,WANG G,et al.Deep hashing for compact binary codes learning[C]//2015 IEEE Conference on ComputerVision and Pattern Recognition (CVPR).Boston,MA:IEEE,2015:2475-2483.[11]XIA H K, PAN Y, LAI H J, et al. Supervised hashing for image retrieval via image representation learning [C]// Proceedings ofthe 28th AAAI Conference on Artificial Intelligence. Menlo Park:AAAI, 2014: 2156-2162.[12]CAKIR F,HE K,BARGAL S A,et al.Hashing with mutualinformation[J].IEEE Transactions on Pattern Analysis andMachine Intelligence,2019,41(10):2424-2437.[13]CAO Y,LONG M S,LIU B,et al.Deep cauchy hashing for hamming space retrieval[C]//2018 IEEE/CVF Conference onComputer Vision and Pattern Recognition.Salt Lake City,UT,USA:IEEE,2018:1229-1237.[14]FAN L X,NG K W,JU C,et al.Deep polarized network for supervised learning of accurate binary hashing codes[C]//Proceedings of the Twenty-Ninth International Joint Conferenceon Artificial Intelligence.Yokohama,Japan,California:International Joint Conferences on Artificial IntelligenceOrganization,2020:825-831.[15]HE K,CAKIR F,BARGAL S A,et al.Hashing as tie-aware learning to rank[C]//2018 IEEE/CVF Conference on ComputerVision and Pattern Recognition. Salt Lake City,UT,USA:IEEE,2018:4023-4032.[16]LIU H M,WANG R P,SHAN S G,et al.Deep supervised hashing for fast image retrieval[C]//2016 IEEE Conference onComputer Vision and Pattern Recognition (CVPR). Las Vegas,NV,USA:IEEE,2016:2064-2072.[17]LECUN Y,BOTTOU L,BENGIO Y,et al.Gradient-based learning applied to document recognition[J].Proceedings of theIEEE,1998,86(11):2278-2324.[18]KRIZHEVSKY A,SUTSKEVERU I,HINTON G E. ImageNet classification with deep convolutional neural networks [C]//Proceedings of Advances in Neural Information Processing System.Red Hook: Curran Associates, 2012: 1106-1114.[19]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[J].eprint arXiv:1409.1556[20]SZEGEDY C,LIU W,JIA Y Q,et al.Going deeper with convolutions[C]//2015 IEEE Conference on Computer Vision andPattern Recognition (CVPR).Boston,MA,USA:IEEE,2015:1-9.[21]HE K M,ZHANG X Y,REN S Q,et al.Deep residual learning for image recognition[C]//2016 IEEE Conference on ComputerVision and Pattern Recognition (CVPR).Las Vegas,NV,USA:IEEE,2016:770-778.[22]TAN M X,LE Q V.EfficientNet:rethinking model scaling for convolutional neural networks[EB/OL].2019:arXiv:1905.11946.https://arxiv.org/abs/1905.11946.[23]ZHU H,LONG M S,WANG J M,et al.Deep hashing network for efficient similarity retrieval[C]// Proceedings of the ThirtiethAAAI Conference on Artificial Intelligence. Menlo Park:AAAI,2016: 2415-2421.[24]ZHANG Z,ZOU Q,LIN Y W,et al.Improved deep hashing with soft pairwise similarity for multi-label image retrieval[J].IEEETransactions on Multimedia,2020,22(2):540-553.[25]LI Y,MIAO Z,WANG J B,et al.Deep discriminative supervised hashing via Siamese network[J]. IEICE Transactions onInformation and Systems,2017,E100,D(12):3036-3040.[26]CAO Y,LIU B,LONG M S,et al.HashGAN:deep learning to70。
代号分类号学号密级10701TP37公开1102121253题(中、英文)目基于GPU/多核CPU平台下并行计算的实时超分辨和立体视图生成Real-time Super-resolution and Stereoscopic View Genera-tion with GPU/Multicore CPU Based Parallel Computing 作者姓名孙增增指导教师姓名、职务郑喆坤教授学科门类工学提交论文日期二〇一四年三月学科、专业模式识别与智能系统西安电子科技大学学位论文独创性(或创新性)声明秉承学校严谨的学风和优良的科学道德,本人声明所呈交的论文是我个人在导师指导下进行的研究工作及取得的研究成果。
尽我所知,除了文中特别加以标注和致谢中所罗列的内容以外,论文中不包含其他人已经发表或撰写过的研究成果;也不包含为获得西安电子科技大学或其它教育机构的学位或证书而使用过的材料。
与我一同工作的同志对本研究所做的任何贡献均已在论文中做了明确的说明并表示了谢意。
申请学位论文与资料若有不实之处,本人承担一切的法律责任。
本人签名:日期:西安电子科技大学关于论文使用授权的说明本人完全了解西安电子科技大学有关保留和使用学位论文的规定,即:研究生在校攻读学位期间论文工作的知识产权单位属西安电子科技大学。
学校有权保留送交论文的复印件,允许查阅和借阅论文;学校可以公布论文的全部或部分内容,可以允许采用影印、缩印或其它复制手段保存论文。
同时本人保证,毕业后结合学位论文研究课题再撰写的文章一律署名单位为西安电子科技大学。
(保密的论文在解密后遵守此规定)本人授权西安电子科技大学图书馆保存学位论文,本学位论文属于(保密级别),在年解密后适用本授权书,并同意将论文在互联网上发布。
本人签名:日期:导师签名:日期:摘要近些年来,许多因素导致了计算产业转向了并行化发展的方向。
在这过程中,受市场对实时、高清晰3维图形绘制的需求驱使,可编程的图形处理单元(GPU)逐渐发展进化成为了具有强大计算能力、非常高内存带宽的高度并行、多线程的众核处理器。
前言:最近由于工作的关系,接触到了很多篇以前都没有听说过的经典文章,在感叹这些文章伟大的同时,也顿感自己视野的狭小。
想在网上找找计算机视觉界的经典文章汇总,一直没有找到。
失望之余,我决定自己总结一篇,希望对 CV领域的童鞋们有所帮助。
由于自
己的视野比较狭窄,肯定也有很多疏漏,权当抛砖引玉了
1990年之前
1990年
1991年
1992年
1993年
1994年
1995年
1996年
1997年
1998年
1998年是图像处理和计算机视觉经典文章井喷的一年。
大概从这一年开始,开始有了新的趋势。
由于竞争的加剧,一些好的算法都先发在会议上了,先占个坑,等过一两年之后再扩展到会议上。
1999年
2000年
世纪之交,各种综述都出来了
2001年
2002年
2003年
2004年
2005年
2006年
2007年
2008年
2009年
2010年
2011年
2012年。
hierarchical partitioning analysis 层次分区分析(HierarchicalPartitioningAnalysis)是一种用于解决多因素影响下的变量重要性排名问题的方法。
它是通过对变量的影响因素进行层次分解,从而确定每个因素对变量的影响程度。
本文将对层次分区分析的原理、应用以及优缺点进行介绍。
层次分区分析的原理:
层次分区分析是一种基于因果关系的分析方法。
它将变量的影响因素分为不同的层次,每个层次的因素对变量的影响程度不同。
通过对这些因素的层次分解,可以确定每个因素对变量的影响程度,从而实现变量重要性排名。
层次分区分析的应用:
层次分区分析广泛应用于环境学、生态学、生物学等领域。
例如,它可以用于确定某个生物物种的适宜栖息地,分析不同因素对生物物种的影响程度。
同时,它也可以用于评估不同环境因素对生物多样性的影响,以及确定保护生物多样性的最佳方案。
层次分区分析的优缺点:
层次分区分析的优点是可以很好地解决变量重要性排名问题,并且可以很好地处理复杂的多因素影响。
同时,它可以提供关于每个因素的相对重要性,有助于制定有效的管理和保护措施。
然而,层次分区分析也存在一些缺点。
例如,它需要大量的数据和时间来进行计算,同时也需要对数据进行较为复杂的处理。
此外,它仅仅提供了相对重要性,而无法提供变量的具体影响程度。
综上所述,层次分区分析是一种用于解决多因素影响下的变量重要性排名问题的方法。
它在环境学、生态学、生物学等领域得到了广泛的应用。
虽然它存在一些缺点,但其优点远远超过了缺点,因此它仍然是一种非常有用的分析方法。
第33卷 第2期 计算机辅助设计与图形学学报Vol.33 No.2 2021年2月Journal of Computer-Aided Design & Computer GraphicsFeb. 2021收稿日期: 2020-05-22; 修回日期: 2020-11-13. 基金项目: 国家自然科学基金(61573168). 陈珺莹(1996—), 女, 硕士研究生, 主要研究方向为计算机视觉; 陈莹(1976—), 女, 博士, 教授, 博士生导师, CCF 会员, 论文通讯作者, 主要研究方向为计算机视觉、信息融合.基于显著增强分层双线性池化网络的细粒度图像分类陈珺莹, 陈莹*(江南大学轻工过程先进控制教育部重点实验室 无锡 214122) (*********************.cn)摘 要: 分层双线性池化网络考虑了中间卷积层的特征交互, 对细粒度图像起到了良好的分类效果, 但它对一幅图像包括无关背景在内的所有区域激活都进行了特征交互, 会影响分类性能. 针对该问题, 提出一种显著增强的分层双线性池化方法. 该方法在分层双线性池化网络的基础上, 结合显著性检测网络生成注意力图, 使用注意力图与特征提取网络进行交互实现对显著区域的信息增强, 减少了背景等无关信息的影响, 提高了分类性能. 在3个常用的细粒度图像数据集CUB-200-2011, Stanford Cars 和FGVC-Aircraft 上均进行了实验, 分类准确率分别为86.5%, 92.9%和90.8%, 与当前其他主流方法相比, 取得了良好的分类效果.关键词: 细粒度图像分类; 显著性检测; 区域信息增强; 分层双线性池化 中图法分类号: TP391.41 DOI: 10.3724/SP.J.1089.2021.18399Saliency Enhanced Hierarchical Bilinear Pooling for Fine-Grained ClassificationChen Junying and Chen Ying *(Key Laboratory of Advanced Process Control for Light Industry (Ministry of Education), Jiangnan University, Wuxi 214122)Abstract: Hierarchical bilinear pooling network considers the feature interaction in middle convolutional lay-ers and works well in the classification of fine-grained images. However, it carries out feature interaction on all activated image regions including irrelevant background, which affects the classification performance. To ad-dress this problem, a saliency enhanced hierarchical bilinear pooling method is proposed, which combines with the saliency detection network to generate an attention map, and uses the attention map to interact with the feature extraction network to enhance the information of the salient regions. As the result, it can reduce the impact of background and other irrelevant information, and improve the classification performance. The classi-fication accuracy on three commonly used fine-grained image datasets CUB-200-2011, Stanford Cars and FGVC-Aircraft is 86.5%, 92.9% and 90.8%, respectively, which is excellent compared with other mainstream methods.Key words: fine-grained classification; saliency detection; regional information enhancement; hierarchical bi-linear pooling细粒度图像分类是近年来计算机视觉领域的一个研究热点, 其目的是对粗粒度的大类进行更加精细的子类划分. 随着深度学习的发展, 许多计算机视觉任务取得了不错的进展, 但是细粒度图像分类仍然面临较大的困难. 其难点有2点: (1) 同一个子类中的物体之间存在个体的差异, 并且还受到遮挡、尺度、姿态等的影响, 造成图像中的外观差异可能会很大; (2) 同一个大类下不同的子242 计算机辅助设计与图形学学报第33卷类物体可能具有相似的外观和组成结构, 只存在细微的差异, 且细节特征不易捕捉.针对细粒度图像分类的难点, 往往需要定位到图像中起区分作用的微小区域. 很多方法使用额外人工标注来对这些微小区域做信息增强, 即基于强监督信息的分类方法. 然而, 由于人工标注信息的获取代价十分昂贵, 并且人工标定的特定的标注框或标注点并非最适合模型分类的区域, 因此基于弱监督信息的方法逐渐成为主流. Liu等[1]提出过滤蒸馏学习注意力模型, 用于增强细粒度视觉分类中的区域注意力. Zheng等[2]提出三线注意力采样网络, 利用三线注意力模块模型化通道间的关系产生注意力图, 基于注意力的采样器将关注区域通过高分辨率显示. Zhang等[3]提出一种基于空间显著性提取的细粒度图像分类方法, 由图像的显著性信息获得物体的位置, 将物体裁剪出来, 忽略掉背景的影响, 提高了分类性能. 除此之外, 对高阶特征编码的方式也取得了良好的分类效果. Lin等[4]提出双线性卷积神经网络(bilinear convolutional neural networks, BCNN), 通过对最后一个卷积层的输出特征进行外积操作, 以平移不变的方式, 对局部对级特征交互进行建模, 取得了优秀的性能. Yu等[5]在BCNN的基础上, 提出分层双线性池化(hierarchical bilinear pooling, HBP), HBP网络对中间卷积层的特征也进行了交互, 提高了分类性能.虽然HBP有效地提高了分类准确率, 但是它对一幅图像的所有区域都进行了特征交互, 包括了一些无关的背景或不相干的物体区域, 这些区域会破坏特征交互, 从而影响分类性能. 所以, 需要对这些无关的背景区域进行过滤, 增强对感兴趣区域的关注.针对上述问题, 本文提出了显著增强分层双线性池化(saliency enhanced HBP, SE-HBP)网络. 显著性检测可以获取一幅图像中的感兴趣区域, 与文献[3]不同, 本文并不对显著区域进行裁剪, 而是通过显著性检测得到一幅图像的显著性特征, 在此基础上得到注意力图; 再将注意力图与特征提取网络中的特征进行乘积融合实现显著区域的信息增强, 然后进行后续的特征提取及增强特征HBP操作. 本文分别在3个常用的细粒度图像数据集CUB-200-2011[6], Stanford Cars[7]和FGVC-Aircraft[8]上进行了实验, 实验结果表明, 本文方法取得了较好的分类性能, 证明了显著性增强的有效性. 1 本文方法本文基于显著性增强改进HBP, 具体而言, 先利用显著性检测网络得到一幅图像的显著性特征, 再由显著性特征生成注意力图, 通过注意力图实现显著区域的信息增强, 以此增强对显著区域即判别区域的关注, 从而减少对背景等无关信息的关注. 下面分别介绍HBP的问题所在以及SE-HBP.1.1HBP及其问题分析BCNN[4]实现了端到端的细粒度图像分类; Yu 等[5]指出, BCNN只考虑了最后一个卷积层的特征交互, 忽视了卷积层之间的特征交互与细粒度特征的学习是可以相互加强的, 所以提出HBP来捕获层间的特征交互, 并集成了多个分层双线性特征, 提高了特征的表示能力.本文使用ResNet-34构建HBP网络, 对ResNet-34最后一组(即第4组)块(block)的3个特征进行HBP操作, 这3个特征分别表示为41-X, 42-X和43-X, 如图1所示选取2幅图像, 对这3个特征激活进行了可视化. 图1中, 第1列是从数据集中选取的图像, 其他3列是ResNet-34最后一组block的3个特征激活的热力图, 红色区域表示网络更加关注的区域, HBP模型对这2幅图像分类错误, 最后一列是错误类别中的示例图像.图1 最后一组block的3个特征激活的热力图及错误分类的类别图1a选取的是CUB-200-2011数据集中的一幅类别为鱼鸭的鸟类图像, 被错误分类成宽尾拟八哥. 从特征激活的热力图可以看出, 网络只关注到了鱼鸭的头部和飞羽部位区域, 这些区域与宽尾拟八哥没有明显的不同; 鱼鸭的背部通常是纯黑色的, 而宽尾拟八哥的背部呈现亮蓝色或亮绿色, 但是HBP并没有关注到这部分区域, 导致分类错误. 图1b选取的是Stanford Cars数据集中的一幅车辆图像, 是阿斯顿马丁敞篷车的车身尾部, 被错误分类为宝马敞篷车. 从阿斯顿马丁敞篷车第2期陈珺莹, 等: 基于显著增强分层双线性池化网络的细粒度图像分类 243的尾部来看, 其与宝马敞篷车的最大不同在于车尾灯区域, 而HBP 没有关注到这部分区域, 反而关注到了无关的背景(车旁的人), 导致分类错误.所以, 对于HBP 网络, 一些感兴趣区域(即判别区域)没有得到关注, 而一些无关的背景或不相关的区域在特征激活中有较高的响应, HBP 网络对这些区域进行了特征交互, 影响了分类性能. 需要考虑如何减少无关背景的影响、如何更好地关注感兴趣区域, 从而实现更好的特征交互. 显著性检测网络可以获取到一幅图像中的感兴趣区域, 针对上述问题, 本文结合显著性检测网络来增强对感兴趣区域即判别区域的关注, 减少背景等无关信息的影响.1.2 SE-HBP为了更好地关注感兴趣区域, 文献[9-11]均通过物体标注框或部位标注点等额外人工标注对物体整体或局部区域进行定位, 实现区域信息的增强; 但是这些方法依赖的人工标注成本较高, 在实际应用中有局限性. 由于显著性检测可以模拟人的视觉注意力机制, 提取图像中的感兴趣区域, 本文结合显著性检测网络实现感兴趣区域的信息增强. 即通过显著性检测网络得到的显著性特征, 以注意力图的形式来增强感兴趣区域的信息, 同时减少背景等无关区域的信息.整体的网络结构如图2所示, 分为显著性检测、注意力生成以及增强特征HBP 3个部分. 首先经过显著性检测网络获得显著性特征, 再由此得到注意力图; 该注意力图的取值范围为0~1. 对于显著区域, 即感兴趣区域, 注意力图中相应区域的值接近1; 对于背景等无关区域, 注意力图中相应区域的值很小. 采用ResNet-34构建SE-HBP 网络, 将注意力图与ResNet-34第3组block 最后输出的特征进行乘积融合, 再进行后续的特征提取以及增强特征HBP 操作.图2 本文网络结构1.2.1 显著性检测显著性检测旨在提取图像中的显著区域, 即感兴趣区域. 本文使用的显著性检测网络是循环残差优化网络(recurrent residual refinement net-works, R 3Net)[12], 其网络结构如图2的橙色区域所示, 它包含很多个残差优化块(residual refinement block, RRB). RRB 可以交替利用全卷积网络的低层集成特征和高层集成特征来学习中间显著性预测和真实值之间的残差, 低层集成特征可以捕获更多的显著性细节, 高层集成特征可以减少中间预测的非显著区域.首先, 通过特征提取网络(这里用ResNeXt101)生成一组特征图, 其中包含了不同尺度的低层细节和高级语义信息. 然后, 浅层特征通过集成生成低层集成特征, 深层特征聚合成高层集成特征. 提取特征后, 从高层集成特征生成一个初始的显著图, 然后交替利用低层集成特征和高层集成特征生成RRB, 逐步完善中间显著图. RRB 执行的具体操作为: 将前一个循环得到的显著性预测图交替地与低层或高层集成特征相连接, 经过卷积得到当前循环的残差, 再将残差与前一个循环的显著性预测图相加, 得到当前循环的显著性预测图. 1.2.2 注意力生成在R 3Net 的训练过程中, 高层集成特征首先在244计算机辅助设计与图形学学报 第33卷真实值的监督下生成一个初始的显著图, 然后用一系列的RRB 对其进行完善. 所以, 使用R 3Net [12]模型, 采用其中的高层集成特征作为显著性特征. 对于低层集成特征, 其在训练时不直接接受真实值的监督, 所以低层集成特征包含的显著性信息不如高层集成特征的好. 本文在实验部分比较了分别使用高层和低层集成特征作为显著性特征得到的分类准确率, 验证了高层集成特征比低层集成特征更有效.通过显著性检测网络R 3Net 得到显著性特征111sal h w c ⨯⨯∈X , 其中, 1h 为高, 1w 为宽, 1c 为通道数; 再经过降维以及Sigmoid 函数生成1通道的注意力图111att h w ⨯⨯∈X .1.2.3 增强特征HBPHBP 网络(网络结构如图2中的蓝色区域所示)是在因式分解双线性池化(factorized bilinear pool-ing, FBP)[13]网络基础上改进的. 假设一幅图像经过卷积神经网络进行特征提取后得到的输出特征为h w c ⨯⨯∈X . 其中, h 为高, w 为宽, c 为通道数; 输出特征可以看做由c 个大小为h w ⨯的二维特征组成, 用:,:,k h w ⨯∈ X 表示第k 个通道的特征图; 也可以看做由h w ⨯个c 维描述符组成, 用,,:i j c ∈ X 表示一个特定位置(,)i j 上的描述符; 其中, {}1,,i h ∈ , {}1,,j w ∈ . FBP 模型定义为每个空间位置的低秩外积操作,即T T T FBP ()=οP U x V x . 其中, ,,:i j =x X 为特定位置(,)i j 的c 维描述符; d d '⨯∈P 为分类矩阵; d 为特征向量的维数; d '为图像的类别数;c d ⨯∈U 和c d ⨯∈V 为投影矩阵; 表示对应元素相乘; FBP d'∈ο 为输出向量.与文献[14]的基础网络一致, 本文使用ResNet-34构建SE-HBP 网络. 假设图像经过ResNet-34第3组block 后得到的特征为1123h w c ⨯⨯∈X ; 其中, 1h 为高, 1w 为宽, 2c 为通道数. 因为经过ResNet-34第3组block 后的特征与经过R 3Net 得到的显著性特征仅通道数不同, 特征的高与宽是相同的, 所以不需要对注意力图的高和宽进行处理. 将注意力图与ResNet-34第3组block 得到的特征相乘, 得到33att '= X X X . 对3'X 进行后续的特征提取, 经过ResNet-34最后一组block 得到的3个特征分别为41-∈X h w c ⨯⨯ ,42h w c ⨯⨯-∈X 和43h w c ⨯⨯-∈X , 它们都融入了显著性特征, 实现了显著增强. 这3个显著增强后的卷积特征通过层叠多个HBP 模块进行合并, 利用了层间的特征交互关系, 增强了特征的描述能力. 对于每个空间位置, SE-HBP 模型的输出为T T T T SE-HBP 414241TTT434243=concat(, ,).------οP U x V x U x S x V x S x其中, ,,:4141=i j --x X , ,,:4242=i j --x X , ,,:4343=i j --x X 分别为3个卷积特征的特征描述符; c d ⨯∈U ,c d ⨯∈V , c d ⨯∈S 分别为3个投影矩阵;3d d '⨯∈P 为分类矩阵, 在实际中分类采用全连接层(fully connected layer, fc)来实现, d '为图像的类别数; concat( )为通道级联操作; SE-HBP d '∈ο 为输出向量.得到输出向量之后, 使用Softmax 函数将分类预测的输出标准化, 得到输入图像归属于各个类别的概率分布; 然后使用交叉熵作为损失函数计算预测分类与真实结果之间的相似度. 在实验部分, 本文还比较了将注意力图与41-X 以及42-X 相乘的分类准确率, 验证当注意力图与第3组block的输出特征3X 相乘时能获得最优的识别效果. 2 实验及结果分析2.1 数据集在3个常用的细粒度图像数据集上进行了实验, 分别是CUB-200-2011, Stanford Cars 和FGVC-Aircraft. CUB-200-2011是加利福尼亚理工学院创建的鸟类数据集, 包括了11 788幅鸟类图像, 共有200个子类; Stanford Cars 数据集包含了196种车型, 共有16 185幅图像; FGVC-Aircraft 数据集包含10 000幅飞机图像, 共100个子类. 每个数据集的详细信息如表1所示.表1 数据集信息数据集类别数训练集/幅测试集/幅CUB-200-2011 200 5 994 5 794 Stanford Cars1968 144 8 041 FGVC-Aircraft 1006 667 3 3332.2 参数设置实验采用了开源Linux 内核的Ubuntu14.04桌面应用系统, PyTorch 深度学习框架, Python 编程语言.在训练R 3Net 时, 使用了在ImageNet 数据集[15]上预训练的ResNeXt101初始化特征提取网络; 然后将ResNeXt101的前3层特征聚合构成低层集成特征, 后2层特征聚合构成高层集成特征, 权衡时第2期陈珺莹, 等: 基于显著增强分层双线性池化网络的细粒度图像分类 245间性能以及检测精度, 使用了6个RRB. 在MSRA10K 数据集[16]上进行训练, 每个循环中的显著性预测都与真实值计算交叉熵损失, 这些交叉熵损失的和构成总的损失函数. 训练过程使用了随机梯度下降(stochastic gradient descent, SGD), 权重衰减(weight decay)为4510-⨯, 动量(momentum)为0.9, 学习率初始化为0.001, 使用了poly 策略[17]调整学习率, 在6 000次迭代后训练完成.与文献[14]一致, 使用在ImageNet 数据集上预训练的ResNet-34构建SE-HBP 网络, 去除ResNet-34最后的全连接层. 训练过程分为2步:Step1. 只训练在原始ResNet-34基础上新增加的层. Step2. 微调整个网络. 显著性检测网络R 3Net 的权重始终固定.对于图进行预处理, 数据集CUB-200-2011, Stanford Cars 和FGVC-Aircraft 的图像像素大小分别固定为600600⨯, 500500⨯以及500480⨯, 将图像裁剪到448448⨯, 在训练过程中使用了随机裁剪, 测试过程中使用的是中心裁剪. 对于投影层的维数, 文献[5]验证了投影到8 192维效果最好, 文献[14]也是投影到8 192维, 所以, 本文将ResNet-34最后一组block 得到的3个特征都由512维投影到8 192维. 使用的优化器为SGD, weight decay 为5110-⨯, momentum 为0.9. 第1步训练中学习率初始化为1, 第2步训练中学习率初始化为0.01, 每隔40个周期学习率下降90%.2.3 显著增强的效果从上述3个细粒度图像数据集中选取一些图像, 比较SE-HBP 以及HBP 网络的特征, 可视化结果如图3所示. 图3前4列含义与图1中前4列含义相同, 最后3列是进行了显著增强的特征激活的热力图.图3a 选取的是CUB-200-2011中的2幅图像, 上下2行分别是丽色凤头燕鸥和鱼鸭的热力图, HBP 网络关注到丽色凤头燕鸥图像中的树干、海水等无关背景, 而显著增强后, 对无关背景的关注减弱. 对于鱼鸭, HBP 网络将其错误分类成了宽尾拟八哥, 是因为HBP 网络没有关注到鱼鸭和宽尾拟八哥起区分作用的背部, 而显著增强后的网络可以关注到背部. 图3b 是本田雅阁轿车和阿斯顿图3 热力图对比246计算机辅助设计与图形学学报 第33卷马丁敞篷车的热力图, HBP 网络关注到了本田雅阁轿车旁的无关背景, 而SE-HBP 可以关注到更多的车身部分且过滤掉了无关背景, HBP 网络将阿斯顿马丁敞篷车误判成宝马敞篷车, 改进后的网络关注到了阿斯顿马丁敞篷车的车尾灯区域, 这是与宝马敞篷车尾部最大的不同, 并且对旁边的行人等无关背景的关注减弱. 图3c 是机型分别为波音737-400和波音767-200的2架飞机, SE-HBP 关注到波音737-400更多的机身部分, 忽略了对天空、地面等无关背景的关注; 对于波音767-200, SE-HBP 也关注到其引擎以及机顶等起区分作用的关键区域. HBP 网络会关注到一些无关的背景区域, 对这些区域进行特征交互会影响分类性能, 且其关注到的局部区域可能不是最具判别性的区域, 而SE-HBP 网络对这几幅图像均可以分类正确, 其可以关注到HBP 网络没有关注到的起区分作用的关键区域, 并且可以减弱对背景或不相关区域的关注, 从而获得更好的分类性能.2.4 分类性能 2.4.1 卷积层选择本文首先通过显著性特征得到注意力图, 然后将注意力图与ResNet-34第3组block 后的特征进行点乘加权增强处理, 经过ResNet-34第3组block 后的特征表示为3X , 再将相乘后的特征送入ResNet-34最后一组block 进行特征提取. Res-Net-34最后一组block 共有3个残差块, 经过这3个残差块得到的特征分别表示为41-X , 42-X 和43-X . 本文在CUB-200-2011, Stanford Cars,FGVC-Aircraft 这3个数据集上, 将注意力图与不同卷积层的特征相乘, 对比其分类准确率, 结果如表2所示.表2 与不同卷积层相乘的分类准确率对比 %数据集3X41-X42-XCUB-200-2011 86.5 86.0 85.9 Stanford Cars 92.992.592.2FGVC-Aircraft 90.890.6 90.3注意力图与特征41-X 相乘时, 将相乘后的特征送入ResNet-34最后一组block 的后2个残差块继续进行特征提取, 再将相乘前的特征与后2个残差块的输出特征进行双线性池化操作. 同理, 注意力图与特征42-X 相乘时, 先经过了最后一组block 的前2个残差块得到41-X 和42-X , 再将注意力图与42-X 相乘, 然后相乘后的特征送入最后一个残差块, 最后将相乘前的特征与前后2个残差块的输出进行双线性池化操作. 注意力图与41-X , 42-X 相乘时, 高和宽为41-X , 42-X 的2倍, 所以需要将注意力图的高和宽调整为原来的1/2.由表2可知, 在3个数据集上, 当注意力图与ResNet-34第3组block 最后输出的特征3X 相乘时, 均能得到最高的分类准确率. 这是因为当注意力图与3X 相乘时, 相乘后的特征送入ResNet-34最后一组block, 得到的41-X , 42-X , 43-X 都融入了显著性特征; 而与41-X 相乘时, 只有42-X 和43-X 融入了显著性特征, 与42-X 相乘时, 只有43-X 进行了显著增强. 以下实验都建立在注意力图与3X 相乘的基础上.2.4.2 显著性特征的选择在R 3Net 中, 首先是高层集成特征在真实值的监督下生成一个初始的显著图, 然后RRB 交替地利用高层和低层集成特征对其进行完善. 在CUB-200-2011, Stanford Cars, FGVC-Aircraft 这3个数据集上, 比较了使用高层和低层集成特征分别作为显著性特征生成注意力图的分类准确率, 如表3所示.表3 高低层集成特征作为显著性特征的分类准确率对比 %数据集高层集成特征低层集成特征CUB-200-2011 86.5 86.3 Stanford Cars92.992.5FGVC-Aircraft 90.890.4表3中, 显著性特征生成的注意力图均是与ResNet-34第3组block 最后输出的特征3X 相乘. 高层集成特征生成的注意力图的高和宽与3X 的高和宽一致, 所以不需要作额外的处理; 而低层集成特征生成的注意力图的高和宽是3X 的4倍, 需要对其作下采样处理. 实验结果显示, 在3个数据集上, 高层集成特征作为显著性特征得到的分类准确率均比低层集成特征作为显著性特征高.为进一步验证高层集成特征的分类优势, 利用Python 的Matplotlib 库分别对高层和低层集成特征生成的注意力图进行可视化, 结果如图4所示. 图4中第1列是选自3个数据集的图像, 第2列和第3列分别是高层和低层集成特征生成的注意力图, 显示了不同颜色所对应值.由图4可知, 高层集成特征生成的注意力图能够更好地过滤掉背景等无关信息, 更好地保留了第2期陈珺莹, 等: 基于显著增强分层双线性池化网络的细粒度图像分类 247图4 注意力图比较感兴趣区域, 以下实验的显著性特征均采用高层集成特征.2.4.3 与基线方法比较本文的基线方法是使用ResNet-34重新构建的HBP网络(HBP-RNet)[14], 本文在CUB-200-2011, Stanford Cars和FGVC-Aircraft这3个数据集上都进行了实验, 并比较了SE-HBP网络和基线网络的准确率, 结果如图5所示.由图5可知, SE-HBP在CUB-200-2011, Stan-ford Cars和FGVC-Aircraft这3个数据集上的分类准确率分别为86.5%, 92.9%和90.8%, 基线方法的分类准确率分别为85.8%, 92.2%和90.2%, 可以看出, 本文方法在这3个数据集上的分类准确率都优于基线方法.图5 本文方法与基线方法准确率对比2.4.4 与其他方法比较表4列出了在3个细粒度图像数据集上的分类实验结果, 并与相关方法做了比较. 本文方法只使用了图像的类别标签, 没有使用额外人工标注.表4本文方法与相关方法的准确率对比% 分类方法额外人工标注 CUB-200-2011StanfordCars FGVC-AircraftSPDA-CNN[18]√ 85.1基于部件且使用额外的人工标注PC-CNN[19]√ 84.1MA-CNN[20]86.5 92.8 89.9RP-CNN[21]84.5 93.0 89.9基于部件只使用图像类别标签UPM[22]85.492.390.0BCNN[4]84.191.384.1CBP[23]84.0LRBP[24]84.290.187.3KP[25]86.292.486.9GP[26]85.892.889.8HBP-RNet[14]85.8 92.2 90.2Dai等[27]85.392.4基于双线性网络及其改进SE-HBP 86.5 92.9 90.8注. 粗体表示精度最高.表4中, 基于部件且使用额外的人工标注的方法, 分别是联合语义部件检测和提取的卷积神经网络(unifying semantic part detection and abstraction convolutional neural networks, SPDA-CNN)[18], 以及基于深度协同卷积网络的细粒度分类(part-colla- boration convolutional neural networks, PC- CNN)[19], 后者在每个手工标注的部件上训练定制的子网络.248 计算机辅助设计与图形学学报第33卷基于部件只使用图像类别标签的方法分别有多注意力卷积神经网络(multi-attention convolutional neural networks, MA-CNN)[20], 其利用通道分组子网络生成多个部件; 随机部件定位模型(random part localization convolutional neural networks, RP-CNN)[21]使用显著图提取前景物体; 部件挖掘(unsupervised part mining, UPM)[22]方法利用模式挖掘算法来获取判别区域.还有一些基于双线性网络以及在其基础上改进的方法, 如紧凑的双线性池化方法(compact bi-linear pooling, CBP)[23]、低秩双线性池化(low-rank bilinear pooling, LRBP)[24]、通用的池化框架(kernel pooling, KP)[25]、格拉斯曼池化(Grassmann pooling, GP)[26], HBP[5]和基于子类别相似性度量的双线性池化[27], 都是在BCNN[4]的基础上改进的.由表4结果可知, 本文方法(SE-HBP)在3个数据集上都取得了较好的分类性能, 都优于基线方法(HBP-RNet). 注意, 本文的基线方法为使用ResNet-34重新构建的HBP网络(HBP-RNet)[14], 因为ResNet-34是一个轻量级的网络, 与构建HBP 网络[5]的VGG-Net[28]相比, ResNet-34的计算效率更高. 由表4可知, 本文方法比双线性网络以及在其基础上改进方法的分类准确率都要高. 由此可见, SE-HBP可以实现对显著区域即感兴趣区域的信息增强, 减少背景等无关信息的影响, 使网络关注到更具判别性的区域, 从而获得表述能力更好的特征; 本文方法比使用了额外人工标注信息的局部信息增强方法的准确率高, 因为人工标注的区域并非是最适合模型分类的区域, 证明了本文使用显著性检测网络可以获取到比人工标注更为合理的感兴趣区域, 从而获得良好的分类性能, 同时避免了使用成本较高的人工标注; 本文方法比大多数不使用额外人工标注的局部信息增强的方法的分类效果好, 证明了本文使用显著增强来获取感兴趣区域的做法可以与一些复杂的部件检测方法达到类似的效果.3结语本文提出了适用于细粒度图像分类的SE-HBP 网络, 使用显著性检测网络得到图像的显著性特征, 并将显著性特征以注意力图的形式与原来的特征相结合, 以此增强显著区域的信息, 减少背景等无关信息的影响, 实现网络的显著性增强. 本文在3个常用的细粒度图像数据集上都进行了实验, 在CUB-200-2011, Stanford Cars和FGVC-Aircraft 上分别取得了86.5%, 92.9%和90.8%的分类准确率, 均优于基线方法, 与其他对比方法相比, 取得了具有竞争力的分类准确率, 得到了良好的分类性能, 验证了本文方法的有效性.参考文献(References):[1] Liu C B, Xie H T, Zha Z J, et al. Filtration and distillation: en-hancing region attention for fine-grained visual categoriza-tion[C] //Proceedings of the AAAI Conference on Artificial In-telligence. Palo Alto: AAAI Press, 2020: 11555-11562[2] Zheng H L, Fu J L, Zha Z J, et al. Looking for the devil in thedetails: learning trilinear attention sampling network for fine-grained image recognition[C] //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. LosAlamitos: IEEE Computer Society Press, 2019: 5007-5016 [3] Zhang J T, Sun F W, Song J, et al. Fine-grained image classifi-cation via spatial saliency extraction[C] //Proceedings of the 17th IEEE International Conference on Machine Learning andApplications. Los Alamitos: IEEE Computer Society Press, 2018: 249-255[4] Lin T Y, RoyChowdhury A, Maji S. Bilinear CNN models forfine-grained visual recognition[C] //Proceedings of the IEEE International Conference on Computer Vision. Los Alamitos: IEEE Computer Society Press, 2015: 1449-1457[5] Yu C J, Zhao X Y, Zheng Q, et al. Hierarchical bilinear poolingfor fine-grained visual recognition[C] //Proceedings of the Eu-ropean Conference on Computer Vision. Heidelberg: Springer,2018: 574-589[6] Wah C, Branson S, Welinder P, et al. The Caltech-UCSDbirds-200-2011 dataset[R]. California: California Institute of Technology. Computer Neural System, 2011[7] Krause J, Stark M, Deng J, et al. 3D object representations forfine-grained categorization[C] //Proceedings of the IEEE In-ternational Conference on Computer Vision Workshops. Los Alamitos: IEEE Computer Society Press, 2013: 554-561[8] Maji S, Rahtu E, Kannala J, et al. Fine-grained visual classifi-cation of aircraft[OL]. [2020-05-22]. https:///abs/1306.5151[9] Zhang N, Donahue J, Girshick R, et al. Part-based R-CNNs forfine-grained category detection[C] //Proceedings of the Euro-pean Conference on Computer Vision. Heidelberg: Springer, 2014: 834-849[10] Wei X S, Xie C W, Wu J X, et al. Mask-CNN: localizing partsand selecting descriptors for fine-grained bird species categori-zation[J]. Pattern Recognition, 2018, 76: 704-714[11] Zhao Yili, Xu Dan. Joint semantic parts for fine-grained birdimages recognition[J]. Journal of Computer-Aided Design & Computer Graphics, 2018, 30(8): 1522-1529(in Chinese)(赵毅力, 徐丹. 联合语义部件的鸟类图像细粒度识别[J].计算机辅助设计与图形学学报, 2018, 30(8): 1522-1529) [12] Deng Z J, Hu X W, Zhu L, et al. R3Net: recurrent residual re-finement network for saliency detection[C] //Proceedings of the。
2.2参考⽂献[1.1] O.D.Faugeras and G.Toscani, The Calibration Problem for Stereo, Proc. of IEEE Conference of Computer Vision and Pattern Recognition, pp 15-20, 1986.[1.2] O.D.Faugeras and G.Toscani,Camera Calibration for 3D Computer V ision, Proc. Of International Workshop on Industrial Application of Machine Vision and Machine Intellegence, pp.240-247, Japan, 1987[1.3] S.J.Manbank and O.D.Faugeras, A Theory of Self-Calibration of a Moving Camera., International Journal of Computer V ision, V ol.8:2,pp.123-151,1992.[1.4] S.D.Ma, A Self-Calibration Technique for Active Vision System, IEEE Trans. Robotics and Automation,Feb.,1996.[1.5] Y.C.Shiu and S.Ahmad, Calibration of Wrist-Mounted Robotic Sensors by Solving Homogenous Transform Equations of Form AX=XB, IEEETrans. Robotics and Automation,V ol.5,No.1,pp 16-29,January,1989.[1.6] Z. Zhang, A Flexible New Technique for Camera Calibration, IEEE Transactions on Pattern Analysis and Machine Intelligence,22(11):1330–1334, 2000.[2.1] Barnard S T,Fischler M A.Computational stereo[J].ACM Computing Surveys,1982,14(4):553-572.[2.2] Dhond U R,Aggarwal J K.Structure from stereo-A review[J].IEEE Trans on Systems,Man andCybernetics,1989,19(6):1489-1510.[2.3] Faugeras O.What can be seen in three dimensions with an uncalibrated stereo rig[C].Proc of the 2nd European Confon Computer Vision.Santa Margherita Ligure,1992:563-578.[2.4] Koschan A.What is new in computational stereo since 1989:A survey of current stereo papers [R].Berlin:Technical University of Berlin. 1993.[2.5] Scharstein D,Szeliski R.A taxonomy and evaluation of dense two-frame stereo correspondence algorithms[J].Int J of Computer V ision,2002,47(1):7-42.[2.6] Brown M Z,Burschka D,Hager G D.A dvances in computational stereo[J].IEEE Trans on Pattern Analysis and Machine Intelligence,2003,25(8):993-1008.[2.7] V enkateswar V,Chellappa R.Hierarchical stereo and motion correspondence using feature groupings[J].Int J of Computer V ision,1995,15(3):245-269.[2.8] Mahlmann K,Maier D,Hesser J,eta1.Calculating dense disparity maps from color stereo images,an efficient implementation[J].Int J of Computer Vision,2002,47(i-3):79-88.[2.9]Kolmogorov V,Zabin R.What energy functions can be minimized via graph cuts? [J].IEEE Trans on Pattern Analysis and Machine Intelligence,2004,26(2):147-159.[2.10] Gong M,Y ang Y H.Fast unambiguous stereo matching using reliability-based dynamic programming[J].IEEE Trans on Pattern Analysis and Machine Intelligence,2005,27(6):998-1003.[2.11]徐奕,周军,周源华.基于⼩波及动态规划的相位匹配[J].上海交通⼤学学报,2003,37(3):388-392.[2.12]Goulermas J Y,Liatsis P,Fernando T.A constrained nonlinear energy minimization framework for the regularization of the stereo correspondence problem [J].IEEE Trans on Circuits and Systems for V ideo Technology,2005,15(4):550-565.[2.13] V eksler O.Fast variable window for stereo correspondence using integral images[C].Proc of IEEE Conf on Computer V ision and Pattern Recognition.Madison,2003:556-561.[2.14]Wang L,Kang S B,Shum H Y.Cooperative segmentation and stereo using perspective space search [C].Proc of Asian Conf on Computer V ision.Jeju Island,2004:366-371.[2.15]Y oon K J,Kweon I S.Locally adaptive support-weight approach for visual correspondence search [C].Proc of IEEE Conf on Computer Vision and Pattern Recognition.SanDiego,2005:924-931.[2.16] Zabih R,Woodfill J.Non-parametric local transforms for computing visual correspondence[C].Procofthe 3rd European Conf on Computer Vision.Stockholm,l994:150-l58.[2.17] Birchfield S,Tomasi C.Multiway cut for stereo and motion with slanted surfaces [C].Proc of Int Conf on Computer Vision.GreeP.e,1999:489-495.[2.18] Boykov Y,V eksler O,Zabih R.Markovr and omfields with efficient approximations[C].Proc of IEEE Conf on Computer V ision and Pattern Recognition.Santa Barbara,1998:648-655.[2.19]Kim H,Sohn K.Hierarchical disparity estimation with energy-based regularization[C].Proc of Int Conf on Image Processing.Barcelona,2003:373-376.[2.20] Prince J D,Eagle R A.Weighted directional energy model of human stereo correspondence [J].V ision Research,2000,40(9):1143-1155.[2.21] Muquit M A,Shibahara T,Aoki T.A high-accuracy passive 3D measurement system using phase-based image matching [J].IEICE Trans on Fundamentals of Electronics,Communications and Computer Sciences,2006,E89-A(3):686-697.2006,E89-A(3):686-697.[2.22] Fleet D J,Jepson A D.Stability of phase information [J].IEEE Trans on:Pattern Analysis and Machine Intelligence,1993,15(12):1253-1268.[2.23] 徐彦君,杜利民,侯⾃强.基于相位的尺度⾃适应⽴体匹配⽅法[J].电⼦学报,1999,27(7):38-41.[2.24] Birchfield S,Tomasi C.Depth discontinuities by pixel-to-pixelstereo [C].Proc of IEEE Int Conf on Computer V ision.Bombay,1998:1073-1080.[2.25] Ohta Y,Kanade T.Stereo by intra- and inter-scan line search using dynamic programming [J].IEEE Trans on Pattern Analysis and Machine Intelligence,1985,7(2):139-i54.[2.26] Cox I J,Hingorani S L,Rao S B,eta1.A maximum likelihood stereo algorithm [J].Computer V ision and Image Understanding,1996,63(3):542-567.[2.27]Bobick A F,Intille S S.Large occlusion stereo[J].Int J of Computer V ision,1999,33(3):181-200.[2.28] Lei C,Seizer J,Y ang Y H.Region-tree based stereo using dynamic programming optimization [C].Proc of IEEE Conf on Computer V ision and Pattern Recognition.New Y ork,2006:378-385.[2.29]Cox I J,Roy S.A maximum-flow formulation of the N-camera stereo correspondence problem [C].Proc of the 6th Int Conf on Computer V ision.Bombay,1998:492-499.[2.30]Boykov Y,V eksler O,Zabih R.Fast approximate energy minimization via graph cuts [J].IEEE Trans on Pattern Analysis and Machine Intelligence,2001,23(11):1222-1239.[2.31]王年,范益政,鲍⽂霞,等.基于图割的图像匹配算法[J].电⼦学报,2006,34(2):232-236.[2.32]Boykov Y,Kolmogorov V.An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision [J].IEEE Trans on Pattern Analysis and Machine Intelligence,2004,26(9):1124-1137.[2.33] Ruichek Y.Multi level and neural-network-based stereo-matching method for real-time obstacle detection using linear cameras[J].IEEE Trans on Intelligent Transportation Systems,2005,6(1):54-62.[2.34] Hua X J,Y okomichi M,Kono M.Stereo correspondence using color based on competitive-cooperative neural networks[C].Proc of the 6th Int Conf on Parallel and Distributed Computing,Applications and Technologies.Denver,2005:856-860.[2.35]Wang B.Chung R,Shen C.Genetic algorithm-based stereo vision with no block-partitioning of input images[C].Proc IEEE Int Symposium on Computational Intelligence in Robotics and Automation.Kobe,2003:830-836.[2.36]Gong M,Y ang Y H.Genetic-based stereo algorithm and disparity map evaluation [J].Int J of ComputerVision,2002,47(1-3):63-77.[2.37]Zitnick L C,Kanade T.A cooperative algorithm for stereo matching and occlusion detection[J].IEEE Trans on Pattern Analysis and Machine Intelligence,2000,22(7):675-681.[2.38]Ruichek Y,Issa H,Postaire JG,etal.Towards real-time obstacle detection using a hierarchicaldecomposition methodology for stereo matching with a genetic algorithm[C].Proc of the 16th IEEE Int Conf on Tools with Artificial Intelligence.Boca Raton,2004:138-147.[2.39]Scharstein D,Szeliski R.Stereo matching with nonlinear diffusion[J].Int J of Computer Vision,1998,28(2):155-174.[2.40] Lee S H,Kanatsugu Y,Park J I.MAP-based stochastic diffusion for stereo matching and line fieldsestimation[J].Int J of Computer V ision,2002,47(1-3):195-218.[2.41]Sun J,Zheng N N,Shum H Y.Stereo matching using belief propagation[J].IEEE Trans on Pattern Analysis and Machine Intelligence,2003,25(7):787-800.[2.42]Felzenszwalb P F,Huttenlocher D P.Efficient belief propagation for early vision[C].Proc of IEEE Conf on Computer V ision and Pattern Recognition.Washington,2004:261-268.[2.43]Klaus A,Sormann M,Karner K.Segment-based stereo matching using belief propagation and a self-adapting dissimilarity measure[C].Proc of the 18th Int Conf on Pattern Recognition.HongKong,2006:15-18.[2.44] Wang L,Liao M,Gong M,etal.High-qualityreal-time stereo using adaptive cost aggregation and dynamic programming[C].3rd Int Symposiumon 3D DataProcessing,Visualization and Transmission.NorthCarolina,2006:798-805.[2.45]Birchfield B,Natarajan B,Tromasi C.correspondence as energy-based segmentation[J].Image and Vision Computing,2007,25(8):1329-1340.[2.46]Deng Y,Y ang Q,Linx,etal.Stereo correspondence with occlusion handling in a symmetric patch-based graph-cuts model[J].IEEE Trans on Pattern Analysis and Machine Intelligence,2007,29(6):1068-1079.[2.47]Binaghi E,Gallo I,Marino G,etal.Neural adaptive stereo matching[J].Pattern RecognitionLetters,2004,25(15):1743-1758.[2.48]Bleyer M,Gelautz M.Graph-cut-based stereo matching using image segmentation with symmetrical treatment of occlusions[J].Signal Processing:Image Communication,2007,22(2):127-143.[2.49]Gong M,Y ang Y H.Real-time stereo matching using orthogonal reliability-based dynamic programming[J].IEEE Trans on Image Processing,2007,16(3):879-884.[2.50]Ambrosio G,Gonzalez J.Extracting and matching perceptual groups for hierarchical stereo vision[C].Proc of the 15th Int Conf on Pattern Recognition.Barcelona,2000:542-545.[2.51]Desolneux A,Moisan I,Morel J M.Gestalt theory and computer vision[R].Paris:CMLA,2002.。
基于Open CASCADE的复杂曲面零件快速分割许强;马建伟;贾振元【摘要】基于Open CASCADE内核和VC编程技术,建立复杂曲面模型零件的快速分割系统。
通过离散导入的IGES曲面模型,提取离散后曲面片的边界信息,构造曲面片的特征长方体,并以此获得曲面片的特征尺寸组。
对曲面片依照主特征尺寸排序和分组后完成复杂曲面模型的快速分割。
通过叶轮模型的快速分割,验证了分割系统对一般具有局部特征的复杂曲面零件分割的可行性和有效性。
【期刊名称】《制造业自动化》【年(卷),期】2015(000)008【总页数】3页(P26-27,32)【关键词】复杂曲面模型;特征尺寸组;快速分割;OCC-VC编程【作者】许强;马建伟;贾振元【作者单位】大连理工大学机械工程学院,大连116024;大连理工大学机械工程学院,大连116024;大连理工大学机械工程学院,大连116024【正文语种】中文【中图分类】TP3910 引言复杂曲面零件作为高端设备的关键零件在航空航天、能源动力、汽车等行业有着广泛的应用。
但是随着对高端装备性能需求的不断提高,关键零件的结构也愈加复杂,部分局部几何特征的加工精度也要求更高。
而带有这些局部几何特征的曲面零件由于其特殊结构和难加工材料的应用加工效率较低,零件整体采用单一的加工参数很难适应高端装备的快速发展需求。
以带分流小叶片的叶轮为代表,在数控加工过程中整体工艺参数受局部分流小叶片特征的限制。
然而,分流小叶片加工面积占叶轮整体加工面积的比例过小,确定的整体加工工艺对分流小叶片外其他加工区域过于保守,因此严重制约了该叶轮整体的加工效率。
研究复杂曲面模型快速分割方法,将分流小叶片(局部特征)从叶轮整体(整体模型)中分离出来,并分别选取已分离的曲面和剩余加工区域上的最优加工参数,生成独立文件的加工代码,经整合最终实现数控装备上的高效加工。
目前曲面分割技术还主要应用于生物医学立体图像识别[1]和逆向工程[2,3]等领域,针对提高数控加工效果的曲面分割技术很少。
DOI10.1007/s11263-013-0620-5Selective Search for Object Recognition J.R.R.Uijlings·K.E.A.van de Sande·T.Gevers·A.W.M.SmeuldersReceived:5May2012/Accepted:11March2013©Springer Science+Business Media New York2013Abstract This paper addresses the problem of generating possible object locations for use in object recognition.We introduce selective search which combines the strength of both an exhaustive search and segmentation.Like segmen-tation,we use the image structure to guide our sampling process.Like exhaustive search,we aim to capture all possi-ble object locations.Instead of a single technique to generate possible object locations,we diversify our search and use a variety of complementary image partitionings to deal with as many image conditions as possible.Our selective search results in a small set of data-driven,class-independent,high quality locations,yielding99%recall and a Mean Average Best Overlap of0.879at10,097locations.The reduced num-ber of locations compared to an exhaustive search enables the use of stronger machine learning techniques and stronger appearance models for object recognition.In this paper we show that our selective search enables the use of the powerful Bag-of-Words model for recognition.The selective search software is made publicly available(Software:http://disi. unitn.it/~uijlings/SelectiveSearch.html).1IntroductionFor a long time,objects were sought to be delineated before their identification.This gave rise to segmentation,which aims for a unique partitioning of the image through a generic algorithm,where there is one part for all object silhouettesin the image.Research on this topic has yielded tremendous J.R.R.Uijlings(B)University of Trento,Trento,Italye-mail:jrr.uijlings@;jrr@disi.unitn.itK.E.A.van de Sande·T.Gevers·A.W.M.SmeuldersUniversity of Amsterdam,Amsterdam,The Netherlands progress over the past years(Arbeláez et al.2011;Comaniciu and Meer2002;Felzenszwalb and Huttenlocher2004;Shi and Malik2000).But images are intrinsically hierarchical:In Fig.1a the salad and spoons are inside the salad bowl,which in turn stands on the table.Furthermore,depending on the context the term table in this picture can refer to only the wood or include everything on the table.Therefore both the nature of images and the different uses of an object category are hierarchical.This prohibits the unique partitioning of objects for all but the most specific purposes.Hence for most tasks multiple scales in a segmentation are a necessity.This is most naturally addressed by using a hierarchical partitioning,as done for example by Arbeláez et al.(2011).Besides that a segmentation should be hierarchical,a generic solution for segmentation using a single strategy may not exist at all.There are many conflicting reasons why a region should be grouped together:In Fig.1b the cats can be separated using colour,but their texture is the same.Con-versely,in Fig.1c the chameleon is similar to its surrounding leaves in terms of colour,yet its texture differs.Finally,in Fig.1d,the wheels are wildly different from the car in terms of both colour and texture,yet are enclosed by the car.Indi-vidual visual features therefore cannot resolve the ambiguity of segmentation.And,finally,there is a more fundamental problem. Regions with very different characteristics,such as a face over a sweater,can only be combined into one object after it has been established that the object at hand is a human. Hence without prior recognition it is hard to decide that a face and a sweater are part of one object(Tu et al.2005).This has led to the opposite of the traditional approach: to do localisation through the identification of an object. This recent approach in object recognition has made enor-mous progress in less than a decade(Dalal and Triggs2005; Felzenszwalb et al.2010;Harzallah et al.2009;Viola andFig.1There is a high variety of reasons that an image region forms an object.In(b)the cats can be distinguished by colour,not texture.In (c)the chameleon can be distinguished from the surrounding leaves by texture,not colour.In(d)the wheels can be part of the car because they are enclosed,not because they are similar in texture or colour.There-fore,tofind objects in a structured way it is necessary to use a variety of diverse strategies.Furthermore,an image is intrinsically hierarchical as there is no single scale for which the complete table,salad bowl,and salad spoon can be found in(a)Jones2001).With an appearance model learned from exam-ples,an exhaustive search is performed where every location within the image is examined as to not miss any potential object location(Dalal and Triggs2005;Felzenszwalb et al. 2010;Harzallah et al.2009;Viola and Jones2001).However,the exhaustive search itself has several draw-backs.Searching every possible location is computationally infeasible.The search space has to be reduced by using a reg-ular grid,fixed scales,andfixed aspect ratios.In most cases the number of locations to visit remains huge,so much that alternative restrictions need to be imposed.The classifier is simplified and the appearance model needs to be fast.Fur-thermore,a uniform sampling yields many boxes for which it is immediately clear that they are not supportive of an object. Rather then sampling locations blindly using an exhaustive search,a key question is:Can we steer the sampling by a data-driven analysis?In this paper,we aim to combine the best of the intu-itions of segmentation and exhaustive search and propose a data-driven selective search.Inspired by bottom-up segmen-tation,we aim to exploit the structure of the image to gener-ate object locations.Inspired by exhaustive search,we aim to capture all possible object locations.Therefore,instead of using a single sampling technique,we aim to diversify the sampling techniques to account for as many image condi-tions as possible.Specifically,we use a data-driven grouping-based strategy where we increase diversity by using a variety of complementary grouping criteria and a variety of comple-mentary colour spaces with different invariance properties. The set of locations is obtained by combining the locations of these complementary partitionings.Our goal is to generate a class-independent,data-driven,selective search strategy that generates a small set of high-quality object locations.Our application domain of selective search is object recog-nition.We therefore evaluate on the most commonly used dataset for this purpose,the Pascal VOC detection challenge which consists of20object classes.The size of this dataset yields computational constraints for our selective search.Fur-thermore,the use of this dataset means that the quality of locations is mainly evaluated in terms of bounding boxes.However,our selective search applies to regions as well and is also applicable to concepts such as“grass”.In this paper we propose selective search for object recognition.Our main research questions are:(1)What are good diversification strategies for adapting segmentation as a selective search strategy?(2)How effective is selective search in creating a small set of high-quality locations within an image?(3)Can we use selective search to employ more powerful classifiers and appearance models for object recog-nition?2Related WorkWe confine the related work to the domain of object recog-nition and divide it into three categories:Exhaustive search, segmentation,and other sampling strategies that do not fall in either category.2.1Exhaustive SearchAs an object can be located at any position and scale in the image,it is natural to search everywhere(Dalal and Triggs 2005;Harzallah et al.2009;Viola and Jones2004).How-ever,the visual search space is huge,making an exhaustive search computationally expensive.This imposes constraints on the evaluation cost per location and/or the number of loca-tions considered.Hence most of these sliding window tech-niques use a coarse search grid andfixed aspect ratios,using weak classifiers and economic image features such as HOG (Dalal and Triggs2005;Harzallah et al.2009;Viola and Jones2004).This method is often used as a preselection step in a cascade of classifiers(Harzallah et al.2009;Viola and Jones2004).Related to the sliding window technique is the highly successful part-based object localisation method of Felzen-szwalb et al.(2010).Their method also performs an exhaus-tive search using a linear SVM and HOG features.However, they search for objects and object parts,whose combination results in an impressive object detection performance.Lampert et al.(2009)proposed using the appearance model to guide the search.This both alleviates the constraints of using a regular grid,fixed scales,andfixed aspect ratio, while at the same time reduces the number of locations vis-ited.This is done by directly searching for the optimal win-dow within the image using a branch and bound technique. While they obtain impressive results for linear classifiers, Alexe et al.(2010)found that for non-linear classifiers the method in practice still visits over a100,000windows per image.Instead of a blind exhaustive search or a branch and bound search,we propose selective search.We use the underly-ing image structure to generate object locations.In contrast to the discussed methods,this yields a completely class-independent set of locations.Furthermore,because we do not use afixed aspect ratio,our method is not limited to objects but should be able tofind stuff like“grass”and“sand”as well (this also holds for Lampert et al.(2009)).Finally,we hope to generate fewer locations,which should make the prob-lem easier as the variability of samples becomes lower.And more importantly,it frees up computational power which can be used for stronger machine learning techniques and more powerful appearance models.2.2SegmentationBoth Carreira and Sminchisescu(2010)and Endres and Hoiem(2010)propose to generate a set of class independent object hypotheses using segmentation.Both methods gen-erate multiple foreground/background segmentations,learn to predict the likelihood that a foreground segment is a complete object,and use this to rank the segments.Both algorithms show a promising ability to accurately delineate objects within images,confirmed by Li et al.(2010)who achieve state-of-the-art results on pixel-wise image classi-fication using Carreira and Sminchisescu(2010).As com-mon in segmentation,both methods rely on a single strong algorithm for identifying good regions.They obtain a vari-ety of locations by using many randomly initialised fore-ground and background seeds.In contrast,we explicitly deal with a variety of image conditions by using different group-ing criteria and different representations.This means a lower computational investment as we do not have to invest in the single best segmentation strategy,such as using the excel-lent yet expensive contour detector of Arbeláez et al.(2011). Furthermore,as we deal with different image conditions sep-arately,we expect our locations to have a more consistent quality.Finally,our selective search paradigm dictates that the most interesting question is not how our regions com-pare to Carreira and Sminchisescu(2010),Endres and Hoiem (2010),but rather how they can complement each other.Gu et al.(2009)address the problem of carefully segment-ing and recognizing objects based on their parts.Theyfirst generate a set of part hypotheses using a grouping method based on Arbeláez et al.(2011).Each part hypothesis is described by both appearance and shape features.Then,an object is recognized and carefully delineated by using its parts,achieving good results for shape recognition.In their work,the segmentation is hierarchical and yields segments at all scales.However,they use a single grouping strategy whose power of discovering parts or objects is left unevalu-ated.In this work,we use multiple complementary strategies to deal with as many image conditions as possible.We include the locations generated using Arbeláez et al.(2011)in our evaluation.2.3Other Sampling StrategiesAlexe et al.(2012)address the problem of the large sam-pling space of an exhaustive search by proposing to search for any object,independent of its class.In their method they train a classifier on the object windows of those objects which have a well-defined shape(as opposed to stuff like “grass”and“sand”).Then instead of a full exhaustive search they randomly sample boxes to which they apply their classifier.The boxes with the highest“objectness”mea-sure serve as a set of object hypotheses.This set is then used to greatly reduce the number of windows evaluated by class-specific object detectors.We compare our method with their work.Another strategy is to use visual words of the Bag-of-Words model to predict the object location.Vedaldi(2009) use jumping windows(Chum and ZissermanZ2007),in which the relation between individual visual words and the object location is learned to predict the object location in new images.Maji and Malik(2009)combine multiple of these relations to predict the object location using a Hough-transform,after which they randomly sample windows close to the Hough maximum.In contrast to learning,we use the image structure to sample a set of class-independent object hypotheses.To summarize,our novelty is as follows.Instead of an exhaustive search(Dalal and Triggs2005;Felzenszwalb et al.2010;Harzallah et al.2009;Viola and Jones2004) we use segmentation as selective search yielding a small set of class independent object locations.In contrast to the segmentation of Carreira and Sminchisescu(2010),Endres and Hoiem(2010)instead of focusing on the best segmen-tation algorithm(Arbeláez et al.2011),we use a variety of strategies to deal with as many image conditions as possible, thereby severely reducing computational costs while poten-tially capturing more objects accurately.Instead of learning an objectness measure on randomly sampled boxes(Alexe et al.2012),we use a bottom-up grouping procedure to gen-erate good object locations.Fig.2Two examples of our selective search showing the necessity of different scales.On the left wefind many objects at different scales.On the right we necessarilyfind the objects at different scales as the girl is contained by the tv3Selective SearchIn this section we detail our selective search algorithm for object recognition and present a variety of diversification strategies to deal with as many image conditions as possible.A selective search algorithm is subject to the following design considerations:Capture All Scales.Objects can occur at any scale within the image.Furthermore,some objects have less clear boundaries then other objects.Therefore,in selective search all object scales have to be taken into account, as illustrated in Fig.2.This is most naturally achieved by using an hierarchical algorithm.Diversification.There is no single optimal strategy to group regions together.As observed earlier in Fig.1, regions may form an object because of only colour,only texture,or because parts are enclosed.Furthermore,light-ing conditions such as shading and the colour of the light may influence how regions form an object.Therefore instead of a single strategy which works well in most cases,we want to have a diverse set of strategies to deal with all cases.Fast to Compute.The goal of selective search is to yielda set of possible object locations for use in a practicalobject recognition framework.The creation of this set should not become a computational bottleneck,hence our algorithm should be reasonably fast.3.1Selective Search by Hierarchical GroupingWe take a hierarchical grouping algorithm to form the basis of our selective search.Bottom-up grouping is a popu-lar approach to segmentation(Comaniciu and Meer2002; Felzenszwalb and Huttenlocher2004),hence we adapt it for selective search.Because the process of grouping itself is hierarchical,we can naturally generate locations at all scales by continuing the grouping process until the whole image becomes a single region.This satisfies the condition of cap-turing all scales.As regions can yield richer information than pixels,we want to use region-based features whenever possible.To get a set of small starting regions which ideally do not span multiple objects,we use the fast method of Felzenszwalb and Huttenlocher(2004),which Arbeláez et al.(2011)found well-suited for such purpose.Our grouping procedure now works as follows.Wefirst use(Felzenszwalb and Huttenlocher2004)to create initial regions.Then we use a greedy algorithm to iteratively group regions together:First the similarities between all neighbour-ing regions are calculated.The two most similar regions are grouped together,and new similarities are calculated between the resulting region and its neighbours.The process of group-ing the most similar regions is repeated until the whole image becomes a single region.The general method is detailed in Algorithm1.For the similarity s(r i,r j)between region r i and r j we want a variety of complementary measures under the con-straint that they are fast to compute.In effect,this means that the similarities should be based on features that can be prop-agated through the hierarchy,i.e.when merging region r i and r j into r t,the features of region r t need to be calculated from the features of r i and r j without accessing the image pixels.3.2Diversification StrategiesThe second design criterion for selective search is to diver-sify the sampling and create a set of complementary strategies whose locations are combined afterwards.We diversify our selective search(1)by using a variety of colour spaces withAlgorithm1:Hierarchical Grouping AlgorithmDontPrintSemicolon Input:(colour)imageOutput:Set of object location hypotheses LObtain initial regions R={r1,···,r n}using Felzenszwalb and Huttenlocher(2004)Initialise similarity set S=∅;foreach Neighbouring region pair(r i,r j)doCalculate similarity s(r i,r j);S=S∪s(r i,r j);while S=∅doGet highest similarity s(r i,r j)=max(S);Merge corresponding regions r t=r i∪r j;Remove similarities regarding r i:S=S\s(r i,r∗);Remove similarities regarding r j:S=S\s(r∗,r j);Calculate similarity set S t between r t and its neighbours;S=S∪S t;R=R∪r t;Extract object location boxes L from all regions in R;different invariance properties,(2)by using different simi-larity measures s i j,and(3)by varying our starting regions.Complementary Colour Spaces.We want to account for different scene and lighting conditions.Therefore we per-form our hierarchical grouping algorithm in a variety of colour spaces with a range of invariance properties.Specif-ically,we the following colour spaces with an increasing degree of invariance:(1)RG B,(2)the intensity(grey-scale image)I,(3)Lab,(4)the rg channels of normalized RG B plus intensity denoted as rgI,(5)H SV,(6)normalized RG B denoted as rgb,(7)C Geusebroek et al.(2001)which is an opponent colour space where intensity is divided out,and finally(8)the Hue channel H from H SV.The specific invari-ance properties are listed in Table1.Of course,for images that are black and white a change of colour space has little impact on thefinal outcome of the algorithm.For these images we rely on the other diversifica-tion methods for ensuring good object locations.In this paper we always use a single colour space through-out the algorithm,meaning that both the initial grouping algorithm of Felzenszwalb and Huttenlocher(2004)and our subsequent grouping algorithm are performed in this colour space.Complementary Similarity Measures.We define four com-plementary,fast-to-compute similarity measures.These mea-sures are all in range[0,1]which facilitates combinations of these measures.s colour(r i,r j)measures colour similarity.Specifically, for each region we obtain one-dimensional colour his-tograms for each colour channel using25bins,which we found to work well.This leads to a colour histogramC i={c1i,···,c n i}for each region r i with dimensionalityn=75when three colour channels are used.The colour histograms are normalised using the L1norm.Similarity is measured using the histogram intersection:s colour(r i,r j)=nk=1min(c k i,c k j).(1)The colour histograms can be efficiently propagated through the hierarchy byC t=size(r i)×C i+size(r j)×C jsize(r i)+size(r j).(2)The size of a resulting region is simply the sum of its constituents:size(r t)=size(r i)+size(r j).s texture(r i,r j)measures texture similarity.We represent texture using fast SIFT-like measurements as SIFT itself works well for material recognition(Liu et al.2010).We take Gaussian derivatives in eight orientations usingσ= 1for each colour channel.For each orientation for each colour channel we extract a histogram using a bin size of 10.This leads to a texture histogram T i={t1i,···,t n i} for each region r i with dimensionality n=240when three colour channels are used.Texture histograms are normalised using the L1norm.Similarity is measured using histogram intersection:s texture(r i,r j)=nk=1min(t k i,t k j).(3)Texture histograms are efficiently propagated through the hierarchy in the same way as the colour histograms.s size(r i,r j)encourages small regions to merge early.This forces regions in S,i.e.regions which have not yet been merged,to be of similar sizes throughout the algorithm. This is desirable because it ensures that object locations at all scales are created at all parts of the image.For example,it prevents a single region from gobbling up all other regions one by one,yielding all scales only at the location of this growing region and nowhere else. s size(r i,r j)is defined as the fraction of the image that r i and r j jointly occupy:s size(r i,r j)=1−size(r i)+size(r j)size(im),(4)where size(im)denotes the size of the image in pixels. s f ill(r i,r j)measures how well region r i and r jfit into each other.The idea is tofill gaps:if r i is contained in r j it is logical to merge thesefirst in order to avoid any holes. On the other hand,if r i and r j are hardly touching each other they will likely form a strange region and should not be merged.To keep the measure fast,we use only the size of the regions and of the containing boxes.Specifically, we define B B i j to be the tight bounding box around r i and r j.Now s f ill(r i,r j)is the fraction of the image contained in B B i j which is not covered by the regions of r i and r j:Table1The invariance properties of both the individual colour channels and the colour spaces used in this paper,sorted by degree of invariance Colour channels R G B I V L a b S r g C HLight intensity−−−−−−+/−+/−+++++ Shadows/shading−−−−−−+/−+/−+++++ Highlights−−−−−−−−−−−+/−+ Colour spaces RGB I Lab rgI HSV rgb C HLight intensity−−+/−2/32/3+++Shadows/shading−−+/−2/32/3+++Highlights−−−−1/3−+/−+“A+/−”means partial invariance.A fraction1/3means that one of the three colour channels is invariant to said propertyf ill(r i,r j)=1−size(B B i j)−size(r i)−size(r i)size(im)(5)We divide by size(im)for consistency with Eq.4.Note that this measure can be efficiently calculated by keeping track of the bounding boxes around each region,as the bounding box around two regions can be easily derived from these.In this paper,ourfinal similarity measure is a combination of the above four:s(r i,r j)=a1s colour(r i,r j)+a2s texture(r i,r j)+a3s size(r i,r j)+a4s f ill(r i,r j),(6) where a i∈{0,1}denotes if the similarity measure is used or not.As we aim to diversify our strategies,we do not consider any weighted similarities.Complementary Starting Regions.A third diversification strategy is varying the complementary starting regions.To the best of our knowledge,the method of Felzenszwalb and Huttenlocher(2004)is the fastest,publicly available algo-rithm that yields high quality starting locations.We could notfind any other algorithm with similar computational effi-ciency so we use only this oversegmentation in this paper. But note that different starting regions are(already)obtained by varying the colour spaces,each which has different invari-ance properties.Additionally,we vary the threshold parame-ter k in Felzenszwalb and Huttenlocher(2004).3.3Combining LocationsIn this paper,we combine the object hypotheses of several variations of our hierarchical grouping algorithm.Ideally, we want to order the object hypotheses in such a way that the locations which are most likely to be an object come first.This enables one tofind a good trade-off between the quality and quantity of the resulting object hypothesis set, depending on the computational efficiency of the subsequent feature extraction and classification method.We choose to order the combined object hypotheses set based on the order in which the hypotheses were generated in each individual grouping strategy.However,as we com-bine results from up to80different strategies,such order would too heavily emphasize large regions.To prevent this, we include some randomness as follows.Given a grouping strategy j,let r j i be the region which is created at position i in the hierarchy,where i=1represents the top of the hierarchy(whose corresponding region covers the complete image).We now calculate the position value v j i as RND×i, where RND is a random number in range[0,1].Thefinal ranking is obtained by ordering the regions using v j i.When we use locations in terms of bounding boxes,we first rank all the locations as detailed above.Only afterwards wefilter out lower ranked duplicates.This ensures that dupli-cate boxes have a better chance of obtaining a high rank.This is desirable because if multiple grouping strategies suggest the same box location,it is likely to come from a visually coherent part of the image.4Object Recognition Using Selective SearchThis paper uses the locations generated by our selective search for object recognition.This section details our frame-work for object recognition.Two types of features are dominant in object recognition: histograms of oriented gradients(HOG)(Dalal and Triggs 2005)and bag-of-words(Csurka et al.2004;Sivic and Zisserman2003).HOG has been shown to be successful in combination with the part-based model by Felzenszwalb et al.(2010).However,as they use an exhaustive search,HOG features in combination with a linear classifier is the only fea-sible choice from a computational perspective.In contrast, our selective search enables the use of more expensive and potentially more powerful features.Therefore we use bag-of-words for object recognition(Harzallah et al.2009;Lampert et al.2009;Vedaldi2009).However,we use a more powerful (and expensive)implementation than(Harzallah et al.2009; Lampert et al.2009;Vedaldi2009)by employing a varietyFig.3The training procedure of our object recognition pipeline.As positive learning examples we use the ground truth.As negatives we use examples that have a20–50%overlap with the positive examples.We iteratively add hard negatives using a retraining phaseof colour-SIFT descriptors(van de Sande et al.2010)and a finer spatial pyramid division(Lazebnik et al.2006).Specifically we sample descriptors at each pixel on a sin-gle scale(σ=1.2).Using software from van de Sande et al. (2010),we extract SIFT(Lowe2004)and two colour SIFTs which were found to be the most sensitive for detecting image structures,Extended OpponentSIFT(van de Sande et al. 2012)and RGB-SIFT(van de Sande et al.2010).We use a visual codebook of size4,000and a spatial pyramid with4 levels using a1×1,2×2,3×3and4×4division.This gives a total feature vector length of360,000.In image classifica-tion,features of this size are already used(Perronnin et al. 2010;Zhou et al.2010).Because a spatial pyramid results in a coarser spatial subdivision than the cells which make up a HOG descriptor,our features contain less information about the specific spatial layout of the object.Therefore,HOG is better suited for rigid objects and our features are better suited for deformable object types.As classifier we employ a Support Vector Machine with a histogram intersection kernel using the Shogun Toolbox (Sonnenburg et al.2010).To apply the trained classi-fier,we use the fast,approximate classification strategy of Maji et al.(2008),which was shown to work well for Bag-of-Words in Uijlings et al.(2010).Our training procedure is illustrated in Fig.3.The initial positive examples consist of all ground truth object windows. As initial negative examples we select from all object loca-tions generated by our selective search that have an overlap of20–50%with a positive example.To avoid near-duplicate negative examples,a negative example is excluded if it has more than70%overlap with another negative.To keep the number of initial negatives per class below20,000,we ran-domly drop half of the negatives for the classes car,cat,dog and person.Intuitively,this set of examples can be seen as difficult negatives which are close to the positive examples. This means they are close to the decision boundary and are therefore likely to become support vectors even when the complete set of negatives would be considered.Indeed,we found that this selection of training examples gives reason-ably good initial classification models.Then we enter a retraining phase to iteratively add hard negative examples(e.g.Felzenszwalb et al.(2010)):We apply the learned models to the training set using the locations generated by our selective search.For each negative image we add the highest scoring location.As our initial training set already yields good models,our models converge in only two iterations.For the test set,thefinal model is applied to all locations generated by our selective search.The windows are sorted by classifier score while windows which have more than30% overlap with a higher scoring window are considered near-duplicates and are removed.5EvaluationIn this section we evaluate the quality of our selective search. We divide our experiments in four parts,each spanning a separate subsection:Diversification Strategies.We experiment with a variety of colour spaces,similarity measures,and thresholds of the initial regions,all which were detailed in Sect.3.2.We seek a trade-off between the number of generated object hypotheses,computation time,and the quality of object locations.We do this in terms of bounding boxes.This results in a selection of complementary techniques which together serve as ourfinal selective search method.Quality of Locations.We test the quality of the object location hypotheses resulting from the selective search.Object Recognition.We use the locations of our selective search in the Object Recognition framework detailed in Sect.4.We evaluate performance on the Pascal VOC detection challenge.An upper bound of location quality.We investigate how well our object recognition framework performs when using an object hypothesis set of“perfect”quality.How does this compare to the locations that our selective search generates?。
Technical Report Pattern Recognition and Image Processing GroupInstitute of Computer Aided AutomationVienna University of TechnologyFavoritenstr.9/183-2A-1040Vienna AUSTRIAPhone:+43(1)58801-18351Fax:+43(1)58801-18392E-mail:{yll,krw}@prip.tuwien.ac.atURL:http://www.prip.tuwien.ac.at/PRIP-TR-81July16,2003 Hierarchical Image Partitioning with Dual Graph Contraction1Yll Haxhimusa and Walter KropatschAbstractWe present a hierarchical partitioning of images using a pairwise similarity function on a graph-based representation of an image.This function measures the difference along the boundary of two components relative to a measure of differences of the components’inter-nal differences.This definition tries to encapsulate the intuitive notion of contrast.Two components are merged if there is a low-cost connection between them.Each component’s internal difference is represented by the maximum edge weight of its minimum spanning tree.External differences are the smallest weight of edges connecting components.We use this idea for building a minimum spanning tree tofind region borders quickly and effortlessly in a bottom-up way,based on local differences in a specific feature.Keywords.Hierarchical graph-based image partitioning,irregular graph pyramids, minimum weight spanning tree,topology preserving contraction.Contents1Introduction21.1Related Work (4)2Minimum Weight Spanning Tree62.1Bor˚u vka’s Algorithm (8)3Dual Graph Contraction(DGC)103.1Contraction Kernels (11)3.2Equivalent contraction kernels (13)4Minimum Spanning Tree with DGC14 5Hierarchy of Partitions165.1Building a Hierarchy of Partitions (17)5.2Construct Hierarchy of Partitions (18)6Experiments on Image Graphs21 7Conclusion and Outlook2611IntroductionWertheimer[54]has formulated the importance of wholes(Ganzen)and not of its indi-vidual elements as:“Es gibt Zusammenh¨a nge,bei denen nicht,was im Ganzen geschieht, sich daraus herleitet,wie die einzelnen St¨u cke sind und sich zusammensetzen,sondern umgekehrt,wo-im pr¨a gnanten Fall-sich das,was an einem Teil dieses Ganzen geschieht, bestimmt von inneren Strukturgesetzen dieses seines Ganzen”1,and introduced the im-portance of perceptual grouping and organization in visual perception.Low-level cue image segmentation cannot and should not produce a completefinal “good”segmentation.The low-level coherence of brightness,color,texture or motion attributes should be used to come up sequentially with hierarchical partitions[50].Mid and high level knowledge can be used to either confirm these groups or select some further attention.A wide range of computational vision problems could make use of segmented images,were such segmentation rely on efficient computation.For instance motion esti-mation requires an appropriate region of support forfinding correspondence.Higher-level problems such as recognition and image indexing can also make use of segmentation results in the problem of matching.It is important that a grouping method has following properties[11]:•capture perceptually important groupings or regions,which reflect global aspects of the image,•be highly efficient,running in time linear in the number of image pixels,•creates hierarchical partitions[50].In a regular image pyramid(for an overview see[32])the number of pixels at any level l,is r times higher than the number of pixels at the next reduced level l+1.The so called reduction factor r is greater than one and it is the same for all levels l.If s denotes the number of pixels in an image I,the number of new levels on top of I amounts to log r(s).Thus,the regular image pyramid may be an efficient structure for fast grouping and access to image objects in top-down and bottom-up processes.However,regular image pyramids are confined to globally defined sampling grids and lack shift invariance[2].Bister et.al.[2]concludes that regular image pyramids have to be rejected as general-purpose segmentation algorithms.In[41,23]it was shown how these drawbacks can be avoided by irregular image pyramids,the so called adaptive pyramids, where the hierarchical structure(vertical network)of the pyramid was not“a priori”known but recursively built based on the data.Moreover in[37,39,3,7,18]was shown that irregular pyramid can be used for segmentation and feature detection.Each level represents a partition of the pixel set into cells,i.e.connected subsets of pixels.The construction of an irregular image pyramid is iteratively local[38,22,1,20]:•the cells have no information about their global position.•the cells are connected only to(direct)neighbors.•the cells cannot distinguish the spatial positions of the neighbors.This means that we use only local properties to build the hierarchy of the pyramid.On the base level(level0)of an irregular image pyramid the cells represent single pixels and the neighborhood of the cells is defined by the4(8)-connectivity of the pixels.A cell on level l+1(parent)is a union of neighboring cells on level l(children).This union is controlled by so called contraction kernels(decimation parameters,see Section3).Every parent computes its values independently of other cells on the same level.This implies that an image pyramid is built in O[log(imageG l)of plane graphs G l and its dual(plane)graphG l represent the borders of the cells on level l,depicted with solid lines in Fig-ure1(b),possibly including so called pseudo edges needed to represent the neighborhood relation to a cell completely surrounded by another cell.Finally,the vertices ofG l),0≤l≤h is called(dual)graph pyramid.Moreover the graph is attributed,G(V,E,attr v,attr e),where attr v:V→R+and attr e:E→R+. We use a weight for attr e measuring the difference between the two end points.The aim of this technical report is to built in minimum weight spanning tree(MST) of an image by combining the advantage of regular pyramids(logarithmic tapering)with the advantages of irregular graph pyramids(their purely local construction and shift invariance).The aim is reached by using the selection method for contraction kernels proposed in[20]to achieve logarithmic tapering,local construction and shift invariance. Bor˚u vka’s algorithm[4]with dual graph contraction algorithm[30]is used for building in a hierarchical way a minimum weight spanning tree(of the region)and at the same time to preserve topology.The topological relation seems to play an even more important role for vision tasks in natural systems than precise geometrical position.We use the idea of building a minimum weight spanning tree tofind region borders quickly and effortlessly3(a)(b)Figure1:a)Partition of pixel set into cells.b)Representation of the cells and their neighborhood relations by a dual pair(G,for segmentation and feature detection.The clustering community[27]has produced agglomerative and divisive algorithms;in image segmentation the region-based merging and spliting algorithms exist.Early graph-based methods[59]usefixed thresholds and local measures in computing a segmentation, i.e the minimum spanning tree(MST)is computed.The segmentation criterion is to break the MST edges with the largest weight.The idea behind is that edges in the MST reflect the low-cost connection between two elements.The work of Urquhart[52] attempts to overcome the problem offixed threshold by normalizing the weight of an edge using the smallest weight incident on the vertices touching that edge.The methods in [11,13,18]uses an adaptive criterion that depend on local properties rather than global ones.Recently the works[11,40,13,18]have the minimum spanning tree as the base algorithm.Gestalt grouping factors,such as proximity,similarly,continuity and symmetry,are encoded and combined in pairwise feature similarity measures[56,50,46,44,57,53,49,58, 13].Other method of segmentation is that of splitting and merging region based on how well the regions fulfill some criterion.Such methods[9,43]uses a measure of uniformity of a region.Authors in[11,13,18]use,in contrast,in a graph-based method a pairwise region comparison rather than applying a uniformity criterion to each individual region. It has been demonstrated that complex grouping phenomena can emerge from simple computation on these local cues[19,35].The use of Markov Random Field(MRF)has been used for image restoration and segmentation[17].However the use of MRF to image segmentation usually leads to NP-hard problems.The graph-based approximation method for MRF problems[5]yields practical solution,if the number of labels for the pixel is small,which limits these methods for use in segmentation.The methods based on minimum cuts in graph are designed to minimize the similarity between pixels that are being split[56,50,16].Wu and Leahy in[56]define a cut criterion,but it was biased towardfinding small components.Shi and Malik[50]developed the normalized cut criterion to address this bias,which takes into consideration self-similarity of regions.These cut-criterion methods capture the non-local properties of the image,in contrast with the simple graph-based methods such as breaking edges in the MST.It also produce divisive hierarchical tree,the dendogram.However they provide only a characterization of such cut rather than offinal segmentation as is provided by Felzenszwalb[11].Shi and Malik[50]developed an approximation method for computing the minimum normalized cut,the error in these approximation in not well understood(it is closely related to spectral graph methods,e.g[12])and this method is computational expensive.The minimal spanning tree and the minimum cut are explicitly defined on edge weighted graph,whereas the concept of a maximal clique is defined on unweighted edge graphs.As a consequence,maximal clique based clustering algorithms work on unweighted graphs derived from the edge weighted graphs by means of some thresholding[27].Pa-5van and Pelillo[42]generalized the concept of maximal clique to weighted graphs.The maximal cliques are found using the discrete replicator dynamics,which turns out to be an instance of relaxation labeling algorithm[48].Our method is related to the work in[11,18]in the sense of pairwise comparison of region similarity.We create a hierarchy of attributed graphs.At each level of the pyramid a region adjacency graph(RAG)is created,in an agglomerative way(by contraction)with the proper topology.A vertex of RAG is a representative of the region in the base level (receptivefield),and it is created by taking into consideration the self-similarity of the region.2Minimum Weight Spanning TreeLet G=(V,E)be the undirected connected plane graph consisting of thefinite set of vertices V and thefinite set of edges E.Each edge e∈E is identified with a pair of vertices v i,v j∈V.Let each edge e∈E be associated with nonnegative unique real weights w(e)=w(v i,v j).Note that parallel edges,for ex.e1=(v1,v2)and e2=(v1,v2) e1=e2,have different weights.First let us give some basic graph theoretical notions taken from[51].Definition1[walk].A walk in a graph G is afinite alternating sequence of vertices and edges v0,e1,v1,...,v k−1,e k,v k beginning and ending with vertices such that v i−1and v i are the end vertices of the edge e i,1≤i≤k.Definition2[trail].A walk is a trail if all its edges are distinct.A trail is closed if its end vertices are the same.Definition3[circuit].A closed trail is a circuit if all its vertices except the end vertices are distinct.Definition4[acyclic graph].A graph G is acyclic if it has no circuits.Definition5[tree].A tree of graph G is a connected acyclic subgraph of G.Definition6[spanning tree].Spanning tree of graph G is a tree of G containing all the vertices of G.The problem is formulated as construction of a minimum weight spanning tree of G.A deterministic solution to this problem was proposed by Bor˚u vka[4],Kruskal[33],and Prim[45]as the widely known greedy algorithms in the literature.A better solution to this problem was proposed(depending on the assumptions made)by Karger et.al.in[28] with a randomized algorithm,which solves the problem in linear expected time;Fredman et.al.in[14]with the linear worst case time solution under the assumption that weights6are small integers.The solution proposed by Gabow et.al.in [15]solves the problem with the solution’s exact bound O (m log β(m,n )),on a graph with n vertices and m edges.Here β(m,n )=min {i |log (i )(n )≤m/n },where log (i )n is log(log(..(log(n ))..)) i −times .Here we will give two lemmas that provide the basis of the minimum weight spanning tree algorithms,taken from [51],which will help us in proving the correctness of MST algorithm.The weight of the subgraph of G is the sum of edge weights of subgraph,i.e.for T ⊆G ,the weight of a subgraph is :w (T )= e ∈Tw (e ).(1)Theorem 1Consider a vertex v in a weighted connected graph G .Among all the edges incident on v ,let e be one of minimum weight.Then,G has a minimum weight spanning tree that contains e.Proof:Let T min be a minimum weighted spanning tree of G .If T min does not contain e ,then consider the fundamental circuit C of T min with respect to e ,i.e.created by adding e to T min .Let e ′be the edge of C that is adjacent to e and incident to the same vertex v .Clearly e ′∈T min .Also T ′=T min −e ′+e is a spanning tree of G .Since e and e ′are both incident on v ,we have w (e )≤w (e ′),by assumption.But from the fact that T min is MST w (T ′)=w (T min )−w (e ′)+w (e )≥w (T min )follows w (e )≥w (e ′).So,w (e )=w (e ′)and w (T ′)=w (T min ).Thus T ′is also a minimum weighted spanning tree.The theorem follows since T ′contains e .2Theorem 2Let T be an acyclic subgraph of a weighted connected graph G such that there exists a minimum weight spanning tree containing T .If G ′denotes the graph obtained by contracting all the edges of T ,and T ′min is a minimum weight spanning tree of G ′,then T ′min ∪T is a minimum weight spanning tree of G .Proof:Let T min be a minimum weight spanning tree of G containing T .If T min =T ∪T ′(Figure 2(a)),then clearly T ′is a spanning tree of G ′(Figure 2(b)).Therefore,w (T ′)≥w (T ′min ).(2)Since T is an acyclic subgraph of G and T ′min is a minimum spanning tree of G ′,it is easyto see that T ′min ∪T is also a spanning tree of G .So,w (T ′min ∪T )≥w (T min )=w (T )+w (T ′).(3)From this we getw (T ′min )≥w (T ′).(4)7a)minb)Figure2:(a)T min=T∪T′,where T=T1∪T2∪T3and(b)T′after contraction of all edges in T.Combining Equations(2)and(4)we get w(T′min)=w(T),and so w(T′min∪T)=w(T′∪T)=w(T min).Thus T′min∪T is a minimum spanning tree of G.2We intend to solve the problem of MST in parallel by using Bor˚u vka’s algorithm in conjunction with dual graph contraction[30].We use Bor˚u vka’s algorithm since it can be used to built MST in parallel[8].In the next Section2.1we give the parallel version of this algorithm.2.1Bor˚uvka’s AlgorithmThe idea of the Bor˚u vka[4]is to do steps like in Prim’s algorithm,in parallel over the graph at the same time.This algorithm constructs a spanning tree in iterations composed of the following steps:Input:Graph G(V,E).1:MST:=empty edge list.2:all vertices v∈V make a list of trees L.3:while there is more than one tree in L do4:each tree T∈Lfinds the edge e with the minimum weight which connects T to G\T and add edge e to MST.5:using edge e merge pairs of trees in L.6:end whileOutput:Minimum weight spanning tree-MST.(b)(c)Figure3:The steps of building MST with Bor˚u vka’s algorithm.a)Starting graph;b) Edges with minimum weight(solid lines);c)MST(solid line).the edge e with the smallest weight,which connects T to G\T.The trees T are then connected to G\T with the edge e.In this way the number of trees in L is reduced, until there is only one,the minimum weight spanning tree.A simple example of steps of Bor˚u vka’s algorithm is given in Figure3;(a)the simple attributed graph as input,where each vertex is a tree in L;(b)the edges with the smallest weight(solid lines)incidented on vertices,connect the trees(vertices in this case)to each other creating the trees T1, T2and T3;(c)each tree in b)finds the edge with the smallest weight to connect to the other tree.The output shown in Figure3(c)is the MST of the input graph.The proof that this algorithm builds the minimum weight spanning tree is based on the proof of the Kruskal’s MST algorithms given in[51].Theorem3Bor˚uvka’s algorithm constructs a minimum weight spanning tree of a weighted connected graph G.Proof:Let G be the given nontrivial weighted connected graph.When Bor˚u vka’s algo-rithm terminates the selected tree T min is a spanning tree.Thus we have to show that T min is indeed a minimum weight spanning tree of G by proving that every T i constructed in the course of Bor˚u vka’s algorithm is contained in a minimum weight spanning tree of G.Our proof is by induction on i.The subgraph T i+1is constructed from T i by adding an edge of minimum weight with exactly one end vertex in T i.This construction ensures that all T i’s are connected.As inductive hypothesis assume that T i is contained in a minimum spanning tree of G.If G′denotes the graph obtained by contracting the edges of T i and v′denotes the vertex of G′,which corresponds to the vertex set of T i,then e i+1 is in fact a minimum weight edge incident on v′in G′.Clearly by Theorem1the edge e i+1 is contained in a minimum weight spanning tree T′min of G′.By Theorem2,T i∪T′min is a minimum weight spanning tree of G.More specifically T i+1=T i∪{e i+1}is contained in a minimum weight spanning tree of G and the correctness of Bor˚u vka’s algorithm follows. 2It can be easily shown that Algorithm1may fail to build MST if the edge weights are not distinct,since the set of selected edges may contain cycles.This problem can be9neighborhood graphG k(V k,E k)face graphV k,'G∗(E k\'G k+1(E k+1)Edualc dual dualdual edge contraction'eliminates deg( ccG k+1)=C[(G k,The base of the pyramid consist of the pair of dual image graphs(G0,G k) and(G k+1,G k+1)=C[(G k,2Neglected level indices refer to contraction from level k to level k+1.11v f v v v 6a)v fv vv7Dual Edge Contraction8 8v©9d d d d db)v f v Dual Edge Contraction 986v v ©9767V k +1;r ,V k \S k ; ,V k +1;−→, S k ,V k \S k ∈N k,l .12w E e'N.3.2Equivalent contraction kernelsThe combination of two(and more)successive reductions in an equivalent weighting function allowed Burt[6]to calculate any level of the pyramid directly from the base. Similarly we combine two(and more)dual graph contractions(see Figure7)of graph G k with decimation parameters S k,N k,k+1 and S k+1,N k+1,k+2 into one single equivalent contraction kernel N k,k+2=N k,k+1◦N k+1,k+2(for simplicity G k stands for(G k+1,(G k ,G k +1)(G k +2,E S k ,N k,k +1 Figure 7:Equivalent contraction kernelDefinition 9Function bridge :E k +1→E k assigns to each edge e k +1=(v k +1,w k +1)∈E k +1one of the bridges e k ∈E k of the connecting paths CP k (v k +1,w k +1):bridge (e k +1):=e k .(7)Two disjoint tree structures connected by a single edge become a new tree structure.The result of connecting all contraction kernels T k by bridges fulfills the requirements of a contraction kernel:N k,k +2:=N k,k +1∪e k +1∈N k +1,k +2bridge(e k +1)(8)The above process can be repeated on the remaining contraction kernels until the base level 0contracts in one step into the apex V h ={v h },the top of the pyramid.4Minimum Spanning Tree with DGCThe dual graph contraction algorithm [29,30]is used to contract edges and create “super”vertices i.e.to create father-son relation between vertices in subsequent levels (vertical relation)and at the same time to preserve the topology,whereas Bor˚u vka’s algorithm is used to create son-son relation between vertices in the same level (horizontal relation).In Section 5we will refine the son-son relation to simulate the pop-out phenomena [25,26],and to find region borders quickly and effortlessly in a bottom-up ’stimulus-driven’based on local differences in a specific feature (color).Here we expand Bor˚u vka’s algorithm with the steps that contract edges,remove par-allel edges and self loops (if the topology is not changed),see Algorithm 2.Theorem 4The equivalent contraction kernel of the apex of the graph pyramid is the minimum spanning tree of the base level.Proof:The top h of the pyramid,consists of an isolated vertex (apex).The set of edges at the base level (input graph)which are contracted so that the apex is arrived,is the equivalent contraction kernel N 0,h (see Section 3.2).Since Algorithm 2builds minimum spanning tree (see Theorem 3)this implies that N 0,h =MST .214Input:Attributed graph G0(V,E).1:k=02:repeat3:For each vertex v∈G kfind the minimum-weight edge e∈G k incident to the vertex v and mark the edges e to be contracted.4:determine CC k i as the connected components of the marked edges e.5:contract connected components CC k i in a single vertex and eliminate the parallel edges(except the one with the minimum weight)and self-loops and create the graphG k+1=C[G k,CC k i].6:k=k+17:until all connected components of G are contracted into one single vertex. Output:A graph pyramid with an apex.0,2Figure8:Building MST with DGC.5Hierarchy of PartitionsHierarchies are a significant tool for image partitioning as they naturally mix with ho-mogeneity criteria.Horowitz and Pavlidis[21]define a consistent homogeneity criterium over a set V as a boolean predicate P over its partsΦ(V)that verifies the consistency property:∀(x,y)∈Φ(V)x⊂y⇒(P(y)⇒(P(x)).(9) In image analysis Equation(9)states that the subregions of homogeneous region are also homogeneous.It follows that if P yr is a hierarchy and P a consistent homogeneity criterium on V then the set of maximal elements of P yr that fulfill P defines a unique partition of V.Thus the joint use of hierarchy and homogeneity criteria allow to define a partitioning in a natural way.Let G(V,E)be a given attributed graph with the vertex set V and edge set E.We will discuss later about the attributes.The goal is tofind partitions P={CC1,CC2,...,CC n} such that these elements satisfy certain properties.Moreover P is a partition of V∈G, and∀i=j CC i∩CC j=φ∧ i=1,...,n CC i=V.(10)We use the pairwise comparison of neighboring vertices,i.e.partitions to check for similarities[11,13,18].Felzenszwalb in[11]defines a pairwise group comparison function Comp(CC i,CC j)that judges whether or not there is evidence for a boundary between two partitions CC i,CC j∈P.Note that Comp(CC i,CC j)is a boolean comparison function for pairs of partitions and it is not defined yet.Definition of Comp(CC i,CC j)depends on the p(CC i,CC j)is true,if there is evidence for a boundary between CC i and CC j,and false when there is no boundary.165.1Building a Hierarchy of PartitionsIn this section we present the pairwise comparison function Comp(·,·),which tells us if there is a boundary between pairs of regions(groups).This function measures the difference along the boundary of two components relative to a measure of differences of components’internal differences.This definition tries to encapsulate the intuitive notion of contrast:a contrasted zone is a region containing two connected components whose inner differences(internal contrast)are less than differences within it’s context (external contrast).We define an external contrast measure between two components and an internal contrast measure of each component.These measures are defined in[11, 13,18],analogously.Let G(V,E,attr v,attr e)be a given attributed graph with the vertex set V and edge set E on the base level(level0).Vertices v∈V and edges e∈E are attributed,i.e. attr v:V→R+and attr e:E→R+.One possible way to attribute the edges is given in Section6.The graph on level k of the pyramid is denoted by G k.Every vertex u∈G k is a representative of a component CC i of the partition P k.The equivalent contraction kernel of a vertex u∈G k,N0,k(u)is e set of edges of the base level e∈E that are contracted;i.e.applying equivalent contraction kernel on the base level,one contracts the subgraph G′⊆G onto the vertex u(see Section3.2).The internal contrast measure of the CC i∈P k is the largest dissimilarity mea-sure of component CC i i.e.the largest edge weight of the N0,k(u)of vertex u∈G k:Int(CC i)=max{attr e(e),e∈N0,k(u)}.(11)Let u i,u j∈V k be the end vertices of an edge e∈E k.The external contrast measure between two components CC i,CC j∈P k is the smallest dissimilarity measure between component CC i and CC j i.e.the smallest edge weight connecting N0,k(u i)and N0,k(u j) of vertices u i∈CC i and u j∈CC j:Ext(CC i,CC j)=min{attr e(e),e=(v,w):v∈N0,k(u i)∧w∈N0,k(u j)}.(12)In Figure9a simple example of Int(CC i)and Ext(CC i,CC j)is given.The Int(CC i) (Int(CC j))of the component CC i(CC j)is the maximum of weights of the solid line edges,whereas Ext(CC i,CC j)is the minimum of weights of the dashed line edges (bridges)connecting component CC i and CC j on the base level G0.Vertices u i and u j are representative of the components CC i and CC j.By contracting the edges N0,k(u i) one arrives to the vertex u i,analogously N0,k(u j)for the vertex u j(see Section3.2).The pairwise comparison function Comp(·,·)between two connected components CC i and CC j can now be defined as:Comp(CC i,CC j)= True if Ext(CC i,CC j)>P Int(CC i,CC j),(13)False otherwise,17G k −1G k u imax {attr e =Int (}j )e =Ext (CC i ,CC j )Figure 9:Internal and External contrast.where P Int (CC i ,CC j )is the minimum internal contrast difference between two compo-nents:P Int (CC i ,CC j )=min (Int (CC i )+τ(CC i ),Int (CC j )+τ(CC j )).(14)For the function Comp (CC i ,CC j )to be true i.e.for the border to exist,the external contrast difference must be greater than the internal contrast differences.The reason for using a threshold function τ(CC )in Equation (14)is that for small components CC ,Int (CC )is not a good estimate of the local characteristics of the data,in extreme case when |CC |=1,Int (CC )=0.Any non-negative function of a single component CC ,can be used for τ(CC )[11].One can define τto be function of the size of CC :τ(CC )=α/|CC |,(15)where |CC |denotes the size of the component CC and αis a constant.More complex definition of τ(CC ),which is large for certain shapes and small otherwise would produce a partitioning which prefers certain shapes,ing ratio of perimeter to area would prefer components that are not long and thin.5.2Construct Hierarchy of PartitionsThe Algorithm 2in Section 4in conjunction with the definition of comparison function Comp (·,·)defined in Section 5.1is used as basis to build the hierarchy of partitions.We proved that Algorithm 2builds a minimum spanning tree of a attributed graph,so the definition of internal and external contrast are now recalled for clarity:Int (CC k )=max {attr e (e ),e ∈MST (u k )}.Ext (CC k i ,CC k j )=min {attr e (e ),e =(v,w ):v ∈MST (u k,i )∧w ∈MST (u k,j )}.P Int (CC k i ,CC k j )=min (Int (CC k i )+τ(CC k i ),Int (CC k j )+τ(CC k j )).τ(CC k )=f (CC k ),(16)18where f(CC k)is non-negative function,not defined yet.Let P k=CC k i,CC k j,...,CC k n be the partitions on the level k of the pyramid i.e P k is the graph G k(V k,E k).The algorithm to build the hierarchy of partitions is as follows:Input:Attributed graph G0.1:k=02:repeat3:for all vertices u∈G k do4:E min(u)=argmin{attr e(e)|e=(u,v)∈E k or e=(v,u)∈E k}5:end for6:for all e=(u k,i,u k,j)∈E min with Ext(CC k i,CC k j)≤P Int(CC k i,CC k j)do7:include e in contraction edges N k,k+18:end for9:contract graph G k with contraction kernels,N k,k+1:G k+1=C[G k,N k,k+1].10:for all e k+1∈G k+1do11:set edge attributes attr e(e k+1)=min{attr e(e k)|e k+1=C[e k,N k,k+1]}12:end for13:k=k+114:until G k=G k−1Output:A region adjacency graph(RAG)at each level of the pyramid.。