neural network identification 神经网络系统辨识外文文献3
- 格式:pdf
- 大小:3.75 MB
- 文档页数:11
利用Matlab进行系统辨识的技术方法一、引言系统辨识是研究系统动态特性的一个重要方法,它广泛应用于控制系统、信号处理、通信等领域。
利用Matlab进行系统辨识能够实现快速、准确的模型建立和参数估计。
本文将介绍在Matlab环境下常用的系统辨识技术方法及其应用。
二、系统辨识的基本概念系统辨识是通过对系统的输入和输出信号进行观测和分析,以推断系统的结构和参数。
一般来说,系统辨识包括建立数学模型、估计系统参数和进行模型验证三个步骤。
1. 建立数学模型建立数学模型是系统辨识的第一步,它是描述系统行为的数学表达式。
常用的数学模型包括线性模型、非线性模型和时变模型等。
2. 估计系统参数在建立了数学模型之后,需要通过对实验数据的分析,估计出系统的参数。
参数估计可以通过最小二乘法、极大似然估计法等方法实现。
3. 模型验证模型验证是为了确定估计得到的系统模型是否准确。
常用的方法有经验验证、残差分析、模型检验等。
三、常用的系统辨识技术方法1. 线性参数模型线性参数模型是最常用的系统辨识方法之一。
它假设系统具有线性特性,并通过估计线性模型的参数来描述系统。
在Matlab中,可以使用函数"arx"进行线性参数模型的辨识。
2. 神经网络模型神经网络模型是一种非线性模型,它通过人工神经元的连接权值来描述系统行为。
在Matlab中,可以使用"nlarx"函数进行神经网络模型的辨识。
3. 系统辨识工具箱Matlab提供了丰富的系统辨识工具箱,包括System Identification Toolbox和Neural Network Toolbox等。
这些工具箱提供了各种方法和函数,方便用户进行系统辨识分析。
四、利用Matlab进行系统辨识的应用案例1. 系统辨识在控制系统中的应用系统辨识在控制系统中具有广泛的应用,如无人机控制、机器人控制等。
通过对系统进行辨识,可以建立准确的数学模型,并用于控制器设计和系统优化。
试论述神经网络系统建模的几种基本方法。
利用BP 网络对以下非线性系统进行辨识。
非线性系统22()(2(1)1)(1)()1()(1)y k y k y k u k y k y k -++=+++-1)首先利用u(k)=sin(2*pi*k/3)+1/3*sin(2*pi*k/6),产生样本点500,输入到上述系统,产生y(k), 用于训练BP 网络;2)网络测试,利用u(k)=sin(2*pi*k/4)+1/5*sin(2*pi*k/7), 产生测试点200,输入到上述系统,产生y(k), 检验BP/RBF 网络建模效果。
3)利用模型参考自适应方法,设计NNMARC 控制器,并对周期为50,幅值为+/- 的方波给定,进行闭环系统跟踪控制仿真,检验控制效果(要求超调<5%)。
要求给出源程序和神经网络结构示意图,计算结果(权值矩阵),动态过程仿真图。
1、系统辨识题目中的非线性系统可以写成下式:22()(2(1)1)(1)()();()1()(1)y k y k y k f u k f y k y k -++=•+•=++- 使用BP 网络对非线性部分()f •进行辨识,网络结构如图所示,各层神经元个数分别为2-8-1,输入数据为y(k-1)和y(k-2),输出数据为y(k)。
图 辨识非线性系统的BP 网络结构使用500组样本进行训练,最终达到设定的的误差,训练过程如图所示图网络训练过程使用200个新的测试点进行测试,得到测试网络输出和误差结果分别如下图,所示。
从图中可以看出,相对训练数据而言,测试数据的辨识误差稍微变大,在±0.06范围内,拟合效果还算不错。
图使用BP网络辨识的测试结果图使用BP网络辨识的测试误差情况clear all;close all;%% 产生训练数据和测试数据U=0; Y=0; T=0;u_1(1)=0; y_1(1)=0; y_2(1)=0;for k=1:1:500 %使用500个样本点训练数据U(k)=sin(2*pi/3*k) + 1/3*sin(2*pi/6*k);T(k)= y_1(k) * (2*y_2(k) + 1) / (1+ y_1(k)^2 + y_2(k)^2); %对应目标值Y(k) = u_1(k) + T(k); %非线性系统输出,用于更新y_1if k<500u_1(k+1) = U(k); y_2(k+1) = y_1(k); y_1(k+1) = Y(k); endendy_1(1)=; y_1(2)=0;y_2(1)=0; y_2(2)=; y_2(3)=0; %为避免组合后出现零向量,加上一个很小的数X=[y_1;y_2];save('traindata','X','T');clearvars -except X T ; %清除其余变量U=0; Y=0; Tc=0;u_1(1)=0; y_1(1)=0; y_2(1)=0;for k=1:1:200 %使用500个样本点训练数据U(k)=sin(2*pi/4*k) + 1/5*sin(2*pi/7*k); %新的测试函数Y(k) = u_1(k) + y_1(k) * (2*y_2(k) + 1) / (1+ y_1(k)^2 + y_2(k)^2); if k<200u_1(k+1) = U(k); y_2(k+1) = y_1(k); y_1(k+1) = Y(k); endendTc=Y; Uc=u_1;y_1(1)=; y_1(2)=0;y_2(1)=0; y_2(2)=; y_2(3)=0; %为避免组合后出现零向量,加上一个很小的数Xc=[y_1;y_2];save('testdata','Xc','Tc','Uc'); %保存测试数据clearvars -except Xc Tc Uc ; %清除其余变量,load traindata; load testdata; %加载训练数据和测试数据%% 网络建立与训练[R,Q]= size(X); [S,~]= size(T); [Sc,Qc]= size(Tc);Hid_num = 8; %隐含层选取8个神经元较合适val_iw =rands(Hid_num,R); %隐含层神经元的初始权值val_b1 =rands(Hid_num,1); %隐含层神经元的初始偏置val_lw =rands(S,Hid_num); %输出层神经元的初始权值val_b2 =rands(S,1); %输出层神经元的初始偏置net=newff(X,T,Hid_num); %建立BP神经网络,使用默认参数 %设置训练次数= 50;%设置mean square error,均方误差,%设置学习速率{1,1}=val_iw; %初始权值和偏置{2,1}=val_lw;{1}=val_b1;{2}=val_b2;[net,tr]=train(net,X,T); %训练网络save('aaa', 'net'); %将训练好的网络保存下来%% 网络测试A=sim(net,X); %测试网络E=T-A; %测试误差error = sumsqr(E)/(S*Q) %测试结果的的MSEA1=sim(net,Xc); %测试网络Yc= A1 + Uc;E1=Tc-Yc; %测试误差error_c = sumsqr(E1)/(Sc*Qc) %测试结果的的MSEfigure(1);plot(Tc,'r');hold on;plot(Yc,'b'); legend('exp','act'); xlabel('test smaple'); ylabel('output') figure(2); plot(E1);xlabel('test sample'); ylabel('error')2、MRAC 控制器被控对象为非线性系统:22()(2(1)1)(1)()();()1()(1)y k y k y k f u k f y k y k -++=•+•=++- 由第一部分对()f •的辨识结果,可知该非线性系统的辨识模型为:(1)[(),(1)]()I p y k N y k y k u k +=-+可知u(k)可以表示为(1)p y k +和(),(1)y k y k -的函数,因此可使用系统的逆模型进行控制器设计。
使用Matlab进行非线性系统辨识与控制的技巧在控制系统领域,非线性系统一直是研究的重点和难点之一。
与线性系统不同,非线性系统具有复杂的动力学特性和响应行为,给系统的建模、辨识和控制带来了挑战。
然而,随着计算机技术的快速发展,现在可以利用强大的软件工具如Matlab来进行非线性系统辨识与控制的研究。
本文将分享一些使用Matlab进行非线性系统辨识与控制的技巧,希望对相关研究人员有所帮助。
一、非线性系统辨识非线性系统辨识是指通过实验数据来确定系统的数学模型,以描述系统的动态行为。
在非线性系统辨识中,最常用的方法是基于系统响应的模型辨识技术。
这种方法通常包括以下几个步骤:1. 数据采集和预处理:首先,需要采集实验数据以用于系统辨识。
在数据采集过程中,应尽量减小噪声的影响,并确保数据的可靠性。
然后,对采集到的数据进行预处理,如滤波、采样等,以消除噪声和干扰。
2. 模型结构选择:在进行非线性系统辨识时,应选择合适的模型结构来描述系统的动态特性。
常见的模型结构包括非线性自回归移动平均模型(NARMA),广义回归神经网络(GRNN)等。
选择合适的模型结构对于准确地描述系统非线性特性至关重要。
3. 参数估计:根据选定的模型结构,使用最小二乘法或其他参数估计算法来估计模型的参数。
MATLAB提供了多种估计算法和工具箱,如系统辨识工具箱(System Identification Toolbox)等,可方便地进行参数估计。
4. 模型验证与评估:在参数估计完成后,应对辨识的模型进行验证和评估。
常用的方法是计算模型的均方根误差(RMSE)和决定系数(R-squared),进一步提高模型的准确性和可靠性。
二、非线性系统控制非线性系统控制是指通过设计控制策略来实现对非线性系统的稳定和性能要求。
与非线性系统辨识类似,非线性系统控制也可以利用Matlab进行研究和设计。
以下是一些常用的非线性系统控制技巧:1.反馈线性化控制:线性化是将非线性系统近似为线性系统的一种方法。
MATLAB中常见的自动化建模方法介绍随着科技的不断进步,自动化建模在各个领域中变得越来越重要。
MATLAB作为一种强大的数学建模与仿真工具,为研究人员和工程师们提供了许多自动化建模方法。
本文将介绍几种常见的MATLAB中的自动化建模方法,包括系统辨识、机器学习和优化方法。
一、系统辨识系统辨识是在无法直接获得系统模型的情况下,通过对系统输入和输出数据的观测来估计系统模型。
MATLAB提供了多种用于系统辨识的函数和工具箱,其中最常用的是System Identification Toolbox。
System Identification Toolbox提供了参数估计、模型结构选择和模型验证等功能。
在MATLAB中,使用系统辨识工具箱进行模型辨识一般包括以下步骤:收集系统输入和输出数据、选择适当的模型结构、参数估计和模型验证。
通过这些步骤,研究人员可以获得一个能够准确描述系统动态特性的模型。
二、机器学习机器学习是一种通过让计算机从数据中学习,并且在新的数据上做出预测或决策的方法。
在MATLAB中,有多种机器学习算法可供选择,包括支持向量机(SVM)、人工神经网络(ANN)和决策树等。
支持向量机是一种基于统计学习理论的二分类器,其主要思想是通过在高维特征空间中找到一个最优超平面来实现数据分类。
MATLAB中的Support Vector Machines Toolbox提供了一系列用于支持向量机模型的训练和应用的函数。
人工神经网络是一种模拟人脑神经元网络的算法,它可以通过学习样本数据来进行分类、回归、聚类等任务。
MATLAB中的Neural Network Toolbox提供了一系列用于构建、训练和应用神经网络的函数和工具。
决策树是一种通过对数据进行分割来实现分类的方法。
决策树模型通过一系列的判定条件将数据分为不同的类别。
在MATLAB中,可以利用Classification Learner App来构建和训练决策树模型,同时还可利用TreeBagger函数进行随机森林模型的构建和训练。
神经网络辨识的液压挖掘机 LPV 模型邵辉;胡艳丽;洪雪梅;王飞【摘要】A linear parameter varying model is proposed based on neural network identification for building the hydraulic excavator boom model.The model of the joint angle is obtained based on the first-order plus dead time model of the joint velocity at each working-point.Depending on scheduling variable characteristics,the LPV model parameters are identified by using neural network,and the global LPV model of the excavator boom in the workspace is designed.The simulations and experiments indicate the accuracy of the model and the validity of the method.%针对液压挖掘机动臂关节的非线性建模问题,提出一种基于神经网络的线性变参数(LPV)模型的辨识方法。
在各个工作点处根据其关节速度的一阶惯性加延迟模型,获得其关节角度模型;结合调度变量特性,采用神经网络辨识出 LPV 模型的参数,设计出挖掘机动臂在全局工作范围的 LPV 模型。
通过仿真实验,验证了该方法的有效性和模型的准确性。
【期刊名称】《华侨大学学报(自然科学版)》【年(卷),期】2016(000)001【总页数】5页(P43-47)【关键词】液压挖掘机;动臂关节;神经网络;线性变参数;辨识【作者】邵辉;胡艳丽;洪雪梅;王飞【作者单位】华侨大学信息科学与工程学院,福建厦门 361021;华侨大学信息科学与工程学院,福建厦门 361021;华侨大学信息科学与工程学院,福建厦门361021;华侨大学信息科学与工程学院,福建厦门 361021【正文语种】中文【中图分类】TP273液压挖掘机是结构最复杂、用途最广泛的工程机械之一.目前,大部分液压挖掘机属手动控制,它操作速度慢,效率低,无法应对相对危险的环境[1].因此,实现挖掘机的自动控制是提高效率和安全性的必要途径.对复杂的非线性系统,线性建模方法[2]在非线性因数变化很大时并不适用,而现有的非线性建模方法如机理模型[3]、Volterra级数[4]、非线性ARMAX[5]、Wiener模型[6]等也存在着很多缺陷.最大的问题是非线性过程的复杂性和辨识的高成本.因此,需要寻找一种更好的、低成本的非线性建模方法.神经网络的线性变参数(linear parameter varying,LPV)是Shamma等[7]在研究增益调度控制时首先引入的.对于大部分具有非线性特性的工业过程,其系统并不是在整个操作域内随机无序的进行,而是存在一个与系统动态特性相关的调度变量.而LPV系统的动态特性依赖实时可测的外部参数,其调度参数反映了系统的非线性特性或时变特性,根据调度变量建立系统的LPV模型可以满足系统后继的控制要求.由一个非线性系统得到其LPV模型有2种方法:基于系统动态方程的分析法[8]和基于系统输入输出的实验法[9],实验法常用于辨识LPV的黑箱模型[10].为了辨识系统模型,需要进一步将其参数化,文献[11-13]对此进行了大量的研究,多是将过程模型参数用调度变量的非线性函数表示,采用递归最小二乘法估计模型参数,得到LPV模型.由于调度变量高度的相互依存性,相互之间的函数关系并不明确,且考虑到神经网络可以快速有效地辨识多输入多输出的高度非线性系统.因此,本文采用实验法,提出了基于神经网络的LPV模型的非线性辨识方法.液压挖掘机控制系统是指对发动机、液压泵、多路换向阀和执行元件(液压缸、液压马达)等动力系统进行控制的系统[2],如图1所示.若要对挖掘机进行准确的控制,则必须建立其准确的模型.选取挖掘机的动臂关节,对其进行合理的建模.在单一工作点时,根据工程简化,其阀门开度与动臂关节速度之间是一阶惯性加延迟系统,则阀门开度与关节角度之间的传递函数可以表示为式(1)中包括3个参数依赖系统:系统的稳态增益K;二阶系统的时间常数T;延迟量τ.整个系统是典型的非线性时变系统,K,T,τ分别是关于系统调度变量的函数,均受阀门开度、动臂位置及其运动方向的影响.系统通过阀门开度将泵排出的液压油提供到各元件,使挖掘机完成各项工作,而挖掘机动臂的运动方向及其位置的及时反馈也会驱使挖掘机分流阀动作,使挖掘机在理想的工作面上工作.选择阀门开度、动臂运动方向及其角度作为系统的调度变量.则系统可表示为式(2)中:,w是系统的工作点;η是阀门开度,范围是[ηmin,ηmax];θ是挖掘机动臂的角度,范围是[θmin,θmax];是动臂运动速率,令其表示运动方向,范围是max];而T(w),K(w),τ(w)分别是系统的时间常量、稳态增益及延迟量,也是系统需要辨识的参数集.选用两层前馈神经网络.系统的调度变量阀门开度η,角度θ及运动方向作为神经网络的输入值;系统数学模型参数集稳态增益K(w),时间常数T(w),延迟量τ(w)则作为神经网络的预测输出值,如图2所示.若Ui是输入层节点i的输出,Uk是输出层节点k的输出,Uj是隐含层节点j的输出,则隐含层的第j个节点的输入表示为第j个节点的输出为式(4)中:f(Up,j)为节点激励函数.第j个节点的输出Uj将通过加权系数j,k向前传播到第k个节点,输出层第k个节点的总输入为式(5)中:q为隐含层节点数.则输出层第k个节点的实际网络输出为使用神经网络预测,首先要训练网络,通过训练使网络具有联想记忆和预测能力,有以下6个步骤.步骤1 网络初始化.步骤2 提供训练集.根据调度变量选择工作点测试数据.输入矢量为),期望输出矢量为D=(K,T,τ).步骤3 计算实际输出.步骤4 根据网络预测输出与期望输出,计算网络预测误差.步骤5 更新权值.步骤6 判断算法是否结束,若没有,则返回步骤2.通过对实验数据的多次训练,神经网络达到了理想的效果.采用12 t,挖掘能力0.52 m3的ZAXIS-120型日立挖掘机上的实际数据.针对挖掘机的动臂,在MATLAB/Simulink中建立仿真模型.在操作范围内,即当η∈[0%,100%],θ∈[-70.5°,44.75°],(基于挖掘机基坐标或(令1为方向向上,-1为方向向下)时,选取若干个典型工作点.根据挖掘机提取的数据(部分),如表1所示.表1中:n为实验次数;τ,K,T为近似数值.表1中:测定位姿为挖掘机动臂关节所在位姿的角度,其中,位姿No.1是动臂关节运动到最高极限70.5°角,位姿No.2是运动到水平10.5°角.选择典型工作点采集数据并建立神经网络,经过多次训练,结果如图3所示.从图3可以看出:神经网络具有较高的拟合能力.在已完成神经网络的基础上,于 Matlab/Simulink中建立挖掘机的仿真模型.对挖掘机动臂在开度为100%,从位姿No.1运动到位姿No.2,以及开度为43.75%和31.25%时,从位姿No.2运动到位姿No.1时的运动情况进行阶跃、正弦响应实验仿真,结果如图4所示.由图4比较分析可知:该LPV模型能够很好地逼近真实过程;因为调度变量的全局性,该LPV模型可以模拟系统在整个工作范围内的活动,节省了工作量.为了简化模型并降低成本,提出一种基于神经网络的LPV辨识方法.该方法建立在具有简单结构的数学模型上,结合系统的非线性时变特性,通过神经网络建立调度变量与辨识参数之间的联系.通过仿真实验,得到以下2点结论.1) 引入神经网络,辨识出动态模型参数,能够简单快速地构建系统模型,并在全局范围内有效.2) 基于神经网络的LPV模型结构简单,保证了后继的控制器设计简单可行.【相关文献】[1]LI Bo,YAN Jun,GUO Gang,et al.High performance control of hydraulic excavator based on f uzzy-PI soft-switch controller[C]∥IEEE International Conference on Computer Science and Automation Engineering.Shanghai:IEEE Press,2011:676-679.[2]LU Guangming,SUN Lining,XU Yuan.BP network control over the track of working device of hydraulic excavator[J].Chinese Journal of Mechanical Engineering,2005,41(5):199-122.[3] XIANG Qiangzhong,LI Dongliang.Mechanical-hydraulic coupling simulation for hydraulic excavator working mechanism[C]∥2nd Interna tional Conference on Advanced Engineering Materials and Technology.Zhuhai:Advanced Materials Research,2012:494-497.[4]BOUILLOC T,FAVIER G.Nonlinear channel modeling and identification using baseband Volt erra Parafac models[J].Signal Processing,2012,92(6):1492-1498.[5] SHARDT Y,HUANG Biao.Closed-loop identification condition for ARMAX models using routine operating data[J].Automati ca,2011,47(7):1534-1537.[6]BIAGIOLA S I,FIGUEROA J L.Identification of uncertain MIMO Wiener and Hammerstein m odels[J].Computers and Chemical Engineering,2011,35(12):2867-2875.[7]SHAMMA J,ATHANS M.Guaranteed properties of gain scheduled control for linear parame ter-varying plants[J].Automatica,1991,27(3):559-564.[8] YUE Ting,WANG Lixin,AI Junqiang.Gain self-scheduled H1 control for morphing aircraft in the wing transition process based on an LP V model[J].Chinese Journal of Aeronautics,2013,26(4):909-917.[9]CASELLA F,LOVERA M.LPV/LFT modeling and identification: Overview synergies and a cas e study[C]∥IEEE Conference on Computer Aided Control System Design.San Antonio:IEEE Press,2008:852-857.[10] SALAH C P,EL-DINE,MAHDI S,et al.Black-box versus grey-box LPV identification to control a mechanical system[C]∥IEEE 51st Annual Conference on Decision and Control.Maui:IEEE Press,2012:5152-5157.[11]KNOBLACH A,SAUPE F.LPV gray box identification of industrial robots for control[C]∥IEEE International Conference on Control Applications.Dubrovnik:IEEE Press,2012:831-836. [12]BAMIEH B,GIARRE L.Identification for linear parameter varying models[J].International Jou rnal of Robust and Nonlinear Control,2002,12(9):841-853.[13] 邵辉,野波健藏.Peltier热电设备的LPV建模及多参考模型IPD自适应控制研究[J].南京理工大学学报,2011,35(增刊1):85-90.[14] 邵辉,胡伟石,罗继亮.自动挖掘机的动作规划[J].控制工程,2012,19(4):594-597.。
人工神经网络系统辨识综述摘要:当今社会,系统辨识技术的发展逐渐成熟,人工神经网络的系统辨识方法的应用也越来越多,遍及各个领域。
首先对神经网络系统辨识方法与经典辨识法进行对比,显示出其优越性,然后再通过对改进后的算法具体加以说明,最后展望了神经网络系统辨识法的发展方向。
关键词:神经网络;系统辨识;系统建模0引言随着社会的进步,越来越多的实际系统变成了具有不确定性的复杂系统,经典的系统辨识方法在这些系统中应用,体现出以下的不足:(1)在某些动态系统中,系统的输入常常无法保证,但是最小二乘法的系统辨识法一般要求输入信号已知,且变化较丰富。
(2)在线性系统中,传统的系统辨识方法比在非线性系统辨识效果要好。
(3)不能同时确定系统的结构与参数和往往得不到全局最优解,是传统辨识方法普遍存在的两个缺点。
随着科技的继续发展,基于神经网络的辨识与传统的辨识方法相比较具有以下几个特点:第一,可以省去系统机构建模这一步,不需要建立实际系统的辨识格式;其次,辨识的收敛速度仅依赖于与神经网络本身及其所采用的学习算法,所以可以对本质非线性系统进行辨识;最后可以通过调节神经网络连接权值达到让网络输出逼近系统输出的目的;作为实际系统的辨识模型,神经网络还可用于在线控制。
1神经网络系统辨识法1.1神经网络人工神经网络迅速发展于20世纪末,并广泛地应用于各个领域,尤其是在模式识别、信号处理、工程、专家系统、优化组合、机器人控制等方面。
随着神经网络理论本身以及相关理论和相关技术的不断发展,神经网络的应用定将更加深入。
神经网络,包括前向网络和递归动态网络,将确定某一非线性映射的问题转化为求解优化问题,有一种改进的系统辨识方法就是通过调整网络的权值矩阵来实现这一优化过程。
1.2辨识原理选择一种适合的神经网络模型来逼近实际系统是神经网络用于系统辨识的实质。
其辨识有模型、数据和误差准则三大要素。
系统辨识实际上是一个最优化问题,由辨识的目的与辨识算法的复杂性等因素决定其优化准则。
北京工商大学《系统辨识》课程调研报告题目类别:系统建模的分类现代辨识方法报告题目:基于神经网络与模糊控制的辨识方法调研目录第一章系统辨识理论综述 21.1系统辨识的基本原理 21.2系统辨识的经典方法 21.3神经网络系统辨识综述 21.3.2神经网络在非线性系统辨识中的应用 2 1.4模糊系统辨识综述 31.4.1模糊系统的结构辨识 31.4.2参数优化的方法 31.4.3模糊规则库的化简 31.5小结 4第二章模糊模型辨识方法的研究 42.1模糊模型辨识流程 42.2模糊模型结构辨识方法 52.3模糊模型参数辨识方法 52.4模糊系统辨识中的其它问题 62.4.1衡量非线性建模方法好坏的几个方面 62.4.2模糊辨识算法在实际系统应用中的几个问题 62.4.3模糊模型的品质指标 62.5小结 7第三章基于两种模型的自行车机器人系统辨识 73.1基于ARX模型的自行车机器人系统辨识 73.2基于ANFls模糊神经网络的自行车机器人系统辨识 73.3 展望 7第一章系统辨识理论综述1.1系统辨识的基本原理根据LA.zadel的系统辨识的定义(1962):系统辨识就是在输入和输出数据的基础上,从一组给定的模型类中,确定一个与所测系统等价的模型"系统辨识有三大要素:(1) 数据。
能观测到的被辨识系统的输入或输出数据,他们是辨识的基础。
(2) 模型类。
寻找的模型范围,即所考虑的模型的结构。
(3) 等价准则。
等价准则一辨识的优化目标,用来衡量模型接近实际系统的标准。
1.2系统辨识的经典方法1、阶跃响应法系统辨识;2、频率响应法系统辨识;3、相关分析法系统辨识;4、系统辨识的其他常用方法;1.3神经网络系统辨识综述1.3.1神经网络在线性系统辨识中的应用自适应线性(Adallne一MadaLine)神经网络作为神经网络的初期模型与感知机模型相对应,是以连续线性模拟量为输入模式,在拓扑结构上与感知机网络十分相似的一种连续时间型线性神经网络。
Script identification in the wild via discriminative convolutionalneural networkBaoguang Shi,Xiang Bai n,Cong YaoSchool of Electronic Information and Communications,Huazhong University of Science and Technology,Wuhan430074,Chinaa r t i c l e i n f oArticle history:Received25May2015Received in revised form23October2015Accepted10November2015Available online1December2015Keywords:Script identificationConvolutional neural networkMid-level representationDiscriminative clusteringDataseta b s t r a c tScript identification facilitates many important applications in document/video analysis.This paperinvestigates a relatively new problem:identifying scripts in natural images.The basic idea is combiningdeep features and mid-level representations into a globally trainable deep model.Specifically,a set ofdeep feature maps isfirstly extracted by a pre-trained CNN model from the input images,where the localdeep features are densely collected.Then,discriminative clustering is performed to learn a set of dis-criminative patterns based on such local features.A mid-level representation is obtained by encoding thelocal features based on the learned discriminative patterns(codebook).Finally,the mid-level repre-sentations and the deep features are jointly optimized in a deep network.Benefiting from such afine-grained classification strategy,the optimized deep model,termed Discriminative Convolutional NeuralNetwork(DisCNN),is capable of effectively revealing the subtle differences among the scripts difficult tobe distinguished,e.g.Chinese and Japanese.In addition,a large scale dataset containing16,291in-the-wild text images in13scripts,namely SIW-13,is created for evaluation.Our method is not limited toidentifying text images,and performs effectively on video and document scripts as well,not requiringany preprocess like binarization,segmentation or hand-crafted features.The experimental comparisonson the datasets including SIW-13,CVSI-2015and Multi-Script consistently demonstrate DisCNN a state-of-the-art approach for script identification.&2015Elsevier Ltd.All rights reserved.1.IntroductionScript identification is one of the key components in OpticalCharacter Recognition(OCR),which has received much attentionfrom the document analysis community,especially when the databeing processed is in multi-script or multi-language form.Due tothe rapidly increasing amount of multimedia data,especially thosecaptured and stored by mobile terminals,how to recognize textcontent in natural scenes has become an active and important taskin thefields of pattern recognition,computer vision and multi-media[15,16,25,26,47,49,59,23,57,38,8,41].Different from theprevious approaches which have been mainly designed for docu-ment images[48,19,20]or videos[40,58],this work focuses onidentifying the language/script types of texts in natural images(inthe wild),at word or text line level.This problem has seldom beenfully studied before.As texts in natural scenes often carry rich,high level semantics,there exist many efforts in scene text loca-lization and recognition[37,9,59,54–56,44].Script identification inthe wild is an inevitable preprocessing of a scene textunderstanding system under multi-lingual scenarios[5,6,22],potentially useful in many applications such as scene under-standing[32],product image search[17],mobile phone naviga-tion,film caption recognition[11],and machine translation[4,50].Given an input text image,the task of script identification is toclassify it into one of the pre-defined script categories(English,Chinese,Greek,etc.).Naturally,this problem can be cast as animage classification problem,which has been extensively studied.However,script identification in scene text images remains achallenging task,and has its characteristics that are quite differentfrom document/video script identification,or general image clas-sification,mainly due to the following reasons:1.In natural scenes,texts exhibit larger variations than they do indocuments or videos.They are often written/printed on outdoorsigns and advertisements,in some artistic styles.Often,thereexist large variations in their fonts,colors,and layout shapes.2.The quality of text images will affect the identification accuracy.As scene texts are often captured under uncontrolled environ-ments,the difficulties in identification may be caused by severalfactors such as low resolutions,noises,and illumination chan-ges.Document/video analysis techniques such as binarizationand component analysis tend to be unreliable.Contents lists available at ScienceDirectjournal homepage:/locate/prPattern Recognition/10.1016/j.patcog.2015.11.0050031-3203/&2015Elsevier Ltd.All rightsreserved.n Corresponding author.E-mail addresses:shibaoguang@(B.Shi),xbai@(X.Bai),yaocong2010@(C.Yao).Pattern Recognition52(2016)448–4583.Some scripts/languages have relatively minor differences,e.g.Greek,English and Russian.As illustrated in Fig.1,these scripts share a subset of characters that have exactly the same shapes.Distinguishing them relies on special characters or character components,and is afine-grained classification problem.4.Text images have arbitrary aspect ratios,since text strings havearbitrarily lengths,ruling out some image classification meth-ods that only operate onfixed-size inputs.Recently,CNN has achieved great success in image classifica-tion tasks[27],due to its strong capacity and invariance to translation and distortions.To handle the complex foregrounds and backgrounds in scene text images,we choose to adopt deep features learned by CNN as the basic representation.In our method,a deep feature hierarchy,which is a set of feature maps,is extracted from the input images through a pretrained CNN[29]. The hierarchy carries rich and multi-scale representations of the images.The differences among some certain scripts are subtle or even tiny,thus a holistic representation would not work well.Typical image classification algorithms,such as the conventional CNN[27] and the Single-Layer Networks(SLN)[10],usually describe images in a holistic style without explicit emphasis on discriminative patches that play an important role in distinguishing some script categories(e.g.English and Greek).Therefore,to explicitly capture fine-grained features of scripts,we extracted a set of common patterns,termed as discriminative patterns(the image patches containing the representative strokes or components)from script images via discriminative clustering[45].Such common patterns represented by deep features can be treated as a codebook for encoding the dense deep features into a feature vector,providing a mid-level representation of an input script image.A pooling strategy inspired by the Spatial Pyramid Pooling[18,28],called horizontal pooling,is adopted in the mid-level representation process.This strategy enables our method to capture topological structure of texts,and naturally handles input images of arbitrary aspect ratios.To maximize the discriminatory power of the mid-level representations,we put the above two modules,namely the convolutional layers for extracting deep feature hierarchy and the discriminative encoder for extracting the mid-level representa-tions,into a single deep network for joint optimization with the back-propagation algorithm[30].The globalfine-tuning process optimizes both the deep features and the mid-level representa-tion,effectively integrating the global features(deep feature maps)andfine-grained features(discriminative patterns)for script identification.This paper is a continuation and extension of our previous work[43].In[43]we have proposed Multi-stage Spatially sensitive Pooling Network(MSPN)and a10-classes dataset called SIW-10 for the in-the-wild script identification pared with[43], this paper describes text images via discriminative mid-level representation,instead of the global horizontal pooling on con-volutional feature maps.Discriminative patches corresponding to special characters or components are explicitly discovered and used for building the mid-level representation.In addition,this paper proposes a larger and more challenging dataset with13 script classes.In summary,the contributions of the paper are as follows:(1)A discriminative mid-level representation built on deep features is presented for script identification tasks,in contrast to other methods that rely on texture,edge or connected component analysis.(2)We show that the mid-level representation and the deep feature extraction can be incorporated in a deep model,and get jointly optimized.(3)The proposed method is not limited to script identification in the wild,applicable to video and document script identification as well.The highly competitive performances are consistently achieved on such three kinds of script bench-marks.(4)Compared to the previously collected SIW-10,a larger and more challenging dataset SIW-13is created and released.The remainder of this paper is organized as follows:In Section2 related work in script identification and image classification is reviewed and compared.In Section3the proposed method is described in detail.In Section4we introduce the SIW-13dataset. The experimental evaluation,comparisons with other methods,and some discussions are presented in Section5.We conclude our paper in Section6.2.Related work2.1.Script identificationPrevious works on script identification mainly focus on texts in documents[48,20,7,24]and videos[40,58].Script identification can be done at document page level,paragraph or text-block level, text-line level or word/character level.An extensive and detailed survey has been made by Ghosh et al.in[14].Text images can be classified by their textures.Some previous works conduct texture analysis to extract some kind of holistic appearance descriptors of the input image.Tan[48]proposes to extract rotation invariant texture features for identifying docu-ment scripts.In[7],several texture features,including gray-level co-occurrence matrix features,Gabor energy features,and wavelet energy features,are tested.Joshi et al.[24]present a generalized framework to identify scripts at paragraph and text-block levels. Their method is based on texture analysis and a two-level hier-archical classification scheme.Phan et al.[40]propose to identify text-line level video scripts using edge-based features.The fea-tures are extracted from the smoothness and cursiveness of the upper and lower lines in each of thefive equally sized horizontal zones of the text lines.In[58],Zhao et al.present features that are based on Spatial Gradient-Features at text block level,building features from horizontal and vertical gradients.Manthalkar et al.[34]propose rotation and scale invariant texture features,using discrete wavelet packet transform.Texture analysis,although widely adopted,may be insufficient to identify scripts,especially when distinguishing scripts that share common characters.Instead of texture analysis,our approach uses a discriminative mid-level representation,whichEnglish GreekRussianFig.1.Illustration of the script identification task and its challenges:Both fore-grounds and backgrounds exhibit large variations and high level of noise.Mean-while,characters“A”,“B”and“E”appear in all the three scripts.Identifying themrelies on special patterns that are unique to certain scripts.B.Shi et al./Pattern Recognition52(2016)448–458449tends to be more effective in distinguishing between scripts that have subtle appearance difference.Some other approaches analyze texts via their shapes and structures.In[46],a method based on structure analysis is intro-duced.Different topological and structural features,including number of loops,water reservoir concept features,headline fea-tures and profile features,are combined.Hochberg et al.[20] discover a set of templates by clustering“textual symbols”,which are connected components extracted from training scripts.Test scripts are then compared with these templates tofind their best matching ponent based methods,however,are usually limited to binarized document scripts,since in video or natural scenes images it is hard to achieve ideal binarization.Our approach does not rely on any binarization or segmentation techniques.It is applicable to not only documents,but also a much wider range of scenarios including scene texts and video texts. 2.2.Image classificationNaturally,script identification can be cast as an image classifi-cation problem.The Bag-of-Words(BoW)framework[31]is a technique that is widely adopted in image classification problems. In BoW,local descriptors such as SIFT[33],HOG[12]or simply raw pixel patches[10]are extracted from images,and encoded by some coding methods such as the locality-constrained linear coding(LLC[52])or the triangle activation[10].Recent research on image classification and other visual tasks has seen a leap forward, thanks to the wide application of deep convolutional neural net-works(CNNs[29]).CNN is deep neural network equipped with convolutional layers.It learns the feature representation from raw pixels,and can be trained in an end-to-end manner by the back-propagation algorithm[30].CNN,however,is not specially designed for the script identification task.It cannot handle images with arbitrary aspect ratios,and it does not put emphasis on dis-criminative local patches,which may be crucial for distinguishing scripts that have subtle differences.3.Methodology3.1.OverviewGiven a cropped text image I,which may contain a word or sentence written horizontally,we predict its script class c A f1;…;C g.As illustrated in Fig.2,the training process is divided into two stages.In thefirst stage,we build a discriminative mid-level representation,from the deep feature hierarchies (Section 3.2)extracted by a pretrained CNN,using the dis-criminative clustering method(Section3.3).The result is a dis-criminative codebook that contains a set of linear classifiers.We use the codebook to build the mid-level representation(Section 3.4).In the second stage(Section 3.5),we model the feature extraction,mid-level representation andfinal classification into one neural network.The network is initialized by transferring parameters(weights)learned in thefirst stage.We train the net-work using back-propagation.Consequently,parameters of all modules getfine-tuned together.3.2.Deep feature Hierarchy extractionThe input image isfirstly represented by a convolutional fea-ture hierarchy f h l g L l¼1,where each h l is a set of feature maps with the same size,and L is the number of levels in the hierarchy.The feature hierarchy is extracted by a pretrained CNN,which is dis-cussed in Section5.1.The input text image I isfirstly resized to a fixed height(32pixels throughout our experiments),keeping their aspect ratios.Thefirst level of the feature hierarchy h1is extracted by performing convolution and max-pooling with convolutional filters f k1i;j gi;jand biases f b1jgj,resulting in feature mapsh1¼f h1j gj¼1;…;n1.Each feature map h j1is computed by:h1j¼mpσX n0i¼1I i n k1i;jþb1j!!:ð1ÞHere,I i represents the i-th channel of the input image.n0is the number of image channels.The star operator n indicates the2-D convolution operation.σðÁÞis the squashing function which is an element-wise non-linearity.In our implementation,we use the element-wise thresholding function maxð0;xÞ,also known as the ReLU[36].mp is the max-pooling function,which downsamples feature maps by taking the maximum values on downsampling subregions.The remaining levels of the feature hierarchy are extracted recursively,by performing convolution and max-pooling on the feature maps from the preceding hierarchy level:h lj¼mpσXn lÀ1i¼1h i n k l i;jþb l j!!:ð2ÞHere,l is the level index in the hierarchy.Since image down-sampling is applied,the sizes of the feature maps decrease with l.The extracted feature hierarchy f h l g L l¼1provides dense local descriptors on the input image.At each level l,the feature mapsh l¼f h l k g n lk¼1are extracted by applying convolutional kernels densely on either the input image I or the feature maps h lÀ1fromFig.2.Illustration of the training process of the proposed approach.B.Shi et al./Pattern Recognition52(2016)448–458450the preceding level.A pixel on the feature map h lk ½i ;j ,where i and j are the row and column indices,respectively,is determined by a corresponding subregion on the input image,also known as the receptive field [21].As illustrated in Fig.3,the concatenation of thepixel values across all feature maps x l ½i ;j ¼½h l 1½i ;j ;…;h ln l ½i ;j T is taken as the local descriptor of that subregion.Since down-sampling is applied,the size of the subregion increases with level l .Therefore,the extracted feature hierarchy provides dense local descriptors at several different scales.The feature hierarchy is rich and invariant to various image distortions,making the representation robust to various distor-tions and variations in natural scenes.In addition,convolutional features are learned from data,thus domain-speci fic and poten-tially stronger than general hand-crafted features such as SIFT [33]and HOG [12].3.3.Discriminative patch discoveryAs we have discussed in Section 1,one challenging aspect of the script identi fication task is that some scripts share a subset of characters that have the same visual shapes,making it dif ficult to distinguish them via some holistic representations,such as texture features.The visual differences between these scripts can be observed only via a few local regions,or discriminative patches [45],which may correspond to special characters or special char-acter components.These patches are observed in certain scripts,and are strong evidence for identifying the script type.For example,characters “Λ”and “Σ”are distinctive to Greek.If the input image contains any of them,it is likely to be Greek.In our approach,we discover these discriminative patches from local patches extracted from the training images.The patches are described by deep features.As we have described in Section 3.2,the feature hierarchies provide dense local descriptors.Therefore we simply compute all feature hierarchies and extract dense local descriptors from them.To discover the discriminative visual pat-terns from the set of local patches,we adopt the method proposed by Singh et al.in [45],which is a discriminative clustering method for discovering patches that are both representative and discriminative.Given the set of local patches described by deep features,the discriminative clustering algorithm outputs a discriminative codebook,which contains a set of linear classi fiers.The clustering is performed separately on each class c and on each feature level l .For each class c ,a set of local descriptors X l c is extracted from the feature hierarchies,taken as the discovery set [45].Another set,the natural set ,contains local descriptors from the remaining classes.The discriminative clustering algorithm is performed on the twosets,resulting in a multi-class linear classi fier f w l c ;b lc g .The finaldiscriminative codebook is built by concatenating the classi fierweights from all classes,i.e.K l ¼ðW l ;b lÞ.Detailed descriptions are listed in Algorithm 1.Algorithm 1.The discriminative clustering process.1:Input:Local descriptors f x l i g i ;l ¼1…L2:Output:Discriminative clusters f K l g l ¼1…L 3:for feature level l ¼1to L do 4:for class c ¼1to C do 5:Discovery set D l c ¼f x l :x l A X l c g 6:Natural set N l c ¼f x l :x l =2X l c g7:w l c ;b lc ¼discriminative_clustering ðD l c ;N l c Þ8:end for9:W l ;b l¼concat c ðf W l c gÞ;concat c ðf b lc g10:K l ¼ðW l ;b lÞ11:end for12:Output f K l g l ¼1…LFig.4shows some examples of the discriminative patches discovered from feature level 4(the last convolutional layer).Among the patches,we can observe special characters or text components that are distinctive to certain scripts.The dis-criminative clustering algorithm automatically chooses the num-ber of clusters.In our experiments,it results in a codebook with about 1500classi fiers on each feature level.3.4.Mid-level representationTo obtain the mid-level representation,we firstly encode the feature maps in the hierarchy with the learned discriminative codebook,then horizontally pool the encoding results into a fixed-length vector.Assuming that the feature maps have the shape n Âw Âh ,as mentioned,from the maps we can densely extract w Âh local descriptors,each of n dimensions.Each local descriptor,say x ½i ;j where i ,j are the location on the map,is encoded with the discriminative codebook that has k entries (i.e.k linear clas-si fiers),resulting in a k -dimensional vector z ½i ;j :z l ½i ;j ¼max ð0;W l x l ½i ;j þb lÞ:ð3ÞHere,the encoded vector is the non-negative response of all theclassi fiers in the codebook.W l x l ½i ;j þb lis the responses of all k classi fiers.A positive response indicates the presence of certain discriminative patterns,and is kept,while negative responses are suppressed by setting them to zero.To describe the whole image from the encoding results,we adopt a horizontal pooling scheme,inspired by the spatial pyramid pooling (SPP [18]).Texts in real world are mostly horizontally written.The horizontal positions of individual characters are less meaningful for identifying the script type.Their vertical positions of the text components such as strokes,on the other hand,are useful since they capture structure of the characters.To make the representation invariant to the horizontal positions of local descriptors,while maintaining the topological structure on the vertical direction,we propose to take the maximum response along each row of the feature maps,i.e.take max j z l ½i ;j .The maximum responses are concatenated as a long vector,which captures the topological structure of characters,and are invariant to the character positions or orderings.We call the module for extracting this mid-level representation discriminative encoder .Itis parameterized by the codebook weights,i.e.W l and b l.Fig.3.Locations on the feature maps and their corresponding receptive fields on the input image.The concatenation of the values on that location across all feature maps form the descriptor of the receptive field,consequently dense local descriptors at different scales can be extracted from the feature hierarchy.B.Shi et al./Pattern Recognition 52(2016)448–4584513.5.Global fine-tuningFine-tuning is the process of optimizing the parameters of several algorithm components in a joint ually fine-tuning is carried out in a neural network structure,where gradients on layer parameters are calculated with the back-propagation algorithm [30].In global fine-tuning,we aim to optimize the parameters (weights)of all components involved,including the convolutional feature extractor,the discriminative codebook,and the final clas-si fier.To achieve this,we model the components into network layers,forming an end-to-end network that maps the input image into the final predicted labels,and apply the back-propagation algorithm to optimize it.Discriminative encoding layer:We model the discriminative encoding process as a network layer.According to Eq.(3),the linear transform W x þb is firstly applied to all locations on the feature maps,equivalent to the linear transform on the map level.Then,a threshold function max ð0;x Þis applied,equivalent to the ReLU nonlinearity.Therefore,we model the layer as the sequential combination of a linear layer that is parameterized by codebook weights W ,b ,and an ReLU layer.We call this layer the dis-criminative encoding (DE)layer.Horizontal pooling layer:The horizontal pooling process can be readily modeled as the horizontal pooling layer,which is inserted after each DE layer.Multi-level encoding and pooling:The feature maps on different hierarchy levels describe the input image on different scales andabstraction levels.We believe that they are complementary with each other for classi fication.Therefore,we construct a network topology that utilizes multiple feature hierarchy levels.The topology is illustrated in Fig.5.We insert discriminative encoding þhorizontal pooling layers after multiple convolutional layers,and concatenate their outputs into a long vector,which is fed to the final classi fication layers.The resulted network is initialized by the weights learned in previous procedures.Speci fically,the convolutional layers are initialized by the weights in the convolutional feature extractor.The weights in the discriminative layers are transferred from the discriminative codebook.The weights in the classi fication layers are randomly initialized.The network is fine-tuned with the back-propagation algorithm.4.The SIW-13datasetThere exist several public datasets that consist of texts in the wild,for instance,ICDAR 2011[42],SVT [53]and IIIT 5K-Word [35].However,these datasets are primarily used for scene text detec-tion and recognition tasks.Besides,these datasets are dominated by English or other Latin-based scripts.Other scripts,such as Arabic,Cambodian and Tibetan,are rarely seen in these datasets.In the area of script identi fication,there exist several datasets [20,40,58].However,the datasets proposed in these works mainly focus on texts extracted from documents orvideos.Fig.4.Examples of discriminative patches discovered from the training data.Each row shows a discovered cluster,which corresponds to a special character that is unique to a certain script,e.g.row 1for Greek,row 6for Japanese and row 8for Korean.B.Shi et al./Pattern Recognition 52(2016)448–458452In this paper,we propose a dataset 1for script identi fication in wild scenes.The dataset contains a large number of cropped text images taken from natural scene images.As illustrated in Fig.6,the dataset contains text images from 13different scripts:Arabic,Cambodian,Chinese,English,Greek,Hebrew,Japanese,Kannada,Korean,Mongolian,Russian,Thai and Tibetan.We call this dataset the Script Identi fication in the Wild 13Classes (SIW-13)dataset.For collecting the dataset,we first harvest a collection of street view images from the Google Street View [1]and manually label the bounding boxes of text regions,as shown in Fig.7.Text images are then cropped out,and recti fied by being rotated to the hor-izontal orientation.For each script,about 600–1000street view images are collected,and about 1000–2000text images are cropped out.Totally,the dataset contains 16,291text images.For benchmarking script identi fication algorithms,we split the dataset into the training and testing sets.The testing set contains all together 6500samples,with 500samples for each class.The remaining 9791samples are used for training.Table 1lists the detailed statistics of the dataset.Some examples of the collected dataset are shown in Fig.6.Since images are collected in natural scenes images,texts in the images exhibit large variations in fonts,color,layout and writing styles.The backgrounds are sometimes cluttered and complex.In some cases,text images are blurred or affected by lighting conditions or cameraposes.These factors make our dataset realistic,and much more challenging than datasets that are collected from document or videos.The SIW-13dataset is extended from our previously pro-posed SIW-10[43].Three new scripts,namely Cambodian,Kannada and Mongolian,are added.Also,we revise the remaining script classes by removing images that are either too noisy or corrupted,and by adding some new images to these classes.5.ExperimentsIn this section,we evaluate the performance of the proposed DiscCNN on three tasks,namely script identi fication in the wild,in videos and in documents,and compare it with other widely used image classi fication or script identi fication methods,including the conventional CNN,the SLN [10]and the LBP.5.1.Implementation detailsWe use the same network structure throughout our experi-ments,with the exception of the discriminative encoding layers,whose structures and initial parameters are determined auto-matically by the discriminative patch discovery process.As illu-strated in Figs.2and 5,we use feature levels 2,3,and 4for patch discovery and discriminative encoding.In the discovery step,the number of local descriptors can be large,especially when the feature maps have large size.For this reason,we use only a part of the extracted local descriptors for patch discovery on feature levels 2and 3.For the first stage in our approach,the convolutional layers are pretrained by a conventional CNN whose structure is speci fied in Table 2.The network is jointly optimized by stochastic gradient descent (SGD)with the learning rate set to 10À3,the momentum set to 0.9and the batch size set to 128.The network uses the dropout strategy in the last hidden layer with a dropout rate 0.5during training.The learning rate is multiplied by 0.1when the validation error stops decreasing for enough number of iterations.The network optimization process terminates until the learning rate reaches 10À6.The proposed approach is implemented using C þþand Python.On a machine with the Intel Core i5-2320CPU (3.00GHz),8GB RAM and a NVIDIA GTX 660GPU,the feature hierarchy extraction and discriminative clustering takes about 4h.The GPU accelerated fine-tuning process takes about 8h to reach con-vergence.Running on a GPU device,the testing process takes less than 20ms for each image,and consumes less than 50MBRAM.Chinese Cambodian Arabic English Greek Hebrew Japanese Kannada Korean Mongolian RussianThai TibetanFig.6.Examples of cropped text images in the SIW-13dataset.Fig.5.The structure and parameters of the deep network model.The network consists of four convolutional layers (conv1to conv4),three discriminative encoding layers (DE-1,DE-2and DE-3)and two fully connected layers (fc1and fc2).Discriminative encoding layers are inserted after convolutional layers conv2,conv3and conv4.Their outputs are concatenated as a long vector,and passed to the fully connected layers.(For interpretation of the references to color in this figure,the reader is referred to the web version of this paper.)1The dataset can be downloaded at /$xbai/mspnProjectPage/.It is available for academic use only.B.Shi et al./Pattern Recognition 52(2016)448–458453。