当前位置:文档之家› neural network identification 神经网络系统辨识外文文献3

neural network identification 神经网络系统辨识外文文献3

neural network identification 神经网络系统辨识外文文献3
neural network identification 神经网络系统辨识外文文献3

Script identi?cation in the wild via discriminative convolutional

neural network

Baoguang Shi,Xiang Bai n,Cong Yao

School of Electronic Information and Communications,Huazhong University of Science and Technology,Wuhan430074,China

a r t i c l e i n f o

Article history:

Received25May2015

Received in revised form

23October2015

Accepted10November2015

Available online1December2015

Keywords:

Script identi?cation

Convolutional neural network

Mid-level representation

Discriminative clustering

Dataset

a b s t r a c t

Script identi?cation facilitates many important applications in document/video analysis.This paper

investigates a relatively new problem:identifying scripts in natural images.The basic idea is combining

deep features and mid-level representations into a globally trainable deep model.Speci?cally,a set of

deep feature maps is?rstly extracted by a pre-trained CNN model from the input images,where the local

deep features are densely collected.Then,discriminative clustering is performed to learn a set of dis-

criminative patterns based on such local features.A mid-level representation is obtained by encoding the

local features based on the learned discriminative patterns(codebook).Finally,the mid-level repre-

sentations and the deep features are jointly optimized in a deep network.Bene?ting from such a?ne-

grained classi?cation strategy,the optimized deep model,termed Discriminative Convolutional Neural

Network(DisCNN),is capable of effectively revealing the subtle differences among the scripts dif?cult to

be distinguished,e.g.Chinese and Japanese.In addition,a large scale dataset containing16,291in-the-

wild text images in13scripts,namely SIW-13,is created for evaluation.Our method is not limited to

identifying text images,and performs effectively on video and document scripts as well,not requiring

any preprocess like binarization,segmentation or hand-crafted features.The experimental comparisons

on the datasets including SIW-13,CVSI-2015and Multi-Script consistently demonstrate DisCNN a state-

of-the-art approach for script identi?cation.

&2015Elsevier Ltd.All rights reserved.

1.Introduction

Script identi?cation is one of the key components in Optical

Character Recognition(OCR),which has received much attention

from the document analysis community,especially when the data

being processed is in multi-script or multi-language form.Due to

the rapidly increasing amount of multimedia data,especially those

captured and stored by mobile terminals,how to recognize text

content in natural scenes has become an active and important task

in the?elds of pattern recognition,computer vision and multi-

media[15,16,25,26,47,49,59,23,57,38,8,41].Different from the

previous approaches which have been mainly designed for docu-

ment images[48,19,20]or videos[40,58],this work focuses on

identifying the language/script types of texts in natural images(in

the wild),at word or text line level.This problem has seldom been

fully studied before.As texts in natural scenes often carry rich,

high level semantics,there exist many efforts in scene text loca-

lization and recognition[37,9,59,54–56,44].Script identi?cation in

the wild is an inevitable preprocessing of a scene text

understanding system under multi-lingual scenarios[5,6,22],

potentially useful in many applications such as scene under-

standing[32],product image search[17],mobile phone naviga-

tion,?lm caption recognition[11],and machine translation[4,50].

Given an input text image,the task of script identi?cation is to

classify it into one of the pre-de?ned script categories(English,

Chinese,Greek,etc.).Naturally,this problem can be cast as an

image classi?cation problem,which has been extensively studied.

However,script identi?cation in scene text images remains a

challenging task,and has its characteristics that are quite different

from document/video script identi?cation,or general image clas-

si?cation,mainly due to the following reasons:

1.In natural scenes,texts exhibit larger variations than they do in

documents or videos.They are often written/printed on outdoor

signs and advertisements,in some artistic styles.Often,there

exist large variations in their fonts,colors,and layout shapes.

2.The quality of text images will affect the identi?cation accuracy.

As scene texts are often captured under uncontrolled environ-

ments,the dif?culties in identi?cation may be caused by several

factors such as low resolutions,noises,and illumination chan-

ges.Document/video analysis techniques such as binarization

and component analysis tend to be unreliable.

Contents lists available at ScienceDirect

journal homepage:https://www.doczj.com/doc/899390523.html,/locate/pr

Pattern Recognition

https://www.doczj.com/doc/899390523.html,/10.1016/j.patcog.2015.11.005

0031-3203/&2015Elsevier Ltd.All rights

reserved.

n Corresponding author.

E-mail addresses:shibaoguang@https://www.doczj.com/doc/899390523.html,(B.Shi),xbai@https://www.doczj.com/doc/899390523.html,(X.Bai),

yaocong2010@https://www.doczj.com/doc/899390523.html,(C.Yao).

Pattern Recognition52(2016)448–458

3.Some scripts/languages have relatively minor differences,e.g.

Greek,English and Russian.As illustrated in Fig.1,these scripts share a subset of characters that have exactly the same shapes.

Distinguishing them relies on special characters or character components,and is a?ne-grained classi?cation problem.

4.Text images have arbitrary aspect ratios,since text strings have

arbitrarily lengths,ruling out some image classi?cation meth-ods that only operate on?xed-size inputs.

Recently,CNN has achieved great success in image classi?ca-tion tasks[27],due to its strong capacity and invariance to translation and distortions.To handle the complex foregrounds and backgrounds in scene text images,we choose to adopt deep features learned by CNN as the basic representation.In our method,a deep feature hierarchy,which is a set of feature maps,is extracted from the input images through a pretrained CNN[29]. The hierarchy carries rich and multi-scale representations of the images.

The differences among some certain scripts are subtle or even tiny,thus a holistic representation would not work well.Typical image classi?cation algorithms,such as the conventional CNN[27] and the Single-Layer Networks(SLN)[10],usually describe images in a holistic style without explicit emphasis on discriminative patches that play an important role in distinguishing some script categories(e.g.English and Greek).Therefore,to explicitly capture ?ne-grained features of scripts,we extracted a set of common patterns,termed as discriminative patterns(the image patches containing the representative strokes or components)from script images via discriminative clustering[45].Such common patterns represented by deep features can be treated as a codebook for encoding the dense deep features into a feature vector,providing a mid-level representation of an input script image.A pooling strategy inspired by the Spatial Pyramid Pooling[18,28],called horizontal pooling,is adopted in the mid-level representation process.This strategy enables our method to capture topological structure of texts,and naturally handles input images of arbitrary aspect ratios.

To maximize the discriminatory power of the mid-level representations,we put the above two modules,namely the convolutional layers for extracting deep feature hierarchy and the discriminative encoder for extracting the mid-level representa-tions,into a single deep network for joint optimization with the back-propagation algorithm[30].The global?ne-tuning process optimizes both the deep features and the mid-level representa-tion,effectively integrating the global features(deep feature maps)and?ne-grained features(discriminative patterns)for script identi?cation.

This paper is a continuation and extension of our previous work[43].In[43]we have proposed Multi-stage Spatially sensitive Pooling Network(MSPN)and a10-classes dataset called SIW-10 for the in-the-wild script identi?cation https://www.doczj.com/doc/899390523.html,pared with[43], this paper describes text images via discriminative mid-level representation,instead of the global horizontal pooling on con-volutional feature maps.Discriminative patches corresponding to special characters or components are explicitly discovered and used for building the mid-level representation.In addition,this paper proposes a larger and more challenging dataset with13 script classes.

In summary,the contributions of the paper are as follows:(1)A discriminative mid-level representation built on deep features is presented for script identi?cation tasks,in contrast to other methods that rely on texture,edge or connected component analysis.(2)We show that the mid-level representation and the deep feature extraction can be incorporated in a deep model,and get jointly optimized.(3)The proposed method is not limited to script identi?cation in the wild,applicable to video and document script identi?cation as well.The highly competitive performances are consistently achieved on such three kinds of script bench-marks.(4)Compared to the previously collected SIW-10,a larger and more challenging dataset SIW-13is created and released.

The remainder of this paper is organized as follows:In Section2 related work in script identi?cation and image classi?cation is reviewed and compared.In Section3the proposed method is described in detail.In Section4we introduce the SIW-13dataset. The experimental evaluation,comparisons with other methods,and some discussions are presented in Section5.We conclude our paper in Section6.

2.Related work

2.1.Script identi?cation

Previous works on script identi?cation mainly focus on texts in documents[48,20,7,24]and videos[40,58].Script identi?cation can be done at document page level,paragraph or text-block level, text-line level or word/character level.An extensive and detailed survey has been made by Ghosh et al.in[14].

Text images can be classi?ed by their textures.Some previous works conduct texture analysis to extract some kind of holistic appearance descriptors of the input image.Tan[48]proposes to extract rotation invariant texture features for identifying docu-ment scripts.In[7],several texture features,including gray-level co-occurrence matrix features,Gabor energy features,and wavelet energy features,are tested.Joshi et al.[24]present a generalized framework to identify scripts at paragraph and text-block levels. Their method is based on texture analysis and a two-level hier-archical classi?cation scheme.Phan et al.[40]propose to identify text-line level video scripts using edge-based features.The fea-tures are extracted from the smoothness and cursiveness of the upper and lower lines in each of the?ve equally sized horizontal zones of the text lines.In[58],Zhao et al.present features that are based on Spatial Gradient-Features at text block level,building features from horizontal and vertical gradients.Manthalkar et al.

[34]propose rotation and scale invariant texture features,using discrete wavelet packet transform.

Texture analysis,although widely adopted,may be insuf?cient to identify scripts,especially when distinguishing scripts that share common characters.Instead of texture analysis,our approach uses a discriminative mid-level representation,which

English Greek

Russian

Fig.1.Illustration of the script identi?cation task and its challenges:Both fore-

grounds and backgrounds exhibit large variations and high level of noise.Mean-

while,characters“A”,“B”and“E”appear in all the three scripts.Identifying them

relies on special patterns that are unique to certain scripts.

B.Shi et al./Pattern Recognition52(2016)448–458449

tends to be more effective in distinguishing between scripts that have subtle appearance difference.

Some other approaches analyze texts via their shapes and structures.In[46],a method based on structure analysis is intro-duced.Different topological and structural features,including number of loops,water reservoir concept features,headline fea-tures and pro?le features,are combined.Hochberg et al.[20] discover a set of templates by clustering“textual symbols”,which are connected components extracted from training scripts.Test scripts are then compared with these templates to?nd their best matching https://www.doczj.com/doc/899390523.html,ponent based methods,however,are usually limited to binarized document scripts,since in video or natural scenes images it is hard to achieve ideal binarization.Our approach does not rely on any binarization or segmentation techniques.It is applicable to not only documents,but also a much wider range of scenarios including scene texts and video texts. 2.2.Image classi?cation

Naturally,script identi?cation can be cast as an image classi?-cation problem.The Bag-of-Words(BoW)framework[31]is a technique that is widely adopted in image classi?cation problems. In BoW,local descriptors such as SIFT[33],HOG[12]or simply raw pixel patches[10]are extracted from images,and encoded by some coding methods such as the locality-constrained linear coding(LLC[52])or the triangle activation[10].Recent research on image classi?cation and other visual tasks has seen a leap forward, thanks to the wide application of deep convolutional neural net-works(CNNs[29]).CNN is deep neural network equipped with convolutional layers.It learns the feature representation from raw pixels,and can be trained in an end-to-end manner by the back-propagation algorithm[30].CNN,however,is not specially designed for the script identi?cation task.It cannot handle images with arbitrary aspect ratios,and it does not put emphasis on dis-criminative local patches,which may be crucial for distinguishing scripts that have subtle differences.

3.Methodology

3.1.Overview

Given a cropped text image I,which may contain a word or sentence written horizontally,we predict its script class c A f1;…;C g.As illustrated in Fig.2,the training process is divided into two stages.In the?rst stage,we build a discriminative mid-level representation,from the deep feature hierarchies (Section 3.2)extracted by a pretrained CNN,using the dis-criminative clustering method(Section3.3).The result is a dis-criminative codebook that contains a set of linear classi?ers.We use the codebook to build the mid-level representation(Section 3.4).In the second stage(Section 3.5),we model the feature extraction,mid-level representation and?nal classi?cation into one neural network.The network is initialized by transferring parameters(weights)learned in the?rst stage.We train the net-work using back-propagation.Consequently,parameters of all modules get?ne-tuned together.

3.2.Deep feature Hierarchy extraction

The input image is?rstly represented by a convolutional fea-ture hierarchy f h l g L l?1,where each h l is a set of feature maps with the same size,and L is the number of levels in the hierarchy.The feature hierarchy is extracted by a pretrained CNN,which is dis-cussed in Section5.1.The input text image I is?rstly resized to a ?xed height(32pixels throughout our experiments),keeping their aspect ratios.The?rst level of the feature hierarchy h1is extracted by performing convolution and max-pooling with convolutional ?lters f k1i;j g

i;j

and biases f b1

j

g

j

,resulting in feature maps

h1?f h1j g

j?1;…;n1

.Each feature map h j1is computed by:

h1

j

?mpσ

X n0

i?1

I i n k1i;jtb1j

!!

:e1T

Here,I i represents the i-th channel of the input image.n0is the number of image channels.The star operator n indicates the2-D convolution operation.σeáTis the squashing function which is an element-wise non-linearity.In our implementation,we use the element-wise thresholding function maxe0;xT,also known as the ReLU[36].mp is the max-pooling function,which downsamples feature maps by taking the maximum values on downsampling subregions.

The remaining levels of the feature hierarchy are extracted recursively,by performing convolution and max-pooling on the feature maps from the preceding hierarchy level:

h l

j

?mpσ

X

n là1

i?1

h i n k l i;jtb l j

!!

:e2T

Here,l is the level index in the hierarchy.Since image down-sampling is applied,the sizes of the feature maps decrease with l.

The extracted feature hierarchy f h l g L l?1provides dense local descriptors on the input image.At each level l,the feature maps

h l?f h l k g n l

k?1

are extracted by applying convolutional kernels densely on either the input image I or the feature maps h là1

from

Fig.2.Illustration of the training process of the proposed approach.

B.Shi et al./Pattern Recognition52(2016)448–458

450

the preceding level.A pixel on the feature map h l

k ?i ;j ,where i and j are the row and column indices,respectively,is determined by a corresponding subregion on the input image,also known as the receptive ?eld [21].As illustrated in Fig.3,the concatenation of the

pixel values across all feature maps x l ?i ;j ??h l 1?i ;j ;…;h l

n l ?i ;j T is taken as the local descriptor of that subregion.Since down-sampling is applied,the size of the subregion increases with level l .Therefore,the extracted feature hierarchy provides dense local descriptors at several different scales.

The feature hierarchy is rich and invariant to various image distortions,making the representation robust to various distor-tions and variations in natural scenes.In addition,convolutional features are learned from data,thus domain-speci ?c and poten-tially stronger than general hand-crafted features such as SIFT [33]and HOG [12].

3.3.Discriminative patch discovery

As we have discussed in Section 1,one challenging aspect of the script identi ?cation task is that some scripts share a subset of characters that have the same visual shapes,making it dif ?cult to distinguish them via some holistic representations,such as texture features.The visual differences between these scripts can be observed only via a few local regions,or discriminative patches [45],which may correspond to special characters or special char-acter components.These patches are observed in certain scripts,and are strong evidence for identifying the script type.For example,characters “Λ”and “Σ”are distinctive to Greek.If the input image contains any of them,it is likely to be Greek.

In our approach,we discover these discriminative patches from local patches extracted from the training images.The patches are described by deep features.As we have described in Section 3.2,the feature hierarchies provide dense local descriptors.Therefore we simply compute all feature hierarchies and extract dense local descriptors from them.To discover the discriminative visual pat-terns from the set of local patches,we adopt the method proposed by Singh et al.in [45],which is a discriminative clustering method for discovering patches that are both representative and discriminative.

Given the set of local patches described by deep features,the discriminative clustering algorithm outputs a discriminative codebook,which contains a set of linear classi ?ers.The clustering is performed separately on each class c and on each feature level l .For each class c ,a set of local descriptors X l c is extracted from the feature hierarchies,taken as the discovery set [45].Another set,the natural set ,contains local descriptors from the remaining classes.The discriminative clustering algorithm is performed on the two

sets,resulting in a multi-class linear classi ?er f w l c ;b l

c g .The ?nal

discriminative codebook is built by concatenating the classi ?er

weights from all classes,i.e.K l ?eW l ;b l

T.Detailed descriptions are listed in Algorithm 1.

Algorithm 1.The discriminative clustering process.1:Input:Local descriptors f x l i g i ;l ?1…L

2:Output:Discriminative clusters f K l g l ?1…L 3:for feature level l ?1to L do 4:for class c ?1to C do 5:Discovery set D l c ?f x l :x l A X l c g 6:

Natural set N l c ?f x l :x l =2X l c g

7:w l c ;b l

c ?discriminative_clustering eD l c ;N l c T8:en

d for

9:

W l ;b l

?concat c ef W l c gT;concat c ef b l

c g

10:K l ?eW l ;b l

T11:end for

12:Output f K l g l ?1…L

Fig.4shows some examples of the discriminative patches discovered from feature level 4(the last convolutional layer).Among the patches,we can observe special characters or text components that are distinctive to certain scripts.The dis-criminative clustering algorithm automatically chooses the num-ber of clusters.In our experiments,it results in a codebook with about 1500classi ?ers on each feature level.3.4.Mid-level representation

To obtain the mid-level representation,we ?rstly encode the feature maps in the hierarchy with the learned discriminative codebook,then horizontally pool the encoding results into a ?xed-length vector.Assuming that the feature maps have the shape n ?w ?h ,as mentioned,from the maps we can densely extract w ?h local descriptors,each of n dimensions.Each local descriptor,say x ?i ;j where i ,j are the location on the map,is encoded with the discriminative codebook that has k entries (i.e.k linear clas-si ?ers),resulting in a k -dimensional vector z ?i ;j :z l ?i ;j ?max e0;W l x l ?i ;j tb l

T:

e3T

Here,the encoded vector is the non-negative response of all the

classi ?ers in the codebook.W l x l ?i ;j tb l

is the responses of all k classi ?ers.A positive response indicates the presence of certain discriminative patterns,and is kept,while negative responses are suppressed by setting them to zero.

To describe the whole image from the encoding results,we adopt a horizontal pooling scheme,inspired by the spatial pyramid pooling (SPP [18]).Texts in real world are mostly horizontally written.The horizontal positions of individual characters are less meaningful for identifying the script type.Their vertical positions of the text components such as strokes,on the other hand,are useful since they capture structure of the characters.To make the representation invariant to the horizontal positions of local descriptors,while maintaining the topological structure on the vertical direction,we propose to take the maximum response along each row of the feature maps,i.e.take max j z l ?i ;j .The maximum responses are concatenated as a long vector,which captures the topological structure of characters,and are invariant to the character positions or orderings.We call the module for extracting this mid-level representation discriminative encoder .It

is parameterized by the codebook weights,i.e.W l and b l

.

Fig.3.Locations on the feature maps and their corresponding receptive ?elds on the input image.The concatenation of the values on that location across all feature maps form the descriptor of the receptive ?eld,consequently dense local descriptors at different scales can be extracted from the feature hierarchy.

B.Shi et al./Pattern Recognition 52(2016)448–458451

3.5.Global ?ne-tuning

Fine-tuning is the process of optimizing the parameters of several algorithm components in a joint https://www.doczj.com/doc/899390523.html,ually ?ne-tuning is carried out in a neural network structure,where gradients on layer parameters are calculated with the back-propagation algorithm [30].In global ?ne-tuning,we aim to optimize the parameters (weights)of all components involved,including the convolutional feature extractor,the discriminative codebook,and the ?nal clas-si ?er.To achieve this,we model the components into network layers,forming an end-to-end network that maps the input image into the ?nal predicted labels,and apply the back-propagation algorithm to optimize it.

Discriminative encoding layer:We model the discriminative encoding process as a network layer.According to Eq.(3),the linear transform W x tb is ?rstly applied to all locations on the feature maps,equivalent to the linear transform on the map level.Then,a threshold function max e0;x Tis applied,equivalent to the ReLU nonlinearity.Therefore,we model the layer as the sequential combination of a linear layer that is parameterized by codebook weights W ,b ,and an ReLU layer.We call this layer the dis-criminative encoding (DE)layer.

Horizontal pooling layer:The horizontal pooling process can be readily modeled as the horizontal pooling layer,which is inserted after each DE layer.

Multi-level encoding and pooling:The feature maps on different hierarchy levels describe the input image on different scales and

abstraction levels.We believe that they are complementary with each other for classi ?cation.Therefore,we construct a network topology that utilizes multiple feature hierarchy levels.The topology is illustrated in Fig.5.We insert discriminative encoding thorizontal pooling layers after multiple convolutional layers,and concatenate their outputs into a long vector,which is fed to the ?nal classi ?cation layers.

The resulted network is initialized by the weights learned in previous procedures.Speci ?cally,the convolutional layers are initialized by the weights in the convolutional feature extractor.The weights in the discriminative layers are transferred from the discriminative codebook.The weights in the classi ?cation layers are randomly initialized.The network is ?ne-tuned with the back-propagation algorithm.

4.The SIW-13dataset

There exist several public datasets that consist of texts in the wild,for instance,ICDAR 2011[42],SVT [53]and IIIT 5K-Word [35].However,these datasets are primarily used for scene text detec-tion and recognition tasks.Besides,these datasets are dominated by English or other Latin-based scripts.Other scripts,such as Arabic,Cambodian and Tibetan,are rarely seen in these datasets.In the area of script identi ?cation,there exist several datasets [20,40,58].However,the datasets proposed in these works mainly focus on texts extracted from documents or

videos.

Fig.4.Examples of discriminative patches discovered from the training data.Each row shows a discovered cluster,which corresponds to a special character that is unique to a certain script,e.g.row 1for Greek,row 6for Japanese and row 8for Korean.

B.Shi et al./Pattern Recognition 52(2016)448–458

452

In this paper,we propose a dataset 1for script identi ?cation in wild scenes.The dataset contains a large number of cropped text images taken from natural scene images.As illustrated in Fig.6,the dataset contains text images from 13different scripts:Arabic,Cambodian,Chinese,English,Greek,Hebrew,Japanese,Kannada,Korean,Mongolian,Russian,Thai and Tibetan.We call this dataset the Script Identi ?cation in the Wild 13Classes (SIW-13)dataset.

For collecting the dataset,we ?rst harvest a collection of street view images from the Google Street View [1]and manually label the bounding boxes of text regions,as shown in Fig.7.Text images are then cropped out,and recti ?ed by being rotated to the hor-izontal orientation.For each script,about 600–1000street view images are collected,and about 1000–2000text images are cropped out.Totally,the dataset contains 16,291text images.For benchmarking script identi ?cation algorithms,we split the dataset into the training and testing sets.The testing set contains all together 6500samples,with 500samples for each class.The remaining 9791samples are used for training.Table 1lists the detailed statistics of the dataset.

Some examples of the collected dataset are shown in Fig.6.Since images are collected in natural scenes images,texts in the images exhibit large variations in fonts,color,layout and writing styles.The backgrounds are sometimes cluttered and complex.In some cases,text images are blurred or affected by lighting conditions or camera

poses.These factors make our dataset realistic,and much more challenging than datasets that are collected from document or videos.The SIW-13dataset is extended from our previously pro-posed SIW-10[43].Three new scripts,namely Cambodian,Kannada and Mongolian,are added.Also,we revise the remaining script classes by removing images that are either too noisy or corrupted,and by adding some new images to these classes.

5.Experiments

In this section,we evaluate the performance of the proposed DiscCNN on three tasks,namely script identi ?cation in the wild,in videos and in documents,and compare it with other widely used image classi ?cation or script identi ?cation methods,including the conventional CNN,the SLN [10]and the LBP.5.1.Implementation details

We use the same network structure throughout our experi-ments,with the exception of the discriminative encoding layers,whose structures and initial parameters are determined auto-matically by the discriminative patch discovery process.As illu-strated in Figs.2and 5,we use feature levels 2,3,and 4for patch discovery and discriminative encoding.In the discovery step,the number of local descriptors can be large,especially when the feature maps have large size.For this reason,we use only a part of the extracted local descriptors for patch discovery on feature levels 2and 3.For the ?rst stage in our approach,the convolutional layers are pretrained by a conventional CNN whose structure is speci ?ed in Table 2.The network is jointly optimized by stochastic gradient descent (SGD)with the learning rate set to 10à3,the momentum set to 0.9and the batch size set to 128.The network uses the dropout strategy in the last hidden layer with a dropout rate 0.5during training.The learning rate is multiplied by 0.1when the validation error stops decreasing for enough number of iterations.The network optimization process terminates until the learning rate reaches 10à6.

The proposed approach is implemented using C ttand Python.On a machine with the Intel Core i5-2320CPU (3.00GHz),8GB RAM and a NVIDIA GTX 660GPU,the feature hierarchy extraction and discriminative clustering takes about 4h.The GPU accelerated ?ne-tuning process takes about 8h to reach con-vergence.Running on a GPU device,the testing process takes less than 20ms for each image,and consumes less than 50MB

RAM.

Chinese Cambodian Arabic English Greek Hebrew Japanese Kannada Korean Mongolian Russian

Thai Tibetan

Fig.6.Examples of cropped text images in the SIW-13

dataset.

Fig.5.The structure and parameters of the deep network model.The network consists of four convolutional layers (conv1to conv4),three discriminative encoding layers (DE-1,DE-2and DE-3)and two fully connected layers (fc1and fc2).Discriminative encoding layers are inserted after convolutional layers conv2,conv3and conv4.Their outputs are concatenated as a long vector,and passed to the fully connected layers.(For interpretation of the references to color in this ?gure,the reader is referred to the web version of this paper.)

1

The dataset can be downloaded at https://www.doczj.com/doc/899390523.html,/$xbai/mspnProjectPage/.It is available for academic use only.

B.Shi et al./Pattern Recognition 52(2016)448–458453

5.2.Methods for comparisons

Local Binary Patterns (LBP ):LBP [3]is a widely adopted texture analysis technique.To extract the features,we use the vl _lbp function in the VLFeat library [51].Images are scaled to ?xed sizes

100?32.Cell sizes are set to 8?8.The resulting feature vector has 2784dimensions.Linear SVM is adopted for classi ?cation.

Basic CNN (CNN ):A conventional CNN structure,called CNN-Basic in the following,is set up for comparisons.A conventional CNN structure only accepts images of ?xed widths and heights,due to the existence of fully connected layers.We adopt a simple workaround by creating samples with size 100?32by cropping or padding the original images.The network structure is speci ?ed in Table 2.It has the same convolutional layers as DiscCNN.Similar to DiscCNN,The CNN-Basic is trained with SGD,with batch size set to 128.The initial learning rate is set to 10à2,and the momentum is set to 0.9.

Single-Layer Networks (SLN ):In [10]Coates et al.propose SLN,showing that,with simple unsupervised feature learning via K -Means clustering,one can achieve state-of-the-art performances at that time on image classi ?cation tasks.We use the K -Means feature learning code released by the authors and make several changes,including pooling over the upper and lower half regions instead of the quadrants,since distinguishing between left and right regions is not meaningful for script identi ?cation.The linear SVM is adopted for classi ?cation.

MSPN :Multi-Stage Pooling Network is proposed in our pre-vious work [43].It is a CNN variant that contains multiple stage horizontal pooling in its architecture.We use the same archi-tecture as used in [43]for comparisons.5.3.Script identi ?cation in the wild

We train and evaluate our DiscCNN on the SIW-13dataset.The dataset contains all together 13scripts.We also construct two subsets from the full set:Alphabetic consists of alphabetic scripts including English,Greek,Arabic,Mongolian and Russian;Logo-graphic consists of three logographic scripts including Chinese,Japanese and Korean.To test the performance,one model is trained on the full set and tested on all the subsets and the full set.Recognition accuracies are listed in Table 3.For comparison,we test three other methods,namely the Local Binary Patterns (LBP),the Single-Layer Networks (SLN)and the conventional CNN (Basic-

CNN).

Fig.7.Images we harvested from Google Street View.Yellow boxes are manually labeled text regions.(For interpretation of the references to color in this ?gure caption,the reader is referred to the web version of this paper.)Table 1

Statistics of the SIW-13dataset.Script #Images #Train #Test Arabic

1002502500Cambodian 1083583500Chinese 1298798500English 1221721500Greek 1018518500Hebrew 1242742500Japanese 1215715500Kannada 1029529500Korean 156********Mongolian 1192692500Russian 1031531500Thai 22221722500Tibetan 1177677500Total

16,291

9791

6500

Table 2

Con ?guration of CNN-Basic.For brevity,‘n ’stands for number of output maps,‘k ’stands for kernel size,‘s ’stands for stride,‘p ’stands for padding size,and ‘h ’stands for number of hidden https://www.doczj.com/doc/899390523.html,yer Type

Parameters

Activation conv1Convolutional n ?96;k ?5;s ?1;p ?2ReLU mp1Max-pooling k ?3;s ?2

–conv2Convolutional n ?256;k ?3;s ?1;p ?2ReLU mp2Max-pooling k ?3;s ?2

–conv3Convolutional n ?384;k ?3;s ?1;p ?1ReLU mp3Max-pooling k ?3;s ?2

–conv4Convolutional n ?512;k ?3;s ?1;p ?0ReLU fc1Fully connected h ?512ReLU fc2Fully connected h ?13–sm

Soft-max

B.Shi et al./Pattern Recognition 52(2016)448–458

454

As can be seen from the comparisons in Table3,the proposed method consistently outperforms other methods.The texture analysis approach LBP performs well on Logographic scripts,but signi?cantly worse on alphabetic scripts.The reason is that logo-graphic scripts have larger appearance differences,making them easier distinguishable via texture features.Alphabetic scripts,on the other hand,sometimes share a subset of characters that have the same appearance.Both SLN and Basic-CNN perform better than LBP.On the Logograhic subset,SLN and Basic-CNN achieve accuracies comparable with DiscCNN.But on other subsets,they perform much worse.Without explicitly utilizing the special characters,LBP,SLN and Basic-CNN cannot well distinguish scripts that share characters,e.g.English and Greek.We have also tested the widely adopted method,Locality-constrained Linear Coding [52](LLC),which is similar to SLN except that it uses hand-crafted HOG features.On the three subsets,LLC achieves average accura-cies0.83,0.91and0.85,generally lower than SLN.This indicates that the deep-like feature learned by SLN is superior than the hand-crafted HOG.

Comparing the accuracies of different scripts and the confusion matrix shown in Fig.8,we can see that accuracies on scripts like Thai and Arabic are signi?cantly higher than that on other scripts. The reason is that these scripts have unique writing styles and can be easily distinguished from other scripts.Other scripts,especially Latin based,are relatively harder to identify.On these scripts, lower accuracies are observed in all methods.One reason is that these scripts share a common subset of alphabet so that they have similar holistic appearances.This makes it much harder to identify these scripts.

Table4compares the proposed DiscNet with the MSPN pro-posed in our previous work[43].DiscCNN outperforms MSPN, especially on the alphabetic scripts.Since DiscCNN describe text images by the discriminative characters/components,it better distinguishes between alphabetic scripts,which tend to have small or tiny appearance differences with each other.

5.3.1.Script identi?cation in video texts

Our method can be adopted for video script identi?cation without any modi?cation.CVSI-2015is the dataset used in the ICDAR2015Competition on Video Script Identi?cation[2].The dataset contains text images extracted from television videos,such as news and advertisements.The scripts considered in this dataset are10Indian scripts,including English,Hindi,Bengali,Oriya, Gujarati,Punjabi,Kannada,Tamil,Telugu and Arabic.Four tasks are proposed in the CVSI-2015competition.Task1requires iden-tifying scripts from8different scripts triplets based on their use in the Indian sub-continent.All triplets have English and Hindi.Task 2requires identifying the combination of scripts used in north India.It involves identi?cation of seven scripts,namely,English, Hindi,Bengali,Oriya,Gujarati,Punjabi and Arabic.Task3requires identifying the combination of scripts used in south India.The task involves identi?cation of?ve scripts,namely,English,Hindi, Kannada,Tamil and Telugu.Task4requires identifying the com-bination of all the ten scripts in the dataset.

Table5lists the results of the proposed method and compar-isons with other methods.For Task1we report the average accuracy on all8sub-tasks.It can be observed that our approach

Table3

Recognition accuracies on the SIW-13benchmark and comparisons with other baseline methods.

Script Alphabetic Logographic Full

LBP SLN CNN Ours LBP SLN CNN Ours LBP SLN CNN Ours

Ara0.800.910.940.96––––0.640.870.900.94 Cam––––––––0.460.760.830.88 Chi––––0.820.900.860.910.660.870.850.88 Eng0.630.770.720.83––––0.310.640.580.71 Gre0.700.790.740.86––––0.570.750.700.81 Heb––––––––0.610.910.890.91 Jap––––0.850.930.880.930.580.880.750.90 Kan––––––––0.560.880.820.91 Kor––––0.870.940.930.960.690.930.900.95 Mon0.880.970.940.98––––0.770.950.960.96 Rus0.620.780.710.82––––0.440.700.660.79 Tha––––––––0.610.910.790.94 Tib––––0.960.970.990.980.880.970.970.97

Avg.0.730.840.810.890.880.930.920.940.600.850.82

0.89

Fig.8.Confusion matrix on the SIW-13dataset.Y-labels are groundtruth labels and X-labels are predicted labels.(For interpretation of the references to color in this ?gure,the reader is referred to the web version of this paper.)Table4

Accuracy comparison between DiscCNN and MSPN on the SIW-13dataset. Experiments were carried out3times with different model initializations.Student's t-test is carried and the p o0:05criterion is satis?ed.

Method Alphabetic Logographic Full

MSPN0.87070.0050.93070.0190.86670.014 DiscCNN0.89270.0080.94270.0070.88770.007

B.Shi et al./Pattern Recognition52(2016)448–458455

consistently achieves the best performances among the methods we test.

5.3.2.Script identi ?cation in documents

We also apply our method to document texts.In [39],Pati and Ramakrishnan have proposed a large-scale word images dataset for the script identi ?cation task.The dataset contains 220,000scanned word images from 11different scripts,namely Bengali,Devanagari,English,Gujarati,Kannada,Malayalam,Oriya,Punjabi,Tamil,Telugu and Urdu.

Table 6shows the quantitative results.Our approach sig-ni ?cantly outperforms [39]on all scripts,reaching saturated accuracies on some https://www.doczj.com/doc/899390523.html,pared to the Gabor or DCT features adopted in [39],our representation captures information of special characters/components of scripts.Besides,based on a deep architecture,our method bene ?ts from the relatively large training set.5.4.Discussions

5.4.1.Impact of ?ne-tuning

The joint optimization simultaneously updates all parameters in the network,thus optimizing parameters from all modules used in our method.To evaluate the effectiveness of the joint optimi-zation scheme,the DiscCNN is evaluated on the SIW-13Full dataset,with and without joint optimization.The resulting accuracies are compared in Fig.9.The ?ne-tuning process achieves a 10–20%error reduction.

5.4.2.Choices on feature levels

In our implementation,we use three feature hierarchy levels for pattern discovery and discriminative encoding.To test the effectiveness of each of them,we modify the network structure,resulting in several network variants,and test them on the SIW-13dataset.Table 7lists the con ?gurations of the network variants and their recognition accuracies on the SIW-13dataset.

Comparing the performances of Variant-1,Variant-2and Var-iant-3,we can see that Variant-3achieves the best result.The feature level 3alone contributes the most to the recognition accuracies.Variant-4and Variant-5use multiple feature levels,

and signi ?cant performance gains can be observed.Finally,the proposed DiscCNN uses all three levels,and achieves the highest accuracy.This indicates that combining features from multiple convolutional layers outputs can improve the performance.The feature hierarchy extracted by convolutional layers describes the input image at different grain-levels,and the combination of them brings performance gain.Note that we do not use features from level-1,since the feature maps at level-1are too large in size,and will require too much memory and computation time.

5.4.3.Impact of imperfect text cropping

In the SIW-13dataset,texts are manually labeled.All bounding boxes are tightly labeled.However,for a practical system,input

Table 5

Recognition accuracies on the CVSI-2015benchmark and comparisons with other methods.Method Task1Task2Task3Task4CNN 0.8990.8530.9260.874SLN 0.9500.9210.9360.930Ours

0.961

0.938

0.967

0.943

Table 6

Recognition accuracies on all 11scripts included in the Multi-Script dataset pro-posed in [39].Script Gabor tSVM DCT tSVM Ours BE 0.9620.9260.998EN 0.9820.9660.996GU 0.9550.9540.991HI 0.9330.9430.990KA 0.9330.9030.984MA 0.9360.8440.995OD 0.9400.9430.997PU 0.9380.9210.993TA 0.9520.9330.993TE 0.9230.9100.992UR 0.9790.9840.998Avg.

0.948

0.938

0.993

Fig.9.Impact of the joint optimization,evaluated on the SIW-13Full dataset.Top:accuracies for all classes,and the average accuracy.Bottom:relative error reduc-tion.(For interpretation of the references to color in this ?gure,the reader is referred to the web version of this paper.)

Table 7

Network con ?gurations for DiscCNN and its variants.“DE-3”indicates that the network variant only uses Discriminative Encoding layer-3(DE-3)in Fig.5,“DE-2tDE-3”indicates that the network variant uses both DE-2and DE-3.Variant Con ?gurations Accuracy (%)Variant-1DE-274.7Variant-2DE-384.2Variant-3DE-4a

85.5Variant-4DE-3tDE-488.3Variant-5DE-2tDE-3

85.2DiscCNN

DE-2tDE-3tDE-4

88.6

a

In Variant-1the number of hidden nodes in fc2is set to 512.

B.Shi et al./Pattern Recognition 52(2016)448–458

456

texts are usually detected by a text detector,which would some-times fail to give perfect bounding boxes.To test the impact of imperfect text cropping,we distort the test data by randomly cropping sub-images from the original text images,and evaluate the performance of a trained DiscCNN model.In Fig.10,we plot the average accuracy as a function of the Intersection over Union (IoU),which measures how much the bounding box is away from the groundtruth.The recognition accuracy falls when the IoU decreases.At IoU ?0.7and IoU ?0.5,which are the thresholds used in PASCAL VOC 2007[13],the recognition accuracies are,respec-tively,0.74and 0.49.The results indicate that,for minor mis-cropping (IoU less than 0.7),the results are still acceptable,but when the IoU is less than 0.7,performance falls quickly.In addi-tion,we evaluate the impact of imperfect crop on other methods.For SLN,CNN and DiscCNN,the tendencies of the accuracy decreases are similar.LBP seems to be more robust to imperfect crop.One of the reasons is that LBP describes the texture,which is often insensitive to cropping.

5.5.Limitations of the proposed approach

The proposed DiscCNN works well on the majority of the datasets we have evaluated.In Fig.11a we show some hard sam-ples that are correctly recognized.Nevertheless,the proposed method fails under some cases.Fig.11b demonstrates some failure cases.Misclassi ?cation sometimes happens when the input image is blurred or is in low resolution.Scripts with unusual text layouts (e.g.curved text)can be hard to identify.In addition,scripts such as Chinese and Japanese are sometimes very close in appearances,and are hard to distinguish without a semantic understanding.

As another drawback,the proposed method is based on the convolutional neural network,and therefore requires relatively long training time.GPU acceleration is required to make the training time reasonable.The method also requires larger training set (typically 44k examples)in order to avoid over ?tting.

6.Conclusion

In this paper,we have presented DiscCNN,a novel deep learning based method for script identi ?cation.The method combines deep features with discriminative mid-level repre-sentations.We have successfully applied the method to script identi ?cation in natural scene images,in documents and in videos.It is worth noting that,with some modi ?cations,the proposed network can accept images with arbitrary sizes without scaling them.In future work,we will try to extend our approach to gen-eral image classi ?cation problems,especially ?ne-grained classi-?cation problems.In addition,to directly identify the script types in the whole street images is another direction that is worthy of exploration in the future.Con ?ict of interest

None declared.Acknowledgements

This work was primarily supported by National Natural Science Foundation of China (NSFC)(No.61222308),and in part by Pro-gram for New Century Excellent Talents in University (No.NCET-

12-0217).

Fig.10.Average recognition accuracies on SIW-13as a function of the Intersection over Union

(IoU).

Fig.11.(a)Dif ?cult samples that are successfully recognized.(b)Some failure cases.Captions are in the format “(true script)-4(predicted script)”.

B.Shi et al./Pattern Recognition 52(2016)448–458457

References

[1]Google Street View.?https://www.doczj.com/doc/899390523.html, ?.

[2]Icdar 2015Competition on Video Script Identi ?cation (CVSI-2015).?http://

www.ict.grif ?https://www.doczj.com/doc/899390523.html,.au/cvsi2015/?.

[3]T.Ahonen,A.Hadid,M.Pietik?inen,Face description with local binary pat-terns:application to face recognition,IEEE Trans.Pattern Anal.Mach.Intell.28(12)(2006)2037–2041.

[4]V.Alabau,A.Sanchis,F.Casacuberta,Improving on-line handwritten recog-nition in interactive machine translation,Pattern Recognit.47(3)(2014)1217–1228.

[5]S.Basu,N.Das,R.Sarkar,M.Kundu,M.Nasipuri,D.K.Basu,A novel framework

for automatic sorting of postal documents with multi-script address blocks,Pattern Recognit.43(10)(2010)3507–3521.

[6]I.Bazzi,R.M.Schwartz,J.Makhoul,An omnifont open-vocabulary OCR system

for English and Arabic,IEEE Trans.Pattern Anal.Mach.Intell.21(6)(1999)495–504.

[7]A.Busch,W.W.Boles,S.Sridharan,Texture for script identi ?cation,IEEE Trans.

Pattern Anal.Mach.Intell.27(11)(2005)1720–1732.

[8]P.R.Cavalin,R.Sabourin,C.Y.Suen,A.de Souza Britto Jr.,Evaluation of incre-mental learning algorithms for HMM in the recognition of alphanumeric characters,Pattern Recognit.42(12)(2009)3241–3253.

[9]D.Chen,J.Odobez,H.Bourlard,Text detection,recognition in images and

video frames,Pattern Recognit.37(3)(2004)595–608.

[10]A.Coates,A.Y.Ng,H.Lee,An analysis of single-layer networks in unsupervised

feature learning,in:Proceedings of AISTATS,2011,pp.215–223.

[11]T.Cour,C.Jordan,https://www.doczj.com/doc/899390523.html,tsakaki,B.Taskar,Movie/script:alignment and parsing

of video and text transcription,in:Proceedings of ECCV,2008.

[12]N.Dalal,B.Triggs,Histograms of oriented gradients for human detection,in:

Proceedings of CVPR,2005.

[13]M.Everingham,L.J.V.Gool,C.K.I.Williams,J.M.Winn,A.Zisserman,The Pascal

visual object classes (VOC)challenge,https://www.doczj.com/doc/899390523.html,put.Vis.88(2)(2010)303–338.

[14]D.Ghosh,T.Dube,A.P.Shivaprasad,Script recognition —a review,IEEE Trans.

Pattern Anal.Mach.Intell 32(12)(2010)2142–2161.

[15]M.M.Haji,T.D.Bui,C.Y.Suen,Removal of noise patterns in handwritten images

using expectation maximization and fuzzy inference systems,Pattern Recog-nit.45(12)(2012)4237–4249.

[16]H.Hase,T.Shinokawa,M.Yoneda,C.Y.Suen,Character string extraction from

color documents,Pattern Recognit.34(7)(2001)1349–1365.

[17]J.He,J.Feng,X.Liu,T.Cheng,T.Lin,H.Chung,S.Chang,Mobile product search

with bag of hash bits and boundary reranking,in:Proceedings of CVPR,2012.[18]K.He,X.Zhang,S.Ren,J.Sun,Spatial pyramid pooling in deep convolutional

networks for visual recognition,in:Proceedings of ECCV,2014.

[19]P.S.Hiremath,S.Shivashankar,Wavelet based co-occurrence histogram fea-tures for texture classi ?cation with an application to script identi ?cation in a document image,Pattern Recognit.Lett.29(9)(2008)1182–1189.

[20]J.Hochberg,P.Kelly,T.Thomas,L.Kerns,Automatic script identi ?cation from

document images using cluster-based templates,IEEE Trans.Pattern Anal.Mach.Intell.19(2)(1997)176–181.

[21]D.H.Hubel,T.N.Wiesel,Receptive ?elds,binocular interaction and functional

architecture in the cat's visual cortex,J.Physiol.160(1)(1962)106–154.

[22]L.G.i Bigorda,D.Karatzas,Multi-script text extraction from natural scenes,in:

Proceedings of ICDAR,2013.

[23]A.K.Jain,B.Yu,Automatic text location in images and video frames,Pattern

Recognit.31(12)(1998)2055–2076.

[24]G.D.Joshi,S.Garg,J.Sivaswamy,A generalised framework for script identi ?-cation,Int.J.Doc.Anal.Recognit.10(2)(2007)55–68.

[25]K.Jung,K.I.Kim,A.K.Jain,Text information extraction in images and video:a

survey,Pattern Recognit.37(5)(2004)977–997.

[26]M.Khayyat,https://www.doczj.com/doc/899390523.html,m,C.Y.Suen,Learning-based word spotting system for arabic

handwritten documents,Pattern Recognit.47(3)(2014)1021–1030.

[27]A.Krizhevsky,I.Sutskever,G.E.Hinton,Imagenet classi ?cation with deep

convolutional neural networks,in:Proceedings of NIPS,2012.

[28]https://www.doczj.com/doc/899390523.html,zebnik,C.Schmid,J.Ponce,Beyond bags of features:spatial pyramid

matching for recognizing natural scene categories,in:Proceedings of CVPR,2006.

[29]Y.LeCun,L.Bottou,Y.Bengio,P.Haffner,Gradient-based learning applied to

document recognition,Proc.IEEE 86(11)(1998)2278–2324.

[30]Y.LeCun,L.Bottou,G.B.Orr,K.Müller,Ef ?cient backprop,in:Neural Networks:

Tricks of the Trade,2nd edition,2012,pp.9–48.

[31]F.Li,P.Perona,A Bayesian Hierarchical model for learning natural scene

categories,in:Proceedings of CVPR,2005.

[32]L.Li,R.Socher,F.Li,Towards total scene understanding:classi ?cation,anno-tation and segmentation in an automatic framework,in:Proceedings of CVPR,2009.

[33]D.G.Lowe,Distinctive image features from scale-invariant keypoints,Int.J.

Comput.Vis.60(2)(2004)91–110.

[34]R.Manthalkar,P.K.Biswas,B.N.Chatterji,Rotation and scale invariant texture

features using discrete wavelet packet transform,Pattern Recognit.Lett.24(14)(2003)2455–2462.

[35]A.Mishra,K.Alahari,C.V.Jawahar,Scene text recognition using higher order

language priors,in:Proceedings of BMVC,2012.

[36]V.Nair,G.E.Hinton,Recti ?ed linear units improve restricted Boltzmann

machines,in:Proceedings of ICML,2010.

[37]L.Neumann,J.Matas,A method for text localization and recognition in real-world images,in:Proceedings of ACCV,2010.

[38]X.Niu,C.Y.Suen,A novel hybrid CNN-SVM classi ?er for recognizing hand-written digits,Pattern Recognit.45(4)(2012)1318–1325.

[39]P.B.Pati,A.G.Ramakrishnan,Word level multi-script identi ?cation,Pattern

Recognit.Lett.29(9)(2008)1218–1229.

[40]T.Q.Phan,P.Shivakumara,Z.Ding,S.Lu,C.L.Tan,Video script identi ?cation

based on text lines,in:Proceedings of ICDAR,2011.

[41]J.Schenk,J.Lenz,G.Rigoll,Novel script line identi ?cation method for script

normalization and feature extraction in on-line handwritten whiteboard note recognition,Pattern Recognit.42(12)(2009)3383–3393.

[42]A.Shahab, F.Shafait, A.Dengel,ICDAR 2011robust reading competition

challenge 2:reading text in scene images,in:Proceedings of ICDAR,2011.[43]B.Shi,C.Yao,C.Zhang,X.Guo,F.Huang,X.Bai,Automatic script identi ?cation

in the wild,in:Proceedings of ICDAR,2015.

[44]C.Shi,C.Wang,B.Xiao,Y.Zhang,S.Gao,Scene text detection using graph

model built upon maximally stable extremal regions,Pattern Recognit.Lett.34(2)(2013)107–116.

[45]S.Singh, A.Gupta, A.A.Efros,Unsupervised discovery of mid-level dis-criminative patches,in:Proceedings of ECCV,2012.

[46]S.Sinha,U.Pal,B.B.Chaudhuri,Word-wise script identi ?cation from Indian

documents,in:Proceedings of Workshop on DAS,2004.

[47]C.Strouthopoulos,N.Papamarkos,A.Atsalakis,Text extraction in complex

color documents,Pattern Recognit.35(8)(2002)1743–1758.

[48]T.N.Tan,Rotation invariant texture features and their use in automatic script

identi ?cation,IEEE Trans.Pattern Anal.Mach.Intell.20(7)(1998)751–756.[49]Y.Y.Tang,S.Lee,C.Y.Suen,Automatic document processing:a survey,Pattern

Recognit.29(12)(1996)1931–1952.

[50]A.H.Toselli,V.Romero,M.P.i.Gadea,E.Vidal,Multimodal interactive tran-scription of text images,Pattern Recognit.43(5)(2010)1814–1825.

[51]A.Vedaldi,B.Fulkerson,VLFeat:An Open and Portable Library of Computer

Vision Algorithms.?https://www.doczj.com/doc/899390523.html,/?,2008.

[52]J.Wang,J.Yang,K.Yu,F.Lv,T.S.Huang,Y.Gong,Locality-constrained linear

coding for image classi ?cation,in:Proceedings of CVPR,2010.

[53]K.Wang,S.Belongie,Word spotting in the wild,in:Proceedings of ECCV,2010.[54]V.Wu,R.Manmatha,E.M.Riseman,Text ?nder:an automatic system to detect

and recognize text in images,IEEE Trans.Pattern Anal.Mach.Intell 21(11)(1999)1224–1229.

[55]C.Yao,X.Bai,W.Liu,A uni ?ed framework for multi-oriented text detection

and recognition,IEEE Trans.Image Process.23(11)(2014)4737–4749.

[56]C.Yao,X.Bai,B.Shi,W.Liu,Strokelets:a learned multi-scale representation for

scene text recognition,in:Proceedings of CVPR,2014.

[57]C.Zhang,C.Yao,B.Shi,X.Bai,Automatic discrimination of text and non-text

natural images,in:Proceedings of ICDAR,2015.

[58]D.Zhao,P.Shivakumara,S.Lu, C.L.Tan,New spatial-gradient-features for

video script identi ?cation,in:Proceedings of Workshop on DAS,2012.

[59]Y.Zhong,K.Karu,A.K.Jain,Locating text in complex color images,Pattern

Recognit.28(10)(1995)1523–1535.

Baoguang Shi received the B.S.degree in Electronics and Information Engineering from the Huazhong University of Science and Technology (HUST),Wuhan,China,in 2012,where he is currently working toward the Ph.D.degree at the School of Electronic Information and Communications.His research interests include scene text detection and recognition,script identi ?cation and face alignment.

Xiang Bai received the B.S.,M.S.,and Ph.D.degrees from the Huazhong University of Science and Technology (HUST),Wuhan,China,in 2003,2005,and 2009,respectively,all in electronics and information engineering.He is currently a Professor with the School of Electronic Information and Communications,HUST.He is also the Vice-director of the National Center of Anti-Counterfeiting Technology,HUST.His research interests include object recognition,shape analysis,scene text recognition and intelligent systems.

Cong Yao received the B.S.and Ph.D.degrees,both in Electronics and Information Engineering,from Huazhong University of Science and Technology (HUST),Wuhan,China,in 2008and 2014,respectively.His research has focused on computer vision and machine learning,particularly in the area of text detection and recognition in natural images.

B.Shi et al./Pattern Recognition 52(2016)448–458

458

自动化工程案例分析

《自动化工程案例分析》课程总结报告 时光如白驹过隙,转眼间,大学已经步入了第四年的光景。短暂的回眸,激荡起那一片片的涟漪,却才开始发现,案例分析,在我心中挥之不去,留下了难以磨灭的记忆。四位老师的倾情传授,为我们的大学生涯留下的不止是斑驳的光影,还有那一缕盘旋不去的温情。 四位老师给我们深入浅出地讲解了很多详细的实例,这些例子和我们所学的知识相互印证,加深了我们对专业知识的了解。也让我们对毕业后的工作方向有了一个更直观的认识,让我们更加有勇气,更加自信的面对即将到来的工作或者是研究生的学习生涯。 叶老师给我们演示的是“中石化某油库计量系统”。首先叶老师讲了背景:中国石化担负着保障国家能源安全的重要责任,一年的原油加工量约为亿吨,其中原油依赖进口,因此,如何降低原油的采购运输成本成为了影响企业生产经营效益的重要问题。原油运输大型化或者原油运输管道化已成为中国石化降低原油输送成本的主要手段。国外的油库管理中已经引入了先进的工业控制技术、网络技术、数据库技术等,对油库日常的收发油品作业、储油管理、油库监控系统等进行全方位的综合管理。而我国的油库自动化技术与国际先进水平相比还是有一定的差距。各种计量仪表的精度较低,稳定性较差,控制系统的控制精度比较低,信息化管理水平不够健全。我国的油库自动化控制和管理系统曾经历了一个较长的发展时期,各种系统操作方式各异,水平也参差不齐,其中还存在着许多人工开票、开阀、手动控泵的原始发油手段。这些系统一方面是可靠性不高,影响油库的经济效益另一方

面没有运用现代化信息技术使有关人员能够方便及时的了解现场的实时运行情况以及历史生产信息,不能为生产调度决策提供可靠的数据依据,同时也不利于提高整个企业的科学化管理水平。 自动化项目浏览: 油库监控自动化系统 原油调合自动化系统 选矿自动化系统 嵌入式项目浏览: 智能防溜系统 海关油气液体化工品物流监控系统 综合项目要求,从整个系统分析,我们需要: 自动化/嵌入式项目浏览 投标与方案 监控系统设计 监控系统调试 监控系统验收 项目管理 油库是储存和供应石油产品的专业性仓库,是协调原油生产和加工、成品油运输及供应的纽带。长期以来,我国油库数据采集工作中的许多操作都是采用人工作业的方式。一方面,不仅工作效率低,而且容易出现人为因素造成的失误另一方面,也不便于有关人员及时了解现场的实时运行情况,不利于提高企业的规范化管理水平。随着自动化

浅谈神经网络分析解析

浅谈神经网络 先从回归(Regression)问题说起。我在本吧已经看到不少人提到如果想实现强AI,就必须让机器学会观察并总结规律的言论。具体地说,要让机器观察什么是圆的,什么是方的,区分各种颜色和形状,然后根据这些特征对某种事物进行分类或预测。其实这就是回归问题。 如何解决回归问题?我们用眼睛看到某样东西,可以一下子看出它的一些基本特征。可是计算机呢?它看到的只是一堆数字而已,因此要让机器从事物的特征中找到规律,其实是一个如何在数字中找规律的问题。 例:假如有一串数字,已知前六个是1、3、5、7,9,11,请问第七个是几? 你一眼能看出来,是13。对,这串数字之间有明显的数学规律,都是奇数,而且是按顺序排列的。 那么这个呢?前六个是0.14、0.57、1.29、2.29、3.57、5.14,请问第七个是几? 这个就不那么容易看出来了吧!我们把这几个数字在坐标轴上标识一下,可以看到如下图形: 用曲线连接这几个点,延着曲线的走势,可以推算出第七个数字——7。 由此可见,回归问题其实是个曲线拟合(Curve Fitting)问题。那么究竟该如何拟合?机器不

可能像你一样,凭感觉随手画一下就拟合了,它必须要通过某种算法才行。 假设有一堆按一定规律分布的样本点,下面我以拟合直线为例,说说这种算法的原理。 其实很简单,先随意画一条直线,然后不断旋转它。每转一下,就分别计算一下每个样本点和直线上对应点的距离(误差),求出所有点的误差之和。这样不断旋转,当误差之和达到最小时,停止旋转。说得再复杂点,在旋转的过程中,还要不断平移这条直线,这样不断调整,直到误差最小时为止。这种方法就是著名的梯度下降法(Gradient Descent)。为什么是梯度下降呢?在旋转的过程中,当误差越来越小时,旋转或移动的量也跟着逐渐变小,当误差小于某个很小的数,例如0.0001时,我们就可以收工(收敛, Converge)了。啰嗦一句,如果随便转,转过头了再往回转,那就不是梯度下降法。 我们知道,直线的公式是y=kx+b,k代表斜率,b代表偏移值(y轴上的截距)。也就是说,k 可以控制直线的旋转角度,b可以控制直线的移动。强调一下,梯度下降法的实质是不断的修改k、b这两个参数值,使最终的误差达到最小。 求误差时使用累加(直线点-样本点)^2,这样比直接求差距累加(直线点-样本点) 的效果要好。这种利用最小化误差的平方和来解决回归问题的方法叫最小二乘法(Least Square Method)。 问题到此使似乎就已经解决了,可是我们需要一种适应于各种曲线拟合的方法,所以还需要继续深入研究。 我们根据拟合直线不断旋转的角度(斜率)和拟合的误差画一条函数曲线,如图:

神经网络与复杂网络的分析

神经网络与复杂网络的分析 摘要 复杂网络在现实生活中是无处不在的,生物网络是它的一个分类。神经网络是很重要的生物网络。利用神经网络是可以研究一些其他的方向,如网络安全、人工智能等。而神经网络又可以因为它是复杂的网络,可以利用复杂网络的部分性质里进行研究,比如小世界效应的。 本文只要介绍了几篇应用复杂网络的研究,并进行简单的介绍和分析。 关键词:复杂网络、神经网络 Abstract The Complex network is in everywhere in real life, while Biological network is one of kinds of it. And neural network is one of the most important of biological network. The neural network could be used to research other subjects such as network security, artificial intelligence and so on. However we also use some properties of complex network to study neural network. Foe example we could use small-world to study it. This paper introduces and analysis five articles that use complex network. Key word:complex network、neural network

(完整版)小波神经网络的时间预测

基于小波神经网络的短时交通流预测 摘要 将小波神经网络的时间序列预测理论应用于短时交通流量的预测。通过小波分解与重构获取交通流量数据中的低频近似部分和高频随机部分, 然后在分析各种模型的优、劣的基础上, 选取较有效的模型或模型结合方式, 建立了交通流量预测模型。最后, 利用实测交通流量数据对模型仿真, 结果表明该模型可以有效地提高短时交通流量预测的精度。 关键词: 小波变换 交通流预测 神经网络 1.背景 众所周知, 道路交通系统是一个有人参与的、时变的、复杂的非线性大系统, 它的显著特点之一就是具有高度的不确定性(人为的和自然的影响)。这种不确定性给短时交通流量预测带来了极大的困难。这也就是短时交通流量预测相对于中长期预测更复杂的原因所在。在交通流量预测方面,小波分析不是一个完全陌生的工具,但是仍然处于探索性的应用阶段。实际上,这种方法在计算机网络的流量的预测中有着广泛的应用。与计算机网络一样,车流也表现出复杂的习性。所以可以把它的应用推广类比到交通流量的预测中来。小波分析有着与生俱来的解决非稳定时间序列的能力, 所以常常被单独用来解决常规时间序列模型中的问题。 2.小波理论 小波分析是针对傅里叶变换的不足发展而来的,傅里叶变换是信号处理领域里最为广泛的一种分析手段,然而他有一个严重的不足,就是变换抛弃了时间信息,变换结果无法判断某个信号发生的时间。小波是一种长度有限,平均值为0的波形,它的特点包括: (1)时域都具有紧支集或近似紧支集; (2)直流分量为0; 小波变换是指把某一基本小波函数ψ(t)平移b 后,再在不同尺度a 下与待分析的信号x(t)做内积。 dt a b t t x a b a WT x )()(1),(-=?*ψ??==?*)(),()()(,,t t x dt t t x b a b a ψψ (2 — 1) 等效的时域表达式为 dt a b x a b a WT x ωωψωj e )()(1),(-=?* a > 0 (2 — 2) 3.小波神经网络 小波神经网络是小波分析理论与神经网络理论相结合的产物,把小波基函数作为隐含层节点的传递函数,信号前向传播的同时误差反向传播的神经网络。 图一中1x ,2x ,....k x 是小波神经网络的输入参数,1y ,2y ....,m y 是小波神经网络的预测输出。

神经网络分析应用

基于动态BP神经网络的预测方法及其应用来源:中国论文下载中心 [ 08-05-05 15:35:00 ] 作者:朱海燕朱晓莲黄頔编辑:studa0714 摘要人工神经网络是一种新的数学建模方式,它具有通过学习逼近任意非线性映射的能力。本文提出了一种基于动态BP神经网络的预测方法,阐述了其基本原理,并以典型实例验证。 关键字神经网络,BP模型,预测 1 引言 在系统建模、辨识和预测中,对于线性系统,在频域,传递函数矩阵可以很好地表达系统的黑箱式输入输出模型;在时域,Box-Jenkins方法、回归分析方法、ARMA模型等,通过各种参数估计方法也可以给出描述。对于非线性时间序列预测系统,双线性模型、门限自回归模型、ARCH模型都需要在对数据的内在规律知道不多的情况下对序列间关系进行假定。可以说传统的非线性系统预测,在理论研究和实际应用方面,都存在极大的困难。相比之下,神经网络可以在不了解输入或输出变量间关系的前提下完成非线性建模[4,6]。神经元、神经网络都有非线性、非局域性、非定常性、非凸性和混沌等特性,与各种预测方法有机结合具有很好的发展前景,也给预测系统带来了新的方向与突破。建模算法和预测系统的稳定性、动态性等研究成为当今热点问题。目前在系统建模与预测中,应用最多的是静态的多层前向神经网络,这主要是因为这种网络具有通过学习逼近任意非线性映射的能力。利用静态的多层前向神经网络建立系统的输入/输出模型,本质上就是基于网络逼近能力,通过学习获知系统差分方程中的非线性函数。但在实际应用中,需要建模和预测的多为非线性动态系统,利用静态的多层前向神经网络必须事先给定模型的阶次,即预先确定系统的模型,这一点非常难做到。近来,有关基于动态网络的建模和预测的研究,代表了神经网络建模和预测新的发展方向。 2 BP神经网络模型 BP网络是采用Widrow-Hoff学习算法和非线性可微转移函数的多层网络。典型的BP算法采用梯度下降法,也就是Widrow-Hoff算法。现在有许多基本的优化算法,例如变尺度算法和牛顿算法。如图1所示,BP神经网络包括以下单元:①处理单元(神经元)(图中用圆圈表示),即神经网络的基本组成部分。输入层的处理单元只是将输入值转入相邻的联接权重,隐层和输出层的处理单元将它们的输入值求和并根据转移函数计算输出值。②联接权重(图中如V,W)。它将神经网络中的处理单元联系起来,其值随各处理单元的联接程度而变化。③层。神经网络一般具有输入层x、隐层y和输出层o。④阈值。其值可为恒值或可变值,它可使网络能更自由地获取所要描述的函数关系。⑤转移函数F。它是将输入的数据转化为输出的处理单元,通常为非线性函数。

小波神经网络程序

这是一个小波神经网络程序,作者judyever %参考<青岛海洋大学学报> 2001年第1期一种基于BP算法学习的小波神经网络%% %step1--------网络初始化------------------------------------------- clc; clear all; %设定期望的误差最小值 err_goal=0.001; %设定最大循环次数 max_epoch=50; %设定修正权值的学习速率0.01-0.7 lr=0.7; epoch=0; x=0:0.01:0.3;%输入时间序列 d=sin(8*pi*x)+sin(16*pi*x);%目标输出序列 M=size(x,2);%输入节点的个数 N=M;%输出节点的个数 n=10;%隐形节点的个数 %这个地方需要改进,由于实际上隐形节点的个数可以通过小波的时频分析确定 Wjk=randn(n,M); Wij=randn(N,n); % a=randn(1,n); a=1:1:n; b=randn(1,n); % stepa=0.2*(x(M)-x(1)); % a=stepa(n-1)+stepa; % step=(x(M)-x(1))/n; % b=x(1)+step:step:x(1)+n*step; % y=zeros(1,N);%输出节点初始化 y=zeros(1,N);%输出节点初始化 net=zeros(1,n);%隐形节点初始化 net_ab=zeros(1,n);%隐形节点初始化 %step2--------对网络进行训练------------------------------------------- for i=1:1:N for j=1:1:n for k=1:1:M net(j)=net(j)+Wjk(j,k)*x(k); net_ab(j)=(net(j)-b(j))/a(j); end y(i)=y(i)+Wij(i,j)*mymorlet(net_ab(j)); %mymorlet是judyever编写的小波函数,以后可以扩展成输入不同的小波名字即可 % y(i)=mysigmoid(2,y(i)); end

基于人工神经网络的MATLAB手写数字识别系统

基于人工神经网络的MATLAB手写数字识别系统 一、函数MouseDraw实现手写识别系统GUI界面的建立和鼠标手写的实现。(使用时保存为MouseDraw.m) function MouseDraw(action) % MouseDraw 本例展示如何以Handle Graphics来设定滑鼠事件 % (MouseDraw Events)的反应指令(Callbacks) % 本程序在鼠标移动非常快时,不会造成画“断线” % global不能传矩阵 global InitialX InitialY FigHandle hb2 hb3 hb4 count hb5 hb6 hb7 count='E:\im.jpg'; imSize = 50; if nargin == 0, action = 'start'; end switch(action) %%开启图形视窗 case'start', FigHandle = figure('WindowButtonDownFcn','MouseDraw down','DeleteFcn','save bpnet'); axis([1 imSize 1 imSize]); % 设定图轴范围% set(gca,'Position',[.25 .20 .7 .7]); axis off; grid off; box on; % 将图轴加上图框 title('手写体输入窗'); try evalin('base','load bpnet') catch evalin('base','bpgdtrain'); end % % fprintf('start'); %%设定滑鼠按钮被按下时的反应指令为「MouseDraw down」 % set(gcf, 'WindowButtonDownFcn', 'MouseDraw down'); hb1 = uicontrol('Parent', FigHandle, 'Units', 'Normalized', ... 'Position', [.3 .01 .13 .07], 'String', '保存', ... 'Callback',['exa=rgb2gray(frame2im(getframe(gca)));','imwrite(exa ,''E:\im.jpg'')']); hb2=uicontrol('Parent',FigHandle,'Style','popupmenu','Position',[ 50 50 50 30],... 'String', {'26','24', '22', '20', '18', '16','14','12','10'}); hb3=uicontrol('Parent', FigHandle,'Style','text',...

神经网络分析法

神经网络分析法是从神经心理学和认知科学研究成果出发,应用数学方法发展起来的一种具有高度并行计算能力、自学能力和容错能力的处理方法。 神经网络技术在模式识别与分类、识别滤波、自动控制、预测等方面已展示了其非凡的优越性。神经网络是从神经心理学和认识科学研究成果出发,应用数学方法发展起来的一种并行分布模式处理系统,具有高度并行计算能力、自学能力和容错能力。神经网络的结构由一个输入层、若干个中间隐含层和一个输出层组成。神经网络分析法通过不断学习,能够从未知模式的大量的复杂数据中发现其规律。神经网络方法克服了传统分析过程的复杂性及选择适当模型函数形式的困难,它是一种自然的非线性建模过程,毋需分清存在何种非线性关系,给建模与分析带来极大的方便。 编辑本段神经网络分析法在风险评估的运用 神经网络分析方法应用于信用风险评估的优点在于其无严格的假设限制,且具有处理非线性问题的能力。它能有效解决非正态分布、非线性的信用评估问题,其结果介于0与1之间,在信用风险的衡量下,即为违约概率。神经网络法的最大缺点是其工作的随机性较强。因为要得到一个较好的神经网络结构,需要人为地去调试,非常耗费人力与时间,因此使该模型的应用受到了限制。Altman、marco和varetto(1994)在对意大利公司财务危机预测中应用了神经网络分析法;coats及fant(1993)trippi 采用神经网络分析法分别对美国公司和银行财务危机进行预测,取得较好效果。然而,要得到一个较好的神经网络结构,需要人为随机调试,需要耗费大量人力和时间,加之该方法结论没有统计理论基础,解释性不强,所以应用受到很大限制。 编辑本段神经网络分析法在财务中的运用 神经网络分析法用于企业财务状况研究时,一方面利用其映射能力,另一方面主要利用其泛化能力,即在经过一定数量的带噪声的样本的训练之后,网络可以抽取样本所隐含的特征关系,并对新情况下的数据进行内插和外推以推断其属性。 神经网络分析法对财务危机进行预测虽然神经网络的理论可追溯到上个世纪40年代,但在信用风险分析中的应用还是始于上个世纪90年代。神经网络是从神经心理学和认识科学研究成果出发,应用数学方法发展起来的一种并行分布模式处理系统,具有高度并行计算能力、自学能力和容错能力。神经网络的结构是由一个输入层、若干个中间隐含层和输出层组成。国外研究者如Altman,Marco和Varetto(1995),对意大利公司财务危机预测中应用了神经网络分析法。Coats,Pant(1993)采用神经网络分析法

神经网络在数据挖掘中的应用

神经网络在数据挖掘中的应用

————————————————————————————————作者:————————————————————————————————日期: ?

神经网络在数据挖掘中的应用 摘要:给出了数据挖掘方法的研究现状,通过分析当前一些数据挖掘方法的局限性,介绍一种基于关系数据库的数据挖掘方法——神经网络方法,目前,在数据挖掘中最常用的神经网络是BP网络。在本文最后,也提出了神经网络方法在数据挖掘中存在的一些问题. 关键词:BP算法;神经网络;数据挖掘 1.引言 在“数据爆炸但知识贫乏”的网络时代,人们希望能够对其进行更高层次的分析,以便更好地利用这些数据。数据挖掘技术应运而生。并显示出强大的生命力。和传统的数据分析不同的是数据挖掘是在没有明确假设的前提下去挖掘信息、发现知识。所得到的信息具有先未知,有效性和实用性三个特征。它是从大量数据中寻找其规律的技术,主要有数据准备、规律寻找和规律表示三个步骤。数据准备是从各种数据源中选取和集成用于数据挖掘的数据;规律寻找是用某种方法将数据中的规律找出来;规律表示是用尽可能符合用户习惯的方式(如可视化)将找出的规律表示出来。数据挖掘在自身发展的过程中,吸收了数理统计、数据库和人工智能中的大量技术。作为近年来来一门处理数据的新兴技术,数据挖掘的目标主要是为了帮助决策者寻找数据间潜在的关联(Relation),特征(Pattern)、趋势(Trend)等,发现被忽略的要素,对预测未来和决策行为十分有用。 数据挖掘技术在商业方面应用较早,目前已经成为电子商务中的关键技术。并且由于数据挖掘在开发信息资源方面的优越性,已逐步推广到保险、医疗、制造业和电信等各个行业的应用。 数据挖掘(Data Mining)是数据库中知识发现的核心,形成了一种全新的应用领域。数据挖掘是从大量的、有噪声的、随机的数据中,识别有效的、新颖的、有潜在应用价值及完全可理解模式的非凡过程。从而对科学研究、商业决策和企业管理提供帮助。 数据挖掘是一个高级的处理过程,它从数据集中识别出以模式来表示的知识。它的核心技术是人工智能、机器学习、统计等,但一个DM系统不是多项技术的简单组合,而是一个完整的整体,它还需要其它辅助技术的支持,才能完成数据采集、预处理、数据分析、结果表述这一系列的高级处理过程。所谓高级处理过程是指一个多步骤的处理过程,多步骤之间相互影响、反复调整,形成一种螺旋式上升过程。最后将分析结果呈现在用户面前。根据功能,整个DM系统可以大致分为三级结构。 神经网络具有自适应和学习功能,网络不断检验预测结果与实际情况是否相符。把与实际情况不符合的输入输出数据对作为新的样本,神经网络对新样本进行动态学习并动态改变网络结构和参数,这样使网络适应环境或预测对象本身结构和参数的变化,从而使预测网络模型有更强的适应性,从而得到更符合实际情况的知识和规则,辅助决策者进行更好地决策。而在ANN的

模糊神经网络的预测算法在嘉陵江水质评测中的应用2

模糊神经网络的预测算法 ——嘉陵江水质评价 一、案例背景 1、模糊数学简介 模糊数学是用来描述、研究和处理事物所具有的模糊特征的数学,“模糊”是指他的研究对象,而“数学”是指他的研究方法。 模糊数学中最基本的概念是隶属度和模糊隶属度函数。其中,隶属度是指元素μ属于模糊子集f的隶属程度,用μf(u)表示,他是一个在[0,1]之间的数。μf(u)越接近于0,表示μ属于模糊子集f的程度越小;越接近于1,表示μ属于f的程度越大。 模糊隶属度函数是用于定量计算元素隶属度的函数,模糊隶属度函数一般包括三角函数、梯形函数和正态函数。 2、T-S模糊模型 T-S模糊系统是一种自适应能力很强的模糊系统,该模型不仅能自动更新,还能不断修正模糊子集的隶属函数。T-S模糊系统用如下的“if-then”规则形式来定义,在规则为R i 的情况下,模糊推理如下: R i:If x i isA1i,x2isA2i,…x k isA k i then y i =p0i+p1i x+…+p k i x k 其中,A i j为模糊系统的模糊集;P i j(j=1,2,…,k)为模糊参数;y i为根据模糊规则得到的输出,输出部分(即if部分)是模糊的,输出部分(即then部分)是确定的,该模糊推理表示输出为输入的线性组合。 假设对于输入量x=[x1,x2,…,x k],首先根据模糊规则计算各输入变量Xj的隶属度。 μA i j=exp(-(x j-c i j)/b i j)j=1,2,…,k;i=1,2,…,n式中,C i j,b i j分别为隶属度函数的中心和宽度;k为输入参数数;n为模糊子集数。 将各隶属度进行模糊计算,采用模糊算子为连乘算子。 ωi=μA1j(x1)*μA2j(x2)*…*μA k j i=1,2,…,n 根据模糊计算结果计算模糊型的输出值y i。 Y I=∑n i=1ωi(P i0+P i1x1+…+P i k xk)/ ∑n i=1ωi 3、T-S模糊神经网络模型 T-S模糊神经网络分为输入层、模糊化层、模糊规则计划层和输出层四层。输入层与输入向量X I连接,节点数与输入向量的维数相同。模糊化层采用隶属度函数对输入值进行模

小波神经网络及其应用

小波神经网络及其应用 1014202032 陆宇颖 摘要:小波神经网络是将小波理论和神经网络理论结合起来的一种神经网络,它避免了BP 神经网络结构设计的盲目性和局部最优等非线性优化问题,大大简化了训练,具有较强的函数学习能力和推广能力及广阔的应用前景。首先阐明了小波变换和多分辨分析理论,然后介绍小波神经网络数学模型和应用概况。 1.研究背景与意义 人工神经网络是基于生物神经系统研究而建立的模型,它具有大规模并行处理和分布式存储各类图像信息的功能,有很强的容错性、联想和记忆能力,因而被广泛地应用于故障诊断、模式识别、联想记忆、复杂优化、图像处理以及计算机领域。但是,人工神经网络模型建立的物理解释,网络激活函数采用的全局性函数,网络收敛性的保证,网络节点数的经验性确定等问题尚有待进一步探讨和改善。 小波理论自 Morlet 提出以来,由于小波函数具有良好的局部化性质,已经广泛渗透到各个领域。小波变换方法是一种窗口大小固定但其形状可以改变, 时间窗和频率窗都可以改变的时频局部化分析方法, 由于在低频部分具有较高的频率分辨率和较低的时间分辨率, 在高频部分具有较高的时间分辨率和较低的频率分辨率, 所以被誉为数学显微镜。正是这种特性, 使小波变换具有对信号的自适应性。基于多分辨分析的小波变换由于具有时频局部化特性而成为了信号处理的有效工具。实际应用时常采用Mallat快速算法,利用正交小波基将信号分解到不同尺度上。实现过程如同重复使用一组高通和低通滤波器把信号分解到不同的频带上,高通滤波器产生信号的高频细节分量,低通滤波器产生信号的低频近似分量。每分解一次信号的采样频率降低一倍,近似分量还可以通过高通滤波和低通滤波进一步地分解,得到下一层次上的两个分解分量。 而小波神经网络(Wavelet Neural Network, WNN)正是在近年来小波分析研究获得突破的基础上提出的一种人工神经网络。它是基于小波分析理论以及小波变换所构造的一种分层的、多分辨率的新型人工神经网络模型,即用非线性小波基取代了通常的非线性Sigmoid 函数,其信号表述是通过将所选取的小波基进行线性叠加来表现的。 小波神经网络这方面的早期工作大约开始于1992 年,主要研究者是Zhang Q、Harold H S 和焦李成等。其中,焦李成在其代表作《神经网络的应用与实现》中从理论上对小波神经网络进行了较为详细的论述。近年来,人们在小波神经网络的理论和应用方面都开展了不少研究工作。 小波神经网络具有以下特点。首先,小波基元及整个网络结构的确定有可靠的理论根据,可避免BP 神经网络等结构设计上的盲目性;其次,网络权系数线性分布和学习目标函数的凸性,使网络训练过程从根本上避免了局部最优等非线性优化问题;第三,有较强的函数学习能力和推广能力。 2.数学模型与小波工具 2.1 小波变换及多分辨分析 L R(或更广泛的Hilbert 空间)中,选择一个母小波函数(又称为基本在函数空间2() ,使其满足允许条件: 小波函数)()x

基于人工神经网络的图像识别

本文首先分析了图像识别技术以及bp神经网络算法,然后详细地阐述了人工神经网络图像识别技术。 【关键词】人工神经网络 bp神经网络图像识别识别技术 通常而言,所谓图像处理与识别,便是对实际图像进行转换与变换,进而达到识别的目的。图像往往具有相当庞大的信息量,在进行处理图像的时候要进行降维、数字化、滤波等程序,以往人们进行图像识别时采用投影法、不变矩法等方法,随着计算机技术的飞速发展,人工神经网络的图像识别技术将逐渐取代传统的图像识别方法,获得愈来愈广泛的应用。 1 人工神经网络图像识别技术概述 近年来,人工智能理论方面相关的理论越来越丰富,基于人工神经网络的图像识别技术也获得了非常广泛的应用,将图像识别技术与人工神经网络技术结合起来的优点是非常显著的,比如说: (1)由于神经网络具有自学习功能,可以使得系统能够适应识别图像信息的不确定性以及识别环境的不断变化。 (2)在一般情况下,神经网络的信息都是存储在网络的连接结构以及连接权值之上,从而使图像信息表示是统一的形式,如此便使得知识库的建立与管理变得简便起来。 (3)由于神经网络所具有的并行处理机制,在处理图像时可以达到比较快的速度,如此便可以使图像识别的实时处理要求得以满足。 (4)由于神经网络可增加图像信息处理的容错性,识别系统在图像遭到干扰的时候仍然能正常工作,输出较准确的信息。 2 图像识别技术探析 2.1 简介 广义来讲,图像技术是各种与图像有关的技术的总称。根据研究方法以及抽象程度的不同可以将图像技术分为三个层次,分为:图像处理、图像分析以及图像理解,该技术与计算机视觉、模式识别以及计算机图形学等学科互相交叉,与生物学、数学、物理学、电子学计算机科学等学科互相借鉴。此外,随着计算机技术的发展,对图像技术的进一步研究离不开神经网络、人工智能等理论。 2.2 图像处理、图像识别与图像理解的关系 图像处理包括图像压缩、图像编码以及图像分割等等,对图像进行处理的目的是判断图像里是否具有所需的信息并滤出噪声,并对这些信息进行确定。常用方法有灰度,二值化,锐化,去噪等;图像识别则是将经过处理的图像予以匹配,并且对类别名称进行确定,图像识别可以在分割的基础之上对所需提取的特征进行筛选,然后再对这些特征进行提取,最终根据测量结果进行识别;所谓图像理解,指的是在图像处理与图像识别的基础上,根据分类作结构句法分析,对图像进行描述与解释。所以,图像理解包括图像处理、图像识别和结构分析。就图像理解部分而言,输入是图像,输出是对图像的描述解释。 3 人工神经网络结构和算法 在上个世纪八十年代,mcclelland与rumelhant提出了一种人工神经网络,截止现在,bp神经网络已经发展成为应用最为广泛的神经网络之一,它是一种多层前馈神经网络,包括输入层、输出层和输入层输出层之间隐藏层,如图1所示,便是一种典型的bp神经网络结构。 bp神经网络是通过不断迭代更新权值使实际输入与输出关系达到期望,由输出向输入层反向计算误差,从而通过梯度下降方法不断修正各层权值的网络。 bp神经网络结构算法如下所述: (1)对权值矩阵,学习速率,最大学习次数,阈值等变量和参数进行初始化设置; (2)在黑色节点处对样本进行输入;

关于人工神经网络的分析

人工神经网络 分析 班级: 学号: 姓名: 指导教师: 时间:

摘要: 人工神经网络也简称为神经网络,是一种模范动物神经网络行为特征,进行分布式并行信息处理的算法数学模型。这种网络依靠系统的复杂程度,通过调整内部大量节点之间相互连接的关系,从而达到处理信息的目的。 自从认识到人脑的计算与传统的计算机相比是完全不同的方式开始,关于人工神经网络的研究就开始了。半个多世纪以来,神经网络经历了萌芽期、第一次高潮期、反思低潮期、第二次高潮期、再认识与应用研究期五个阶段。而近年来,人工神经网络通过它几个突出的优点更是引起了人们极大的关注,因为它为解决大复杂度问题提供了一种相对来说比较有效的简单方法。目前,神经网络已成为涉及计算机科学、人工智能、脑神经科学、信息科学和智能控制等多种学科和领域的一门新兴的前言交叉学科。 英文摘要: Artificial neural networks are also referred to as the neural network is a neural network model of animal behavior, distributed parallel information processing algorithm mathematical model. This network relies on system complexity, achieved by adjusting the number of nodes connected to the relationship between, so as to achieve the purpose of processing information. Since the understanding of the human brain compared to traditional computer calculation and are completely different way to start on artificial neural network research began. Over half a century, the neural network has experienced infancy, the first high tide, low tide reflections, the second peak period, and again knowledge and applied research on five stages. In recent years, artificial neural networks through which several prominent advantage is attracting a great deal of attention because it is a large complex problem solving provides a relatively simple and effective way. Currently, neural networks have become involved in computer science, artificial intelligence, brain science, information science and intelligent control and many other disciplines and fields of an emerging interdisciplinary foreword. 关键字:

Workbench高级工程实例分析培训

Workbench高级工程实例分析培训 第1例:齿轮动态接触分析 该实例系统讲解模型的导入,接触设置,齿轮实现转动的方法和原理解释,并给学员演示空载荷负载作用下的齿轮结构的应力计算比较。 图1 斜齿轮接触的有限元模型 图2 动态接触过程中某一时刻的等效应力云图(空载)

图3 动态接触过程中某一时刻的等效应力云图(负载200N.m) 第2例:过盈装配结构分析 该实例会系统讲解过盈装配结构的应力分析方法。不同设置过盈量的计算结果比较和讨论设置过盈量的合理方法,摩擦系数,旋转速度对过盈装配应力的影响。 图4 过盈量为0.00005m时的等效应力(转速=0)图5 过盈量为0.00005m时的接触应力(转速=0)

图6 过盈量为0.00005m 时的等效应力(转速=4000) 图7 过盈量为0.00005m 时的接触应力(转速=4000) 第3例:液压阀结构的分析 该实例会讲解施加随空间变化的压力载荷和系统分析接触设置对求解的影响,并给出如何合理选取接触参数来实现较为准确的求解。 图8 变化压力载荷分布云图 图9 接触压力云图(摩擦系数=0.1,增强拉格朗日算法) 第4例:发动机活塞机构的多体动力学分析 该实例会讲解如何为多体设置驱动力和约束多体之间的运动关系的方法,并讲解柔性体的多体动力学分析和刚-柔耦合的多体动力学分析。

图10 0.12s时刻的等效应力云图(柔性体)图11 1.17s时刻的等效应力云图(柔性体) 图12 0.12s时刻的等效应力云图(刚-柔耦合)图13 1.17s时刻的等效应力云图(刚-柔耦合)第5例:薄壁结构的非线性屈曲分析 该实例会讲解如何在Workbench环境下完成薄壁结构的非线性屈曲分析并获得非线性屈曲载荷的方法,研究不同初始缺陷,弹塑性对非线性屈曲载荷的影响。

浅谈基于小波分析的神经网络

浅谈基于小波分析的神经网络 摘要:基于小波分析的神经网络在我们的日常生产中有着重要的作用,尤其是在故障检测中,正因为有了它的存在,使得我们能更好的对一些机器内部微小的部件进行检测。在一定程度上,避免了人工检测工作量大且准确度不高的情况,降低了检验的成本,减少了因零件损坏而带来的损失,为工业的生产提供了极大的帮助。 关键词:小波分析,神经网络,故障诊断 随着科学的进步与时代的发展,神经网络正慢慢的运用到我们的日常生活与生产之中。从1943年人们首次提出了人工神经网络这一概念至今,神经网络已经与越来越多的其他技术结合了起来,例如,结合神经元的混沌属性提出混沌神经网络,应用于组合优化的问题中,与粗集理论结合,应用于对数据的分类处理,与分形理论结合,应用于图形识别、图像编码、图像压缩等,与小波分析结合,应用于机械设备的故障检测中。以下是我对基于小波分析的神经网络的见解。 一、概述 小波分析即小波变换,是1981年Morlet首先提出的,经过发展后成为了一门学科,小波分析对低频信号在频域和高频信号在时域里有着较好的分辨率。而神经网络特有的对非线性适应性信息处理能力,当它与小波分析相结合后,使得它们能在对高压电网的信号处理,机械故障的检测等方面发挥了重要的作用。

二、小波神经网络的算法 小波神经网络的算法大体的思路是这样的,小波神经网络的核心是隐层神经元的激活函数小波基函数(Morlet )进行非线性映射,信号通路只进行前向传递,待分类信号进行前向传递的同时,误差信号进行反向的传递。输出层的传递函数为S 函数,小波函数的拓扑结构如下所示: 小波函数的修正公式如下: (k 1)(k)*E mc ωωη ωω?+=++? (1) a(k 1)(k)*E a mc a a η?+=++? (2) b(k 1)(k)*E b mc b b η ?+=++? (3) 误差函数如下: 211 1(y yt )2N M n n m m n m E N ===-∑∑ (4) 输入层 隐含层 输出层

BP网络用于催化剂配方建模--MATLAB实例

BP 网络用于催化剂配方建模--MATLAB 实例 本例是《人工神经网络理论、设计及应用》(第二版)中BP 网络应用与设计的例子,现用MATLABF 仿真。 介绍:理论上已经证明,三层前馈神经网络可以任意精度逼近任意连续函数。本例采用BP 神经网络对脂肪醇催化剂配方的实验数据进行学习,以训练后的网络作为数学模型映射配方与优化指标之间的复杂非线形关系,获得了较高的精度。网络设计方法与建模效果如下: (1)网络结构设计与训练首先利用正交表安排实验,得到一批准确的实验数据作为神经网络的学习样本。根据配方的因素个数和优化指标的个数设计神经网络的结构,然后用实验数据对神经网络进行训练。完成训练之后的多层前馈神经网络,其输入与输出之间形成了一种能够映射配方与优化指标内在联系的连接关系,可作为仿真实验的数学模型。图3.28给出针对五因素、三指标配方的实验数据建立的三层前馈神经网络。五维输入向量与配方组成因素相对应,三维输出向量与三个待优化指标[脂肪酸甲脂转化率TR(%)、脂肪醇产率Y (%)和脂肪醇选择性S (%)]相对应。通过试验确定隐层结点数为4。正交表安排了18OH OH 组实验,从而得到18对训练样本。训练时采用了改进BP 算法: ) 1()(??+=?t W X t W αηδ(2)BP 网络模型与回归方程仿真结果的对比表3.3给出BP 网络配方模型与回归方程建立的配方模型的仿真结果对比。其中回归方程为经二次多元逐步回归分析,在一定置信水平下经过F 检验而确定的最优回归方程。从表中可以看出,采用BP 算法训练的多层前馈神经网络具有较高的仿真精度。

表3.3注:下标1表示实测结果,下标2表示神经网络输出结果,下标3表示回归方程 以下是具体操作: 编号A/Cu Z n/C u B/Cu C/Cu Mn/Cu T R1/% 1 T R2/% T R3/% Y OH 1/%Y OH 2/% Y OH 3/% S OH 1/% S OH 2/% S OH 3/% 10.050.130.080.140.0494.594.62 83.8396.3 96.56 95.9897.8 97.24 102.8320.0650.070.120.160.0288.05 88.0592.4375.575.97 76.5 86.586.68 79.6530.08 0.190.080.060.060.25 60.4382.0340.2141.4344.8796.2595.3681.9240.0950.110.060.160.0493.05 93.1194.3197.3196.29105.4399.3 99.39 103.0850.11 0.050.020.060.0294.65 94.7285.7988.5588.0677.8995.297.49 87.1260.1250.170.00.140.096.05 95.9697.0895.5 96.69 105.4399.599.52 104.7170.14 0.090.160.040.0461.00 61.1365.3959.7258.954.76 67.3569.1 73.52 80.1550.030.120.140.0270.40 70.3980.4437.5 41.83 46.3652.2551.3871.4590.17 0.150.10.040.083.383.32 70.2282.8580.4659.5 99.2 96.53 74.3 100.050.070.060.120.0584.585.27 70.2290.9 90.46 91.5195.997.87 92.75110.0650.190.040.020.0369.569.45 80.7761.865.03 55.2288.292.41 98.44120.08 0.130.00.120.0194.55 95.694.75 97.695.74 92.4499.697.93 101.65130.0950.050.160.020.0570.95 69.5192.8862.5460.452.5 60.162.63 68.12140.11 0.170.140.10.0387.287.16 78.6491.0 89.19 76.9299.899.36 92.22150.1250.110.10.00.0164.264.08 69.5958.359.12 54.0258.960.22 72.5 160.14 0.030.080.10.0586.15 86.1582.4 75.65 61.4329.9386.578.07 79.28170.1550.150.040.00.0377.15 77.1775.2371.971.72 83.9491.891.74 94.2318 0.17 0.090.020.080.0196.05 96 87.05 94.60 94.62 94.61 98.00 99.12 90.35

相关主题
文本预览
相关文档 最新文档