32.2 ABSTRACT Exploiting K-Distance Signature for Boolean Matching and G-Symmetry Detection
- 格式:pdf
- 大小:387.29 KB
- 文档页数:6
第 38 卷第 11 期2023 年 11 月Vol.38 No.11Nov. 2023液晶与显示Chinese Journal of Liquid Crystals and Displays基于自适应聚焦CRIoU损失的目标检测算法肖振久1,赵昊泽2,张莉莉2,夏羽3,郭杰龙4*,俞辉4,李成龙2,王俐文2(1.辽宁工程技术大学软件学院,辽宁葫芦岛 125000;2.中国兵器工业集团航空弹药研究院有限公司,黑龙江哈尔滨, 150000;3.上海宇航系统工程研究所,上海 201100;4.中国科学院海西研究院泉州装备制造研究中心,福建泉州 362000)摘要:在目标检测任务中,传统的边界框回归损失函数所回归的内容与评价标准IoU(Intersection over Union)之间存在不相关性,并且对于边界框的回归属性存在一定不合理性,使得回归属性不完整,降低了检测精度和收敛速度,甚至还会造成回归阻碍的情况。
并且在回归任务中也存在样本不均衡的情况,大量的低质量样本影响了损失收敛。
为了提高检测精度和回归收敛速度提出了一种新的边界框回归损失函数。
首先确定设计思想并设计IoU系列损失函数的范式;其次在IoU损失的基础上引入两中心点形成矩形的周长和两框形成的最小闭包矩形周长的比值作为边界框中心点距离惩罚项,并且将改进的IoU损失应用到非极大值抑制(Non-Maximum Suppression,NMS)处理中。
接着引入两框的宽高误差和最小外包框的宽高平方作为宽高惩罚项,确定CRIoU(Complete Relativity IoU,CRIoU)损失函数。
最后在CRIoU 的基础上加入自适应加权因子,对高质量样本的回归损失加权,定义了自适应聚焦CRIoU(Adaptive focal CRIoU,AF-CRIoU)。
实验结果表明,使用AF-CRIoU损失函数对比传统非IoU系列损失的检测精度最高相对提升了8.52%,对比CIoU系列损失的检测精度最高相对提升了2.69%,使用A-CRIoU-NMS(Around CRIoU NMS)方法对比原NMS 方法的检测精度提升0.14%。
摘要摘要本文介绍的是一种从图像中获得关键点并用关键点去提供可靠的图像匹配的方法——SIFT算法。
SIFT算法是一种提取局部特征的算法,它能在尺度空间提取关键点。
关键点的信息包括位置,尺度,方向。
相对于全局特征,SIFT特征描述子独特性好,信息量丰富,对旋转、尺度缩放、视角变化、噪声等大多数变换或干扰具有很强的不变性,是一种非常优秀的局部特征描述。
本文对SIFT算法中的关键点检测技术进行了详细的阐述,并用程序将关键点检测过程实现。
实现过程包括高斯金字塔的建立,DOG金字塔的建立,尺度空间生成,关键点提取以及描述子的生成等。
最后进行了实验,对在检测中起到关键作用的参数进行了验证性实验,得到了满意的结果。
关键词:关键点尺度空间描述子AbstractAbstractThis paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene,which called ——SIFT. SIFT is algorithm that can detect local descriptors, extrema in scale-space. This method can extract location, scale, and rotation. Sift feature is highly distinctive and show extraordinary robustness against most disturbances such as scaling,rotation, occlusion, perspective and illumination changes,and is a kind of excellent local descriptors.The key point detection technology in SIFT algorithm is described in details in this paper. The detection process is realized. The process includes building the Gaussian pyramid and DOG pyramid, setting up scale-space,the key descriptor extraction and descriptors’ generation and so on. Finally,we make experiment testing the parameters which play an important role in key points detection. We get satisfied results. Keyword: keypoints scale-space descriptors目录 i目录第一章绪论 ......................................................................................... 错误!未定义书签。
Full-Dimension MIMO (FD-MIMO) for Next Generation Cellular Technology AbstractThis article considers a practical implementation of massive MIMO system.Although the best performance can be achieved when a large number of active antennas are placed only in horizontal domain, BS form factor(形状系数) limitation often makes horizontal array placement infeasible. To cope with this limitation, this article introduces full-dimension MIMO (FD-MIMO) cellular wireless communication system, where active antennas are placed in a 2D grid at BSs.动机:大量有源天线水平放置可以达到最好性能,但基站的形状系统限制使水平阵列放置不灵活。
为此引入full-dimension MIMO 蜂窝无线通信系统,其中基站端的有源天线以2维网络状分布。
For analysis of the FD-MIMO systems, a 3D spatial channel model is introduced, on which system-level simulations are conducted.内容:为了分析FD-MIMO系统,介绍了一个3维空间信道模型,并在该信道模型的基础上进行了系统级的仿真。
The simulation results show that the proposed FD-MIMO system with 32 antenna ports achieves 2–3.6 times cell average throughput gain and 1.5–5 times cell edge throughput gain compared to the 4G LTE system of two antenna ports at the BS.结论:仿真结果显示,相比于具有2天线端口的4G LTE系统,所提FD-MIMO系统在拥有32天线端口天线时,小区平均吞吐量增加2-3.6倍,小区边缘吞吐量增加1.5-5倍。
聚类算法在网络入侵检测技术中的作用作者:张培帅来源:《电脑知识与技术·学术交流》2008年第32期摘要:该文探讨了聚类算法作为一种无监督的异常检测技术,在网络入侵检测技术中的作用,并通过具体分析K-means算法和迭代最优化算法的优劣,把两种算法结合起来,提出一种新的分类算法。
关键词:模式识别;无监督的异常检测技术;聚类算法K-means算法;迭代最优化算法中图分类号:TP393文献标识码:A文章编号:1009-3044(2008)32-1194-03The Role of the Clustering Algorithm in the Network Intrusion Detection TechnologyZHANG Pei-shuai(The PLA 71977 Unit, Rizhao 276800, China)Abstract: The paper studies the role of the clustering algorithm in the network intrusion detection technology, and create a new algorithm by analysising the Pros and Cons of K-means algorithm and Iterative optimization algorithms.Key words: pattern recognition; anomaly detection technology without supervision; clustering algorithm; K-means algorithm; iterative optimization algorithms1 引言我们知道,在一个多因素问题中,结果即目标与各因素即指标之间难以找出直接的联系,很难用理论的途径去解决。
Draft:Deep Learning in Neural Networks:An OverviewTechnical Report IDSIA-03-14/arXiv:1404.7828(v1.5)[cs.NE]J¨u rgen SchmidhuberThe Swiss AI Lab IDSIAIstituto Dalle Molle di Studi sull’Intelligenza ArtificialeUniversity of Lugano&SUPSIGalleria2,6928Manno-LuganoSwitzerland15May2014AbstractIn recent years,deep artificial neural networks(including recurrent ones)have won numerous con-tests in pattern recognition and machine learning.This historical survey compactly summarises relevantwork,much of it from the previous millennium.Shallow and deep learners are distinguished by thedepth of their credit assignment paths,which are chains of possibly learnable,causal links between ac-tions and effects.I review deep supervised learning(also recapitulating the history of backpropagation),unsupervised learning,reinforcement learning&evolutionary computation,and indirect search for shortprograms encoding deep and large networks.PDF of earlier draft(v1):http://www.idsia.ch/∼juergen/DeepLearning30April2014.pdfLATEX source:http://www.idsia.ch/∼juergen/DeepLearning30April2014.texComplete BIBTEXfile:http://www.idsia.ch/∼juergen/bib.bibPrefaceThis is the draft of an invited Deep Learning(DL)overview.One of its goals is to assign credit to those who contributed to the present state of the art.I acknowledge the limitations of attempting to achieve this goal.The DL research community itself may be viewed as a continually evolving,deep network of scientists who have influenced each other in complex ways.Starting from recent DL results,I tried to trace back the origins of relevant ideas through the past half century and beyond,sometimes using“local search”to follow citations of citations backwards in time.Since not all DL publications properly acknowledge earlier relevant work,additional global search strategies were employed,aided by consulting numerous neural network experts.As a result,the present draft mostly consists of references(about800entries so far).Nevertheless,through an expert selection bias I may have missed important work.A related bias was surely introduced by my special familiarity with the work of my own DL research group in the past quarter-century.For these reasons,the present draft should be viewed as merely a snapshot of an ongoing credit assignment process.To help improve it,please do not hesitate to send corrections and suggestions to juergen@idsia.ch.Contents1Introduction to Deep Learning(DL)in Neural Networks(NNs)3 2Event-Oriented Notation for Activation Spreading in FNNs/RNNs3 3Depth of Credit Assignment Paths(CAPs)and of Problems4 4Recurring Themes of Deep Learning54.1Dynamic Programming(DP)for DL (5)4.2Unsupervised Learning(UL)Facilitating Supervised Learning(SL)and RL (6)4.3Occam’s Razor:Compression and Minimum Description Length(MDL) (6)4.4Learning Hierarchical Representations Through Deep SL,UL,RL (6)4.5Fast Graphics Processing Units(GPUs)for DL in NNs (6)5Supervised NNs,Some Helped by Unsupervised NNs75.11940s and Earlier (7)5.2Around1960:More Neurobiological Inspiration for DL (7)5.31965:Deep Networks Based on the Group Method of Data Handling(GMDH) (8)5.41979:Convolution+Weight Replication+Winner-Take-All(WTA) (8)5.51960-1981and Beyond:Development of Backpropagation(BP)for NNs (8)5.5.1BP for Weight-Sharing Feedforward NNs(FNNs)and Recurrent NNs(RNNs)..95.6Late1980s-2000:Numerous Improvements of NNs (9)5.6.1Ideas for Dealing with Long Time Lags and Deep CAPs (10)5.6.2Better BP Through Advanced Gradient Descent (10)5.6.3Discovering Low-Complexity,Problem-Solving NNs (11)5.6.4Potential Benefits of UL for SL (11)5.71987:UL Through Autoencoder(AE)Hierarchies (12)5.81989:BP for Convolutional NNs(CNNs) (13)5.91991:Fundamental Deep Learning Problem of Gradient Descent (13)5.101991:UL-Based History Compression Through a Deep Hierarchy of RNNs (14)5.111992:Max-Pooling(MP):Towards MPCNNs (14)5.121994:Contest-Winning Not So Deep NNs (15)5.131995:Supervised Recurrent Very Deep Learner(LSTM RNN) (15)5.142003:More Contest-Winning/Record-Setting,Often Not So Deep NNs (16)5.152006/7:Deep Belief Networks(DBNs)&AE Stacks Fine-Tuned by BP (17)5.162006/7:Improved CNNs/GPU-CNNs/BP-Trained MPCNNs (17)5.172009:First Official Competitions Won by RNNs,and with MPCNNs (18)5.182010:Plain Backprop(+Distortions)on GPU Yields Excellent Results (18)5.192011:MPCNNs on GPU Achieve Superhuman Vision Performance (18)5.202011:Hessian-Free Optimization for RNNs (19)5.212012:First Contests Won on ImageNet&Object Detection&Segmentation (19)5.222013-:More Contests and Benchmark Records (20)5.22.1Currently Successful Supervised Techniques:LSTM RNNs/GPU-MPCNNs (21)5.23Recent Tricks for Improving SL Deep NNs(Compare Sec.5.6.2,5.6.3) (21)5.24Consequences for Neuroscience (22)5.25DL with Spiking Neurons? (22)6DL in FNNs and RNNs for Reinforcement Learning(RL)236.1RL Through NN World Models Yields RNNs With Deep CAPs (23)6.2Deep FNNs for Traditional RL and Markov Decision Processes(MDPs) (24)6.3Deep RL RNNs for Partially Observable MDPs(POMDPs) (24)6.4RL Facilitated by Deep UL in FNNs and RNNs (25)6.5Deep Hierarchical RL(HRL)and Subgoal Learning with FNNs and RNNs (25)6.6Deep RL by Direct NN Search/Policy Gradients/Evolution (25)6.7Deep RL by Indirect Policy Search/Compressed NN Search (26)6.8Universal RL (27)7Conclusion271Introduction to Deep Learning(DL)in Neural Networks(NNs) Which modifiable components of a learning system are responsible for its success or failure?What changes to them improve performance?This has been called the fundamental credit assignment problem(Minsky, 1963).There are general credit assignment methods for universal problem solvers that are time-optimal in various theoretical senses(Sec.6.8).The present survey,however,will focus on the narrower,but now commercially important,subfield of Deep Learning(DL)in Artificial Neural Networks(NNs).We are interested in accurate credit assignment across possibly many,often nonlinear,computational stages of NNs.Shallow NN-like models have been around for many decades if not centuries(Sec.5.1).Models with several successive nonlinear layers of neurons date back at least to the1960s(Sec.5.3)and1970s(Sec.5.5). An efficient gradient descent method for teacher-based Supervised Learning(SL)in discrete,differentiable networks of arbitrary depth called backpropagation(BP)was developed in the1960s and1970s,and ap-plied to NNs in1981(Sec.5.5).BP-based training of deep NNs with many layers,however,had been found to be difficult in practice by the late1980s(Sec.5.6),and had become an explicit research subject by the early1990s(Sec.5.9).DL became practically feasible to some extent through the help of Unsupervised Learning(UL)(e.g.,Sec.5.10,5.15).The1990s and2000s also saw many improvements of purely super-vised DL(Sec.5).In the new millennium,deep NNs havefinally attracted wide-spread attention,mainly by outperforming alternative machine learning methods such as kernel machines(Vapnik,1995;Sch¨o lkopf et al.,1998)in numerous important applications.In fact,supervised deep NNs have won numerous of-ficial international pattern recognition competitions(e.g.,Sec.5.17,5.19,5.21,5.22),achieving thefirst superhuman visual pattern recognition results in limited domains(Sec.5.19).Deep NNs also have become relevant for the more generalfield of Reinforcement Learning(RL)where there is no supervising teacher (Sec.6).Both feedforward(acyclic)NNs(FNNs)and recurrent(cyclic)NNs(RNNs)have won contests(Sec.5.12,5.14,5.17,5.19,5.21,5.22).In a sense,RNNs are the deepest of all NNs(Sec.3)—they are general computers more powerful than FNNs,and can in principle create and process memories of ar-bitrary sequences of input patterns(e.g.,Siegelmann and Sontag,1991;Schmidhuber,1990a).Unlike traditional methods for automatic sequential program synthesis(e.g.,Waldinger and Lee,1969;Balzer, 1985;Soloway,1986;Deville and Lau,1994),RNNs can learn programs that mix sequential and parallel information processing in a natural and efficient way,exploiting the massive parallelism viewed as crucial for sustaining the rapid decline of computation cost observed over the past75years.The rest of this paper is structured as follows.Sec.2introduces a compact,event-oriented notation that is simple yet general enough to accommodate both FNNs and RNNs.Sec.3introduces the concept of Credit Assignment Paths(CAPs)to measure whether learning in a given NN application is of the deep or shallow type.Sec.4lists recurring themes of DL in SL,UL,and RL.Sec.5focuses on SL and UL,and on how UL can facilitate SL,although pure SL has become dominant in recent competitions(Sec.5.17-5.22). Sec.5is arranged in a historical timeline format with subsections on important inspirations and technical contributions.Sec.6on deep RL discusses traditional Dynamic Programming(DP)-based RL combined with gradient-based search techniques for SL or UL in deep NNs,as well as general methods for direct and indirect search in the weight space of deep FNNs and RNNs,including successful policy gradient and evolutionary methods.2Event-Oriented Notation for Activation Spreading in FNNs/RNNs Throughout this paper,let i,j,k,t,p,q,r denote positive integer variables assuming ranges implicit in the given contexts.Let n,m,T denote positive integer constants.An NN’s topology may change over time(e.g.,Fahlman,1991;Ring,1991;Weng et al.,1992;Fritzke, 1994).At any given moment,it can be described as afinite subset of units(or nodes or neurons)N= {u1,u2,...,}and afinite set H⊆N×N of directed edges or connections between nodes.FNNs are acyclic graphs,RNNs cyclic.Thefirst(input)layer is the set of input units,a subset of N.In FNNs,the k-th layer(k>1)is the set of all nodes u∈N such that there is an edge path of length k−1(but no longer path)between some input unit and u.There may be shortcut connections between distant layers.The NN’s behavior or program is determined by a set of real-valued,possibly modifiable,parameters or weights w i(i=1,...,n).We now focus on a singlefinite episode or epoch of information processing and activation spreading,without learning through weight changes.The following slightly unconventional notation is designed to compactly describe what is happening during the runtime of the system.During an episode,there is a partially causal sequence x t(t=1,...,T)of real values that I call events.Each x t is either an input set by the environment,or the activation of a unit that may directly depend on other x k(k<t)through a current NN topology-dependent set in t of indices k representing incoming causal connections or links.Let the function v encode topology information and map such event index pairs(k,t)to weight indices.For example,in the non-input case we may have x t=f t(net t)with real-valued net t= k∈in t x k w v(k,t)(additive case)or net t= k∈in t x k w v(k,t)(multiplicative case), where f t is a typically nonlinear real-valued activation function such as tanh.In many recent competition-winning NNs(Sec.5.19,5.21,5.22)there also are events of the type x t=max k∈int (x k);some networktypes may also use complex polynomial activation functions(Sec.5.3).x t may directly affect certain x k(k>t)through outgoing connections or links represented through a current set out t of indices k with t∈in k.Some non-input events are called output events.Note that many of the x t may refer to different,time-varying activations of the same unit in sequence-processing RNNs(e.g.,Williams,1989,“unfolding in time”),or also in FNNs sequentially exposed to time-varying input patterns of a large training set encoded as input events.During an episode,the same weight may get reused over and over again in topology-dependent ways,e.g.,in RNNs,or in convolutional NNs(Sec.5.4,5.8).I call this weight sharing across space and/or time.Weight sharing may greatly reduce the NN’s descriptive complexity,which is the number of bits of information required to describe the NN (Sec.4.3).In Supervised Learning(SL),certain NN output events x t may be associated with teacher-given,real-valued labels or targets d t yielding errors e t,e.g.,e t=1/2(x t−d t)2.A typical goal of supervised NN training is tofind weights that yield episodes with small total error E,the sum of all such e t.The hope is that the NN will generalize well in later episodes,causing only small errors on previously unseen sequences of input events.Many alternative error functions for SL and UL are possible.SL assumes that input events are independent of earlier output events(which may affect the environ-ment through actions causing subsequent perceptions).This assumption does not hold in the broaderfields of Sequential Decision Making and Reinforcement Learning(RL)(Kaelbling et al.,1996;Sutton and Barto, 1998;Hutter,2005)(Sec.6).In RL,some of the input events may encode real-valued reward signals given by the environment,and a typical goal is tofind weights that yield episodes with a high sum of reward signals,through sequences of appropriate output actions.Sec.5.5will use the notation above to compactly describe a central algorithm of DL,namely,back-propagation(BP)for supervised weight-sharing FNNs and RNNs.(FNNs may be viewed as RNNs with certainfixed zero weights.)Sec.6will address the more general RL case.3Depth of Credit Assignment Paths(CAPs)and of ProblemsTo measure whether credit assignment in a given NN application is of the deep or shallow type,I introduce the concept of Credit Assignment Paths or CAPs,which are chains of possibly causal links between events.Let usfirst focus on SL.Consider two events x p and x q(1≤p<q≤T).Depending on the appli-cation,they may have a Potential Direct Causal Connection(PDCC)expressed by the Boolean predicate pdcc(p,q),which is true if and only if p∈in q.Then the2-element list(p,q)is defined to be a CAP from p to q(a minimal one).A learning algorithm may be allowed to change w v(p,q)to improve performance in future episodes.More general,possibly indirect,Potential Causal Connections(PCC)are expressed by the recursively defined Boolean predicate pcc(p,q),which in the SL case is true only if pdcc(p,q),or if pcc(p,k)for some k and pdcc(k,q).In the latter case,appending q to any CAP from p to k yields a CAP from p to q(this is a recursive definition,too).The set of such CAPs may be large but isfinite.Note that the same weight may affect many different PDCCs between successive events listed by a given CAP,e.g.,in the case of RNNs, or weight-sharing FNNs.Suppose a CAP has the form(...,k,t,...,q),where k and t(possibly t=q)are thefirst successive elements with modifiable w v(k,t).Then the length of the suffix list(t,...,q)is called the CAP’s depth (which is0if there are no modifiable links at all).This depth limits how far backwards credit assignment can move down the causal chain tofind a modifiable weight.1Suppose an episode and its event sequence x1,...,x T satisfy a computable criterion used to decide whether a given problem has been solved(e.g.,total error E below some threshold).Then the set of used weights is called a solution to the problem,and the depth of the deepest CAP within the sequence is called the solution’s depth.There may be other solutions(yielding different event sequences)with different depths.Given somefixed NN topology,the smallest depth of any solution is called the problem’s depth.Sometimes we also speak of the depth of an architecture:SL FNNs withfixed topology imply a problem-independent maximal problem depth bounded by the number of non-input layers.Certain SL RNNs withfixed weights for all connections except those to output units(Jaeger,2001;Maass et al.,2002; Jaeger,2004;Schrauwen et al.,2007)have a maximal problem depth of1,because only thefinal links in the corresponding CAPs are modifiable.In general,however,RNNs may learn to solve problems of potentially unlimited depth.Note that the definitions above are solely based on the depths of causal chains,and agnostic of the temporal distance between events.For example,shallow FNNs perceiving large“time windows”of in-put events may correctly classify long input sequences through appropriate output events,and thus solve shallow problems involving long time lags between relevant events.At which problem depth does Shallow Learning end,and Deep Learning begin?Discussions with DL experts have not yet yielded a conclusive response to this question.Instead of committing myself to a precise answer,let me just define for the purposes of this overview:problems of depth>10require Very Deep Learning.The difficulty of a problem may have little to do with its depth.Some NNs can quickly learn to solve certain deep problems,e.g.,through random weight guessing(Sec.5.9)or other types of direct search (Sec.6.6)or indirect search(Sec.6.7)in weight space,or through training an NNfirst on shallow problems whose solutions may then generalize to deep problems,or through collapsing sequences of(non)linear operations into a single(non)linear operation—but see an analysis of non-trivial aspects of deep linear networks(Baldi and Hornik,1994,Section B).In general,however,finding an NN that precisely models a given training set is an NP-complete problem(Judd,1990;Blum and Rivest,1992),also in the case of deep NNs(S´ıma,1994;de Souto et al.,1999;Windisch,2005);compare a survey of negative results(S´ıma, 2002,Section1).Above we have focused on SL.In the more general case of RL in unknown environments,pcc(p,q) is also true if x p is an output event and x q any later input event—any action may affect the environment and thus any later perception.(In the real world,the environment may even influence non-input events computed on a physical hardware entangled with the entire universe,but this is ignored here.)It is possible to model and replace such unmodifiable environmental PCCs through a part of the NN that has already learned to predict(through some of its units)input events(including reward signals)from former input events and actions(Sec.6.1).Its weights are frozen,but can help to assign credit to other,still modifiable weights used to compute actions(Sec.6.1).This approach may lead to very deep CAPs though.Some DL research is about automatically rephrasing problems such that their depth is reduced(Sec.4). In particular,sometimes UL is used to make SL problems less deep,e.g.,Sec.5.10.Often Dynamic Programming(Sec.4.1)is used to facilitate certain traditional RL problems,e.g.,Sec.6.2.Sec.5focuses on CAPs for SL,Sec.6on the more complex case of RL.4Recurring Themes of Deep Learning4.1Dynamic Programming(DP)for DLOne recurring theme of DL is Dynamic Programming(DP)(Bellman,1957),which can help to facili-tate credit assignment under certain assumptions.For example,in SL NNs,backpropagation itself can 1An alternative would be to count only modifiable links when measuring depth.In many typical NN applications this would not make a difference,but in some it would,e.g.,Sec.6.1.be viewed as a DP-derived method(Sec.5.5).In traditional RL based on strong Markovian assumptions, DP-derived methods can help to greatly reduce problem depth(Sec.6.2).DP algorithms are also essen-tial for systems that combine concepts of NNs and graphical models,such as Hidden Markov Models (HMMs)(Stratonovich,1960;Baum and Petrie,1966)and Expectation Maximization(EM)(Dempster et al.,1977),e.g.,(Bottou,1991;Bengio,1991;Bourlard and Morgan,1994;Baldi and Chauvin,1996; Jordan and Sejnowski,2001;Bishop,2006;Poon and Domingos,2011;Dahl et al.,2012;Hinton et al., 2012a).4.2Unsupervised Learning(UL)Facilitating Supervised Learning(SL)and RL Another recurring theme is how UL can facilitate both SL(Sec.5)and RL(Sec.6).UL(Sec.5.6.4) is normally used to encode raw incoming data such as video or speech streams in a form that is more convenient for subsequent goal-directed learning.In particular,codes that describe the original data in a less redundant or more compact way can be fed into SL(Sec.5.10,5.15)or RL machines(Sec.6.4),whose search spaces may thus become smaller(and whose CAPs shallower)than those necessary for dealing with the raw data.UL is closely connected to the topics of regularization and compression(Sec.4.3,5.6.3). 4.3Occam’s Razor:Compression and Minimum Description Length(MDL) Occam’s razor favors simple solutions over complex ones.Given some programming language,the prin-ciple of Minimum Description Length(MDL)can be used to measure the complexity of a solution candi-date by the length of the shortest program that computes it(e.g.,Solomonoff,1964;Kolmogorov,1965b; Chaitin,1966;Wallace and Boulton,1968;Levin,1973a;Rissanen,1986;Blumer et al.,1987;Li and Vit´a nyi,1997;Gr¨u nwald et al.,2005).Some methods explicitly take into account program runtime(Al-lender,1992;Watanabe,1992;Schmidhuber,2002,1995);many consider only programs with constant runtime,written in non-universal programming languages(e.g.,Rissanen,1986;Hinton and van Camp, 1993).In the NN case,the MDL principle suggests that low NN weight complexity corresponds to high NN probability in the Bayesian view(e.g.,MacKay,1992;Buntine and Weigend,1991;De Freitas,2003), and to high generalization performance(e.g.,Baum and Haussler,1989),without overfitting the training data.Many methods have been proposed for regularizing NNs,that is,searching for solution-computing, low-complexity SL NNs(Sec.5.6.3)and RL NNs(Sec.6.7).This is closely related to certain UL methods (Sec.4.2,5.6.4).4.4Learning Hierarchical Representations Through Deep SL,UL,RLMany methods of Good Old-Fashioned Artificial Intelligence(GOFAI)(Nilsson,1980)as well as more recent approaches to AI(Russell et al.,1995)and Machine Learning(Mitchell,1997)learn hierarchies of more and more abstract data representations.For example,certain methods of syntactic pattern recog-nition(Fu,1977)such as grammar induction discover hierarchies of formal rules to model observations. The partially(un)supervised Automated Mathematician/EURISKO(Lenat,1983;Lenat and Brown,1984) continually learns concepts by combining previously learnt concepts.Such hierarchical representation learning(Ring,1994;Bengio et al.,2013;Deng and Yu,2014)is also a recurring theme of DL NNs for SL (Sec.5),UL-aided SL(Sec.5.7,5.10,5.15),and hierarchical RL(Sec.6.5).Often,abstract hierarchical representations are natural by-products of data compression(Sec.4.3),e.g.,Sec.5.10.4.5Fast Graphics Processing Units(GPUs)for DL in NNsWhile the previous millennium saw several attempts at creating fast NN-specific hardware(e.g.,Jackel et al.,1990;Faggin,1992;Ramacher et al.,1993;Widrow et al.,1994;Heemskerk,1995;Korkin et al., 1997;Urlbe,1999),and at exploiting standard hardware(e.g.,Anguita et al.,1994;Muller et al.,1995; Anguita and Gomes,1996),the new millennium brought a DL breakthrough in form of cheap,multi-processor graphics cards or GPUs.GPUs are widely used for video games,a huge and competitive market that has driven down hardware prices.GPUs excel at fast matrix and vector multiplications required not only for convincing virtual realities but also for NN training,where they can speed up learning by a factorof50and more.Some of the GPU-based FNN implementations(Sec.5.16-5.19)have greatly contributed to recent successes in contests for pattern recognition(Sec.5.19-5.22),image segmentation(Sec.5.21), and object detection(Sec.5.21-5.22).5Supervised NNs,Some Helped by Unsupervised NNsThe main focus of current practical applications is on Supervised Learning(SL),which has dominated re-cent pattern recognition contests(Sec.5.17-5.22).Several methods,however,use additional Unsupervised Learning(UL)to facilitate SL(Sec.5.7,5.10,5.15).It does make sense to treat SL and UL in the same section:often gradient-based methods,such as BP(Sec.5.5.1),are used to optimize objective functions of both UL and SL,and the boundary between SL and UL may blur,for example,when it comes to time series prediction and sequence classification,e.g.,Sec.5.10,5.12.A historical timeline format will help to arrange subsections on important inspirations and techni-cal contributions(although such a subsection may span a time interval of many years).Sec.5.1briefly mentions early,shallow NN models since the1940s,Sec.5.2additional early neurobiological inspiration relevant for modern Deep Learning(DL).Sec.5.3is about GMDH networks(since1965),perhaps thefirst (feedforward)DL systems.Sec.5.4is about the relatively deep Neocognitron NN(1979)which is similar to certain modern deep FNN architectures,as it combines convolutional NNs(CNNs),weight pattern repli-cation,and winner-take-all(WTA)mechanisms.Sec.5.5uses the notation of Sec.2to compactly describe a central algorithm of DL,namely,backpropagation(BP)for supervised weight-sharing FNNs and RNNs. It also summarizes the history of BP1960-1981and beyond.Sec.5.6describes problems encountered in the late1980s with BP for deep NNs,and mentions several ideas from the previous millennium to overcome them.Sec.5.7discusses afirst hierarchical stack of coupled UL-based Autoencoders(AEs)—this concept resurfaced in the new millennium(Sec.5.15).Sec.5.8is about applying BP to CNNs,which is important for today’s DL applications.Sec.5.9explains BP’s Fundamental DL Problem(of vanishing/exploding gradients)discovered in1991.Sec.5.10explains how a deep RNN stack of1991(the History Compressor) pre-trained by UL helped to solve previously unlearnable DL benchmarks requiring Credit Assignment Paths(CAPs,Sec.3)of depth1000and more.Sec.5.11discusses a particular WTA method called Max-Pooling(MP)important in today’s DL FNNs.Sec.5.12mentions afirst important contest won by SL NNs in1994.Sec.5.13describes a purely supervised DL RNN(Long Short-Term Memory,LSTM)for problems of depth1000and more.Sec.5.14mentions an early contest of2003won by an ensemble of shallow NNs, as well as good pattern recognition results with CNNs and LSTM RNNs(2003).Sec.5.15is mostly about Deep Belief Networks(DBNs,2006)and related stacks of Autoencoders(AEs,Sec.5.7)pre-trained by UL to facilitate BP-based SL.Sec.5.16mentions thefirst BP-trained MPCNNs(2007)and GPU-CNNs(2006). Sec.5.17-5.22focus on official competitions with secret test sets won by(mostly purely supervised)DL NNs since2009,in sequence recognition,image classification,image segmentation,and object detection. Many RNN results depended on LSTM(Sec.5.13);many FNN results depended on GPU-based FNN code developed since2004(Sec.5.16,5.17,5.18,5.19),in particular,GPU-MPCNNs(Sec.5.19).5.11940s and EarlierNN research started in the1940s(e.g.,McCulloch and Pitts,1943;Hebb,1949);compare also later work on learning NNs(Rosenblatt,1958,1962;Widrow and Hoff,1962;Grossberg,1969;Kohonen,1972; von der Malsburg,1973;Narendra and Thathatchar,1974;Willshaw and von der Malsburg,1976;Palm, 1980;Hopfield,1982).In a sense NNs have been around even longer,since early supervised NNs were essentially variants of linear regression methods going back at least to the early1800s(e.g.,Legendre, 1805;Gauss,1809,1821).Early NNs had a maximal CAP depth of1(Sec.3).5.2Around1960:More Neurobiological Inspiration for DLSimple cells and complex cells were found in the cat’s visual cortex(e.g.,Hubel and Wiesel,1962;Wiesel and Hubel,1959).These cellsfire in response to certain properties of visual sensory inputs,such as theorientation of plex cells exhibit more spatial invariance than simple cells.This inspired later deep NN architectures(Sec.5.4)used in certain modern award-winning Deep Learners(Sec.5.19-5.22).5.31965:Deep Networks Based on the Group Method of Data Handling(GMDH) Networks trained by the Group Method of Data Handling(GMDH)(Ivakhnenko and Lapa,1965; Ivakhnenko et al.,1967;Ivakhnenko,1968,1971)were perhaps thefirst DL systems of the Feedforward Multilayer Perceptron type.The units of GMDH nets may have polynomial activation functions imple-menting Kolmogorov-Gabor polynomials(more general than traditional NN activation functions).Given a training set,layers are incrementally grown and trained by regression analysis,then pruned with the help of a separate validation set(using today’s terminology),where Decision Regularisation is used to weed out superfluous units.The numbers of layers and units per layer can be learned in problem-dependent fashion. This is a good example of hierarchical representation learning(Sec.4.4).There have been numerous ap-plications of GMDH-style networks,e.g.(Ikeda et al.,1976;Farlow,1984;Madala and Ivakhnenko,1994; Ivakhnenko,1995;Kondo,1998;Kord´ık et al.,2003;Witczak et al.,2006;Kondo and Ueno,2008).5.41979:Convolution+Weight Replication+Winner-Take-All(WTA)Apart from deep GMDH networks(Sec.5.3),the Neocognitron(Fukushima,1979,1980,2013a)was per-haps thefirst artificial NN that deserved the attribute deep,and thefirst to incorporate the neurophysiolog-ical insights of Sec.5.2.It introduced convolutional NNs(today often called CNNs or convnets),where the(typically rectangular)receptivefield of a convolutional unit with given weight vector is shifted step by step across a2-dimensional array of input values,such as the pixels of an image.The resulting2D array of subsequent activation events of this unit can then provide inputs to higher-level units,and so on.Due to massive weight replication(Sec.2),relatively few parameters may be necessary to describe the behavior of such a convolutional layer.Competition layers have WTA subsets whose maximally active units are the only ones to adopt non-zero activation values.They essentially“down-sample”the competition layer’s input.This helps to create units whose responses are insensitive to small image shifts(compare Sec.5.2).The Neocognitron is very similar to the architecture of modern,contest-winning,purely super-vised,feedforward,gradient-based Deep Learners with alternating convolutional and competition lay-ers(e.g.,Sec.5.19-5.22).Fukushima,however,did not set the weights by supervised backpropagation (Sec.5.5,5.8),but by local un supervised learning rules(e.g.,Fukushima,2013b),or by pre-wiring.In that sense he did not care for the DL problem(Sec.5.9),although his architecture was comparatively deep indeed.He also used Spatial Averaging(Fukushima,1980,2011)instead of Max-Pooling(MP,Sec.5.11), currently a particularly convenient and popular WTA mechanism.Today’s CNN-based DL machines profita lot from later CNN work(e.g.,LeCun et al.,1989;Ranzato et al.,2007)(Sec.5.8,5.16,5.19).5.51960-1981and Beyond:Development of Backpropagation(BP)for NNsThe minimisation of errors through gradient descent(Hadamard,1908)in the parameter space of com-plex,nonlinear,differentiable,multi-stage,NN-related systems has been discussed at least since the early 1960s(e.g.,Kelley,1960;Bryson,1961;Bryson and Denham,1961;Pontryagin et al.,1961;Dreyfus,1962; Wilkinson,1965;Amari,1967;Bryson and Ho,1969;Director and Rohrer,1969;Griewank,2012),ini-tially within the framework of Euler-LaGrange equations in the Calculus of Variations(e.g.,Euler,1744). Steepest descent in such systems can be performed(Bryson,1961;Kelley,1960;Bryson and Ho,1969)by iterating the ancient chain rule(Leibniz,1676;L’Hˆo pital,1696)in Dynamic Programming(DP)style(Bell-man,1957).A simplified derivation of the method uses the chain rule only(Dreyfus,1962).The methods of the1960s were already efficient in the DP sense.However,they backpropagated derivative information through standard Jacobian matrix calculations from one“layer”to the previous one, explicitly addressing neither direct links across several layers nor potential additional efficiency gains due to network sparsity(but perhaps such enhancements seemed obvious to the authors).。
人工智能- K近邻分类算法的Java实现_天天向上-CSDN博客// 从文件中读取数据public static void Input(String path,int num,Flower[] arr) throws FileNotFoundException { File file new File(path); Scanner scanner new Scanner(file); for(int i 0;i num;i ) { DoublecalyxLength,calyxWidth,petalLength,petalWidth; Integer type; calyxLength scanner.nextDouble(); calyxWidthscanner.nextDouble(); petalLength scanner.nextDouble(); petalWidth scanner.nextDouble(); type scanner.nextInt(); arr[i] newFlower(calyxLength,calyxWidth,petalLength,petalWidth,type); // 计算K private static int calculateK(Flower[] dataTesting,int numberTesting,Flower[] dataTraining,int numberTraining) { int bestK 1; //最佳的k double bestValue 0; //识别率for(int k 1;k numberTraining;k ) { double temKNN(dataTesting,numberTesting,dataTraining,numberTraining,k ); System.out.println( k k 识别率tem); if(tem bestValue-0.000001) { bestValue tem; bestK k; return bestK; // KNN private static Double KNN(Flower[] dataTesting,int numberTesting,Flower[] dataTraining,int numberTraining, int k){ double cnt 0; for(int i 0;i numberTesting;i ) { // 清空计数器int count[] new int[5]; for(int j 1;j 4;j ) { count[j] 0; Queue DistanceAndType priorityQueue new PriorityQueue DistanceAndType (); //开始找最近的k个//距离类型排序比较器优先队列for(int j 0;j numberTraining;j ){ priorityQueue.add( newDistanceAndType( dataTesting[i].distance( dataTraining[j] ), dataTraining[j].getType() for(int j 0;j k;j ) { DistanceAndType dat priorityQueue.poll(); count[ dat.getType() ] ;//System.out.println( dat ); priorityQueue.clear();//System.out.println(); int bestCount 0,bestType -1; for(int j 1;j 4;j ) { if(count[j] bestCount) { bestCount count[j]; bestType j;if(bestType dataTesting[i].getType()) { cnt cnt 1; returncnt/numberTesting; public static void main(String[] args) throws FileNotFoundException { // 定义变量Scanner scanner new Scanner(System.in); Flower[] dataTesting new Flower[35]; Flower[] dataTraining new Flower[125]; int numberTesting 31,numberTraining 120,k; char option; // 输入数据Input( C:\\Users\\Administrator\\Desktop\\人工智能\\1033180327_孙创昱_AI_project6\\AI-实验6-KNN\\iris-data-testing.txt ,numberTesting,dataTesting); Input( C:\\Users\\Administrator\\Desktop\\人工智能\\1033180327_孙创昱_AI_project6\\AI-实验6-KNN\\iris-data-training.txt ,numberTraining,dataTraining); //输出检查for(int i 0;i 31;i ) System.out.println(dataTesting[i]); System.out.println( -------------------------------------------------------- ); for(int i 0;i 120;i ) System.out.println(dataTraining[i]); // 获得K System.out.println( 计算K吗(Y/N) ); optionscanner.next().charAt(0); if(option Y || option y ) { k calculateK(dataTesting,numberTesting,dataTraining,numberTrai ning); }else { System.out.println( 请输入K: ); kscanner.nextInt(); System.out.println( K k); //计算准确率Double resKNN(dataTesting,numberTesting,dataTraining,numberTraining,k ); System.out.println( 识别率为res); 运行结果可以看到k比较好的值为45 这个时候识别率为0.9677。
华中科技大学硕士学位论文摘要随着人工智能的兴起,计算机视觉领域近年来得到快速的发展。
作为计算机视觉领域研究的核心之一,视觉目标跟踪已经广泛应用于许多领域。
由于其应用领域十分广泛,应用场景也变得更加复杂。
尺度变化、遮挡、旋转、复杂背景、低分辨率等因素的存在给视觉目标跟踪带来了的挑战。
基于核相关滤波的KCF(Kernelized Correlation Filters)跟踪算法自问世以来受到了广泛的关注。
本文基于KCF跟踪算法,针对KCF跟踪算法的不足提出了一些改进方法。
针对KCF跟踪算法不能自适应目标尺度的变化,本文提出了SMIKCF(Scale-Adaptive Multiple-Feature Improved-Template-Update KCF)跟踪算法。
SMIKCF在KCF算法的基础上,加入一个尺度估计滤波器;提取样本的HOG((Histogram of Oriented Gradient))特征和CN(Color Name)特征,并把两个特征相融合;利用APCE判据改进位置估计滤波器模型的更新方式。
针对KCF算法在目标发生遮挡时容易出现目标丢失的情况,本文提出了AOKCF(Anti-Occlusion KCF)跟踪算法。
AOKCF在KCF算法的基础上,利用APCE 进行跟踪可靠性检测;加入检测模块,在检测到跟踪结果不可靠时,启动检测模块对目标进行检测,并使用位置滤波器进行目标识别,当识别出目标时,更新目标位置以及位置滤波器,否则直接进入下一帧,继续采用检测模块进行检测。
本文实验数据来源于Visual Tracker Benchmark。
采用OPE(One-Pass Evaluation)的评估方式,在Benchmark的50个测试视频序列上进行实验,来评估算法的性能。
实验表明,本文提出的方法能够有效地解决相关问题,提高跟踪算法的性能。
关键词:目标跟踪核相关滤波尺度自适应遮挡华中科技大学硕士学位论文ABSTRACTWith growing of artificial intelligence, computer vision has been rapid development in recent years. Visual target tracking is one of the most important part of computer vision research, with widely used in many fields. The application situation is becoming more and more complicated due to the wide application fields. The existence of scale changes, occlusion, rotation, complex background, low resolution and other factors has brought higher challenges to visual target tracking. KCF (Kernelized Correlation Filters) tracking algorithm has been widely concerned since it was proposed. Based on the KCF tracking algorithm, this paper proposed improved methods for the shortcoming of KCF tracking algorithm.Firstly, the SMIKCF(Scale-Adaptive Multiple-Feature Improved-Template-Update KCF) tracking algorithm is proposed in this paper to solve the problem that the KCF tracking algorithm can not adapt to the target scale. On the basis of KCF algorithm, SMIKCF adds a scale estimation filter, and combines HOG characteristics and CN characteristics, using the APCE criterion to improve the updating method of the position estimation filter model.Secondly, the AOKCF(Anti-Occlusion KCF) tracking algorithm is proposed to solve the problem of occlusion. AOKCF tracking algorithm is based on KCF tracking algorithm. APCE criterion is used to check the reliability of tracking results. When the tracking result is unreliable, add detection module to detect the target, and then use the position filter to the target recognition. If the target is recognized, then update the target position and the position filter module. Otherwise, go directly to the next frame.The experimental datas are from Visual Tracker Benchmark. Experiments were performed on Benchmark's 50 test video sequences and use the OPE (One-Pass Evaluation) evaluation method to evaluate the performance of the algorithm.华中科技大学硕士学位论文Experimental results show that the proposed method can effectively solve the related problems and improve the performance of the tracking algorithm.Keywords: Object tracking Kernelized correlation filters Scale adaptive Occlusion华中科技大学硕士学位论文目录摘要 (I)Abstract (II)目录 (IV)1 绪论 (1)1.1研究背景与意义 (1)1.2国内外研究概况 (1)1.3论文的主要研究内容 (4)1.4主要组织结构 (4)2 KCF跟踪算法介绍 (6)2.1引言 (6)2.2相关滤波器 (6)2.3核相关滤波器 (9)2.4本章小结 (17)3 改进的尺度自适应KCF跟踪算法研究 (18)3.1引言 (18)3.2尺度估计 (19)3.3多特征融合 (22)3.4模型更新策略改进 (22)3.5算法整体框架 (24)华中科技大学硕士学位论文3.6实验结果与分析 (25)3.7本章小结 (40)4 抗遮挡KCF跟踪算法研究 (41)4.1引言 (41)4.2跟踪可靠性检测 (43)4.3重检测算法研究 (43)4.4算法整体框架 (47)4.5实验结果与分析 (48)4.6本章小结 (60)5 总结与展望 (61)5.1全文总结 (61)5.2课题展望 (62)致谢 (63)参考文献 (64)附录1 攻读学位期间发表论文与参与课题 (70)华中科技大学硕士学位论文1 绪论1.1 研究背景与意义人工智能的飞速发展使得计算机视觉备受关注,作为计算机视觉领域研究的核心之一,视觉目标跟踪已经广泛应用于许多领域[1]。
I.J.Mathematical Sciences and Computing,2016, 2, 24-33Published Online April 2016 in MECS ()DOI: 10.5815/ijmsc.2016.02.03Available online at /ijmscRMSD Protein Tertiary Structure Prediction with Soft ComputingMohammad Saber Iraji a, Hakimeh Ameri ba Faculty Member of Department of Computer Engineering and Information Technology, Payame NoorUniversity, I.R. of Iranb Teacher of Department of Computer Engineering and Information Technology, Payame Noor University, I.R.of IranAbstractRoot-mean-square-deviation (RMSD) is an indicator in protein-structure-prediction-algorithms (PSPAs). Goal of PSP algorithms is to obtain 0 Å RMSD from native protein structures. Protein structure and RMSD prediction is very essential. In 2013, the estimated RMSD proteins based on nine features were obtained best results using D2N (Distance to the native). We presented in This paper proposed approach to reduce predicted RMSD Error Than the actual amount for RMSD and calculate mean absolute error (MAE), through feed forward neural network, adaptive neuro fuzzy method. ANFIS is achieved better and more accurate results.Index Terms: Root-mean-square-deviation (RMSD), protein, native structure, neural network, fuzzy.© 2016 Published by MECS Publisher. Selection and/or peer review under responsibility of the Research Association of Modern Education and Computer Science1.IntroductionFinding an accurate algorithm to predict the protein structure is proved as an extremely difficult problem. There are two fundamentally different approaches: one is ab initio prediction that did not take any help of previously known protein structures, which employ computer-based applications to minimize very large functions corresponding to the free energy in the molecules. The other one is known as a knowledge-based approach which used amino acid sequences as a source of knowledge. This knowledge then extracts and stores to be used to predict any knew amino acid sequence (proteins) with known sequence but unknown structure. We can categorize knowledge-based methods into two groups: statistics based, and neural network based [1].During the past four decades, several proteins secondary structural class prediction algorithms based on the sequence of protein amino acids have been studied and developed. Techniques such as neural networks, Fuzzy neural network, Support Vector Machines (SVM), component-coupled algorithm, and nearest neighbor classifier with a complexity-based distance measure (NN-CDM) are some of them. Among all these works the* Corresponding author.E-mail address: iraji.ms@, ha.amery@neural network based methods showed the most efficiency. Neural networks are an artificial intelligence technique and a family of statistical learning algorithms that usually involves a large number of processors operating in parallel, each with its own small knowledge to create a secondary structure model with optimal parameters. Neurons in a neural network can be processed to encode the amino acid sequences into a usable numerical form. The classification performance of a neural network prediction algorithm can be considerably differing depending on the derivation of a sustainable parametric model from the training data set [2]. A fuzzy neural network or neuro-fuzzy system is a learning machine that finds the parameters of a fuzzy system by exploiting approximation techniques from neural networks. This technique used to calculate the degree of confidence for each prediction result and overcome the uncertainties prediction problem [3]. Another technique that can be used to predict the secondary structural protein based on the extracted feathers of predicted burial information from amino acid residues is SVM.2.Related WorkSepideh Babaei and et al. (2010) studied a supervised learning of recurrent neural networks, which is well-suited for prediction of protein secondary structures from the underlying amino acid's sequence. They devised a modular prediction system based on the interactive integration of the MRR-NN and the MBR-NN structures. To model the strong correlations between adjacent secondary structure elements a Modular mutual recurrent neural network (MRR-NN) has been proposed. And, MBR-NN is a multilayer bidirectional recurrent neural network which can capture the long-range intra-molecular interactions between amino acids in formation of the secondary structure. The proposed model used to arbitrarily engage the neigh boring effects of the secondary structure types concurrent with memorizing the sequential dependencies of amino acids along the protein chain. They used their model on PSIPRED data set into three-fold cross-validation. The reported accuracy was 79.36% and boosts the segment overlap (SOV) was up to 70.09% [4]. Zhun Zhou and et al. (2010) proposed a model named SAC method based on association rule classifier, and rules are the rules are obtained by the KDD* model mining secondary structure for information of mutual impact. The KDD* Model focused on high confidence and low support rules, which is called …knowledge in shortage‟. The proposed model was based on CMAR (Classification based on Multiple Association rules) algorithm. The accuracies of the proposed model tested on RS126 and CB513 data sets are reported as 83.6% and 80.49%, respectively [5].Wu Qu and et al. (2011) proposed a multi-modal back propagation neural network (MMBP) method to predict the protein secondary structures. The created model is a compound pyramid model (CPM), which is composed of three layers of the intelligent interface that integrate multi-modal back propagation neural network (MMBP), mixed-modal SVM (MMS), modified Knowledge Discovery in the Databases (KDD) process and so on. The claimed accuracy of the proposed model on a non-redundant test data set of 256 proteins from RCASP256 was 86.13% [6]. Rohayan tihasan and et al. (2011) recommended a derivative feature vector, DPS that utilizes the optimal length of the local protein structure to predict the secondary structures of proteins. The DPS denotes the unification of amino acid propensity score and dihedral angle score. Secondary structure assignment method (SSAM) and secondary structure prediction method (SSPM) generate class labels for DPS feature vectors and the Support Vector Machine (SVM) used for prediction. The accuracy of the proposed model on the RS126 sequences was 84.4% [7]. Sepideh Babaei and et al. (2012) used a multilayer perceptron of recurrent neural network (RNN) pruned for optimizing size of hidden layers to enhance the secondary structure prediction. A type of reciprocal recurrent neural networks (MRR-NN) and the long-range interactions between amino acids in formation of the secondary structure (bidirectional RNN) and a multilayer bidirectional recurrent neural network (MBR-NN) consecutively applied to capture the predominant long-term dependencies. Finally, a modular prediction system (the MRR-NN and MBR-NN) used on the trusted PSIPRED data set and report the percentage accuracy (Q3) up to 76.91% and augment the segment overlap (SOV) up to 68.13% [8].Mohammad Hossein Zangooei and et al. (2012) proposed a method based on Support Vector Regression (SVR) classification. They used non-dominated Sorting Genetic Algorithm II (NSGAII) to find mapping points (MPs) for mapping a real-value to an integer one. We applied non-dominated Sorting Genetic Algorithm II (NSGAII)to find mapping points (MPs) to round a real-value to an integer one. In order to enhance the performance of the proposed model, they used the NSGAII to find and tune the SVR kernel parameters optimally. To improve the prediction result, the Dynamic Weighted Kernel Fusion (DWKF) method for fusing of three SVR kernels was used. The obtained classification accuracy of the proposed model on RS126 and CB513 data sets reported as 85.79% and 84.94% respectively [9].Maqsood Hayat and et al. (2014) proposed a model employing hybrid descriptor space along with the optimized evidence-theoretic K-nearest neighbor algorithms. The hybrid space is a composition of two descriptor spaces, including Multi-profile Bays and bi-gram probability. The high discriminative descriptors from the hybrid space have been selected by use of a well-known evolutionary feature selection technique named particle swarm optimization. These features are extracted to enhance the generalization power of the classifier. They used the jack knife test on three low similarity benchmark databases, including PDB, 1189, and 640 to evaluate the performance of their proposed model. The success rates of their proposed model are 87.0%, 86.6%, and 88.4%, respectively on the three benchmark data sets [10]. Maulika S. Patel and et al. (2014)provided a hybrid novel algorithm, KB-PROSSP-NN, which is a combination of knowledge base method (KB-PROSSP) and modelling of the exceptions in the knowledge base using neural networks (NN) for protein secondary structure prediction (PSSP). These two methods are used in cascade to optimize the results. They used two popular data sets RS126 and CB396 to evaluate the accuracy of the proposed model. The Q3 accuracy of 90.16% and 82.28% achieved on the RS126 and CB396 test sets respectively [11]. Yong Tat Tan and et al. (2015) claimed that nearest neighbor –complexity distance measure (NN-CDM) algorithm using Lempel–Ziv (LZ) complexity-based distance measure had a problem. NN-CDM algorithm is slow and ineffective in handling uncertainties. To solve this problem, they proposed fuzzy NN-CDM (FKNN-CDM) algorithm that combines the confidence level of prediction results and enhances the prediction process by designing hardware architecture. The high average prediction accuracies for Z277 and PDB data sets using the hybrid proposed algorithm are 84.12% and 47.81% respectively [3].Avinash Mishra and et al. (2014) used random forest machine learning methods to predict the protein structure. The two types of data sets have been chosen are: (i) Public decoys from Decoys …R‟us (/) and, (ii) Server predictions of CASP experiments CASP5 to CASP9 (http:// /download area/). This contains 409 systems with their decoys covering a Root-mean-square-deviation (RMSD) range of 0–30 Å There are 278 systems belonging to CASP decoys while 131 systems are from public decoy dataset. Three training models based on RMSDs are designed. The first model named “Model-I” trained on the complete training set consist of 64,827 structures that covered the whole range of RMSDs (0–30 Å) The second one named “Model-II”trained on 13,793 structures which covered RMSD range of 0–10 Ǻ and the last model named “Model-III” trained on 13,793 structures which covered an RMSD range of 0–5 Ǻ They combined these tree models together to set their final prediction structure. These models are combined as three different layers and used to predict the most accurate distance of any given structure. A 10-fold cross validation performed and Correlation, R2 and accuracy used for evaluation. The best reported results are as follow: that native can be predicted as less than 1.5 Å RMSD structure in~89.3% of the cases. High quality structures (0–3 Å) can be predicted to within ±1.5 Å error in ~98.5% of the cases. It means a 2 Å structure may be reported as either 1 Å or 3 Å structure or anywhere in between with ~ 98% accuracy [4]. Yuehui Chen and et al. (2012) proposed a novel method based on an integrated RBF neural network to predict the protein interaction sites. Features named sequence profiles, entropy, relative entropy, conservation weight, accessible surface area and sequence variability are extracted in the first step. They made six sliding windows on these features that contained 1, 3, 5, 7, 9 and 11 amino acid residues respectively, and used them as input layer of six radial basis functional (RBF) neural networks. These RBF neural networks were trained to predict the protein–protein interaction sites. Decision fusion (DF) and Genetic Algorithm based Selective Ensembles (GASEN) are used to obtain the final results. The recall and accuracy of their proposed methods are and 0.8032 respectively [12].We proposed a model for prediction the distance of an estimate structure from native based on Physicochemical Properties of Protein Tertiary Structure In this paper. The paper is organized in five sections.After the introduction Section 1 which introduces the related works of protein structure estimate, Section 2 continues with Mathematical Model in section 3. Section 4 and 5 presents the results, conclusions of the research. The paper ends with a list of references.3.Mathematical ModelSoft computing are included, different types of neural networks, fuzzy systems, genetic algorithms, etc. that used in information retrieval applications. Fuzzy theory was developed by Zadeh, is a new intelligent method, stated to solve unlike problems more efficient than the old calculations.3.1.Neural NetworkDeveloping a neural net solution means teaching the net a desired behavior. This is called the learning phase. Either sample data sets or a “teacher” can be used in this step. A teacher is either a numerical function or a person that rates the quality of the neural net performance. Since neural nets are mostly used for complex applications where no adequate mathematical models exist and rating the performance of a neural net is difficult in most applications, most are trained with sample data (figure 1).Fig.1. Training and Working Phase for Supervised Learning3.1.1.Neuron ModelAn elementary neuron with R inputs is shown in figure 2. Each input is weighted with an appropriate w. The sum of the weighted inputs and the bias, forms the input to the transfer function f. Neurons may use any differentiable transfer function f to generate their output.Fig.2. Structure a NeuronFeed forward networks often have one or more hidden layers of sigmoid neurons followed by an output layer of linear neurons. Multiple layers of neurons with nonlinear transfer functions allow the network to learn nonlinear and linear relationships between input and output vectors(figure 3). The linear output layer lets the network produce values outside the range –1 to +1. On the other hand, if you want to constrain the outputs of a network (such as between 0 and 1), then the output layer should use a sigmoid transfer function (such as logsig). This network can be used as a general function approaching. It can approximate any function with a finite number of discontinuities, arbitrarily well, given sufficient neurons in the hidden layer [14].3.1.2.Neural Network Step for Estimating RsmdBack-propagation feed forward neural network approach is used to predict the RMSD from native structure.A neural network is constructed with six inputs and one output RMSD.Fig.3. Feed Forward Neural Network with Two LayersThe input nodes represent the distinguishing parameters of proteins approach, and the output nodes represent the RMSD. The network is constructed by using MATLAB. In our experiment, the training function we have considered is a train GDA, the adaptation learning function considered learns GDM, and the performance function used was Mean Square Error (MSE). The projects considered for these researches are taken from the UCI. This is a data set of Physicochemical Properties of Protein Tertiary Structure. The data set is taken from CASP 5-9. There are 45730 decoys and size varying from 0 to 21 Armstrong [13]. After determining affect following features was extracted between the data set. The input nodes represent the physic-chemical properties of proteins.F1-Total surface area.F2- Non polar exposed area.F3- Fractional area of exposed non polar residue.F4- Fractional area of exposed non polar part of residue.F5- Average deviation from standard exposed area of residue.F6- Special Distribution constraints (N, K Value).3.2.ANFIS (Adaptive Neuro Fuzzy Inference System)ANFIS (Adaptive Neuro Fuzzy Inference System) is based sugeno (Jang, Sun & Mizutani, 1997; Jang & Sun, 1995) A generic rule in a Sugeno fuzzy pattern has the form If Input 1 = x and Input 2 = y, then output is z = ax + by + c. Figure 4 explains the ANFIS neural network [14]. In Figure 4 first layer are the degree of membershipof linguistic variables. The second layer is 'rules layer'. After the linear composition of rules at third layer then specify the degree of belonging to a special class by Sigmund's function in layer 4. ANFIS is a type of fuzzy neural network with a learning algorithm based on a set of training data for tuning an available rule base that permits the rule base to reconcile the training data [14].Fig.4. Adaptive Neuro fuzzy Network (anfis) for rmsdWe have applied features f1...f6 to ANFIS the given training data, the related rules is set, and obtain more accurate output RMSD (Figure 4).4.Experimental ResultsWe implement our proposed system in MATLAB version 7.12 on Laptop, 1.7 GHZ CPU. In Anfis proposed system was considered a database of CASP 5-9 with 45730 records, In order to train and test the fuzzy neural network. After calculate six Features: Total surface area, Non polar exposed area, Fractional area of exposed non polar residue, Fractional area of exposed non polar part of residue, Average deviation from standard exposed area of residue, Special Distribution constraint (N, K Value) described above for 38110 records was considered for train, and 7620 records were allocated to Test system. Neural network set six input, sevenFig.5. Adaptive Neuro fuzzy Network (ANFIS) for RMSDhidden neuron, one output RMSD(figure 5) and number of epochs 5 were considered. After train neural network, Rmse(Root mean squared errors ) was obtained for train data 5. 1044 and for test data 5.1008 (figure 6, 7).also Rmse of test data for six hidden neurons and 12 hidden neuron 5.1347 and 5.1226 was obtained. However, in Adaptive Fuzzy Neural Network After setting network parameters to generate fis =grid partition, optim. method= hybrid, linear, train fis epochs=5 Rmse for training data Obtained 4.7386 (Figure 8).Fig.6. Sqrt mse=5.1044 for Train DataFig.7. Error for Test Data Sqrt(mse)= 5.1008After completing the process the training adaptive fuzzy neural network, fuzzy input variables were calibrated (Figure 9) and the number 7620 record was to test.Fig.8. CalculateRMSE for Train Data with Proposed ANFIS SystemFig.9. Input Variable Fuzzy Membership For Proposed ANFIS System Before and After CalibrateTable 1. Comparison Actual RMSD and Predicted RMSD for Test DataSeveral methods exist to compare cost estimation models. Each method has its advantages and disadvantages. In this work, mean absolute error (MAE) of RMSD will be used. AE for each observation i can be obtained as:(1) MAE can be achieved through the summation of MAE over N observations:∑ (2) After training adaptive fuzzy neural network by uci data, we have applied our proposed neuro fuzzy system on 7620 records. In the proposed system predicted for row, 2 RMSD obtain 3.250838 When F1=7511.48, F2=1975.7, F3=0.26302, F4=85.9864, F5=114.437, F6=37.2338.5.ConclusionsOur purpose in this research is to predict the exact quantity of the Root-mean-square-deviation (RMSD) protein-structure from native protein structure's Protein and creating an adaptive neuro fuzzy model for this purpose, as yet. The Total surface area, Non polar exposed area, Fractional area of exposed non polar residue, Fractional area of exposed non polar part of residue, Average deviation from standard exposed area of residue, Special Distribution constraints (N, K Value) features were considered In order to The RMSD protein-structure estimation. Using adaptive neuro fuzzy combinatorial proposed system offered a2lgorithm, the amount of mean absolute error (MAE) indicator for ANFIS is3.845204 and neural network is 4.202547. ANFIS model is better, performed calculations in experimental results in table 1, prove this claim. In order to perform future works, the proposed model for distance from native protein structure can be developed with Raising The data relating to projects, and also other's neural methods can be used in order to determine the exact amount of RMSD in industrial environments and other data sets. Other features of protein can also be considered for prediction protein structure. Probably better results can be achieved by changing the number of linguistic variables or the type of membership function.References[1]Saha, S. (2008). Protein Secondary Structure Prediction by Fuzzy Min Max Neural Network withCompensatory Neurons (Doctoral dissertation, Indian Institute of Technology, Kharagpur).[2] E. Sakk, A. Alexander, On the variability of neural network classfication measures in the proteinsecondary structure prediction problem, Appl. Comput. Intell. Soft Comput. (2013) 1–9. Available online at 4/16/2015.(/journals/acisc/2013/794350/).[3]Tan, Y. T., &Rosdi, B. A. (2015). FPGA-based hardware accelerator for the prediction of proteinsecondary class via fuzzy K-nearest neighbors with Lempel–Ziv complexity based distance measure.Neurocomputing, 148, 409-419.[4]Babaei, S., Geranmayeh, A., &Seyyedsalehi, S. A. (2010). Protein secondary structure prediction usingmodular reciprocal bidirectional recurrent neural networks. Computer methods and programs in biomedicine, 100(3), 237-247.[5]Zhou, Z., Yang, B., &Hou, W. (2010). Association classification algorithm based on structure sequence inprotein secondary structure prediction. Expert Systems with Applications, 37(9), 6381-6389.[6]Qu, W., Sui, H., Yang, B., &Qian, W. (2011). Improving protein secondary structure prediction using amulti-modal BP method. Computers in biology and Medicine, 41(10), 946-959.[7]Hassan, R., Othman, R. M., Saad, P., &Kasim, S. (2011). A compact hybrid feature vector for an accuratesecondary structure prediction. Information Sciences, 181(23), 5267-5277.[8]Babaei, S., Geranmayeh, A., &Seyyedsalehi, S. A. (2012). Towards designing modular recurrent neuralnetworks in learning protein secondary structures.Expert Systems with Applications, 39(6), 6263-6274. [9]Zangooei, M. H., &Jalili, S. (2012). Protein secondary structure prediction using DWKF based on SVR-NSGAII. Neurocomputing, 94, 87-101.[10]Hayat, M., Tahir, M., & Khan, S. A. (2014). Prediction of protein structure classes using hybrid space ofmulti-profile Bayes and bi-gram probability feature spaces. Journal of theoretical biology, 346, 8-15. [11]Patel, M. S., &Mazumdar, H. S. (2014). Knowledge base and neural network approach for proteinsecondary structure prediction. Journal of theoretical biology, 361, 182-189.[12]Chen, Y., Xu, J., Yang, B., Zhao, Y., & He, W. (2012). A novel method for prediction of proteininteraction sites based on integrated RBF neural puters in biology and medicine, 42(4), 402-407.[13]Mishra, A., Rana, P. S., Mittal, A., &Jayaram, B. (2014). D2N: Distance to the native. Biochimica etBiophysicaActa (BBA)-Proteins and Proteomics, 1844(10), 1798-1807.[14]Iraji, M. S., &Motameni, H. (2012). Object Oriented Software Effort Estimate with Adaptive NeuroFuzzy use Case Size Point (ANFUSP). International Journal of Intelligent Systems and Applications (IJISA), 4(6), 14.Authors’ ProfilesMohammad Saber Iraji received B.Sc in Computer Software engineering from Shomaluniversity, Iran, Amol; M.Sc1in industrial engineering (system management and productivity)from Iran, Tehran and M.Sc2 in Computer Science. Currently, he is engaged in research andteaching on Computer Graphics, Image Processing, Fuzzy and Artificial Intelligent, DataMining, Software engineering and he is Faculty Member of Department of ComputerEngineering and Information Technology, Payame Noor University, I.R. of Iran.Hakimeh Ameri is graduated in M.S at K.N.Toosi Univerity of science and technology ininformation technology. Her main research interest is on bioinformatics, Data analysis andbig data. She has 7 published papers in this filed. She now teaching Artificial Intelligence,data mining, Information technology, programming languages and data structure inUniversity.How to cite this paper: Mohammad Saber Iraji, Hakimeh Ameri,"RMSD Protein Tertiary Structure Prediction with Soft Computing", International Journal of Mathematical Sciences and Computing(IJMSC), Vol.2, No.2, pp.24-33, 2016.DOI: 10.5815/ijmsc.2016.02.03。
Data Min Knowl Disc(2010)21:345–370DOI10.1007/s10618-009-0157-yDensity-based semi-supervised clusteringCarlos Ruiz·Myra Spiliopoulou·Ernestina MenasalvasReceived:2February2008/Accepted:31October2009/Published online:21November2009©The Author(s)2009Abstract Semi-supervised clustering methods guide the data partitioning and grouping process by exploiting background knowledge,among else in the form of con-straints.In this study,we propose a semi-supervised density-based clustering method. Density-based algorithms are traditionally used in applications,where the anticipated groups are expected to assume non-spherical shapes and/or differ in cardinality or density.Many such applications,among else those on GIS,lend themselves to con-straint-based clustering,because there is a priori knowledge on the group membership of some records.In fact,constraints might be the only way to prevent the formation of clusters that do not conform to the applications’semantics.For example,geograph-ical objects,e.g.houses,separated by a borderline or a river may not be assigned to the same cluster,independently of their physical proximity.Wefirst provide an over-view of constraint-based clustering for different families of clustering algorithms. Then,we concentrate on the density-based algorithms’family and select the algo-rithm DBSCAN,which we enhance with Must-Link and Cannot-Link constraints. Our enhancement is seamless:we allow DBSCAN to build temporary clusters,which Responsible editor:Eamonn Keogh.Part of Ernestina Menasalvas work was funded by the Spanish ministry under project grantTIN2008-05924.C.Ruiz·E.MenasalvasFacultad de Informática,Universidad Politecnica de Madrid,Madrid,Spaine-mail:cruiz@cettico.fi.upm.esE.Menasalvase-mail:emenasalvas@fi.upm.esM.Spiliopoulou(B)Faculty of Computer Science,Otto-von-Guericke-University Magdeburg,Magdeburg,Germanye-mail:myra@iti.cs.uni-magdeburg.de346 C.Ruiz et al. we then split or merge according to the constraints.Our experiments on synthetic and real datasets show that our approach improves the performance of the algorithm. Keywords Constraint-based clustering·Semi-supervised clustering·Density-based clustering·Instance level constraints·Constraints1IntroductionClustering with instance-level constraints has received a lot of attention in recent years. It is a subcategory of semi-supervised clustering,which allows the human expert to incorporate domain knowledge into the data mining process and thus guide it to better results(Anand et al.2004;Kopanas et al.2002).The use of instance-level constraints to this task is motivated by a specific form of background knowledge that can be found in many applications:while the majority of the records/instances to be clustered are unlabeled,there is a priori knowledge for some instances,such as their label or their relationship to other instances.The so-called“Must-Link”and“Cannot-Link”con-straints capture relationships among data instances,e.g.that two houses located at different sides of a border should not be assigned to the same village/cluster,even if they are physically close to each other.As another example,Wagstaff et al.(2001) use constraints among vehicle instances to identify the road lanes(clusters)from GPS data.Beyond contributing to better quality of the clustering results,constraints have also the potential to enhance computation performance Davidson and Ravi(2005)by speeding up convergence.Furthermore,as reported in Bennett et al.(2000),constraints can also be used to prevent the formation of empty clusters.Semi-supervised clustering with constraints can be designed for any type of clus-tering algorithm.Most studies have concentrated on partitioning algorithms like K-means,but there are also works on hierarchical constraint-based clustering (Davidson and Ravi2005).In this study,we propose the use of instance-level con-straints in density-based clustering and present C-DBSCAN,a constraint-based var-iation of the clustering algorithm DBSCAN proposed by Ester et al.(1996)for the grouping of noisy data in(possibly not-spherical)clusters.Density-based clustering lends itself to constraint enforcement:differently from partitioning algorithms,which strive a globally optimal partitioning of the data space, density-based algorithms like DBSCAN build solutions that are only locally optimal. Hence,they can exploit both Must-Link and Cannot-Link constraints between prox-imal instances.This is an inherent advantage towards constraint-based partitioning algorithms,which may fail to converge in the presence of Cannot-Link constraints (Davidson and Ravi2005).A further advantage of density-based clustering with constraints lays in the nature of the applications.Algorithms like DBSCAN are useful,among others,for geo-graphic applications,where knowledge of some geographical formations may be a priori available.In Figs.1and2,we depict clusters on real and artificial data that exhibit remarkable constructs:they contain“bridges”,locations of higher/lower den-sity,diffuse borders or specific shapes.Such constructs are not unusual in geographicalDensity-based semi-supervised clustering347Fig.1Clusters that overlap(a),have different densities(b),have a bridge(c)Fig.2Datasets where DBSCAN performs poorly:(a)DS1,(b)DS2,(c)DS3,(d)DS4 applications.Thesefigures depict datasets for which DBSCAN is known to perform poorly.Hence,it is reasonable to attempt improvement through constraint exploitation. As we will show,our constraint-based variation of DBSCAN,C-DBSCAN,achieves indeed remarkable improvements for the above datasets.Earlier versions of our work have appeared in Ruiz et al.(2007a,b).The paper is organized as follows:In Sect.2,we provide an overview of con-straint-based clustering methods for different families of algorithms.We elaborate on the different ways of incorporating the constraint enforcement procedure to an algorithm,on the different types of constraints being used and on methods for build-ing the set of constraints to be enforced.Section3starts with a brief introduction to density-based clustering and a concise description of DBSCAN.We then present our constraint-based algorithm C-DBSCAN.In Sect.4,we describe the framework of our experimental evaluation,including the evaluation criteria and the synthetic and real datasets we have used.Section5contains our experiments for different datasets and different sets of constraints,juxtaposing among else the performance of C-DBSCAN when only Must-Link or only Cannot-Link constraints are available.The last section concludes our study with a summary of ourfindings and a discussion on open issues. 2Clustering with constraintsIn Han et al.(1999)point out the need to organize data mining processes in an effective way,separating between computer labour and activities to be performed by human users.They argue that computers should undertake tasks like databases search,count-ing and algorithm execution,while users should specify the focus of data mining and guide the process.Constraint-based mining is an appropriate method for guiding a mining algorithm through the solution space.348 C.Ruiz et al.This section contains a concise overview of constraint-based clustering.First,wediscuss constraint types,including those proposed in the early work of Han et al.(1999).Next,we group the constraint-based clustering algorithms depending on howthey enforce constraints(Sect.2.2).Then we elaborate on methods that build an opti-mal set of constraints for the mining problem(Sect.2.3).2.1Different types of constraintsHan et al.(1999)distinguish among constraints on the knowledge to be extracted fromthe data,on the selection of data that are relevant to the mining task(data constraints),on the feature space over these data and on the interestingness of the mining results.Data constraints,also termed instance-level constraints,have been the subject of muchattention and research in recent years.In the specific domain of spatial databases,physical constraints appear by nature,so there are many methods that model and exploit them(Zaïane and Lee2002a,b).In particular,constraints are used to capture formations like rivers,woods or villagesthat can be described with help of two-dimensional polygons.Such methods assumea grid over the data space and look for geometrical shapes in it.In Ruiz et al.(2006),we have proposed different types of constraints for constraint-based clustering over stream data,distinguishing among constraints on pattern(i.e.on the whole of a cluster),constraints on attribute(i.e.on the feature space)andco-appearance and separation constraints(i.e.constraints on sets of data instances). Hereafter,we concentrate on instance-level constraints for static data.Another categorization of constraints can be made on the basis of the“object(s)”ref-erenced by the constraints:instance-level constraints refer to individual data records.Traditionally,one distinguishes between Must-Link constraints on instances that mustappear together and Cannot-Link constraints on instances that are not allowed to bein the same cluster(Wagstaff et al.2001).Theτn-constraint on cluster cardinalityBennett et al.(2000)refers to a whole cluster instead.The -constraint on the dis-tance between points inside a cluster states that any two points must be closer than adistance threshold ,while theδ-constraints on the distance between clusters requiresthat points in different clusters must be at a distance of at leastδ(Davidson and Ravi2005);both constraint types refer to whole clusters.Clustering with instance-level constraints has recently received a lot of attention(Davidson and Basu2005,2006),also under the label semi-supervised clustering(Gunopulos et al.2006).The core idea is that background knowledge on cluster mem-bership may exist for a small number of data records.This knowledge can be effectivelyexploited to guide the clustering algorithm:while a traditionally clustering algorithmtreats all records equally,a constraint points out that two given instances must beassigned to the same cluster or,conversely,they must be assigned to different clusters. Example1Consider the queries Q1,Q2,Q3,where Q1contains the keyword “Basel”,Q2is on“risk management in banks”and Q3mentions“quality control”and“process”.Although risk management and the city Basel in Switzerland are notrelated in general,one might know(orfigure out by studying the query results)thatDensity-based semi-supervised clustering349 both users were interested in the Basel regulations on bank services,especially with respect to risk management.Basel regulations include among else quality control and certification,so that Q3 seems also related to Q1.However,inspection may reveal that those who launched Q3 were interested in quality control for manufacturing processes.Thus we conclude that Q1and Q2must belong to the same cluster,while Q3must belong to a different clus-ter than Q1.This results in a Must-Link constraint on Q1,Q2and in a Cannot-Link constraint on Q1,Q3.The identification of the constraints is achieved by manual inspection.This cannot be afforded for all data.However,research on constraint-based clustering Wagstaff et al.(2001),Davidson and Ravi(2005),and Davidson and Basu(2006)shows that the identification and exploitation of a small number of constraints is adequate to enhance the clustering of the whole dataset.2.2Different ways of instance-level constraint enforcementClustering with enforcement of instance-level constraints encompasses methods that adhere to different principles.Wefirst discuss methods that embed constraints into the optimization criterion of the clustering algorithm,distinguishing among algo-rithms that optimize a global objective function(Wagstaff et al.2001;Basu et al. 2002;Davidson and Ravi2005)and those that optimize locally(Demiriz et al.1999; Wagstaff and Cardie2000;Davidson and Ravi2005).We then turn to algorithms that use the constraints to“learn”a new distance metric for the clustering algorithm(Basu et al.2004b;Bilenko et al.2004;Halkidi et al. 2005).Under this approach,which is often termed distance-based,clustering is done in a conventional way,using the new distance function.Finally,we discuss algorithms that combine both approaches.The incorporation of constraints into the optimization criterion and learning of the distance function.Our density-based algorithm C-DBSCAN Ruiz et al.(2007a,b)belongs to thefirst category above.It embeds instance-level constraints into the local optimization crite-rion of the density-based clustering algorithm DBSCAN.2.2.1Embedding constraints into the optimization criterionMethods of this category enforce constraints by modifying the optimization crite-rion of the clustering algorithm.Wagstaff et al.have embedded constraints into the incremental COBWEB algorithm(Wagstaff and Cardie2000).The new algorithm COP-COBWEB returns a clustering that satisfies all constraints;if no such clustering can be built,no solution is returned.Similarly,COP-K-means extends the partitioning algorithm K-means into a variation that enforces all constraints;if it is not possible to satisfy all constraints,no clustering is built(Wagstaff et al.2001).The experiments of the authors on UCI datasets(Newman et al.1998)show that the new algorithms are superior to their conventional variants in terms of both computational time and convergence speed.350 C.Ruiz et al.Basu et al.(2002)investigate how constraints can be used to set the initial seeds for K-means.Instead of selecting the seeds randomly,they build the transitive closure of the instance-level constraints,compute clusters upon it and then use the centers of these clusters as seeds.They derive two methods,one that ignores the constraints after the initialization step(“Seeded-K-Means”)and one that enforces them through all subsequent steps until convergence.Since datasets are prone to noise,Seeded-K-means is more appropriate for noisy datasets where the constraints may be less trustworthy.Constraint-based variations of K-means are algorithms that try to build a globally optimal solution.This is not always convenient,though,because there is no guarantee that the constraint-based algorithm will converge at all.In Davidson and Ravi(2005), show that the satisfaction of all Must-Link and Cannot-Link constraints by K-means is an NP-complete problem,and that Cannot-Link constraints may prevent the algorithm from converging.The convergence problem is trivially avoided by embedding constraints to algo-rithms that operate locally instead of building globally optimal solutions.Davidson and Ravi(2005)propose constraint enforcement with a hierarchical clustering algorithm and show that constraint satisfaction becomes a P-complete problem.Our C-DBSCAN is based on the same observation:C-DBSCAN operates locally upon the data and,as we show,builds clusters that satisfy all constraints.It also exhibits the inherent advan-tage of the base algorithm DBSCAN in being robust towards clusters of arbitrary shapes(Ester et al.1996).An early algorithm that embeds constraints indirectly into the optimization crite-rion is worth mentioning here.Demiriz et al.(1999)embed the constraints into the selection and cross-over of a genetic algorithm:they combine an unsupervised learner which minimizes within-cluster variance with a supervised technique that uses the records with known class label to minimize cluster impurity.2.2.2Embedding constraints into the distance functionSome clustering algorithms use the constraints to learn a new,adjusted distance/simi-larity function.These algorithms are often called semi-supervised learning algorithms.In the new function,instances associated with a Must-Link constraint are“pushed closer”to each other while instances associated with a Cannot-Link constraint are “pulled away”from each other.The function is typically implemented in a look-up conversion matrix,thus allowing for individual distance values for any pair of records. The use of an adjusted function for constraint enforcement implies that constraints may be violated,e.g.by separating instances that should be linked,because they are still too far away from each other.Constraint violation may also be associated with some penalty value.Different distance functions can be used as basis for this group of algorithms. Xing et al.(2003)model the learning of the distance function as a convex optimiza-tion problem;they use the Mahalanobis distance as basis.Klein et al.(2002)use the Euclidean distance as a basis and learn a distance function that computes the shortest path between points.It allows them to further generalize instance-level constraintsDensity-based semi-supervised clustering351 into a global relationship among points in a Euclidean space.They show that this generalization is superior to earlier approaches.The asymmetric Kullback–Leibler divergence is used by Cohn et al.(2003).Their method is designed for document clustering with user feedback,so the learned distance function reflects the different perspectives of users upon documents.In Bilenko and Mooney(2003),use semi-supervised learning to detect duplicate records in a database.For this task,they model the records as vectors of tokens,but recognize that some tokens are better indicators of similarity than others.To capture this into the model,they learn a new edit distance function that assigns more weights to those tokens.They incorporate this learnable edit distance into an Expectation-Max-imization clustering algorithm.They have also developed a novel similarity measure that is learned with an SVM.Their results show that the semi-supervised variation is superior to conventional EM,while the semi-supervised SVM exhibits irregular performance(Bilenko and Mooney2003).In a later work,Bilenko et al.(2004)propose MPCK-means:this method incorpo-rates the learning of the distance function on the labeled data and on the data affected by the constraints in each iteration of the clustering algorithm.Hence,this algorithm learns a different distance function for each cluster.2.2.3Hybrid methodsAlgorithms that embed constraints to the distance function can penalize constraint violation but do not prohibit it per se.On the other hand,algorithms that embed con-straints in the objective function may deliver no solution at all.Hybrid algorithms attempt to avoid both pitfalls.Basu et al.(2004b)propose semi-supervised probabilistic clustering with Hidden Markov randomfields(HMRFs).The authors propose a framework for the learning of the distance function,which allows the use of different metrics like cosine,Bregman divergence,etc.The learned function is used in a variation of K-means for clustering. The new algorithm,HMRF-K-means,minimizes an objective function that is derived from the joint probability of the model together with the penalty values for constraint violations.Bilenko et al.(2004)minimize the objective function of MPCK-means that uses learned distance functions.A further method of the same category appears in Halkidi et al.(2005):constraint violation penalties are incorporated to the distance metric and then combined with a modified objective function.The authors use a hill-climbing algorithm tofind an opti-mal solution.Optimality includes satisfying a maximal number of constraints.The objective function uses the S_Dbw measure proposed in Vazirgiannis et al.(2003); this measure takes into account both the cohesion and the separation of the clusters.2.2.4Constraints on streamsAll of the above methods enforce constraints in a static dataset.In recent years,there has been much research on clustering data that are not static but rather arrive as a stream. Stream clustering algorithms lend themselves quite naturally to constraint enforce-ment,because knowledge on already seen data can be exploited to adapt the clusters352 C.Ruiz et al. on data that arrive later on.In Ruiz et al.(2009)we propose C-DENSTREAM,a density-based stream clustering algorithm that exploits so-called“micro-constraints”. These are instance-level constraints that involve instances seen at different time points.2.2.5Closing remarksA major challenge for constraint-based clustering is the interplay between achieving a feasible solution(i.e.ensuring convergence),and satisfying all constraints.The con-vergence issue concerns only algorithms like K-means,which build a globally optimal solution.For K-means,Davidson and Ravi(2005)have shown that the satisfaction of both Must-Link and Cannot-Link constraints is an NP-complete problem and that it is the presence of Cannot-Link constraints which may prevent convergence.Clustering algorithms that embed the constraints into the distance function omit the convergence problem but allow for solutions that may violate some constraints.The same holds for algorithms that embed the constraints to both the objective function and the distance function.Clustering algorithms that build local solutions do not face the convergence prob-lem(by nature)and are thus very appropriate for constraint enforcement.We follow this line of thinking here by proposing a constraint-based variation of the density-based clustering algorithm DBSCAN.As we show in the next sections,C-DBSCAN satisfies all input constraints while demonstrating performance improvements over DBSCAN, especially in the cases where DBSCAN is known to perform poorly.2.3Determining the set of constraintsResearch performed in the last years on the exploitation of domain information in the form of instance level constraints has shown that improvements in results highly depend on the selected set of constraints(Wagstaff2002;Davidson et al.2006).There-fore,most constraint-based methods calculate the average performance achievements of the results using random sets of constraints(Wagstaff and Cardie2000;Wagstaff et al.2001;Halkidi et al.2005;Davidson and Ravi2005;Ruiz et al.2007a).Instead of using a randomly-generated set of constraints,some methods build the constraints set in interaction with the user(Cohn et al.2003;Davidson et al.2007). Cohn et al.(2003)describe an interactive method that derives relationships among documents and document-group memberships and then returns them to the user for feedback.With help of user feedback,the method can then adjust the constraints and achieve a better,user-tailored clustering.The authors show that the human interac-tion leads to the discovery of more intuitive relationships.Obviously,the method has the disadvantage of requiring cluster recomputation any time the set of constraints is adjusted.Davidson et al.(2007)incorporate user feedback for constraint adjustment in an incremental clustering method,thus reducing the overhead of re-clustering.In Davidson et al.(2006),propose two quantitative measures that calculate the expected benefit of each constraint on the clustering results.The information measure computes the additional amount of information that the algorithm obtains through the set of constraints,i.e.the amount of information that the algorithm would not be ableDensity-based semi-supervised clustering353 to acquire without the constraints.The coherence measure computes the number of times the constraints agree with(i.e.they are compliant with)the objective function. The authors show that the higher the coherence and amount of information contributed by the set of constraints is,the higher is also the performance improvement.3Semi-supervised clustering with C-DBSCANWe perform semi-supervised clustering with constraint enforcement on the basis of a density-based algorithm.Wefirst provide a short overview of density-based clustering methods and then describe our own algorithm C-DBSCAN.3.1A retrospective to density-based clusteringA number of successful density-based clustering algorithms have been proposed in the end of the previous decade,including DBSCAN(Ester et al.1996),DENCLUE (Hinneburg and Keim1998)and OPTICS(Ankerst et al.1999).DBSCAN has been designed for the clustering of large noisy datasets with some emphasis on spatial data(Ester et al.1996).DBSCAN introduced the concept of“neighbourhood”as a region of given radius(i.e.a sphere)and containing a minimum number of data points. Connected neighbourhoods form clusters,thus departing from the notion of spherical cluster.DENCLUE(Hinneburg and Keim1998)is also based on neighbourhoods,con-centrating on high-dimensional multimedia databases.Within the multidimensional data space,DENCLUE computes the impact of a data point upon its neighbourhood and uses it to assign data points to clusters.OPTICS(Ankerst et al.1999)is not a clus-tering algorithm in the strict sense;it rather contributes in identifying the clustering structure by ordering points and reachability distances in a way that can be exploited by a density-based algorithm.Like DBSCAN and DENCLUE,it observes density as a regional/local phenomenon.Dense neighbourhoods have been considered in STING(Wang et al.1997),W A VE-CLUSTER(Sheikholeslami et al.1998),CLIQUE(Agrawal et al.1998)and DESCRY (Angiulli et al.2004).STING(Wang et al.1997)uses a hierarchical statistical infor-mation grid to reduce the computational cost of data querying.First,it recursively divides the data space into cells that contain statistics over sets of objects,such as cardinality,mean,deviation and other information on the distribution.Then,each cell points to a set of smaller cells at the next lowest level of the tree.W A VECLUSTER (Sheikholeslami et al.1998)applies a wavelet transform to discover relative distances between objects at different levels of resolution.It divides the space using a grid struc-ture and creates an n-dimensional feature space where wavelet transform is performed multiple times.CLIQUE(Agrawal et al.1998)combines a density-based and a grid-based approach tofind clusters embedded in subspaces of a high-dimensional data space:it partitions the data space into rectangular units that are considered dense if they contain a minimum number of points.Connected dense units form clusters.The more recently proposed DESCRY(Angiulli et al.2004)returns back to the idea of a neighbourhood as a spherical region,as it was used in DBSCAN(Ester et al.1996).DESCRY builds clusters in four steps:First,the data are sampled,then they354 C.Ruiz et al. are partitioned using a KD-Tree(Bentley1975),whereupon the centers of the leaf nodes(the so-called“meta-points”)are computed.Clustering is performed as a last step,using a hierarchical agglomerative algorithm.Similarly to DESCRY,we also use a KD-Tree to build the initial regions,but thereafter we use DBSCAN and a set of constraints to connect regions into clusters.3.2Introducing C-DBSCANThe original DBSCAN discovers and connects dense neighbourhoods of data points (Ester et al.1996).In particular,it identifies core points,i.e.those having at least Min Pts data points as neighbours within a radius Eps.It then connects overlap-ping neighbourhoods into clusters.Data points within the same neighbourhood are termed density-reachable from the core point,those in overlapping neighbourhoods are density-connected to each other.Our algorithm C-DBSCAN(cf.Algorithm1)extends DBSCAN in three steps.We first partition the data space into denser subspaces with help of a KD-Tree(Bentley 1975).From them we build a set of initial local clusters;these are groups of points within the leaf nodes of the KD-tree,split sofinely that they already satisfy those Cannot-Link constraints that are associated with their contents.Then,we merge den-sity-connected local clusters and enforce the Must-Link constraints.Finally,we merge adjacent neighbourhoods in a bottom-up fashion and enforce the remaining Cannot-Link constraints.In the next subsection,we explain how C-DBSCAN partitions the data space.In Sect.3.4,we discuss how the algorithm enforces constraints during cluster construc-tion.Both subsections refer to the pseudo-code of Algorithm1.We use as an example the data depicted in Fig.3a,together with some constraints on data instances,as shown in Fig.3b.3.3Partitioning the data spaceFor the partitioning step,we use the KD-Tree proposed in Bentley(1975).The moti-vation is to deal effectively with subareas that have different densities.The KD-Tree construction algorithm divides the data space iteratively into cubic structures by splitting planes that are perpendicular to the axes.Each cube becomes a node and is further partitioned as long as each new node contains a minimum number of data points,specified as input parameter.The result is an unbalanced tree:small leaf nodes capture locally dense subareas while large leaf nodes cover the less dense subareas.In C-DBSCAN,we set the threshold value Min Pts to the same value as this parameter,sine the goals of the two are very similar—to define and select dense areas.Since there are many possible ways to choose axis-aligned splitting planes,there are many different ways to construct KD-trees.In C-DBSCAN,we perform a depth-first tree traversal;at each level,we splitfirst across the X-axis as long as the new nodes have a minimum number of points,choosing the median as cut-off for the plane perpendicular to the axis.。
一种改进的随机性模糊金库算法V ol_27No.2Mar.2011科技通报BULLETIN0FSCIENCEANDTECHN0L0GY第27卷第2期2011年3月一种改进的随机性模糊金库算法刘艳涛,游林(杭州电子科技大学通信工程学院,杭州310018)摘要:模糊金库(fuzzyvault)算法作为目前较流行的一种指纹密钥技术,得到了人们越来越多的关注.本文基于原有模糊金库算法,提出了一种新的模糊金库算法,该算法中:首先.上锁时利用一组随机数建立一线性函数,函数的长度不再与密钥信息的长度有关;其次,引入伪随机序列发生器,以指纹特征值作为生成元,产生随机数组用于构建模糊金库;另外,本文还提出密钥分离思想,将密钥信息分别存储于上锁过程的两部分中,使密钥信息与用户身份更牢固的绑定在一起.由此获得更高的随机性和安拿性.关键词:指纹密钥;模糊金库;伪随机序列;安全性;随机性中图分类号:TP393.08文献标识码:A文章编号:1001—7119(2011)02—0288—05 AnImprovedRandomnessFuzzyV aultSchemeLIUY antao.YOULin(InstituteofCommunicationandEngineering,HangzhouDianziUniversity,Hangzhou310018,China)Abstract:Asapopularkindoffingerprint—keytechnology,thefuzzyvaultschemegetsmoreandmoreattention.Basedon theexistingfuzzyvaultalgorithm,thispaperproposesanewalgorithmoffuzzyvault.Fortheal gorithm:First,usingagroupofrandomNumbe~toestablishalinearfunctionwhenlockingprocess,andthelengthof thefunctionhasnomorerelationtothekeyinformation:Secondly,introducingthepseudorandomsequencegenerator ,whichtakesfingerprinteigenvaluesasgeneratingelement,generatingrandomNumbersgroupusedtoconstructfuzz yvault:Thelast,thisletteralsoproposedaconceptofkeyseparation,whichmeansthekeyinformationwouldbedividedi ntotowpartsandsavedisolated,bythatwecanmakethekeyinformationanduseridentityfirmerbindtogether,andgai ninghigherrandomnessandsecurity.Keywords:figerprint—key;~zzyvault;pseudorandomsequence;security;randomness0引言随着科学技术的不断发展,人们的生活变得越来越网络化和电子化,随之也产生了大量的信息交流与传递.如何保障个人或集体的重要信息在传送和保存的过程中不被他人窃取,这就对相应的安全保障体系提出更高的要求.传统的密码体制,虽然在很大程度上确保了信息的安全性,但在某些方面还存在着弊端n],如密码的长度有限,存储管理不便,记忆困难等.因此,人们研究并提出了新的加密技术——生物加密(biomet—ricencryption).所谓生物加密,是指将生物模板与密钥通过某种方式结合进而生成生物密钥,取代了传统的利用口令或智能卡等对密钥进行保护的方式.生物加密将用户身份与密钥信息很好的绑定在了一起,克服了传统密码体制的不足,但同时它也面对一些问题,例如:如何在保证利用生收稿日期:2010一l1—0l基金项目:国家自然科学基金(NO.60763009)作者简介:刘艳涛(1984一),男,黑龙江省讷河市人,在读研究生,信息安全,*******************.第2期刘艳涛等.一种改进的随机性模糊金库算法289物特征保护密钥的同时,又保证不泄露用户的任何生物信息;如何使得密码体制的精确性与生物特征的模糊性更好结合起来等.Juels和Sudan提出了一种模糊金库算法l3],在一定程度很好地解决了上述问题.模糊金库算法是利用指纹特征数据通过阶多项式与密钥信息绑定.进而来构造一数据集合.然后添加大量的噪声点到集合中,这样既完成了对信息的加密,且将密钥信息与生物特征绑定起来,同时也实现了隐藏生物信息的功能.本文在原有模糊金库算法的基础上,对算法进行改进,提出一种随机性的模糊金库算法.在该方案中.首先在上锁时使用一组随机数建立一个线性函数,使得函数的长度不再与密钥信息相关:其次,引入伪随机序列发生器,以指纹特征值作为生成元,产生随机数组用于构建模糊金库4_:另外,本文还提出密钥分离思想,将密钥信息分别存储于上锁过程的两个部分中,使密钥信息与用户身份更牢固地绑定在一起,以获得更高的随机性和安全性.1随机性的模糊金库算法模糊金库算法(fuzzyvault)作为生物特征加密领域经典的算法之一,利用其自身模糊性和无序性的特点,很好地将生物特征的模糊性和密码系统的精确性结合起来,从而完成生物加密的功能.本方案在原有模糊金库理论的基础上,引入伪随机序列对原算法进行了改进.在增加系统随机性的同时,也提高了系统的安全性,使得攻击更具有挑战性.伪随机序列生成器可分为线性反馈移位寄存器和非线性反馈移位寄存器两种[5=.本方案中使用修正序列,它是一种非线性反馈移位寄存器,它的序列周期为(2一1),满足安全性需求.该类反馈移位寄存器的基本结构可用图1表示.图中筏表示移位寄存器的状态(筏取值0或1),厂(戈.,:,一?,Xn)为移位寄存器的反馈函数,所以,移位寄存器的输出就由它的初始状态和反馈函数唯一决定.本方案中,我们利用二进制化了的指纹特征值(JY)作为生成元,用伪随机序列生成器生成对应的随机数组,然后用于构建线性函数方程.具体的算法过程如下所示:图1Jl级非线性反馈移位寄存器框图Fig.1Diagramofn—levelnonlinearfeedbackshiftregister 1.1上锁算法该算法所有操作都在有限域F=GF(65537)内进行,该域的空间足够大,能够满足金库系统的安全性要求.并且域内的计算也有利于构建金库系统.首先利用指纹识别技术提取出指纹模板的细节特征,并取其中高质量的个特征点构成特征集合,记作坼={(溉,Yi)li=1,…,}的形式,其中(,Y)代表特征点的坐标,x,y分别被线性映射到[0,256]的空间范围内,各成8bits长字符串,然后连接和Y构成16bits二进制串(f Y),则得到集合7={g,g…,gTIgi=(xlY)}.下面是对算法的形式化描述:(1)以特征集合,中的值作为生成元,通过伪随机序列生成器生成随机数组:()[l,…,菌][,"n,…,u];获取随机密钥信息S并作分解:卜;生成线性函数:P(ui(gi))=(…+ttll+~)mod65537;+..【呦l…『n1J;();(2)加密密钥信息|s:c+_.(Js);(3)生成上锁数据集合:+一;当i=1,…,时:+一-{(&,P(Ui()))li=l,…,nT};添加噪声点:C∈Ar;di∈{P(Ui(臣))}且di#p(Ui(C));C={(c,d/)li=1,…,/7,,>>};UC:(4)输出(,c).上述算法中,P(Ui(蜀))的系数中,aa是取密钥信息S的指定16bits位,aa?.0】为随机生成的16bits数,无规律可循.ao是通过CRC一16编码算法生成的循环效验码,用于解锁时验证解锁290科技通报第27卷是否成功.E为Rijndael加密算法,利用私钥k对信息|s进行加密(S是由真正密钥信息去掉高16bits位后的结果).1.2解锁算法要获得密钥信息S,用户必须提供验证指纹%并转换成Ap(形式与M,相同),然后可以利用MQ中的元素从中选出与之匹配的点.既要求与中的坐标值,Y的距离在一定阈值内的点,构成集合9=,Wli=1…nv,>d},并通过伪随机序列生成器得到相应的数组集合tt().因为要解出线性函数P(ui=aju.z+…+口1l+%至少需要d+l组值[引,所以集合Q的大小也至少为1.则通过求解可重构线性函数P(/ti)=M…+altiti1+%,并计算oo是否等于.(l…lal),相等则认为验证指纹是合法的.进而可求得Ii}1.??Ja1;SD(ck);上式中D为对应于E的解密算法,解锁成功进而获得信息密钥Js=(lIs).2实验及结果分析本实验是在MATLAB环境下进行的,以标准指纹数据库FVC2002一DB1为基础.数据库包含有100枚指纹的800幅图像,图像大小为388x374像素.该实验在上锁过程中,选取每枚指纹中细节点数目最多的作为模板指纹,取其中质量最好的细节点20个用于加密,对特征值进行二值化,然后利用伪随机序列生成器生成随机数组.用于线性函数构建金库,同时生成随机密钥信息5.在解锁恢复密钥过程中,首先获得验证指纹的特征值,并输入用户姓名,然后用其解锁金库, 找出匹配的特征信息,用这些信息来进行解码获得线性函数P(M)的系数信息,进而解密获取密钥信息Js.该实验中,通过对线性函数P(M)取不同的长度d(d=8,…,12)来检测并比较算法的性能,获得的实验结果如表1.由表1可以知道.随着d的增大FAR(falseac. ceptrate)逐渐减小,FRR(falserejectrate)逐渐增大,说明该算法能够获得较高的安全性能.在该实验中,可以成功解密系统'次,如果强行攻击,成功的可能性为'/C.3系统分析相比原有模糊金库算法,该算法中线性函数的建立引入了随机数,增加了系统随机性的同时.也使线性函数的系数不再受密钥信息S长度的限制.既摆脱了线性函数方程与密钥信息的关系,并可以根据系统的不同要求,改变线性函数的长度,很大程度上提高了系统的安全性.而且, 指纹预处理r………………一………—提取特征点l一一一.,.,...,..J00011O1O11111110010111O10001111O0001000000101111001101O0OO1111i111O110O11101O011100011001111O11O0O11100100001001110011000000100011000101100010010O10101OO1O110000111000011000011l0000011000010001i111O1O1111110100l0011000011O11010舞1001O1011O111O11I;O11O110101000011000011111O1001O01010010101110010I;0001111101010010I图2伪随机数生成实验Fig.2Pseudorandomgenerationexperimental第2期刘艳涛等.一种改进的随机性模糊金库算法291 图3生成密钥信息实验Fig.3Generatethekeysinformationexperiment输入姓名.Alicesucces/error110'100001001O10100000001O10O1OOO11O011OO1O10O10000OO1000001OoOO00O00OO01O10001010000110O000'11000001O11'O11011'000101100010001010100101OO1O111010000}0900011'0100001OO1O0O111010000'0,3000111O'OOO01OO10001110'000010O10001110100001O01000'100000100100011010000'dn……~……一一,.1O111100101111O01000010000001111i图4解锁恢复密钥信息实验Fig.4Unlockandrestorethekeysinformationexperiment表1线性函数长度与算法安全性测试结果Table1Thelengthofthelinearfunctionwiththesafetyofalgorithmtestresults 模糊金库的建立在基于细节点的特征值的同时.还引入了由其生成的一组伪随机数.利用伪随机序列的特性,可使系统具有可撤销的性能,更好地增强了系统的随机性,为密钥信息的加密提供了良好的保障.另外,本方案提出一种密钥分离的思想,即将密钥信息S分为两部分.一部分作线性函数的系数,另一部分通过加密转变成c.通过这种方科技通报第27卷式,攻击者只有在获得合法指纹的情况下,才可获得密钥信息,如果他在只基于c的情况下想要强行攻击系统来解锁密钥信息S是不可能的.一方面破解c后具有很强的难度,因为本身的Ri—indael加密算法已具有极高的安全性.而且它的私钥k是由随机生成的线性函数系数构成的,具有极强的随机性,只有在恢复线性函数后才能得到,所以系统安全性很大程度上得到提升,解锁信息也具有更高的精确性要求;另一方面,c不是完整的密钥信息,即使攻破c获得的也只是部分的密钥信息,还是要求提供合法用户指纹才能解锁.4结语本文在原有模糊金库算法的基础上,引入随机性的概念,对模糊金库算法进行了改进,并提出了密钥分离的思想,将密钥信息与用户自身的指纹特征信息牢牢绑定在一起,使得攻击者只有在拥有真实指纹信息的情况下才可以解锁金库获得密钥信息,从而提高了系统的安全性.随着生物识别技术的不断发展和普及,生物加密也必将有越来越广大的应用空间,指纹密钥作为当今生物密钥领域研究投入较多的一门科学,一定会使生物加密技术得到更深入的突破,不断地推出更多更好的算法.参考文献:[1]biningmultiplemarchersforahighsecuri—tyfingerprintverificationsystem[J].PatternRecognition Letters.1999,20:1371—1379.[2]NicholsPK.ICSAGuidetoCryptography[M].[S.1.]:McGraw—HillCompanies,1998.[3]JuelsA,SudanM.AFuzzyV aultScheme[C]//Proc.of IEEEInternationalSymp.onInformationTheory.I丑u—sanne,Switzerland:[s.n.],20o2:237—257.[4J李伟,郝建红,李长春.一种用于图文数据加密的新型混沌伪随机数发生器[J].中国电力教育.2007(z2):380-382.[5]MayhewGL,GolombSW.LinearSpansofModifiedde BrnijnSequences[J]rm.Theory, 1990,36(5):l166一ll67.[6]ChenY,DassSC,andJainAK.FingerprintQuMi~In—dicesforPredictingAuthenticationandPerformance[J]. InProc.AVBPA,RyeBrook,NY,2005,pp:160—170. [7]UludagU,PankantiS,andJainAK.FuzzyV aultfor Fingerprints[C]//ProeAudioandVideo—basedBiomet—ricPersonAuthentication2005,NYUSA,July20—22.2005:310—319.[8]PankantiS,PrabhakarS,andJainAK.OnTheIndi- viduMiwofFingerprints[J].IEEETrans.OilPAMI,2o02,24(8):1010—1025.。
第13卷㊀第4期Vol.13No.4㊀㊀智㊀能㊀计㊀算㊀机㊀与㊀应㊀用IntelligentComputerandApplications㊀㊀2023年4月㊀Apr.2023㊀㊀㊀㊀㊀㊀文章编号:2095-2163(2023)04-0091-06中图分类号:TP391文献标志码:A基于AlBert-Tiny-DPCNN的案件事实倾向性类别预测施君可(浙江理工大学信息学院,杭州310018)摘㊀要:近年来,随着智慧司法的推进,中国裁判文书作为重要的研究对象,衍生了诸多任务,但针对裁判文书的研究大多基于刑事案件,缺乏对民事案件领域下的研究㊂本文结合预训练词向量㊁文本分类模型等技术,对民间借贷这一细分领域下的案件事实标签预测进行研究,为现有的案件事实提供同类别的有参考价值的裁判文书,减少相关工作者在大量数据中寻找所耗费的时间㊂本文提出了基于Albert-Tiny-DPCNN的分类模型,该模型采用注意力机制与标签平滑归一化技术来提高模型的精度,并在实验数据集上验证了模型的有效性㊂关键词:深度学习;裁判文书;文本分类;预训练词向量ChinesejudicialdocumentprocessingbyutilizingcasefactlabelclassificationbasedonAlBert-Tiny-DPCNNSHIJunke(SchoolofInformationScienceandTechnology,ZhejiangSci-TechUniversity,Hangzhou310018,China)ʌAbstractɔInrecentyears,withtheadvancementofintelligentjustice,judicialdocumentsinChina,asanimportantresearchobject,havederivedmanytasks.However,mostoftheresearchonjudicialdocumentsisbasedoncriminalcases,andthereisalackofresearchoncivilcases.Combiningpre-trainedwordvectors,textclassificationmodels,andothertechnologies,thispaperstudiesthepredictionofcasefactsinthesubdivisionfieldofprivatelendingsoastoprovidetheexistingcasefactswithsimilarjudgmentdocumentswithreferencevalueandreducethetimespentbyrelevantworkerslookingfortheminalargeamountofdata.Inthispaper,aclassificationmodelbasedonAlbert-Tiny-DPCNNisproposed,attentionmechanismsandlabelsmoothingregularizationtechniquesareusedtoimprovetheaccuracyofthemodel,andthevalidityofthemodelisverifiedonexperimentaldatasets.ʌKeywordsɔdeeplearning;Chinesejudicialdocument;textclassification;pre-trainedwordvector作者简介:施君可(1997-),男,硕士研究生,主要研究方向:文本分类㊂通讯作者:施君可㊀㊀Email:158****0053@163.com收稿日期:2022-05-200㊀引㊀言近年来,裁判文书结合NLP衍生出诸多任务,如:类案检索㊁信息抽取等,但大多研究均基于刑事案件,缺乏对民事案件领域下的研究㊂而且,由于民事案件的判决结果不会出现原告 胜诉㊁败诉 等确切字句,只有 支持原告部分请求 ㊁ 驳回原告其余请求 等字样,只能对裁判文书的案件事实进行了倾向性分类㊂此外,民间借贷案件存在借贷金额不同㊁借贷关系复杂等因素,可能还存在 同案不同判 的问题[1],并且相关从事人员不易从大量的文书中寻找到同类别的参考文书㊂针对这些状况,本文基于民间借贷一审判决书数据集,以案件事实作为输入,通过深度学习技术进行案件事实倾向性类别预测的研究㊂本文主要工作如下:(1)提出了基于Albert-Tiny-DPCNN的案件事实标签分类模型㊂该模型使用Albert-Tiny进行词嵌入,使用DPCNN的特征提取结构捕获文本语义,并使用带类别权重的焦点损失进行模型训练㊂实验结果表明,相比于对照模型,该模型的准确率与加权F1值更高,准确率能够达到79.65%㊂(2)本文在上述模型的基础上,采用注意力机制增强特征提取层输出向量的重要特征㊁减弱无用特征,使用标签平滑归一化方法提升模型的泛化能力,准确率提升到了81.22%,增加了1.97%㊂1㊀研究现状本文研究了近几年来裁判文书结合深度学习的分类模型的现状㊂王文广等学者[2]基于HAN和DPCNN提出了法律判决预测模型HAC,在刑期预测等各项判决预测任务中表现良好㊂王业沛等学者[3]探索了不同层数的LSTM模型在预测判决结果的倾向性分类时的效果,发现3层LSTM在实验中准确率较高,但仅使用LSTM很难凸显文本的重要信息㊂王宁等学者[4]使用基于注意力的BiGRU模型来预测判决结果的倾向性,实验发现模型能在一定程度上进行有效预测,但模型的准确率仍有提升空间㊂自2018年底始,预训练模型从最开始的Word2Vec㊁Glove过渡到了BERT系列㊁ERNIE㊁XLNet㊁MPNet等模型[5-6]㊂目前,预训练语言模型被广泛使用在文本分类任务中㊂王立梅等学者[7]对比了TextCNN㊁TextRNN㊁Transformer㊁Bert模型,然后优化了最优模型的嵌入方式㊂孟令慈[8]基于Bert-LSTM模型对刑事案件进行结果分类研究,通过对序列进行编码和特征融合来获取类别信息㊂2㊀模型架构本文判决结果倾向性分类模型的主要架构如图1所示,主要包括3个部分:嵌入层的作用是将输入文本转为词向量矩阵,本文比较了Word2Vec[9]与Albert-Tiny[10]在本文任务中的效果;特征提取层是为了提取词向量矩阵中的语义特征,本文比较了TextCNN[11]㊁TextRCNN[12]㊁DPCNN[13]三种网络结构对模型的影响;分类层将输入的神经元数量压缩至类别个数㊂此外,本文对比了交叉熵损失函数和焦点损失函数对模型的影响㊂输入(I n p u t)嵌入层(E m b e d d i n g L a y e r)(Wo r d2V e c、A l b e r t-T i n y)特征提取层(F e a t u r e E x t r a c t i o n L a y e r)随机失活层(D r o p o u t L a y e r)分类层(C l a s s i f i c a t i o n L a y e r)输出(O u t p u t)T e t C N NT e x t R C N ND P C N N图1㊀模型基本框架Fig.1㊀Basicstructureofthemodel2.1㊀数据采集与处理本文采用selenium工具从中国裁判文书官网(https://wenshu.court.gov.cn/)爬取民间借贷一审判决书,并以此构建数据集㊂裁判文书的文本格式较为规范,本文通过分析裁判文书的结构,使用正则提取数据中的案件事实(fact)㊁判决理由(reason)㊁判决依据(legal)㊁判决结果(verdict)四个部分㊂然后,本文构建了判决结果倾向性类别标签集合,并结合reason㊁legal㊁verdict的内容进行数据标注,将verdict转为数值表示㊂标签集合见表1㊂表1㊀类别标签集合Tab.1㊀Collectionofcategorytags判决结果倾向性分类类别对于原告的诉讼请求法院予以驳回;本案受理费用由原告承担㊂0法院对于原告的诉讼请求全部支持;本案受理费用由原告承担㊂1法院对于原告的诉讼请求全部支持;本案受理费用由被告承担㊂2法院对于原告的诉讼请求全部支持;本案受理费用由原告㊁被告双方承担㊂3法院对于原告的诉讼请求部分支持;本案受理费用由原告承担㊂4法院对于原告的诉讼请求部分支持;本案受理费用由被告承担㊂5法院对于原告的诉讼请求部分支持;本案受理费用由原告㊁被告双方承担㊂62.2㊀嵌入层本文采用Word2Vec词向量和Albert词向量作为词嵌入进行实验研究㊂由于文本信息不能直接作为输入,在词嵌入操作前需要将文本序列化㊂文本序列化,即把词序列表示成一系列能够表达文本语义的数字表示,每个数字代表一个词在词表中的索引号㊂之后,形状为L[]的文本序列W经过词嵌入层后会转换成形状为L,D[]的词嵌入矩阵E,L为文本序列长度,D为词向量维度,wi表示输入文本序列中的第i个词在词表中的索引,e(wi)表示词表中索引为wi的词对应的词向量,如式(1)所示:W=[w1,w2,w3, ,wL]E=[e(w1),e(w2),e(w3), ,e(wL)](1)㊀㊀基于此,进一步展开研究分述如下㊂29智㊀能㊀计㊀算㊀机㊀与㊀应㊀用㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀第13卷㊀(1)Word2Vec㊂Word2Vec[9]把所有词语投影到K维的向量空间,将文本内容的处理简化为K维向量空间中的向量运算㊂本文使用jieba工具对案件事实进行分词,并去除了无关字符与常用停用词㊂其次,本文使用分词去停后的案件事实构建语料库,使用gensim开源工具Word2Vec的skip-gram算法训练W2V词向量表㊂接着,将分词去停后的案件事实词序列进行文本序列化处理[4],并统一文本长度为150㊂最后,本文用文本序列和W2V词向量表实现静态词嵌入㊂(2)Albert-Tiny㊂本文使用了Albert-Tiny进行下游任务的微调,为保证输入文本相同,模型的输入为分词去停后的案件事实㊂首先,本文使用继承了transformers的BertTokenizer的Tokenizer实现文本序列化㊂接着,统一文本序列化长度为300,对超出298的数据进行截断,在文本序列首部添加类别符号 [CLS] ,在尾部添加结束符号 [SEP] ,并对不足的文本序列进行填充㊂最后,使用transformers的AlbertModel进行下游任务微调,得到需要的AlbertTiny词向量㊂2.3㊀特征提取层目前,文本分类模型多种多样,本文主要探索了TextCNN[11]㊁TextRCNN[12]㊁DPCNN[13]三种特征提取结构在模型中的表现效果㊂其中,TextCNN使用不同大小的卷积核来捕获输入文本对应的词向量矩阵中相邻词汇之间的特征信息㊂因卷积核尺寸通常不会很大,使得TextCNN无法捕获长距离信息,故该模型较为适合中短文本场景,并不适合长文本㊂TextRCNN使用双向长短期记忆网络获取输入文本的上下文信息,并通过最大池化筛选出最重要的特征,能够获取长距离依赖㊂DPCNN通过增加感受野来缓解长距离依赖问题,主要用区域卷积增加模型可编码的输入文本长度,并通过残差结构增加网络深度来缓解梯度弥散问题㊂2.4㊀分类层分类层采用全连接网络结构来提取特征间的关系,该层的输入为特征提取层后的文本语义向量,输出为每个类别的概率,如图2所示㊂图2中,第一层全连接的维度形状为1∗n,第二层全连接的维度形状为1∗(n/2),输出层(即最终结果输出的网络层)形状为1∗C的分类结果,这里C为类别标签的数量,该输出向量通过softmax分类器后得到的可能性最高的类别即为预测的类别㊂除输出层外的2个全连接层采用了LeakReLu激活函数㊂另外,可根据训练效果来决定是否对全连接的输出结果进行批归一化处理㊂输出O u t p u t输入I n p u t1*C1*(n/2)1*n数量N u m b e r图2㊀分类层结构Fig.2㊀Classificationlayerstructure2.5㊀损失函数本文的标签分类任务属于多分类问题,故模型的最后一层应使用softmax函数进行归一化,并在计算真实标签与预测结果之间的损失值时以softmax结果中概率最高的类别为预测标签㊂本文对比了Pytorch框架中的交叉熵损失(CrossEntropyLoss)函数以及自定义的焦点损失函数(FocalLoss)在模型训练中的表现效果,并在训练过程中增加类别权重w来缓解类别不平衡问题㊂由于上述的损失函数内部已使用softmax分类器,故本节模型的最后一层不必再使用softmax进行归一化㊂这里,给出带类别权重的交叉熵损失函数见式(2):LCE=1NðNn=1LCE,n,LCE,nXn,Yn()=-ðCi=1wiyn,ilogexpxn,i()ðCc=1exp(xn,c)(2)其中,N是批大小;Xn是批中第n条数据的预测标签值,xn,i为第n条数据的第i类别的预测概率;Yn是批中第n条数据的真实标签,yn,i为第i类别的真实标签值,只有一个类别为1,其余类别为0;C是类别数目;LCE为N条数据的交叉熵损失均值;wi是第i类别的类别权重㊂接下来,给出焦点损失函数[14]见式(3):LF=1NðNn=1LF,n,LF,npn()=-αt1-pn()γlogpnpn(Xn,Yn)=yn,iexpxn,i()ðCc=1exp(xn,c)æèççöø÷÷+㊀㊀㊀(1-yn,i)1-expxn,i()ðCc=1exp(xn,c)æèççöø÷÷(3)其中,N㊁Xn㊁xn,i㊁Yn㊁yn,i㊁C的含义与式(2)相同;LF为N条数据的焦点损失均值;pn(Xn,Yn)表示第n条数据预测的各类别输出为1的概率;αt是39第4期施君可:基于AlBert-Tiny-DPCNN的案件事实倾向性类别预测权重因子;γɪ0,5[]是聚焦参数㊂本文中将超参αt替换为类别权重数组来缓解类别不均衡问题㊂3㊀实验与分析3.1㊀实验环境本文使用的操作系统环境为Ubuntu18.04,硬件环境为IntelXeonPlatinum8163(Skylake)2.5GHz,内存为92GB,GPU为1∗TeslaV100NVLink-32G,python集成平台Anaconda3,采用的DeepLearning框架为pytorch,使用的依赖有:numpy㊁pandas㊁regex㊁jieba㊁gensim㊁scikit-learn㊁pytorch㊁transformers㊁matplotlib㊁tensorboard㊂3.2㊀实验数据与参数设置本文通过爬虫在中国裁判文书网收集相关的民事一审判决书,使用数据筛选过后留下30000条数据来进行实验研究,并按6ʒ2ʒ2的比例划分训练集㊁验证集㊁测试集㊂本文的数据集共有7个类别,但各类别的样本数量分布并不均匀,分别为{600,100,11300,480,70,15950,1500},故对于该问题,本文在模型训练过程中使用了带类别权重的损失函数,类别权重W的计算方式具体如下:W=[w1,w2, ,wC]㊀wi=1/log(1.02+ti)(4)其中,wi为第i个类别的权重值;C为类别数;ti为第i个类别样本数占总样本数的比例㊂为保持对比实验的参数一致性,本文在模型训练时的参数配置详见表2㊂表2㊀对比模型实验中的参数表Tab.2㊀Parametersinthecomparativemodelexperiment参数名参数值参数名参数值词向量尺寸(embeddingsize)256优化器(optimizer)AdamW单次批量(batchsize)512失活系数(dropout)0.5标签平滑参数(labelsmoothing)0学习率(learningrate)1e-43.3㊀评价指标本文采用准确率(accuracy)㊁加权F1值(Weighted-F1)作为结果倾向性预测模型的评价指标㊂准确率计算方法公式具体如下:accuracy=Nright/(Nright+Nwrong)(5)㊀㊀其中,Nright,Nwrong分别表示预测正确㊁错误的数量㊂加权F1值可以用作多分类问题的评价指标之一,可以减轻数据不平衡带来的影响㊂加权F1值考虑了样本不均衡的原因,在计算查准率(Precisionweighted)和召回率(Recallweighted)时,需要各个类别的查准率(precision)和召回率(recall)乘以该类在总样本中的占比来求和㊂而加权F1值为Precisionweighted和Recallweighted的调和平均数,详见式(6) 式(10):Precisioni=TPiTPi+FPi(6)Precisionweighted=ðLi=1Precisioni∗wi()L(7)Recalli=TPiTPi+FNi(8)Recallweighted=ðLi=1Recalli∗wi()L(9)F1weighted=2∗Precisionweighted∗RecallweightedPrecisionweighted+Recallweighted(10)㊀㊀其中,TPi表示真实类别为第i类㊁且被模型判定为第i类的样本数;FPi表示真实类别非第i类㊁且被模型判定为第i类的样本数;FNi表示真实类别为第i类㊁且被模型判定为非第i类的样本数;L表示类别数量;wi表示第i类别在总样本中的占比㊂3.4㊀对照实验本文比较了不同的词嵌入方式㊁特征提取方式㊁损失函数相互组合的模型在训练时的表现效果,结果见表3㊂表3中,W2V表示Word2Vec嵌入方式,TextCNN㊁TextRCNN㊁DPCNN为特征提取方式,CEL表示交叉熵损失函数㊁FL表示焦点损失函数㊂从表3的实验数据可知,与交叉熵损失函数相比,焦点损失函数能够在一定程度上提高模型准确率㊂相比于Word2Vec静态词向量,AlBert-Tiny词向量的嵌入效果更好,能更好地学习词之间的关系,使得特征提取层输出的特征向量包含更多的文本语义信息㊂特征提取层使用的网络结构中,DPCNN的效果相对较好,能更准确地捕获文本特征㊂49智㊀能㊀计㊀算㊀机㊀与㊀应㊀用㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀第13卷㊀表3㊀对比模型的准确率与加权F1值Tab.3㊀Comparingtheaccuracyandtheweighted-F1valueofthemodel模型CELaccuracyweighted-F1FLaccuracyweighted-F1W2V+TextCNN(baseline)0.71430.71660.74950.7307W2V+TextRCNN0.63550.64510.69780.6405W2V+DPCNN0.74190.74530.77540.7575AlBert-Tiny+TextCNN0.73580.70150.79200.7683AlBert-Tiny+TextRCNN0.70420.67650.72610.6934AlBert-Tiny+DPCNN0.78670.75350.79650.76073.5㊀模块与参数实验上述实验中,模型AlBert-Tiny+DPCNN在对比实验中的准确率最高,基于该模型,本文进行了变体测试与参数实验㊂变体测试主要包括2个方向:对特征提取层输出的文本特征向量进行注意力加权㊁标签平滑归一化㊂接下来,可做阐释表述如下㊂(1)Albert-Tiny+DPCNN:对比实验中表现最优的模型㊂(2)Albert-Tiny+DPCNN+LSR:LSR表示标签平滑归一化[15],该技术通过平滑归一化方法扰动真实标签来解决过度拟合的问题㊂对标签的具体处理详见式(11):yLSk=yk(1-α)+α/K(11)㊀㊀其中,yLSk为第k类标签平滑处理后的值;yk为第k类标签的one-hot值;K为类别数;α为超参数㊂(3)Albert-Tiny+DPCNN+Att:注意力原理不依赖于任何框架,本文搭建了2层全连接层的Att注意力模块,通过该模块来捕获重要特征,减小无用特征,提升特征提取层的文本语义表征能力㊂具体如图3所示㊂分类层随机失活层s o f t m a x图3㊀特征提取层使用的注意力模块Fig.3㊀Attentionmodule㊀㊀(4)Albert-Tiny+DPCNN+Att+LSR:结合了(2)㊁(3)中的Att与LSR两种方式㊂㊀㊀表4为变体模型的对比实验结果,从中可以得到以下结论:(1)在特征提取层中使用注意力机制在一定程度上可以提高模型准确率,增强特征提取层的文本语义提取效果㊂(2)标签平滑正则化在一定程度上增强了模型泛化能力,加快了模型的收敛㊂图4(a)㊁图4(b)分别为模型训练过程中准确率与加权F1的变化曲线,明显可见使用Albert-Tiny进行词嵌入时,在epoch等于4或5时能达到较好效果㊂表4㊀Albert-Tiny+DPCNN及其变体的准确率与加权F1值Tab.4㊀Accuracyandweighted-F1valueofAlbert-Tiny+DPCNNanditsvariants模型accuracyweighted-F1Albert-Tiny+DPCNN0.79650.7607Albert-Tiny+DPCNN+Att0.79940.7643Albert-Tiny+DPCNN+LSR0.80420.7657Albert-Tiny+DPCNN+Att+LSR0.81220.77340.800.750.700.650.600.55246810A l b e r t -T i n y +D P C N N A lB e r t -T i n y +D PC N N +A t t A l B e r t -T i n y +D P C N N +L S R A l B e r t -T i n y +D P C N N +A t t +L S R e p o c ha c c u r a c y(a)accuracy0.750.700.650.600.550.500.450.40246810A l b e r t -T i n y +D P C N N A lB e r t -T i n y +D PC N N +A t t A l B e r t -T i n y +D P C N N +L S R A l B e r t -T i n y +D P C N N +A t t +L S R e p o c hw e i g h t e d -F 1(b)weighted-F1图4㊀准确率与加权F1值变化曲线Fig.4㊀Changecurveofaccuracyandweighted-F1value㊀㊀图5(a)㊁图5(b)分别记录了Albert-Tiny+59第4期施君可:基于AlBert-Tiny-DPCNN的案件事实倾向性类别预测DPCNN+Att+LSR的准确率Acc与加权F1值(Weighted-F1)随标签平滑归一化参数alpha㊁随机失活层的失活系数beta的变化过程㊂标签平滑归一化参数alpha能够用来提高模型的泛化能力,从图5(a)可见,当alpha在0.2附近时,模型的表现效果较好㊂随机失活系数beta是防止过拟合的优化参数,并从图5(b)中可知,当alpha=0.2㊁beta值在0.5附近时,模型的表现效果较好㊂0.810.800.790.780.770.760.10.20.30.40.5αya c c u r a c y w e i g h t e d -F 1(a)alphaβ0.800.780.760.740.720.700.680.660.20.40.60.8ya c c u r a c y w e i g h t e d -F 1(b)beta图5㊀准确率与加权F1值随alpha、beta变化的曲线Fig.5㊀Thecurveofaccuracyandweighted-F1valueschangingwithalphaandbeta4㊀结束语本文提出了基于Albert-Tiny-DPCNN的分类模型,并将其应用于案件事实标签预测任务㊂该模型以案件事实为输入,以判决结果的倾向性类别为真实标签,通过对案件事实进行单标签多分类来实现类别预测㊂基于民间借贷一审判决书数据集的实验结果证明,本文提出的模型准确率能够达到79.65%,高于其他对照组㊂此外,本文使用注意力机制加强了重要特征信息,减小了噪声与无用特征的影响,并用标签平滑归一化提升模型泛化能力,模型的准确率达到了81.22%,提升了1.97%㊂虽然,本文提出的模型在实验数据集上拥有较好的表现,但若想在实际应用场景中使用依旧存在以下改进之处㊂第一,深度学习中的网络模型需要大量数据的支撑,本文数据集的样本数量亟待扩充,模型泛化能力有待加强;第二,目前,NLP领域中除Albert外还有ELMO㊁MPNet等预训练语言模型,因此下一步工作可以尝试深入探讨其他模型,另外采用多特征融合或结合传统机器学习的方法可能取得不错的效果㊂参考文献[1]吴叶乾,韩青松.对 同案同判 的全面理解及其实现[J].浙江理工大学学报(社会科学版),2020,44(01):75-80.[2]王文广,陈运文,蔡华,等.基于混合深度神经网络模型的司法文书智能化处理[J].清华大学学报(自然科学版),2019,59(07):505-511.[3]王业沛,宋梦姣,王譞,赵志宏.基于深度学习的判决结果倾向性分析[J].计算机应用研究,2019,36(02):335-338.[4]王宁,李世林,刘堂亮,赵伟.基于注意力机制的BiGRU判决结果倾向性分析[J].计算机系统应用,2019,28(03):191-195.[5]CAMACHO-COLLADOSJ,PILEHVARMT.Fromwordtosenseembeddings:Asurveyonvectorrepresentationsofmeaning[J].JournalofArtificialIntelligenceResearch,2018,63:743-788.[6]LIQian,PENGHao,LIJianxin,etal.Asurveyontextclassification:Fromshallowtodeeplearning[J].arXivpreprintarXiv:2008.00364,2020.[7]王立梅,朱旭光,汪德嘉,等.基于深度学习的民事案件判决结果分类方法研究[J].计算机科学,2021,48(08):80-85.[8]孟令慈.基于Bert-LSTM模型的裁判文书分类的研究[D].华东交通大学,2021.[9]MIKOLOVT,CHENKai,CORRADOG,etal.Efficientestimationofwordrepresentationsinvectorspace[C]//InternationalConferenceofLearningRepresentation.Scottsdale,Arizona:IEEE,2013:1-12.[10]LANZhenzhong,CHENMingda,GOODMANS,etal.Albert:Alitebertforself-supervisedlearningoflanguagerepresentations[J].arXivpreprintarXiv:1909.11942,2019.[11]KIMY.ConvolutionalNeuralNetworksforsentenceclassification[J].arXivpreprintarXiv:1408.5882,2014.[12]LAISiwei,XULiheng,LIUKang,etal.Recurrentconvolutionalneuralnetworksfortextclassification[C]//Twenty-ninthAAAIConferenceonArtificialIntelligence.Austin,Texas,USA:AAAI,2015:2267-2273.[13]JOHNSONR,ZHANGTong.Deeppyramidconvolutionalneuralnetworksfortextcategorization[C]//Proceedingsofthe55thAnnualMeetingoftheAssociationforComputationalLinguistics(Volume1:LongPapers).Vancouver:ACL,2017:562-570.[14]LINTY,GOYALP,GIRSHICKR,etal.Focallossfordenseobjectdetection[C]//ProceedingsoftheIEEEInternationalConferenceonComputerVision.Venice:IEEE,2017:2980-2988.[15]MÜLLERR,KORNBLITHS,HINTONGE.Whendoeslabelsmoothinghelp?[J].arXivpreprintarXiv:1906.02629,2019.69智㊀能㊀计㊀算㊀机㊀与㊀应㊀用㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀㊀第13卷㊀。
华中科技大学硕士学位论文 AbstractIn r ecent years, malwares, including worms, Trojans and botnets, always threat to Internet security. As the growing popularity of WEB2.0 and cloud computing, more and more applications provide WEB-based services, there have been trends of browser OS. Exploiting browser vulnerabilities and plug-in vulnerabilities has replaced exploiting vulnerabilities in operating systems and applications. Web malicious code has become the main way of attack and spreading of malware and an important part of the underground economy. Malicious web page is the page that contains malicious content which spreads virus, Trojan, etc. Included malicious content, which is always called web Trojan, essentially is not a Trojan. It’s the malicious code spreading by webpage, generally written in JavaScript, VBScript or other scripting language, usually obfuscated in various ways to escape detection. By exploiting vulnerabilities in browsers or plug-ins, webpage malicious code can download and run malware, such as adware, Trojan, viruses, etc. Users could be attacked even when they visit a seemingly benign website since benign web page could have been injected with malicious code. Various tactics are used in order to evade detection by AV scanner, for example, encryption and polymorphism. Traditional detection system has a high false negative rate. Therefore, more and more attackers utilize Internet to spread malware. Detections techniques are usually classified into static detection (based on page content or URL), dynamic detection (based on browsing behavior), and a combination of both. Traditional static detection method is simple, but difficult to deal with code obfuscation, which lead to a high false negative rate and false positive rate. Therefore, many existing systems use the dynamic detection approach, that is, run scripts of a webpage in a real browser in a virtual machine environmen t, monitor the execution for malicious activity. While the system is quite accurate, the process is costly, requiring seconds for a single page without optimization, thus, is unable to be performed on a large set of web pages.In this paper, a light-weighted detection system was proposed. The system analyzes pages, extraction features, automatically derive detection models using machine-learning techniques. In addition, we use JavaScript virtual machine to make further analysis for the obfuscation code to make a complementary for static analysis, which detect the source code of a web page statically. As most of the analysis process only use the source of the华中科技大学硕士学位论文 page, without the need for execution and, therefore, consumes less resources and can be applied to large-scale web pages’ detection, for example, integration with the search engine. We analyze the characteristic of a malicious webpage systematically and present important features for machine learning. In the end, we describe the system’s design and implementation and demonstrate the effectiveness of the system by experimental results.Keywords: web malicious code, drive-by download, static detection, dynamic detection, machine learning.独创性声明本人声明所呈交的学位论文是我个人在导师指导下进行的研究工作及取得的研究成果。
Exploiting K-Distance Signaturefor Boolean Matching and G-Symmetry Detection∗Kuo-Hua WangDept.of Computer Science and Information EngineeringFu Jen Catholic UniversityHsinchuang City,Taipei County24205,Taiwan,R.O.C.khwang@.twABSTRACTIn this paper,we present the concept of k-distance signature which is a general case of many existing signatures.By exploiting k-distance signature,we propose an Algorithm for Input Differentiation(AID)which is very powerful to distinguish inputs of Boolean functions.Moreover,based on AID,we propose a heuristic method to detect G-symmetry of Boolean functions and a Boolean matching algorithm.The experimental results show that our methods are not only effective but also very efficient for Boolean matching and G-symmetry detection.Categories and Subject Descriptors:B.6.3[Hardware] Logic Design:Design Aids-Automatic Synthesis.General Terms:Algorithms,Design,Verification.1.INTRODUCTIONBoolean matching is a technique to check if two functions are equivalent under input permutation and input/output phase assignments(so called NPN-Class).It can be ap-plied in technology mapping and logic verification.In the early90’s,the work in the article[1]opened the door of re-search on Boolean matching for technology mapping.Over the past decade,many studies had been devoted to Boolean matching and various Boolean matching methods were pro-posed[1]-[6].Among these methods,exploiting signatures and computing canonical forms of Boolean functions are the most successful approaches to deal with Boolean matching. The idea behind using signatures for Boolean matching is very simple.Various signatures can be defined to char-acterize the input variables of Boolean functions,where the variables with different signatures can be distinguished with each other and many infeasible permutations and phase as-signments can be pruned quickly.The major issue of sig-nature based approach is the aliasing problem,i.e.two or more input variables have the same signature and therefore can’t be distinguished.In such a case,an exhaustive search ∗This work was supported by the National Science Council of Taiwan under Grants NSC94-2215-E-030-006. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on thefirst page.To copy otherwise,to republish,to post on servers or to redistribute to lists,requires prior specific permission and/or a fee.DAC2006,July24–28,2006,San Francisco,California,USA. Copyright2006ACM1-59593-381-6/06/0007...$5.00.is still required tofind the input mapping.The paper[8]had shown signatures have the inherent limitations to dis-tinguish all input variables for those Boolean functions with G-symmetry.To the best of our knowledge,there were few research devoted to detect G-symmetry of Boolean functions[9][10].Yet,another group of research was based on computingcanonical forms of Boolean functions[5][6].Using this ap-proach,each NPN-class of functions will be represented as a unique canonical function.If two functions can be transformed to the same canonical function,then they are matched.In a recent paper[6],a new approach combin-ing signature and canonical form of Boolean functions was proposed.Their method was based on generalized signa-tures(signatures of one or more variables)and canonicity-producing(CP)transformation for Boolean matching under input permutation and input phase assignment.In this paper,we consider permutation independentBoolean matching problem.A new k-distance signature willbe presented and correlated with many previously proposed signatures.Based on k-distance signature,we will propose a very effective Algorithm for Input Differentiation(AID) to distinguish inputs of Boolean functions.With respect to (w.r.t.)the aliasing problem of signatures,we also present a heuristic method to detect G-symmetry of Boolean bining AID,G-symmetry detection,and trans-formation of Ordered Binary Decision Diagrams(OBDD’s), we also propose a Boolean matching algorithm to match two Boolean functions.The remainder of this paper is organized as follows.Sec-tion2briefly reviews Boolean matching and some previ-ously proposed signatures.Then we introduce k-distance signature and propose a heuristic to detect G-symmetry of Boolean functions in Section3and Section4,respectively. Our AID and Boolean matching algorithms will be described in Section5.Finally,the experimental results and conclu-sion will be given in Section6and Section7,respectively.2.PRELIMINARIES2.1Boolean Matching and SymmetryBoolean matching is to check the equivalence of two func-tions under input permutation and input/output phase as-signment.Let B n denote the multi-dimensional space spanned by n binary-valued Boolean variables.A minterm is a point in the Boolean space B n.A literal is a vari-able x i(positive literal)or its complement form¯x i(negative literal).The weight of a minterm m,denoted as W(m),32.2is the number of positive literals x i ’s involved in m .Thecofactor of f (x 1,···,x i ,···,x n )w.r.t.the literal x i (¯xi )is f x i =f (x 1,···,1,···,x n )(f ¯x i =f (x 1,···,0,···,x n )).The satisfy count of f ,denoted as |f |,is the number of minterms for which the function value is 1.Given two inputs x i and x j of a function f ,(x i ,x j )is a symmetric pair of f if and only if f is invariant under the swapping of x i and x j ,i.e.f ¯x i x j =f x i ¯x j .Let X s be an input subset of a Boolean function f (X ).X s is a maximal symmetric set of f if and only if f is not symmetric w.r.t.X s ∪{x i }for any input x i ∈X −X s [8].f (X )is a totally symmetric function if and only if X is a maximal symmetric set of f .In other words,f evaluates to 1w.r.t.all the minterms with weight i ∈A ,where A ⊆{0,1,···,n }andn is the number of input variables.It is denoted as S nA andcan be shortened as S nk if A ={k }.2.2Review of Some SignaturesConsider a Boolean function f (X )and an input variable x i ∈X .The signature of f w.r.t.x i is a description of the input x i that is independent of the permutation of the inputs of Boolean function f .Four types of previously proposed signatures are redefined as below.Definition 2.1.The cofactor-satisfy-count signature of f w.r.t.x i is the integer |f x i |.Definition 2.2.The partner signature [3]of f w.r.t.x i is an integer vector [p 0,p 1,p 2,p 3],where p 0,p 1,p 2,and p 3are |¯f ¯x i ·¯f x i |,|¯f ¯x i ·f x i |,|f ¯x i ·¯f x i |,and |f ¯x i ·f x i |,respectively.Definition 2.3.The cofactor-breakup signature [4]of f w.r.t.x i is an integer vector [b 1,···,b i ,···,b n ],where b i is the number of minterms with weight i in the domain x i =1,i.e.|f x i |=P n i =1b i .Definition 2.4.The cross signature [7]of f w.r.t.x i is an integer vector [|f x i |,|f x i ¯x 1|,···,|f x i ¯x n |].Definition 2.5.Let s 1and s 2be two different signature types.We say that s 1dominates s 2if two inputs x i ,x j ∈X of any Boolean function f (X )can be distinguished by s 2,then they can be distinguished by s 1as well.Theorem 2.1.The partner and cofactor-breakup signa-tures dominate the cofactor-satisfy-count signature .<Proof :>This theorem can be proved using the fact that the partner and cofactor-breakup signatures are integer de-compositions of |f x i |.The details are however omitted be-cause of space limitation.Theorem 2.2.The cross signature dominates the cofactor-satisfy-count signature .<Proof :>It can be easily proved because the first element of cross signature is equal to the cofactor-satisfy-count sig-nature.2.3G -SymmetryLet G ⊆P n ,where P n involves all input permutationsw.r.t.an input variable set X .A Boolean function f (X )is G -symmetric if and only if f is invariant under any input permutation ψ∈G .There are three types of G -symmetry dis-cussed in the paper [8]:group symmetry (g-symmetry ),hier-archical symmetry (h-symmetry ),and rotational symmetry (r-symmetry ).Definition 2.6.Consider a Boolean function f (X ).Let X i =[x i 1,x i 2,···,x i k ]and X j =[x j 1,x j 2,···,x j k ]be twodisjoint ordered subsets of X with k ≥1.f is group sym-metric (g-symmetric )w.r.t.to X i and X j if and only if f is invariant under the swapping of x i m and x j m for m =1,2,···,k .We call (X i ,X j )is a g-symmetric pair of f .Example 2.1.Let f =x 1(x 2+¯x 3)+x 4(x 5+¯x 6).It is obvious that f is g-symmetric w.r.t.ordered sets [x 1,x 2,x 3]and [x 4,x 5,x 6].By Def.2.6,the traditional symmetric pair can be viewed as a special case of g-symmetry pair with k = 1.It is easy to prove that g-symmetric pair is an equivalence re-lation.That is,a larger g-symmetric sets can be derived from g-symmetric pairs directly.In addition,h-symmetry is a special case of g-symmetry,i.e.all swapping groups are maximal symmetric sets.Definition 2.7.Let X i and X j be two maximal sym-metric sets of f (X ).f is h-symmetric w.r.t.X i and X j if and only if f is g-symmetric w.r.t.X i and X j .Example 2.2.Consider the function f =x 1x 2x 3+x 4x 5x 6+x 7x 8x 9.There are three maximal symmetric sets X 1={x 1,x 2,x 3},X 2={x 4,x 5,x 6},and X 3={x 7,x 8,x 9}.It is clear that f is h-symmetric w.r.t.the permutation of X 1,X 2,and X 3.Definition 2.8.Consider a function f (X )and an or-dered subset X i of X .Without loose of generality,let X i =[x 1,x 2,···,x k ]which is not a symmetric set of f for k ≥3.f is rotational symmetric (r-symmetric )w.r.t.X i if and only if f is invariant under the permutations ψ=(x m ,x m +1,···,x k ,x 1,···,x m −1)for m =1,2,···,k .Example 2.3.Let f =x 1x 2+x 2x 3+x 3x 4+x 4x 5+x 5x 1.f is invariant under rotating the variables of f w.r.t.the ordered set [x 2,x 3,x 4,x 5,x 1].3.K-DISTANCE SIGNATURES3.1Five types of K-Distance SignatureLet p =(m a ,m b )be a pair of minterms w.r.t.the same input set.The mark of p ,denoted as Mark (p ),is the product by anding the literals involved in m a which are different from the literals involved in m b .For example,consider the minterm pair p =(m a ,m b )w.r.t.the in-put set X ={x 1,x 2,x 3,x 4,x 5},where m a =01100and m b =10000.Then Mark (p )=¯x 1x 2x 3.Given two functions f and g w.r.t.the same input set,consider any minterm pair p =(m a ,m b )with mark M i ,where m a and m b are from f and g ,respectively.Assume the number of literals in the mark M i is k ,denoted as |M i |.Let S k M i and |S kM i|=w i denote the set involving all minterm pairs with mark M i and the number of minterm pairs in S k M i,respectively.Then the tuple (w i ,M i )can characterize a relation between f and g .If we further classify (w i ,M i )’s in terms of the number of literals involved in M i ,then each group can be viewed as a signature showing the k-disjoint situations between f and g .In the following,we define the k-distance signature and give a simple example.Definition 3.1.Consider two functions f (X )and g (X ).Given an integer k >0,the k-distance signature of f w.r.t.to g ,denoted as ksig (f,g,k ),can be formally defined asksig (f,g,k )={(w i ,M i )||S kM i|=w i ,|M i |=k }(1),where S kM i={p =(m a ,m b )|Mark (p )=M i ,∀m a ∈f,∀m b ∈g }.Example 3.1.Consider two functions f =x 1¯x 2x 3=P m (5)and g =(¯x 1+x 2)x 3=Pm (1,3,7).There are two minterm pairs (m 5,m 1)and (m 5,m 7)with mark x 1and ¯x2,respectively.So ksig (f,g,1)={(1,x 1),(1,¯x 2)}.For k =2,ksig (f,g,k =2)={(1,x 1¯x 2)}because of one minterm pair (m 5,m 3)with mark x 1¯x 2.Instead of computing Equation (1)directly,given an input x i ,ksig (f,g,k )can be computed in a recursive way as be-low.ksig (f,g,k )=ksig (f ¯x i ,g ¯x i ,k )Sksig (f x i ,g x i ,k )S ¯x i ·ksig (f ¯x i ,g x i ,k −1)S x i ·ksig (f x i ,g ¯x i ,k −1)(2)By Def.3.1,ksig (f,g,k )can be viewed as a 2k×C n kinteger matrix M with rows and columns corresponding to different input phase assignments and input combinations of k inputs from X ,respectively.Example 3.2.Consider f (x 1,x 2,x 3,x 4)=P(0,1,3,6,8,13,14,15).ksig (f,f,k =1)is a 2×C 41integer matrix shown below:x 1x 2x 3x 401»20222022–W.r.t.k =2,it is a 22×C 42integer matrix shown below:x 1x 2x 1x 3x 1x 4x 2x 3x 2x 4x 3x400011011264201211000001000001201211375The following definition shows how to extract the signa-ture from ksig (f,g,k )w.r.t.an input x i ∈X .Definition 3.2.Consider a target function f (X )and an anchor function g (X ).Let x i be an input of X .Given an integer k >0,the k-distance signature of f w.r.t.x i and g isan integer vector [v 1,···,v j ,···,v s ]with s =2k −1×C n −1k −1,where v j =|f x i m a |and m a ∈Bk −1which is expanded by any (k −1)-input subset of X without involving x i .Example 3.3.Consider the same function f (X )shown in Example 3.2and an input x 1∈X .Given k =2,the k-distance signature of f w.r.t.x 1and anchor function f is a six-integer vector shown below:[|f x 1¯x 2|,f x 1¯x 3|,f x 1¯x 4|,|f x 1x 2|,f x 1x 3|,f x 1x 4|]=[0,0,0,2,0,1]According to Equation (2),the time complexity of comput-ing k-distance signature is O (p k +1×q k +1),where p and q are the sizes of OBDD’s representing f and g ,respec-tively.It is time consuming to compute k-distance signa-ture w.r.t.a large number k .Therefore,we propose five k-distance signatures w.r.t.k =1and use ksig (f,g )to de-note ksig (f,g,k =1)for short.It is clear that ksig (f,g )is a 2n -integer vector,where each tuple corresponds to the number of pairs of minterms which only involve one different literal x i or ¯x i .•Type 0:g =1In fact,type-0k-distance signature is equivalent to the cofactor satisfy-count signature.However,it can com-pute the cofactor-satisfy-count signatures w.r.t.all in-puts simultaneously.•Type 1:g 1=f and g 2=¯f•Type 2:g =S ni for i =0,1,···,nS ni is a totally symmetric function involving all minterms with weight i .•Type 3:g =x i for i =1,···,n01111000x 1x 2x 3x 4(a) Karnaugh Map(b) BDDx 21x 3x 4x 100011110011110000001110B A ABA101110Figure 1:Karnaugh map and BDD of f =x 1x 2+x 3x 4.Below,we will describe a new type of k-distance signa-ture based on communication complexity.Suppose that X bis the set of inputs (or referred as bounded set )which had been distinguished by some signatures.The communication complexity of f w.r.t.X b is defined as the number of dis-tinct cofactors of f w.r.t.m i ,where m i is a minterm in the Boolean space spanned by X b .For example,consider f =x 1x 2+x 3x 4and bound set X b ={x 1,x 2}.Fig.1(a)and 1(b)show the Karnaugh map of f w.r.t.the partition X b /X −X b and OBDD of f with order x 1<x 2<x 3<x 4,respectively.W.r.t.the partition X b /X −X b ,there are two distinct cofactors A and B corresponding to row patterns 0010and 1111,respectively.So the communication com-plexity is 2.Take a look at this OBDD.Under the dotted cut-line,it is easy to identify two sub-BDD’s which corre-spond to cofactors A and B ,respectively.W.r.t.such an input partition,f can be decomposed as f =b 1·u 1+b 2·u 2,where b 1=¯x 1+¯x 2,b 2=x 1x 2,u 1=x 3x 4and u 2=1.Given a Boolean function f (X )and a bounded set X b including all distinguished inputs,let m be the communica-tion complexity of f w.r.t.X b .Then f can be decomposed as f =P mi =1b i (X b )·u i (X −X b ),where f m j =u i for each minterm m j ∈b i .Type 4k-distance signature is thus de-fined as below.•Type 4:g =b i for i =1,···,mTheorem 3.1.Type 0k-distance signature is equivalent to the cofactor-satisfy-count signature .<Proof :>As we state for k-distance signature w.r.t.k =1,it is an integer vector of 2n tuples corresponding to literals x i ’s or ¯x i ’s.Consider any minterm m a ∈f .Let m a =Q ni =1l i ,where l i is x i or ¯xi .For each literal l i ∈m a ,since g is a tautology,it is clear that there exists a minterm m b ∈g ,where Mark (m a ,m b )=l i .So the tuple on l i in type 0k-distance signature will increase accordingly.After checking all minterms in f in such a way,then each tuple is equivalent to the cofactor-satisfy-count signature w.r.t.literal x i or ¯x i .It is proved.Theorem 3.2.Type 1k-distance signature is equivalent to the partner signature.<Proof :>By Def.2.2,the partner signature is a 4-tuple integer vector [p 0,p 1,p 2,p 3],where p 0,p 1,p 2and p 3are |¯f ¯x i ·¯f x i |,|¯f ¯x i ·f x i |,|f ¯x i ·¯f x i |,and |f ¯x i ·f x i |,respectively.Consider any input x i in X .W.r.t.g =f ,The tuple of ksig (f,f )corresponding to x i is equal to p 3.The same ob-servation can be applied to g =¯f.The tuples in ksig (f,¯f )corresponding to x i and ¯x i are equal to p 1and p 2,respec-tively.Moreover,p 0can be computed as 2n −1−(p 1+p 2+p 3).It is proved.Theorem 3.3.Type 2k-distance signature is equivalent to the cofactor-breakup signature.<Proof :>By Def.2.3,the cofactor-breakup signature of f w.r.t.x i shows the distribution of minterms involving the literal x i in terms of minterm weight.It is an integer vector [b 1,···,b j ,···,b n ],where b j is the number of minterms withweight j .Consider g =S nj −1.Consider the domain x i =1.For each minterm m a ∈f with weight j ,there must exist a minterm m b ∈g and Mark (m a ,m b )=x i .So b j can in-crease 1because of this minterm pair.W.r.t.different S n j’s,all b j ’s can be computed in the same way.It is proved.Theorem 3.4.Type 3k-distance signature is equivalent to the cross signature.<Proof :>It can be proved directly by their definitions.Definition 3.3.Consider a Boolean function f (X )and a signature type s .The distinct factor of f w.r.t.s ,denoted as d f (f,s ),is defined as the ratio of the number of inputs distinguished by s to the number of inputs in X .Let us consider a Boolean function f (X )with symmet-ric inputs first.Without loose of generality,let MSS ={x 1,x 2,···,x m }be a maximal symmetric set of f .In our AID algorithm,x 1will be used to represent MSS .While computing the distinct factor of f w.r.t.any signature type,the remaining m −1inputs can be viewed as the distin-guished inputs of f .Example 3.4.Let f =x 1x 2+x 3x 4+x 5.Type 0k-distance signature s 0of f is shown below:x 1x 2x 3x 4x 501»1010101071313131316–It is clear that {x 1,x 2}and {x 3,x 4}are symmetric sets of f .Consequently,x 2and x 4are included in the distinguished input set DI .x 5is also added into DI because of |f x 5|=16=13.So d f (f,s 0)=3/5=0.6.4.DETECTION OF G -SYMMETRY4.1Check of Group SymmetryThere are two phases in our heuristic to detect g-symmetry of Boolean functions.Phase 1:Search of Candidate Swapping PairsSuppose f is g-symmetric to X 1=[x 11,x 12,···,x 1k ]and X 2=[x 21,x 22,···,x 2k ].Let A =[a 1,a 2,···,a i ,···,a k ]and B =[b 1,b 2,···,b i ,···,b k ]be constructed by the fol-lowing rule:if a i =x 1i (or x 2i ),then b i =x 2i (or x 1i ).It is easy to prove that f is also g-symmetric w.r.t.A and B .Therefore,it doesn’t matter to find any specific ordered sets A and B .We only need to search the candidate pairs of variables for swapping.In our method,we consider the swapping of two indistinguishable inputs x i and x j and use type 4k-distance signature to find their mutual dependency which can help search the other candidate swapping pairs.Consider the function f =x 1(x 2+¯x 3)+x 4(x 5+¯x 6).It has three indistinguishable input groups X 1={x 1,x 4},X 2={x 2,x 5},and X 3={x 3,x 6}.The following matrix shows the positive phase of type 4signatures w.r.t.all in-put variables.Each entry corresponds to the cofactor count of f x i x j .X 1X 2X 3z }|{x 1x 4z }|{x 2x 5z }|{x 3x 6X 1:x 1x 4X 2:x 2x 5X 3:x 3x 62666664014161410131401416131016140121110141612010111013111307131010117103777775x 1x 2x 3x 4x 5x 1x 2x 3x 4x 52666408778808777808777808877837775(a )M 1x 1x 2x 4x 5x 3x 1x 2x 4x 5x 326664087878077877088878077887037775(b )M 2Figure 2:Type 4Signatures with Different Input Orders.Consider the input group X 1w.r.t.the above matrix.If we swap the inputs x 1and x 4,two pairs (x 2,x 5)and (x 3,x 6)must be also swapped to make this matrix invariant.The same situation can happen to swapping the inputs in X 2and X 3.With such an observation,we can obtain three swapping candidate pairs.Then f can be checked if it is invariant under the swapping of these candidate pairs.Our heuristic chooses the inputs from the smallest group as the initial swapping candidate pair.Phase 2:Check of Group Symmetry with Transpo-sitional OperatorConsider a function f (X )and two k -input ordered subsets X 1and X 2of X .In the article [9],the authors proposed a straightforward method using 22k −1−2k −1cofactors’equiv-alence checking which is not feasible for large number k .Instead,we check G -symmetry of f w.r.t.X 1and X 2by transpositional operator which can swap two inputs of f with ITE operator [11].The time complexity of transpo-sitional operator is O (p 2),where p is the size of OBDD representing f .Let f k be the function obtained from f by applying k transpositional operators to f .If f k =f ,then f is G -symmetric w.r.t.X 1and X 2.Let f 0=f and f i be the function after applying the i ’th transpositional operator.Thetime complexity of computing f k from f is O (P k −1i =0p i 2),where p i is the size of OBDD representing f i .4.2Check of Rotational SymmetryConsider the function f =x 1x 2+x 2x 3+x 3x 4+x 4x 5+x 5x 1which is invariant under rotating the variables of f with the ordered set [x 5,x 1,x 2,x 3,x 4].Fig.2(a)and (b)show the matrices of type 4k-distance signature of f with input orders [x 1,x 2,x 3,x 4,x 5]and [x 1,x 2,x 4,x 5,x 3],respectively.In the first matrix,each left-to-right diagonal has the same value,while the second matrix doesn’t own this property.According to this important observation,given an initial type 4k-distance signature matrix,we have developed a prune-and-search algorithm to find all reordered matrices with all diagonals having the same value.If there exists such a matrix,we can further check if f is r-symmetric w.r.t.this input order.The detail is however omitted because of space limitation.We have implemented the matrix reordering algorithm mentioned above and performed an experiment to check r-symmetry of f =x 1x 2+x 2x 3+···+x n −1x n +x n x 1.The experimental result is shown in Table 1.The rows la-beled n ,|BDD |,and CPU Time show the input size,the OBDD size,and the run time in second,respectively.It shows our matrix reordering algorithm is very promising to check r-symmetry of Boolean functions.Table 1:Results of Detecting R-Symmetryn 58163264128256|BDD |1020521162445001012CPU Time0.000.000.010.040.332.4421.02Algorithm AID (f,X )Input:A Boolean function f w.r.t.the input set X ;Output:Ψis an ordered set of distinguished inputs;BeginMSS =Maximal -Symmetric -Set (f,X );Generate the initial Ψbased on MSS ;if Ψ=X return Ψ;Type-0signature ksig (f,1)is used to update Ψ;if Ψ=X return Ψ;Type-1signature ksig (f,f )is used to update Ψ;if Ψ=X return Ψ;for i =0to n do /*optional */Type-2signature ksig (f,S ni)is used to update Ψ;if Ψ=X return Ψ;endforType-3signature ksig (f,x i )is used to update Ψ;if Ψ=X return Ψ;Generate the bounded set X b by MSS and Ψ;Reorder the OBDD of f with transpositional operator so the order of inputs in X b is before the inputs in X −X b ;Let f =P m i =1(b i (X b )·u i (X −X b ))by Type-4signature;for i =1to m doΨ=Ψ+AID (u i ,X −X b );if Ψ=X return Ψ;endforreturn (Ψ);EndFigure 3:The AID Algorithm.5.ALGORITHM FOR INPUT DIFFERENTIATION (AID)5.1AIDOur AID algorithm starts by detecting all maximal sym-metric sets MSS of f (X )using the algorithm we proposed in [12].Then we classify the MSS in terms of set size.For each maximal symmetric set P i ∈MSS ,if P i is unique in size |P i |,then we throw the inputs of P i into the ini-tial ordered set Ψ;otherwise,we take an input x i of P i to represent P i and throw the remaining inputs into Ψ.Take f =x 1x 2x 3+x 4x 5+x 6x 7for example.It is clear P 1={x 1,x 2,x 3},P 2={x 4,x 5}and P 3={x 6,x 7}are maximal symmetric sets.The inputs of P 1are added into Ψdue to P 1is unique on set size 3.Since P 2and P 3have the same set size,x 4and x 6are used to represent P 1and P 2,respectively.Consequently,x 5and x 7are thrown into Ψ.After generating the initial Ψ,we apply type 0,1,2,and 3signatures in turn to add more inputs into Ψ.Once all inputs are distinguished,the algorithm returns Ψimmediately.If there still exist some inputs that can’t be distinguished,the bounded set X b is constructed based on Ψand MSS .One thing we need to address is that the bounded set is not al-ways equal to Ψ.Take the function f =x 1x 2x 3+x 4x 5+x 6x 7for example.After applying all k-distance signatures except for type 4,we get Ψ=[x 1,x 2,x 3,x 5,x 7]and two indis-tinguishable inputs x 4,x 6.So,x 5and x 7must be removed from Ψto generate the bounded set X b ={x 1,x 2,x 3}.Then w.r.t.X b ,we reorder the OBDD of f using transpositional operator [11]and decompose f as P m i =1(b i (X b )·u i (X −X b ))based on type 4k-distance signature.Then AID procedure is recursively applied to subfunctions u i ’s to further distin-guish the inputs in X −X b .By the above description,the AID algorithm is shown in Fig 3.5.2Boolean Matching AlgorithmOur Boolean matching method is mainly based on AID ,G -symmetry detection,and structural transformation of OBDD’s.To check if two functions f (X )and g (Y )areAlgorithm Boolean-Matching (f (X ),g (Y ))Input:f (X )and g (Y )Output:πis a feasible assignment;Beginif |f |=|g |return (∅);Ψ1=AID (f,X );Ψ2=AID (g,Y );if Ψ1=Ψ2return (∅);G 1=G -Symmetry of f (X )utilizing Ψ1;G 2=G -Symmetry of g (Y )utilizing Ψ2;Let ψbe an assignment w.r.t.(Ψ1,G 1)and (Ψ2,G 2);Reorder f w.r.t.ψ;if f ≡g return (ψ);/*≡:structural equivalent */return (∅);EndFigure 4:The Boolean Matching Algorithm.equivalent under input permutation,our algorithm starts by applying AID algorithm to f and g to find their distin-guishable inputs as many as possible.If all the inputs of f and g can be distinguished and have the same signature for each ordered input pair,then the OBDD representing f can be reordered to check if the transformed OBDD is struc-tural equivalent [11]to the OBDD representing g .For the situation that some inputs can’t be distinguished by AID ,we may apply our G -symmetry detection method to find G -symmetry of f and g ,bining the result by AID and detected G -symmetry,an assignment ψmap-ping X to Y can be identifistly,the same reordering and structural equivalence checking scheme can be applied to check whether f and g are matched or not.The above Boolean matching algorithm is shown in Fig.4.6.EXPERIMENTAL RESULTSThe proposed AID algorithm and G -symmetry detection method have been implemented in C language on SUN-Blade 1000station.128circuits from MCNC and ISCAS benchmarking sets have been tested.For each circuit,each output is tested individually.Overall,there are 17,28,63,71,and 78out of 128circuits with all inputs that can be distinguished by type 0,1,2,3k-distance signatures,and AID ,respectively.The ratios are 13%,22%,49%,55%,and 61%,respectively.Table 2shows the partial experimental results.The columns labeled #in ,#out ,and |BDD |are the numbers of inputs,the number of outputs,and the size of OBDD,respectively.The columns with labels 0,1,2,3and AID show the average distinct factors of type 0,1,2,3k-distance signatures,and our AID algorithm,respectively.The CPU Time column shows the running time in second.The rows with labels average and ratio show the average distinct factor and runtime normalization,respectively.The average distinct factors of type 0,1,2,3,and AID are 0.60,0.67,0.78,0.76,and 0.81,respectively.It shows that AID is the most effective signature for being used in Boolean matching.Table 2also shows the result of testing our G -symmetry de-tection method.The column #O show the number of out-puts with indistinguishable inputs after running AID .The +GSD column shows the result of G -symmetry detection.The columns labeled #HS and #GS show the number of outputs with h-symmetry and with g-symmetry ,respec-tively.The result of r-symmetry is skipped since no r-symmetry exists for all circuits.The experimental result shows that the G -symmetry of all tested circuits except for C1355can be detected by our heuristic with little run time overhead (7%).In summary,our heuristic is very effective and efficient to check G -symmetry of Boolean functions.。