2002 On Sparse Representation in Pairs of Bases
- 格式:pdf
- 大小:112.84 KB
- 文档页数:3
Robust Face Recognition via Sparse Representation -- A Q&A about the recent advances in face recognitionand how to protect your facial identityAllen Y. Yang (yang@)Department of EECS, UC BerkeleyJuly 21, 2008Q: What is this technique all about?A: The technique, called robust face recognition via sparse representation, provides a new solution to use computer program to classify human identity using frontal facial images, i.e., the well-known problem of face recognition.Face recognition has been one of the most extensively studied problems in the area of artificial intelligence and computer vision. Its applications include human-computer interaction, multimedia data compression, and security, to name a few. The significance of face recognition is also highlighted by a contrast between human’s high accuracy to recognize face images under various conditions and the computer’s historical poor accuracy.This technique proposes a highly accurate recognition framework. The extensive experiment has shown the method can achieve similar recognition accuracy as human vision, for the first time. In some cases, the method has outperformed what human vision can achieve in face recognition.Q: Who are the authors of this technique?A: The technique was developed in 2007 by Mr. John Wright, Dr. Allen Y. Yang, Dr. S. Shankar Sastry, and Dr. Yi Ma.The technique is jointly owned by the University of Illinois and the University of California, Berkeley. A provisional US patent has been filed in 2008. The technique is also being published in the IEEE Transactions on Pattern Analysis and Machine Intelligence [Wright 2008].Q: Why is face recognition difficult for computers?A: There are several issues that have historically hindered the improvement of face recognition in computer science.1.High dimensionality, namely, the data size is large for face images.When we take a picture of a face, the face image under certain color metrics will be stored as an image file on a computer, e.g., the image shown in Figure 1. Because the human brain is a massive parallel processor, it can quickly process a 2-D image and match the image with the other images learned in the past. However, the modern computer algorithms can only process 2-D images sequentially, meaning, it can only process an image pixel-by-pixel. Hence although the image file usually only takes less than 100 K Bytes to store on computer, if we treat each image as a sample point, it sits in a space of more than 10-100 K dimension (that is each pixel owns an individual dimension). Any pattern recognition problem in high-dimensional space (>100 D) is known to be difficult in the literature.Fig. 1. A frontal face image from the AR database [Martinez 1998]. The size of a JPEG file for this image is typically about 60 Kbytes.2.The number of identities to classify is high.To make the situation worse, an adult human being can learn to recognize thousands if not tens of thousands of different human faces over the span of his/her life. To ask a computer to match the similar ability, it has to first store tens of thousands of learned face images, which in the literature is called the training images. Then using whatever algorithm, the computer has to process the massive data and quickly identify a correct person using a new face image, which is called the test image.Fig. 2. An ensemble of 28 individuals in the Yale B database [Lee 2005]. A typical face recognition system needs to recognition 10-100 times more individuals. Arguably an adult can recognize thousands times more individuals in daily life.Combine the above two problems, we are solving a pattern recognition problem to carefully partition a high-dimensional data space into thousands of domains, each domain represents the possible appearance of an individual’s face images.3.Face recognition has to be performed under various real-world conditions.When you walk into a drug store to take a passport photo, you would usually be asked to pose a frontal, neutral expression in order to be qualified for a good passport photo. The store associate will also control the photo resolution, background, and lighting condition by using a uniform color screen and flash light. However in the real world, a computer program is asked to identify humans without all the above constraints. Although past solutions exist to achieve recognition under very limited relaxation of the constraints, to this day, none of the algorithms can answer all the possible challenges, including this technique we present.To further motivate the issue, human vision can accurately recognize learned human faces under different expressions, backgrounds, poses, and resolutions [Sinha 2006]. With professional training, humans can also identify face images with facial disguise. Figure 3 demonstrates this ability using images of Abraham Lincoln.Fig. 3. Images of Abraham Lincoln under various conditions (available online). Arguably humans can recognize the identity of Lincoln from each of these images.A natural question arises: Do we simply ask too much for a computer algorithm to achieve? For some applications such as at security check-points, we can mandate individuals to pose a frontal, neural face in order to be identified. However, in most other applications, this requirement is simply not practical. For example, we may want to search our photo albums to find all the images that contain our best friendsunder normal indoor/outdoor conditions, or we may need to identify a criminal suspect from a murky, low-resolution hidden camera who would naturally try to disguise his identity. Therefore, the study to recognize human faces under real-world conditions is motivated not only by pure scientific rigor, but also by urgent demands from practical applications.Q: What is the novelty of this technique? Why is the method related to sparse representation?A: The method is built on a novel pattern recognition framework, which relies on a scientific concept called sparse representation. In fact, sparse representation is not a new topic in many scientific areas. Particularly in human perception, scientists have discovered that accurate low-level and mid-level visual perceptions are a result of sparse representation of visual patterns using highly redundant visual neurons [Olshausen 1997, Serre 2006].Without diving into technical detail, let us consider an analogue. Assume that a normal individual, Tom, is very good at identifying different types of fruit juice such as orange juice, apple juice, lemon juice, and grape juice. Now he is asked to identify the ingredients of a fruit punch, which contains an unknown mixture of drinks. Tom discovers that when the ingredients of the punch are highly concentrated on a single type of juice (e.g., 95% orange juice), he will have no difficulty in identifying the dominant ingredient. On the other hand, when the punch is a largely even mixture of multiple drinks (e.g., 33% orange, 33% apple, and 33% grape), he has the most difficulty in identifying the individual ingredients. In this example, a fruit punch drink can be represented as a sum of the amounts of individual fruit drinks. We say such representation is sparse if the majority of the juice comes from a single fruit type. Conversely, we say the representation is not sparse. Clearly in this example, sparse representation leads to easier and more accurate recognition than nonsparse representation.The human brain turns out to be an excellent machine in calculation of sparse representation from biological sensors. In face recognition, when a new image is presented in front of the eyes, the visual cortex immediately calculates a representation of the face image based on all the prior face images it remembers from the past. However, such representation is believed to be only sparse in human visual cortex. For example, although Tom remembers thousands of individuals, when he is given a photo of his friend, Jerry, he will assert that the photo is an image of Jerry. His perception does not attempt to calculate the similarity of Jerry’s photo with all the images from other individuals. On the other hand, with the help of image-editing software such as Photoshop, an engineer now can seamlessly combine facial features from multiple individuals into a single new image. In this case, a typical human would assert that he/she cannot recognize the new image, rather than analytically calculating the percentage of similarities with multiple individuals (e.g., 33% Tom, 33% Jerry, 33% Tyke) [Sinha 2006].Q: What are the conditions that the technique applies to?A: Currently, the technique has been successfully demonstrated to classify frontal face images under different expressions, lighting conditions, resolutions, and severe facial disguise and image distortion. We believe it is one of the most comprehensive solutions in face recognition, and definitely one of the most accurate.Further study is required to establish a relation, if any, between sparse representation and face images with pose variations.Q: More technically, how does the algorithm estimate a sparse representation using face images? Why do the other methods fail in this respect?A: This technique has demonstrated the first solution in the literature to explicitly calculate sparse representation for the purpose of image-based pattern recognition. It is hard to say that the other extant methods have failed in this respect. Why? Simply because previously investigators did not realize the importance of sparse representation in human vision and computer vision for the purpose of classification. For example, a well-known solution to face recognition is called the nearest-neighbor method. It compares the similarity between a test image with all individual training images separately. Figure 4 shows an illustration of the similarity measurement. The nearest-neighbor method identifies the test image with a training image that is most similar to the test image. Hence the method is called the nearest neighbor. We can easily observe that the so-estimated representation is not sparse. This is because a single face image can be similar to multiple images in terms of its RGB pixel values. Therefore, an accurate classification based on this type of metrics is known to be difficult.Fig. 4. A similarity metric (the y-axis) between a test face image and about 1200 training images. The smaller the metric value, the more similar between two images. Our technique abandons the conventional wisdom to compare any similarity between the test image and individual training images or individual training classes. Rather, the algorithm attempts to calculate a representation of the input image w.r.t. all available training images as a whole. Furthermore, the method imposes one extra constraint that the optimal representation should use the smallest number of training images. Hence, the majority of the coefficients in the representation should be zero, and the representation is sparse (as shown in Figure 5).Fig. 5. An estimation of sparse representation w.r.t. a test image and about 1200 training images. The dominant coefficients in the representation correspond to the training images with the same identity as the input image. In this example, the recognition is based on downgraded 12-by-10 low-resolution images. Yet, the algorithm can correctly identify the input image as Subject 1.Q: How does the technique handle severe facial disguise in the image?A: Facial disguise and image distortion pose one of the biggest challenges that affect the accuracy of face recognition. The types of distortion that can be applied to face images are manifold. Figure 6 shows some of the examples.Fig. 6. Examples of image distortion on face images. Some of the cases are beyond human’s ability to perform reliable recognition.One of the notable advantages about the sparse representation framework is that the problem of image compensation on distortion combined with face recognition can be rigorously reformulated under the same framework. In this case, a distorted face image presents two types of sparsity: one representing the location of the distorted pixels in the image; and the other representing the identity of the subject as before. Our technique has been shown to be able to handle and eliminate all the above image distortion in Figure 6 while maintaining high accuracy. In the following, we present an example to illustrate a simplified solution for one type of distortion. For more detail, please refer to our paper [Wright 2008].Figure 7 demonstrates the process of an algorithm to recognize a face image with severe facial disguise by sunglasses. The algorithm first partitions the left test image into eight local regions, and individually recovers a sparse representation per region. Notice that with the sunglasses occluding the eye regions, the corresponding representations from these regions do not provide correct classification. However, when we look at the overall classification result over all regions, the nonocclused regions provide a high consensus for the image to be classified as Subject 1 (as shownin red circles in the figure). Therefore, the algorithm simultaneously recovers the subject identity and the facial regions that are being disguised.Fig. 7. Solving for part-based sparse representation using local face regions. Left: Test image. Right: Estimation of sparse representation and the corresponding classification on the titles. The red circle identifies the correct classiciation.Q: What is the quantitative performance of this technique?A: Most of the representative results from our extensive experiment have been documented in our paper [Wright 2008]. The experiment was based on two established face recognition databases, namely, the Extended Yale B database [Lee 2005] and the AR database [Martinez 1998].In the following, we highlight some of the notable results. On the Extended Yale B database, the algorithm achieved 92.1% accuracy using 12-by-10 resolution images, 93.7% using single-eye-region images, and 98.3% using mouth-region images. On the AR database, the algorithm achieves 97.5% accuracy on face images with sunglasses disguise, and 93.5% with scarf disguise.Q: Does the estimation of sparse representation cost more computation and time compared to other methods?A: The complexity and speed of an algorithm are important to the extent that they do not hinder the application of the algorithm to real-world problems. Our technique uses some of the best-studied numerical routines in the literature, namely, L-1 minimization to be specific. The routines belong to a family of optimization algorithms called convex optimization, which have been known to be extremely efficient to solve on computer. In addition, considering the rapid growth of the technology in producing advanced micro processors today, we do not believe there is any significant risk to implement a real-time commercial system based on this technique.Q: With this type of highly accurate face recognition algorithm available, is it becoming more and more difficult to protect biometric information and personal privacy in urban environments and on the Internet?A: Believe it or not, a government agency, a company, or even a total stranger can capture and permanently log your biometric identity, including your facial identity, much easier than you can imagine. Based on a Time magazine report [Grose 2008], a resident living or working in London will likely be captured on camera 300 times per day! One can believe other people living in other western metropolitan cities are enjoying similar “free services.” If you like to stay indoor and blog on the Internet, your public photo albums can be easily accessed over the nonprotected websites, and probably have been permanently logged by search engines such as Google and Yahoo!.With the ubiquitous camera technologies today, completely preventing your facial identity from being obtained by others is difficult, unless you would never step into a downtown area in big cities and never apply for a driver’s license. However, there are ways to prevent illegal and involuntary access to your facial identity, especially on the Internet. One simple step that everyone can choose to do to stop a third party exploring your face images online is to prevent these images from being linked to your identity. Any classification system needs a set of training images to study the possible appearance of your face. If you like to put your personal photos on your public website and frequently give away the names of the people in the photos, over time a search engine will be able to link the identities of the people with the face images in those photos. Therefore, to prevent an unauthorized party to “crawl” into your website and sip through the valuable private information, you should make these photo websites under password protection. Do not make a large amount of personal images available online without consent and at the same time provide the names of the people on the same website.Previously we have mentioned many notable applications that involve face recognition. The technology, if properly utilized, can also revolutionize the IT industry to better protect personal privacy. For example, an assembly factory can install a network of cameras to improve the safety of the assembly line but at the same time blur out the facial images of the workers from the surveillance videos. A cellphone user who is doing teleconferencing can activate a face recognition function to only track his/her facial movements and exclude other people in the background from being transmitted to the other party. All in all, face recognition is a rigorous scientific study. Its sole purpose is to hypothesize, model, and reproduce the image-based recognition process with accuracy comparable or even superior to human perception. The scope of its final extension and impact to our society will rest on the shoulder of the government, the industry, and each of the individual end users. References[Grose 2008] T. Grose. When surveillance cameras talk. Time (online), Feb. 11, 2008.[Lee 2005] K. Lee et al.. Acquiring linear subspaces for face recognition under variable lighting. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 5, 2005.[Martinez 1998] A. Martinez and R. Benavente. The AR face database. CVC Tech Report No. 24, 1998.[Olshausen 1997] B. Olshausen and D. Field. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, vol. 37, 1997.[Serre 2006] T. Serre. Learning a dictionary of shape-components in visual cortex: Comparison with neurons, humans and machines. PhD dissertation, MIT, 2006.[Sinha 2006] P. Sinha et al.. Face recognition by humans: Nineteen results all computer vision researchers should know about. Proceedings of the IEEE, vol. 94, no. 11, November 2006.[Wright 2008] J. Wright et al.. Robust face recognition via sparse representation. (in press) IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.。
摘要摘要随着科技的不断进步,人们进入了信息时代。
数字图像作为一种信息传播的重要形式,其分辨率的高低以及一些浑浊的介质会影响人们获取图像中的信息。
在现实世界中,有非常多的因素会影响图像的分辨率,如快门、散弹噪声、抖动、衍射极限、传感器、聚焦、颜色混叠等。
在物体成像中也存在着很多浑浊的介质,如水滴、颗粒、烟雾等。
这些因素和介质都会导致图像的分辨率降低,以及图像中的部分信息丢失,因此,提高图像的分辨率和去除图像中的雾就显得尤为重要。
当成像设备与成像环境均不够完善时,采用数学算法提高图像质量,即利用稀疏表示方法对图像进行去雾超分辨,这样做的优点是既不受硬件设备和环境条件的限制,还能使成本降低,具有广阔的应用前景。
稀疏表示理论在图像处理方面的应用备受关注,如图像去噪、人脸识别、图像超分辨率重建等。
通过训练字典可以将图像补丁稀疏表示,再用最少的原子代表图像补丁,准确获取图像的纹理特征信息。
本文的创新点在于利用稀疏表示方法对图像进行超分辨的同时又加入了对图像去雾的研究,得出的图片效果要好于单独去雾和单独超分辨的图像。
本文主要研究内容如下:1.利用超完备字典中适当选择的元素稀疏线性组合表示图像补丁。
由此为每个输入的低分辨率补丁都找到相对应的稀疏系数,通过该系数得出高分辨率输出。
2.高、低分辨率图像补丁的联合训练,可以加强高、低分辨率图像补丁对间稀疏表示的相似性。
因此,低分辨率图像补丁的稀疏表示能够与高分辨率图像补丁字典一起应用,生成高分辨率的图像补丁。
3.在利用稀疏表示方法对图像进行超分辨率重建的过程中,其求解是一个不适定问题,为了进一步提高图像的重建质量,在重建过程中引入了正则化约束。
4.在广泛了解现有的图像去雾的方法后,决定将暗通道先验模型这种去雾算法和稀疏表示的图像去雾算法相结合。
此方法使用暗通道先验模型来去掉图像中的雾,使用稀疏表示方法去掉图像中的细节噪声,提高图像分辨率。
相对以往的暗通道先验模型来说,此方法的去雾霾效果稍好一些,保真度也稍好一些。
第 21 卷 第 7 期2023 年 7 月太赫兹科学与电子信息学报Journal of Terahertz Science and Electronic Information TechnologyVol.21,No.7Jul.,2023基于稀疏表示的双平行线阵二维DOA估计苏龙,李雪,罗德凌(空军航空维修技术学院,湖南长沙410124)摘要:为了对空域目标的方位角和俯仰角进行有效估计,提出一种基于稀疏表示的双平行线阵二维DOA估计方法。
首先需构建包含目标方位角和俯仰角信息的2个空间复合角;然后利用稀疏表示技术求解其中的一个空间复合角,以此作为前提条件,另一个空间复合角就可以解耦为一维波达方向(DOA)估计问题,利用矩阵运算可以求解出来;最后根据已求解的2个空间复合角对方位角和俯仰角进行配对求解。
与现有算法相比较,所提方法受快拍数的影响较小,在信噪比较高、角度间隔较大的情况下,具有良好的性能。
关键词:稀疏表示;双平行线阵;波达方向;空间复合角中图分类号:TN911.7 文献标志码:A doi:10.11805/TKYDA2020710Two-dimensional DOA estimation of double parallel arrays based onsparse representationSU Long,LI Xue,LUO Deling(Air Force Aviation Maintenance Technical College,Changsha Hunan 410124,China)AbstractAbstract::In order to estimate the azimuth angle and pitch angle of airborne target effectively, a two-dimensional Direction Of Arrival(DOA) estimation method based on sparse representation is presented.Firstly, it is necessary to construct two spatial compound angles containing the azimuth and pitchinformation of the target; then, one of the spatial composite angles is solved by sparse representationtechnology, and the other spatial composite angle can be decoupled into one-dimensional DOAestimation problem, which can be solved by matrix operation; finally, the azimuth angle and pitch angleare solved in pairs according to the solved two spatial compound angles. Compared with the existingalgorithms, the proposed method is less affected by the number of snapshots, and has good performanceunder the condition of high signal-to-noise ratio and large angular interval.KeywordsKeywords::sparse representation;double parallel line array;Direction Of Arrival;spatial compound angle波达方向(DOA)的估计在阵列信号处理当中一直是一个热点,引起了许多学者的关注与研究[1-6],在雷达、通信、声呐领域取得了许多成果。
Compressive samplingEmamnuel J.Candès∗Abstract.Conventional wisdom and common practice in acquisition and reconstruction of images from frequency data follow the basic principle of the Nyquist density sampling theory. This principle states that to reconstruct an image,the number of Fourier samples we need to acquire must match the desired resolution of the image,i.e.the number of pixels in the image. This paper surveys an emerging theory which goes by the name of“compressive sampling”or “compressed sensing,”and which says that this conventional wisdom is inaccurate.Perhaps surprisingly,it is possible to reconstruct images or signals of scientific interest accurately and sometimes even exactly from a number of samples which is far smaller than the desired resolution of the image/signal,e.g.the number of pixels in the image.It is believed that compressive sampling has far reaching implications.For example,it suggests the possibility of new data acquisition protocols that translate analog information into digital form with fewer sensors than what was considered necessary.This new sampling theory may come to underlie procedures for sampling and compressing data simultaneously.In this short survey,we provide some of the key mathematical insights underlying this new theory,and explain some of the interactions between compressive sampling and otherfields such as statistics,information theory,coding theory,and theoretical computer science. Mathematics Subject Classification(2000).Primary00A69,41-02,68P30;Secondary62C65.pressive sampling,sparsity,uniform uncertainty principle,underdertermined systems of linear equations, 1-minimization,linear programming,signal recovery,error cor-rection.1.IntroductionOne of the central tenets of signal processing is the Nyquist/Shannon sampling theory: the number of samples needed to reconstruct a signal without error is dictated by its bandwidth–the length of the shortest interval which contains the support of the spectrum of the signal under study.In the last two years or so,an alternative theory of“compressive sampling”has emerged which shows that super-resolved signals and images can be reconstructed from far fewer data/measurements than what is usually considered necessary.The purpose of this paper is to survey and provide some of the key mathematical insights underlying this new theory.An enchanting aspect of compressive sampling it that it has significant interactions and bearings on somefields in the applied sciences and engineering such as statistics,information theory,coding ∗The author is partially supported by an NSF grant CCF–515362.Proceedings of the International Congressof Mathematicians,Madrid,Spain,2006©2006European Mathematical Society2Emmanuel J.Candès theory,theoretical computer science,and others as well.We will try to explain these connections via a few selected examples.From a general viewpoint,sparsity and,more generally,compressibility has played and continues to play a fundamental role in manyfields of science.Sparsity leads to efficient estimations;for example,the quality of estimation by thresholding or shrink-age algorithms depends on the sparsity of the signal we wish to estimate.Sparsity leads to efficient compression;for example,the precision of a transform coder depends on the sparsity of the signal we wish to encode[24].Sparsity leads to dimensionality reduction and efficient modeling.The novelty here is that sparsity has bearings on the data acquisition process itself,and leads to efficient data acquisition protocols.In fact,compressive sampling suggests ways to economically translate analog data into already compressed digital form[20],[7].The key word here is“economically.”Everybody knows that because typical signals have some structure,they can be com-pressed efficiently without much perceptual loss.For instance,modern transform coders such as JPEG2000exploit the fact that many signals have a sparse represen-tation in afixed basis,meaning that one can store or transmit only a small number of adaptively chosen transform coefficients rather than all the signal samples.The way this typically works is that one acquires the full signal,computes the complete set of transform coefficients,encode the largest coefficients and discard all the others.This process of massive data acquisition followed by compression is extremely wasteful (one can think about a digital camera which has millions of imaging sensors,the pixels,but eventually encodes the picture on a few hundred kilobytes).This raises a fundamental question:because most signals are compressible,why spend so much ef-fort acquiring all the data when we know that most of it will be discarded?Wouldn’t it be possible to acquire the data in already compressed form so that one does not need to throw away anything?“Compressive sampling”also known as“compressed sensing”[20]shows that this is indeed possible.This paper is by no means an exhaustive survey of the literature on compressive sampling.Rather this is merely an account of the author’s own work and thinking in this area which also includes a fairly large number of references to other people’s work and occasionally discusses connections with these works.We have done our best to organize the ideas into a logical progression starting with the early papers which launched this subject.Before we begin,we would like to invite the interested reader to also check the article[17]by Ronald DeV ore–also in these proceedings–for a complementary survey of thefield(Section5).2.Undersampled measurementsConsider the general problem of reconstructing a vector x∈R N from linear mea-surements y about x of the formy k= x,ϕk ,k=1,...,K,or y= x.(2.1)Compressive sampling3 That is,we acquire information about the unknown signal by sensing x against K vectorsϕk∈R N.We are interested in the“underdetermined”case K N,where we have many fewer measurements than unknown signal values.Problems of this type arise in a countless number of applications.In radiology and biomedical imaging for instance,one is typically able to collect far fewer measurements about an image of interest than the number of unknown pixels.In wideband radio frequency signal analysis,one may only be able to acquire a signal at a rate which is far lower than the Nyquist rate because of current limitations inAnalog-to-Digital Converter technology. Finally,gene expression studies also provide examples of this kind.Here,one would like to infer the gene expression level of thousands of genes–that is,the dimension N of the vector x is in the thousands–from a low number of observations,typically in the tens.Atfirst glance,solving the underdertermined system of equations appears hopeless, as it is easy to make up examples for which it clearly cannot be done.But suppose now that the signal x is compressible,meaning that it essentially depends on a number of degrees of freedom which is smaller than N.For instance,suppose our signal is sparse in the sense that it can be written either exactly or accurately as a superposition of a small number of vectors in somefixed basis.Then this premise radically changes the problem,making the search for solutions feasible.In fact,accurate and sometimes exact recovery is possible by solving a simple convex optimization problem.2.1.A nonlinear sampling theorem.It might be best to consider a concrete example first.Suppose here that one collects an incomplete set of frequency samples of a discrete signal x of length N.(To ease the exposition,we consider a model problem in one dimension.The theory extends easily to higher dimensions.For instance,we could be equally interested in the reconstruction of2-or3-dimensional objects from undersampled Fourier data.)The goal is to reconstruct the full signal f given only K samples in the Fourier domainy k=1√NN−1t=0x t e−i2πωk t/N,(2.2)where the‘visible’frequenciesωk are a subset (of size K)of the set of all frequencies {0,...,N−1}.Sensing an object by measuring selected frequency coefficients is the principle underlying Magnetic Resonance Imaging,and is common in manyfields of science,including Astrophysics.In the language of the general problem(2.1),the sensing matrix is obtained by sampling K rows of the N by N discrete Fourier transform matrix.We will say that a vector x is S-sparse if its support{i:x i=0}is of cardinality less or equal to S.Then Candès,Romberg and Tao[6]showed that one could almost alwaysrecover the signal x exactly by solving the convex program1( ˜x1:=Ni=1|˜x i|)(P1)min˜x∈R N ˜x1subject to ˜x=y.(2.3)1(P1)can even be recast as a linear program[3],[15].4Emmanuel J.Candès Theorem2.1([6]).Assume that x is S-sparse and that we are given K Fouriercoefficients with frequencies selected uniformly at random.Suppose that the number of observations obeysK≥C·S·log N.(2.4) Then minimizing 1reconstructs x exactly with overwhelming probability.In details, if the constant C is of the form22(δ+1)in(2.4),then the probability of success exceeds1−O(N−δ).Thefirst conclusion is that one suffers no information loss by measuring just aboutany set of K frequency coefficients.The second is that the signal x can be exactlyrecovered by minimizing a convex functional which does not assume any knowledgeabout the number of nonzero coordinates of x,their locations,and their amplitudeswhich we assume are all completely unknown a priori.While this seems to be a great feat,one could still ask whether this is optimal,or whether one could do with even fewer samples.The answer is that in general,we cannot reconstruct S-sparse signals with fewer samples.There are examplesfor which the minimum number of samples needed for exact reconstruction by anymethod,no matter how intractable,must be about S log N.Hence,the theorem istight and 1-minimization succeeds nearly as soon as there is any hope to succeed byany algorithm.The reader is certainly familiar with the Nyquist/Shannon sampling theory and onecan reformulate our result to establish simple connections.By reversing the roles oftime and frequency in the above example,we can recast Theorem1as a new nonlinearsampling theorem.Suppose that a signal x has support in the frequency domainwith B=| |.If is a connected set,we can think of B as the bandwidth of x.Ifin addition the set is known,then the classical Nyquist/Shannon sampling theoremstates that x can be reconstructed perfectly from B equally spaced samples in the timedomain2.The reconstruction is simply a linear interpolation with a“sinc”kernel.Now suppose that the set ,still of size B,is unknown and not necessarily con-nected.In this situation,the Nyquist/Shannon theory is unhelpful–we can onlyassume that the connected frequency support is the entire domain suggesting thatall N time-domain samples are needed for exact reconstruction.However,Theo-rem2.1asserts that far fewer samples are necessary.Solving(P1)will recover x perfectly from about B log N time samples.What is more,these samples do not haveto be carefully chosen;almost any sample set of this size will work.Thus we have anonlinear analog(described as such since the reconstruction procedure(P1)is non-linear)to Nyquist/Shannon:we can reconstruct a signal with arbitrary and unknown frequency support of size B from about B log N arbitrarily chosen samples in the time domain.Finally,we would like to emphasize that our Fourier sampling theorem is onlya special instance of much more general statements.As a matter of fact,the results2For the sake of convenience,we make the assumption that the bandwidth B divides the signal length N evenly.Compressive sampling5 extend to a variety of other setups and higher dimensions.For instance,[6]shows how one can reconstruct a piecewise constant(one or two-dimensional)object from incomplete frequency samples provided that the number of jumps(discontinuities) obeys the condition above by minimizing other convex functionals such as the total variation.2.2.Background.Now for some background.In the mid-eighties,Santosa and Symes[44]had suggested the minimization of 1-norms to recover sparse spike trains, see also[25],[22]for early results.In the last four years or so,a series of papers[26], [27],[28],[29],[33],[30]explained why 1could recover sparse signals in some special setups.We note though that the results in this body of work are very different than the sampling theorem we just introduced.Finally,we would like to point out important connections with the literature of theoretical computer science.Inspired by[37],Gilbert and her colleagues have shown that one could recover an S-sparse signal with probability exceeding1−δfrom S·poly(log N,logδ)frequency samples placed on special equispaced grids[32].The algorithms they use are not based on optimization but rather on ideas from the theory of computer science such as isolation, and group testing.Other points of connection include situations in which the set of spikes are spread out in a somewhat even manner in the time domain[22],[51].2.3.Undersampling structured signals.The previous example showed that the structural content of the signal allows a drastic“undersampling”of the Fourier trans-form while still retaining enough information for exact recovery.In other words,if one wanted to sense a sparse object by taking as few measurements as possible,then one would be well-advised to measure randomly selected frequency coefficients.In truth, this observation triggered a massive literature.To what extent can we recover a com-pressible signal from just a few measurements.What are good sensing mechanisms? Does all this extend to object that are perhaps not sparse but well-approximated by sparse signals?In the remainder of this paper,we will provide some answers to these fundamental questions.3.The Mathematics of compressive sampling3.1.Sparsity and incoherence.In all what follows,we will adopt an abstract and general point of view when discussing the recovery of a vector x∈R N.In practical instances,the vector x may be the coefficients of a signal f∈R N in an orthonormalbasisf(t)=Ni=1x iψi(t),t=1,...,N.(3.1)For example,we might choose to expand the signal as a superposition of spikes(the canonical basis of R N),sinusoids,B-splines,wavelets[36],and so on.As a side6Emmanuel J.Candès note,it is not important to restrict attention to orthogonal expansions as the theory and practice of compressive sampling accommodates other types of expansions.For example,x might be the coefficients of a digital image in a tight-frame of curvelets[5]. To keep on using convenient matrix notations,one can write the decomposition(3.1)as x= f where is the N by N matrix with the waveformsψi as rows or equivalently,f= ∗x.We will say that a signal f is sparse in the -domain if the coefficient sequence is supported on a small set and compressible if the sequence is concentrated near a small set.Suppose we have available undersampled data about f of the same form as beforey= f.Expressed in a different way,we collect partial information about x via y= x where = ∗.In this setup,one would recover f byfinding–among all coefficient sequences consistent with the data–the decomposition with minimum 1-normmin ˜x1such that ˜x=y.Of course,this is the same problem as(2.3),which justifies our abstract and general treatment.With this in mind,the key concept underlying the theory of compressive sampling is a kind of uncertainty relation,which we explain next.3.2.Recovery of sparse signals.In[7],Candès and Tao introduced the notion of uniform uncertainty principle(UUP)which they refined in[8].The UUP essentially states that the K×N sensing matrix obeys a“restricted isometry hypothesis.”Let T,T⊂{1,...,N}be the K×|T|submatrix obtained by extracting the columns of corresponding to the indices in T;then[8]defines the S-restricted isometry constantδS of which is the smallest quantity such that(1−δS) c 22≤ T c 22≤(1+δS) c 22(3.2)for all subsets T with|T|≤S and coefficient sequences(c j)j∈T.This property es-sentially requires that every set of columns with cardinality less than S approximately behaves like an orthonormal system.An important result is that if the columns of the sensing matrix are approximately orthogonal,then the exact recovery phenomenon occurs.Theorem3.1([8]).Assume that x is S-sparse and suppose thatδ2S+δ3S<1or, better,δ2S+θS,2S<1.Then the solution x to(2.3)is exact,i.e.,x =x.In short,if the UUP holds at about the level S,the minimum 1-norm reconstruction is provably exact.Thefirst thing one should notice when comparing this result with the Fourier sampling theorem is that it is deterministic in the sense that it does not involve any probabilities.It is also universal in that all sufficiently sparse vectorsCompressive sampling7 are exactly reconstructed from x.In Section3.4,we shall give concrete examples of sensing matrices obeying the exact reconstruction property for large values of the sparsity level,e.g.for S=O(K/log(N/K)).Before we do so,however,we would like to comment on the slightly better version δ2S+θS,2S<1,which is established in[10].The numberθS,S for S+S ≤N is called the S,S -restricted orthogonality constants and is the smallest quantity such that| T c, T c |≤θS,S · c2 c2(3.3)holds for all disjoint sets T,T ⊆{1,...,N}of cardinality|T|≤S and|T |≤S . ThusθS,S is the cosine of the smallest angle between the two subspaces spanned by the columns in T and T .Small values of restricted orthogonality constants that disjoint subsets of covariates span nearly orthogonal subspaces.Theδ2S+θS,2S<1is better thanδ2S+δ3S<1since it is not hard to see thatS+S−δS ≤θS,S ≤δS+S for S ≥S[8,Lemma1.1].Finally,now that we have introduced all the quantities needed to state our recovery theorem,we would like to elaborate on the conditionδ2S+θS,2S<1.Suppose thatδ2S=1which may indicate that there is a matrix T1∪T2with2S columns (|T1|=S,|T2|=S)that is rank-deficient.If this is the case,then there is a pair (x1,x2)of nonvanishing vectors with x1supported on T1and x2supported on T2 obeying(x1−x2)=0⇐⇒ x1= x2.In other words,we have two very distinct S-sparse vectors which are indistinguishable. This is why any method whatsoever needsδ2S<1.For,otherwise,the model is not identifiable to use a terminology borrowed from the statistics literature.With this in mind,one can see that the conditionδ2S+θS,2S<1is only slightly stronger than this identifiability condition.3.3.Recovery of compressible signals.In general,signals of practical interest may not be supported in space or in a transform domain on a set of relatively small size. Instead,they may only be concentrated near a sparse set.For example,a commonly discussed model in mathematical image or signal processing assumes that the coef-ficients of elements taken from a signal class decay rapidly,typically like a power law.Smooth signals,piecewise signals,images with bounded variations or bounded Besov norms are all of this type[24].A natural question is how well one can recover a signal that is just nearly sparse. For an arbitrary vector x in R N,denote by x S its best S-sparse approximation;that is,x S is the approximation obtained by keeping the S largest entries of x and setting the others to zero.It turns out that if the sensing matrix obeys the uniform uncertainty principle at level S,then the recovery error is not much worse than x−x S 2.8Emmanuel J.Candès Theorem3.2([9]).Assume that x is S-sparse and suppose thatδ3S+δ4S<2.Then the solution x to(2.3)obeysx∗−x2≤C·x−x S1√S.(3.4)For reasonable values ofδ4S,the constant in(3.4)is well behaved;e.g.C≤8.77for δ4S=1/5.Suppose further thatδS+2θS,S+θ2S,S<1,we also havex∗−x1≤C x−x S1,(3.5)for some positive constant C.Again,the constant in(3.5)is well behaved.Roughly speaking,the theorem says that minimizing 1recovers the S-largest entries of an N-dimensional unknown vector x from K measurements only.As a side remark,the 2-stability result(3.4)appears explicitly in[9]while the‘ 1instance optimality’(3.5)is implicit in[7]although it is not stated explicitely.For example,it follows from Lemma2.1–whose hypothesis holds because of Lemma2.2.in[8]–in that paper.Indeed,let T be the set where x takes on its S-largest values.Then Lemma2.1in[7]gives x∗·1T c 1≤4 x−x S 1and,therefore, (x∗−x)·1T c 1≤5 x−x S 1.We conclude by observing that on T we have(x∗−x)·1T1≤√S (x∗−x)·1T 2≤C x−x S 1,where the last inequality follows from(3.4).For information,a more direct argument yields better constants.To appreciate the content of Theorem3.2,suppose that x belongs to a weak- p ball of radius R.This says that if we rearrange the entries of x in decreasing order of magnitude|x|(1)≥|x|(2)≥···≥|x|(N),the i th largest entry obeys|x|(i)≤R·i−1/p,1≤i≤N.(3.6) More prosaically,the coefficient sequence decays like a power-law and the parame-ter p controls the speed of the decay:the smaller p,the faster the decay.Classical calculations then show that the best S-term approximation of an object x∈w p(R)obeysx−x S2≤C2·R·S1/2−1/p(3.7) in the 2norm(for some positive constant C2),andx−x S1≤C1·R·S1−1/pin the 1-norm.For generic elements obeying(3.6),there are no fundamentally better estimates available.Hence,Theorem3.2shows that with K measurements only,we can achieve an approximation error which is as good as that one would obtain by knowing everything about the signal and selecting its S-largest entries.Compressive sampling 93.4.Random matrices.Presumably all of this would be interesting if one could design a sensing matrix which would allow us to recover as many entries of x as possible with as few as K measurements.In the language of Theorem 3.1,we would like the condition δ2S +θS,2S <1to hold for large values of S ,ideally of the order of K .This poses a design problem.How should one design a matrix –that is to say,a collection of N vectors in K dimensions –so that any subset of columns of size about S be about orthogonal?And for what values of S is this possible?While it might be difficult to exhibit a matrix which provably obeys the UUP for very large values of S ,we know that trivial randomized constructions will do so with overwhelming probability.We give an example.Sample N vectors on the unit sphere of R K independently and uniformly at random.Then the condition of Theorems 3.1and 3.2hold for S =O(K/log (N/K))with probability 1−πN where πN =O(e −γN )for some γ>0.The reason why this holds may be explained by some sort of “blessing of high-dimensionality.”Because the high-dimensional sphere is mostly empty,it is possible to pack many vectors while maintaining approximate orthogonality.•Gaussian measurements.Here we assume that the entries of the K by N sensing matrix are independently sampled from the normal distribution with mean zero and variance 1/K .Then ifS ≤C ·K/log (N/K),(3.8)S obeys the condition of Theorems 3.1and 3.2with probability 1−O(e −γN )for some γ>0.The proof uses known concentration results about the singular values of Gaussian matrices [16],[45].•Binary measurements.Suppose that the entries of the K by N sensing matrix are independently sampled from the symmetric Bernoulli distribution P ( ki =±1/√K)=1/2.Then it is conjectured that the conditions of Theorems 3.1and 3.2are satisfied with probability 1−O(e −γN )for some γ>0provided that S obeys (3.8).The proof of this fact would probably follow from new concentration results about the smallest singular value of a subgaussian matrix[38].Note that the exact reconstruction property for S -sparse signals and (3.7)with S obeying (3.8)are known to hold for binary measurements [7].•Fourier measurements.Suppose now that is a partial Fourier matrix obtained by selecting K rows uniformly at random as before,and renormalizing the columns so that they are unit-normed.Then Candès and Tao [7]showed that Theorem 3.1holds with overwhelming probability if S ≤C ·K/(log N)6.Recently,Rudelson and Vershynin [43]improved this result and established S ≤C ·K/(log N)4.This result is nontrivial and use sophisticated techniques from geometric functional analysis and probability in Banach spaces.It is conjectured that S ≤C ·K/log N holds.10Emmanuel J.Candès •Incoherent measurements.Suppose now that is obtained by selecting K rows uniformly at random from an N by N orthonormal matrix U and renormalizing the columns so that they are unit-normed.As before,we could think of U as the matrix ∗which maps the object from the to the -domain.Then the arguments used in[7],[43]to prove that the UUP holds for incomplete Fourier matrices extend to this more general situation.In particular,Theorem3.1holds with overwhelming probability provided thatS≤C·1μ2·K(log N)4,(3.9)whereμ:=√N max i,j|U i,j|(observe that for the Fourier matrix,μ=1which gives the result in the special case of the Fourier ensemble above).WithU= ∗,μ:=√maxi,j| ϕi,ψj |(3.10)which is referred to as the mutual coherence between the measurement basis and the sparsity basis [27],[28].The greater the incoherence of the measurement/sparsity pair( , ),the smaller the number of measurements needed.In short,one can establish the UUP for a few interesting random ensembles and we expect that in the future,many more results of this type will become available.3.5.Optimality.Before concluding this section,it is interesting to specialize our recovery theorems to selected measurement ensembles now that we have established the UUP for concrete values of S.Consider the Gaussian measurement ensemble in which the entries of are i.i.d.N(0,1/K).Our results say that one can recover any S-sparse vector from a random projection of dimension about O(S·log(N/S)),see also[18].Next,suppose that x is taken from a weak- p ball of radius R for some 0<p<1,or from the 1-ball of radius R for p=1.Then we have shown that for all x∈w p(R)x −x2≤C·R·(K/log(N/K))−r,r=1/p−1/2,(3.11) which has also been proven in[20].An important question is whether this is op-timal.In other words,can wefind a possibly adaptive set of measurements and a reconstruction algorithm that would yield a better bound than(3.11)?By adaptive, we mean that one could use a sequential measurement procedure where at each stage, one would have the option to decide which linear functional to use next based on the data collected up to that stage.It proves to be the case that one cannot improve on(3.11),and we have thus identified the optimal performance.Fix a class of object F and let E K(F)be the best reconstruction error from K linear measurementsE K(F)=inf supf∈F f−D(y)2,y= f,(3.12)where the infimum is taken over all set of K linear functionals and all reconstruction algorithms D.Then it turns out E K(F)nearly equals the Gelfand numbers of a class F defined asd K(F)=infV {supf∈FP V f :codim(V)<K},(3.13)where P V is the orthonormal projection on the subspace V.Gelfand numbers play animportant role in approximation theory,see[40]for more information.If F=−Fand F=F+F≤c F F,then d K(F)≤E K(F)≤c F d K(F).Note that c F=21/p in the case where F is a weak- p ball.The thing is that we know the approximatevalues of the Gelfand numbers for many classes of interest.Suppose for examplethat F is the 1-ball of radius R.A seminal result of Kashin[35]and improved byGarnaev and Gluskin[31]shows that for this ball,the Gelfand numbers obeyC1·R·log(N/K)+1K≤d k(F)≤C2·R·log(N/K)+1K,(3.14)where C1,C2are universal constants.Gelfand numbers are also approximately knownfor weak- p balls as well;the only difference is that((log(N/K)+1)/K)r substitutes((log(N/K)+1)/K)1/2.Hence,Kashin,Garnaev and Gluskin assert that with K measurements,the minimal reconstruction error(3.12)one can hope for is boundedbelow by a constant times(K/log(N/K))−r.Kashin’s arguments[35]also usedprobabilistic functionals which establish the existence of recovery procedures forwhich the reconstruction error is bounded above by the right-hand side of(3.14).Similar types of recovery have also been known to be possible in the literature oftheoretical computer science,at least in principle,for certain types of random mea-surements[1].In this sense,our results–specialized to Gaussian measurements–are optimalfor weak- p norms.The novelty is that the information about the object can beretrieved from random coefficients by minimizing a simple linear program(2.3),andthat the decoding algorithm adapts automatically to the weak- p signal class,withoutknowledge thereof.Minimizing the 1-norm is adaptive and nearly gives the bestpossible reconstruction error simultaneously over a wide range of sparse classes ofsignals;no information about p and the radius R are required.4.Robust compressive samplingIn any realistic application,we cannot expect to measure x without any error,and wenow turn our attention to the robustness of compressive sampling vis a vis measure-ment errors.This is a very important issue because any real-world sensor is subjectto at least a small amount of noise.And one thus immediately understands that tobe widely applicable,the methodology needs to be stable.Small perturbations in the。
试题A:图像复原随着社会的不断发展,成像设备广泛的应用于各个领域。
现实生活中,由于相机成像设备、以及周围拍摄环境的影响,捕获到的图像往往带有一定的噪声和模糊。
这些模糊以及噪声在很大程度上破坏了图像的有效信息,对于后续的图像分析造成了较大的困难。
图像退化的过程可以简单地用一个数学模型来刻画:B = I*k+n,其中,B 是模糊图像,I是清晰图像,k是模糊算子,n是噪声,*表示卷积算子。
当模糊算子是狄拉克函数时,就变成简单地图像去噪。
图像复原的目标就是从给定的带有模糊或者噪声的图像中恢复出清晰的图像。
根据模糊算子可以分为非盲去卷积(non-blind deconvolution)和盲去卷积(blind deconvolution)。
如果模糊算子已知,则称为非盲去卷积。
早期维纳滤波技术在一定程度上能够去除模糊,但是复原的图像往往含有人工效应(ringing artifacts)[1,2]。
后来研究人员又相继提出了基于正则化的方法,最著名的就是TV图像去噪/去模糊技术[3,4],以及后来的基于图像梯度统计方法的去噪、去模糊方法[5,6]。
在实际应用中,模糊算子通常是未知的。
这时的图像复原问题称为盲去卷积。
由于在退化方程中只有模糊图像B已知,而清晰图像I和模糊算子k未知,从退化的图像中恢复出清晰图像是个病态(ill-posed)问题。
为了降低问题的困难性,一些研究人员提出了基于两幅或者多幅图像的去模糊方法[7,8]。
这些方法主要通过利用清晰图像的信息来指导图像去模糊。
虽然基于两幅图像的去模糊方法降低了问题的难度,但是这些方法要求清晰图像和模糊图像的结构信息能够很好的匹配。
获取这样的数据需要特殊的硬件设备。
而在现实中,我们很难通过相机或者手机得到一幅模糊及其相对应的清晰图像。
为了解决单幅图像去模糊问题,大量的方法被相继提出[9,10,11,12]。
这些方法大多通过考虑模糊算子和图像的性质,加入一些关于模糊算子和清晰图像的先验知识来解决图像复原问题。
Proceedings of the Seventh European Conference on Computer Vision,2002. Learning a Sparse Representation for Object DetectionShivani Agarwal and Dan RothDepartment of Computer ScienceUniversity of Illinois at Urbana-ChampaignUrbana,IL61801,USAsagarwal,danr@Abstract.We present an approach for learning to detect objects in still gray im-ages,that is based on a sparse,part-based representation of objects.A vocabularyof information-rich object parts is automatically constructed from a set of sam-ple images of the object class of interest.Images are then represented using partsfrom this vocabulary,along with spatial relations observed among them.Basedon this representation,a feature-efficient learning algorithm is used to learn to de-tect instances of the object class.The framework developed can be applied to anyobject with distinguishable parts in a relativelyfixed spatial configuration.Wereport experiments on images of side views of cars.Our experiments show thatthe method achieves high detection accuracy on a difficult test set of real-worldimages,and is highly robust to partial occlusion and background variation.In addition,we discuss and offer solutions to several methodological issues thatare significant for the research community to be able to evaluate object detectionapproaches.1IntroductionThis paper describes an approach for learning to detect instances of object classes in im-ages.The development of reliable object detection systems is important for a wide va-riety of problems,such as image classification,content-based image retrieval,tracking and surveillance.Much research has been done in this direction.However,the problem remains a largely unsolved one.The approach presented is based on the belief that the key tofinding a solution to this problem lies infinding the right representation.Specifically,we suggest that in order to extract high-level,conceptual information such as the presence of an object in an image,it is essential to transform the raw,low-level input(in this case,the pixel grayscale values)to a higher-level,more“meaningful”representation that can support the detection process.One theory of biological vision explains object detection on the basis of decomposi-tion of objects into constituent parts[1–3].According to this theory,the representation used by humans for identifying an object consists of the parts that constitute the object, together with structural relations over these parts that define the global geometry of the object.Such a representation also forms the basis of some computational theories of object detection[4].2In this paper,we describe an approach that uses an automatically acquired,sparse, part-based representation of objects to learn a classifier that can be used to accurately detect occurrences of a category of objects in natural scenes.As shown in our experi-ments,the method is highly robust to cluttered backgrounds and partial occlusion.1.1Related WorkMost work that has used learning in object detection has been done at the pixel level (e.g.,[5–7]).Here we mention a small number of recent studies that have deviated from this,and relate them to our work.Several approaches represent objects using low-level features.For example,in[8, 9],the object parts are local groupings of common primitive features such as edge frag-ments,while in[10],rectangle features are used to capture the presence of edges,bars and other simple structures.We believe that the expressivity of a part-based represen-tation can be enhanced by considering distinctive,higher-level parts that are rich in information content and more specific to the object class of interest.Such an approach has been followed in[11],in which separate classifiers are used to detect heads,arms and legs of people in an image,and afinal classifier is then used to decide whether or not a person is present.However,the work in[11]requires the object parts to be manu-ally defined and separated for training the individual part classifiers.In order to build a system that is easily extensible to deal with different objects,it is important that the part selection procedure be automated.Our method for automatically selecting information-rich parts builds on a technique described in[12],in which interest points are used to collect distinctive parts.In[12],a generative probabilistic model is learned over these parts.Our method,on the other hand,does not need to assume any probabilistic model; we simply collect parts that could be of interest and then directly learn a classifier over them.In addition,the model learned in[12]relies on a very small number offixed parts,making it potentially sensitive to large variations across images.By learning over a large feature space,we are able to learn a more expressive model that is robust to such variations.We use a feature-efficient learning algorithm that has been used in a similar task in[13].However,[13]uses a pixel-based representation,whereas in our approach, images arefirst transformed(automatically)and represented using a higher-level and more sparse representation.This has implications both in terms of detection accuracy and robustness,and in terms of computational efficiency:the sparse representation of the image allows us to perform operations(e.g.,computing relations)that would be prohibitive in a pixel-based representation.1.2Problem SpecificationWe assume some object class of interest.Our goal is to develop a system which,given an image,can detect instances of this object class in the image,and return their locations. In the process offinding the locations of these object instances,we also require the system to be able to output the number of object instances itfinds in the image.This may appear to be a trivial requirement for an object detection system,but as we discuss later in Section2.4,it imposes a non-trivial condition that the system must satisfy.3 1.3Overview of the ApproachThis section outlines our conceptual approach,which consists broadly of four stages.1.Vocabulary ConstructionThefirst stage consists of building a“vocabulary”of parts that can be used to rep-resent objects in the target class.This is done automatically by using an interest operator to extract information-rich parts from sample images of the object of in-terest.Similar parts thus obtained are grouped together and treated as a single part.2.Image RepresentationEach input image is transformed and represented in terms of parts from the vocab-ulary obtained in thefirst stage.This requires determining which parts from the vocabulary are present in the image;a similarity-based measure is used for this purpose.Each image is then represented as a binary feature vector based on the vocabulary parts present in it and the spatial relations among them.3.Learning a ClassifierGiven a set of training images labeled as positive(object)or negative(non-object), each image is re-represented as a binary feature vector as described above.These feature vectors are then fed as input to a supervised learning algorithm that learns to classify an image as a member or non-member of the object class,with some associated confidence.As shown in our experiments,the part-based representation captured by the feature vectors enables a relatively simple learning algorithm to learn a good classifier.4.Detection Hypothesis using the Learned ClassifierThefinal stage consists of using the learned classifier to form a reliable detector.We introduce the notion of a classifier activation map,which is generated by applying the classifier to various windows in a test image.We then present a method for producing a good detection hypothesis using this activation map.This framework can be applied to any object that consists of distinguishable parts arranged in a relativelyfixed spatial configuration.Our experiments are performed on images of side views of cars;therefore,we will use this object class as a running exam-ple throughout the paper to illustrate the ideas and techniques we introduce.We describe each stage of our approach in detail in Section2.Section3presents an evaluation of our method.In this section,wefirst discuss several methodological issues, including evaluation and performance measurement techniques,that are important for the research community to be able to evaluate object detection approaches.We then present our experimental results.Section4concludes with a summary and possible future directions.2ApproachThis section describes each stage of our approach in detail.4Fig.1.Left:A sample object image used in the vocabulary construction stage.Center:Interest points detected by the F orstner operator.Crosses denote intersection points;circles denote centers of circular patterns.Right:Patches extracted around the interest points.2.1Vocabulary ConstructionTo obtain an expressive representation for the object class of interest,we require dis-tinctive parts that are specific to the object class but can also capture the variation across different instances of the object class.Our method for automatically selecting such parts is based on the extraction of interest points from a set of representative images of the target object.A similar method has been used in[12].Interest points in an image are points that have high information content in terms of the local change in signal.They have been used in a variety of problems in computer vision,including stereo matching[14],object recognition and image retrieval[15].In-terest points have typically been designed and used for properties such as rotation and viewpoint invariance,which are useful in recognizing different views of the same ob-ject,and not for the“perceptual”or“conceptual”quality that is required for reliably detecting different instances of an object class.However,by using interest points in conjunction with a redundant representation that is described below,we are able to cap-ture a degree of conceptual invariance that turns out to be sufficient for this task.We apply the F orstner interest operator[16,12]to a set of representative images of the object class.This detects intersection points of lines and centers of circular pat-terns.A vocabulary of representative parts is then constructed by extracting small image patches around the interest points obtained.The goal of extracting a large vocabulary from different instances of the object class is to be able to“cover”new object instances, i.e.to be able to represent new instances using a subset of this vocabulary.In our experiments,the F orstner operator was applied to a set of50representative images of cars,each pixels in size.Figure1shows an example of this process. Patches of size pixels were extracted around each such interest point,producing a vocabulary of400parts from the50images.This vocabulary is shown in Figure2.As seen in Figure2,several of the parts extracted by this procedure are visually very similar to each other.To facilitate learning,it is important to abstract over these parts by mapping similar parts to the same feature id(and distinct parts to different feature ids).This is achieved via a bottom-up clustering procedure.Initially,each part is assigned to a separate cluster.Similar clusters are then successively merged together until no similar clusters remain.In merging clusters,the similarity between two clusters and is measured by the average similarity between their respective parts:similaritysimilaritywhere the similarity between two parts is measured by normalized correlation,allowing for small shifts of ing this technique,the400parts were grouped into5 Fig.2.The vocabulary of400parts extracted by the F orstner interest operator.Fig.3.Examples of the clusters formed after grouping similar parts together.270clusters.While several clusters contained just one element,parts with high similar-ity were grouped together.Figure3shows some of the larger clusters that were formed. Parts belonging to the same cluster are treated as a single“conceptual”part by giving them the same feature id;in this way,by using a deliberately redundant representa-tion that uses several similar parts to represent a single conceptual part,we are able to extract a higher-level,conceptual representation from the interest points.Experiments described in Section3.3show the role of the clustering procedure.2.2Image RepresentationHaving constructed the part vocabulary above,images are now represented using this vocabulary.This is done by determining which of the vocabulary parts are present in an image,and representing the image as a binary feature vector based on these detected parts and the spatial relations that are observed among them.Part Detection Searching a whole image for the presence of vocabulary parts is com-putationally expensive.To focus our attention on the interesting regions in the image, the F orstner operator isfirst applied to the image,and patches around the interest points found in the image are highlighted.For each highlighted patch,we perform a similarity-based indexing into the part vocabulary.Each patch is compared to the vocabulary parts using the same similarity measure used earlier for clustering similar parts when con-structing the vocabulary.If a sufficiently similar vocabulary part is found,the patch in the image is represented by the feature id corresponding to that vocabulary part.Fig-ure4shows examples of the part detection process.Relations over Detected Parts Spatial relations among the parts detected in an im-age are defined in terms of the distance and direction between each pair of parts.Sev-eral earlier approaches to recognition have relied on geometric constraints based on the distances and angles between object elements.For example,[17]models objects as polyhedra and,given input data from wich object patches can be inferred,performs6Fig.4.Examples of the part detection process applied to a positive and a negative image during training.Center images show the patches highlighted by the interest operator;notice how this successfully captures the interesting regions in the image.These highlighted interest patches are then matched with vocabulary parts.In the right images,the highlighted patches are replaced by an arbitrary member of the part cluster(if any)matched by this detection process.These parts, together with the spatial relations among them,form our representation of the image.recognition by searching for interpretations of the surface patches whose inter-patch distances and angles are consistent with those between corresponding model faces.In our approach,we do not assume a known model for the target object class,but attempt to use the object parts observed in object examples,together with the relations observed among them,to learn a model or representation for the object class.The distances and directions between parts in our approach are discretized into bins:in our implementa-tion,the distances are defined relative to the part size and are discretized into5bins, while the directionss are discretized into8different ranges,each covering an angle of .However,by considering the parts in afixed order across the image,the number of direction bins that need to be represented is reduced to4.This gives20possible relations(i.e.distance-direction combinations)between any two parts.The training images(and later,windows in test images)that are converted to feature vectors have a very small number of parts actually present in them: on average,a positive window contains6-8parts,while a negative one contains around 2-4.Therefore,the cost of computing relations between all pairs of detected parts is negligible once the parts have been detected.Feature Vector Each training image(and later,each window in the test images)is represented as a feature vector containing feature elements of two types: (i),denoting the th occurrence of a part of type in the image(inour experiments;each corresponds to a particular part cluster)(ii)1,denoting the th occurrence of relation between a part of type and a part of type in the image(in our implementation;eachcorresponds to a particular distance-direction combination)These are binary features(each indicating whether or not a part or relation occurs in the image),each represented by a unique identifier.The re-representation of the image is a list of the identifiers corresponding to the features that are active(present)in the image. 1In the implementation,a part feature of the form is represented by a unique feature id, which is an integer determined as a function of and.Similarly,a relation feature of the form is assigned a unique feature id that is a function of,,and.7 2.3Learning a ClassifierUsing the above feature vector representation,a classifier is trained to classify a image as car or non-car.We used a training set of1000labeled images(500positive and 500negative),each pixels in size.The images were acquired from different sources:partly from the World Wide Web,partly with a camera,and partly by taking video sequences of cars in motion and processing the digitized video with a frame grabber.After cropping and scaling to the required size,histogram equalization was performed on all images to reduce sensitivity to changes in illumination conditions. The positive examples contain images of different kinds of cars against a variety of backgrounds,and include images of partially occluded cars.The negative examples include images of natural scenes,buildings and road views.Note that our training set is relatively small and all images in our data set are natural;no synthetic training images are used to simplify the learning problem,as has been done,for example,in[13,18].Each of these training images is converted into a feature vector as described in Sec-tion2.2.Note that the potential number of features in any vector is very large,since there are270different types of parts that may be present,20possible relations between each possible pair of parts,and several of the parts and relations may potentially be repeated.However,in any single image,only a very small number of these possible features is actually active.Taking advantage of this sparseness property,we train our classifier using the Sparse Network of Winnows(SNoW)learning architecture[19,20], which is especially well-suited for such sparse feature representations.2SNoW learns a linear function over the feature space using a feature-efficient variation of the Win-now learning algorithm;it allows input vectors to specify only active features,and its sample complexity grows linearly with the number of relevant features and only loga-rithmically with the total number of potential features.A separate function is learnt over this common feature space for each target class in the classification task.In our task, feature vectors obtained from object training images are taken as positive examples for the object class and negative examples for the non-object class,and vice-versa.Given a new input vector,the learned function corresponding to each class outputs an activation value,which is the dot product of the input vector with the learned weight vector,passed through a sigmoid function to lie between0and1.Classification then takes place via a winner-take-all decision based on these activations(i.e.the class with the highest ac-tivation wins).The activation levels have also been shown to provide a robust measure of confidence;we use this property in thefinal stage as described in Section2.4below. Using this learning algorithm,the representation learned for an object class is a linear threshold function over the feature space,i.e.over the part and relation features.2.4Detection Hypothesis Using the Learned ClassifierHaving learned a classifier3that can classify images as positive or negative, cars can be detected in an image by moving a window over the image and classifying each such window as positive or negative.However there is one issue that needs to be resolved in doing this.2Software for SNoW is freely available at /˜cogcomp/3The SNoW parameters we used to train the classifier were1.25,0.8,4.0and1.0respectively for the promotion and demotion rates,the threshold and the default weight.8It is clear that in the vicinity of an object in the image,several windows will be classified as positive,giving rise to multiple detections corresponding to a single object in the scene.The question that arises is how the system should be evaluated in the pres-ence of these multiple detections.In much previous work in object detection,multiple detections output by the system are all considered to be correct detections(provided they satisfy the criteria for a correct detection;this is discussed later in Section3.1). However,going back to the requirements specified in Section1.2,such a system fails not only to locate the objects in the image,but also to form a correct hypothesis about the number of object instances present in the image.Therefore in using a classifier to perform detection,it is necessary to have another processing step,above the level of the classifier output,to produce a coherent detection hypothesis.A few studies have attempted to develop such a processing step.[10]uses a simple strategy:detected windows are partitioned into disjoint(non-overlapping)groups,and each group gives a single detection.While this may be suitable for the face detection database used there,in general,imposing a zero-overlap constraint on detected windows may be too strong a condition.As a more general solution to this problem,we introduce the notion of a classifier activation map(CAM),which can be generated from any classifier that can produce an activation or confidence value in addition to a binary classification output.Given a test image,a window is moved over the image and the learned classifier is applied to each such window(represented as a feature vector)in the image.However,instead of using the binary classification output of the classifier,we take advantage of the activa-tion values it produces.Negatively classified windows are mapped to a zero activation value.Windows classified as positive are mapped to the activation value produced by the classifier.This produces a map with high activation values at points where the classi-fier has a high confidence in its positive classification.Our algorithm analyzes this map andfinds the peaks at which the activation is highest in some neighbourhood,giving the desired locations of the target object.Different neighbourhoods can be used for this purpose.Our algorithm searches for activation peaks in a rectangular neighbourhood, defined by two parameters hNbhd and wNbhd,so that a point is considered to be an object location if(i)activation activation for all,wherehNbhd wNbhd,and(ii)no other point in has already been declared an object location(the map is ana-lyzed in row-wise order).The neighbourhood size determines the extent of overlap that is allowed between de-tected windows,and can be chosen appropriately depending on the object class and window size.In our experiments,we used hNbhd=40pixels,wNbhd=35pixels.There is a trade-off between the number of correct detections and number of false detections.An activation threshold is introduced in the algorithm to determine where to lie on this trade-off curve.All activations in the CAM that fall below the threshold are set to zero.Lowering the threshold increases the correct detections but also increases the false positives;raising the threshold has the opposite effect.Figure5shows a thresh-olded CAM generated from a sample test image,and the associated detection result.9Fig.5.The center image shows the classifier activation map (CAM)generated from the test image on the left using an activation threshold of 0.85:all activations below 0.85have been set to zero.The activations in the map have been scaled by 255to produce the image;the bright white peak corresponds to the highest activation,producing the detection result shown in the right image.The method prevents the system from producing multiple detections for a single object.3EvaluationTo evaluate the performance of the system,we collected a set of 170test images con-taining 200cars in all.These are all distinct from the training set.The images were acquired in the same way as the training images.They are of different resolutions and include instances of partially occluded cars,cars that have low contrast with the back-ground,and images with highly textured backgrounds.Before presenting our results,we discuss some important methodological issues related to evaluation and performance measurement methods in object detection.3.1Evaluation SchemePast work on object detection has often emphasized the need for standardized data sets to allow a fair comparison of different methods.Although several studies have reported results on common data sets,it is often not clear how the performance of the different methods has been evaluated on these data sets.Problems such as image classification have a naturally defined evaluation criterion associated with them.However,in object detection,there is no such natural criterion:correct detections and false detections can be defined in different ways which may give rise to different results.To ensure that the comparison between different methods is truly fair,it is essential that the same evalu-ation scheme be used.Therefore in addition to standard data sets for object detection,we also need appropriate standardized evaluation schemes to be associated with them.We describe in detail here the scheme used to evaluate the accuracy of our system.4For each car in the test images,we manually determined the location (true ,true )of the best window containing the car.For a location (det ,det )output by the detector to be considered a correct detection,we require three conditions to be satisfied:(i)true det height ,(ii)true det width ,and (iii)the windows at the detected and true locations have an overlap of at least area .4Both the data set we have used and the evaluation routine are available at/˜cogcomp/.10The parameters height,width and area are chosen appropriately depending on the target object class and window size,such that these conditions together ensure that no false detection is counted as a correct detection.In our experiments,we used height pixels,width pixels and area.In addition,if two or more detected windows satisfy these conditions for the same object location,only one is considered a correct detection;the others are counted as false positives.(See Section2.4for discussion.)3.2Measuring PerformanceIn measuring the accuracy of a detection system,the two quantities of interest are clearly the correct detections which we wish to maximize,and the false detections that we wish to minimize.Different methods for reporting and interpreting results present this trade-off in different ways,and again it is necessary to identify a suitable method that captures the trade-off correctly in the context of the object detection problem.One method for expressing this trade-off is the Receiver Operating Characteristics (ROC)curve.This plots the positive detection rate vs.the false detection rate,wherePositive Detection RateNumber of correct positives Total number of positives in data set(1)False Detection RateNumber of false positives Total number of negatives in data set(2)However,note that in the problem of object detection,the number of negatives in the data set(required in(2)above)is not well-defined.The number of negative windows evaluated by the detection system has commonly been used for this purpose.However, there are two problems with this approach.Thefirst is that this measures the accuracy of the system as a classifier,not as a detector.Since the number of negative windows is typically very large compared to the number of positive windows,this allows a large one-sided error:a large absolute number of false detections appears to be small under this measure.The second and more fundamental problem is that the number of negative windows evaluated is not a property of the input to the problem,but rather a property internal to the implementation of the detection system.When a detection system is put into practice,we are interested in knowing how many of the objects it detects,and how often the detections it makes are false.This trade-off is captured more accurately by a variation of the recall-precision curve,whereRecallNumber of correct positivesTotal number of positives in data setPositive Detection Rate above(3)PrecisionNumber of correct positivesNumber of correct positives+Number of false positives(4)Thefirst quantity of interest,namely the proportion of objects that are detected,is given by the recall.The second quantity of interest,namely the number of false detections relative to the total number of detections made by the system,is given by1PrecisionNumber of false positivesNumber of correct positives+Number of false positives(5)Plotting recall vs.(1precision)therefore expresses the desired trade-off.。
稀疏判别分析摘要:针对流形嵌入降维方法中在高维空间构建近邻图无益于后续工作,以及不容易给近邻大小和热核参数赋合适值的问题,提出一种稀疏判别分析算法(seda)。
首先使用稀疏表示构建稀疏图保持数据的全局信息和几何结构,以克服流形嵌入方法的不足;其次,将稀疏保持作为正则化项使用fisher判别准则,能够得到最优的投影。
在一组高维数据集上的实验结果表明,seda是非常有效的半监督降维方法。
关键词:判别分析;稀疏表示;近邻图;稀疏图sparse discriminant analysischen xiao.dong1*, lin huan.xiang 21.school of information and engineering, zhejiang radio and television university, hangzhou zhejiang 310030, china ;2.school of information and electronic engineering,zhejiang university of science and technology, hangzhou zhejiang 310023, chinaabstract:methods for manifold embedding exists in the following issues: on one hand, neighborhood graph is constructed in such thehigh-dimensionality of original space that it tends to work poorly; on the other hand, appropriate values for the neighborhood size and heat kernel parameter involved in graph construction is generally difficult to be assigned. to address these problems, a novel semi-supervised dimensionality reduction algorithm called sparse discriminant analysis (seda) is proposed. firstly, seda sets up a sparse graph to preserve the global information and geometric structure of the data based on sparse representation. secondly, it applies both sparse graph and fisher criterion to seek the optimal projection. experiments on a broad range of data sets show that seda is superior to many popular dimensionality reduction methods.methods for manifold embedding have the following issues: on one hand, neighborhood graph is constructed in suchhigh.dimensionality of original space that it tends to work poorly; on the other hand, appropriate values for the neighborhood size and heat kernel parameter involved in graph construction are generally difficult to be assigned. to address these problems, a new semi.supervised dimensionality reduction algorithm called sparse discriminant analysis (seda) was proposed. firstly, seda set up a sparse graph topreserve the global information and geometric structure of the data based on sparse representation. secondly, it applied both sparse graph and fisher criterion to seek the optimal projection. the experimental results on a broad range of data sets show that seda is superior to many popular dimensionality reduction methods.key words:discriminant analysis; sparse representation; neighborhood graph; sparse graph0 引言在信息检索、文本分类、图像处理和生物计算等应用中,所面临的数据都是高维的。
第37卷第12期计算机仿真2020年12月文章编号:1006 - 9348(2020)12 - 0153 - 04基于稀疏表示的焊缝视觉图像缺陷识别方法王飞,张素兰(太原科技大学计算机科学与技术学院,山西太原048000)摘要:焊缝视觉图像缺陷信息存在干扰像素,导致增强焊接结构件可靠性降低,提出一种基于稀疏表示的焊缝视觉图像缺陷 识别方法。
划分缺陷图像为目标、背景及噪声三部分,超完备转换缺陷目标,建立稀疏表示模型确定焊缝位置;利用中值滤 波剔除缺陷图像整体噪声,对图像进行二值化稀疏处理,去除非焊缝视觉区域的干扰像素,获取焊缝图像全局和局部直方图 特征,在随机Dropout以及Relu激活函数中输人焊缝视觉图像特征,输出焊接缺陷类型,保证缺陷识别结果的有效性。
仿真 结果证明,对缺陷识别准确率髙、识别效果较为清晰。
关键词:稀疏表示;焊缝视觉图像;缺陷识别;二值处理中图分类号:TP327 文献标识码:BDefect Recognition Method of Weld VisualImage Based on Sparse RepresentationWANG Fei,ZHANG Su - lan(College of Computer Science and Technology,Taiyuan University of Science and Technology,Taiyuan Shanxi048000, China)A B S T R A C T:Owing to the interference pixels of weld visual image defect information,the reliability of welded structural parts is reduced.In this regard,a defect recognition method of weld visual image based on sparse representation is put forward.The defect image was divided into target,background and noise.The defect target was transformed into super complete one,and the sparse representation model was established to determine the weld location.The o-verall noise of defect image was eliminated by utilizing median filter.The image is binarized and sparse to eliminate the interference pixels in the non- weld visual area,thus obtaining the global and local histogram features of the weld image.In the random Dropout and Relu activation functions,the visual image features of weld were input to output the type of welding defects,ensuring the validity of defect recognition results.The simulation results show that this method has high accuracy and clear recognition effect.K E Y W O R D S:Sparse representation;Weld visual image;Defect recognition;Binary processingi引言使用工业C C D摄像机的视觉传感手段已经广泛运用在 焊接领域中,尤其在焊缝跟踪控制方面取得了突破性进展,焊缝视觉图像由此孕育而生[1]。