Support Vector Machines for uncertainty region detection

格式：pdf
大小：239.19 KB
文档页数：6

下载文档原格式

Support vector machine reference manual

SV Machine Parameters ===================== 1. 2. 3. 4. 5. 0. Enter Load Save Save Show Exit parameters parameters parameters (pattern_test) parameters as... parameters
snsv
ascii2bin bin2ascii
The rest of this document will describe these programs. To nd out more about SVMs, see the bibliography. We will not describe how SVMs work here. The rst program we will describe is the paragen program, as it speci es all parameters needed for the SVM.
sv
- the main SVM program - program for generating parameter sets for the SVM - load a saved SVM and classify a new data set
paragen loadsv
rm sv
- special SVM program for image recognition, that implements virtual support vectors BS97]. - program to convert SN format to our format - program to convert our ASCII format to our binary format - program to convert our binary format to our ASCII format

Controlling the sensitivity of support vector machines

ber of real world problems such as handwritten charac-
ter and digit recognition Scholkopf, 1997; Cortes, 1995;
LeteCalu.,n1e9t9a7l].,a1n9d9s5p;eVaakperniikd,e1n9ti95c]a, tfiaocne
cients
i are found by
L = Xp
i=1
i
?
1 2
Xp
i;j=1
i
jyiyjK(xi; xj)
(1)
subject to constraints:
i0
Xp iyi = 0
(2)
i=1
Only those points which lie closest to the hyperplane
pdoeicnitssionxifumncatpiopningis
to targets formulated
yini
(i = terms
of these kernels:
f(x) = sign
Xp
!
iyiK(x; xi) + b
i=1
where b is the bias and the coe maximising the Lagrangian:
have In
thie>p0re(stehnecesuopfpnoortisve,ecttworos
). techniques
can
be
used
to allow for, and control, a trade o between training

Support Vector Machines and Kernel Methods

Slack variables
4 3.5 3 2.5 2 1.5 1 0.5 0 −0.5 −3
−2
−1
0
1
2
3
If not linearly separable, add slack variable s ≥ 0 y (x · w + c) + s ≥ 1 Then
i si is total amount by which constraints are violated i si as small as possible
So try to make
Perceptron as convex program
The ﬁnal convex program for the perceptron is: min
i si subject to
(y i x i ) · w + y i c + s i ≥ 1 si ≥ 0 We will try to understand this program using convex duality
10 8
6
4
2
0
−2
−4
−6
−8
−10 −10
−8
−6
−4
−2
0
2
4
6
8
10
Classiﬁcation problem
100
10
% Middle & Upper Class
. . .
95
8
6
90
4
85
2
80
0
75
−2
70
−4
−6
65
X

Using support vector machines for lane change detection

USING SUPPORT VECTOR MACHINES FOR LANE-CHANGE DETECTIONHiren M. Mandalia Dario D. SalvucciDrexelUniversityDrexel UniversityPhiladelphia, PA Philadelphia, PADriving is a complex task that requires constant attention, and intelligent transportation systems thatsupport drivers in this task must continually infer driver intentions to produce reasonable, safe responses.In this paper we describe a technique for inferring driver intentions, specifically the intention to changelanes, using support vector machines (SVMs). The technique was applied to experimental data from aninstrumented vehicle that included both behavioral data and environmental data. Comparing these resultsto recent results using a novel “mind-tracking” technique, we found that SVMs outperformed earlieralgorithms and proved especially effective in early detection of driver lane changes.INTRODUCTIONIntelligent transportation systems (ITS) have been evolving to support drivers perform the vast array of typical driving tasks. Even with the many types of ITS systems in existence and under development, one commonality among these systems is the need for inferring driver intentions — that is, detecting what a driver is trying to do and helping them achieve that goal more safely and easily. One particular driver intention is that of changing lanes, which occurs ubiquitously in common driving environments — for instance, in highway driving, which accounts for approximately 70% of vehicle miles on American roadways (Federal Highway Administration, 1998). Much of previous work has focused on the decision-making and gap acceptance aspects of lane changing (e.g., Ahmed et al, 1996; Gipps, 1986), while other work has focused on development of real-world lane-change warning systems (e.g., Talmadge, Chu & Riney, 2000).Several recent studies have examined the behavioral nature of lane changes. Suzanne Lee et al. (2004) evaluated lane changes to provide a valuable insight into the severity of lane changes under naturalistic driving conditions, while Tijerina (1999) discussed operational and behavioral issues in evaluating lane change crash avoidance systems (CAS). Other studies (e.g., Tijerina, 2005) focused in particular on eye-glance behaviors during lane changes. Very little work has been done on recognizing driving maneuvers, especially critical ones like lane changing.The few studies on recognizing, or detecting, lane-change maneuvers include work from Pentland and Liu (1999), Kuge et al. (2000), Olsen (2003), and Salvucci (2004). The first two approaches were based on the concept that human behavior is made up of a sequence of internal ‘mental’ states that was not directly observable. These approaches used a technique called hidden Markov models (common in speech recognition), probabilistic models powered by robust expectation-maximization methods. Pentland and Liu (1999) reported that their recognition system could successfully recognize lane changes 1.5 seconds into the maneuver. Kuge et al. (2000) reported results with a “continuous” (point-by-point) recognition system; however, their system only uses steering-based features and has no knowledge of the surrounding environment. Olsen (2003) performed comprehensive analysis of lane changes in which a slow lead vehicle was present and proposed logistic regression models to predict lane changes. However his study only provides descriptive results and an insight into the variables that influence lane changes.Salvucci (2004) proposed a mind-tracking approach that uses a model of driver behavior to infer driver intentions. The mind tracking system essentially isolates a temporal window of driver data and extracts its similarity to several virtual drivers that are created probabilistically using a cognitive model. All of these approaches left something to be desired for purposes of a point-by-point detection system that could generate accurate detection with each data point.In this paper we describe a method of detecting lane change intentions using a technique known as support vector machines (SVMs). The paper begins with an overview of SVMs including details of how they are applied to the problem of detecting lane changes. The paper continues with an application study in which the SVM technique was applied to data collected from an instrumented vehicle in a real-world driving task, demonstrating that SVMs can successfully detect lane changes with high accuracy and low false-alarm rates, and that the technique performs very well in comparison with previously developed techniques for this problem.LANE-CHANGE DETECTIONUSING SUPPORT VECTOR MACHINES Support vector machines represent learning algorithms that can perform binary classification (pattern recognition) and real valued function approximation (regression estimation) tasks (see, e.g., Cortes & Vapnik, 1995). SVMs have been widely used for isolated handwritten digit recognition, object recognition, speaker identification, and face detection in images and text categorization. This section reviews the basic functioning of SVMs, motivation for using SVMs for lane change detection, and training of lane changes. The next section reports results in terms of the prediction accuracy (true positive rates and false positive rates) and other measures.Brief Overview of Support Vector MachinesSupport vector machines are based on statistical learning theory that uses supervised learning. In supervised learning, amachine is trained instead of programmed using a number of training examples of input-output pairs. The objective of training is to learn a function which best describes the relation between the inputs and the outputs. In general, any learning problem in statistical learning theory will lead to a solution of the typeEq (1) where the x i, i = 1,…, l are the input examples, K a certain symmetric positive definite function named kernel, and c i a set of parameters to be determined from the examples. For details on the functioning of SVM, readers are encouraged to refer to (Cortes & Vapnik, 1995). In short, the working of SVM can be described as statistical learning machines that map points of different categories in n-dimensional space into a higher dimensional space where the two categories are more separable. The paradigm tries to find an optimal hyperplane in that high dimensional space that best separates the two categories of points. Figure 1 shows an example of two categories of points separated by a hyperplane. Essentially, the hyperplane is learned by the points that are located closest to the hyperplane which are called ‘support vectors’. There can be more than one support vector on each side of the plane.Figure 1: Points separated by a hyperplane.(Source: )Motivation for using SVMsAssessing driver state is a substantial task, complicated by the various nuances and idiosyncrasies that characterize human behavior. Measurable components indicative of driver state may often reside in some high dimensional feature space (see, e.g., Wipf & Rao, 2003). Researchers have found that SVM have been particularly useful for binary classification problems. SVMs offer a robust and efficient classification approach for the problem of lane change detection because they map the driving data to high dimensional feature space where linear hyperplanes are sufficient to separate two categories of data points. A correct choice of kernel and data representation can lead to good solutions.Issues Relevant to Lane-Change DetectionKernel selection. A key issue in using the learning techniques is the choice of the kernel K in Eq (1). The kernel K(x i, x j) defines a dot product between projections of the two inputs x i and x j, in the feature space, the features been {Φ1(x), Φ2(x),…, ΦN(x} with N the dimensionality of the Reproducing Kernel Hilbert Space (see, e.g., Cortes & Vapnik, 1995). Therefore the choice is closely related to the choice of the “effective” representation of the data, e.g. the image representation in a vision application. The problem of choosing the kernel for the SVM, and more generally the issue of finding appropriate data representations for learning, is an important one (e.g., Evgeniou, Pontil, & Poggio, 2000). The theory does not provide a general method for finding “good” data representations, but suggests representations that lead to simple solutions. Although there is no general solution to this problem, several experimental and theoretical works provides insights for specific applications (see, e.g., Evgeniou, Pontil, & Poggio, 2000; Vapnik, 1998). However recent work from researchers (e.g., Lanckriet et al., 2004) has shown that for a given data representation there is a systematic method for kernel estimation using semi-definite programming. Estimating the right kind of kernel remains an important segment of the future work with this kind of work. At this point of time the available data was tested against different types of kernel functions to know the performance of each of them experimentally. Some of the kernel-types tested were Linear, Polynomial, Exponential and Gaussian. However, it was observed that all the kernels performed as good as or worse than Linear kernel. All the final results with SVM classification are therefore analyzed with Linear kernel.Choosing a time window. One issue with using SVM for lane change detection is that lane changes do not have fixed time length. Thus, longer lane changes see a smooth transition in feature values like steering angle, lane position, acceleration, etc. whereas shorter ones have a relatively abrupt transition. Moreover, one feature may be a function of one or many other features. The exact inter-dependency between features or within features themselves is often unclear. In the domain of detecting driving maneuvers like lane changes, the change in the features with respect to time and their inter-dependency is more critical than the individual values of the features. For example, studies have shown that during a lane change drivers exhibit an expected sine-wave steering pattern except for a longer and flatter second peak as they straightened the vehicle (e.g., Salvucci & Liu, 2002). The figure below shows such a pattern of steering against time.Drivers first steer toward the destination and then back to center to level the vehicle on a steady path to the destination lane. Then, in a continual smooth movement, drivers steer in the other direction and back to center to straighten the vehicle.Figure 2: Steering angle displays a sine-wave like pattern(Source: Salvucci & Liu, 2002)Such patterns can only be observed (by humans) or learned (by machines) when a reasonable sized window of samples is observed. Thus for all practical purposes when using the SVM for a lane change or lane keeping, an entire window of samples is input instead of a single sample. A ‘sample’ refers to the set of values of features at that instance of time. The data stream is broken down into fixed-size smaller windows. The size (time length) of the window that adequately captures the patterns within the features is a free parameter and therefore left to experimentation. Various window sizes were analyzed between 1 second to 5 seconds. The results with different window sizes are reported in the results. About 2/3rd of driving data was used for training SVMs and remaining 1/3rd for testing purposes.Another issue in training is assigning the correct label (LC or LK) to each window. The lane change definition used to classify training data was that a lane change starts when the vehicle moves towards the other lane boundary and never returns i.e. the final shift of the driver towards the other lane. Any reversal cancels the lane change. Using the definition of lane change, each sample within a window can be labeled positive (LC) or negative (LK) but not the entire window. The last sample within the window is used to label the entire window, since the last sample offers the latest information about the driving state. Also in order to predict the driving state at any time only the preceding samples can be used.Figure 3: Moving window of constant sizeAs shown in Figure 3 a single window of size N is defined by N samples at times {t 0, t 1,…, t N-1}. The label of sample at t N-1 is assigned to the entire window. A moving window is used as shown in the figure i.e. whenever a new sample is obtained, it is added to the moving window and the last sample is dropped, maintaining a constant size window.Choosing a data representation . The general problem of finding an optimal, or even reasonable, data representation is an open one. One approach to data representation is to input the entire training window to the SVM with the actual values of thefeatures (e.g., Wipf & Rao, 2003) and feed them to Relevance Vector Machines (RVM). In such an approach, an input vector corresponding to a single window of size N samples looks like[steerangle(t 0),…,steerangle(t N-1),speed(t 0),…,speed(t N-1),…]which in general is equivalent to[F 1(t 0),…,F 1(t N-1), F 2(t 0),…,F 2(t N-1),…,F M (t 0),…,F M (t N-1)]where F x (t i ) represents the value of feature F x at time t i . Such a vector is used to train Relevance Vector Machines (RVM). RVM is a probabilistic sparse kernel model identical in functional form to the SVM (e.g., Cortes & Vapnik, 1995).Embedded in this formulation is the fact that temporal variations in maneuver execution are handled implicitly by RVMs. However, inherent functionalities of RVMs or SVMs would fail to observe any dependencies or relationship between values of a feature over a period of time which could be critical. Also this formulation results in abnormally long sized input vector leading to additional computational complexity.An alternative approach is suggested to explicitly include some form of dependency/relationship measure between feature values rather than the original values. As argued previously it is the change pattern in the feature values which is more critical than the values themselves. Variance of a feature over all the samples in the block was used to replace the original values of that feature. Variance of a feature is given byEq (2)where N is the window size (number of samples), µx is the mean of the feature F x within the window and x i is the feature value of the i th sample. Thus variance effectively captures the change in the feature values which is very critical to learn specific patterns. Variance of features is particularly useful in reducing the effects of noise in the data. Another reason that encouraged the use of variance is the reduced size of input vectors that were used for final training as explained in the following section.Figure 4 explains the two data representations that were experimented using variance. A single window of size N is shown on the left hand side of the two representations where each feature F x has N values. In Data Representation I (non-overlapping), a single window is divided into two equal halves and the variance of features from each half is used. Thus for every N values of a feature only two values of variances from the first and second half of the window contributes to the input vector. The rationale for splitting a single window in two halves is the need to capture multiple change patterns within a window. For example, features like lane position or steer angle might change multiple times within a window but a single variance across the window will reflect only the overall change.Data Representation II (overlapping) shown in the figure uses a similar structure with the difference that the two halvesFigure 4: Data Representationoverlap with each other. A window of size N is divided into three equal parts say a, b, c. The first half will consist of the first two parts (a, b) and the second half will consist of last two parts. Overlapping structure was tested to account for the fact that the division of a lane change may not be equal and the changes may also happen in the middle of the window. Experiments were performed with each representation.Choosing a feature set. While training the SVMs it is very important that only the most effective set of features, rather than set of all possible features, is used. Features that display significant differences during a lane change against normal driving are the critical ones. Features that do not show enough predictability and vary randomly irrespective of a lane change should be avoided as they degrade the discrimination power of SVM.With no prior preference to any feature, initial experiments included all features. Later, only selected combinations were employed to choose the minimal feature set that would produce the best classification. Various combinations of features were tested. However, only few selected combinations generated good results. Best classification results were obtained with only four features with lane positions at different distances. Such an outcome was expected since lane position demonstrated the most consistent pattern among all the other features. One can argue that steering angle should also be a strong candidate. However, steering angle displays similar patterns both during lane change and while driving through a curved road which led to high number of false positives.Time to collision (TTC) (see, e.g., Olsen, 2003) could be an important feature however; the available data did not have speed values of the lead vehicles. Another limitation was the lack of eye movement data which could prove critical for lane change detection (see, e.g., Olsen, 2003). The current method of feature selection is based purely on experiments. However, as a future study use of more systematic stochastic techniques like t-tests, recursive feature elimination, and maximum likelihood test are planned.The feature sets in Table 1 were selected such that it includes different combinations of features that would generate significant patterns in their variances. Set 1 contains the most basic relevant features. Set 2 measures the effect of lead car distance. Set 3 includes longitudinal and latitudinal information of the car while Set 4 measures the significance of steering angle. Set 5 contains only lane position values since experiments indicated that they were most significant. Note that the lane position features indicate the longitudinal distance from the driver’s vehicle in m — e.g., “Lane position 30” represents the lane position for the point 30 m directly ahead of the vehicle.Generating continuous recognition. To simulate a continuous recognition scheme we use a moving window of size N (block-size) as shown in Figure 3. The window is slided one sample at a time across the data. The classification label predicted by SVM for each window is used as label for the last sample in the window. Consistency among classification scores is one important advantage of this scheme. That is if the previous and next few samples are classified positive the probability of the current sample to be classified negative is very low.Table 1: Feature setsSet Features1 Acceleration, Lane position 0, Lane position 30,Heading2 Acceleration, Lane position 0, Lane position 30,Heading, Lead Car distance3 Acceleration, Lane position 0, Lane Position 20,Lane position 30, Heading, Longitudinalacceleration, Lateral acceleration4 Acceleration, Lane position 0, Lane position 30,Heading, Steering Angle5 Lane position 0, Lane position 10, Lane position 20,Lane position 30EVALUATION STUDYWe performed an evaluation study applying the SVM technique to lane-change detection and, at the same time, exploring a subset of the space of possible data representations and feature sets. For this study, we used the real-world data set collected by T. Yamamura and N. Kuge at Nissan Research Center in Oppama, Japan (see Salvucci, Mandalia, Kuge, & Yamamura, in preparation). Four driver subjects were asked to drive on a Japanese multi-lane highway environment for one hour each through dense and smooth traffic. The drivers were given no specific goals or instructions and they were allowed to drive on their own.The results for the various combinations of window size, feature sets, and non-overlapping vs. overlapping representations are shown in Tables 2 and 3. Average true positive rate at 5% false alarm rate1 among all feature sets is very high, and results improve with decrease in window size. The overlapping representation with all the lane position features (Set 5) generates the best recognition result with 97.9% accuracy2. The system generates a continuous recognition which means it marks each sample with a positive (LC) or negative (LK) label. Thus out of every 100 actual lane-changing samples about 98 are detected correctly, whereas out of every 100 lane-keeping samples only 5 are incorrectly predicted.The recognition system was also analyzed for accuracy with respect to time to calculate how much time is elapsed from the start of a lane change until the point of detection. The system was able to detect about 87% of all true positives within the first 0.3 seconds from the start of the maneuver.DISCUSSIONThe SVM approach to detecting lane changes worked well with real-world vehicle data. A richer data set with features like lead-car velocity, eye movements should lead to even better1 False alarm rate = No. of False Positives * 100 / Total No. of Negatives (LK samples)2 Accuracy = No. of True Positives * 100 / Total No. of Positives (LC samples)accuracy. Comparing the results to previous algorithms, Pentland and Liu (1999) achieved accuracy of 95%, but only after 1.5 seconds into the maneuver whereas our system classifies all data points from the start of the maneuver with 97.9% accuracy. Kuge et al. (2000) achieved 98% accuracy of recognizing entire lane changes (as opposed to point-by-point accuracy). Salvucci’s (2004) results are more directly comparable: his mind-tracking algorithm achieved approximately 87% accuracy at 5% false alarm rate. Thus, the SVM approach outperforms previous approaches and offers great promise as a lane-change detection algorithm for intelligent transportation systems.Table 2: Accuracy by window size (non-overlapping) Window Set 1 Set 2 Set 3 Set 4 Set 5 5s 83.5 90.0 91.2 90.0 91.14s 88.1 91.3 92.5 91.5 92.22s 89.3 93.0 97.7 94.0 97.41.5s 85.2 93.7 96.3 93.2 97.71.2s 96.8 96.0 96.0 96.0 96.70.8s 86.3 91.8 90.0 86.6 94.6Table 3: Accuracy by window size (overlapping) Window Set 1 Set 2 Set 3 Set 4 Set 5 5s 87.1 86.9 89.0 88.1 87.84s 89.9 90.7 91.5 89.5 91.12s 96.2 96.2 97.8 95.8 97.31.5s 94.5 94.6 97.6 93.3 97.51.2s 93.8 93.7 95.0 93.2 97.90.8s 97.0 95.5 96.0 96.4 96.7ACKNOWLEDGMENTSThis work was supported by National Science Foundation grant#IIS-0133083.REFERENCESAhmed, K. I., Ben-Akiva, M. E., Koutsopoulos, H. N., & Mishalani, R. G. (1996). Models of freeway lane changingand gap acceptance behavior. In J.-B. Lesort (Ed.), Transportation and Traffic Theory. New York: Elsevier. Cortes, C. and Vapnik, V. (1995). Support vector networks. Machine Learning, 20:273-295.Evgeniou, T., Pontil, M., & Poggio, T. (2000). Statistical Learning Theory: A Primer. International Journal ofComputer Vision, 38, 9-13.Federal Highway Administration (1998). Our nation’s highways: Selected facts and figures (Tech. Rep. No. FHWA-PL-00-014). Washington, DC: U.S. Dept. of Transportation. Gipps, P. G. (1986). A model for the structure of lane-changing decisions. Transportation Research – Part B, 5, 403-414. Kuge, N., Yamamura, T. and Shimoyama, O. (2000). A Driver Behavior Recognition Method Based on a Driver Model Framework. Intelligent Vehicle Systems (SP-1538). Lanckriet, G. R. G., Cristianini, N., Barlett, P., Ghaoui, L. E., & Jordan, M. I. (2004). Learning the Kernel Matrix with Semidefinite Programming. Journal of Machine Learning Research 5, pp. 27-72.Lee, S. E., Olsen, E. C. B., & Wierwille, W. W. (2004). Naturalistic lane change field data reduction, analysis, and archiving: A comprehensive examination of naturalistic lane-changes (Tech. Rep. No. DOT-HS-809-702). National Highway Traffic Safety Administration.Olsen, E. C. B., (2003). Modeling slow lead vehicle lane changing. Doctoral Dissertation, Department of Industrial and Systems Engineering, Virginia Polytechnic Institute. Pentland, A., & Liu, A. (1999). Modeling and prediction of human behavior. Neural Computation, 11, 229-242. Salvucci, D. D. (2004). Inferring driver intent: A case study in lane-change detection. In Proceedings of the Human Factors Ergonomics Society 48th Annual Meeting.Salvucci, D. D., & Liu, A. (2002). The time course of a lane change: Driver control and eye-movement behavior. Transportation Research Part F, 5, 123-132.Salvucci, D.D., Mandalia, H.M., Kuge, N., & Yamamura, T. (in preparation). Comparing lane-change detection algorithms on driving-simulator and instrumented-vehicle data. In preparation for submission to Human Factors. Talmadge, S., Chu, R. Y., & Riney, R. S. (2000). Description and preliminary data from TRW’s lane change collision avoidance testbed. In Proceedings of the Intelligent Transportation Society of America’s Tenth Annual Meeting. Tijerina, L. (1999). Operational and behavioral issues in the comprehensive evaluation of lane change crash avoidance systems. J. of Transportation Human Factors, 1, 159-176. Tijerina, L., Garott W. R., Stoltzfus, D., Parmer, E., (2005). Van and Passenger Car Driver Eye Glance Behavior During the Lane Change Decision Phase. In Proceedings of Transportation Research Board Annual Meeting 2005. Vapnik, V.N. (1998). Statistical Learning Theory. Wiley: NY. Wipf, D. & Rao, B. (2003). Driver Intent Inference Annual Report, University of California, San Diego.。

Fuzzy support vector machines

Fuzzy Support Vector Machines Chun-Fu Lin and Sheng-De WangAbstract—A support vector machine(SVM)learns the decision surface from two distinct classes of the input points.In many appli-cations,each input point may not be fully assigned to one of these two classes.In this paper,we apply a fuzzy membership to each input point and reformulate the SVMs such that different input points can make different constributions to the learning of deci-sion surface.We call the proposed method fuzzy SVMs(FSVMs). Index Terms—Classification,fuzzy membership,quadratic pro-gramming,support vector machines(SVMs).I.I NTRODUCTIONT HE theory of support vector machines(SVMs)is a new classification technique and has drawn much attention on this topic in recent years[1]–[5].The theory of SVM is based on the idea of structural risk minimization(SRM)[3].In many ap-plications,SVM has been shown to provide higher performance than traditional learning machines[1]and has been introduced as powerful tools for solving classification problems.An SVM first maps the input points into a high-dimensional feature space and finds a separating hyperplane that maximizes the margin between two classes in this space.Maximizing the margin is a quadratic programming(QP)problem and can be solved from its dual problem by introducing Lagrangian multi-pliers.Without any knowledge of the mapping,the SVM finds the optimal hyperplane by using the dot product functions in feature space that are called kernels.The solution of the optimal hyperplane can be written as a combination of a few input points that are called support vectors.There are more and more applications using the SVM tech-niques.However,in many applications,some input points may not be exactly assigned to one of these two classes.Some are more important to be fully assinged to one class so that SVM can seperate these points more correctly.Some data points cor-rupted by noises are less meaningful and the machine should better to discard them.SVM lacks this kind of ability.In this paper,we apply a fuzzy membership to each input point of SVM and reformulate SVM into fuzzy SVM(FSVM) such that different input points can make different constribu-tions to the learning of decision surface.The proposed method enhances the SVM in reducing the effect of outliers and noises in data points.FSVM is suitable for applications in which data points have unmodeled characteristics.The rest of this paper is organized as follows.A brief review of the theory of SVM will be described in Section II.The FSVMManuscript received January25,2001;revised August27,2001.C.-F.Lin is with the Department of Electrical Engineering,National Taiwan University,Taiwan(e-mail:genelin@.tw).S.-D.Wang is with the Department of Electrical Engineering,National Taiwan University,Taiwan(e-mail:sdwang@.tw).Publisher Item Identifier S1045-9227(02)01807-6.will be derived in Section III.Three experiments are presented in Section IV.Some concluding remarks are given in Section V.II.SVMsIn this section we briefly review the basis of the theory of SVM in classification problems[2]–[4].Suppose we are given asetThe optimal hyperplane problem is then regraded as the so-lution to theproblem(6)where can be regarded as aregularization parameter.This is the only free parameter in theSVM formulation.Tuning this parameter can make balance be-tween margin maximization and classification violation.Detaildiscussions can be found in[4],[6].Searching the optimal hyperplane in(6)is a QP problem,which can be solved by constructing a Lagrangian and trans-formed into thedual(7)where is the vector of nonnegative Lagrangemultipliers associated with the constraints(5).The Kuhn–Tucker theorem plays an important role in thetheory of SVM.According to this theorem,thesolution(8)in(8)are those for which the constraints(5)are satisfied with theequality sign.Thepointandis classified correctly and clearlyaway the decision margin.To construct the optimalhyperplane,it followsthat,the computationof problem(7)and(11)is impossible.There is a good propertyof SVM that it is not necessary to knowaboutcalled kernel that can compute the dotproduct of the data points in featurespace(14)and the decision functionis(15)III.FSVMsIn this section,we make a detail description about the ideaand formulations of FSVMs.A.Fuzzy Property of InputSVM is a powerful tool for solving classification problems[1],but there are still some limitataions of this theory.From thetraining set(1)and formulations discussed above,each trainingpoint belongs to either one class or the other.For each class,wecan easily check that all training points of this class are treateduniformly in the theory of SVM.In many real-world applications,the effects of the trainingpoints are different.It is often that some training points are moreimportant than others in the classificaiton problem.We wouldrequire that the meaningful training points must be classifiedcorrectly and would not care about some training points likenoises whether or not they are misclassified.That is,each training point no more exactly belongs to oneof the two classes.It may90%belong to one class and10%be meaningless,and it may20%belong to one class and80%be meaningless.In other words,there is a fuzzymembershipcan be regarded as the attitude ofmeaningless.We extend the concept of SVM with fuzzy mem-bership and make it an FSVM.B.Reformulate SVM Suppose we are given aset,and sufficientsmall denote the corresponding featurespace vector with amapping.Since the fuzzymembership is the attitude of the corre-spondingpoint(17)where(19)(22)and the Kuhn–Tucker conditions are definedas(23)is misclassi-fied.An important difference between SVM and FSVM is thatthe points with the same value ofin SVM controls the tradeoff be-tween the maximization of margin and the amount of misclassi-fications.Alargermakes SVMignore more training points and get wider margin.In FSVM,we canset(25)where,and make the firstpoint.If we want to make fuzzy membershipbe a linear function of the time,we canselect(26)By applying the boundary conditions,we cangetFig.1.The result of SVM learning for data with time property. By applying the boundary conditions,we canget(30)where(31)suchthat1,it may belongs to this class with lower accuracy or reallybelongs to another class.For this purpose,we can select thefuzzy membership as a function of respective class.Suppose we are given a sequence of trainingpointsifFig.2.The result of FSVM learning for data with time property.Fig.3shows the result of the SVM and Fig.4shows the result of FSVM bysettingis indicated as square.In Fig.3,the SVM finds theoptimal hyperplane with errors appearing in each class.In Fig.4,we apply different fuzzy memberships to different classes,theFSVM finds the optimal hyperplane with errors appearing onlyin one class.We can easily check that the FSVM classify theclass of cross with high accuracy and the class of square withlow accuracy,while the SVM does not.e Class Center to Reduce the Effects of OutliersMany research results have shown that the SVM is very sen-sitive to noises and outliners[8],[9].The FSVM can also applyto reduce to effects of outliers.We propose a model by settingthe fuzzy membership as a function of the distance between thepoint and its class center.This setting of the membership couldnot be the best way to solve the problem of outliers.We just pro-pose a way to solve this problem.It may be better to choose a dif-ferent model of fuzzy membership function in different trainingset.Suppose we are given a sequence of trainingpoints1(37)and the radius ofclassif(39)whereis indicated as square.In Fig.5,theSVM finds the optimal hyperplane with the effect of outliers,for example,a square at( 2.2).InFig.6,the distance of the above two outliers to its correspondingmean is equal to the radius.Since the fuzzy membership is afunction of the mean and radius of each class,these two pointsare regarded as less important in FSVM training such that thereis a big difference between the hyperplanes found by SVM andFSVM.Fig.3.The result of SVM learning for data sets with different weighting.Fig.4.The result of FSVM learning for data sets with different weighting.Fig.5.The result of SVM learning for data sets with outliers.Fig.6.The result of FSVM learning for data sets with outliers.V.C ONCLUSIONIn this paper,we proposed the FSVM that imposes a fuzzy membership to each input point such that different input points can make different constributions to the learning of decision sur-face.By setting different types of fuzzy membership,we can easily apply FSVM to solve different kinds of problems.This extends the application horizon of the SVM.There remains some future work to be done.One is to select a proper fuzzy membership function to a problem.The goal is to automatically or adaptively determine a suitable model of fuzzy membership function that can reduce the effect of noises and outliers for a class of problems.R EFERENCES[1] C.Burges,“A tutorial on support vector machines for pattern recogni-tion,”Data Mining and Knowledge Discovery,vol.2,no.2,1998.[2] C.Cortes and V.N.Vapnik,“Support vector networks,”MachineLearning,vol.20,pp.273–297,1995.[3]V.N.Vapnik,The Nature of Statistical Learning Theory.New York:Springer-Verlag,1995.[4],Statistical Learning Theory.New York:Wiley,1998.[5] B.Schölkopf,C.Burges,and A.Smola,Advances in Kernel Methods:Support Vector Learning.Cambridge,MA:MIT Press,1999.[6]M.Pontil and A.Verri,Massachusetts Inst.Technol.,1997,AI Memono.1612.Properties of support vector machines.[7]N.de Freitas,o,P.Clarkson,M.Niranjan,and A.Gee,“Sequen-tial support vector machines,”Proc.IEEE NNSP’99,pp.31–40,1999.[8]I.Guyon,N.Matic´,and V.N.Vapnik,Discovering Information Patternsand Data Cleaning.Cambridge,MA:MIT Press,1996,pp.181–203.[9]X.Zhang,“Using class-center vectors to build support vector machines,”in Proc.IEEE NNSP’99,1999,pp.3–11.。

用于滚动轴承故障检测与分类的支持向量机方法

用于滚动轴承故障检测与分类的支持向量机
方法
1 滚动轴承故障检测
滚动轴承是一种常见的机械配件，用于滚动移动，传动和支撑机械元件。

由于轴承位置固定，其表面摩擦及内外部压力的变化，容易引起轴承故障，影响设备的正常运行。

因此，准确检测和精确诊断轴承故障具有重要意义。

目前常用的轴承故障检测技术主要是基于分析磁性和非磁性噪声的傅里叶变换等技术。

但由于滚动轴承故障检测和分类存在一定的困难，有较大的局限性。

因此，开发高效的故障检测和分类的技术，以降低设备的维修和维护成本，保障设备的正常使用，成为技术发展的重要课题。

2 支持向量机方法
支持向量机（Support Vector Machine, SVM）是一种基于机器学习的分类方法，用于检测和识别设备和系统中存在的故障模式。

它可以通过研究故障特征，分析故障影响因素，并建立合理的故障模型，有效地诊断轴承故障。

支持向量机方法有三个主要的模块：特征提取，特征选择和模型训练。

首先，通过常用的数字信号处理技术和计算机视觉技术，对滚动轴承的谐波和光谱数据进行处理，以提高提取准确性；其次，通过
精心设计的特征选择算法，可以高效地实现特征选择，以帮助识别及检测轴承故障；最后，运用支持向量机算法建立针对不同轴承故障的模型，用于训练模型和识别轴承故障分类。

由此可见，支持向量机方法在滚动轴承故障诊断与分类领域具有良好的性能，能够实时地检测和分析轴承故障模式，也就是说，可以有效检测和分类不同类别的滚动轴承故障。

不仅可以减少轴承故障对设备造成的不良影响，而且可以降低维护费用，提高检测效率，切实维护设备的运行安全。

Support Vector Machines for Classification and Regression

UNIVERSITY OF SOUTHAMPTONSupport Vector MachinesforClassiﬁcation and RegressionbySteve R.GunnTechnical ReportFaculty of Engineering,Science and Mathematics School of Electronics and Computer Science10May1998ContentsNomenclature xi1Introduction11.1Statistical Learning Theory (2)1.1.1VC Dimension (3)1.1.2Structural Risk Minimisation (4)2Support Vector Classiﬁcation52.1The Optimal Separating Hyperplane (5)2.1.1Linearly Separable Example (10)2.2The Generalised Optimal Separating Hyperplane (10)2.2.1Linearly Non-Separable Example (13)2.3Generalisation in High Dimensional Feature Space (14)2.3.1Polynomial Mapping Example (16)2.4Discussion (16)3Feature Space193.1Kernel Functions (19)3.1.1Polynomial (20)3.1.2Gaussian Radial Basis Function (20)3.1.3Exponential Radial Basis Function (20)3.1.4Multi-Layer Perceptron (20)3.1.5Fourier Series (21)3.1.6Splines (21)3.1.7B splines (21)3.1.8Additive Kernels (22)3.1.9Tensor Product (22)3.2Implicit vs.Explicit Bias (22)3.3Data Normalisation (23)3.4Kernel Selection (23)4Classiﬁcation Example:IRIS data254.1Applications (28)5Support Vector Regression295.1Linear Regression (30)5.1.1 -insensitive Loss Function (30)5.1.2Quadratic Loss Function (31)iiiiv CONTENTS5.1.3Huber Loss Function (32)5.1.4Example (33)5.2Non Linear Regression (33)5.2.1Examples (34)5.2.2Comments (36)6Regression Example:Titanium Data396.1Applications (42)7Conclusions43A Implementation Issues45A.1Support Vector Classiﬁcation (45)A.2Support Vector Regression (47)B MATLAB SVM Toolbox51 Bibliography53List of Figures1.1Modelling Errors (2)1.2VC Dimension Illustration (3)2.1Optimal Separating Hyperplane (5)2.2Canonical Hyperplanes (6)2.3Constraining the Canonical Hyperplanes (7)2.4Optimal Separating Hyperplane (10)2.5Generalised Optimal Separating Hyperplane (11)2.6Generalised Optimal Separating Hyperplane Example(C=1) (13)2.7Generalised Optimal Separating Hyperplane Example(C=105) (14)2.8Generalised Optimal Separating Hyperplane Example(C=10−8) (14)2.9Mapping the Input Space into a High Dimensional Feature Space (14)2.10Mapping input space into Polynomial Feature Space (16)3.1Comparison between Implicit and Explicit bias for a linear kernel (22)4.1Iris data set (25)4.2Separating Setosa with a linear SVC(C=∞) (26)4.3Separating Viginica with a polynomial SVM(degree2,C=∞) (26)4.4Separating Viginica with a polynomial SVM(degree10,C=∞) (26)4.5Separating Viginica with a Radial Basis Function SVM(σ=1.0,C=∞)274.6Separating Viginica with a polynomial SVM(degree2,C=10) (27)4.7The eﬀect of C on the separation of Versilcolor with a linear spline SVM.285.1Loss Functions (29)5.2Linear regression (33)5.3Polynomial Regression (35)5.4Radial Basis Function Regression (35)5.5Spline Regression (36)5.6B-spline Regression (36)5.7Exponential RBF Regression (36)6.1Titanium Linear Spline Regression( =0.05,C=∞) (39)6.2Titanium B-Spline Regression( =0.05,C=∞) (40)6.3Titanium Gaussian RBF Regression( =0.05,σ=1.0,C=∞) (40)6.4Titanium Gaussian RBF Regression( =0.05,σ=0.3,C=∞) (40)6.5Titanium Exponential RBF Regression( =0.05,σ=1.0,C=∞) (41)6.6Titanium Fourier Regression( =0.05,degree3,C=∞) (41)6.7Titanium Linear Spline Regression( =0.05,C=10) (42)vvi LIST OF FIGURES6.8Titanium B-Spline Regression( =0.05,C=10) (42)List of Tables2.1Linearly Separable Classiﬁcation Data (10)2.2Non-Linearly Separable Classiﬁcation Data (13)5.1Regression Data (33)viiListingsA.1Support Vector Classiﬁcation MATLAB Code (46)A.2Support Vector Regression MATLAB Code (48)ixNomenclature0Column vector of zeros(x)+The positive part of xC SVM misclassiﬁcation tolerance parameterD DatasetK(x,x )Kernel functionR[f]Risk functionalR emp[f]Empirical Risk functionalxiChapter1IntroductionThe problem of empirical data modelling is germane to many engineering applications. In empirical data modelling a process of induction is used to build up a model of the system,from which it is hoped to deduce responses of the system that have yet to be ob-served.Ultimately the quantity and quality of the observations govern the performance of this empirical model.By its observational nature data obtained isﬁnite and sampled; typically this sampling is non-uniform and due to the high dimensional nature of the problem the data will form only a sparse distribution in the input space.Consequently the problem is nearly always ill posed(Poggio et al.,1985)in the sense of Hadamard (Hadamard,1923).Traditional neural network approaches have suﬀered diﬃculties with generalisation,producing models that can overﬁt the data.This is a consequence of the optimisation algorithms used for parameter selection and the statistical measures used to select the’best’model.The foundations of Support Vector Machines(SVM)have been developed by Vapnik(1995)and are gaining popularity due to many attractive features,and promising empirical performance.The formulation embodies the Struc-tural Risk Minimisation(SRM)principle,which has been shown to be superior,(Gunn et al.,1997),to traditional Empirical Risk Minimisation(ERM)principle,employed by conventional neural networks.SRM minimises an upper bound on the expected risk, as opposed to ERM that minimises the error on the training data.It is this diﬀerence which equips SVM with a greater ability to generalise,which is the goal in statistical learning.SVMs were developed to solve the classiﬁcation problem,but recently they have been extended to the domain of regression problems(Vapnik et al.,1997).In the literature the terminology for SVMs can be slightly confusing.The term SVM is typ-ically used to describe classiﬁcation with support vector methods and support vector regression is used to describe regression with support vector methods.In this report the term SVM will refer to both classiﬁcation and regression methods,and the terms Support Vector Classiﬁcation(SVC)and Support Vector Regression(SVR)will be used for speciﬁcation.This section continues with a brief introduction to the structural risk12Chapter1Introductionminimisation principle.In Chapter2the SVM is introduced in the setting of classiﬁca-tion,being both historical and more accessible.This leads onto mapping the input into a higher dimensional feature space by a suitable choice of kernel function.The report then considers the problem of regression.Illustrative examples re given to show the properties of the techniques.1.1Statistical Learning TheoryThis section is a very brief introduction to statistical learning theory.For a much more in depth look at statistical learning theory,see(Vapnik,1998).Figure1.1:Modelling ErrorsThe goal in modelling is to choose a model from the hypothesis space,which is closest (with respect to some error measure)to the underlying function in the target space. Errors in doing this arise from two cases:Approximation Error is a consequence of the hypothesis space being smaller than the target space,and hence the underlying function may lie outside the hypothesis space.A poor choice of the model space will result in a large approximation error, and is referred to as model mismatch.Estimation Error is the error due to the learning procedure which results in a tech-nique selecting the non-optimal model from the hypothesis space.Chapter1Introduction3Together these errors form the generalisation error.Ultimately we would like toﬁnd the function,f,which minimises the risk,R[f]=X×YL(y,f(x))P(x,y)dxdy(1.1)However,P(x,y)is unknown.It is possible toﬁnd an approximation according to the empirical risk minimisation principle,R emp[f]=1lli=1Ly i,fx i(1.2)which minimises the empirical risk,ˆf n,l (x)=arg minf∈H nR emp[f](1.3)Empirical risk minimisation makes sense only if,liml→∞R emp[f]=R[f](1.4) which is true from the law of large numbers.However,it must also satisfy,lim l→∞minf∈H nR emp[f]=minf∈H nR[f](1.5)which is only valid when H n is’small’enough.This condition is less intuitive and requires that the minima also converge.The following bound holds with probability1−δ,R[f]≤R emp[f]+h ln2lh+1−lnδ4l(1.6)Remarkably,this expression for the expected risk is independent of the probability dis-tribution.1.1.1VC DimensionThe VC dimension is a scalar value that measures the capacity of a set offunctions.Figure1.2:VC Dimension Illustration4Chapter1IntroductionDeﬁnition1.1(Vapnik–Chervonenkis).The VC dimension of a set of functions is p if and only if there exists a set of points{x i}pi=1such that these points can be separatedin all2p possible conﬁgurations,and that no set{x i}qi=1exists where q>p satisfying this property.Figure1.2illustrates how three points in the plane can be shattered by the set of linear indicator functions whereas four points cannot.In this case the VC dimension is equal to the number of free parameters,but in general that is not the case;e.g.the function A sin(bx)has an inﬁnite VC dimension(Vapnik,1995).The set of linear indicator functions in n dimensional space has a VC dimension equal to n+1.1.1.2Structural Risk MinimisationCreate a structure such that S h is a hypothesis space of VC dimension h then,S1⊂S2⊂...⊂S∞(1.7) SRM consists in solving the following problemmin S h R emp[f]+h ln2lh+1−lnδ4l(1.8)If the underlying process being modelled is not deterministic the modelling problem becomes more exacting and consequently this chapter is restricted to deterministic pro-cesses.Multiple output problems can usually be reduced to a set of single output prob-lems that may be considered independent.Hence it is appropriate to consider processes with multiple inputs from which it is desired to predict a single output.Chapter2Support Vector ClassiﬁcationThe classiﬁcation problem can be restricted to consideration of the two-class problem without loss of generality.In this problem the goal is to separate the two classes by a function which is induced from available examples.The goal is to produce a classiﬁer that will work well on unseen examples,i.e.it generalises well.Consider the example in Figure2.1.Here there are many possible linear classiﬁers that can separate the data, but there is only one that maximises the margin(maximises the distance between it and the nearest data point of each class).This linear classiﬁer is termed the optimal separating hyperplane.Intuitively,we would expect this boundary to generalise well as opposed to the other possible boundaries.Figure2.1:Optimal Separating Hyperplane2.1The Optimal Separating HyperplaneConsider the problem of separating the set of training vectors belonging to two separateclasses,D=(x1,y1),...,(x l,y l),x∈R n,y∈{−1,1},(2.1)56Chapter 2Support Vector Classiﬁcationwith a hyperplane, w,x +b =0.(2.2)The set of vectors is said to be optimally separated by the hyperplane if it is separated without error and the distance between the closest vector to the hyperplane is maximal.There is some redundancy in Equation 2.2,and without loss of generality it is appropri-ate to consider a canonical hyperplane (Vapnik ,1995),where the parameters w ,b are constrained by,min i w,x i +b =1.(2.3)This incisive constraint on the parameterisation is preferable to alternatives in simpli-fying the formulation of the problem.In words it states that:the norm of the weight vector should be equal to the inverse of the distance,of the nearest point in the data set to the hyperplane .The idea is illustrated in Figure 2.2,where the distance from the nearest point to each hyperplane is shown.Figure 2.2:Canonical HyperplanesA separating hyperplane in canonical form must satisfy the following constraints,y i w,x i +b ≥1,i =1,...,l.(2.4)The distance d (w,b ;x )of a point x from the hyperplane (w,b )is,d (w,b ;x )= w,x i +bw .(2.5)Chapter 2Support Vector Classiﬁcation 7The optimal hyperplane is given by maximising the margin,ρ,subject to the constraints of Equation 2.4.The margin is given by,ρ(w,b )=min x i :y i =−1d (w,b ;x i )+min x i :y i =1d (w,b ;x i )=min x i :y i =−1 w,x i +b w +min x i :y i =1 w,x i +b w =1 w min x i :y i =−1 w,x i +b +min x i :y i =1w,x i +b =2 w (2.6)Hence the hyperplane that optimally separates the data is the one that minimisesΦ(w )=12 w 2.(2.7)It is independent of b because provided Equation 2.4is satisﬁed (i.e.it is a separating hyperplane)changing b will move it in the normal direction to itself.Accordingly the margin remains unchanged but the hyperplane is no longer optimal in that it will be nearer to one class than the other.To consider how minimising Equation 2.7is equivalent to implementing the SRM principle,suppose that the following bound holds,w <A.(2.8)Then from Equation 2.4and 2.5,d (w,b ;x )≥1A.(2.9)Accordingly the hyperplanes cannot be nearer than 1A to any of the data points and intuitively it can be seen in Figure 2.3how this reduces the possible hyperplanes,andhence thecapacity.Figure 2.3:Constraining the Canonical Hyperplanes8Chapter2Support Vector ClassiﬁcationThe VC dimension,h,of the set of canonical hyperplanes in n dimensional space is bounded by,h≤min[R2A2,n]+1,(2.10)where R is the radius of a hypersphere enclosing all the data points.Hence minimising Equation2.7is equivalent to minimising an upper bound on the VC dimension.The solution to the optimisation problem of Equation2.7under the constraints of Equation 2.4is given by the saddle point of the Lagrange functional(Lagrangian)(Minoux,1986),Φ(w,b,α)=12 w 2−li=1αiy iw,x i +b−1,(2.11)whereαare the Lagrange multipliers.The Lagrangian has to be minimised with respect to w,b and maximised with respect toα≥0.Classical Lagrangian duality enables the primal problem,Equation2.11,to be transformed to its dual problem,which is easier to solve.The dual problem is given by,max αW(α)=maxαminw,bΦ(w,b,α).(2.12)The minimum with respect to w and b of the Lagrangian,Φ,is given by,∂Φ∂b =0⇒li=1αi y i=0∂Φ∂w =0⇒w=li=1αi y i x i.(2.13)Hence from Equations2.11,2.12and2.13,the dual problem is,max αW(α)=maxα−12li=1lj=1αiαj y i y j x i,x j +lk=1αk,(2.14)and hence the solution to the problem is given by,α∗=arg minα12li=1lj=1αiαj y i y j x i,x j −lk=1αk,(2.15)with constraints,αi≥0i=1,...,llj=1αj y j=0.(2.16)Chapter2Support Vector Classiﬁcation9Solving Equation2.15with constraints Equation2.16determines the Lagrange multi-pliers,and the optimal separating hyperplane is given by,w∗=li=1αi y i x ib∗=−12w∗,x r+x s .(2.17)where x r and x s are any support vector from each class satisfying,αr,αs>0,y r=−1,y s=1.(2.18)The hard classiﬁer is then,f(x)=sgn( w∗,x +b)(2.19) Alternatively,a soft classiﬁer may be used which linearly interpolates the margin,f(x)=h( w∗,x +b)where h(z)=−1:z<−1z:−1≤z≤1+1:z>1(2.20)This may be more appropriate than the hard classiﬁer of Equation2.19,because it produces a real valued output between−1and1when the classiﬁer is queried within the margin,where no training data resides.From the Kuhn-Tucker conditions,αiy iw,x i +b−1=0,i=1,...,l,(2.21)and hence only the points x i which satisfy,y iw,x i +b=1(2.22)will have non-zero Lagrange multipliers.These points are termed Support Vectors(SV). If the data is linearly separable all the SV will lie on the margin and hence the number of SV can be very small.Consequently the hyperplane is determined by a small subset of the training set;the other points could be removed from the training set and recalculating the hyperplane would produce the same answer.Hence SVM can be used to summarise the information contained in a data set by the SV produced.If the data is linearly separable the following equality will hold,w 2=li=1αi=i∈SV sαi=i∈SV sj∈SV sαiαj y i y j x i,x j .(2.23)Hence from Equation2.10the VC dimension of the classiﬁer is bounded by,h≤min[R2i∈SV s,n]+1,(2.24)10Chapter2Support Vector Classiﬁcationx1x2y11-133113131-12 2.513 2.5-143-1Table2.1:Linearly Separable Classiﬁcation Dataand if the training data,x,is normalised to lie in the unit hypersphere,,n],(2.25)h≤1+min[i∈SV s2.1.1Linearly Separable ExampleTo illustrate the method consider the training set in Table2.1.The SVC solution is shown in Figure2.4,where the dotted lines describe the locus of the margin and the circled data points represent the SV,which all lie on the margin.Figure2.4:Optimal Separating Hyperplane2.2The Generalised Optimal Separating HyperplaneSo far the discussion has been restricted to the case where the training data is linearly separable.However,in general this will not be the case,Figure2.5.There are two approaches to generalising the problem,which are dependent upon prior knowledge of the problem and an estimate of the noise on the data.In the case where it is expected (or possibly even known)that a hyperplane can correctly separate the data,a method ofChapter2Support Vector Classiﬁcation11Figure2.5:Generalised Optimal Separating Hyperplaneintroducing an additional cost function associated with misclassiﬁcation is appropriate. Alternatively a more complex function can be used to describe the boundary,as discussed in Chapter2.1.To enable the optimal separating hyperplane method to be generalised, Cortes and Vapnik(1995)introduced non-negative variables,ξi≥0,and a penalty function,Fσ(ξ)=iξσiσ>0,(2.26) where theξi are a measure of the misclassiﬁcation errors.The optimisation problem is now posed so as to minimise the classiﬁcation error as well as minimising the bound on the VC dimension of the classiﬁer.The constraints of Equation2.4are modiﬁed for the non-separable case to,y iw,x i +b≥1−ξi,i=1,...,l.(2.27)whereξi≥0.The generalised optimal separating hyperplane is determined by the vector w,that minimises the functional,Φ(w,ξ)=12w 2+Ciξi,(2.28)(where C is a given value)subject to the constraints of Equation2.27.The solution to the optimisation problem of Equation2.28under the constraints of Equation2.27is given by the saddle point of the Lagrangian(Minoux,1986),Φ(w,b,α,ξ,β)=12 w 2+Ciξi−li=1αiy iw T x i+b−1+ξi−lj=1βiξi,(2.29)12Chapter2Support Vector Classiﬁcationwhereα,βare the Lagrange multipliers.The Lagrangian has to be minimised with respect to w,b,x and maximised with respect toα,β.As before,classical Lagrangian duality enables the primal problem,Equation2.29,to be transformed to its dual problem. The dual problem is given by,max αW(α,β)=maxα,βminw,b,ξΦ(w,b,α,ξ,β).(2.30)The minimum with respect to w,b andξof the Lagrangian,Φ,is given by,∂Φ∂b =0⇒li=1αi y i=0∂Φ∂w =0⇒w=li=1αi y i x i∂Φ∂ξ=0⇒αi+βi=C.(2.31) Hence from Equations2.29,2.30and2.31,the dual problem is,max αW(α)=maxα−12li=1lj=1αiαj y i y j x i,x j +lk=1αk,(2.32)and hence the solution to the problem is given by,α∗=arg minα12li=1lj=1αiαj y i y j x i,x j −lk=1αk,(2.33)with constraints,0≤αi≤C i=1,...,llj=1αj y j=0.(2.34)The solution to this minimisation problem is identical to the separable case except for a modiﬁcation of the bounds of the Lagrange multipliers.The uncertain part of Cortes’s approach is that the coeﬃcient C has to be determined.This parameter introduces additional capacity control within the classiﬁer.C can be directly related to a regulari-sation parameter(Girosi,1997;Smola and Sch¨o lkopf,1998).Blanz et al.(1996)uses a value of C=5,but ultimately C must be chosen to reﬂect the knowledge of the noise on the data.This warrants further work,but a more practical discussion is given in Chapter4.Chapter2Support Vector Classiﬁcation13x1x2y11-133113131-12 2.513 2.5-143-11.5 1.5112-1Table2.2:Non-Linearly Separable Classiﬁcation Data2.2.1Linearly Non-Separable ExampleTwo additional data points are added to the separable data of Table2.1to produce a linearly non-separable data set,Table2.2.The resulting SVC is shown in Figure2.6,for C=1.The SV are no longer required to lie on the margin,as in Figure2.4,and the orientation of the hyperplane and the width of the margin are diﬀerent.Figure2.6:Generalised Optimal Separating Hyperplane Example(C=1)In the limit,lim C→∞the solution converges towards the solution obtained by the optimal separating hyperplane(on this non-separable data),Figure2.7.In the limit,lim C→0the solution converges to one where the margin maximisation term dominates,Figure2.8.Beyond a certain point the Lagrange multipliers will all take on the value of C.There is now less emphasis on minimising the misclassiﬁcation error, but purely on maximising the margin,producing a large width margin.Consequently as C decreases the width of the margin increases.The useful range of C lies between the point where all the Lagrange Multipliers are equal to C and when only one of them is just bounded by C.14Chapter2Support Vector ClassiﬁcationFigure2.7:Generalised Optimal Separating Hyperplane Example(C=105)Figure2.8:Generalised Optimal Separating Hyperplane Example(C=10−8)2.3Generalisation in High Dimensional Feature SpaceIn the case where a linear boundary is inappropriate the SVM can map the input vector, x,into a high dimensional feature space,z.By choosing a non-linear mapping a priori, the SVM constructs an optimal separating hyperplane in this higher dimensional space, Figure2.9.The idea exploits the method of Aizerman et al.(1964)which,enables the curse of dimensionality(Bellman,1961)to be addressed.Figure2.9:Mapping the Input Space into a High Dimensional Feature SpaceChapter2Support Vector Classiﬁcation15There are some restrictions on the non-linear mapping that can be employed,see Chap-ter3,but it turns out,surprisingly,that most commonly employed functions are accept-able.Among acceptable mappings are polynomials,radial basis functions and certain sigmoid functions.The optimisation problem of Equation2.33becomes,α∗=arg minα12li=1lj=1αiαj y i y j K(x i,x j)−lk=1αk,(2.35)where K(x,x )is the kernel function performing the non-linear mapping into feature space,and the constraints are unchanged,0≤αi≤C i=1,...,llj=1αj y j=0.(2.36)Solving Equation2.35with constraints Equation2.36determines the Lagrange multipli-ers,and a hard classiﬁer implementing the optimal separating hyperplane in the feature space is given by,f(x)=sgn(i∈SV sαi y i K(x i,x)+b)(2.37) wherew∗,x =li=1αi y i K(x i,x)b∗=−12li=1αi y i[K(x i,x r)+K(x i,x r)].(2.38)The bias is computed here using two support vectors,but can be computed using all the SV on the margin for stability(Vapnik et al.,1997).If the Kernel contains a bias term, the bias can be accommodated within the Kernel,and hence the classiﬁer is simply,f(x)=sgn(i∈SV sαi K(x i,x))(2.39)Many employed kernels have a bias term and anyﬁnite Kernel can be made to have one(Girosi,1997).This simpliﬁes the optimisation problem by removing the equality constraint of Equation2.36.Chapter3discusses the necessary conditions that must be satisﬁed by valid kernel functions.16Chapter2Support Vector Classiﬁcation 2.3.1Polynomial Mapping ExampleConsider a polynomial kernel of the form,K(x,x )=( x,x +1)2,(2.40)which maps a two dimensional input vector into a six dimensional feature space.Apply-ing the non-linear SVC to the linearly non-separable training data of Table2.2,produces the classiﬁcation illustrated in Figure2.10(C=∞).The margin is no longer of constant width due to the non-linear projection into the input space.The solution is in contrast to Figure2.7,in that the training data is now classiﬁed correctly.However,even though SVMs implement the SRM principle and hence can generalise well,a careful choice of the kernel function is necessary to produce a classiﬁcation boundary that is topologically appropriate.It is always possible to map the input space into a dimension greater than the number of training points and produce a classiﬁer with no classiﬁcation errors on the training set.However,this will generalise badly.Figure2.10:Mapping input space into Polynomial Feature Space2.4DiscussionTypically the data will only be linearly separable in some,possibly very high dimensional feature space.It may not make sense to try and separate the data exactly,particularly when only aﬁnite amount of training data is available which is potentially corrupted by noise.Hence in practice it will be necessary to employ the non-separable approach which places an upper bound on the Lagrange multipliers.This raises the question of how to determine the parameter C.It is similar to the problem in regularisation where the regularisation coeﬃcient has to be determined,and it has been shown that the parameter C can be directly related to a regularisation parameter for certain kernels (Smola and Sch¨o lkopf,1998).A process of cross-validation can be used to determine thisChapter2Support Vector Classiﬁcation17parameter,although more eﬃcient and potentially better methods are sought after.In removing the training patterns that are not support vectors,the solution is unchanged and hence a fast method for validation may be available when the support vectors are sparse.Chapter3Feature SpaceThis chapter discusses the method that can be used to construct a mapping into a high dimensional feature space by the use of reproducing kernels.The idea of the kernel function is to enable operations to be performed in the input space rather than the potentially high dimensional feature space.Hence the inner product does not need to be evaluated in the feature space.This provides a way of addressing the curse of dimensionality.However,the computation is still critically dependent upon the number of training patterns and to provide a good data distribution for a high dimensional problem will generally require a large training set.3.1Kernel FunctionsThe following theory is based upon Reproducing Kernel Hilbert Spaces(RKHS)(Aron-szajn,1950;Girosi,1997;Heckman,1997;Wahba,1990).An inner product in feature space has an equivalent kernel in input space,K(x,x )= φ(x),φ(x ) ,(3.1)provided certain conditions hold.If K is a symmetric positive deﬁnite function,which satisﬁes Mercer’s Conditions,K(x,x )=∞ma mφm(x)φm(x ),a m≥0,(3.2)K(x,x )g(x)g(x )dxdx >0,g∈L2,(3.3) then the kernel represents a legitimate inner product in feature space.Valid functions that satisfy Mercer’s conditions are now given,which unless stated are valid for all real x and x .1920Chapter3Feature Space3.1.1PolynomialA polynomial mapping is a popular method for non-linear modelling,K(x,x )= x,x d.(3.4)K(x,x )=x,x +1d.(3.5)The second kernel is usually preferable as it avoids problems with the hessian becoming zero.3.1.2Gaussian Radial Basis FunctionRadial basis functions have received signiﬁcant attention,most commonly with a Gaus-sian of the form,K(x,x )=exp−x−x 22σ2.(3.6)Classical techniques utilising radial basis functions employ some method of determining a subset of centres.Typically a method of clustering isﬁrst employed to select a subset of centres.An attractive feature of the SVM is that this selection is implicit,with each support vectors contributing one local Gaussian function,centred at that data point. By further considerations it is possible to select the global basis function width,s,using the SRM principle(Vapnik,1995).3.1.3Exponential Radial Basis FunctionA radial basis function of the form,K(x,x )=exp−x−x2σ2.(3.7)produces a piecewise linear solution which can be attractive when discontinuities are acceptable.3.1.4Multi-Layer PerceptronThe long established MLP,with a single hidden layer,also has a valid kernel represen-tation,K(x,x )=tanhρ x,x +(3.8)for certain values of the scale,ρ,and oﬀset, ,parameters.Here the SV correspond to theﬁrst layer and the Lagrange multipliers to the weights.。

Support vector machines for multiple-instance learning

Stuart Andrews,Ioannis Tsochantaridis and Thomas HofmannDepartment of Computer Science,Brown University,Providence,RI02912{stu,it,th}@AbstractThis paper presents two new formulations of multiple-instancelearning as a maximum margin problem.The proposed extensionsof the Support Vector Machine(SVM)learning approach lead tomixed integer quadratic programs that can be solved heuristically.Our generalization of SVMs makes a state-of-the-art classiﬁcationtechnique,including non-linear classiﬁcation via kernels,availableto an area that up to now has been largely dominated by specialpurpose methods.We present experimental results on a pharma-ceutical data set and on applications in automated image indexingand document categorization.1IntroductionMultiple-instance learning(MIL)[4]is a generalization of supervised classiﬁcation in which training class labels are associated with sets of patterns,or bags,instead of individual patterns.While every pattern may possess an associated true label,it is assumed that pattern labels are only indirectly accessible through labels attached to bags.The law of inheritance is such that a set receives a particular label,if at least one of the patterns in the set possesses the label.In the important case of binary classiﬁcation,this implies that a bag is“positive”if at least one of its member patterns is a positive diﬀers from the general set-learning problem in that the set-level classiﬁer is by design induced by a pattern-level classiﬁer.Hence the key challenge in MIL is to cope with the ambiguity of not knowing which of the patterns in a positive bag are the actual positive examples and which ones are not. The MIL setting has numerous interesting applications.One prominent applica-tion is the classiﬁcation of molecules in the context of drug design[4].Here, each molecule is represented by a bag of possible conformations.The eﬃcacy of a molecule can be tested experimentally,but there is no way to control for indi-vidual conformations.A second application is in image indexing for content-based image retrieval.Here,an image can be viewed as a bag of local image patches[9] or image regions.Since annotating whole images is far less time consuming then marking relevant image regions,the ability to deal with this type of weakly anno-tated data is very desirable.Finally,consider the problem of text categorization for which we are theﬁrst to apply the MIL ually,documents which contain a relevant passage are considered to be relevant with respect to a particular cate-gory or topic,yet class labels are rarely available on the passage level and are most commonly associated with the document as a whole.Formally,all of the above applications share the same type of label ambiguity which in our opinion makes a strong argument in favor of the relevance of the MIL setting.We present two approaches to modify and extend Support Vector Machines(SVMs) to deal with MIL problems.Theﬁrst approach explicitly treats the pattern labels as unobserved integer variables,subjected to constraints deﬁned by the(positive) bag labels.The goal then is to maximize the usual pattern margin,or soft-margin, jointly over hidden label variables and a linear(or kernelized)discriminant func-tion.The second approach generalizes the notion of a margin to bags and aims at maximizing the bag margin directly.The latter seems most appropriate in cases where we mainly care about classifying new test bags,while theﬁrst approach seems preferable whenever the goal is to derive an accurate pattern-level classiﬁer. In the case of singleton bags,both methods are identical and reduce to the standard soft-margin SVM formulation.Algorithms for the MIL problem wereﬁrst presented in[4,1,7].These methods(and related analytical results)are based on hypothesis classes consisting of axis-aligned rectangles.Similarly,methods developed subsequently(e.g.,[8,12])have focused on specially tailored machine learning algorithms that do not compare favorably in the limiting case of the standard classiﬁcation setting.A notable exception is[10]. More recently,a kernel-based approach has been suggested which derives MI-kernels on bags from a given kernel deﬁned on the pattern-level[5].While the MI-kernel approach treats the MIL problem merely as a representational problem,we strongly believe that a deeper conceptual modiﬁcation of SVMs as outlined in this paper is necessary.However,we share the ultimate goal with[5],which is to make state-of-the-art kernel-based classiﬁcation methods available for multiple-instance learning. 2Multiple-Instance LearningIn statistical pattern recognition,it is usually assumed that a training set of la-beled patterns is available where each pair(x i,y i)∈ d×Y has been generated independently from an unknown distribution.The goal is to induce a classiﬁer,i.e., a function from patterns to labels f: d→Y.In this paper,we will focus on the binary case of Y={−1,1}.Multiple-instance learning(MIL)generalizes this problem by making signiﬁcantly weaker assumptions about the labeling informa-tion.Patterns are grouped into bags and a label is attached to each bag and not to every pattern.More formally,given is a set of input patterns x1,...,x n grouped into bags B1,...,B m,with B I={x i:i∈I}for given index sets I⊆{1,...,n}(typ-ically non-overlapping).With each bag B I is associated a label Y I.These labels are interpreted in the following way:if Y I=−1,then y i=−1for all i∈I,i.e.,no pattern in the bag is a positive example.If on the other hand Y I=1,then at least one pattern x i∈B I is a positive example of the underlying concept.Notice that the information provided by the label is asymmetric in the sense that a negative bag label induces a unique label for every pattern in a bag,while a positive label does not.In general,the relation between pattern labels y i and bag labels Y I can be expressed compactly as Y I=max i∈I y i or alternatively as a set of linear constraintsy i+1i∈I14Maximum Bag Margin Formulation of MILAn alternative way of applying maximum margin ideas to the MIL setting is to extend the notion of a margin from individual patterns to sets of patterns.It is natural to deﬁne the functional margin of a bag with respect to a hyperplane byγI≡Y I maxi∈I( w,x i +b).(3)This generalization reﬂects the fact that predictions for bag labels take the form ˆYI=sgn max i∈I( w,x i +b).Notice that for a positive bag the margin is deﬁned by the margin of the“most positive”pattern,while the margin of a negative bag is de-ﬁned by the“least negative”pattern.The diﬀerence between the two formulations of maximum-margin problems is illustrated in Figure1.For the pattern-centered mi-SVM formulation,the margin of every pattern in a positive bag matters,although one has the freedom to set their label variables so as to maximize the margin.In the bag-centered formulation,only one pattern per positive bag matters,since it will determine the margin of the bag.Once these“witness”patterns have been identiﬁed,the relative position of other patterns in positive bags with respect to the classiﬁcation boundary becomes ing the above notion of a bag margin,we deﬁne an MIL version of the soft-margin classiﬁer byMI-SVM minw,b,ξ12 w 2+CIξI(5)s.t.∀I:Y I=−1∧− w,x i −b≥1−ξI,∀i∈I,or Y I=1∧ w,x s(I) +b≥1−ξI,andξI≥0.(6) In this formulation,every positive bag B I is thus eﬀectively represented by a single member pattern x I≡x s(I).Notice that“non-witness”patterns(x i,i∈I with i=s(I))have no impact on the objective.For given selector variables,it is straightforward to derive the dual objective function which is very similar to the standard SVM Wolfe dual.The only major diﬀerence is that the box constraints for the Lagrange parametersαare modiﬁed compared to the standard SVM solution,namely one gets0≤αI≤C,for I s.t.Y I=1and0≤i∈Iαi≤C,for I s.t.Y I=−1.(7) Hence,the inﬂuence of each bag is bounded by C.5Optimization HeuristicsAs we have shown,both formulations,mi-SVM and MI-SVM,can be cast as mixed-integer programs.In deriving optimization heuristics,we exploit the fact that forinitialize y i =Y I for i ∈IREPEATcompute SVM solution w ,b for data set with imputed labelscompute outputs f i = w ,x i +b for all x i in positive bagsset y i =sgn (f i )for every i ∈I ,Y I =1FOR (every positive bag B I )IF ( i ∈I (1+y i )/2==0)compute i ∗=arg max i ∈I f i set y i ∗=1ENDENDWHILE (imputed labels have changed)OUTPUT (w ,b )Figure 2:Pseudo-code for mi-SVM optimization heuristics (synchronous update).initialize x I = i ∈I x i /|I |for every positive bag B I REPEATcompute QP solution w ,b for data set withpositive examples {x I :Y I =1}compute outputs f i = w ,x i +b for all x i in positive bagsset x I =x s (I ),s (I )=arg max i ∈I f i for every I ,Y I =1WHILE (selector variables s (I )have changed)OUTPUT (w ,b )Figure 3:Pseudo-code for MI-SVM optimization heuristics (synchronous update).given integer variables,i.e.the hidden labels in mi-SVM and the selector variables in MI-SVM,the problem reduces to a QP that can be solved exactly.Of course,all the derivations also hold for general kernel functions K .A general scheme for a simple optimization heuristic may be described as follows.Alternate the following two steps:(i)for given integer variables,solve the associated QP and ﬁnd the optimal discriminant function,(ii)for a given discriminant,update one,several,or all integer variables in a way that (locally)minimizes the objective.The latter step may involve the update of a label variable y i of a single pattern in mi-SVM,the update of a single selector variable s (I )in MI-SVM,or the simultaneous update of all integer variables.Since the integer variables are essentially decoupled given the discriminant (with the exception of the bag constraints in mi-SVM),this can be done very eﬃciently.Also notice that we can re-initialize the QP-solver at every iteration with the previously found solution,which will usually result in a signiﬁcant speed-up.In terms of initialization of the optimization procedure,we suggest to impute positive labels for patterns in positive bags as the initial conﬁguration in mi-SVM.In MI-SVM,x I is initialized as the centroid of the bag patterns.Figure 2and 3summarize pseudo-code descriptions for the algorithms utilized in the experiments.There are many possibilities to reﬁne the above heuristic strategy,for example,by starting from diﬀerent initial conditions,by using branch and bound techniques to explore larger parts of the discrete part of the search space,by performing stochas-tic updates (simulated annealing)or by maintaining probabilities on the integer variables in the spirit of deterministic annealing.However,we have been able to achieve competitive results even with the simpler optimization heuristics,which val-EMDD[12]MI-NN[10]mi-SVMMUSK184.888.987.484.089.284.3Table1:Accuracy results for various methods on the MUSK data sets. idate the maximum margin formulation of SVM.We will address further algorithmic improvements in future work.6Experimental ResultsWe have performed experiments on various data sets to evaluate the proposed tech-niques and compare them to other methods for MIL.As a reference method wehave implemented the EM Diverse Density(EM-DD)method[12],for which very competitive results have been reported on the MUSK benchmark1.6.1MUSK Data SetThe MUSK data sets are the benchmark data sets used in virtually all previous approaches and have been described in detail in the landmark paper[4].Bothdata sets,MUSK1and MUSK2,consist of descriptions of molecules using multiplelow-energy conformations.Each conformation is represented by a166-dimensionalfeature vector derived from surface properties.MUSK1contains on average ap-proximately6conformation per molecule,while MUSK2has on average more than60conformations in each bag.The averaged results of ten10-fold cross-validationruns are summarized in Table1.The SVM results are based on an RBF kernelK(x,y)=exp(−γ x−y 2)with coarsely optimizedγ.For both MUSK1andMUSK2data sets,mi-SVM achieves competitive accuracy values.While MI-SVM outperforms mi-SVM on MUSK2,it is signiﬁcantly worse on MUSK1.Althoughboth methods fail to achieve the performance of the best method(iterative APR)2,they compare favorably with other approaches to MIL.6.2Automatic Image AnnotationWe have generated new MIL data sets for an image annotation task.The originaldata are color images from the Corel data set that have been preprocessed and segmented with the Blobworld system[2].In this representation,an image consistsof a set of segments(or blobs),each characterized by color,texture and shape descriptors.We have utilized three diﬀerent categories(“elephant”,“fox”,“tiger”)in our experiments.In each case,the data sets have100positive and100negative example images.The latter have been randomly drawn from a pool of photos ofother animals.Due to the limited accuracy of the image segmentation,the relativesmall number of region descriptors and the small training set size,this ends up beingquite a hard classiﬁcation problem.We are currently investigating alternative imageData Set MI-SVMCategory poly linear rbf 1391/23078.382.280.079.0 Fox55.257.858.8 1220/23072.178.478.981.6Dims EM-DD mi-SVMinst/feat linear rbf polyTST192.593.993.7 3344/684284.078.274.384.4TST383.382.277.4 3391/662680.582.869.682.9TST778.778.064.5 3300/698265.567.555.263.7TST1078.379.569.1 Table3:Classiﬁcation accuracy of diﬀerent methods on the TREC9document categorization sets.representations in the context of applying MIL to content-based image retrieval and automated image indexing,for which we hope to achieve better(absolute) classiﬁcation accuracies.However,these data sets seem legitimate for a comparative performance analysis.The results are summarized in Table2.They show that both, mi-SVM and MI-SVM achieve a similar accuracy and outperform EM-DD by a few percent.While MI-SVM performed marginally better than mi-SVM,both heuristic methods were susceptible to other nearby local minima.Evidence of this eﬀect was observed through experimentation with asynchronus updates,as described in Section5,where we varied the number of integer variables updated at each iteration.6.3Text CategorizationFinally,we have generated MIL data sets for text categorization.Starting from the publicly available TREC9data set,also known as OHSUMED,we have split documents into passages using overlapping windows of maximal50words each. The original data set consists of several years of selected MEDLINE articles.We have worked with the1987data set used as training data in the TREC9ﬁltering task which consists of approximately54,000documents.MEDLINE documents are annotated with MeSH terms(Medical Subject Headings),each deﬁning a binary concept.The total number of MeSH terms in TREC9was4903.While we are currently performing a larger scale evaluation of MIL techniques on the full data set,we report preliminary results here on a smaller,randomly subsampled data set.We have been using theﬁrst seven categories of the pre-test portion with at least100positive pared to the other data sets the representation is extremely sparse and high-dimensional,which makes this data an interesting addi-tional benchmark.Again,using linear and polynomial kernel functions,which are generally known to work well for text categorization,both methods show improved performance over EM-DD in almost all cases.No signiﬁcant diﬀerence between the two methods is clearly evident for the text classiﬁcation task.7Conclusion and Future WorkWe have presented a novel approach to multiple-instance learning based on two alternative generalizations of the maximum margin idea used in SVM classiﬁcation. Although these formulations lead to hard mixed integer problems,even simple lo-cal optimization heuristics already yield quite competitive results compared to the baseline approach.We conjecture that better optimization techniques,that can for example avoid unfavorable local minima,may further improve the classiﬁcation accuracy.Ongoing work will also extend the experimental evaluation to include larger scale problems.As far as the MIL research problem is concerned,we have considered a wider range of data sets and applications than is usually done and have been able to obtain very good results across a variety of data sets.We strongly suspect that many MIL methods have been optimized to perform well on the MUSK benchmark and we plan to make the data sets used in the experiments available to the public to encourage further empirical comparisons.AcknowledgmentsThis work was sponsored by an NSF-ITR grant,award number IIS-0085836. References[1]P.Auer.On learning from multi-instance examples:Empirical evaluation of a the-oretical approach.In Proc.14th International Conf.on Machine Learning,pages 21–29.Morgan Kaufmann,San Francisco,CA,1997.[2] C.Carson,M.Thomas,S.Belongie,J.M.Hellerstein,and J.Malik.Blobworld:Asystem for region-based image indexing and retrieval.In Proceedings Third Interna-tional Conference on Visual Information Systems.Springer,1999.[3] A.Demirez and K.Bennett.Optimization approaches to semisupervised learning.In M.Ferris,O.Mangasarian,and J.Pang,editors,Applications and Algorithms of Complementarity.Kluwer Academic Publishers,Boston,2000.[4]T.G.Dietterich,throp,and T.Lozano-Perez.Solving the multiple instanceproblem with axis-parallel rectangles.Artiﬁcial Intelligence,89(1-2):31–71,1997. [5]T.G¨a rtner,P.A.Flach,A.Kowalczyk,and A.J.Smola.Multi-instance kernels.InProc.19th International Conf.on Machine Learning.Morgan Kaufmann,San Fran-cisco,CA,2002.[6]T.Joachims.Transductive inference for text classiﬁcation using support vector ma-chines.In Proceedings16th International Conference on Machine Learning,pages 200–209.Morgan Kaufmann,San Francisco,CA,1999.[7]P.M.Long and L.Tan.PAC learning axis aligned rectangles with respect to productdistributions from multiple-instance examples.In p.Learning Theory,1996.[8]O.Maron and T.Lozano-P´e rez.A framework for multiple-instance learning.InAdvances in Neural Information Processing Systems,volume10.MIT Press,1998. [9]O.Maron and A.L.Ratan.Multiple-instance learning for natural scene classiﬁca-tion.In Proc.15th International Conf.on Machine Learning,pages341–349.Morgan Kaufmann,San Francisco,CA,1998.[10]J.Ramon and L.De Raedt.Multi instance neural networks.In Proceedings of ICML-2000,Workshop on Attribute-Value and Relational Learning,2000.[11] B.Sch¨o lkopf and A.Smola.Learning with Kernels.Support Vector Machines,Regu-larization,Optimization and Beyond.MIT Press,2002.[12]Qi Zhang and Sally A.Goldman.EM-DD:An improved multiple-instance learningtechnique.In Advances in Neural Information Processing Systems,volume14.MIT Press,2002.。

SVMPPT课件

VC维：所谓VC维是对函数类的一种度量，可
以简单的理解为问题的复杂程度，VC维越高，一个问题就越复杂。正是因为SVM关注的是VC 维，后面我们可以看到，SVM解决问题的时候，和样本的维数是无关的（甚至样本是上万维的都可以，这使得SVM很适合用来解决像文本分类这样的问题，当然，有这样的能力也因为引入了核函数）。
11
SVM简介
置信风险：与两个量有关，一是样本数
量，显然给定的样本数量越大，我们的学习结果越有可能正确，此时置信风险越小；二是分类函数的VC维，显然VC维越大，推广能力越差，置信风险会变大。
12
SVM简介
泛化误差界的公式为：
R(w)≤Remp(w)+Ф(n/h) 公式中R(w)就是真实风险，Remp(w)表示经验风险，Ф(n/h)表示置信风险。此时目标就从经验风险最小化变为了寻求经验风险与置信风险的和最小，即结构风险最小。
4
SVM简介
支持向量机方法是建立在统计学习理论的VC 维理论和结构风险最小原理基础上的，根据有限的样本信息在模型的复杂性（即对特定训练样本的学习精度， Accuracy）和学习能力（即无错误地识别任意样本的能力）之间寻求最佳折衷，以期获得最好的推广能力（或称泛化能力）。
5
SVM简介
10
SVM简介
泛化误差界：为了解决刚才的问题，统计学
提出了泛化误差界的概念。就是指真实风险应该由两部分内容刻画，一是经验风险，代表了分类器在给定样本上的误差；二是置信风险，代表了我们在多大程度上可以信任分类器在未知样本上分类的结果。很显然，第二部分是没有办法精确计算的，因此只能给出一个估计的区间，也使得整个误差只能计算上界，而无法计算准确的值（所以叫做泛化误差界，而不叫泛化误差）。

Support vector machine_A tool for mapping mineral prospectivity

Support vector machine:A tool for mapping mineral prospectivityRenguang Zuo a,n,Emmanuel John M.Carranza ba State Key Laboratory of Geological Processes and Mineral Resources,China University of Geosciences,Wuhan430074;Beijing100083,Chinab Department of Earth Systems Analysis,Faculty of Geo-Information Science and Earth Observation(ITC),University of Twente,Enschede,The Netherlandsa r t i c l e i n f oArticle history:Received17May2010Received in revised form3September2010Accepted25September2010Keywords:Supervised learning algorithmsKernel functionsWeights-of-evidenceTurbidite-hosted AuMeguma Terraina b s t r a c tIn this contribution,we describe an application of support vector machine(SVM),a supervised learningalgorithm,to mineral prospectivity mapping.The free R package e1071is used to construct a SVM withsigmoid kernel function to map prospectivity for Au deposits in western Meguma Terrain of Nova Scotia(Canada).The SVM classiﬁcation accuracies of‘deposit’are100%,and the SVM classiﬁcation accuracies ofthe‘non-deposit’are greater than85%.The SVM classiﬁcations of mineral prospectivity have5–9%lowertotal errors,13–14%higher false-positive errors and25–30%lower false-negative errors compared tothose of the WofE prediction.The prospective target areas predicted by both SVM and WofE reﬂect,nonetheless,controls of Au deposit occurrence in the study area by NE–SW trending anticlines andcontact zones between Goldenville and Halifax Formations.The results of the study indicate theusefulness of SVM as a tool for predictive mapping of mineral prospectivity.&2010Elsevier Ltd.All rights reserved.1.IntroductionMapping of mineral prospectivity is crucial in mineral resourcesexploration and mining.It involves integration of information fromdiverse geoscience datasets including geological data(e.g.,geologicalmap),geochemical data(e.g.,stream sediment geochemical data),geophysical data(e.g.,magnetic data)and remote sensing data(e.g.,multispectral satellite data).These sorts of data can be visualized,processed and analyzed with the support of computer and GIStechniques.Geocomputational techniques for mapping mineral pro-spectivity include weights of evidence(WofE)(Bonham-Carter et al.,1989),fuzzy WofE(Cheng and Agterberg,1999),logistic regression(Agterberg and Bonham-Carter,1999),fuzzy logic(FL)(Ping et al.,1991),evidential belief functions(EBF)(An et al.,1992;Carranza andHale,2003;Carranza et al.,2005),neural networks(NN)(Singer andKouda,1996;Porwal et al.,2003,2004),a‘wildcat’method(Carranza,2008,2010;Carranza and Hale,2002)and a hybrid method(e.g.,Porwalet al.,2006;Zuo et al.,2009).These techniques have been developed toquantify indices of occurrence of mineral deposit occurrence byintegrating multiple evidence layers.Some geocomputational techni-ques can be performed using popular software packages,such asArcWofE(a free ArcView extension)(Kemp et al.,1999),ArcSDM9.3(afree ArcGIS9.3extension)(Sawatzky et al.,2009),MI-SDM2.50(aMapInfo extension)(Avantra Geosystems,2006),GeoDAS(developedbased on MapObjects,which is an Environmental Research InstituteDevelopment Kit)(Cheng,2000).Other geocomputational techniques(e.g.,FL and NN)can be performed by using R and Matlab.Geocomputational techniques for mineral prospectivity map-ping can be categorized generally into two types–knowledge-driven and data-driven–according to the type of inferencemechanism considered(Bonham-Carter1994;Pan and Harris2000;Carranza2008).Knowledge-driven techniques,such as thosethat apply FL and EBF,are based on expert knowledge andexperience about spatial associations between mineral prospec-tivity criteria and mineral deposits of the type sought.On the otherhand,data-driven techniques,such as WofE and NN,are based onthe quantiﬁcation of spatial associations between mineral pro-spectivity criteria and known occurrences of mineral deposits ofthe type sought.Additional,the mixing of knowledge-driven anddata-driven methods also is used for mapping of mineral prospec-tivity(e.g.,Porwal et al.,2006;Zuo et al.,2009).Every geocomputa-tional technique has advantages and disadvantages,and one or theother may be more appropriate for a given geologic environmentand exploration scenario(Harris et al.,2001).For example,one ofthe advantages of WofE is its simplicity,and straightforwardinterpretation of the weights(Pan and Harris,2000),but thismodel ignores the effects of possible correlations amongst inputpredictor patterns,which generally leads to biased prospectivitymaps by assuming conditional independence(Porwal et al.,2010).Comparisons between WofE and NN,NN and LR,WofE,NN and LRfor mineral prospectivity mapping can be found in Singer andKouda(1999),Harris and Pan(1999)and Harris et al.(2003),respectively.Mapping of mineral prospectivity is a classiﬁcation process,because its product(i.e.,index of mineral deposit occurrence)forevery location is classiﬁed as either prospective or non-prospectiveaccording to certain combinations of weighted mineral prospec-tivity criteria.There are two types of classiﬁcation techniques.Contents lists available at ScienceDirectjournal homepage:/locate/cageoComputers&Geosciences0098-3004/$-see front matter&2010Elsevier Ltd.All rights reserved.doi:10.1016/j.cageo.2010.09.014n Corresponding author.E-mail addresses:zrguang@,zrguang1981@(R.Zuo).Computers&Geosciences](]]]])]]]–]]]One type is known as supervised classiﬁcation,which classiﬁes mineral prospectivity of every location based on a training set of locations of known deposits and non-deposits and a set of evidential data layers.The other type is known as unsupervised classiﬁcation, which classiﬁes mineral prospectivity of every location based solely on feature statistics of individual evidential data layers.A support vector machine(SVM)is a model of algorithms for supervised classiﬁcation(Vapnik,1995).Certain types of SVMs have been developed and applied successfully to text categorization, handwriting recognition,gene-function prediction,remote sensing classiﬁcation and other studies(e.g.,Joachims1998;Huang et al.,2002;Cristianini and Scholkopf,2002;Guo et al.,2005; Kavzoglu and Colkesen,2009).An SVM performs classiﬁcation by constructing an n-dimensional hyperplane in feature space that optimally separates evidential data of a predictor variable into two categories.In the parlance of SVM literature,a predictor variable is called an attribute whereas a transformed attribute that is used to deﬁne the hyperplane is called a feature.The task of choosing the most suitable representation of the target variable(e.g.,mineral prospectivity)is known as feature selection.A set of features that describes one case(i.e.,a row of predictor values)is called a feature vector.The feature vectors near the hyperplane are the support feature vectors.The goal of SVM modeling is toﬁnd the optimal hyperplane that separates clusters of feature vectors in such a way that feature vectors representing one category of the target variable (e.g.,prospective)are on one side of the plane and feature vectors representing the other category of the target variable(e.g.,non-prospective)are on the other size of the plane.A good separation is achieved by the hyperplane that has the largest distance to the neighboring data points of both categories,since in general the larger the margin the better the generalization error of the classiﬁer.In this paper,SVM is demonstrated as an alternative tool for integrating multiple evidential variables to map mineral prospectivity.2.Support vector machine algorithmsSupport vector machines are supervised learning algorithms, which are considered as heuristic algorithms,based on statistical learning theory(Vapnik,1995).The classical task of a SVM is binary (two-class)classiﬁcation.Suppose we have a training set composed of l feature vectors x i A R n,where i(¼1,2,y,n)is the number of feature vectors in training samples.The class in which each sample is identiﬁed to belong is labeled y i,which is equal to1for one class or is equal toÀ1for the other class(i.e.y i A{À1,1})(Huang et al., 2002).If the two classes are linearly separable,then there exists a family of linear separators,also called separating hyperplanes, which satisfy the following set of equations(KavzogluandFig.1.Support vectors and optimum hyperplane for the binary case of linearly separable data sets.Table1Experimental data.yer A Layer B Layer C Layer D Target yer A Layer B Layer C Layer D Target1111112100000 2111112200000 3111112300000 4111112401000 5111112510000 6111112600000 7111112711100 8111112800000 9111012900000 10111013000000 11101113111100 12111013200000 13111013300000 14111013400000 15011013510000 16101013600000 17011013700000 18010113811100 19010112900000 20101014010000R.Zuo,E.J.M.Carranza/Computers&Geosciences](]]]])]]]–]]]2Colkesen,2009)(Fig.1):wx iþb Zþ1for y i¼þ1wx iþb rÀ1for y i¼À1ð1Þwhich is equivalent toy iðwx iþbÞZ1,i¼1,2,...,nð2ÞThe separating hyperplane can then be formalized as a decision functionfðxÞ¼sgnðwxþbÞð3Þwhere,sgn is a sign function,which is deﬁned as follows:sgnðxÞ¼1,if x400,if x¼0À1,if x o08><>:ð4ÞThe two parameters of the separating hyperplane decision func-tion,w and b,can be obtained by solving the following optimization function:Minimize tðwÞ¼12J w J2ð5Þsubject toy Iððwx iÞþbÞZ1,i¼1,...,lð6ÞThe solution to this optimization problem is the saddle point of the Lagrange functionLðw,b,aÞ¼1J w J2ÀX li¼1a iðy iððx i wÞþbÞÀ1Þð7Þ@ @b Lðw,b,aÞ¼0@@wLðw,b,aÞ¼0ð8Þwhere a i is a Lagrange multiplier.The Lagrange function is minimized with respect to w and b and is maximized with respect to a grange multipliers a i are determined by the following optimization function:MaximizeX li¼1a iÀ12X li,j¼1a i a j y i y jðx i x jÞð9Þsubject toa i Z0,i¼1,...,l,andX li¼1a i y i¼0ð10ÞThe separating rule,based on the optimal hyperplane,is the following decision function:fðxÞ¼sgnX li¼1y i a iðxx iÞþb!ð11ÞMore details about SVM algorithms can be found in Vapnik(1995) and Tax and Duin(1999).3.Experiments with kernel functionsFor spatial geocomputational analysis of mineral exploration targets,the decision function in Eq.(3)is a kernel function.The choice of a kernel function(K)and its parameters for an SVM are crucial for obtaining good results.The kernel function can be usedTable2Errors of SVM classiﬁcation using linear kernel functions.l Number ofsupportvectors Testingerror(non-deposit)(%)Testingerror(deposit)(%)Total error(%)0.2580.00.00.0180.00.00.0 1080.00.00.0 10080.00.00.0 100080.00.00.0Table3Errors of SVM classiﬁcation using polynomial kernel functions when d¼3and r¼0. l Number ofsupportvectorsTestingerror(non-deposit)(%)Testingerror(deposit)(%)Total error(%)0.25120.00.00.0160.00.00.01060.00.00.010060.00.00.0 100060.00.00.0Table4Errors of SVM classiﬁcation using polynomial kernel functions when l¼0.25,r¼0.d Number ofsupportvectorsTestingerror(non-deposit)(%)Testingerror(deposit)(%)Total error(%)11110.00.0 5.010290.00.00.0100230.045.022.5 1000200.090.045.0Table5Errors of SVM classiﬁcation using polynomial kernel functions when l¼0.25and d¼3.r Number ofsupportvectorsTestingerror(non-deposit)(%)Testingerror(deposit)(%)Total error(%)0120.00.00.01100.00.00.01080.00.00.010080.00.00.0 100080.00.00.0Table6Errors of SVM classiﬁcation using radial kernel functions.l Number ofsupportvectorsTestingerror(non-deposit)(%)Testingerror(deposit)(%)Total error(%)0.25140.00.00.01130.00.00.010130.00.00.0100130.00.00.0 1000130.00.00.0Table7Errors of SVM classiﬁcation using sigmoid kernel functions when r¼0.l Number ofsupportvectorsTestingerror(non-deposit)(%)Testingerror(deposit)(%)Total error(%)0.25400.00.00.01400.035.017.510400.0 6.0 3.0100400.0 6.0 3.0 1000400.0 6.0 3.0R.Zuo,E.J.M.Carranza/Computers&Geosciences](]]]])]]]–]]]3to construct a non-linear decision boundary and to avoid expensive calculation of dot products in high-dimensional feature space.The four popular kernel functions are as follows:Linear:Kðx i,x jÞ¼l x i x j Polynomial of degree d:Kðx i,x jÞ¼ðl x i x jþrÞd,l40Radial basis functionðRBFÞ:Kðx i,x jÞ¼exp fÀl99x iÀx j992g,l40 Sigmoid:Kðx i,x jÞ¼tanhðl x i x jþrÞ,l40ð12ÞThe parameters l,r and d are referred to as kernel parameters. The parameter l serves as an inner product coefﬁcient in the polynomial function.In the case of the RBF kernel(Eq.(12)),l determines the RBF width.In the sigmoid kernel,l serves as an inner product coefﬁcient in the hyperbolic tangent function.The parameter r is used for kernels of polynomial and sigmoid types. The parameter d is the degree of a polynomial function.We performed some experiments to explore the performance of the parameters used in a kernel function.The dataset used in the experiments(Table1),which are derived from the study area(see below),were compiled according to the requirementfor Fig.2.Simpliﬁed geological map in western Meguma Terrain of Nova Scotia,Canada(after,Chatterjee1983;Cheng,2008).Table8Errors of SVM classiﬁcation using sigmoid kernel functions when l¼0.25.r Number ofSupportVectorsTestingerror(non-deposit)(%)Testingerror(deposit)(%)Total error(%)0400.00.00.01400.00.00.010400.00.00.0100400.00.00.01000400.00.00.0R.Zuo,E.J.M.Carranza/Computers&Geosciences](]]]])]]]–]]]4classiﬁcation analysis.The e1071(Dimitriadou et al.,2010),a freeware R package,was used to construct a SVM.In e1071,the default values of l,r and d are1/(number of variables),0and3,respectively.From the study area,we used40geological feature vectors of four geoscience variables and a target variable for classiﬁcation of mineral prospec-tivity(Table1).The target feature vector is either the‘non-deposit’class(or0)or the‘deposit’class(or1)representing whether mineral exploration target is absent or present,respectively.For‘deposit’locations,we used the20known Au deposits.For‘non-deposit’locations,we randomly selected them according to the following four criteria(Carranza et al.,2008):(i)non-deposit locations,in contrast to deposit locations,which tend to cluster and are thus non-random, must be random so that multivariate spatial data signatures are highly non-coherent;(ii)random non-deposit locations should be distal to any deposit location,because non-deposit locations proximal to deposit locations are likely to have similar multivariate spatial data signatures as the deposit locations and thus preclude achievement of desired results;(iii)distal and random non-deposit locations must have values for all the univariate geoscience spatial data;(iv)the number of distal and random non-deposit locations must be equaltoFig.3.Evidence layers used in mapping prospectivity for Au deposits(from Cheng,2008):(a)and(b)represent optimum proximity to anticline axes(2.5km)and contacts between Goldenville and Halifax formations(4km),respectively;(c)and(d)represent,respectively,background and anomaly maps obtained via S-Aﬁltering of theﬁrst principal component of As,Cu,Pb and Zn data.R.Zuo,E.J.M.Carranza/Computers&Geosciences](]]]])]]]–]]]5the number of deposit locations.We used point pattern analysis (Diggle,1983;2003;Boots and Getis,1988)to evaluate degrees of spatial randomness of sets of non-deposit locations and toﬁnd distance from any deposit location and corresponding probability that one deposit location is situated next to another deposit location.In the study area,we found that the farthest distance between pairs of Au deposits is71km,indicating that within that distance from any deposit location in there is100%probability of another deposit location. However,few non-deposit locations can be selected beyond71km of the individual Au deposits in the study area.Instead,we selected random non-deposit locations beyond11km from any deposit location because within this distance from any deposit location there is90% probability of another deposit location.When using a linear kernel function and varying l from0.25to 1000,the number of support vectors and the testing errors for both ‘deposit’and‘non-deposit’do not vary(Table2).In this experiment the total error of classiﬁcation is0.0%,indicating that the accuracy of classiﬁcation is not sensitive to the choice of l.With a polynomial kernel function,we tested different values of l, d and r as follows.If d¼3,r¼0and l is increased from0.25to1000,the number of support vectors decreases from12to6,but the testing errors for‘deposit’and‘non-deposit’remain nil(Table3).If l¼0.25, r¼0and d is increased from1to1000,the number of support vectors ﬁrstly increases from11to29,then decreases from23to20,the testing error for‘non-deposit’decreases from10.0%to0.0%,whereas the testing error for‘deposit’increases from0.0%to90%(Table4). In this experiment,the total error of classiﬁcation is minimum(0.0%) when d¼10(Table4).If l¼0.25,d¼3and r is increased from 0to1000,the number of support vectors decreases from12to8,but the testing errors for‘deposit’and‘non-deposit’remain nil(Table5).When using a radial kernel function and varying l from0.25to 1000,the number of support vectors decreases from14to13,but the testing errors of‘deposit’and‘non-deposit’remain nil(Table6).With a sigmoid kernel function,we experimented with different values of l and r as follows.If r¼0and l is increased from0.25to1000, the number of support vectors is40,the testing errors for‘non-deposit’do not change,but the testing error of‘deposit’increases from 0.0%to35.0%,then decreases to6.0%(Table7).In this experiment,the total error of classiﬁcation is minimum at0.0%when l¼0.25 (Table7).If l¼0.25and r is increased from0to1000,the numbers of support vectors and the testing errors of‘deposit’and‘non-deposit’do not change and the total error remains nil(Table8).The results of the experiments demonstrate that,for the datasets in the study area,a linear kernel function,a polynomial kernel function with d¼3and r¼0,or l¼0.25,r¼0and d¼10,or l¼0.25and d¼3,a radial kernel function,and a sigmoid kernel function with r¼0and l¼0.25are optimal kernel functions.That is because the testing errors for‘deposit’and‘non-deposit’are0%in the SVM classiﬁcations(Tables2–8).Nevertheless,a sigmoid kernel with l¼0.25and r¼0,compared to all the other kernel functions,is the most optimal kernel function because it uses all the input support vectors for either‘deposit’or‘non-deposit’(Table1)and the training and testing errors for‘deposit’and‘non-deposit’are0% in the SVM classiﬁcation(Tables7and8).4.Prospectivity mapping in the study areaThe study area is located in western Meguma Terrain of Nova Scotia,Canada.It measures about7780km2.The host rock of Au deposits in this area consists of Cambro-Ordovician low-middle grade metamorphosed sedimentary rocks and a suite of Devonian aluminous granitoid intrusions(Sangster,1990;Ryan and Ramsay, 1997).The metamorphosed sedimentary strata of the Meguma Group are the lower sand-dominatedﬂysch Goldenville Formation and the upper shalyﬂysch Halifax Formation occurring in the central part of the study area.The igneous rocks occur mostly in the northern part of the study area(Fig.2).In this area,20turbidite-hosted Au deposits and occurrences (Ryan and Ramsay,1997)are found in the Meguma Group, especially near the contact zones between Goldenville and Halifax Formations(Chatterjee,1983).The major Au mineralization-related geological features are the contact zones between Gold-enville and Halifax Formations,NE–SW trending anticline axes and NE–SW trending shear zones(Sangster,1990;Ryan and Ramsay, 1997).This dataset has been used to test many mineral prospec-tivity mapping algorithms(e.g.,Agterberg,1989;Cheng,2008). More details about the geological settings and datasets in this area can be found in Xu and Cheng(2001).We used four evidence layers(Fig.3)derived and used by Cheng (2008)for mapping prospectivity for Au deposits in the yers A and B represent optimum proximity to anticline axes(2.5km) and optimum proximity to contacts between Goldenville and Halifax Formations(4km),yers C and D represent variations in geochemical background and anomaly,respectively, as modeled by multifractalﬁlter mapping of theﬁrst principal component of As,Cu,Pb,and Zn data.Details of how the four evidence layers were obtained can be found in Cheng(2008).4.1.Training datasetThe application of SVM requires two subsets of training loca-tions:one training subset of‘deposit’locations representing presence of mineral deposits,and a training subset of‘non-deposit’locations representing absence of mineral deposits.The value of y i is1for‘deposits’andÀ1for‘non-deposits’.For‘deposit’locations, we used the20known Au deposits(the sixth column of Table1).For ‘non-deposit’locations(last column of Table1),we obtained two ‘non-deposit’datasets(Tables9and10)according to the above-described selection criteria(Carranza et al.,2008).We combined the‘deposits’dataset with each of the two‘non-deposit’datasets to obtain two training datasets.Each training dataset commonly contains20known Au deposits but contains different20randomly selected non-deposits(Fig.4).4.2.Application of SVMBy using the software e1071,separate SVMs both with sigmoid kernel with l¼0.25and r¼0were constructed using the twoTable9The value of each evidence layer occurring in‘non-deposit’dataset1.yer A Layer B Layer C Layer D100002000031110400005000061000700008000090100 100100 110000 120000 130000 140000 150000 160100 170000 180000 190100 200000R.Zuo,E.J.M.Carranza/Computers&Geosciences](]]]])]]]–]]] 6training datasets.With training dataset1,the classiﬁcation accuracies for‘non-deposits’and‘deposits’are95%and100%, respectively;With training dataset2,the classiﬁcation accuracies for‘non-deposits’and‘deposits’are85%and100%,respectively.The total classiﬁcation accuracies using the two training datasets are97.5%and92.5%,respectively.The patterns of the predicted prospective target areas for Au deposits(Fig.5)are deﬁned mainly by proximity to NE–SW trending anticlines and proximity to contact zones between Goldenville and Halifax Formations.This indicates that‘geology’is better than‘geochemistry’as evidence of prospectivity for Au deposits in this area.With training dataset1,the predicted prospective target areas occupy32.6%of the study area and contain100%of the known Au deposits(Fig.5a).With training dataset2,the predicted prospec-tive target areas occupy33.3%of the study area and contain95.0% of the known Au deposits(Fig.5b).In contrast,using the same datasets,the prospective target areas predicted via WofE occupy 19.3%of study area and contain70.0%of the known Au deposits (Cheng,2008).The error matrices for two SVM classiﬁcations show that the type1(false-positive)and type2(false-negative)errors based on training dataset1(Table11)and training dataset2(Table12)are 32.6%and0%,and33.3%and5%,respectively.The total errors for two SVM classiﬁcations are16.3%and19.15%based on training datasets1and2,respectively.In contrast,the type1and type2 errors for the WofE prediction are19.3%and30%(Table13), respectively,and the total error for the WofE prediction is24.65%.The results show that the total errors of the SVM classiﬁcations are5–9%lower than the total error of the WofE prediction.The 13–14%higher false-positive errors of the SVM classiﬁcations compared to that of the WofE prediction suggest that theSVMFig.4.The locations of‘deposit’and‘non-deposit’.Table10The value of each evidence layer occurring in‘non-deposit’dataset2.yer A Layer B Layer C Layer D110102000030000411105000060110710108000091000101110111000120010131000140000150000161000171000180010190010200000R.Zuo,E.J.M.Carranza/Computers&Geosciences](]]]])]]]–]]]7classiﬁcations result in larger prospective areas that may not contain undiscovered deposits.However,the 25–30%higher false-negative error of the WofE prediction compared to those of the SVM classiﬁcations suggest that the WofE analysis results in larger non-prospective areas that may contain undiscovered deposits.Certainly,in mineral exploration the intentions are notto miss undiscovered deposits (i.e.,avoid false-negative error)and to minimize exploration cost in areas that may not really contain undiscovered deposits (i.e.,keep false-positive error as low as possible).Thus,results suggest the superiority of the SVM classi-ﬁcations over the WofE prediction.5.ConclusionsNowadays,SVMs have become a popular geocomputational tool for spatial analysis.In this paper,we used an SVM algorithm to integrate multiple variables for mineral prospectivity mapping.The results obtained by two SVM applications demonstrate that prospective target areas for Au deposits are deﬁned mainly by proximity to NE–SW trending anticlines and to contact zones between the Goldenville and Halifax Formations.In the study area,the SVM classiﬁcations of mineral prospectivity have 5–9%lower total errors,13–14%higher false-positive errors and 25–30%lower false-negative errors compared to those of the WofE prediction.These results indicate that SVM is a potentially useful tool for integrating multiple evidence layers in mineral prospectivity mapping.Table 11Error matrix for SVM classiﬁcation using training dataset 1.Known All ‘deposits’All ‘non-deposits’TotalPrediction ‘Deposit’10032.6132.6‘Non-deposit’067.467.4Total100100200Type 1(false-positive)error ¼32.6.Type 2(false-negative)error ¼0.Total error ¼16.3.Note :Values in the matrix are percentages of ‘deposit’and ‘non-deposit’locations.Table 12Error matrix for SVM classiﬁcation using training dataset 2.Known All ‘deposits’All ‘non-deposits’TotalPrediction ‘Deposits’9533.3128.3‘Non-deposits’566.771.4Total100100200Type 1(false-positive)error ¼33.3.Type 2(false-negative)error ¼5.Total error ¼19.15.Note :Values in the matrix are percentages of ‘deposit’and ‘non-deposit’locations.Table 13Error matrix for WofE prediction.Known All ‘deposits’All ‘non-deposits’TotalPrediction ‘Deposit’7019.389.3‘Non-deposit’3080.7110.7Total100100200Type 1(false-positive)error ¼19.3.Type 2(false-negative)error ¼30.Total error ¼24.65.Note :Values in the matrix are percentages of ‘deposit’and ‘non-deposit’locations.Fig.5.Prospective targets area for Au deposits delineated by SVM.(a)and (b)are obtained using training dataset 1and 2,respectively.R.Zuo,E.J.M.Carranza /Computers &Geosciences ](]]]])]]]–]]]8。

Normalization in support vector machines

Normalization in Support Vector MachinesArnulf B.A.Graf1,2and Silvio Borer1arnulf.graf@tuebingen.mpg.de1.Swiss Federal Institute of Technology LausanneLaboratory of Computational Neuroscience,DI-LCN1015Lausanne EPFL,Switzerland2.Max Planck Institute for Biological CyberneticsSpemannstrasse38,72076T¨u bingen,GermanyAbstract.This article deals with various aspects of normalization inthe context of Support Vector Machines.We considerﬁst normalizationof the vectors in the input space and point out the inherent limitations.A natural extension to the feature space is then represented by the kernelfunction normalization.A correction of the position of the Optimal Sep-arating Hyperplane is subsequently introduced so as to suit better thesenormalized kernels.Numerical experimentsﬁnally evaluate the diﬀerentapproaches.Keywords.Support Vector Machines,input space,feature space,nor-malization,optimal separating hyperplane1IntroductionSupport Vector Machines(SVMs)have drawn much attention because of their high classiﬁcation performance[1].In this article,they are applied in a computer vision problem,namely the classiﬁcation of images.SVMs are often combined with a preprocessing stage to form pattern recognition systems.Moreover,it turns out to be intrinsically necessary for the SV algorithm(see[1]-[4])to have data which is preprocessed.It has been shown[5]that normalization is a pre-processing type which plays an important role in this context.Theoretical con-siderations on the kernel interpretation of normalization and an adaptation of the SV algorithm to normalized kernel functions will be developed in this paper in order to shed new light on such pattern recognition systems.In this study,we deal with normalization aspects in SVMs.First,normaliza-tion in the input space is considered in Sec.2and a resulting problem related to SV classiﬁcation is outlined.A possible solution,namely the normalization of the kernel functions in the feature space,is then subsequently presented.A modiﬁcation of the SV algorithm is then presented in Sec.3.This modiﬁca-tion takes into account the properties of the normalized kernels and amounts to a correction of the position of the Optimal Separating Hyperplane(OSH).The corresponding numerical experiments are reported in Sec.4and Sec.5concludes the paper.2Normalization in the Input and Feature SpaceThe normalization of the vectors of the input space can be considered as the most basic type of preprocessing.Assume x∈R N is an input vector,the corre-sponding normalized vector˜x may be expressed as:x˜x=function˜K(x,y)as follows:˜K(x,y)=K(x,y)K(x,x)K(y,y)∈R(2) We clearly have˜K(x,x)=1.Notice that this is always true for RBF kernels K(x,y)=exp(− x−y 2x y p=˜K(x,y).When the kernel function is replaced by the dot product in the input space(p=1),equation(2)reduces to equation (1).Moreover,note that˜K(x,y)=˜ϕ(x)·˜ϕ(y)where˜ϕ(x)=ϕ(x)√w from the OSH.However,all the datapoints lieon a unit hypersphere in the normalized feature space.It would thus be more accurate to do classiﬁcation not according to an OSH computed such that the margins are symmetric around it,but according to an OSH determined such that the margins deﬁne equal distances on the hypersphere.This may be done by adjusting the value of b.The margins,and thus w,are unchanged since the problem is symmetric around w.In other words,the separating hyperplane is translated.In order to compute the correction to the value of b,consider ﬁgure2.The intersection of the two margin hyperplanes with the unit hyper-sphere are represented by the anglesα1andα2deﬁned by cos(α1)=b−δand cos(α2)=b+δ.The bisection of the angle formed byα1andα2is representedby the angleϕand can be computed asϕ=α1+α22.Moreover,we set cos(ϕ)=b′where b′stands for the new position of the OSH. Finally we get the following expression:b′(b,w)=cos arccos(b−1 w )a target+1to the training images of the object i and a target-1to the training images of all the remaining objects.We thus choose a“one against all”strategy. The regularisation parameter is set a priori to1000.Classiﬁcation Each testing image z is presented to each of the classiﬁers and is assigned to class i where C i(z)≥C k(z)∀k=i according to a“winner take all”strategy.As shown in[8],this approach is computationally very eﬃcient for feedforward neural networks with binary gates such as those encountered in SVMs.Since the class of the testing images is known,an error is computed for each test image.The classiﬁcation error for the experiment is then the average of all the individual errors for each C i and the corresponding variance over the objects is also computed.Polynomial kernel functions K(x,y)=(1+xy)p are particularly well suited for the considered studies.The results are presented in Figure3.When consideringFig.3.Classiﬁcation error and variance with normalization in the input space,in the feature space or in the feature space with correction of bthe results,we notice that the three error curves are monotonically decreasing and none are crossing each other.This seems to reﬂect the good generalization ability of the considered algorithms.The experiments clearly point out that the normalization in the feature space outperforms the normalization in the input space for all the considered degrees except for p=1.Indeed,for linear kernels, input and feature space normalizations are of the same order of magnitude for large values of the input vectors.This is the case here since the latter represent vectors of size1024with values ranging between0and255.Thus,for higher values of p(more sensitive the kernel functions),the diﬀerence between these two normalizations gets more pronounced,yielding a bigger diﬀerence in the classiﬁcation performance.Furthermore,the correction of the position of the OSH for normalized kernels decreases the classiﬁcation error further(albeit not signiﬁcantly)for the four most sensitive kernels.Again,the linear case is left unchanged since both normalizations are then almost identical.The correctionbrought to the value of b by the previously-introduced method can be measured as being|b−b′|。

基于支持向量机的长波红外目标分类识别算法

基于支持向量机的长波红外目标分类识别算法王周春1,2,3,4，崔文楠1,2,4，张涛1,2,3,4（1. 中国科学院上海技术物理研究所，上海 200083； 2. 中国科学院大学，北京 100049；3. 上海科技大学，上海 201210；4. 中国科学院智能红外感知重点实验室，上海 200083）摘要：红外图像的分辨率低和色彩单一，但由于红外设备的全天候工作特点，因而在某些场景具有重要作用。

本文采用一种基于支持向量机（support vector machine, SVM）的长波红外目标图像分类识别的算法，在一幅图像中，将算法提取的边缘特征和纹理特征作为目标的识别特征，输入到支持向量机，最后输出目标的类别。

在实验中，设计方向梯度直方图＋灰度共生矩阵＋支持向量机的组合算法模型，采集8种人物目标场景图像进行训练和测试，实验结果显示：相同或者不相同人物目标，穿着不同服饰，算法模型的分类识别正确率较高。

因此，在安防监控、工业检测、军事目标识别等运用领域，此组合算法模型可以满足需要，在红外目标识别领域具有一定的优越性。

关键词：长波红外目标；支持向量机；识别特征；目标识别中图分类号：TN219 文献标识码：A 文章编号：1001-8891(2021)02-0153-09Classification and Recognition Algorithm for Long-wave Infrared TargetsBased on Support Vector MachineWANG Zhouchun1,2,3,4，CUI Wennan1,2,4，ZHANG Tao1,2,3,4(1. Shanghai Institute of Technical Physics, Chinese Academy of Sciences, Shanghai 200083, China;2. University of Chinese Academy of Sciences, Beijing 100049, China;3. Shanghai Tech University, Shanghai 201210, China;4. Key Laboratory of Intelligent Infrared Perception, Chinese Academy of Sciences, Shanghai 200083, China)Abstract: Infrared images have a low resolution and a single color, but they play an important role in some scenes because they can be used under all weather conditions. This study adopts a support vector machine algorithm for long-wave infrared target image classification and recognition. The algorithm extracts edge and texture features, which are used as the recognition features of the target, and forwards them to a support vector machine. Then, the target category is output for infrared target recognition. Several models, such as the histogram of oriented gradient, gray level co-occurrence matrix, and support vector machine, are combined to collect images of eight types of target scenes for training and testing. The experimental results show that the algorithm can classify the same target person wearing different clothes with high accuracy and that it has a good classification effect on different target characters. Therefore, under certain scene conditions, this combined algorithm model can meet the needs and has certain advantages in the field of target recognition.Key words: long-wave infrared target, support vector machine, recognition feature, target recognition0引言红外线是一种波长范围为760nm～1mm的电磁辐射[1]，红外图像的分辨率低、色彩单一，但是，由于红外设备具有全天工作的优点，因而在某些场景具有重大的运用价值，例如军事、交通、安全领域等。

【support vector machines】

2 w
~ ~
这样分类间隔就等于 ,因此要求分类间隔最大,就要求最大.而要求分类面对所有样本正确分类,就是要求满足
yi ( w xi + b) ≥ 1, i i 使等号成立的样本点称为支持向量
2 w
Linear SVM Mathematically
x+
M=Margin Width
X-
What we know: w . x+ + b = +1 w . x- + b = -1 w . (x+-x-) = 2
首先建立Lagrange函数函数首先建立 w J ( w, b, α ) = ∑ α [ y ( w x + b) 1] 2
2 l i i i i =1
J ( w, b, α ) =0 条件１： w J ( w, b, α ) =0 条件2： b 最终可得到
1 l l Q(α ) = J ( w, b, α ) = ∑ α i ∑∑ α iα jyiyj ( xi xj ) 2 i =1 j =1 i =1 寻找最大化目标函数Q (α )的Lagrange乘子{α i }il =1 , 满足约束条件 (1)
SVM的理论基础
由于SVM 的求解最后转化成二次规划问题的求解，因此SVM 的解是全局唯一的最优解 SVM在解决小样本、非线性及高维模式识别问题中表现出许多特有的优势，并能够推广应用到函数拟合等其他机器学习问题中.
线性判别函数和判别面
一个线性判别函数(discriminant function)是指由x的各个分量的线性组合而成的函数
l l
∑α y = 0
i =1 i i l * , i i，
对偶问题

Text Categorization with Support Vector Machines

Text Categorization with Support Vector Machines: Learning with Many Relevant
Features
Thorsten Joachims
Universitat Dortmund Informatik LS8, Baroper Str. 301
3 Support Vector Machines
Support vector machines are based on the Structural Risk Minimization principle 9 from computational learning theory. The idea of structural risk minimization is to nd a hypothesis h for which we can guarantee the lowest true error. The true error of h is the probability that h will make an error on an unseen and randomly selected test example. An upper bound can be used to connect the true error of a hypothesis h with the error of h on the training set and the complexity of H measured by VC-Dimension, the hypothesis space containing h 9 . Support vector machines nd the hypothesis h which approximately minimizes this bound on the true error by e ectively and e ciently controlling the VC-Dimension of H.

A Tutorial on Support Vector Machines for Pattern Recognition

CHRISTOPHER J.C. BURGES
burges@
Bell Laboratories, Lucent Technologies Editor: Usama Fayyad Abstract. The tutorial starts with an overview of the concepts of VC dimension and structural risk minimization. We then describe linear Support Vector Machines (SVMs) for separable and non-separable data, working through a non-trivial example in detail. We describe a mechanical analogy, and discuss when SVM solutions are unique and when they are global. We describe how support vector training can be practically implemented, and discuss in detail the kernel mapping technique which is used to construct SVM solutions which are nonlinear in the data. We show how Support Vector machines can have very large (even inﬁnite) VC dimension by computing the VC dimension for homogeneous polynomial and Gaussian radial basis function kernels. While very high VC dimension would normally bode ill for generalization performance, and while at present there exists no theory which shows that good generalization performance is guaranteed for SVMs, there are several arguments which support the observed high accuracy of SVMs, which we review. Results of some experiments which were inspired by these arguments are also presented. We give numerous examples and proofs of most of the key theorems. There is new material, and I hope that the reader will ﬁnd that even old material is cast in a fresh light. Keywords: support vector machines, statistical learning theory, VC dimension, pattern recognition

请简述 SVM(支持向量机)的原理以及如何处理非线性问题。

支持向量机（Support Vector Machine，SVM）是一种常用的机器学习算法，常用于分类和回归问题。

它的原理是基于统计学习理论和结构风险最小化原则，通过寻找最优超平面来实现分类。

SVM在处理非线性问题时，可以通过核函数的引入来将数据映射到高维空间，从而实现非线性分类。

一、SVM原理支持向量机是一种二分类模型，它的基本思想是在特征空间中找到一个超平面来将不同类别的样本分开。

具体而言，SVM通过寻找一个最优超平面来最大化样本间的间隔，并将样本分为两个不同类别。

1.1 线性可分情况在特征空间中，假设有两个不同类别的样本点，并且这两个类别可以被一个超平面完全分开。

这时候我们可以找到无数个满足条件的超平面，但我们要寻找具有最大间隔（Margin）的超平面。

Margin是指离超平面最近的训练样本点到该超平面之间距离之和。

我们要选择具有最大Margin值（即支持向量）对应的决策函数作为我们模型中使用。

1.2 线性不可分情况在实际问题中，很多情况下样本不是线性可分的，这时候我们需要引入松弛变量（Slack Variable）来处理这种情况。

松弛变量允许样本点处于超平面错误的一侧，通过引入惩罚项来平衡Margin和错误分类的数量。

通过引入松弛变量，我们可以将线性不可分问题转化为线性可分问题。

同时，为了防止过拟合现象的发生，我们可以在目标函数中加入正则化项。

1.3 目标函数在SVM中，目标函数是一个凸二次规划问题。

我们需要最小化目标函数，并找到最优解。

二、处理非线性问题SVM最初是用于处理线性可分或近似线性可分的数据集。

然而，在实际应用中，很多数据集是非线性的。

为了解决这个问题，SVM引入了核函数（Kernel Function）。

核函数可以将数据从低维空间映射到高维空间，在高维空间中找到一个超平面来实现非线性分类。

通过核技巧（Kernel Trick），SVM 可以在低维空间中计算高维空间中样本点之间的内积。

Evolutionary Support Vector Machines and their

U N I V E R S I T ¨AT D O R T M U N D REIHE COMPUTATIONAL INTELLIGENCE S O N D E R F O R S C H U N G S B E R E I C H 531Design und Management komplexer technischer Prozesse und Systeme mit Methoden der Computational Intelligence Evolutionary Support Vector Machines and theirApplication for ClassiﬁcationRuxandra Stoean,Mike Preuss,Catalin Stoeanand D.DumitrescuNr.CI-212/06Interner Bericht ISSN 1433-3325June 2006Sekretariat des SFB 531·Universit ¨a t Dortmund ·Fachbereich Informatik/XI 44221Dortmund ·GermanyDiese Arbeit ist im Sonderforschungsbereich 531,”Computational Intelligence“,der Universit ¨a t Dortmund entstanden und wurde auf seine Veranlassung unter Verwendung der ihm von der Deutschen Forschungsgemeinschaft zur Verf ¨u gung gestellten Mittel gedruckt.Evolutionary Support Vector Machines and theirApplication for ClassiﬁcationRuxandra Stoean1,Mike Preuss2,Catalin Stoean1,and D.Dumitrescu31Department of Computer Science,Faculty of Mathematics and Computer Science,University of Craiova,Romania{ruxandra.stoean,catalin.stoean}@inf.ucv.ro2Chair of Algorithm Engineering,Department of Computer Science,University of Dortmund,Germanymike.preuss@uni-dortmund.de3Department of Computer Science,Faculty of Mathematics and Computer Science,”Babes-Bolyai”University of Cluj-Napoca,Romaniaddumitr@cs.ubbcluj.roAbstract.We propose a novel learning technique for classiﬁcation as result ofthe hybridization between support vector machines and evolutionary algorithms.Evolutionary support vector machines consider the classiﬁcation task as in sup-port vector machines but use evolutionary algorithms to solve the optimizationproblem of determining the decision function.They can acquire the coefﬁcientsof the separating hyperplane,which is often not possible within classical tech-niques.More important,ESVMs obtain the coefﬁcients directly from the evolu-tionary algorithm and can refer them at any point during a run.The concept isfurthermore extended to handle large amounts of data,a problem frequently oc-curring e.g.in spam mail detection,one of our test cases.Evolutionary supportvector machines are validated on this and three other real-world classiﬁcationtasks;obtained results show the promise of this new technique.Keywords:support vector machines,coefﬁcients of decision surface,evolution-ary algorithms,evolutionary support vector machines,parameter tuning1IntroductionSupport vector machines(SVMs)represent a state-of-the-art learning technique that has managed to reach very competitive results in different types of classiﬁcation and regression tasks.Their engine,however,is quite complicated,as far as proper under-standing of the calculus and correct implementation of the mechanisms are concerned. This paper presents a novel approach,evolutionary support vector machines(ESVMs), which offers a simpler alternative to the standard technique inside SVMs,delivered by evolutionary algorithms(EAs).Note that this is not theﬁrst attempt to hybridize SVMs and EAs.Existing alternatives are discussed in§2.2.Nevertheless,we claim that our approach is signiﬁcantly different from these.ESVMs as presented here are constructed solely based on SVMs applied for classi-ﬁcation.Validation is achieved by considering four real-world classiﬁcation tasks.Be-sides comparing results,the potential of the utilized,simplistic EA through parametriza-tion is investigated.To enable handling large data sets,theﬁrst approach is enhanced2Ruxandra Stoean,Mike Preuss,Catalin Stoean,D.Dumitrescuby use of a chunking technique,resulting in a more versatile algorithm.A second so-lution for dealing with a high number of samples is brought by a reconsideration of the elements of theﬁrst evolutionary algorithm.Obtained results prove suitability and competitiveness of the new approach,so ESVMs qualify as viable simpler alternative to standard SVMs in this context.However,this is only aﬁrst attempt with the new approach.Many of its components still remain to be improved.The paper is organized as follows:§2outlines the concepts of classical SVMs to-gether with existing evolutionary approaches aimed at boosting their performance.§3 presents the new approach of ESVMs.Their validation is achieved on real-world exam-ples in§4.Improved ESVMs with two new mechanisms for reducing problem size in case of large data sets are presented in§5.The last section comprises conclusions and outlines ideas for further improvement.2PrerequisitesGiven{f t∈T|,f t:R n→{1,2,...,k}},a set of functions,and{(x i,y i)}i=1,2,...,m, a training set where every x i∈R n represents a data vector and each y i corresponds to a class,a classiﬁcation task consists in learning the optimal function f t∗that mini-mizes the discrepancy between the given classes of data vectors and the actual classes produced by the learning machine.Finally,accuracy of the machine is computed on previously unseen test data vectors.In the classical architecture,SVMs reduce k-class classiﬁcation problems to many binary classiﬁcation tasks that are separately consid-ered and solved.A voting system then decides the class for data vectors in the test set. SVMs regard classiﬁcation from a geometrical point of view,i.e.they assume the ex-istence of a separating surface between two classes labelled as-1and1,respectively. The aim of SVMs then becomes the discovery of this decision hyperplane,i.e.its coef-ﬁcients.2.1Classical Support Vector MachinesIf training data is known to be linearly separable,then there exists a linear hyperplane that performs the partition,i.e. w,x −b=0,where w∈R n is the normal to the hy-perplane and represents hyperplane orientation and b∈R denotes hyperplane location. The separating hyperplane is thus determined by its coefﬁcients,w and b.Consequently, the positive data vectors lie on the corresponding side of the hyperplane and their neg-ative counterparts on the opposite side.As a stronger statement for linear separability, the positive and negative vectors each lie on the corresponding side of a matching sup-porting hyperplane for the respective class(Figure1a)[1],written in brief as:y i( w,x i −b)>1,i=1,2,...,m.In order to achieve the classiﬁcation goal,SVMs must determine the optimal values for the coefﬁcients of the decision hyperplane that separates the training data with as few exceptions as possible.In addition,according to the principle of Structural Risk Minimization from Statistical Learning Theory[2],separation must be performed withEvolutionary Support Vector Machines for Classiﬁcation3 a maximal margin between classes.Summing up,classiﬁcation of linear separable data with a linear hyperplane through SVMs leads thus to the following optimization prob-lem: ﬁnd w∗and b∗as to minimize w∗ 22subject to y i( w∗,x i −b∗)≥1,i=1,2,...,m.(1)Training data are not linearly separable in general and it is obvious that a linear sepa-rating hyperplane is not able to build a partition without any errors.However,a linear separation that minimizes training error can be tried as a solution to the classiﬁcation problem.Training errors can be traced by observing the deviations of data vectors from the corresponding supporting hyperplane,i.e.from the ideal condition of data separabil-ity.Such a deviation corresponds to a value of±ξiw ,ξi≥0.These values may indicatedifferent nuanced digressions(Figure1b),but only aξi higher than unity signals a classiﬁcation error.Minimization of training error is achieved by adding the indicator of error for every training data vector into the separability statement and,at the same time,by minimizing the sum of indicators for errors.Adding up,classiﬁcation of lin-ear nonseparable data with a linear hyperplane through SVMs leads to the following optimization problem,where C is a hyperparameter representing the penalty for errors: ﬁnd w∗and b∗as to minimize w∗ 22+C m i=1ξi,C>0subject to y i( w∗,x i −b∗)≥1−ξi,ξi≥0,i=1,2,...,m.(2)If a linear hyperplane does not provide satisfactory results for the classiﬁcation task, then a nonlinear decision surface can be appointed.The initial space of training data vectors can be nonlinearly mapped into a high enough dimensional feature space,where a linear decision hyperplane can be subsequently built.The separating hyperplane will achieve an accurate classiﬁcation in the feature space which will correspond to a non-linear decision function in the initial space(Figure1c).The procedure leads therefore to the creation of a linear separating hyperplane that would,as before,minimize train-ing error,only this time it would perform in the feature space.Accordingly,a nonlinear mapΦ:R n→H is considered and data vectors from the initial space are mapped into H.w is also mapped throughΦinto H.As a result,the squared norm that is in-volved in maximizing the margin of separation is now Φ(w) 2.Also,the equation of the hyperplane consequently changes to Φ(w),Φ(x i) −b=0.Nevertheless,as simple in theory,the appointment of an appropriate mapΦwith the above properties is a difﬁcult task.As in the training algorithm vectors appear only as part of dot products,the issue would be simpliﬁed if there were a kernel function K that would obey K(x,y)= Φ(x),Φ(y) ,where x,y∈R n.In this way,one would use K everywhere and would never need to explicitly even know whatΦis.The remaining problem is that kernel functions with this property are those that obey the corresponding conditions of Mercer’s theorem from functional analysis,which are not easy to check. There are,however,a couple of classical kernels that had been demonstrated to meet Mercer’s condition,i.e.the polynomial classiﬁer of degree p:K(x,y)= x,y p andthe radial basis function classiﬁer:K(x,y)=e x−y 2σ,where p andσare also hyper-parameters of SVMs.In conclusion,classiﬁcation of linear nonseparable data with a4Ruxandra Stoean,Mike Preuss,Catalin Stoean,D.Dumitrescu(a)(b)(c)Fig.1:(a)Decision hyperplane(continuous line)that separates between circles(positive)and squares(negative)and supporting hyperplanes(dotted lines).(b)Position of data and corre-sponding indicators for errors-correct placement,ξi=0(label1)margin position,ξi<1 (label2)and classiﬁcation error,ξi>1(label3).(c)Initial data space(left),nonlinear map into the higher dimension/its linear separation(right),and corresponding nonlinear surface(bottom). nonlinear hyperplane through SVMs leads to the same optimization problem as in(2) which is now considered in the feature space and with the use of a kernel function: ﬁnd w∗and b∗as to minimize K(w∗,w∗)2+C m i=1ξi,C>0(3)subject to y i(K(w∗,x i)−b∗)≥1−ξi,ξi≥0,i=1,2,...,m.The optimization problem(corresponding to either situation above)is subsequently solved.Accuracy on the test set in then computed,i.e.the side of the decision boundary on which each new data vector lies is determined.Classical SVMs approach the op-timization problem through a generalized form of the method of Lagrange multipliers [3].But the mathematics of the technique can be found to be very difﬁcult both to grasp and apply.This is the reason why present approach aims to simplify(and improve) SVMs through a hybridization with EAsby utilizing these in order to determine optimal values for the coefﬁcients of the separating hyperplane(w and b)directly.2.2Evolutionary Approaches to Support Vector MachinesEAs have been widely used in hybridization with SVMs in order to boost performance of classical architecture.Their combination envisaged two different directions:model selection and feature selection.Model selection concerns adjustment of hyperparame-ters(free parameters)within SVMs,i.e.the penalty for errors,C,and parameters of the kernel,p orσwhich,in standard variants,is performed through grid search or gradi-ent descent methods.Evolution of hyperparameters can be achieved through evolution strategies[4].When dealing with high dimensional classiﬁcation problems,feature se-lection regards the choice of the most relevant features as input for a SVM.The optimal subset of features can be evolved using genetic algorithms in[5]and genetic program-ming in[6].To the best of our knowledge,evolution of coefﬁcients of the separating hyperplane within SVMs has not been accomplished yet.3Evolutionary Support Vector Machines for ClassiﬁcationWithin the new hybridized technique of ESVMs separation of positive and negative vectors proceeds as in standard SVMs,while the optimization problem is solved byEvolutionary Support Vector Machines for Classiﬁcation5 means of EAs.Therefore,the coefﬁcients of the separating hyperplane,i.e.w and b,are encoded in the representation of the EA and their evolution is performed with respect to the objective function and the constraints in the optimization problem(3)within SVMs, which is considered for reasons of generality.3.1Evolving the Coefﬁcients of the Separating HyperplaneRepresentation An individual encodes the coefﬁcients of the separating hyperplane,w and b.Since indicators for errors of classiﬁcation,ξi,i=1,2,...,m,appear in the con-ditions for hyperplane optimality,ESVMs handle them through inclusion in the struc-ture of an individual,as well:c=(w1,...,w n,b,ξ1,....,ξm).(4) After termination of the algorithm,the best individual from all generations gives ap-proximately optimal values for the coefﬁcients of the decision hyperplane.If proper values for parameters of the evolutionary algorithm are chosen,training errors of clas-siﬁcation can also result from the optimal individual(those indicators whose values are higher than unity)but with some loss in accuracy;otherwise,indicators grow in the direction of errors driving the evolutionary cycle towards its goal,but do not reach the limit of unity when the evolutionary process stops.An example of an ESVM which also provides the errors of classiﬁcation is nevertheless exhibited for simple artiﬁcial 2-dimensional data sets separated by various kernels(Figure2).In such a situation,the number of samples and the dimensionality of data are both low,thus accuracy and run-time are not affected by the choice of parameters which leads to the discovery of all training errors.Initial population Individuals are randomly generated such that w i∈[−1,1],i= 1,2,...,n,b∈[−1,1]andξj∈[0,1],j=1,2,...,m.Fitness evaluation Theﬁtness function derives from the objective function of the optimization problem and has to be minimized.Constraints are handled by penalizing the infeasible individuals by appointing t:R→R which returns the value of the argu-ment if negative while otherwise0.The expression of the function is thus as follows:f(w,b,ξ)=K(w,w)+Cmi=1ξi+m i=1[t(y i(K(w,x i)−b)−1+ξi)]2.(5)Genetic operators Operators were chosen experimentally.Tournament selection, intermediate crossover and mutation with normal perturbation are applied.Mutation of errors is constrained,preventing theξi s from taking negative values.Stop condition The algorithm stops after a predeﬁned number of generations.As the coefﬁcients of the separating hyperplane are found,the class for a new,unseen test data vector can be determined,following class(x)=sgn(K(w,x)−b).This is unlike classical SVMs where it is seldom the case that coefﬁcients can be determined following the standard technique,as the mapΦcannot be always explicitly determined. In this situation,the class for a new vector follows from computational artiﬁces.6Ruxandra Stoean,Mike Preuss,Catalin Stoean,D.Dumitrescu(a)(b)(c)Fig.2:Vizualization of ESVMs on 2-dimensional points.Errors of classiﬁcation are squared.Odd (a),even (b)polynomial and radial (c)kernels3.2ESVMs for Multi-class ClassiﬁcationMulti-class ESVMs employ one classical and very successful SVM technique,the ONE-AGAINST-ONE (1-a-1)[7].As the classiﬁcation problem is k -class,k >2,1-a-1considers k (k−1)2SVMs,where each machine is trained on data from every two classes,i and j ,where i corresponds to 1and j to -1.For every SVM,the class of x is computed and if x is in class i ,the vote for the i -th class is incremented by one;conversely,the vote for class j is added by one.Finally,x is taken to belong to the class with the largest vote.In case two classes have identical number of votes,the one with the smaller index is selected.Consequently,1-a-1multi-class ESVMs are straightforward.k (k−1)2ESVMs are built for every two classes and voting is subsequently conducted.4Experimental Evaluation:Real-World ClassiﬁcationExperiments have been conducted on four data sets (with no missing values)concerning real-world problems coming from the UCI Repository of Machine Learning Databases 4,i.e.diabetes mellitus diagnosis,spam detection,iris recognition and soybean disease diagnosis.The motivation for the choice of test cases was manifold.Diabetes and spam are two-class problems,while soybean and iris are multi-class.Differentiating,on the one hand,diagnosis is a better-known benchmark,but ﬁltering is an issue of current major concern;moreover,the latter has a lot more features and samples,which makes a huge difference for classiﬁcation as well as for optimization.On the other hand,iris has a lot more samples while soybean has a lot more attributes.For all reasons mentioned above,the selection of test problems certainly contains all the variety of situations that is necessary for the objective validation of the new approach of ESVMs.Brief information on the classiﬁcation tasks and SVM and EA parameter values are given in Table 1.The error penalty was invariably set to 1.For each data set,30runs of the ESVM were conducted;in every run 75%ran-dom cases were appointed to the training set and the remaining 25%went into test.Experiments showed the necessity for data normalization in diabetes,spam and iris.4Available at /mlearn/MLRepository.htmlEvolutionary Support Vector Machines for Classiﬁcation7 Table1:Data set properties and ESVM algorithm parameter values.Rightmost columns hold tuned parameter sets for modiﬁed ESVMs,ESVMs with chunking,and utilized parameter bounds.Diabetes Iris Soybean SpamData set description modif.chunk.bounds Number of samples768150474601(434)(10/500) Number of attributes843557Number of classes2342ESVM parameter values man.tun.man.tun.man.tun.man.tun.tun.tun.Kernel parameter(p orσ)p=2σ=1p=1p=1p=1 Population size10019810046100162100154519010/200 Number of generations25029610022010029325028718028650/300 Crossover probability0.400.870.300.770.300.040.300.840.950.110.01/1 Mutation probability0.400.210.500.570.500.390.500.200.030.080.01/1 Error mutation probability0.500.200.500.020.500.090.500.07—0.800.01/1 Mutation strength0.104.110.104.040.100.160.103.32 3.750.980.001/5 Error mutation strength0.100.020.103.110.103.800.100.01—0.010.001/5 No further modiﬁcation of the data was carried out and all data was used in the ex-periments.Obtained test accuracies are presented in Table2.Differentiated(spam/non spam for spamﬁltering and ill/healthy for diabetes)accuracies are also depicted.In or-der to validate the manually found EA parameter values,the parameter tuning method SPO[8]was applied with a budget of1000optimization runs.The last4columns of Table2hold performances and standard deviations of the best conﬁguration of an ini-tial latin hypersquare sample(LHS)with size10×#parameters,and theﬁnally found best parameter setting.Resulting parameter values are depicted in Table1.They indi-cate that for all cases,except for soybean data and chunking enhanced ESVM on the spam data,crossover probabilities were dramatically increased,while often reducing mutation probabilities,especially for errors.It must be stated that in most cases,results achieved with manually determined parameter values are only improved by increasing effort(population size or number of generations).Comparison to worst and best found results of different techniques,either SVMs or others,was conducted.Assessment cannot be objective,however,as outlined meth-ods either use different sizes for the training/test sets or other types of cross-validation and number of runs or employ various preprocessing procedures for feature or sam-ple selection.Diabetes diagnosis was approached in SV M light[9]where an accuracy of76.95%was obtained and Critical SVMs[10]with a result of82.29%.Accuracy on spam detection is reported in[11]where k-Nearest Neighbourhood on non preprocessed data resulted in66.5%accuracy and in[12]where functional link network wrapped into a genetic algorithm for input and output feature selection gave a result of92.44%.1-a-1 multi-class SVMs on the Iris data set were perfomed in[7](accompanied by a shrink-ing technique)and[13];obtained results were of97.33%and98.67%,respectively. Results for the Soybean small data set were provided in[14],where,among others, Naive Bayes was applied and provided an accuracy of95.50%,reaching100%when a pair-wise classiﬁcation strategy was employed.8Ruxandra Stoean,Mike Preuss,Catalin Stoean,D.DumitrescuTable2:Accuracies of different ESVM versions on the considered test sets,in percent.Average Worst Best StD LHS best StD SPO StD ESVMsDiabetes(overall)76.3071.3580.73 2.2475.82 3.2777.312.45 Diabetes(ill)50.8139.1960.27 4.5349.357.4752.645.32 Diabetes(healthy)90.5484.8096.00 2.7189.60 2.3690.212.64 Iris(overall)95.1891.11100.0 2.4895.11 2.9595.112.95 Soybean(overall)99.0294.11100.0 2.2399.61 1.4799.801.06 Spamﬁltering(overall)87.7485.7489.83 1.0689.27 1.3790.590.98 Spamﬁltering(spam)77.4870.3182.50 2.7780.63 3.5183.762.21 Spamﬁltering(non spam)94.4192.6296.300.8994.820.9495.060.62 ESVMs with ChunkingSpamﬁltering(overall)87.3083.1390.00 1.7787.52 1.3188.371.15 Spamﬁltering(spam)83.4775.5486.81 2.7886.26 2.6686.352.70 Spamﬁltering(non spam)89.7884.2292.52 2.1188.33 2.4889.682.06 Modiﬁed ESVMsSpamﬁltering(overall)88.4086.5290.35 1.0290.060.9991.250.83 Spamﬁltering(spam)79.6375.3984.70 2.1682.73 2.2885.522.08 Spamﬁltering(non spam)94.1791.3495.84 1.0594.890.9295.000.72 5Improving Training TimeObtained results for the tasks we have undertaken to solve have proven to be competitive as compared to accuracies of different powerful classiﬁcation techniques.However,for large problems,i.e.spamﬁltering,the amount of runtime needed for training is≈800s. This stems from the large genomes employed,as indicators for errors of every sample in the training set are included in the representation.Consequently,we tackle this prob-lem with two approaches:an adaptation of a chunking procedure inside ESVMs and a modiﬁed version of the evolutionary approach.5.1Reducing Samples for Large ProblemsWe propose a novel algorithm to reduce the number of samples for one run of ESVMs which is an adaptation of the widely known shrinking technique within SVMs,called chunking[15].In standard SVMs,this technique implies the identiﬁcation of Lagrange multipliers that denote samples which do not fulﬁll restrictions.As we renounced the standard solving of the optimization problem within SVMs for the EA-based approach, the chunking algorithm was adapted toﬁt our technique(Algorithm1).ESVM with chunking was applied to the spam data set.Values for parameters were set as before,except the number of generations for each EA which is now set to100. The chunk size,i.e N,was chosen as200and the number of iterations with no im-provement was designated to be5.Results are shown in Table2.Average runtime was of103.2s/run.Therefore,the novel algorithm of ESVM with chunking reached its goal, running8times faster than previous one,at a cost of loss in accuracy of0.4%.Evolutionary Support Vector Machines for Classiﬁcation9Algorithm1ESVM with ChunkingRandomly choose N samples from the training data set,equally distributed,to make a chunk; while a predeﬁned number of iterations passes with no improvement doifﬁrst chunk thenRandomly initialize population of a new EA;elseUse best evolved hyperplane coefﬁcients and random indicators for errors toﬁll half of the population of a new EA and randomly initialize the other half;end ifApply EA andﬁnd coefﬁcients of the hyperplane;Compute side of all samples in the training set with evolved hyperplane coefﬁcients;From incorrectly placed,randomly choose(if available)N/2samples,equally distributed;Randomly choose the rest up to N from the current chunk and add all to the new chunk if obtained training accuracy if higher than the best one obtained so far then Update best accuracy and best evolved hyperplane coefﬁcients;set improvement to true;end ifend whileApply best obtained coefﬁcients on the test set and compute accuracy5.2A Reconsideration of the Evolutionary AlgorithmSince ESVMs directly provide hyperplane coefﬁcients at all times,we propose to drop the indicators for errors from the EA representation and,instead,compute their val-ues in a simple geometrical fashion.Consequently,this time,individual representation contains only w and b.Next,all indicatorsξi,i=1,2,...,m are computed in order to be referred in theﬁtness function(5).The current individual(which is the current separating hyperplane)is considered and,following[1],supporting hyperplanes are de-termined.Then,for every training vector,deviation to the corresponding supporting hyperplane,following its class,is calculated.If sign of deviation equals class,corre-spondingξi=0;else,deviation is returned.The EA proceeds with the same values for parameters as in Table1(certainly except probabilities and step sizes for theξi s)and,in the end of the run,hyperplane coefﬁcients are again directly acquired.Empirical results on the spam data set(Table2)and the average runtime of600s seem to support the new approach.It is interesting to remark that the modiﬁed algorithm is not that much faster, but provides some improvement in accuracy.In contrast to the chunking approach,it also seems especially better suited for achieving high non spam recognition rates,in or-der to prevent erroneous deletion of good mail.Surprisingly,SPO here decreases effort while increasing accuracies(Table1),resulting in further speedup.6Conclusions and Future WorkProposed new hybridized learning technique incorporates the vision upon classiﬁcation of SVMs but solves the inherent optimization problem by means of evolutionary algo-rithms.ESVMs present many advantages as compared to SVMs.First of all,they are deﬁnitely much easier to understand and use.Secondly and more important,the evo-lutionary solving of the optimization problem enables the acquirement of hyperplane10Ruxandra Stoean,Mike Preuss,Catalin Stoean,D.Dumitrescucoefﬁcients directly and at all times within a run.Thirdly,accuracy on several bench-mark real-world problems is comparable to those of state-of-the-art SVM methods or to results of other powerful techniques from different other machine learningﬁelds.In order to enhance suitability of the new technique for any classiﬁcation issue,two novel mechanisms for reducing size in large problems are also proposed;obtained results support their employment.Although already competitive,the novel ESVMs for classiﬁcation can still be im-proved.Other kernels may be found and used for better separation;also,the two ap-pointed classical kernels may have parameters evolved by means of evolutionary algo-rithms as in some approaches for model selection.Also,other preprocessing techniques can be considered;feature selection through an evolutionary algorithm as found in lit-erature can surely boost accuracy and runtime.Deﬁnitely,other ways to handle large data sets can be imagined;a genetic algorithm can also be used for sample selection. Finally,the construction of ESVMs for regression problems is a task for future work. References1.Bosch,R.A.,Smith,J.A.:Separating Hyperplanes and the Authorship of the Disputed Feder-alist Papers.American Mathematical Monthly,V ol.105,No.7,(1998),601-6082.Vapnik,V.,Statistical Learning Theory.Wiley,New York(1998)3.Haykin,S.:Neural Networks:A Comprehensive Foundation.Prentice Hall,New Jersey(1999)4.Friedrichs,F.,Igel,C.:Evolutionary tuning of multiple SVM parameters,In Proc.12th Euro-pean Symposium on Artiﬁcial Neural Networks(2004)519-5245.Feres de Souza,B.,Ponce de Leon F.de Carvalho,A.:Gene selection based on multi-class support vector machines and genetic algorithms.Journal of Genetics and Molecular Research, V ol.4,No.3(2005)599-6076.Eads,D.,Hill,D.,Davis,S.,Perkins,S.,Ma,J.,Porter,R.,Theiler,J.:Genetic Algorithms and Support Vector Machines for Time Series Classiﬁcation.5th Conf.on the Applications and Science of Neural Networks,Fuzzy Systems,and Evolutionary Computation,Proc.Symposium on Optical Science and Technology,SPIE,Seattle,W A,4787(2002)74-857.Hsu,C.-W.,Lin,C.-J.:A Comparison of Methods for Multi-class Support Vector Machines. IEEE Transactions on Neural Networks,V ol.13,No.2(2002)415-4258.Bartz-Beielstein,T.:Experimental research in evolutionary computation-the new experimen-talism.Natural Computing Series,Springer-Verlag(2006)9.Joachims,T.:Making Large-Scale Support Vector Machine Learning Practical.In Advances in Kernel Methods:Support Vector Learning(1999)169-18410.Raicharoen,T.and Lursinsap,C.:Critical Support Vector Machine Without Kernel Function. In Proc.9th Intl.Conf.on Neural Information Processing,Singapore,V ol.5(2002)2532-2536 11.Tax,D.M.J.:DDtools,the data description toolbox for Matlab(2005)http://www-ict.ewi.tudelft.nl/davidt/occ/index.html12.Sierra,A.,Corbacho,F.:Input and Output Feature Selection.ICANN2002,LNCS2415 (2002)625-63013.Weston,J.,Watkins,C.:Multi-class Support Vector Machines.Technical Report,CSD-TR-98-04,Royal Holloway,University of London,Department of Computer Science(1998) 14.Bailey,J.,Manoukian,T.,Ramamohanarao,K.:Classiﬁcation Using Constrained Emerging Patterns.In Proc.of W AIM2003(2003)226-23715.Perez-Cruz,F.,Figueiras-Vidal,A.R.,Artes-Rodriguez,A.:Double chunking for solving SVMs for very large datasets.Learning’04,Elche,Spain(2004)/ archive/00001184/01/learn04.pdf.。

基于双谱的射频指纹提取方法

第19卷第1期太赫兹科学与电子信息学报Vo1.19，No.1 2021年2月 Journal of Terahertz Science and Electronic Information Technology Feb.，2021文章编号：2095-4980(2021)01-0107-05基于双谱的射频指纹提取方法贾济铖，齐琳(哈尔滨工程大学信息通信工程学院，黑龙江哈尔滨 150001)摘要：研究了基于通信辐射源射频指纹(RFF)的同类型设备分类识别理论，通过提取通信信号的围线积分双谱值来作为设备个体识别的特征向量，使用支持向量机(SVM)分类器进行识别。

构建辐射源识别系统，并使用实测信号进行仿真测试。

结果显示该方法具有稳定的识别效果，且在信噪比(SNR)为-22 dB时，系统可以达到接近90%的分类识别准确度。

这说明本文提出的基于双谱的RFF提取方法有效。

关键词：物理层安全；射频指纹；围线积分双谱；个体识别中图分类号：TN918文献标志码：A doi：10.11805/TKYDA2019291RF fingerprint extraction method based on bispectrumJIA Jicheng，QI Lin(College of Information and Communication Engineering，Harbin Engineering University，Harbin Helongjiang 150001，China)Abstract：The classification and recognition theories of the same type of equipment based on the Radio Frequency Fingerprint(RFF) of the communication radiation source are studied. The integralbispectrum values of the communication signal are extracted as the feature vector of the device, and theSupport Vector Machine(SVM) classifier is used for identification. After constructing a radiation sourceidentification system, the measured signals are used for simulation testing. The simulation results show astable recognition effect by using the proposed method, and the system can achieve nearly 90%classification recognition accuracy when the Signal to Noise Ratio(SNR) is -22 dB. This result validatesthe effectiveness of bispectrum-based RF fingerprint extraction method.Keywords：physical layer security；Radio Frequency Fingerprint(RFF)；contour integral bispectrum；individual identification物联网及5G的快速发展，使彼此链接的无线设备越来越多，同时也带来了一系列监管与安全的难题。

支持向量机—ctj

VC维(Vapnik-Chervonenkis Dimension)的概念是为了研究学习过程一致收敛的速度和推广性，由统计学习理论定义的有关函数集学习性能的一个重要指标。对于一个指示函数集，如果存在h个样本能够被函数集中的函数按所有可能够的2h种形式分开，则称函数集能够把h个样本打散，函数集的VC维就是它能打散的最大样本数目h。VC维反映了函数集的学习能力，VC维越大则学习机器越复杂(学习能力越强)。

可以统一起表示为:
yi [(wT x) b] 1 0
i 1,2, , m
Support vector machine

最优分类超平面最后可以表示成一个约束优化问题：
(1)
这是一个严格凸规划问题，可以转换成拉格朗日(Lagrange) 问题进行求解。为此，可以定义如下的Lagrange函数：
* * * * 计算 w yi i xi ，选择 a 的一个正分量 j , 并据此计算
b yj yi i* xi xj
* i 1
l
w* x b* 0 ,决策函数 f x sgn((w* x) b* ) 构造分划超平面

i*都与一个训练点相对应。而分事实上，的每一个分量划超平面仅仅依赖于 i* 不为零的训练点 xi, yi ，而与对应于 i* 为零的那些训练点无关。
Support vector machine
图1 underfitting
图2 good fitting
图3 overfitting
支持向量机

Support vector machine
支持向量机(Support Vector Machine，SVM) 是二十世纪九十年代中期Vapnik等人提出了一种在统计学习理论体系基础之上的新型机器学习方法，已经成为机器学习领域的研究热点。它利用结构风险最小化准则训练学习机器，使其具有很好的学习能力，特别是泛化能力。支持向量机具有坚实的理论基础、简洁的数学形式、直观的几何解释，而且能够较好地解决小样本、非线性、维数灾难和局部极小等问题，因此在模式分类，回归问题，生物信息技术等很多领域均得到了广泛的应用。

From regression to classification in support vector machines

1 2
w
2
+C
i=1 (ξi
+ ξi∗ ) i = 1, . . . , l i = 1, . . . , l
yi − (w · xi + b) ≤ + ξi −yi + (w · xi + b) ≤ + ξi∗ ξ, ξ ∗ ≥ 0
The main aim of this paper is to demonstrate a connection between support vector machine classiﬁcation and regression. In general, SVM classiﬁcation and regression are performed using a nonlinear kernel K (xi , xj ). For simplicity of notation, we chose to present our formulations in terms of a linear separating hyperplane w. All our results apply to the nonlinear case; the reader may assume that we are trying to construct a linear separating hyperplane in a high-dimensional feature space.
1 2
w
2
+C
i=1 ξi
yi(w · xi + b) ≥ 1 − ξi ξ ≥0
i = 1, . . . , l
This formulation is motivated by the fact that minimizing the norm of w is equivalent to maximizing the margin; the goal of maximizing the margin is in turn motivated by attempts to bound the generalization error via structural risk minimization. This theme is developed in [2]. In the support vector machine regression problem, the goal is to construct a hyperplane that lies “close” to as many of the data points as possible. We are given l examples (x1 , y1), . . . , (xl , yl ), with xi ∈ R n and yi ∈ R for all i. Again, we must select a hyperplane and threshold (w, b)1 . Our objective is to choose a hyperplane w with small norm, while simultaneously minimizing the sum of the distances from our points to the hyperplane, measured using Vapnik’s -insensitive loss function: |yi − (w · xi + b)| = 0 |yi − (w · xi + b)| − if |yi − (w · xi + b)| ≤ otherwise (1)

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

A version of this algorithm which resembles the original method for SVMs can be obtained by considering the classical deﬁnition of supporting hyperplane [1] for convex hulls. Denote with Sn+ and Sn− the sets of input patterns xj contained in the training set Sn, having a corresponding output yj = +1 or yj = −1, respectively. Suppose, without loss of generality, that the set Sn+ contains the ﬁrst n+ input patterns of the training set Sn, whereas Sn− is formed by the remaining n− = n − n+ points.
Abstract
A new technique for the detection of the uncertainty region in classiﬁcation problems is presented. The core of the method is the determination of the best supporting hyperplane for convex hulls of sets of points in a multidimensional input space. To this aim a modiﬁed version of the algorithm for the Generalized Optimal Hyperplane is shown to be eﬀective.
As in the Support Vector Machine approach, kernel functions can be used to generalize the proposed technique, so as to detect uncertainty regions with nonlinear boundaries. Simulations concerning artiﬁcial benchmarks show the quality of the resulting method.
Some tests have been performed to access the quality of the solution obtained through the new connectionist models, called Asymmetric Support Vector Machines (ASVM). Even in the case of linearly separable training sets they give a better correspondence with the eﬀective uncertainty region than conventional SVMs.
2 Asymmetric Support Vector Machines
In a general pattern recognition problem input patterns x ∈ Rd have to be assigned to one of two possible classes, associated with the values of a binary output y, coded by the integers −1 and +1. The a posteriori probability η(x) = P{Y = +1|X = x} plays a fundamental role in this kind of problem and allows to deﬁne the zone U ⊂ Rd, called uncertainty region, where the classiﬁcation is critical [3]:
1 Introduction
Support Vector Machines (SVM) [6] show the interesting property of detecting, as a byproduct of the learning process, a group of input patterns in the training set, which represent borderline samples between diﬀerent classes. These patterns, called support vectors, contain the essential information about the classiﬁcation problem at hand, since the knowledge of their only position makes possible to reconstruct the decision function in the whole input space.
Φ(w+, ξ)
=
1 2
(w+
· w+) + λ
n+
ξj
(1)
j=1
subject to the constraints
w0+ + w+ · xj ≥ 1 − ξj for j = 1, . . . , n+
(2)
w0+ + w+ · xj ≤ −1 for j = n+ + 1, . . . , n
Support Vector Machines for uncertainty region detection
Gian Paolo Drago, Marco Muselli
Istituto per i Circuiti Elettronici - CNR via De Marini, 6 - 16149 Genova, Italy
A halfspace Cn+ approximating C+ is then delimited by the supporting hyperplane for Sn− that scores the minimum number of errors on the patterns of Sn+. Thyperplane can be found by solving the following quadratic programming problem: minimize the functional
(3)
ξj ≥ 0 for j = 1, . . . , n+
(4)
being · the usual inner product in Rd. As one can note, this problem resembles that used to ﬁnd the Generalized Optimal Hyperplane [6]; the only diﬀerence
U = {x ∈ Rd : 0 < η(x) < 1}
By contrast, the sets C− = {x ∈ Rd : η(x) = 0} ,
C+ = {x ∈ Rd : η(x) = 1}
will be called certainty regions for the problem at hand.
A formal deﬁnition of the uncertainty region, together with a general method for its identiﬁcation is contained in a companion paper [3]. In this contribution a speciﬁc implementation of that general method is proposed, which uses a modiﬁed version of the classical technique for training SVMs. The resulting algorithm presents a reduced computational burden and achieves good approximations in all the situations examined.
lies in the errors ξj, since they are allowed only on the patterns with positive output.
To reduce computational burden, dual programming problems can be considered, as in [6]; the following theorem gives the procedure to be followed. For the sake of brevity, the proof will be omitted, but can be found in [2].
In a practical situation usually we do not know the probability η, but have only access to a training set Sn containing n samples (xj, yj), j = 1, . . . , n. Consequently, the problem of determining the regions C−, C+, and U cannot be solved exactly, but some good approximations Cn−, Cn+, and Un must be searched for. A general algorithm that pursues this aim is described in [3]; however, its implementation depends on the connectionist model to be used.