- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
We propose a novel regularisation method for Gaussian mixture networks, which adopts a Bayesian approach and draws on the evidence scheme to optimise the hyperparameters. This leads to a new, modi ed form of the EM algorithm, which is compared with the original scheme on three classi cation problems.
`the margin of this paper is too narrow to contain the proof', we will only state the re-
of
sult here and refer the interested reader to 3]. De ne fkn := P(kjxn ) 1 ? P(kjxn )] and the two submatrices Akk ii := kk ii Npk ki
0 0 0 0
(11) (12) (13)
and
Bkk0 ii0
2 Regule
Let us introduce a simple Gaussian prior on the weights wki, (7) is the mean of the data, and the f kg are the inverse variances or weightdecay hyperparameters. Note that this scheme implies a devision of the weights wki into several weight groups, with weights feeding into the same kernel sharing a common hyperparameter k . In order to nd the optimal values for k , we follow the evidence scheme of MacKay 4]. Rather than maximising the joint likelihood P(Djfwkig; f kig; fpk g; f kg), which usually is strongly skewed and therefore likely to lead to over- tting, we rst marginalise over the weights wki and then nd the maximum likelihood estimate of P(DjfZ kig; fpkg; f kg) = P(Djfwkig; f kig; fpk g) P(fwkigjf kg) dfwkig (8) The integral in (8) is solved by Gaussian approximation, which requires nding the Hessian 2E (9) Hkk ii = @w @ @w ki k i where
Regularisation of RBF-Networks with the Bayesian Evidence Scheme
Dirk Husmeier, Stephen J. Roberts Neural Systems Research Group Department of Electrical and Electronic Engineering Imperial College of Science, Technology and Medicine Exhibition Road, London SW7 2BT, United Kingdom Email: d.husmeier@
1 Introduction
Consider the problem of inferring the probability density P(x) of some m?dimensional input vector x from a training set D = fxngN=1 . A common approach is to approxn imate the unknown true distribution by a mixture model of the form P(x) =
i=1 i=1
n Pn 1 = P(kjxn) wki ? xi]2 (5) n P ^ki n P(kjxn)
v ! uY m X ki u m ki 2 exp ? xi ? wki ] P (xjk) = t 2 2
(2)
(To improve the clearity of the notation, we use indices in the following systematic way: (i) n 2 f1; : : :; N g labels training exemplars, 1
N 1X pk = N P(kjxn) ^ P n=1 n n wki = P P(kjxx )xi ^ P(kj )
(3) (4)
P(k)P(xjk)
(1)
where the kernels P(xjk) represent unimodal densities of a simple parameteric form, and the mixing coe cients pk := P(k) are positive and normalised (allowing pk to be interpreted as the prior probability for the generation of a data point by the kth component). A common choice for the kernels P(xjk) is a multivariate Gaussian with mean wk and diagonal covariance matrix ?1 Covk ]ij := ki ij :
K X k=1
e.g. xn; (ii) i; j 2 f1; : : :; mg label coordinates, e.g. xni; (iii) k 2 f1; : : :; K g labels mixture components, e.g. wk ). The structure of (2) is equivalent to the Gaussian mixture model in 5], which is an RBF network with positive normalised output weights pk , kernel width parameters ki > 0, and inputto-hidden layer weights wki. A straightforward method for optimising the network parameters := fpk ; ki; wkig is to maximise the likelihood P(Dj ). This can easily be e ected with the Expectation Maximisation (EM) algorithm 1], which leads to the intuitively plausible update scheme (e.g. 6])
Equations (3)-(5) constitute an iterative algorithm, where in each iteration the new parameter values (indicated by the hat over the symbol) are obtained from the old ones after computing the posterior probabilities P(kjxn ) by Bayes' rule: xn j (6) P(kjxn) = P(P(xk)pk n) The disadvantage of this approach, however, is that training neural networks with the maximum likelihood method can lead to over- tting. This problem is even more severe in density estimation. Obviously, the likelihood is maximised in a trivial way by concentrating all the probability mass on
one or several exemplars of the training set, that is, if a kernel centre coincides with one of the data points and the corresponding kernel variance approaches zero. To prevent this problem, Ormoneit and Tresp 7] applied a Bayesian maximum a posteriori approach with a conjugate prior on the parameters. This e ectively introduces a lower bound on the variances 1= k and thus avoids the above singularity problem. The disadvantage, however, is that the optimal values for the so-called hyperparameters introduced by this scheme are usually not known in advance and have to be guessed (or optimised by data-expensive cross-validation). The approach we propose here is to follow a similar Bayesian scheme, but to infer the optimal values of the hyperparameters from the training data itself by maximising the respective evidence.
E = ? ln P(Dj ) (10) While this concept is basically identical to MacKay's work in 4], we note that the novelty of our approach consists in the computation of the Hessian (9), which is more complex than for the generalised linear regression models studied in 4]. Since, in fact,