EM-algorithm for Training of State-space Models with Application to Time Series Prediction
- 格式:pdf
- 大小:359.83 KB
- 文档页数:6
分类em算法摘要:1.引言2.EM 算法的基本原理3.EM 算法的分类应用4.结论正文:1.引言EM 算法,全称Expectation-Maximization 算法,是一种常见的概率模型优化算法。
该算法在统计学、机器学习等领域具有广泛的应用,特别是在分类问题上表现出色。
本文将重点介绍EM 算法在分类问题上的应用及其基本原理。
2.EM 算法的基本原理EM 算法是一种迭代优化算法,主要通过两个步骤进行:E 步(Expectation)和M 步(Maximization)。
在E 步中,根据观测数据计算样本的隐含变量的期望值;在M 步中,根据隐含变量的期望值最大化模型参数的似然函数。
这两个步骤交替进行,直至收敛。
EM 算法的基本原理可以概括为:对于一个包含隐含变量的概率模型,通过迭代优化模型参数,使得观测数据的似然函数最大化。
在这个过程中,EM 算法引入了Jensen 不等式,保证了算法的收敛性。
3.EM 算法的分类应用EM 算法在分类问题上的应用非常广泛,典型的例子包括高斯混合模型(GMM)和隐马尔可夫模型(HMM)。
(1)高斯混合模型(GMM)在传统的分类问题中,我们通常使用极大似然估计(MLE)来求解最佳分类模型。
然而,当数据分布复杂时,MLE 可能无法得到一个好的解。
此时,我们可以引入EM 算法,通过迭代优化模型参数,提高分类的准确性。
在GMM 中,EM 算法可以有效地处理数据的多峰分布,从而提高分类效果。
(2)隐马尔可夫模型(HMM)HMM 是一种基于序列数据的概率模型,广泛应用于语音识别、时间序列分析等领域。
在HMM 中,EM 算法被用于求解最优路径和状态转移概率。
通过EM 算法,我们可以有效地处理观测序列与隐状态之间的不确定性,从而提高分类效果。
4.结论EM 算法作为一种强大的概率模型优化算法,在分类问题上表现出色。
通过引入隐含变量和迭代优化,EM 算法可以有效地处理数据的复杂性和不确定性,提高分类的准确性。
基于EM算法的多种训练方式的比较分析EM算法,全称Expectation-Maximization Algorithm,是一种常用的似然函数最大化的迭代算法。
EM算法广泛应用于机器学习、数据挖掘等领域中的参数估计问题。
在许多应用中,使用EM算法得到的结果往往优于其他常见的参数估计方法,例如最大似然估计和贝叶斯估计。
EM算法的核心思想是通过迭代更新参数,不断提高观测数据的似然度。
算法分为两个步骤:E步骤和M步骤。
在E步骤中,通过已知参数计算每个观测数据属于每个隐变量的概率值。
在M步骤中,通过已知数据的权重和隐变量的概率值,更新参数,求出新的参数估计值。
整个算法通过不断重复这两个步骤,从而得到最终的参数估计值。
在EM算法的应用中,我们常常需要使用多种不同的训练方式。
这些训练方式基于不同的假设和理论基础,可以为我们提供更多的模型选择和优化的机会。
在接下来的文章中,我们将介绍几种不同的基于EM算法的训练方式,并对它们的优缺点进行比较分析。
第一种训练方式:Batch EM训练Batch EM训练是一种传统的EM训练方式,也是许多应用中最常用的训练方式。
在Batch EM训练中,算法在每一次迭代中使用所有的训练数据。
这种训练方式在数据量较小或者计算资源充足的情况下表现良好,但是在数据量较大,计算资源有限的情况下,迭代收敛速度较慢且容易出现过拟合。
第二种训练方式:Stochastic EM训练Stochastic EM训练是一种改进版的EM训练方式。
在Stochastic EM训练中,算法在每一次迭代中只随机选取一小部分数据进行计算,从而加速算法的收敛速度。
这种训练方式在大规模数据集和计算资源不充足的情况下表现良好,但是由于每次随机选择的数据不同,因此可能导致算法的收敛方向不稳定,容易陷入局部最优解。
第三种训练方式:Minibatch EM训练Minibatch EM训练是Stochastic EM训练的进一步改进。
LOW-FREQUENCY ACTIVE TOWED SONAR (LFATS)LFATS is a full-feature, long-range,low-frequency variable depth sonarDeveloped for active sonar operation against modern dieselelectric submarines, LFATS has demonstrated consistent detection performance in shallow and deep water. LFATS also provides a passive mode and includes a full set of passive tools and features.COMPACT SIZELFATS is a small, lightweight, air-transportable, ruggedized system designed specifically for easy installation on small vessels. CONFIGURABLELFATS can operate in a stand-alone configuration or be easily integrated into the ship’s combat system.TACTICAL BISTATIC AND MULTISTATIC CAPABILITYA robust infrastructure permits interoperability with the HELRAS helicopter dipping sonar and all key sonobuoys.HIGHLY MANEUVERABLEOwn-ship noise reduction processing algorithms, coupled with compact twin line receivers, enable short-scope towing for efficient maneuvering, fast deployment and unencumbered operation in shallow water.COMPACT WINCH AND HANDLING SYSTEMAn ultrastable structure assures safe, reliable operation in heavy seas and permits manual or console-controlled deployment, retrieval and depth-keeping. FULL 360° COVERAGEA dual parallel array configuration and advanced signal processing achieve instantaneous, unambiguous left/right target discrimination.SPACE-SAVING TRANSMITTERTOW-BODY CONFIGURATIONInnovative technology achievesomnidirectional, large aperture acousticperformance in a compact, sleek tow-body assembly.REVERBERATION SUPRESSIONThe unique transmitter design enablesforward, aft, port and starboarddirectional transmission. This capabilitydiverts energy concentration away fromshorelines and landmasses, minimizingreverb and optimizing target detection.SONAR PERFORMANCE PREDICTIONA key ingredient to mission planning,LFATS computes and displays systemdetection capability based on modeled ormeasured environmental data.Key Features>Wide-area search>Target detection, localization andclassification>T racking and attack>Embedded trainingSonar Processing>Active processing: State-of-the-art signal processing offers acomprehensive range of single- andmulti-pulse, FM and CW processingfor detection and tracking. Targetdetection, localization andclassification>P assive processing: LFATS featuresfull 100-to-2,000 Hz continuouswideband coverage. Broadband,DEMON and narrowband analyzers,torpedo alert and extendedtracking functions constitute asuite of passive tools to track andanalyze targets.>Playback mode: Playback isseamlessly integrated intopassive and active operation,enabling postanalysis of pre-recorded mission data and is a keycomponent to operator training.>Built-in test: Power-up, continuousbackground and operator-initiatedtest modes combine to boostsystem availability and accelerateoperational readiness.UNIQUE EXTENSION/RETRACTIONMECHANISM TRANSFORMS COMPACTTOW-BODY CONFIGURATION TO ALARGE-APERTURE MULTIDIRECTIONALTRANSMITTERDISPLAYS AND OPERATOR INTERFACES>State-of-the-art workstation-based operator machineinterface: Trackball, point-and-click control, pull-down menu function and parameter selection allows easy access to key information. >Displays: A strategic balance of multifunction displays,built on a modern OpenGL framework, offer flexible search, classification and geographic formats. Ground-stabilized, high-resolution color monitors capture details in the real-time processed sonar data. > B uilt-in operator aids: To simplify operation, LFATS provides recommended mode/parameter settings, automated range-of-day estimation and data history recall. >COTS hardware: LFATS incorporates a modular, expandable open architecture to accommodate future technology.L3Harrissellsht_LFATS© 2022 L3Harris Technologies, Inc. | 09/2022NON-EXPORT CONTROLLED - These item(s)/data have been reviewed in accordance with the InternationalTraffic in Arms Regulations (ITAR), 22 CFR part 120.33, and the Export Administration Regulations (EAR), 15 CFR 734(3)(b)(3), and may be released without export restrictions.L3Harris Technologies is an agile global aerospace and defense technology innovator, delivering end-to-endsolutions that meet customers’ mission-critical needs. The company provides advanced defense and commercial technologies across air, land, sea, space and cyber domains.t 818 367 0111 | f 818 364 2491 *******************WINCH AND HANDLINGSYSTEMSHIP ELECTRONICSTOWED SUBSYSTEMSONAR OPERATORCONSOLETRANSMIT POWERAMPLIFIER 1025 W. NASA Boulevard Melbourne, FL 32919SPECIFICATIONSOperating Modes Active, passive, test, playback, multi-staticSource Level 219 dB Omnidirectional, 222 dB Sector Steered Projector Elements 16 in 4 stavesTransmission Omnidirectional or by sector Operating Depth 15-to-300 m Survival Speed 30 knotsSize Winch & Handling Subsystem:180 in. x 138 in. x 84 in.(4.5 m x 3.5 m x 2.2 m)Sonar Operator Console:60 in. x 26 in. x 68 in.(1.52 m x 0.66 m x 1.73 m)Transmit Power Amplifier:42 in. x 28 in. x 68 in.(1.07 m x 0.71 m x 1.73 m)Weight Winch & Handling: 3,954 kg (8,717 lb.)Towed Subsystem: 678 kg (1,495 lb.)Ship Electronics: 928 kg (2,045 lb.)Platforms Frigates, corvettes, small patrol boats Receive ArrayConfiguration: Twin-lineNumber of channels: 48 per lineLength: 26.5 m (86.9 ft.)Array directivity: >18 dB @ 1,380 HzLFATS PROCESSINGActiveActive Band 1,200-to-1,00 HzProcessing CW, FM, wavetrain, multi-pulse matched filtering Pulse Lengths Range-dependent, .039 to 10 sec. max.FM Bandwidth 50, 100 and 300 HzTracking 20 auto and operator-initiated Displays PPI, bearing range, Doppler range, FM A-scan, geographic overlayRange Scale5, 10, 20, 40, and 80 kyd PassivePassive Band Continuous 100-to-2,000 HzProcessing Broadband, narrowband, ALI, DEMON and tracking Displays BTR, BFI, NALI, DEMON and LOFAR Tracking 20 auto and operator-initiatedCommonOwn-ship noise reduction, doppler nullification, directional audio。
含隐变量模型求解——EM算法期望极⼤值算法(expectation maximizition algorithm,EM)。
是⼀种迭代算法,1977年由Dempster总结提出,⽤于含有隐变量(hidden variable)的概率模型参数的极⼤似然估计或极⼤后验估计。
EM算法分为两步,E步:求期望,M步:求极⼤值。
1 EM算法的引⼊概率模型有时既含有观测变量(observable variable),⼜含有隐变量或潜在变量(latent variable),如果仅有观测变量,那么给定数据就能⽤极⼤似然估计或贝叶斯估计来估计model参数;但是当模型含有隐变量时,需要⼀种含有隐变量的概率模型参数估计的极⼤似然⽅法估计——EM算法1.1 EM算法这⾥有⼀个三硬币模型,三个硬币A,B,C。
正⾯出现的概率是π,p,q;只有当A硬币掷出后得到正⾯才会掷硬币B,否则C,但是独⽴n次后,只能观察到最后的结果,却看不到执⾏过程中A是什么情况,最后的结果是B还是C的。
因此,各个参数含义观测数据的极⼤似然估计是:这个问题没有解析解,只能使⽤EM⽅法迭代求解。
E步B的概率是⼀个条件概率,y取值为1,0.M步:模型参数估计值EM算法受初值影响较⼤。
⼀般地,Y表⽰观测数据,Z表⽰隐变量,Y+Z表⽰完全数据(complete data),Y叫做不完全数据(incomplete data)。
假设给定观测数据Y其概率分布为,不完全数据Y的似然函数为,对数似然函数为,假设Y,Z联合概率分布为,对数似然函数,EM算法就是通过迭代求L的极⼤似然估计。
EM算法:Q函数:EM算法的核⼼,是完全观测数据对数似然函数关于给定观测数据Y和当前参数θi对未观测数据Z的条件概率分布的期望:步骤1:参数的初值可以任意选择,但是EM算法对参数初值⼗分敏感;步骤2:求Q(θ,θi)第⼀个变元表⽰极⼤化的参数,第⼆个变元表⽰参数的当前估计值。
每次迭代实际上再求Q函数及其极⼤。
隐语义模型常用的训练方法隐语义模型(Latent Semantic Model)是一种常用的文本表示方法,它可以将文本表示为一个低维的向量空间中的点,从而方便进行文本分类、聚类等任务。
在实际应用中,如何训练一个高效的隐语义模型是非常重要的。
本文将介绍隐语义模型常用的训练方法。
一、基于矩阵分解的训练方法1.1 SVD分解SVD(Singular Value Decomposition)分解是一种基于矩阵分解的方法,它可以将一个矩阵分解为三个矩阵相乘的形式,即A=UΣV^T。
其中U和V都是正交矩阵,Σ是对角线上元素为奇异值的对角矩阵。
在隐语义模型中,我们可以将用户-物品评分矩阵R分解为两个低维矩阵P和Q相乘的形式,即R≈PQ^T。
其中P表示用户向量矩阵,Q表示物品向量矩阵。
具体地,在SVD分解中,我们首先需要将评分矩阵R进行预处理。
一般来说,我们需要减去每个用户或每个物品评分的平均值,并对剩余部分进行归一化处理。
然后,我们可以使用SVD分解将处理后的评分矩阵R分解为P、Q和Σ三个矩阵。
其中,P和Q都是低维矩阵,Σ是对角线上元素为奇异值的对角矩阵。
通过调整P和Q的维度,我们可以控制模型的复杂度。
在训练过程中,我们需要使用梯度下降等方法来最小化预测评分与实际评分之间的误差。
具体地,在每次迭代中,我们可以随机选择一个用户-物品对(ui),计算预测评分pui,并根据实际评分rui更新P 和Q中相应向量的值。
具体地,更新公式如下:pu=pu+η(euiq-uλpu)qi=qi+η(euip-uλqi)其中η是学习率,λ是正则化参数,eui=rui-pui表示预测评分与实际评分之间的误差。
1.2 NMF分解NMF(Nonnegative Matrix Factorization)分解是另一种基于矩阵分解的方法,在隐语义模型中也有广泛应用。
与SVD不同的是,在NMF中要求所有矩阵元素都为非负数。
具体地,在NMF中,我们需要将评分矩阵R进行预处理,并将其分解为P和Q两个非负矩阵相乘的形式,即R≈PQ。
EM算法原理总结EM算法(Expectation-Maximization algorithm)是一种迭代优化算法,用于估计含有隐变量的概率模型参数。
它能够在缺失数据情况下对概率模型进行参数估计,并可以通过迭代的方式逐步逼近模型的最大似然估计。
EM算法的原理可以总结为以下几个步骤:1.初始化模型参数:首先需要对模型的参数进行初始化。
通常可以采用随机初始化或者根据经验设定的初始值。
2. E步:在E步中,算法会根据当前的参数估计值来计算隐变量在每个数据样本上的期望值(expectation)。
这个计算过程通常使用条件概率的形式,即根据当前参数计算隐变量的后验概率。
3.M步:在M步中,算法会根据E步中计算得到的隐变量的期望值,来最大化似然函数。
这个最大化的过程通常使用最大似然估计的方法,通过对似然函数求偏导数,并使其等于零来求解参数。
4.更新参数:在M步中得到的参数估计值将作为下一轮迭代的初始值。
如此循环迭代,直到模型参数收敛,或者达到预设的迭代次数。
EM算法的优势在于它对于含有隐变量的概率模型的参数估计更加稳定。
由于EM算法使用期望值来替代隐变量,对于缺失数据的情况下也能进行有效的估计。
此外,EM算法的计算过程也相对简单而清晰,容易实现。
然而,EM算法也存在一些不足之处。
首先,EM算法只能够得到概率模型的局部最大似然估计,不能保证找到全局最大似然估计。
其次,EM算法对初始参数的选择非常敏感,有时候可能会陷入局部最优解。
另外,EM算法的收敛速度可能较慢,需要进行多次迭代才能达到理想的结果。
为了解决这些问题,可以采用一些改进的EM算法,如加速的EM算法(accelerated EM algorithm)、剪枝的EM算法(pruning-based EM algorithm)等。
这些改进的算法在EM算法的基础上对其进行了一些改进和优化,提高了算法的收敛速度和估计精度。
总结来说,EM算法是一种用于估计含有隐变量的概率模型参数的优化算法。
em聚类算法流程Clustering algorithms are widely used in data analysis to group similar data points together based on certain criteria. These algorithms play a crucial role in various fields such as machine learning, data mining, and pattern recognition. One of the most commonly used clustering algorithms is the K-means algorithm, which is known for its simplicity and efficiency. The K-means algorithm works by partitioning a dataset into K clusters, where each data point belongs to the cluster with the nearest mean.聚类算法在数据分析中被广泛应用,根据某些标准将相似的数据点分组在一起。
这些算法在机器学习、数据挖掘和模式识别等领域起着至关重要的作用。
其中最常用的聚类算法之一是K-means算法,以其简单性和效率而闻名。
K-means算法通过将数据集分成K个簇,使得每个数据点属于最近均值的簇。
The workflow of the K-means algorithm begins with randomly selecting K initial cluster centroids. Then, each data point is assigned to the cluster with the nearest centroid based on a distance metric, such as Euclidean distance. After the assignment step, the centroidsof the clusters are updated by taking the average of all data points assigned to that cluster. This process iterates until the centroids converge, and the clustering is considered to be stable.K-means算法的工作流程始于随机选择K个初始簇中心。
em算法原理-回复“EM算法原理”是一种用于处理有缺失数据或隐变量的统计问题的迭代算法。
它最早由Arthur Dempster、Nan Laird和Donald Rubin于1977年提出,并被广泛应用于许多领域,如机器学习、数据挖掘和生物统计学等。
EM算法的全称是“Expectation-Maximization algorithm”,即“期望最大化算法”。
它的基本思想是:通过利用观测数据和隐变量的信息,不断迭代地优化模型参数估计,从而找到使模型拟合数据最好的参数。
EM算法的步骤如下:步骤一:初始化参数首先,需要对模型的参数进行初始化。
这可以通过随机初始化或使用领域知识来进行。
参数的初始化对最终结果影响较大,因此通常需要进行多次实验来获得较好的结果。
步骤二:E步(Expectation Step)在E步中,需要根据当前参数估计的值,计算隐变量(即不可观测的变量)的后验概率。
这可以使用贝叶斯公式进行计算,其中观测数据和参数估计值是已知的。
这些后验概率被称为“隐藏变量的期望”,因此这一步骤被称为E步。
步骤三:M步(Maximization Step)在M步中,需要利用E步得到的隐藏变量的期望,来更新模型的参数估计值。
这可以通过最大化完全数据的似然函数来实现。
对于一些简单的模型,这个步骤可以通过解析求解得到参数的闭式表达式。
对于复杂的模型,可以使用数值优化算法(如梯度下降法或牛顿法)来最大化似然函数。
步骤四:重复E步和M步在进行完一次M步之后,需要重新进行E步和M步,直到满足停止迭代的条件。
通常,可以设置最大迭代次数或终止条件,如参数的变化小于某个阈值。
EM算法的优点是对于具有缺失数据或隐变量的模型具有较好的适应性。
此外,它收敛得相对快速,因此被广泛应用于实际问题中。
然而,EM算法也存在一些限制。
首先,EM算法只能保证收敛到局部最大值,不能保证收敛到全局最大值。
其次,EM算法在遇到高维问题时,计算量往往过大。
EM-algorithm for Training of State-spaceModels with Application to Time SeriesPredictionElia Liiti¨a inen,Nima Reyhani and Amaury LendasseHelsinki University of Technology-Neural Networks Research CenterP.O.Box5400,FI-02015-Finland{elia,lendasse,nreyhani}@cis.hut.fiAbstract.In this paper,an improvement to the E step of the EMalgorithm for nonlinear state-space models is presented.We also proposestrategies for model structure selection when the EM-algorithm and state-space models are used for time series prediction.Experiments on thePoland electricity load time series show that the method gives good short-term predictions and can also be used for long-term prediction.1IntroductionTime series prediction is an important problem in manyfields.Applications includefinance,prediction of electricity load and ecology.Time series prediction is a problem of system identification.Taken’s theorem[1]ensures that if the dynamic of an underlying system is deterministic,a regression approach can be used for prediction.However, many real world phenomena are stochastic and more complicated methods might be necessary.One possibility is to use a state-space model for predicting the behavior of a time series.This kind of model is very general and is able to model many kinds of phenomena.Choosing the structure and estimating the parameters of a state-space model can be done in many ways.One approach is to use the EM-algorithm and Radial Basis Functions(RBF)-networks[2].The main problem is that the E-step of the algorithm is difficult due to nonlinearity of the model.In this paper,we apply a Gaussianfilter which handles nonlinearities better than the ordinary Extended Kalman Filter(EKF)which is used in[2].The problem of choosing the dimension of the state-space and the number of neurons is also investigated.This problem is important as too complicated a model can lead to numerical instability,high computational complexity and overfitting.We propose the use of training error curves for model structure selection.The experimental results show that the method works well in prac-tice.We also show that the EM-algorithm maximizes the short-term prediction performance but is not optimal for long-term prediction.1372Gaussian Linear Regression FiltersConsider the nonlinear state-space modelx k=f(x k−1)+w k(1)y k=g(x k)+v k,(2) where w k∼N(0,Q)and v k∼N(0,r)are independent Gaussian random vari-ables,x k∈R n and y k∈R.Here,N(0,Q)means normal distribution with the covariance matrix Q.Correspondingly,r is the variance of the observation noise.Thefiltering problem is stated as calculating p(x k|y1,...,y k).If f and h are linear this could be done using Kalmanfilter.The linearfiltering theory can be applied to nonlinear problems by using Gaussian linear regressionfilters (LRKF)that are recursive algorithms based on statistical linearization of the model equations.The linearization can be done in various ways resulting in slightly different algorithms.Denote by˜p(x k|y1,...,y k)a Gaussian approximation of p(x k|y1,...,y k). Thefirst phase in a recursive step of the algorithm is calculating˜p(x k+1|y1,...,y k), which can be done by linearizing f(x k)≈A k x k+b k so that the errortr(e k)=tr( R n x(f(x k)−A k x k−b k)(f(x k)−A k x k−b k)T˜p(x k|y0,...,y k)dx k)(3)is minimized.Here,tr denotes the sum of the diagonal elements of a matrix. In addition Q is replaced by˜Q k=Q k+e k.The linearized model is used for calculating˜p(x k+1|y1,...,y k)using the theory of linearfilters.The measurement update˜p(x k+1|y1,...,y k)→˜p(x k+1|y1,...,y k+1)is done by a similar linearization.Approximating equation3using central differences would lead to the central differencefilter(CFD)which is related to the unscented Kalmanfilter.How-ever,by using Gaussian nonlinearities as described in the following sections, no numerical integration is needed in the linearization.The smoothed density p(x l|y1,...,y k)(l<k)is also of interest.In our experiments we use the Rauch-Tung-Striebel smoother[4]with the linearized model.3Expectation Maximization(EM)-algorithm for Training of Radial Basis Function NetworksIn this section,a training algorithm introduced in[2]is described.The EM-algorithm is used for parameter estimation for nonlinear state-space models.3.1Parameter EstimationSuppose a sequence of observations(y k)N k=1is available.The underlying system that produced the observations is modelled as a state-space model.The functions138f andg in equations1and2are parametrized using RBF-networks:f=w T fΦf(4)g=w T gΦg,(5)whereΦf=[ρf1(x),...,ρfl (x)x T1]T andΦg=[ρg1(x),...,ρg j(x)x T1]T are theneurons of the RBF-networks.The nonlinear neurons are of the form ρ(x)=|2πS|−1/2exp(−13.3Choosing Kernel Means and WidthsChoosing the centers of the neurons(c in equation6)is done with the k-means algorithm[5].The widths S j are chosen according to the formula(see[6])S j=12I,(8)where I is the identity matrix and N(j,i)is the i th nearest neighbor of j.In the experiments,we use the value l=2.3.4Choosing the Number of Neurons and the Dimension of theState-SpaceTo estimate the dimension of the state-space,we propose the use of a valida-tion set to estimate the generialization error.For each dimension,the model is calculated for different number of neurons and the one which gives the lowest validation error is chosen.This procedure is repeated to get averaged estimates which are used for dimension selection.For choosing the number of neurons,we propose the use of the training error, which has the advantage that the whole data set is used for ually there is a bend after which the training error decreases only slowly(seefigure1c). This kind of a bend is used as an estimate of the point after which overfitting becomes a problem.To avoid local minima,averaging is used.Because iterative prediction is used,the one-step prediction error is used for model structure selection.Instead,it would also have been possible to use the likelihood.4ExperimentsThe experiments are made with a time series that represents the daily electricity consumption in Poland during1400days in the90s[7],seefigure1a.The values 1−1000are used for training and the values1001−1400for testing.For the dimension selection,the values601-1000are kept for validation.The model is tested using the dimensions2to8for the state-space.For each dimension,the validation errors are calculated for different number of neurons so that the number of neurons goes from0to36by step of2(we use the same amount of neurons for both networks).To calculate the prediction errors,the state at the beginning of each prediction is estimated with the linear-regression filter.The prediction is made by approximating the future states and obser-vations with the linear-regression approach.For L in formula7,we choose10. The choice is based on the results in[8]which imply that the window size10 contains enough information for prediction.For each number of neurons,the training is stopped if the likelihood decreased twice in row or more than40iterations have been made.Infigure1b are the validation MSE curves calculated by averaging the values for each number of140Fig.1:a)The Poland electricity consumption time series.b)1-step prediction validation errors for dimension selection.c)1-step prediction errors for dimen-sion7.The solid line is the error on the test set and the dashed line the error on the training set.d)5-step test error for dimension7.e)An example of prediction.neurons over four simulations and choosing the minimum for each dimension.It can be seen that a good choice for dimension of the state-space is7(even though the validation error for dimension6is nearly as good).The errors are averaged over four simulations.The training error curve for1-step prediction shows clearly that a good choice for the number of neurons is6.The test error shows that the model chosen based on the training error gives good results on short-term prediction.The chosen model gives quite good but not optimal5-step prediction result. We claim that the EM-algorithm optimizes mainly the short-term prediction performance which can be seen by examining the test error curves.The1-step test error curve is much smoother.Thus,an algorithm designed for long-term prediction would certainly give much better result.In table1are the results corresponding to the averaged prediction over four simulations.In comparison we have calculated the corresponding results using linear regression,lazy learning[9]and LS-SVM[8].As expected the1-step MSE of the EM-algorithm is very good,but the5-step MSE is less good.141method5-step MSE0.00130.00210.00160.0015。