Step Size Adaptation in Evolution Strategies using Reinforcement Learning
- 格式:pdf
- 大小:142.94 KB
- 文档页数:6
多目标进化算法性能评价指标综述多目标进化算法(MOEA)是一种用于解决多目标优化问题的进化算法。
MOEA通过维护一个个体群体的集合,通过交叉、变异等操作,逐步搜索问题的解空间,以得到一组尽可能好的近似最优解,这些解在不同的目标函数下优化结果良好且彼此之间具有一定的均衡性。
对于多目标进化算法的性能评价,主要包括以下几个方面的指标。
1. 近似最优解集合的质量这是最重要的评价指标之一,主要用于衡量算法是否能够找到一组高质量的非劣解。
在多目标优化问题中,解空间通常非常大,因此算法找到的解集可能只是非劣解的一个近似。
质量好的近似最优解集合应该尽可能接近真正的非劣解集合,并且集合中的解之间应该有较好的均衡性。
2. 支配关系的准确性多目标优化问题中的解往往是通过支配关系进行判断的。
一个解A支配另一个解B,意味着解A在所有目标函数上至少和解B一样好,且在某一个目标函数上更好。
算法找到的解集应该能够正确地判断出解之间的支配关系,并保持非劣解之间的支配关系不变。
3. 外部收敛集的覆盖度外部收敛集是算法找到的近似最优解集合,其覆盖度是衡量算法性能的重要指标之一。
覆盖度越高,说明算法找到的近似最优解集合能够尽可能覆盖真实的非劣解集合。
覆盖度的计算通常通过指标如hypervolume、inverted generational distance等进行。
4. 多样性多样性指的是找到的近似最优解集合中解之间的差异程度。
一方面,算法应该找到尽可能多样的解,以保证搜索过程能够覆盖解空间的各个方向。
解之间应该具有一定的距离,以避免近似最优解集合中过于集中在某个区域。
5. 计算效率和收敛速度算法的计算效率和收敛速度也是评价指标之一。
虽然算法能够找到高质量的近似最优解集合,但如果计算时间过长,就会限制算法的实际应用。
算法应该在保证质量的前提下,尽可能提高计算速度和效率。
多目标进化算法的性能评价指标主要包括近似最优解集合的质量、支配关系的准确性、外部收敛集的覆盖度、多样性以及计算效率和收敛速度。
进化计算综述1.什么是进化计算在计算机科学领域,进化计算(Evolutionary Computation)是人工智能(Artificial Intelligence),进一步说是智能计算(Computational Intelligence)中涉及到组合优化问题的一个子域。
其算法是受生物进化过程中“优胜劣汰”的自然选择机制和遗传信息的传递规律的影响,通过程序迭代模拟这一过程,把要解决的问题看作环境,在一些可能的解组成的种群中,通过自然演化寻求最优解。
2.进化计算的起源运用达尔文理论解决问题的思想起源于20世纪50年代。
20世纪60年代,这一想法在三个地方分别被发展起来。
美国的Lawrence J. Fogel提出了进化编程(Evolutionary programming),而来自美国Michigan 大学的John Henry Holland则借鉴了达尔文的生物进化论和孟德尔的遗传定律的基本思想,并将其进行提取、简化与抽象提出了遗传算法(Genetic algorithms)。
在德国,Ingo Rechenberg 和Hans-Paul Schwefel提出了进化策略(Evolution strategies)。
这些理论大约独自发展了15年。
在80年代之前,并没有引起人们太大的关注,因为它本身还不够成熟,而且受到了当时计算机容量小、运算速度慢的限制,并没有发展出实际的应用成果。
到了20世纪90年代初,遗传编程(Genetic programming)这一分支也被提出,进化计算作为一个学科开始正式出现。
四个分支交流频繁,取长补短,并融合出了新的进化算法,促进了进化计算的巨大发展。
Nils Aall Barricelli在20世纪六十年代开始进行用进化算法和人工生命模拟进化的工作。
Alex Fraser发表的一系列关于模拟人工选择的论文大大发展了这一工作。
[1]Ingo Rechenberg在上世纪60 年代和70 年代初用进化策略来解决复杂的工程问题的工作使人工进化成为广泛认可的优化方法。
多基因串联构建进化树的经典文献1. Felsenstein, J. (1985). Confidence limits on phylogenies: An approach using the bootstrap. Evolution, 39(4), 783-791.这篇经典文献提出了一种使用bootstrap方法构建进化树并计算置信区间的方法。
作者通过模拟数据集并进行重复抽样,得到了进化树的置信度评估。
2. Nei, M., & Kumar, S. (2000). Molecular evolution and phylogenetics. Oxford university press.这本经典教材详细介绍了使用多基因串联数据构建进化树的方法。
作者解释了不同的进化模型和计算方法,并提供了计算进化树的实例和案例研究。
3. Yang, Z. (2006). Computational molecular evolution. Oxford university press.这本经典教材介绍了使用多基因串联数据进行计算机模拟和进化树构建的方法。
作者详细解释了常用的进化模型、计算方法和统计推断,以及如何评估进化树的可靠性。
4. Rannala, B., & Yang, Z. (1996). Probability distribution of molecular evolutionary trees: A new method of phylogenetic inference. Journal of molecular evolution, 43(3), 304-311.这篇经典文献提出了一种基于贝叶斯统计的方法,用于构建进化树并估计参数。
作者通过模拟数据集,比较了该方法与传统方法的性能,并证明了其在多基因串联数据中的有效性。
5. Wiens, J. J., & Moen, D. S. (2008). Missing data and the accuracy of Bayesian phylogenetics. Journal of Systematics and Evolution, 46(3), 307-314.这篇经典文献探讨了在多基因串联数据中缺失数据的影响,并提出了一种贝叶斯方法来处理缺失数据问题。
介绍遗传算法的发展历程遗传算法(Genetic Algorithms,GA)是一种基于自然选择和遗传学原理的优化算法,由美国计算机科学家约翰·霍兰德(John Holland)在20世纪60年代提出。
遗传算法通过模拟自然界的进化过程,利用基因编码表示问题的解,通过交叉、变异等操作来探索解空间并逐步优化求解的过程。
以下是遗传算法发展的主要里程碑:1.早期研究(1960s-1970s):约翰·霍兰德在1960年代提出遗传算法的基本原理,并将其应用于函数优化问题。
他的研究引发了对遗传算法的广泛兴趣,但由于计算能力有限,遗传算法的应用范围较为受限。
2.第一代进化策略(1980s):20世纪80年代,德国科学家汉斯-皮特·舍维尔(Hans-Paul Schwefel)提出了一种基于自然选择的优化算法,称为“进化策略”。
舍维尔的工作开拓了遗传算法的领域,并引入了适应度函数、交叉和变异等基本概念。
3.遗传算法的理论完善(1990s):20世纪90年代,遗传算法的理论基础得到了进一步的完善。
约翰·霍兰德等人提出了“遗传算子定理”,指出在理论条件下,遗传算法可以逐步收敛到最优解。
同时,研究者们提出了多种改进策略,如精英保留策略、自适应参数调节等。
4.遗传算法的应用扩展(2000s):21世纪初,随着计算机计算能力的提高,遗传算法开始在更广泛的领域中得到应用。
遗传算法被成功应用于旅行商问题、网络优化、机器学习等诸多领域。
同时,研究者们在遗传算法的理论基础上,提出了多种变种算法,如基因表达式编码、改进的选择策略等。
5.多目标遗传算法(2024s):近年来,遗传算法的研究重点逐渐转向了解决多目标优化问题。
传统的遗传算法通常只能找到单一最优解,而多目标遗传算法(Multi-Objective Genetic Algorithms,MOGAs)可以同时多个目标的最优解,并通过建立一个解集合来描述问题的全局最优解。
根据您提供的主题,我们将针对基于帕累托前沿面曲率预估的超多目标进化算法展开深度和广度兼具的文章撰写。
在文章中,我们将从简到繁地探讨帕累托前沿面、曲率预估和超多目标进化算法,帮助您全面理解这一主题。
让我们来了解一下帕累托前沿面的概念。
帕累托最优解是在多目标优化问题中非常重要的概念,它代表了在多个目标中达到最优的一系列解。
在帕累托最优解中,不存在能够同时改善所有目标的解,通常需要进行权衡取舍。
我们将探讨曲率预估在多目标优化中的作用。
曲率预估是一种用来估计帕累托前沿面曲率的方法,它能够帮助算法更好地理解前沿面的性质,从而更有效地搜索最优解。
随后,我们将详细解析超多目标进化算法的原理和应用。
超多目标进化算法是针对多目标优化问题设计的一种进化算法,它通过对帕累托前沿面的曲率进行预估,能够更加准确地搜索出多目标优化问题的解集。
我们将深入讨论超多目标进化算法的优点和局限性,帮助您全面了解这一算法的特点。
在文章的结尾部分,我们将对帕累托前沿面曲率预估的超多目标进化算法进行总结和回顾,让您能够全面、深刻和灵活地理解这一主题。
我们还会共享个人观点和理解,从不同角度对这一主题进行深入思考和探讨。
通过以上方式,我们将按照知识的文章格式撰写一篇深度和广度兼具的中文文章,帮助您更好地理解基于帕累托前沿面曲率预估的超多目标进化算法。
如有需要,我们可以进一步讨论文章的具体内容和结构,以确保最终的文章能够满足您的要求。
期待和您共同探讨这一主题,并撰写一篇有价值的文章。
帕累托前沿面曲率预估的超多目标进化算法在实际应用中具有广泛的应用前景。
通过对帕累托前沿面的曲率进行预估,可以有效地优化多目标优化问题,找到更全面的解决方案。
在本文中,我们将深入探讨帕累托前沿面的概念、曲率预估方法以及超多目标进化算法的原理与应用,以帮助读者更好地理解这一重要的主题。
让我们来进一步了解帕累托前沿面的概念。
帕累托最优解是多目标优化问题的核心概念,它代表了在多个目标中找到最优解的一系列解集。
进化计算综述1.什么是进化计算在计算机科学领域,进化计算(Evolutionary Computation)是人工智能(Artificial Intelligence),进一步说是智能计算(Computational Intelligence)中涉及到组合优化问题的一个子域。
其算法是受生物进化过程中“优胜劣汰”的自然选择机制和遗传信息的传递规律的影响,通过程序迭代模拟这一过程,把要解决的问题看作环境,在一些可能的解组成的种群中,通过自然演化寻求最优解。
2.进化计算的起源运用达尔文理论解决问题的思想起源于20世纪50年代。
20世纪60年代,这一想法在三个地方分别被发展起来。
美国的Lawrence J. Fogel提出了进化编程(Evolutionary programming),而来自美国Michigan 大学的John Henry Holland则借鉴了达尔文的生物进化论和孟德尔的遗传定律的基本思想,并将其进行提取、简化与抽象提出了遗传算法(Genetic algorithms)。
在德国,Ingo Rechenberg 和Hans-Paul Schwefel提出了进化策略(Evolution strategies)。
这些理论大约独自发展了15年。
在80年代之前,并没有引起人们太大的关注,因为它本身还不够成熟,而且受到了当时计算机容量小、运算速度慢的限制,并没有发展出实际的应用成果。
到了20世纪90年代初,遗传编程(Genetic programming)这一分支也被提出,进化计算作为一个学科开始正式出现。
四个分支交流频繁,取长补短,并融合出了新的进化算法,促进了进化计算的巨大发展。
Nils Aall Barricelli在20世纪六十年代开始进行用进化算法和人工生命模拟进化的工作。
Alex Fraser发表的一系列关于模拟人工选择的论文大大发展了这一工作。
[1] Ingo Rechenberg在上世纪60 年代和70 年代初用进化策略来解决复杂的工程问题的工作使人工进化成为广泛认可的优化方法。
进化算法的缩放因子-概述说明以及解释1.引言1.1 概述概述部分的内容可以简要介绍进化算法以及其在优化问题中的应用。
以下是一个例子:进化算法是一种基于生物进化原理的优化方法,它最初由达尔文的进化理论所启发。
进化算法模拟了自然界中物种演化的过程,通过适应度评估、选择、交叉和变异等操作来不断优化解空间中的候选解,以求得问题的最优解或近似最优解。
进化算法在求解复杂优化问题中表现出色,其灵活性和鲁棒性使其成为许多实际问题的有效解决方案。
与传统的优化算法相比,进化算法经常能够找到更好的解决方案,尤其是在高维和非线性问题中。
在进化算法中,缩放因子是一种重要的参数。
缩放因子的选择决定了个体之间变异程度的大小,从而影响进化过程中的搜索空间。
合适的缩放因子能够在保证种群多样性的前提下,有效地探索和利用解空间的局部和全局信息,提高算法的收敛速度和搜索效果。
因此,本文将重点关注进化算法中缩放因子的作用,并探讨缩放因子对算法性能的影响。
通过对现有相关研究进行综述和分析,旨在提供对进化算法中缩放因子的深入理解,并为未来的研究和应用提供有益的参考和指导。
1.2 文章结构本文将分为以下几个部分来讨论进化算法的缩放因子。
首先,在引言部分,我们将概述进化算法的基本原理和缩放因子在进化算法中的作用。
然后,在正文部分,我们将详细介绍进化算法的基本原理,以及缩放因子在进化算法中的具体作用和应用。
在结论部分,我们将分析缩放因子对进化算法性能的影响,并探讨未来研究的方向。
具体而言,正文部分将分为两个小节。
首先,我们将介绍进化算法的基本原理,包括遗传算法、粒子群优化算法等常见的进化算法,并解释它们的工作原理和适应度评估方法。
接下来,我们将重点讨论缩放因子在进化算法中的作用。
我们将介绍缩放因子的定义和计算方法,并探讨其在优化问题中的具体应用。
我们将详细讨论缩放因子如何影响进化算法的搜索能力和收敛速度,并讨论不同类型的缩放因子对进化算法的影响差异。
具有自适应能力的多目标进化优化算法研究近年来,随着计算机技术的不断进步,多目标优化算法在解决实际问题中发挥着重要作用。
然而,传统的多目标优化算法往往存在着维数高、解集合非凸以及问题的多样性等特点,导致其在实际应用中的性能较差。
为了克服这些问题,并提高算法的自适应能力,研究者们提出了具有自适应能力的多目标进化优化算法。
具有自适应能力的多目标进化优化算法是一种能够在搜索空间中灵活适应问题特性的进化优化算法。
它通过不断地调整算法的参数和运算子的选择概率,使算法能够更好地适应不同问题的特点。
具体而言,这种算法通常包括了自适应交叉、自适应变异以及自适应选择等操作。
在自适应交叉方面,传统的多目标进化优化算法往往是采用固定的交叉概率。
然而,在不同的问题领域中,交叉概率的选择往往需要根据问题特点进行调整。
因此,具有自适应能力的算法会根据问题的性质动态地调整交叉概率。
一种常见的方法是根据个体适应度的变化情况,通过一定的策略自适应地更新交叉概率。
类似地,自适应变异也是提高算法自适应能力的一个重要方面。
在传统的多目标进化优化算法中,变异概率通常是固定的。
然而,在不同问题的情况下,变异概率的选择也需要进行调整。
具有自适应能力的算法会根据问题的特点,通过一定的策略自适应地更新变异概率。
这样做的目的是保持算法在不同问题领域中的搜索能力,从而更好地找到问题的解集。
此外,自适应选择也是具有自适应能力的多目标进化优化算法的关键之一。
在传统的多目标进化算法中,通常采用非支配排序和拥挤度距离等策略来选择优秀的个体。
然而,这些策略在不同问题的情况下可能不适用。
因此,具有自适应能力的算法会根据问题的特点,通过一定的策略自适应地更新选择策略。
这样做的目的是保持算法在不同问题领域中的选择能力,从而更好地达到多目标优化的目标。
总体而言,具有自适应能力的多目标进化优化算法通过优化交叉、变异和选择操作,能够更好地适应不同问题的特点。
其核心思想是通过动态地调整算法的参数和运算子的选择概率,使算法能够有效地解决多目标优化问题。
Step Size Adaptation in Evolution Strategies using Reinforcement Learning Sibylle D.M¨uller,Nicol N.Schraudolph,Petros D.KoumoutsakosInstitute of Computational SciencesSwiss Federal Institute of TechnologyCH-8092Z¨u rich,Switzerlandmuellers,nic,petros@inf.ethz.chAbstract-We discuss the implementation of a learning al-gorithm for determining adaptation parameters in evo-lution strategies.As an initial test case,we consider theapplication of reinforcement learning for determining therelationship between success rates and the adaptation ofstep sizes in the(1+1)-evolution strategy.The resultsfrom the new adaptive scheme when applied to severaltest functions are compared with those obtained from the(1+1)-evolution strategy with a priori selected parame-ters.Our results indicate that assigning good reward mea-sures seems to be crucial to the performance of the com-bined strategy.1IntroductionThe development of step size adaptation schemes for evo-lution strategies(ES)has received much attention in the ES community.Starting from an early attempt,the so-called“1/5 success rule”[1],applied on two-membered ES’s,mutative step size control[2]and self-adaptation schemes[3]were de-veloped,followed by derandomized mutational step size con-trol schemes[4],[5].The latter two methods have become state-of-the-art techniques that are usually implemented in ES’s.These control schemes employing empirical rules and parameters have been proven successful for solving a wide range of real-world optimization problems.We consider replacing a priori defined adaptation rules by a more general mechanism that can adapt the step sizes dur-ing the evolutionary optimization process automatically.The key concept involves the use of a learning algorithm for the online step size adaptation.This implies that the optimiza-tion algorithm is not supplied with a pre-determined step size adaptation rule but instead the rules are evolved by means of learning.As an initial test for our approach,we consider the appli-cation of reinforcement learning(RL)to the1/5success rule in a two-membered ES.In Section2,we present the concept of RL and an overview of algorithms considered for our problem.The combination of RL with the ES is shown in Section3and results are presented in Section4.2Reinforcement LearningReinforcement learning(RL)is a learning technique in which an agent learns tofind optimal actions for the current state by interacting with its environment.The agent should learn aAgentEnvironmentrewardstateactionFigure1:The interaction between environmentand agent in RL.control strategy,also referred to as policy,that chooses the optimal actions for the current state to maximize its cumu-lative reward.For this purpose,the agent is given a reward by an external trainer for each action taken.The reward can be immediate or delayed.Sample RL applications are learn-ing to play board games(e.g.Tesauro’s backgammon[7])or learning to control mobile robots,see e.g.[6].In the robot ex-ample,the robot is the agent that can sense its environments such that it knows its location,i.e.,its state.The robot can decide which action to choose,e.g.,to move ahead or to turn from one state to the next.The goal may be to reach a partic-ular location and for this purpose the agent has to learn a pol-icy.The diagram in Figure1shows the interaction between agent and environment in RL.2.1The Learning TaskThe learning task can be divided into discrete time steps,. The agent determines at each step the environmental state, and decides upon an action.By performing this action,the agent is transferred to a new state and given a reward.This reward is used to update a value func-tion that can be either a state-value function depend-ing on states or an action-value function depend-ing on states and actions.To each state or state-action pair, the largest expected future reward is assigned by the optimal value functions or,respectively.The optimal state-value function can be learned only if both the reward function and the function that describes how the agent is transferred from state to state are explicitely ually,however,and are unknown,In this case, the optimal action-value function can be learned.2.2Temporal Difference LearningIn this paper,we consider only RL methods for which the agent can learn the function,thereby being able to select optimal actions without knowing explicitely the reward func-tion and the function for the state resulting from applying action on state.From this class of TemporalDifference learning(TD),we present two algorithms,namely -learning and SARSA.Both need to wait only until the next time step to determine the increment to.Such techniques,denoted as TD(0)schemes,combine the advan-tages of both Dynamic Programming and Monte Carlo meth-ods[8].The difference between-learning and SARSA lies in the treatment of estimation(updating the value func-tion)and choice of a policy(selecting an action).In off-policy algorithms such as-learning,the choice and the es-timation of a policy are separated.In on-policy algorithms such as SARSA,the choice and the estimation of a policy are combined.Experiments in the“Cliff Walking”exam-ple in[8]compare-learning and SARSA.The results of these experiments indicate that the-learning method ap-proaches the goal faster but it fails more often than SARSA. Focussing more on safety than on speed,we decided to im-plement SARSA for our online learning problem presented in the next section.It should be noted that the convergence of SARSA(0)was proven only recently[10].The SARSA pseudocode reads as shown in Figure2[8].Initialize arbitrarily.Repeat(for each episode):Initialize.Choose from using policy derived fromusing e.g.an-greedy selection scheme.Repeat(for each step of episode):Take action,observe,.Choose from using policy fromusing e.g.an-greedy selection scheme.until is terminal.Figure2:The SARSA algorithm2.3Learning ParametersSARSA employs three learning parameters:is a learning rate,a discount factor,and a greediness parameter.Con-stant learning rates cannot be used because we have to deal with a non-deterministic environment.For the considered problem,a learning rate ofIn the RL-ES,the lower bound for the step size is set to while the upper bound is set to based on our expe-rience about useful step size limits.If the limits are exceeded, the reward is set to-1.From our experiments,defining proper limits turned out to be a crucial factor.In the following subsections,we discuss the results for dif-ferent ways to define a reward measure.ProblemSph1DSph20DRos2DTable1:Number of iterations until convergence is reached for the(1+1)-ES.Results are averaged over30runs,and the convergence rate is1.0for all problems.Problem RL-ES(Method2)(conv.rate;) Sph1D(1.00;)Sph5D(0.98;)Sph20D(0.92;)Sph80D(1.00;)Ros2D(0.99;)Ros5D(1.00;) Table2:Number of iterations until convergence is reached for two methods of(1+1)-RL-ES.Method1(described in Section 4.1)denotes the original reward definition in terms of a suc-cess rate increase or decrease while method2(described in Section4.2)uses as a reward measure the difference in func-tion values between the current and the last reward computa-tion.Results are averaged over1000runs for the sphere and over30runs for Rosenbrock’s function.4.1Method1In afirst approach called method1,the reward from the en-vironment is defined to be either+1if the success rate has increased,0if the success rate did not change,or-1if the success rate decreased.For the sphere,this RL-ES method converges about3to 7times slower than using the(1+1)-ES.For Rosenbrock’sfunction,the RL-ES achieves an iteration number about one order of magnitude worse than the(1+1)-ES.Note that for both functions,the RL-ES is extremely un-robust.Convergence rates as low as50%are unacceptable. From our investigation on the sphere function,the basic prob-lem is that with the selected scheme no information is avail-able if the success rate is zero.When the strategy is in a state of zero success,it often either oscillates between choosing theactions“increase”and“decrease”or it gets stuck by always choosing the action“keep”.A no-success run usually hap-pens if the values are initialized such that in thefirst phase of the optimization the step size is increased.This causes a zero success rate and often yields one of the two behaviorsdescribed above.Another interesting feature is the table at the end of the optimization and the table.From the1/5suc-cess rule,we would expect a value table as shown for the 1D case in Table3.Note that“+”denotes the highest value per row.Success rate Action:Action:increase keep 0,12+3,4,5,6,7,8,9,10+Table3:Schema of the value table as it should look if the 1/5success rule is learnt.The“+”denotes the highest value per row.As it turns out,such a structure in the actual table at the end of the optimization using RL-ES may be found but it may as well be structured differently without much change in terms of convergence speed.One example for an actually obtained table(Table4)and table(Table5)is shown below for the optimization of a1D sphere function that took 108generations.The*symbol indicates that the state-action pair was not visited during the optimization and the highest values are in bold face.Note that for success rates between 0.4and1.0no learning has taken place.Then,the values are the initialized values.The highest values are found for the action“decrease”when the success rates are0,0.1,and 0.2.The action“increase”is assigned the highest value for a success rate of0.3.Except for a success rate of0.2, the obtained values match our expectations from the1/5 success rule.The number of visits for each state-action pair also reflect that the correct actions(according to the1/5rule) have been selected in this particular case.Changing the initialization of the table from uniformly random numbers in the range[-1,1]to zero values did not have any effect on the results.When we compare the reward assignment in method1 with the1/5success rule,we observe that our definition isSuccess rate Action:Action:increase keep 0*-0.7150200.39038610.3298870.0174612-0.340735*0.45193130.008306*-0.563195 4,5,6,7,8,9,10** Table4:values for a converged optimization using RL-ES after108generations.The*symbol indicates that the state-action pair was not visited during the optimization.The highest values are in bold face.Success rate Action:Action:increase keep0*021322*0*0313*0 4,5,6,7,8,9,10*0*0 Table5:Number of visits for a converged optimization using RL-ES after108generations.The*symbol indicates that the state-action pair was not visited during the optimiza-tion.The highest values are in bold face.ill-posed.Recall that the1/5rule aims at an optimum success rate of0.2.In contrast,the definition in method1assigns a positive reward whenever the success rate is increased even if the success rate is.Despite this ill-posed reward as-signment,the results for the sphere are not affected so much because success rates larger than0.2are achieved less often than success rates.However,for Rosenbrock’s func-tion,the difference matters.4.2Method2In a second approach,we define as reward the difference be-tween the current function value and the function value eval-uated at the last reward computation,(2) where is the difference in generations for which the re-ward is computed.This reward assignment,referred to as method2in the following,is better than the initial reward computation in both the convergence speed and rate.Values are given in Table2.Although better than the original reward computation in terms of speed,this RL-ES remains slower by a factor of about3than the(1+1)-ES.A factor of3seems rea-sonable given the fact that the RL-ES has to learn which of the three actions to choose.Especially the convergence rate is noteworthy:It lies in the range[0.92,1.0],which is a great improvement compared with method1.Why is method2better than method1in terms of the con-vergence rate?One reason might be that the reward assign-ment in method1is ill-posed as stated earlier.Another reason is that the second reward assignment is related to the defini-tion of the progress rate,at least for the sphere function: The progress rate is defined[9]as(3) For the sphere function,,with the optimum at,we have thatFor the sphere,the progress rate is the difference between the square roots of function values,a result similar to the reward assignment in Eqn.2.The results in Table6document the behavior of the strat-egy with the reward identified as the theoretical progress rate .The theoretical results agree well with method2which strengthens our assumption that method2works well because it indirectly incorporates information about the optimal pa-rameter vector.In summary,assigning a good reward measure seems to be crucial to the performance of the RL-ES.Problem(conv.rate;) Sph1D(1.00;)Sph5D(0.98;)Sph20D(0.90;)Sph80D(0.98;)Ros2D(1.00;)Ros5D(1.00;) Table6:Number of iterations until convergence is reached for the(1+1)-RL-ES method which employs the theoretical progress rate as reward,as described in Section4.2.Results are averaged over1000runs for the sphere and over30runs for Rosenbrock’s function.4.3Methods with Optimum-Independent Reward Assign-mentsAs seen above,defining the reward as the theoretical progress rate is a good measure.However,the theoretical progress rate assumes knowledge about the optimum that is usually unknown.How can we formulate a suitable reward that ap-proximates the theoretical progress rate using only values that can be measured while optimizing?We can consider two pos-sible forms,namelyForm1:The sign of the difference between function values,:It describes if the realized step,,points in the half space in which the optimum lies.Form2:The realized step length,.It is anapproximation of the theoretical progress rate. Method3employs only thefirst form,(4) while method4combines both forms,(5) Results of the two methods are summarized in Table7.Problem RL-ES(Method4)(conv.rate;) Sph1D(1.00;)Sph5D(0.99;)Sph20D(0.94;)Sph80D(1.00;)Ros2D(1.00;)Ros5D(1.00;) Table7:Number of iterations until convergence is reached for methods3and4of(1+1)-RL-ES,as described in Section 4.3.Results are averaged over1000runs for the sphere and over30runs for Rosenbrock’s function.For the sphere problem,the convergence speeds of meth-ods3and4are similar and they are in the same range as with method2.For Rosenbrock’s function,method3is worse than methods2and4,while2and4yield similar convergence speeds.The convergence rates of methods2and4are almost the same while method3optimizes less reliably especially for Rosenbrock’s problem.In summary,method4seems to be better than3and compares well with method2in which information about the optimum is contained.From these pre-liminary results that are problem dependent,we propose the fourth method as a reward assignment for RL-ES.4.4Action-Selection SchemeHow is the choice of the action-selection parameter influ-encing the convergence speed?The average number of itera-tions to reach the goal in the1D sphere problem as a function of is shown in Table8.The number of iterations is averaged over1000runs,and.For all values,the success rate is in the range[0.68,0.72]for method1and[0.66,1.0] for method2.Optimum strategy parameter for the consid-ered cases lie close to for method1and for method2.However,these results are not conclusive.Method2:Average numberof iterations1.030660.53270.13490.054570.016770.001901 Table8:Influence of on the average number of iterations for the1D sphere function,measured in1000runs and.5Summary and ConclusionsWe propose an algorithm that combines elements from step size adaptation schemes in evolution strategies(ES) with reinforcement learning(RL).In particular,we tested a SARSA(0)learning algorithm on the1/5success rate in a (1+1)-ES.Heuristics in the(1+1)-ES were reduced and re-placed with a more general learning method.The results in terms of convergence speed and rate,measured on the sphere and Rosenbrock problem in several dimensions,suggest that the performance of the combined scheme(called RL-ES)de-pends strongly on the choice of the reward function.In par-ticular,the RL-ES with a reward assignment based on a com-bination of the realized step length and the sign of the func-tion values yields the same convergence rate(100%)as the (1+1)-ES and its convergence speed is smaller than that of the(1+1)strategy by a factor of about3for both sphere and Rosenbrock’s function,a result that meets our expectations.Future work may answer the question whether the pro-posed reward computation can be generalized for non-tested optimization problems.Bibliography[1]Rechenberg,I.,”Evolutionsstrategie:Optimierungtechnischer Systeme nach Prinzipien der biolo-gischen Evolution,”Fromann-Holzboog,Stuttgart,1973.[2]Rechenberg,I.,”Evolutionsstrategie’94,”Fromann-Holzboog,Stuttgart,1994.[3]B¨a ck,Th.:”Evolutionary Algorithms in Theory andPractice,”Oxford University Press,1996.[4]Hansen,N.,Ostermeier, A.,”Adapting ArbitraryNormal Mutation Distributions in Evolution Strate-gies:The Covariance Matrix Adaptation,”Proceed-ings of the IEEE International Conference on Evolu-tionary Computation(ICEC’96),pp.312-317,1996.[5]Hansen,N.,Ostermeier,A.,”Convergence Propertiesof Evolution Strategies with the Derandomized Co-variance Matrix Adaptation:The-CMA-ES,”Proceedings of the5th European Congresson Intelligent Techniques and Soft Computing(EU-FIT’97),pp.650-654,1997.[6]Mahadevan,S.,Connell,J.,”Automatic program-ming of behavior-based robots using reinforcementlearning,”Proceedings of the Ninth National Confer-ence on Artificial Intelligence,Anaheim,CA,1991.[7]Tesauro,G.,”Temporal difference learning and TD-Gammon,”Communications of the ACM,38(3),pp.58-68,1995.[8]Sutton,R.S.,Barto,A.G.,”Reinforcement Learning–An Introduction,”MIT Press,Cambridge,1998.[9]Schwefel,H.-P.,”Evolution and Optimum Seeking,”John Wiley and Sons,New York,1995.[10]Singh,S.,Jaakkola,T.,Littman,M.L.,Szpes-vari,C.,”Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms,”Ma-chine Learning,1999.[11]Mitchell,T.M.,”Machine Learning,”McGraw-Hill,1997.。