Playing Atari with Deep Reinforcement Learning

格式：pptx
大小：1.13 MB
文档页数：30

下载文档原格式

人工智能文献综述10000字

人工智能文献综述10000字人工智能（Artificial Intelligence，AI）是指通过模拟、延伸和扩展人类智能的技术和方法。

人工智能已经渗透到了各个领域，如医疗、金融、交通等。

本文将对人工智能领域的一些重要文献进行综述，以期了解目前人工智能领域的研究进展和热点。

1. "Deep Residual Learning for Image Recognition" (Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, 2016)这篇论文提出了一种新的深度残差网络（Deep Residual Network，ResNet）结构，通过引入残差学习的方法解决了深度神经网络的退化问题。

该论文在ImageNet数据集上取得了当时最先进的结果，为深度学习的发展做出了重要贡献。

2. "Playing Atari with Deep Reinforcement Learning" (Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller, 2013)这篇论文提出了一种基于深度强化学习的方法，将深度神经网络应用于Atari游戏的自动游戏玩家训练中。

这种方法通过将图像作为输入，直接从原始像素中学习游戏策略，取得了比之前所有方法更好的结果。

这是深度强化学习在游戏领域的开创性工作。

3. "Generative Adversarial Networks" (Ian J. Goodfellow,Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, 2014)这篇论文提出了一种新的生成模型，称为生成对抗网络（Generative Adversarial Networks，GANs）。

基于DDPG算法的路径规划研究

基于DDPG算法的路径规划研究张义，郭坤(青岛理工大学信息与控制工程学院，山东青岛266520)摘要：路径规划是人工智能领域的一个经典问题，在国防军事、道路交通、机器人仿真等诸多领域有着广泛应用，然而现有的路径规划算法大多存在着环境单一、离散的动作空间、需要人工构筑模型的问题。

强化学习是一种无须人工提供训练数据自行与环境交互的机器学习方法，深度强化学习的发展更使得其解决现实问题的能力得到进一步提升，本文将深度强化学习的DDPG（Deep Deterministic Policy Gradient）算法应用到路径规划领域，完成了连续空间、复杂环境的路径规划。

关键词：路径规划；深度强化学习；DDPG；ActorCritic；连续动作空间中图分类号：TP301.6文献标识码：A文章编号：1009-3044(2021)04-0193-02开放科学（资源服务）标识码（OSID）：Research on Path Planning Based on DDPG AlgorithmZHANG Yi，GUO Kun (SchoolofInformationandControlEngineering,Qingdao University of Technology,Qingdao266520,China) Abstract：Path planning is a classic problem in the field of artificial intelligence,which has been widely used in national defense, military,road traffic,robot simulation and other fields.However,most of the existing path planning algorithms have he problems of single environment,discrete action space,and need to build artificial models.Reinforcement learning is a machine learning meth⁃od that interacts with the environment without providing training data manually,deep reinforcement learning more makes its ability to solve practical problems of the development of further ascension.In this paper,deep reinforcement learning algorithm DDPG (Deep Deterministic Policy Gradient)algorithm is applied in the field of path planning,which completes the task of path planning for continuous space,complex environment.Key words：path planning;deep reinforcement learning;DDPG；Actor Critic;continuous action space传统算法如迪杰斯特拉算法[1]、A*算法[2]、人工势场法[3]等。

2018考研翻译：跳出低分牢笼,提高投入有效性

2018考研翻译：跳出低分牢笼，提高投入有效性英语语言文学是外国语言文学下属的一个二级学科。

英语是联合国工作语言之一，是世界上最通用的语言，也是我国学习人数最多的外语语种。

英语语言文学是我国设置最早的外语专业之一。

英语语言学的发展趋势体现了语言学的发展趋势，具体表现为以下特点：由纯描写性研究转向以解释为目的的研究，由侧重英语本身特点的孤立研究转为侧重语言普遍性的比较研究，由侧重语言形式的研究转为形式和语义并重的研究，由对语言结构的研究扩展到对语言的习得、功能和应用的研究，由对英语的语音、词汇、语法、历史、方言等传统领域的研究发展到对英语的各个方面及相关领域的全面研究。

目前英语语言学的主要分支学科包括音位学、形态学、句法学、文体学、历史语言学；语言学的其他主要分支学科，如语义学、语用学、话语分析、篇章语言学应用语言学、社会语言学、心理语言学等，也主要基于对英语的研。

大多数初学翻译的人，感觉最明显的问题是：理解英语不容易，表达成汉语不轻松。

同学们在此环节拿分较低的原因并非是因为我们对单词的生疏，而是没有掌握答题的方法。

那么到底该如何理解和如何表达，下面凯程考研孟巍巍老师就根据历年真题英译汉呈现出的特点和规律，为2015年的广大考生找寻翻译的应对策略。

1.必须完全掌握语法分析原则，做到对原句进行精确的拆分，语法体系不完善的同学应该牢记并不断用真题巩固原句拆分的六个黄金分割标志点，能够快速、准确定位连词、引导词、介词、分词、单词to和重要意群标点符号以在最短时间内完成长句意群解拆并明确句子主干。

2.掌握重点出现的词汇，特别是多义词和熟词生义现象的高频词的词义选择。

在2010年各校的考试当中，重点词汇和重点词组都有复现现象，这使得词汇和常用词组在翻译过程当中得分点更加集中，风险更大，如果核心单词掌握出现漏洞，可能出现在一篇文章中同一个知识点反复丢分的想象。

另外，提醒2014年的考生们需要在考试前熟练把握词缀分析、上下文提示、中文习惯搭配、同近词义替换四大选词原则，并能够在遇到生词时能够多元化思维综合应用上述技巧。

经典的强化学习算法—“DQN”

经典的强化学习算法—“DQN”DQN（Deep Q-Network）是一种经典的强化学习算法，由DeepMind提出。

它在2024年的论文“Playing Atari with Deep Reinforcement Learning”中首次提出，并在2024年的论文“Human-level control through deep reinforcement learning”中进行了进一步改进。

DQN的核心思想是将深度神经网络与Q-learning相结合，用来解决强化学习中的价值函数估计问题。

传统的Q-learning算法使用的是一个表格来存储状态和动作对应的Q值，但是这种方法随着状态和动作空间的增大而变得不可行。

DQN通过引入深度神经网络来替代这个表格，可以有效地处理高维的状态空间。

DQN的网络结构由输入层、隐藏层和输出层组成。

输入层接收环境的状态，隐藏层用来提取特征，输出层则输出每个动作的Q值。

DQN使用的是基于经验回放的方法来进行训练，即将每次与环境交互得到的经验存储在一个经验回放池中，然后以一定的概率从中随机采样一批经验进行训练。

这样做的好处是可以减少样本之间的相关性，提高训练效率和稳定性。

在DQN训练过程中，还引入了一个目标网络（Target Network），用来解决Q-learning算法中的过度估计问题。

目标网络是一个与原始神经网络结构相同的网络，但是参数固定，在一定的时间间隔内更新一次。

训练过程中，使用目标网络来计算目标Q值，然后与原始网络的Q值进行对比，通过最小化误差来更新原始网络的参数。

这样可以使得训练更加稳定，提高收敛速度。

除了引入目标网络，DQN还使用了一个重要的技巧，即ε-greedy策略。

在ε-greedy策略下，智能体以1-ε的概率选择当前Q值最大的动作，以ε的概率随机选择动作。

这样可以在一定程度上保持对探索的能力，避免陷入局部最优解。

DQN在许多经典的强化学习问题上取得了显著的成绩。

近年高考英语一轮复习题型组合练习(4)(含解析)(2021年整理)

2019高考英语一轮复习题型组合练习（4）（含解析）编辑整理：尊敬的读者朋友们：这里是精品文档编辑中心，本文档内容是由我和我的同事精心编辑整理后发布的，发布之前我们对文中内容进行仔细校对，但是难免会有疏漏的地方，但是任然希望（2019高考英语一轮复习题型组合练习（4）（含解析））的内容能够给您的工作和学习带来便利。

同时也真诚的希望收到您的建议和反馈，这将是我们进步的源泉，前进的动力。

本文可编辑可修改，如果觉得对您有帮助请收藏以便随时查阅，最后祝您生活愉快业绩进步，以下为2019高考英语一轮复习题型组合练习（4）（含解析）的全部内容。

2019高考英语一轮题型组合练习（4)及答案李仕才一、阅读理解.AA new “smart bin” could mean the end of environment-conscious families spending hours sorting tins， cartons, bottles， and cardboard for recycling。

The invention, which can help sort rubbish into recycling categories without needing people to operate。

It is being trialed in Poland and is set to go on sale in UK within a few years。

The bin, designed by start-up company Bin.E， recognizes different type of waste via a system positioned inside the bin which uses sensors, image recognition and artificial intelligence。

打棒球经历的英语作文初中

Playing baseball has been a significant part of my life since I was a young teenager. It was not just a sport for me it was a way of life, a passion that ignited a fire within me. The journey of playing baseball has been filled with ups and downs, but it has taught me invaluable lessons that I carry with me to this day.I remember the first time I held a baseball bat it was a feeling of empowerment. The weight of the bat in my hands felt like a responsibility, a tool that could either be wielded with precision or miss its mark entirely. My initial attempts at hitting the ball were far from impressive. The ball seemed to dance around, always just out of reach, or when I did manage to make contact, it was more of a lucky guess than a skillful swing.But I was determined. Each miss was a lesson, each swing a step closer to mastery. The coachs words echoed in my ears, Its not about the hits, its about the effort. I practiced tirelessly, my muscles aching from the strain, my eyes stinging from the sun, but I never gave up. The field became my sanctuary, a place where I could lose myself in the rhythm of the game.As I progressed, I began to understand the nuances of the game. The art of pitching, the strategy of base running, the teamwork involved in fielding each aspect required a different set of skills and a deep understanding of the game. I learned to read the opposing teams moves, to anticipate their strategies, and to adapt my own game accordingly.One of the most memorable moments of my baseball journey was when I hit my first home run. The feeling of the bat connecting with the ball, thecrack echoing across the field, and the exhilaration of rounding the bases is something Ill never forget. It was a moment of triumph, a testament to all the hard work and dedication I had put into the sport.But baseball also taught me about failure. There were times when I struck out, when I missed a catch, or when my team lost a game. These moments were humbling, but they also taught me about resilience. I learned that its not about winning or losing, but about how you handle the setbacks that define your character.Playing baseball also instilled in me a sense of camaraderie. My teammates and I shared a bond that went beyond the game. We supported each other, celebrated each others successes, and picked each other up in times of defeat. The team spirit was a powerful force that brought us together and made us stronger as a unit.As I moved from middle school to high school, the level of competition increased, but so did my passion for the game. I trained harder, learned more, and pushed myself to new limits. Baseball became more than just a sport it was a part of my identity, a reflection of who I was and who I wanted to be.In conclusion, my experience with baseball has been a journey of growth, learning, and selfdiscovery. It has taught me about perseverance, teamwork, and the importance of handling both success and failure with grace. The memories and lessons from the baseball field will stay with me forever, shaping me into the person I am today.。

深度学习之Playing Atari with Deep Reinforcement Learning

Playing Atari with Deep Reinforcement LearningVolodymyr Mnih Koray Kavukcuoglu David Silver Alex Graves Ioannis AntonoglouDaan Wierstra Martin RiedmillerDeepMind Technologies{vlad,koray,david,alex.graves,ioannis,daan,martin.riedmiller}@AbstractWe present theﬁrst deep learning model to successfully learn control policies di-rectly from high-dimensional sensory input using reinforcement learning.Themodel is a convolutional neural network,trained with a variant of Q-learning,whose input is raw pixels and whose output is a value function estimating futurerewards.We apply our method to seven Atari2600games from the Arcade Learn-ing Environment,with no adjustment of the architecture or learning algorithm.Weﬁnd that it outperforms all previous approaches on six of the games and surpassesa human expert on three of them.1IntroductionLearning to control agents directly from high-dimensional sensory inputs like vision and speech isone of the long-standing challenges of reinforcement learning(RL).Most successful RL applica-tions that operate on these domains have relied on hand-crafted features combined with linear value functions or policy representations.Clearly,the performance of such systems heavily relies on the quality of the feature representation.Recent advances in deep learning have made it possible to extract high-level features from raw sen-sory data,leading to breakthroughs in computer vision[11,22,16]and speech recognition[6,7]. These methods utilise a range of neural network architectures,including convolutional networks, multilayer perceptrons,restricted Boltzmann machines and recurrent neural networks,and have ex-ploited both supervised and unsupervised learning.It seems natural to ask whether similar tech-niques could also be beneﬁcial for RL with sensory data.However reinforcement learning presents several challenges from a deep learning perspective. Firstly,most successful deep learning applications to date have required large amounts of hand-labelled training data.RL algorithms,on the other hand,must be able to learn from a scalar reward signal that is frequently sparse,noisy and delayed.The delay between actions and resulting rewards, which can be thousands of timesteps long,seems particularly daunting when compared to the direct association between inputs and targets found in supervised learning.Another issue is that most deep learning algorithms assume the data samples to be independent,while in reinforcement learning one typically encounters sequences of highly correlated states.Furthermore,in RL the data distribu-tion changes as the algorithm learns new behaviours,which can be problematic for deep learning methods that assume aﬁxed underlying distribution.This paper demonstrates that a convolutional neural network can overcome these challenges to learn successful control policies from raw video data in complex RL environments.The network is trained with a variant of the Q-learning[26]algorithm,with stochastic gradient descent to updatethe weights.To alleviate the problems of correlated data and non-stationary distributions,we useFigure 1:Screen shots from ﬁve Atari 2600Games:(Left-to-right )Pong,Breakout,Space Invaders,Seaquest,Beam Rideran experience replay mechanism [13]which randomly samples previous transitions,and thereby smooths the training distribution over many past behaviors.We apply our approach to a range of Atari 2600games implemented in The Arcade Learning Envi-ronment (ALE)[3].Atari 2600is a challenging RL testbed that presents agents with a high dimen-sional visual input (210×160RGB video at 60Hz)and a diverse and interesting set of tasks that were designed to be difﬁcult for humans players.Our goal is to create a single neural network agent that is able to successfully learn to play as many of the games as possible.The network was not pro-vided with any game-speciﬁc information or hand-designed visual features,and was not privy to the internal state of the emulator;it learned from nothing but the video input,the reward and terminal signals,and the set of possible actions—just as a human player would.Furthermore the network ar-chitecture and all hyperparameters used for training were kept constant across the games.So far the network has outperformed all previous RL algorithms on six of the seven games we have attempted and surpassed an expert human player on three of them.Figure 1provides sample screenshots from ﬁve of the games used for training.2BackgroundWe consider tasks in which an agent interacts with an environment E ,in this case the Atari emulator,in a sequence of actions,observations and rewards.At each time-step the agent selects an action a t from the set of legal game actions,A ={1,...,K }.The action is passed to the emulator and modiﬁes its internal state and the game score.In general E may be stochastic.The emulator’s internal state is not observed by the agent;instead it observes an image x t ∈R d from the emulator,which is a vector of raw pixel values representing the current screen.In addition it receives a reward r t representing the change in game score.Note that in general the game score may depend on the whole prior sequence of actions and observations;feedback about an action may only be received after many thousands of time-steps have elapsed.Since the agent only observes images of the current screen,the task is partially observed and many emulator states are perceptually aliased,i.e.it is impossible to fully understand the current situation from only the current screen x t .We therefore consider sequences of actions and observations,s t =x 1,a 1,x 2,...,a t −1,x t ,and learn game strategies that depend upon these sequences.All sequences in the emulator are assumed to terminate in a ﬁnite number of time-steps.This formalism gives rise to a large but ﬁnite Markov decision process (MDP)in which each sequence is a distinct state.As a result,we can apply standard reinforcement learning methods for MDPs,simply by using the complete sequence s t as the state representation at time t .The goal of the agent is to interact with the emulator by selecting actions in a way that maximises future rewards.We make the standard assumption that future rewards are discounted by a factor of γper time-step,and deﬁne the future discounted return at time t as R t = T t =t γt −t r t ,where T is the time-step at which the game terminates.We deﬁne the optimal action-value function Q ∗(s,a )as the maximum expected return achievable by following any strategy,after seeing some sequence s and then taking some action a ,Q ∗(s,a )=max πE [R t |s t =s,a t =a,π],where πis a policy mapping sequences to actions (or distributions over actions).The optimal action-value function obeys an important identity known as the Bellman equation .This is based on the following intuition:if the optimal value Q ∗(s ,a )of the sequence s at the next time-step was known for all possible actions a ,then the optimal strategy is to select the action amaximising the expected value of r+γQ∗(s ,a ),Q∗(s,a)=E s ∼Er+γmaxaQ∗(s ,a )s,a(1)The basic idea behind many reinforcement learning algorithms is to estimate the action-value function,by using the Bellman equation as an iterative update,Q i+1(s,a)=E[r+γmax a Q i(s ,a )|s,a].Such value iteration algorithms converge to the optimal action-value function,Q i→Q∗as i→∞[23].In practice,this basic approach is totally impractical,because the action-value function is estimated separately for each sequence,without any generali-sation.Instead,it is common to use a function approximator to estimate the action-value function,Q(s,a;θ)≈Q∗(s,a).In the reinforcement learning community this is typically a linear function approximator,but sometimes a non-linear function approximator is used instead,such as a neural network.We refer to a neural network function approximator with weightsθas a Q-network.A Q-network can be trained by minimising a sequence of loss functions L i(θi)that changes at eachiteration i,L i(θi)=E s,a∼ρ(·)(y i−Q(s,a;θi))2,(2)where y i=E s ∼E[r+γmax a Q(s ,a ;θi−1)|s,a]is the target for iteration i andρ(s,a)is a probability distribution over sequences s and actions a that we refer to as the behaviour distribution. The parameters from the previous iterationθi−1are heldﬁxed when optimising the loss function L i(θi).Note that the targets depend on the network weights;this is in contrast with the targets used for supervised learning,which areﬁxed before learning begins.Differentiating the loss function with respect to the weights we arrive at the following gradient,∇θi L i(θi)=E s,a∼ρ(·);s ∼Er+γmaxaQ(s ,a ;θi−1)−Q(s,a;θi)∇θiQ(s,a;θi).(3)Rather than computing the full expectations in the above gradient,it is often computationally expe-dient to optimise the loss function by stochastic gradient descent.If the weights are updated after every time-step,and the expectations are replaced by single samples from the behaviour distribution ρand the emulator E respectively,then we arrive at the familiar Q-learning algorithm[26].Note that this algorithm is model-free:it solves the reinforcement learning task directly using sam-ples from the emulator E,without explicitly constructing an estimate of E.It is also off-policy:it learns about the greedy strategy a=max a Q(s,a;θ),while following a behaviour distribution that ensures adequate exploration of the state space.In practice,the behaviour distribution is often se-lected by an -greedy strategy that follows the greedy strategy with probability1− and selects a random action with probability .3Related WorkPerhaps the best-known success story of reinforcement learning is TD-gammon,a backgammon-playing program which learnt entirely by reinforcement learning and self-play,and achieved a super-human level of play[24].TD-gammon used a model-free reinforcement learning algorithm similar to Q-learning,and approximated the value function using a multi-layer perceptron with one hidden layer1.However,early attempts to follow up on TD-gammon,including applications of the same method to chess,Go and checkers were less successful.This led to a widespread belief that the TD-gammon approach was a special case that only worked in backgammon,perhaps because the stochasticity in the dice rolls helps explore the state space and also makes the value function particularly smooth [19].Furthermore,it was shown that combining model-free reinforcement learning algorithms such as Q-learning with non-linear function approximators[25],or indeed with off-policy learning[1]could cause the Q-network to diverge.Subsequently,the majority of work in reinforcement learning fo-cused on linear function approximators with better convergence guarantees[25].1In fact TD-Gammon approximated the state value function V(s)rather than the action-value function Q(s,a),and learnt on-policy directly from the self-play gamesMore recently,there has been a revival of interest in combining deep learning with reinforcement learning.Deep neural networks have been used to estimate the environment E;restricted Boltzmann machines have been used to estimate the value function[21];or the policy[9].In addition,the divergence issues with Q-learning have been partially addressed by gradient temporal-difference methods.These methods are proven to converge when evaluating aﬁxed policy with a nonlinear function approximator[14];or when learning a control policy with linear function approximation using a restricted variant of Q-learning[15].However,these methods have not yet been extended to nonlinear control.Perhaps the most similar prior work to our own approach is neuralﬁtted Q-learning(NFQ)[20]. NFQ optimises the sequence of loss functions in Equation2,using the RPROP algorithm to update the parameters of the Q-network.However,it uses a batch update that has a computational cost per iteration that is proportional to the size of the data set,whereas we consider stochastic gradient updates that have a low constant cost per iteration and scale to large data-sets.NFQ has also been successfully applied to simple real-world control tasks using purely visual input,byﬁrst using deep autoencoders to learn a low dimensional representation of the task,and then applying NFQ to this representation[12].In contrast our approach applies reinforcement learning end-to-end,directly from the visual inputs;as a result it may learn features that are directly relevant to discriminating action-values.Q-learning has also previously been combined with experience replay and a simple neural network[13],but again starting with a low-dimensional state rather than raw visual inputs. The use of the Atari2600emulator as a reinforcement learning platform was introduced by[3],who applied standard reinforcement learning algorithms with linear function approximation and generic visual features.Subsequently,results were improved by using a larger number of features,and using tug-of-war hashing to randomly project the features into a lower-dimensional space[2].The HyperNEAT evolutionary architecture[8]has also been applied to the Atari platform,where it was used to evolve(separately,for each distinct game)a neural network representing a strategy for that game.When trained repeatedly against deterministic sequences using the emulator’s reset facility, these strategies were able to exploit designﬂaws in several Atari games.4Deep Reinforcement LearningRecent breakthroughs in computer vision and speech recognition have relied on efﬁciently training deep neural networks on very large training sets.The most successful approaches are trained directly from the raw inputs,using lightweight updates based on stochastic gradient descent.By feeding sufﬁcient data into deep neural networks,it is often possible to learn better representations than handcrafted features[11].These successes motivate our approach to reinforcement learning.Our goal is to connect a reinforcement learning algorithm to a deep neural network which operates directly on RGB images and efﬁciently process training data by using stochastic gradient updates. Tesauro’s TD-Gammon architecture provides a starting point for such an approach.This architec-ture updates the parameters of a network that estimates the value function,directly from on-policy samples of experience,s t,a t,r t,s t+1,a t+1,drawn from the algorithm’s interactions with the envi-ronment(or by self-play,in the case of backgammon).Since this approach was able to outperform the best human backgammon players20years ago,it is natural to wonder whether two decades of hardware improvements,coupled with modern deep neural network architectures and scalable RL algorithms might produce signiﬁcant progress.In contrast to TD-Gammon and similar online approaches,we utilize a technique known as expe-rience replay[13]where we store the agent’s experiences at each time-step,e t=(s t,a t,r t,s t+1) in a data-set D=e1,...,e N,pooled over many episodes into a replay memory.During the inner loop of the algorithm,we apply Q-learning updates,or minibatch updates,to samples of experience, e∼D,drawn at random from the pool of stored samples.After performing experience replay, the agent selects and executes an action according to an -greedy policy.Since using histories of arbitrary length as inputs to a neural network can be difﬁcult,our Q-function instead works onﬁxed length representation of histories produced by a functionφ.The full algorithm,which we call deep Q-learning,is presented in Algorithm1.This approach has several advantages over standard online Q-learning[23].First,each step of experience is potentially used in many weight updates,which allows for greater data efﬁciency.Algorithm 1Deep Q-learning with Experience ReplayInitialize replay memory D to capacity NInitialize action-value function Q with random weightsfor episode =1,M doInitialise sequence s 1={x 1}and preprocessed sequenced φ1=φ(s 1)for t =1,T doWith probability select a random action a t otherwise select a t =max a Q ∗(φ(s t ),a ;θ)Execute action a t in emulator and observe reward r t and image x t +1Set s t +1=s t ,a t ,x t +1and preprocess φt +1=φ(s t +1)Store transition (φt ,a t ,r t ,φt +1)in DSample random minibatch of transitions (φj ,a j ,r j ,φj +1)from D Set y j = r j for terminal φj +1r j +γmax a Q (φj +1,a ;θ)for non-terminal φj +1Perform a gradient descent step on (y j −Q (φj ,a j ;θ))2according to equation 3end for end forSecond,learning directly from consecutive samples is inefﬁcient,due to the strong correlations between the samples;randomizing the samples breaks these correlations and therefore reduces the variance of the updates.Third,when learning on-policy the current parameters determine the next data sample that the parameters are trained on.For example,if the maximizing action is to move left then the training samples will be dominated by samples from the left-hand side;if the maximizing action then switches to the right then the training distribution will also switch.It is easy to see how unwanted feedback loops may arise and the parameters could get stuck in a poor local minimum,or even diverge catastrophically [25].By using experience replay the behavior distribution is averaged over many of its previous states,smoothing out learning and avoiding oscillations or divergence in the parameters.Note that when learning by experience replay,it is necessary to learn off-policy (because our current parameters are different to those used to generate the sample),which motivates the choice of Q-learning.In practice,our algorithm only stores the last N experience tuples in the replay memory,and samples uniformly at random from D when performing updates.This approach is in some respects limited since the memory buffer does not differentiate important transitions and always overwrites with recent transitions due to the ﬁnite memory size N .Similarly,the uniform sampling gives equal importance to all transitions in the replay memory.A more sophisticated sampling strategy might emphasize transitions from which we can learn the most,similar to prioritized sweeping [17].4.1Preprocessing and Model ArchitectureWorking directly with raw Atari frames,which are 210×160pixel images with a 128color palette,can be computationally demanding,so we apply a basic preprocessing step aimed at reducing the input dimensionality.The raw frames are preprocessed by ﬁrst converting their RGB representation to gray-scale and down-sampling it to a 110×84image.The ﬁnal input representation is obtained by cropping an 84×84region of the image that roughly captures the playing area.The ﬁnal cropping stage is only required because we use the GPU implementation of 2D convolutions from [11],which expects square inputs.For the experiments in this paper,the function φfrom algorithm 1applies this preprocessing to the last 4frames of a history and stacks them to produce the input to the Q -function.There are several possible ways of parameterizing Q using a neural network.Since Q maps history-action pairs to scalar estimates of their Q-value,the history and the action have been used as inputs to the neural network by some previous approaches [20,12].The main drawback of this type of architecture is that a separate forward pass is required to compute the Q-value of each action,resulting in a cost that scales linearly with the number of actions.We instead use an architecture in which there is a separate output unit for each possible action,and only the state representation is an input to the neural network.The outputs correspond to the predicted Q-values of the individual action for the input state.The main advantage of this type of architecture is the ability to compute Q-values for all possible actions in a given state with only a single forward pass through the network.We now describe the exact architecture used for all seven Atari games.The input to the neural network consists is an84×84×4image produced byφ.Theﬁrst hidden layer convolves168×8ﬁlters with stride4with the input image and applies a rectiﬁer nonlinearity[10,18].The second hidden layer convolves324×4ﬁlters with stride2,again followed by a rectiﬁer nonlinearity.The ﬁnal hidden layer is fully-connected and consists of256rectiﬁer units.The output layer is a fully-connected linear layer with a single output for each valid action.The number of valid actions varied between4and18on the games we considered.We refer to convolutional networks trained with our approach as Deep Q-Networks(DQN).5ExperimentsSo far,we have performed experiments on seven popular ATARI games–Beam Rider,Breakout, Enduro,Pong,Q*bert,Seaquest,Space Invaders.We use the same network architecture,learning algorithm and hyperparameters settings across all seven games,showing that our approach is robust enough to work on a variety of games without incorporating game-speciﬁc information.While we evaluated our agents on the real and unmodiﬁed games,we made one change to the reward structure of the games during training only.Since the scale of scores varies greatly from game to game,we ﬁxed all positive rewards to be1and all negative rewards to be−1,leaving0rewards unchanged. Clipping the rewards in this manner limits the scale of the error derivatives and makes it easier to use the same learning rate across multiple games.At the same time,it could affect the performance of our agent since it cannot differentiate between rewards of different magnitude.In these experiments,we used the RMSProp algorithm with minibatches of size32.The behavior policy during training was -greedy with annealed linearly from1to0.1over theﬁrst million frames,andﬁxed at0.1thereafter.We trained for a total of10million frames and used a replay memory of one million most recent frames.Following previous approaches to playing Atari games,we also use a simple frame-skipping tech-nique[3].More precisely,the agent sees and selects actions on every k th frame instead of every frame,and its last action is repeated on skipped frames.Since running the emulator forward for one step requires much less computation than having the agent select an action,this technique allows the agent to play roughly k times more games without signiﬁcantly increasing the runtime.We use k=4for all games except Space Invaders where we noticed that using k=4makes the lasers invisible because of the period at which they blink.We used k=3to make the lasers visible and this change was the only difference in hyperparameter values between any of the games.5.1Training and StabilityIn supervised learning,one can easily track the performance of a model during training by evaluating it on the training and validation sets.In reinforcement learning,however,accurately evaluating the progress of an agent during training can be challenging.Since our evaluation metric,as suggested by[3],is the total reward the agent collects in an episode or game averaged over a number of games,we periodically compute it during training.The average total reward metric tends to be very noisy because small changes to the weights of a policy can lead to large changes in the distribution of states the policy visits.The leftmost two plots inﬁgure2show how the average total reward evolves during training on the games Seaquest and Breakout.Both averaged reward plots are indeed quite noisy,giving one the impression that the learning algorithm is not making steady progress.Another, more stable,metric is the policy’s estimated action-value function Q,which provides an estimate of how much discounted reward the agent can obtain by following its policy from any given state.We collect aﬁxed set of states by running a random policy before training starts and track the average of the maximum2predicted Q for these states.The two rightmost plots inﬁgure2show that average predicted Q increases much more smoothly than the average total reward obtained by the agent and plotting the same metrics on the otherﬁve games produces similarly smooth curves.In addition to seeing relatively smooth improvement to predicted Q during training we did not experience any divergence issues in any of our experiments.This suggests that,despite lacking any theoretical convergence guarantees,our method is able to train large neural networks using a reinforcement learning signal and stochastic gradient descent in a stable manner.2The maximum for each state is taken over the possible actions.A v e r a g e R e w a r d p e r E p i s o d eTraining EpochsA v e r a g e R e w a r d p e r E p i s o d e Training EpochsA v e r a g e A c t i o n V a l u e (Q )Training EpochsA v e r a g e A c t i o n V a l u e (Q )Training EpochsFigure 2:The two plots on the left show average reward per episode on Breakout and Seaquest respectively during training.The statistics were computed by running an -greedy policy with =0.05for 10000steps.The two plots on the right show the average maximum predicted action-value of a held out set of states on Breakout and Seaquest respectively.One epoch corresponds to 50000minibatch weight updates or roughly 30minutes of training time.Figure 3:The leftmost plot shows the predicted value function for a 30frame segment of the game Seaquest.The three screenshots correspond to the frames labeled by A,B,and C respectively.5.2Visualizing the Value Function Figure 3shows a visualization of the learned value function on the game Seaquest.The ﬁgure shows that the predicted value jumps after an enemy appears on the left of the screen (point A).The agent then ﬁres a torpedo at the enemy and the predicted value peaks as the torpedo is about to hit the enemy (point B).Finally,the value falls to roughly its original value after the enemy disappears (point C).Figure 3demonstrates that our method is able to learn how the value function evolves for a reasonably complex sequence of events.5.3Main EvaluationWe compare our results with the best performing methods from the RL literature [3,4].The method labeled Sarsa used the Sarsa algorithm to learn linear policies on several different feature sets hand-engineered for the Atari task and we report the score for the best performing feature set [3].Con-tingency used the same basic approach as Sarsa but augmented the feature sets with a learned representation of the parts of the screen that are under the agent’s control [4].Note that both of these methods incorporate signiﬁcant prior knowledge about the visual problem by using background sub-traction and treating each of the 128colors as a separate channel.Since many of the Atari games use one distinct color for each type of object,treating each color as a separate channel can be similar to producing a separate binary map encoding the presence of each object type.In contrast,our agents only receive the raw RGB screenshots as input and must learn to detect objects on their own.In addition to the learned agents,we also report scores for an expert human game player and a policy that selects actions uniformly at random.The human performance is the median reward achieved after around two hours of playing each game.Note that our reported human scores are much higher than the ones in Bellemare et al.[3].For the learned methods,we follow the evaluation strategy used in Bellemare et al.[3,5]and report the average score obtained by running an -greedy policy with =0.05for a ﬁxed number of steps.The ﬁrst ﬁve rows of table 1show the per-game average scores on all games.Our approach (labeled DQN)outperforms the other learning methods by a substantial margin on all seven games despite incorporating almost no prior knowledge about the inputs.We also include a comparison to the evolutionary policy search approach from [8]in the last three rows of table 1.We report two sets of results for this method.The HNeat Best score reﬂects the results obtained by using a hand-engineered object detector algorithm that outputs the locations and。

playing atari with deep reinforcement learnin

playing atari with deep reinforcement learninDeep reinforcement learning has reinvigorated interest in the popular classic Atari games. Atari games were first released in the 1970s and 1980s and provide a challenging domain for reinforcement learning models. These games are popular among researchers due to their simplicity yet complexity.Deep reinforcement learning combines reinforcement learning with deep neural networks. Reinforcement learning is the process of training an agent by rewarding desired behavior and punishing undesired behavior. Deep neural networks are capable of learning complex non-linear relationships from input data, and can be utilized to predict actions and values in a game.Deep reinforcement learning has shown excellent results in playing Atari games. One of the most well-known examples of deep reinforcement learning for Atari games is Deep Q-Network (DQN) algorithm. The DQN algorithm uses a neural network to evaluate the state of the game and outputs the best possible action that the agent should take in that state. The algorithm follows a "greedy" strategy, where the agent performs the action with the highest predicted Q-value. The Q-value represents an estimate of the expected future reward.In addition to DQN, there are other deep reinforcement learning algorithms used for playing Atari games, such as Double DQN, Dueling DQN, and prioritized experience replay. Each of these algorithms has different improvements over DQN and achieves better results by addressing different challenges in the Atari games.One of the challenges in training reinforcement learning models for Atari games is that the agent has to learn from the visual information of the game screen. Typically, the input to a deep reinforcement learning model is the raw pixels of the game screen. To make the training process more efficient, additional preprocessing techniques, such as frame skipping, grayscale conversion, and image cropping, are applied to reduce the dimensionality of the input.Another challenge is that many Atari games have sparse rewards. The agent may not receive any reward for taking an action for several timesteps. To address this problem, techniques such as reward shaping and intrinsic motivation can be used to provide extra rewards to the agent for making progress toward the game's objective.In conclusion, deep reinforcement learning has shown great promise in playing Atari games. The combination of reinforcement learning and deep neural networks has enabled models to learn complex non-linear relationships from visual input and optimally control actions to achieve high game scores. As researchers continue to study the Atari games domain, we can expect further advancements in deep reinforcement learning techniques and better performance in playing these classic games.。

八年级心理健康英语阅读理解30题

八年级心理健康英语阅读理解30题1<背景文章>Teenagers often face various sources of stress. One of the main stressors is academic pressure. With numerous assignments, tests, and exams, students feel constantly under pressure to perform well. Another source of stress is social pressure. Teenagers worry about fitting in with their peers, being popular, and having good relationships. They may also experience stress from family expectations. Parents may have high hopes for their children's academic achievements or future careers, which can add to the stress.1. What is one of the main stress sources for teenagers?A. Family outings.B. Academic pressure.C. Outdoor activities.D. Video games.答案：B。

解析：文章中明确提到One of the main stressors is academic pressure.，即青少年的主要压力来源之一是学业压力。

2. What do teenagers worry about in social pressure?A. Solving math problems.B. Reading books.C. Fitting in with peers.D. Watching movies.答案：C。

高中英语课本选修9

选修9 Unit 1 Breaking records-Reading"THE ROAD IS ALWAYS AHEAD OF YOU"Ashrita Furman is a sportsman who likes the challenge of breaking Guinness r ecords. Over the last 25 years, he hasbroken approximately 93 Guinness records. More than twenty of these he still holds, including the record for having the most r ecords. But these records are not made in any conventional sport like swimming or soccer. Rather Ashrita attempts to break records in very imaginative events and in very interesting places.Recently, Ashrita achieved his dream of breaking a record in all seven contine nts, including hula hooping in Australia, pogo stick jumping under water in South A merica, and performing deep knee bends in a hot air balloon in North America.While these activities might seem childish and cause laughter rather than respe ct, in reality they require an enormous amount of strength and fitness as well as d etermination.Think about the fine neck adjustments needed to keep a full bottle of milk on your head while you are walking. You can stop to rest or eat but the bottle has to stay on your head.While Ashrita makes standing on top of a 75 cm Swiss ball look easy, it is no t. It takes a lot of concentration and a great sense of balance to stay on it. You h ave to struggle to stay on top especially when your legs start shaking.And what about somersaulting along a road for 12 miles? Somersaulting is a t ough event as you have to overcome dizziness, extreme tiredness and pain. You are permitted to rest for only five minutes in every hour of rolling but you are allow ed to stop briefly to vomit.Covering a mile in the fastest time while doing gymnastically correct lunges is yet another event in which Ashrita is outstanding. Lunges are extremely hard on yo ur legs. You start by standing and then you step forward with the fight foot while t ouching the left knee to the ground. Then you stand up again and step forward wit h the left foot while touching the fight knee to the ground. Imagine doing this for a mile!Yet this talented sportsman is not a natural athlete. As a child he was very un fit and was not at all interested in sports. However, he was fascinated by the Guin ness Book of World Records.How Ashrita came to be a sportsman is an interesting story. As a teenager, h e began searching for a deeper meaning in life. He studied Eastern religions and, aged 16, discovered an Indian meditation teacher called Sri Chinmoy who lived in his neighbourhood in New York City. Since that time in the early 1970s, Ashrita ha s been one of Sri Chinmoy's students. Sri Chinmoy says that it is just as important for people to develop their bodies as it is to develop their minds, hearts and spirit ual selves. He believes that there is no limit to people's physical abilities.When Ashrita came third in a 24-hour bicycle marathon in New York's Central Park in 1978, he knew that he would one day get into the Guinness Book of Worl d Records. He had been urged by his spiritual leader to enter the marathon even t hough he had done no training. So, when he won third place, he came to the und erstanding that his body was just an instrument of the spirit and that he seemed to be able to use his spirit to accomplish anything. From then on, Ashrita refused t o accept any physical limitation.With this new confidence, Asharita broke his first Guinness record with 27,000 jumping jacks in 1979. The motivation to keep trying to break records comes throu gh his devotion to Sri Chinmoy. Every time Ashrita tries to break a record, he reac hes a point where he feels he cannot physically do any more. At that moment, he goes deep within himself and connects with his soul and his teacher.Ashrita always acknowledges his teacher in his record-breaking attempts.In fact, he often wears a T-shirt with Sri Chinmoy's words on the back. The words are: "There is only one perfect road. It is ahead of you, always ahead of you."FOCUS ON ...Lance ArmstrongDate of Birth: 8th September, 1971Country: USALance Armstrong's Guinness record for the fastest average speed at the Tour de France was set in 1999 with an average speed of 40.27 km/hr. In his teens he was a triathlete but at 16 he began to concentrate on cycling. He was an amateu r cyclist before the 1992 Olympic Games but turned professional after he had com peted in the Games. In the following few years, he won numerous titles, and by 1 996 he had become the world's number one. However, in October 1996, he discov ered he had cancer andhad to leave cycling. Successfully fighting his illness, Armstrong officially return ed to racing in 1998. In 1999 he won the Tour de France and in 2003 he achieve d his goal of winning five Tours de France.Michellie JonesDate of Birth: 9th June, 1969Country: AustraliaIn 1988 Michellie Jones helped establish the multi-sport event, the triathlon, in Australia. After completing her teaching qualifications in 1990, she concentrated on the triathlon. In 1991, she finished third at the world championships. In 1992 and 1 993, she was the International Triathlon Union World Champion. Since then, she h as never finished lower than fourth in any of the world championships she has co mpeted in. At the Sydney Olympics in 2000 she won the silver medal in the Wom en's Triathlon, the first time the event had been included in the Olympic Games. R ecently, for the first time in 15 years, Jones was not selected as part of the nation al team and therefore did not compete in the 2004 Olympics in Athens.Fu MingxiaDate of Birth: 16th August, 1978Country: ChinaFu Mingxia first stood on top of the 10-metre diving platform at the age of nin e. At 12 years old she won a Guinness Record when she became the youngest fe male to win the women's world title for platform diving at the World Championships in Australia in 1991. At the 1992 Barcelona Olympic Games, she took the gold medal in the women's 10-metre platform, becoming the youngest Olympic diving cha mpion of all time. This was followed by great success at the 1996 Atlanta Olympic Games where she won gold for both the 10-metre platform and the three-metre s pringboard. This made her the first woman in Olympic diving history to win three g old medals. She retired from diving after Atlanta and went to study economics at u niversity. While there she decided to make a comeback and went on to compete a t the Sydney Olympic Games, where she won her fourth Olympic gold, again maki ng Olympic history.Martin StrelDate of Birth: 1st October, 1954Country: SloveniaStrel was trained as a guitarist before he became a professional marathon swi mmer in 1978. He has a passion for swimming the world's great rivers. In 2000, h e was the first person ever to swim the entire length of the Danube River in Euro pe - a distance of 3,004 kilometres in 58 days. For this, he attained his first entry in the Guinness Book of World Records. Then in 2001 he broke the Guinness rec ord for non-stop swimming - 504.5 kilometres in the Danube River in 84 hours and 10 minutes. Martin won his third entry in the Guinness Book of World Records in 2002 when he beat his own record for long distance swimming by swimming the l ength of the Mississippi River in North America in 68 days, a total of 3,797 kilomet res. Then in 2003 he became the first man to have swum the whole 1,929 kilomet res of the difficult Parana River in South America.In 2004, Strel again broke his ow n Guinness record by swimming the length of the dangerous Changjiang River (4,6 00 km), the third longest fiver in the world.选修9 Unit 2 Sailing the oceans-ReadingSRILING THE OCERNSWe may well wonder how seamen explored the oceans before latitude and lon gitude made it possible to plot a ship's position on a map. The voyages of travelle rs before the 17th century show that they were not at the mercy of the sea even t hough they did not have modern navigational aids. So how did they navigate so w ell? Read these pages from an encyclopedia.Page 1:Using nature to help Keeping alongside the coastlineThis seems to have been the first and most useful form of exploration which c arried the minimum amount of risk.Using celestial bodiesNorth StarAt the North Pole the North Star is at its highest position in the sky, but at th e equator it is along the horizon. So accomplished navigators were able to use it t o plot their positions.SunOn a clear day especially during the summer the sailors could use the sun ov erhead at midday to navigate by. They can use the height of the sun to work out their latitude.CloudsSea captains observed the clouds over islands. There is a special cloud format ion which indicates there is land close by.Using wildlifeSeaweedSailors often saw seaweed in the sea and could tell by the colour and smell h ow long it had been them. If it was fresh and smelled strongly,then the ship was c lose to land.BirdsSea birds could be used to show the way to land when it was nowhere to be seen. In the evening nesting birds return to land and their nests. So seamen coul d follow the birds to land even if they were offshore and in the open sea.Using the weatherFogFog gathers at sea as well as over streams or rivers. Seamen used it to help identify the position of a stream or river when they were close to land.WindsWise seamen used the winds to direct their sailing. They could accelerate the speed, but they could also be dangerous. So the Vikings would observe the winds before and during their outward or return journeys.Using the seaCertain tides and currents could be used by skillful sailors to carry ships to th eir destination.These skills helped sailors explore the seas and discover new lands. They increased their ability to navigate new seas when they used instruments.Page 2:Using navigational instruments to helpFinding longitudeThere was no secure method of measuring longitude until the 17th century wh en the British solved this theoretical problem. Nobody knew that the earth moved westwards 15 degrees every hour, but sailors did know an approximate method of calculating longitude using speed and time. An early method of measuring speed i nvolved throwing a knotted rope tied to a log over the side of the ship. The rope was tied to a log which was then thrown into the sea. As the ship advanced throu gh the water the knots were counted as they passed through a seaman's hands. T he number of knots that were counted during a fixed period of time gave the spee d of the ship in nautical miles per hour.Later, when seamen began to use the compass in the 12th century they could calculate longitude using complicated mathematical tables. The compass has a sp ecial magnetic pointer which always indicates the North Pole, so it is used to help find the direction that the ship needs to go. In this way the ship could set a straig ht course even in the middle of the ocean.Finding latitudeThe Bearing CircleIt was the first instrument to measure the sun's position. A seaman would mea sure the sun's shadow and compare it with the height of the sun at midday. Then he could tell if he was sailing on his correct rather than a random course.A Bearing CircleThe AstrolabeThe astrolabe, quadrant and sextant are all connected. They are developments of one another. The earliest, the astrolabe, was a special all-in-one tool for telling the position of the ship in relation to the sun and various stars which covered the whole sky. This gave the seamen the local time and allowed them to find their lat itude at sea. However, it was awkward to use as one of the points of reference w as the moving ship itself.The QuadrantThis was a more precise and simplified version of the astrolabe. It measured h ow high stars were above the horizon using a quarter circle rather than the full cir cle of the astrolabe.It was easier to handle because it was more portable. Its short coming was that it still used the moving ship as one of the fixed points of referenc e. As the ship rose and plunged in the waves, it was extremely difficult to be accu rate with any reading.The sextantThe sextant was the updated version of the astrolabe and quadrant which redu ced the tendency to make mistakes. It proved to be the most accurate and reliableof these early navigational instruments. It works by measuring the angle between t wo fixed objects outside the ship using two mirrors. This made the calculations mo re precise and easier to do.THE GREATEST NAVIGATIONAL JOURNEY:A LESSON IN SURVIVALI am proud to have sailed with Captain Bligh on his journey of over 40 days t hrough about 4,000miles in an open boat across the Pacific Ocean in 1789. Our o utward voyage in the "Bounty" to Tahiti had been filled with the kind of incidents t hat I thought would be my stories when I returned home. But how wrong I was! O n our departure from Tahiti, some of the crew took over the ship.They deposited th e captain into a small boat to let him find his own way home. But who else was t o go with him? Those of us on board the "Bounty" were caught in a dilemma. Wa s it better to risk certain death by sitting close together on a small, crowded open boat with very little food and water? Or should one stay on the "Bounty" with the crew and face certain death from the British Navy if caught? The drawback of stay ing on the ship seemed to grow as I thought about how wrong it was to treat Cap tain Bligh in this way. So I joined him in the small boat. As dusk fell, we seemed to face an uncertain future. We had no charts and the only instruments the captain was allowed to take with him were a compass and a quadrant.Once we were at sea, our routine every day was the same. At sunrise and su nset the captain measured our position using the quadrant and set the course usin g the compass. It was extremely difficult for us to get a correct reading from the q uadrant as the boat moved constantly. The captain used a system called "dead rec koning". He knew there was land directly northwest of our original position. So his task was to make sure we stayed on that course. As you can see from the map we kept to a straight course pretty well. In addition, the captain kept us all busy reading the tables to work out our position. Although this took a great deal of time, i t didn't matter. Time was, after all, what we had a lot of!Our daily food was shared equally among us all: one piece of bread and one cup of water. It was starvation quantities but the extreme lack of water was the ha rdest to cope with psychologically. Imagine all that water around you, but none of i t was safe to drink because the salt in it would drive you mad! All the time the ca ptain tried to preserve our good spirits by telling stories and talking hopefully about what we would do when we got back to England. We only half believed him.The tension in the boat got worse as the supply of food and water gradually d isappeared. We could foresee that we would die if we could not reach land very s oon and we sank gradually into a sleepy, half-alive state. The captain was as wea k as the rest of us, but he was determined not to give up. He continued his navig ational measurements every day. He kept us busy and tried to take our minds off our stomachs and our thirst. He kept us alive.You could not imagine a more disturbing sight than what we looked like when arriving in Timor over forty days after being set loose in our small boat. Our clothe s were torn, we had fever and our faces showed the hardships we had suffered. B ut after a rest, some good meals and some new clothes, everything changed. We couldn't stop talking about our voyage and everybody wanted to hear about it. We were the heroes who had escaped the jaws of death by completing the greatest n avigational feat of all time!选修9 Unit 3 Australia-ReadingGLIMPSES OF AUSTRALIAAUSTRALIACapital: Canberra Offcial name: Commonwealth of AustraliaArea: 7,686,850 km2 Population: 20 millionHighest point: Mount Kosciuszko, 2,228 metres above sea levelLowest point: Lake Eyre, 15 metres below sea levelAustralia is the only country that is also a continent. It is the sixth largest cou ntry in the world and is in the smallest continent - Oceania. It is a mainly dry cou ntry with only a few coastal areas that have adequate rainfall to support a large p opulation. Approximately 80 of Australians live in the south-eastern coastal area, w hich includes Australia's two largest cities –Melbourne and Sydney. The centre of the continent, which is mainly desert and dry grassland, has few settlements.Australia is famous for its huge, open spaces, bright sunshine, enormous numb er of sheep and cattle and its unusual wildlife, which include kangaroos and koalas. Australia is a popular destination with tourists from all over the world who come t o experience its unique ecology.Australia is made up of six states. Like the states in America, Australian states are autonomous in some areas of government. However, Australia has a federal g overnment responsible for matters that affect people all over the country, such as defence, foreign policy and taxation. The federal parliament is located in Canberra.CITIZENSHIP CEREMONIES PLANNED AROUND AUSTRALIAOn 26 January, Australia Day, in over 200 locations across the nation , more t han 9,000 people will become Australian citizens."By these citizenship ceremonies we welcome those who have come from over seas from many different cultural and social backgrounds into our communities and our nation," said the Minister for Citizenship and Multicultural Affairs. "Australia Da y celebrations that include people from so many birthplaces are an excellent way t o encourage tolerance, respect and friendship among all the people of Australia."Most citizenship ceremonies will be followed by displays of singing and dancing from many of the migrants' homelands and the tasting of food from all over the w orld.Go by plane and see cloudsGo by TRAIN and see AustraliaEnjoy 3 nights on board the Indian-PacificOn this 4,352-km journey from Sydney to Perth via Adelaide you'll view some ot Australias unique scenery from the superb Blue Mountains to the treeless plains of the Nuliarbor. Along the way you will spot a fascinating variety of wildlife.Enjoy 2 nights on board the GhanAs you travel from Adelaide to Darwin via Alice Springs, you'll observe some o f Australia'smost spectacular landscapes - from the rolling hills surrounding Adelaide to the rusty reds ofAustralia's centre and the tropical splendour of Darwin.For more information, timetables and fares go to .au/trains.htm Dear Shen Ping,I wish you could see this amazing rock. It is part of one of Australia's 14 Wor m Heritage Sites andrises about 335 metres out of a vast, flat sandy plain. A t different times of th e day it appears tochange co/our, from grey-red at sunrise, to golden and finally to burning red at dusk. Aboriginal people have lived near Uluru for thousands of years and yout ca n walk around it with an Aboriainal guide to learn about their customs, art, religion and day-to-day life. It is also possible to climb the rock, but most people don't do this out of respect for the Aboriginal people who consider the rock to be sacred. I’ll be back in Sydney in a fortnight because I've made a reservation on the Indian Pacific train to Perth.love JackTours outside HobartDrive 250 km northwestwards from Hobart along the A10 highway and you'll ar rive at the southern end of the magnificent Cradle Mountain National Park and Wor ld Heritage area. This park is famous for its mountain peaks, lakes and ancient for ests. A popular attraction for active tourists is the 80-km walking track that joins th e southern and northern ends of the park. There are also a range of short walks.Reading and discussingBefore you read the following text, read the title and look at the pictures. Disc uss with a partner what you expect to read about in the text.AUSTRALIA’S DANGEROUS CREATURESAustralia is home to more than 170 different kinds of snake and 115 of these are poisonous. In fact, Australia has more kinds of venomous snake than any othe r country in the world. Luckily, the poison of most snakes can kill or paralyze only small creatures.A few varieties, however, can kill humans, so it is just as well that snakes are very shy and usually attack only if they are disturbed and feel threate ned.There are also approximately 2,000 different kinds of spider in Australia and, li ke snakes, most have a poisonous bite. However, the majority have no effect on h umans or cause only mild sickness.Only a few have venom that is powerful enoug h to kill a human being. While a small number of Australians are bitten by spiders each year, most recover without any medical treatment.The seas around Australia contain over 160 different kinds of shark, which var y in size from just 20 centimetres to over 14 metres. However, although they look dangerous because of their wide mouths and sharp teeth, all but two or three kind s are harmless to humans.Another potentially dangerous sea animal is the jellyfish. Most kinds of poisono us jellyfish can cause severe pain to anyone who touches them but the poison of t he box jellyfish can actually kill a human, especially if that person has a weak heart. The tiniest amount of poison from a box jellyfish can kill in less than five minut es and it is probably the most poisonous animal in the world.There is one other dangerous animal in Australia worth mentioning, and that is the crocodile. Although two types of crocodile live in Australia, only the saltwater crocodile has been known to kill humans. This crocodile moves very quickly when i t sees something it considers to be food, and from time to time a crocodile has sn atched someone before he or she is even aware that the crocodile is there.You might think that with all these dangerous animals Australia is an unsafe pl ace to live in or visit. However, this is far from the truth. There are no more than a handful of shark attacks each year and only three deaths have been reported in the last five years.Similarly, in the last three years there have been only two repo rted deaths from crocodile attacks. Since 1956, when an anti-venom treatment for r edback spider bites was developed, there have been no deaths from redbacks, and since 1981 when a treatment was developed for funnelweb spider poison, there h ave been no deaths from this spider either. Treatments for jellyfish stings and s~aa kebites have also been developed and in the last five years there have been only three deaths from jellyfish stings and about the same number from snakebites.选修9 Unit 4 Exploring plants-ReadingPLANT EXPLORATION IN THE 18TH AND 19TH CENTURIESThe plants in our gardens look so familiar that often we do not realize that ma ny of them actually come from countries far away. Collecting "exotic" plants, as the y are called, dates back to the earliest times. Many ancient civilisations saw the va lue of bringing back plants from distant lands. The first plant collecting expedition recorded in history was around 1500 BC when the Queen of Egypt sent ships awa y to gather plants, animals and other goods.However, it was not until the eighteenth and nineteenth centuries that the expl oration of the botanical world began on a large scale. Europe had become interest ed in scientific discovery and the European middle classes took great interest in co llecting new plants. Thisattraction to exotic plants grew as European nations, like the Netherlands, Brita in and Spain, moved into other parts of the world like Asia and Australia. Brave yo ung men took the opportunity of going on botanical expeditions, often facing many dangers including disease,near-starvation, severe environments and conflicts with th e local people.An important group of collectors were Frencn Catholic missionaries who, by the middle of the 18th century, were beginning to set themselves up in China. One s uch missionary, Father d'Incarville, was sent to Beijing in the 1740s. He collected s eeds of trees and bushes including those of the Tree of Heaven. Just before he di ed, he sent some Tree of Heaven seeds to England. They arrived in 1751 and pla nts from these seeds were grown throughout Europe and later, in 1784, the specie s was introduced in North America.Sir Joseph Banks was a very famous British plant collector, who accompanied James Cook on his first voyage from England to Oceania. The purpose of the trip for Banks was to record the plant and animal life they came across. He and his te am collected examples whenever they went onto dry land. In 1769, Banks collected vast quantities of plants in the land now known as Australia. None of these plantshad been recorded by Europeans before. Cook called the bay where the Endeavo ur had anchored Botany Bay.Keeping plants alive during long land or sea voyages was an enormous challe nge. Large numbers of seeds failed to grow after long sea voyages or trips across land between Asia and Europe. One plant explorer lost several years' work when his plants were mined with seawater.The world of plant exploration was completely changed with Dr Nathaniel Ward' s invention of a tightly sealed portable glass container. This invention, called the W ardian case, allowed plants to be transported on long journeys. In 1833, Ward ship ped two cases of British plants to Sydney, Australia. All the plants survived the six -month journey. In 1835, the cases made a return trip with some Australian specie s that had never been successfully transported before. After eight months at sea, t hey arrived safely in London.A British man called Robert Fortune was one of the earliest plant collectors to use Wardian cases. He made several trips to China between 1843 and 1859. At t hat time, there were restrictions on the movement of Europeans and so, in order t o travel unnoticed, he developed his fluency in Chinese and dressed as a Chinese man, even shaving his head in the Chinese style. He experienced many adventur es including huge thunderstorms in the Yellow Sea and pirates on the Yangtze Riv er. Not only did Fortune introduce over 120 species of plants to Western gardens but he also shipped 20,000 tea plants from Shanghai to India, where a successful tea industry was established.The second half of the nineteenth century was a very important period of plant exploration. During this time many Catholic missionaries were sent to China fromFrance. They valued the study of the natural sciences and many of the missionarie s knew a lot about plants and animals. Their expeditions resulted in huge plant col lections, which were sent back to France. One of the collectors was Father Farges, who collected 37 seeds from a tree that had appealed to him. This tree was later called the Dove Tree. He sent the seeds back to France in1897 but only one seed grew.Although the missionaries collected large numbers of soecimens. there was not enough material for growing particular species in Western gardens. However, Euro pean botanists were very excited with the knowledge that China had a vast variety of plants, so many plant collectors were sent on collecting trips to China. One of these collectors was E H Wilson who, in 1899, was able to collect a large quantity of seeds of the Dove Tree that Father Farges had discovered. Wilson and other plant collectors introduced many new plants to Western gardens.Reading and discussingBefore you read the text on page 38, have a quick glance at it. What is the t ext about? What do the pictures show you? What is the chart about?FLOWERS AND THEIR ANIMAIL POLLINATORSOver time, many flowering plants and their animal pollinators have evolved tog ether. The plant needs the animal to pollinate it and the animal is rewarded with f ood called nectar when it visits the flowers. Pollen becomes attached to the animal during its visit to a flower and is then passed on to another plant's blossom on it s next visit. So pollination takes place, therefore increasing the chances of the surv ival of the plant species.。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Playing Atari with Deep Reinforcement Learning
汇报人：牟勤洋日期：2016-5-29
Contents
1
Research Background
研究背景
2
Research Approach
研究方案
3
Research Result
研究成果
4
Research Summary
Approach Summary
方案总结
• initialize all weight matrices randomly • initialize the replay memory
• linearly anneal ϵ from 1 to 0.1 over the course of the
•
3 Discounted Future Reward
Given one run of the Markov decision process, we can easily calculate the total reward for one episode:
Given that, the total future reward from time point t onward can be expressed as:
研究总结
研究背景
Research Background
Conference
NIPS 2013
December 4-9 Barcelona,Spain
Conference
NIPS 2013 Deep Learning Workshop
神经信息处理系统进展大会
Authors
Volodymyr Mnih
This is called the Bellman equation. The gist of the Q-learning algorithm is as simple as the following
5 Deep Q Network
Left: Naive formulation of deep Q-network. Right: More optimized architecture of deep Q-network, used in DeepMind paper.
Experiment Result
The final hidden representation of the input image frames in the network was a 256 dimensional vector.
Experiment Approach
实验方案对比
研究总结
Alex Graves
Koray Kavukcuoglntonoglou
Daan Wiersra
Martin Riedmiller
What is this?
What is this?
how a computer learned to play Atari 2600 video games by observing just the screen pixels and receiving a reward when the game score increased.
5 Deep Q Network
Given a transition < s, a, r, s’ >, the Q-table update rule in the previous algorithm must be replaced with the following: 1. Do a feedforward pass for the current state s to get predicted Qvalues for all actions. 2. Do a feedforward pass for the next state s’ and calculate maximum overall network outputs max a’ Q(s’, a’). 3. Set Q-value target for action to r + γ max a’ Q(s’, a’) (use the max calculated in step 2). For all other actions, set the Q-value target to the same as originally returned from step 1, making the error 0 for those outputs. 4. Update the weights using backpropagation.
用文字来总结论文所做的工作
THANKS
Exploration-Exploitation: with probability ε choose a random action, otherwise go with the “greedy” action with the highest Q-value.
5 Deep Q Network
deep Q-learning algorithm with experience replay:
1 Reinforcement Learning
2 Markov Decision Process
The set of states and actions, together with rules for transitioning from one state to another, make up a Markov decision process.
Simple 简单 samples minibatches of size 100
next 500,000 frames optimization algorithm
Effective 有效
• performs a gradient step using the Adam
研究成果
Research Result
5 Deep Q Network
The network architecture that DeepMind used is as follows:
5 Deep Q Network
Input to the network are four 84×84 grayscale game screens. Outputs of the network are Q-values for each possible action (3 in Pong). Qvalues can be any real values, which makes it a regression task, that can be optimized with simple squared error loss.
Research Summary
研究总结
Research Summary
总结：
•
• •
perform near human level
state-of-the-art results in six of the seven games it was tested on a revival of interest in combining deep learning with reinforcement learning
What is this?
研究方案
Research Approach
Approach Knowledge
1
2 3 4
Reinforcement Learning
Markov Decision Process
Discounted Future Reward
Q-learning
5
Deep Q Network
Once you have the magical Q-function, the answer becomes really simple – pick the action with the highest Q-value!
Here π represents the policy, the rule how we choose an action in each state.
Experiment Result
shows the maximum Q value for the first ~20,000 training steps
This indicates improvement as it means that the network is expecting to receive a greater reward per game as it trains for longer.
5 Deep Q Network
Experience Replay: During gameplay all the experiences < s, a, r, s’ > are stored in a replay memory. When training the network, random minibatches from the replay memory are used instead of the most recent transition.
4 Q-learning
Just like with discounted future rewards in the previous section, we can express the Q-value of state s and action a in terms of the Q-value of the next state s’.
discounted future reward:

氯丙嗪(Chlorpromazine)：一种经典的抗精神病药物的详细介绍

页数:2
精神分裂症患者该怎样服药

页数:2
精神科药物适应症及常用剂量

页数:13
富马酸喹硫平与氯丙嗪治疗精神分裂症的临床对比

页数:1
精神分裂症需要吃多长时间药

页数:2
精神分裂症药物治疗程序.

页数:30
氯丙嗪介绍

页数:3
富马酸喹硫平与氯丙嗪治疗精神分裂症的临床对比

页数:1
?治疗精神分裂症的药物大全

页数:1
精神病常用药物剂量

页数:4