Markov games as a framework for multi-agent reinforcement learning

格式：pdf
大小：70.88 KB
文档页数：7

下载文档原格式

Markov Model for Online Multi-Channel Attribution

Package‘ChannelAttribution’May17,2023Type PackageTitle Markov Model for Online Multi-Channel AttributionVersion2.0.7Date2023-05-17Maintainer Davide Altomare<**************************>Description Advertisers use a variety of online marketing channels to reach con-sumers and they want to know the degree each channel contributes to their marketing suc-cess.This is called online multi-channel attribution problem.This package contains a probabilis-tic algorithm for the attribution problem.The model uses a k-order Markov representa-tion to identify structural correlations in the customer journey data.The package also con-tains three heuristic algorithms(ﬁrst-touch,last-touch and linear-touch ap-proach)for the same problem.The algorithms are implemented in C++.License GPL-3|ﬁle LICENSEURL https://channelattribution.ioLinkingTo Rcpp,RcppArmadilloImports RcppNeedsCompilation yesAuthor Davide Altomare[cre,aut],David Loris[aut]Repository CRANDate/Publication2023-05-1719:20:02UTCR topics documented:ChannelAttribution-package (2)auto_markov_model (3)choose_order (4)Data (5)heuristic_models (6)markov_model (7)transition_matrix (8)Index1012ChannelAttribution-packageChannelAttribution-packageMarkov Model for Online Multi-Channel AttributionDescriptionAdvertisers use a variety of online marketing channels to reach consumers and they want to know the degree each channel contributes to their marketing success.This is called online multi-channel attribution problem.In many cases,advertisers approach this problem through some simple heuris-tics methods that do not take into account any customer interactions and often tend to underestimate the importance of small channels in marketing contribution.This package provides a function that approaches the attribution problem in a probabilistic way.It uses a k-order Markov representation to identify structural correlations in the customer journey data.This would allow advertisers to givea more reliable assessment of the marketing contribution of each channel.The approach basicallyfollows the one presented in Eva Anderl,Ingo Becker,Florian v.Wangenheim,Jan H.Schumann (2014).Differently for them,we solved the estimation process using stochastic simulations.In this way it is also possible to take into account conversion values and their variability in the computa-tion of the channel importance.The package also contains a function that estimates three heuristic models(ﬁrst-touch,last-touch and linear-touch approach)for the same problem.DetailsPackage:ChannelAttributionType:PackageVersion: 2.0.7Date:2023-05-17License:GPL(>=2)Package contains functions for channel attribution in web marketing.Author(s)Davide Altomare,David LorisMaintainer Davide Altomare<**************************>ReferencesChannelAttribution Ofﬁcial Website:https://channelattribution.ioEva Anderl,Ingo Becker,Florian v.Wangenheim,Jan H.Schumann:Mapping the Customer Jour-ney,2014,doi:10.2139/ssrn.2343077auto_markov_model3 auto_markov_model Automatic Markov Model.DescriptionEstimate a Markov model from customer journey data after automatically choosing a suitable order.It requires paths that do not lead to conversion as input.Usageauto_markov_model(Data,var_path,var_conv,var_null,var_value=NULL,max_order=10,roc_npt=100,plot=FALSE,nsim_start=1e5,max_step=NULL,out_more=FALSE,sep=">",ncore=1,nfold=10,seed=0,conv_par=0.05,rate_step_sim=1.5,verbose=TRUE,flg_adv=TRUE)ArgumentsData data.frame containing customer journeys data.var_path column name containing paths.var_conv column name containing total conversions.var_null column name containing total paths that do not lead to conversions.var_value column name containing total conversion value.max_order maximum Markov Model order considered.roc_npt number of points used for approximating roc and auc.plot if TRUE,a plot with penalized auc with respect to order will be displayed.nsim_start minimum number of simulations used in computation.max_step maximum number of steps for a single simulated path.if NULL,it is the maxi-mum number of steps found into Data.out_more if TRUE,transition probabilities between channels and removal effects will be shown.sep separator between the channels.ncore number of threads used in computation.nfold how many repetitions are used to verify if convergence is reached at each itera-tion.seed random seed.Giving this parameter the same value over different runs guaran-tees that results will not vary.conv_par convergence parameter for the algorithm.The estimation process ends when the percentage of variation of the results over different repetitions is less thanconvergence parameter.rate_step_sim number of simulations used at each iteration is equal to the number of simula-tions used at previous iteration multiplied by rate_step_sim.verbose if TRUE,additional information about process convergence will be shown.flg_adv if TRUE,ChannelAttribution Pro banner is printed.4choose_orderValueAn object of class data.frame with the estimated number of conversions and the estimated con-version value attributed to each channel.Author(s)Davide Altomare(<**************************>).Examples##Not run:library(ChannelAttribution)data(PathData)auto_markov_model(Data,"path","total_conversions","total_null")##End(Not run)choose_order Choose order for Markov model.DescriptionFind the minimum Markov Model order that gives a good representation of customers’behaviour for data considered.It requires paths that do not lead to conversion as input.Minimum order is found maximizing a penalized area under ROC curve.Usagechoose_order(Data,var_path,var_conv,var_null,max_order=10,sep=">",ncore=1,roc_npt=100,plot=TRUE,flg_adv=TRUE)ArgumentsData data.frame containing customer journeys.var_path column name of Data containing paths.var_conv column name of Data containing total conversions.var_null column name of Data containing total paths that do not lead to conversion.max_order maximum Markov Model order considered.Data5sep separator between channels.ncore number of threads used in computation.roc_npt number of points used for approximating roc and auc.plot if TRUE,a plot with penalized auc with respect to order will be displayed.flg_adv if TRUE,ChannelAttribution Pro banner is printed.ValueAn object of class List with the estimated roc,auc and penalized auc.Author(s)Davide Altomare(<**************************>).Examples##Not run:library(ChannelAttribution)data(PathData)res=choose_order(Data,var_path="path",var_conv="total_conversions",var_null="total_null")#plot auc and penalized aucplot(res$auc$order,res$auc$auc,type="l",xlab="order",ylab="pauc",main="AUC")lines(res$auc$order,res$auc$pauc,col="red")legend("right",legend=c("auc","penalized auc"),col=c("black","red"),lty=1)##End(Not run)Data Customer journeys data.DescriptionExample dataset.Usagedata(PathData)6heuristic_modelsFormatData is a data.frame with10.000rows and4columns:"path"containing customer paths,"to-tal_conversions"containing total number of conversions,"total_conversion_value"containing total conversion value and"total_null"containing total number of paths that do not lead to conversion.heuristic_models Heuristic models for the online attribution problem.DescriptionEstimate theree heuristic models(ﬁrst-touch,last-touch and linear)from customer journey data. Usageheuristic_models(Data,var_path,var_conv,var_value=NULL,sep=">",flg_adv=TRUE) ArgumentsData data.frame containing paths and conversions.var_path column name containing paths.var_conv column name containing total conversions.var_value column name containing total conversion value.sep separator between the channels.flg_adv if TRUE,ChannelAttribution Pro banner is printed.ValueAn object of class data.frame with the estimated number of conversions and the estimated con-version value attributed to each channel for each model.Author(s)Davide Altomare(<**************************>).Examples##Not run:library(ChannelAttribution)data(PathData)heuristic_models(Data,"path","total_conversions")heuristic_models(Data,"path","total_conversions",var_value="total_conversion_value") ##End(Not run)markov_model7 markov_model Markov model for the online attribution problem.DescriptionEstimate a k-order Markov model from customer journey data.Differently from markov_model, this function iterates estimation until convergence is reached and enables multiprocessing. Usagemarkov_model(Data,var_path,var_conv,var_value=NULL,var_null=NULL,order=1,nsim_start=1e5,max_step=NULL,out_more=FALSE,sep=">",ncore=1,nfold=10,seed=0,conv_par=0.05,rate_step_sim=1.5,verbose=TRUE,flg_adv=TRUE)ArgumentsData data.frame containing customer journeys data.var_path column name containing paths.var_conv column name containing total conversions.var_value column name containing total conversion value.var_null column name containing total paths that do not lead to conversions.order Markov Model order.nsim_start minimum number of simulations used in computation.max_step maximum number of steps for a single simulated path.if NULL,it is the maxi-mum number of steps found into Data.out_more if TRUE,transition probabilities between channels and removal effects will be returned.sep separator between the channels.ncore number of threads used in computation.nfold how many repetitions are used to verify if convergence has been reached at each iteration.seed random seed.Giving this parameter the same value over different runs guaran-tees that results will not vary.conv_par convergence parameter for the algorithm.The estimation process ends when the percentage of variation of the results over different repetions is less thanconvergence parameter.rate_step_sim number of simulations used at each iteration is equal to the number of simula-tions used at previous iteration multiplied by rate_step_sim.verbose if TRUE,additional information about process convergence will be shown.flg_adv if TRUE,ChannelAttribution Pro banner is printed.ValueAn object of class data.frame with the estimated number of conversions and the estimated con-version value attributed to each channel.Author(s)Davide Altomare(<**************************>).Examples##Not run:library(ChannelAttribution)data(PathData)#Estimate a Makov model using total conversionsmarkov_model(Data,var_path="path","total_conversions")#Estimate a Makov model using total conversions and revenuesmarkov_model(Data,"path","total_conversions",var_value="total_conversion_value")#Estimate a Makov model using total conversions,revenues and paths that do not lead to conversions markov_model(Data,"path","total_conversions",var_value="total_conversion_value",var_null="total_null")#Estimate a Makov model returning transition matrix and removal effectsmarkov_model(Data,"path","total_conversions",var_value="total_conversion_value",var_null="total_null",out_more=TRUE)#Estimate a Markov model using4threadsmarkov_model(Data,"path","total_conversions",var_value="total_conversion_value",ncore=4)##End(Not run)transition_matrix Transition matrix.DescriptionEstimate a k-order transition matrix from customer journey data.Usagetransition_matrix(Data,var_path,var_conv,var_null,order=1,sep=">",flg_equal=TRUE,flg_adv=TRUE)ArgumentsData data.frame containing customer journeys data.var_path column name containing paths.var_conv column name containing total conversions.var_null column name containing paths that do not lead to conversions.order Markov Model order.sep separator between the channels.flg_equal if TRUE,transitions from a channel to itself will be considered.flg_adv if TRUE,ChannelAttribution Pro banner is printed.ValueAn object of class List containing a dataframe with channel names and a dataframe with the estimated transition matrix.Author(s)Davide Altomare(<**************************>).Examples##Not run:library(ChannelAttribution)data(PathData)transition_matrix(Data,var_path="path",var_conv="total_conversions",var_null="total_null",order=1,sep=">",flg_equal=TRUE) transition_matrix(Data,var_path="path",var_conv="total_conversions",var_null="total_null",order=3,sep=">",flg_equal=TRUE) ##End(Not run)Index∗channel attributionChannelAttribution-package,2∗channel marketingChannelAttribution-package,2∗choose markov graph orderchoose_order,4∗choose markov model orderchoose_order,4∗customer journey datasetData,5∗customer journeyChannelAttribution-package,2∗customer path dataData,5∗datasetData,5∗ﬁrst touchheuristic_models,6∗last touchheuristic_models,6∗linear touchheuristic_models,6∗marketing attributionChannelAttribution-package,2∗markov graphauto_markov_model,3markov_model,7transition_matrix,8∗markov modelauto_markov_model,3markov_model,7transition_matrix,8∗multi channel funnelChannelAttribution-package,2∗multi channel marketingChannelAttribution-package,2∗online attributionChannelAttribution-package,2∗web marketingChannelAttribution-package,2∗web statisticsChannelAttribution-package,2 auto_markov_model,3ChannelAttribution(ChannelAttribution-package),2 ChannelAttribution-package,2choose_order,4Data,5heuristic_models,6markov_model,7transition_matrix,810。

Markov逻辑网及其在信息抽取中的应用

其发生概率越小，但不一定是Ｏ；违反的公式越少，发生的概率越大。公式的权值表示了该公式限制强度的大小，
这些方法的表达能力均不够强大，不能充分利用观察
基金项目：国家部委基金资助项目
作者筒介：谭：７（９９，男，硕士研究生，主研方向：数据挖掘，机器学习；罗军勇，教授；尹美娟，讲师３９１８一）ｋ，￣
第３卷第１期８８
Ｖｏ－８ｌ３
・
计
算
机
工
程
２１０２年９月
Ｓｐｅｂｒ２０１ｅｔｍｅ２
ＮＯ．８１
ＣｏｐｔｒＥｎｉｅｒｎｍｕｅｇｎｅｉｇ
人工智能及识别技术・
文章缩号：ｏｏ３２（ｌ）＿１ —ｏ１０＿４８０２＆＿６＿文献标识码：２１０２４Ａ
为真，则节点值为１，否则为０。
２，即最大伪似然估计和区别式训练… 。种最大伪似然估计是用伪似然概率替代似然概率。如Ｊ果表示一个可能世界，而是第ｉ个闭原子的真值，给定
Ｗ、ｘ的伪似然为：
（＝＝兀。（＝Ｘｌ）ｉ（））（）２
区别式训练针对的是这样一类应用：事先知道哪些谓
词是证据谓词，些谓词是查询谓词，哪推理任务就是在给定证据谓词的条件下正确预测查询谓词。其主要思想是把域中的闭原子分为２个集合：证据原子集合和查询原子集合ｙ，在给定的条件下，ｙ的条件似然为：
ＬｇｃＮｅｒ（Ｎ）ｄｉｐｌａｏｓｉｎｏｍａｏｘｒｃｉｎＩ）ｉｃｕｉｇｎｍｅｎｉｅｏｎｔｎｅｔｙｒｌｔｎｅｔｃｉｎａｄｅｔｙｏｉｔｋＭＬａｔａｐｉｔｎＩｆｒｔｎＥｔｔ（Ｅ，ｎｌｄｎａｄｅｔｙｒｃｇｉｏ，ｎｉｅｉｘｒｔｎｎｉｗｏｎｓｃｉｎｉａｏｔｉｔａｏａｏｔ

马尔科夫完美均衡

• the model and define the concept of Markov strategy (Eric Maskin, Jean Tirole,2001,Markov
Perfect Equilibrium I. Observable Actions, Journal of Economic Theory)
1.Theory and application of MPE
1.eliminating or reducing a large multiplicity of equilibria in dynamic games (MPE—a refinement of Nash
equilibrium)
2.reduce the number of parameters to be estimated in dynamic econometric models（动态计
• 注：弗兰克·拉姆齐是英国剑桥大学的数学家和逻辑学家，1928年12 月，他在《经济学杂志》上发表了《储蓄的数学原理》一文，建立了拉姆齐模型。该模型在确定性的条件下，分析最优经济增长，推导满足最优路径的跨时条件，阐述了动态非货币均衡模型中的消费和资本积累原理。这个模型被后人称为拉姆齐模型。拉姆齐研究的中心问题是跨时资源的分配，在任何时刻，国民产出有多少应该分配给消费以产生当前效用，又有多少应该储蓄并投资以提高未来的产出和消费，从而产生未来的效用）
Pure strategy Markov equilibrium in stochastic games with a continuum of players, Journal of Mathematical Economics )
• Igor Livshits（2002）provides a simple counterexample to existence of pure strategy Markov perfect equilibrium for a class of inﬁnite-horizon games with complete information（无限有界完全信息博弈）, ﬁnitely many actions, and ﬁnitely many ordered states（有限行动、有限状态）. For the given example, the proof of non-existence is provided.（ Igor Livshits ,2002, On non-existence of pure

(完整版)人工智能原理MOOC习题集及答案北京大学王文敏课件

正确答案：A、B 你选对了Quizzes for Chapter 11 单选(1 分)图灵测试旨在给予哪一种令人满意的操作定义得分/ 5 多选(1 分)选择下列计算机系统中属于人工智能的实例得分/总分总分A. W eb搜索引擎A. 人类思考B.超市条形码扫描器B. 人工智能C.声控电话菜单该题无法得分/1.00C.机器智能 1.00/1.00D.智能个人助理该题无法得分/1.00正确答案：A、D 你错选为C、DD.机器动作正确答案： C 你选对了6 多选(1 分)选择下列哪些是人工智能的研究领域得分/总分2 多选(1 分)选择以下关于人工智能概念的正确表述得分/总分A.人脸识别0.33/1.00A. 人工智能旨在创造智能机器该题无法得分/1.00B.专家系统0.33/1.00B. 人工智能是研究和构建在给定环境下表现良好的智能体程序该题无法得分/1.00C.图像理解C.人工智能将其定义为人类智能体的研究该题无法D.分布式计算得分/1.00正确答案：A、B、C 你错选为A、BD.人工智能是为了开发一类计算机使之能够完成通7 多选(1 分)考察人工智能(AI) 的一些应用，去发现目前下列哪些任务可以通过AI 来解决得分/总分常由人类所能做的事该题无法得分/1.00正确答案：A、B、D 你错选为A、B、C、DA.以竞技水平玩德州扑克游戏0.33/1.003 多选(1 分)如下学科哪些是人工智能的基础？得分/总分B.打一场像样的乒乓球比赛A. 经济学0.25/1.00C.在Web 上购买一周的食品杂货0.33/1.00B. 哲学0.25/1.00D.在市场上购买一周的食品杂货C.心理学0.25/1.00正确答案：A、B、C 你错选为A、CD.数学0.25/1.008 填空(1 分)理性指的是一个系统的属性，即在_________的环境下正确答案：A、B、C、D 你选对了做正确的事。

得分/总分正确答案：已知4 多选(1 分)下列陈述中哪些是描述强AI (通用AI )的正确答案？得1 单选(1 分)图灵测试旨在给予哪一种令人满意的操作定义得分/ 分/总分总分A. 指的是一种机器，具有将智能应用于任何问题的A.人类思考能力0.50/1.00B.人工智能B. 是经过适当编程的具有正确输入和输出的计算机，因此有与人类同样判断力的头脑0.50/1.00C.机器智能 1.00/1.00C.指的是一种机器，仅针对一个具体问题D.机器动作正确答案： C 你选对了D.其定义为无知觉的计算机智能，或专注于一个狭2 多选(1 分)选择以下关于人工智能概念的正确表述得分/总分窄任务的AIA. 人工智能旨在创造智能机器该题无法得分/1.00B.专家系统0.33/1.00B. 人工智能是研究和构建在给定环境下表现良好的C.图像理解智能体程序该题无法得分/1.00D.分布式计算C.人工智能将其定义为人类智能体的研究该题无法正确答案：A、B、C 你错选为A、B得分/1.00 7 多选(1 分)考察人工智能(AI) 的一些应用，去发现目前下列哪些任务可以通过AI 来解决得分/总分D.人工智能是为了开发一类计算机使之能够完成通A.以竞技水平玩德州扑克游戏0.33/1.00常由人类所能做的事该题无法得分/1.00正确答案：A、B、D 你错选为A、B、C、DB.打一场像样的乒乓球比赛3 多选(1 分)如下学科哪些是人工智能的基础？得分/总分C.在Web 上购买一周的食品杂货0.33/1.00A. 经济学0.25/1.00D.在市场上购买一周的食品杂货B. 哲学0.25/1.00正确答案：A、B、C 你错选为A、CC.心理学0.25/1.008 填空(1 分)理性指的是一个系统的属性，即在_________的环境下D.数学0.25/1.00 做正确的事。

crf

Conditional Random Fields:Probabilistic Modelsfor Segmenting and Labeling Sequence DataJohn Lafferty LAFFERTY@ Andrew McCallum MCCALLUM@ Fernando Pereira FPEREIRA@ WhizBang!Labs–Research,4616Henry Street,Pittsburgh,PA15213USASchool of Computer Science,Carnegie Mellon University,Pittsburgh,PA15213USADepartment of Computer and Information Science,University of Pennsylvania,Philadelphia,PA19104USAAbstractWe present,a frame-work for building probabilistic models to seg-ment and label sequence data.Conditional ran-domﬁelds offer several advantages over hid-den Markov models and stochastic grammarsfor such tasks,including the ability to relaxstrong independence assumptions made in thosemodels.Conditional randomﬁelds also avoida fundamental limitation of maximum entropyMarkov models(MEMMs)and other discrimi-native Markov models based on directed graph-ical models,which can be biased towards stateswith few successor states.We present iterativeparameter estimation algorithms for conditionalrandomﬁelds and compare the performance ofthe resulting models to HMMs and MEMMs onsynthetic and natural-language data.1.IntroductionThe need to segment and label sequences arises in many different problems in several scientiﬁcﬁelds.Hidden Markov models(HMMs)and stochastic grammars are well understood and widely used probabilistic models for such problems.In computational biology,HMMs and stochas-tic grammars have been successfully used to align bio-logical sequences,ﬁnd sequences homologous to a known evolutionary family,and analyze RNA secondary structure (Durbin et al.,1998).In computational linguistics and computer science,HMMs and stochastic grammars have been applied to a wide variety of problems in text and speech processing,including topic segmentation,part-of-speech(POS)tagging,information extraction,and syntac-tic disambiguation(Manning&Sch¨u tze,1999).HMMs and stochastic grammars are generative models,as-signing a joint probability to paired observation and label sequences;the parameters are typically trained to maxi-mize the joint likelihood of training examples.To deﬁne a joint probability over observation and label sequences, a generative model needs to enumerate all possible ob-servation sequences,typically requiring a representation in which observations are task-appropriate atomic entities, such as words or nucleotides.In particular,it is not practi-cal to represent multiple interacting features or long-range dependencies of the observations,since the inference prob-lem for such models is intractable.This difﬁculty is one of the main motivations for looking at conditional models as an alternative.A conditional model speciﬁes the probabilities of possible label sequences given an observation sequence.Therefore,it does not expend modeling effort on the observations,which at test time areﬁxed anyway.Furthermore,the conditional probabil-ity of the label sequence can depend on arbitrary,non-independent features of the observation sequence without forcing the model to account for the distribution of those dependencies.The chosen features may represent attributes at different levels of granularity of the same observations (for example,words and characters in English text),or aggregate properties of the observation sequence(for in-stance,text layout).The probability of a transition between labels may depend not only on the current observation, but also on past and future observations,if available.In contrast,generative models must make very strict indepen-dence assumptions on the observations,for instance condi-tional independence given the labels,to achieve tractability. Maximum entropy Markov models(MEMMs)are condi-tional probabilistic sequence models that attain all of the above advantages(McCallum et al.,2000).In MEMMs, each source state1has a exponential model that takes the observation features as input,and outputs a distribution over possible next states.These exponential models are trained by an appropriate iterative scaling method in the 1Output labels are associated with states;it is possible for sev-eral states to have the same label,but for simplicity in the rest of this paper we assume a one-to-one correspondence.maximum entropy framework.Previously published exper-imental results show MEMMs increasing recall and dou-bling precision relative to HMMs in a FAQ segmentation task.MEMMs and other non-generativeﬁnite-state models based on next-state classiﬁers,such as discriminative Markov models(Bottou,1991),share a weakness we callhere the:the transitions leaving a given state compete only against each other,rather than againstall other transitions in the model.In probabilistic terms, transition scores are the conditional probabilities of pos-sible next states given the current state and the observa-tion sequence.This per-state normalization of transition scores implies a“conservation of score mass”(Bottou, 1991)whereby all the mass that arrives at a state must be distributed among the possible successor states.An obser-vation can affect which destination states get the mass,but not how much total mass to pass on.This causes a bias to-ward states with fewer outgoing transitions.In the extreme case,a state with a single outgoing transition effectively ignores the observation.In those cases,unlike in HMMs, Viterbi decoding cannot downgrade a branch based on ob-servations after the branch point,and models with state-transition structures that have sparsely connected chains of states are not properly handled.The Markovian assump-tions in MEMMs and similar state-conditional models in-sulate decisions at one state from future decisions in a way that does not match the actual dependencies between con-secutive states.This paper introduces(CRFs),a sequence modeling framework that has all the advantages of MEMMs but also solves the label bias problem in a principled way.The critical difference between CRFs and MEMMs is that a MEMM uses per-state exponential mod-els for the conditional probabilities of next states given the current state,while a CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence.Therefore,the weights of different features at different states can be traded off against each other.We can also think of a CRF as aﬁnite state model with un-normalized transition probabilities.However,unlike some other weightedﬁnite-state approaches(LeCun et al.,1998), CRFs assign a well-deﬁned probability distribution over possible labelings,trained by maximum likelihood or MAP estimation.Furthermore,the loss function is convex,2guar-anteeing convergence to the global optimum.CRFs also generalize easily to analogues of stochastic context-free grammars that would be useful in such problems as RNA secondary structure prediction and natural language pro-cessing.2In the case of fully observable states,as we are discussing here;if several states have the same label,the usual local maxima of Baum-Welch arise.bel bias example,after(Bottou,1991).For concise-ness,we place observation-label pairs on transitions rather than states;the symbol‘’represents the null output label.We present the model,describe two training procedures and sketch a proof of convergence.We also give experimental results on synthetic data showing that CRFs solve the clas-sical version of the label bias problem,and,more signiﬁ-cantly,that CRFs perform better than HMMs and MEMMs when the true data distribution has higher-order dependen-cies than the model,as is often the case in practice.Finally, we conﬁrm these results as well as the claimed advantages of conditional models by evaluating HMMs,MEMMs and CRFs with identical state structure on a part-of-speech tag-ging task.2.The Label Bias ProblemClassical probabilistic automata(Paz,1971),discrimina-tive Markov models(Bottou,1991),maximum entropy taggers(Ratnaparkhi,1996),and MEMMs,as well as non-probabilistic sequence tagging and segmentation mod-els with independently trained next-state classiﬁers(Pun-yakanok&Roth,2001)are all potential victims of the label bias problem.For example,Figure1represents a simpleﬁnite-state model designed to distinguish between the two words and.Suppose that the observation sequence is. In theﬁrst time step,matches both transitions from the start state,so the probability mass gets distributed roughly equally among those two transitions.Next we observe. Both states1and4have only one outgoing transition.State 1has seen this observation often in training,state4has al-most never seen this observation;but like state1,state4 has no choice but to pass all its mass to its single outgoing transition,since it is not generating the observation,only conditioning on it.Thus,states with a single outgoing tran-sition effectively ignore their observations.More generally, states with low-entropy next state distributions will take lit-tle notice of observations.Returning to the example,the top path and the bottom path will be about equally likely, independently of the observation sequence.If one of the two words is slightly more common in the training set,the transitions out of the start state will slightly prefer its cor-responding transition,and that word’s state sequence will always win.This behavior is demonstrated experimentally in Section5.L´e on Bottou(1991)discussed two solutions for the label bias problem.One is to change the state-transition struc-ture of the model.In the above example we could collapse states1and4,and delay the branching until we get a dis-criminating observation.This operation is a special case of determinization(Mohri,1997),but determinization of weightedﬁnite-state machines is not always possible,and even when possible,it may lead to combinatorial explo-sion.The other solution mentioned is to start with a fully-connected model and let the training procedureﬁgure out a good structure.But that would preclude the use of prior structural knowledge that has proven so valuable in infor-mation extraction tasks(Freitag&McCallum,2000). Proper solutions require models that account for whole state sequences at once by letting some transitions“vote”more strongly than others depending on the corresponding observations.This implies that score mass will not be con-served,but instead individual transitions can“amplify”or “dampen”the mass they receive.In the above example,the transitions from the start state would have a very weak ef-fect on path score,while the transitions from states1and4 would have much stronger effects,amplifying or damping depending on the actual observation,and a proportionally higher contribution to the selection of the Viterbi path.3In the related work section we discuss other heuristic model classes that account for state sequences globally rather than locally.To the best of our knowledge,CRFs are the only model class that does this in a purely probabilistic setting, with guaranteed global maximum likelihood convergence.3.Conditional Random FieldsIn what follows,is a random variable over data se-quences to be labeled,and is a random variable over corresponding label sequences.All components of are assumed to range over aﬁnite label alphabet.For ex-ample,might range over natural language sentences and range over part-of-speech taggings of those sentences, with the set of possible part-of-speech tags.The ran-dom variables and are jointly distributed,but in a dis-criminative framework we construct a conditional model from paired observation and label sequences,and do not explicitly model the marginal..Thus,a CRF is a randomﬁeld globally conditioned on the observation.Throughout the paper we tacitly assume that the graph isﬁxed.In the simplest and most impor-3Weighted determinization and minimization techniques shift transition weights while preserving overall path weight(Mohri, 2000);their connection to this discussion deserves further study.tant example for modeling sequences,is a simple chain or line:.may also have a natural graph structure;yet in gen-eral it is not necessary to assume that and have the same graphical structure,or even that has any graph-ical structure at all.However,in this paper we will be most concerned with sequencesand.If the graph of is a tree(of which a chain is the simplest example),its cliques are the edges and ver-tices.Therefore,by the fundamental theorem of random ﬁelds(Hammersley&Clifford,1971),the joint distribu-tion over the label sequence given has the form(1),where is a data sequence,a label sequence,and is the set of components of associated with the vertices in subgraph.We assume that the and are given andﬁxed. For example,a Boolean vertex feature might be true if the word is upper case and the tag is“proper noun.”The parameter estimation problem is to determine the pa-rameters from training datawith empirical distribution. In Section4we describe an iterative scaling algorithm that maximizes the log-likelihood objective function:.As a particular case,we can construct an HMM-like CRF by deﬁning one feature for each state pair,and one feature for each state-observation pair:. The corresponding parameters and play a simi-lar role to the(logarithms of the)usual HMM parameters and.Boltzmann chain models(Saul&Jor-dan,1996;MacKay,1996)have a similar form but use a single normalization constant to yield a joint distribution, whereas CRFs use the observation-dependent normaliza-tion for conditional distributions.Although it encompasses HMM-like models,the class of conditional randomﬁelds is much more expressive,be-cause it allows arbitrary dependencies on the observationFigure2.Graphical structures of simple HMMs(left),MEMMs(center),and the chain-structured case of CRFs(right)for sequences. An open circle indicates that the variable is not generated by the model.sequence.In addition,the features do not need to specify completely a state or observation,so one might expect that the model can be estimated from less training data.Another attractive property is the convexity of the loss function;in-deed,CRFs share all of the convexity properties of general maximum entropy models.For the remainder of the paper we assume that the depen-dencies of,conditioned on,form a chain.To sim-plify some expressions,we add special start and stop states and.Thus,we will be using the graphical structure shown in Figure2.For a chain struc-ture,the conditional probability of a label sequence can be expressed concisely in matrix form,which will be useful in describing the parameter estimation and inference al-gorithms in Section4.Suppose that is a CRF given by(1).For each position in the observation se-quence,we deﬁne the matrix random variableby,where is the edge with labels and is the vertex with label.In contrast to generative models,con-ditional models like CRFs do not need to enumerate over all possible observation sequences,and therefore these matrices can be computed directly as needed from a given training or test observation sequence and the parameter vector.Then the normalization(partition function)is the entry of the product of these matrices:. Using this notation,the conditional probability of a label sequence is written as, where and.4.Parameter Estimation for CRFsWe now describe two iterative scaling algorithms toﬁnd the parameter vector that maximizes the log-likelihood of the training data.Both algorithms are based on the im-proved iterative scaling(IIS)algorithm of Della Pietra et al. (1997);the proof technique based on auxiliary functions can be extended to show convergence of the algorithms for CRFs.Iterative scaling algorithms update the weights asand for appropriately chosen and.In particular,the IIS update for an edge feature is the solution ofdef.where is thedef. The equations for vertex feature updates have similar form.However,efﬁciently computing the exponential sums on the right-hand sides of these equations is problematic,be-cause is a global property of,and dynamic programming will sum over sequences with potentially varying.To deal with this,theﬁrst algorithm,Algorithm S,uses a“slack feature.”The second,Algorithm T,keepstrack of partial totals.For Algorithm S,we deﬁne the bydef,where is a constant chosen so that for all and all observation vectors in the training set,thus making.Feature is“global,”that is,it does not correspond to any particular edge or vertex.For each index we now deﬁne thewith base caseifotherwiseand recurrence.Similarly,the are deﬁned byifotherwiseand.With these deﬁnitions,the update equations are,where.The factors involving the forward and backward vectors in the above equations have the same meaning as for standard hidden Markov models.For example,is the marginal probability of label given that the observation sequence is.This algorithm is closely related to the algorithm of Darroch and Ratcliff(1972),and MART algorithms used in image reconstruction.The constant in Algorithm S can be quite large,since in practice it is proportional to the length of the longest train-ing observation sequence.As a result,the algorithm may converge slowly,taking very small steps toward the maxi-mum in each iteration.If the length of the observations and the number of active features varies greatly,a faster-converging algorithm can be obtained by keeping track of feature totals for each observation sequence separately. Let def.Algorithm T accumulates feature expectations into counters indexed by.More speciﬁcally,we use the forward-backward recurrences just introduced to compute the expectations of feature and of feature given that.Then our param-eter updates are and,whereand are the unique positive roots to the following polynomial equationsmax max,(2)which can be easily computed by Newton’s method.A single iteration of Algorithm S and Algorithm T has roughly the same time and space complexity as the well known Baum-Welch algorithm for HMMs.To prove con-vergence of our algorithms,we can derive an auxiliary function to bound the change in likelihood from below;this method is developed in detail by Della Pietra et al.(1997). The full proof is somewhat detailed;however,here we give an idea of how to derive the auxiliary function.To simplify notation,we assume only edge features with parameters .Given two parameter settings and,we bound from below the change in the objective function with anas followsdefwhere the inequalities follow from the convexity of and.Differentiating with respect to and setting the result to zero yields equation(2).5.ExperimentsWeﬁrst discuss two sets of experiments with synthetic data that highlight the differences between CRFs and MEMMs. Theﬁrst experiments are a direct veriﬁcation of the label bias problem discussed in Section2.In the second set of experiments,we generate synthetic data using randomly chosen hidden Markov models,each of which is a mix-ture of aﬁrst-order and second-order peting models are then trained and compared on test data.As the data becomes more second-order,the test er-ror rates of the trained models increase.This experiment corresponds to the common modeling practice of approxi-mating complex local and long-range dependencies,as oc-cur in natural data,by small-order Markov models.OurFigure3.Plots of error rates for HMMs,CRFs,and MEMMs on randomly generated synthetic data sets,as described in Section5.2. As the data becomes“more second order,”the error rates of the test models increase.As shown in the left plot,the CRF typically signiﬁcantly outperforms the MEMM.The center plot shows that the HMM outperforms the MEMM.In the right plot,each open square represents a data set with,and a solid circle indicates a data set with.The plot shows that when the data is mostly second order(),the discriminatively trained CRF typically outperforms the HMM.These experiments are not designed to demonstrate the advantages of the additional representational power of CRFs and MEMMs relative to HMMs.results clearly indicate that even when the models are pa-rameterized in exactly the same way,CRFs are more ro-bust to inaccurate modeling assumptions than MEMMs or HMMs,and resolve the label bias problem,which affects the performance of MEMMs.To avoid confusion of dif-ferent effects,the MEMMs and CRFs in these experiments use overlapping features of the observations.Fi-nally,in a set of POS tagging experiments,we conﬁrm the advantage of CRFs over MEMMs.We also show that the addition of overlapping features to CRFs and MEMMs al-lows them to perform much better than HMMs,as already shown for MEMMs by McCallum et al.(2000).5.1Modeling label biasWe generate data from a simple HMM which encodes a noisy version of theﬁnite-state network in Figure1.Each state emits its designated symbol with probabilityand any of the other symbols with probability.We train both an MEMM and a CRF with the same topologies on the data generated by the HMM.The observation fea-tures are simply the identity of the observation symbols. In a typical run using training and test samples, trained to convergence of the iterative scaling algorithm, the CRF error is while the MEMM error is, showing that the MEMM fails to discriminate between the two branches.5.2Modeling mixed-order sourcesFor these results,we useﬁve labels,(),and26 observation values,();however,the results were qualitatively the same over a range of sizes for and .We generate data from a mixed-order HMM with state transition probabilities given byand,simi-larly,emission probabilities given by.Thus,for we have a standardﬁrst-order HMM.In order to limit the size of the Bayes error rate for the resulting models,the con-ditional probability tables are constrained to be sparse. In particular,can have at most two nonzero en-tries,for each,and can have at most three nonzero entries for each.For each randomly gener-ated model,a sample of1,000sequences of length25is generated for training and testing.On each randomly generated training set,a CRF is trained using Algorithm S.(Note that since the length of the se-quences and number of active features is constant,Algo-rithms S and T are identical.)The algorithm is fairly slow to converge,typically taking approximately500iterations for the model to stabilize.On the500MHz Pentium PC used in our experiments,each iteration takes approximately 0.2seconds.On the same data an MEMM is trained using iterative scaling,which does not require forward-backward calculations,and is thus more efﬁcient.The MEMM train-ing converges more quickly,stabilizing after approximately 100iterations.For each model,the Viterbi algorithm is used to label a test set;the experimental results do not sig-niﬁcantly change when using forward-backward decoding to minimize the per-symbol error rate.The results of several runs are presented in Figure3.Each plot compares two classes of models,with each point indi-cating the error rate for a single test set.As increases,the error rates generally increase,as theﬁrst-order models fail toﬁt the second-order data.Theﬁgure compares models parameterized as,,and;results for models parameterized as,,and are qualitatively the same.As shown in theﬁrst graph,the CRF generally out-performs the MEMM,often by a wide margin of10%–20% relative error.(The points for very small error rate,with ,where the MEMM does better than the CRF, are suspected to be the result of an insufﬁcient number of training iterations for the CRF.)HMM 5.69%45.99%MEMM 6.37%54.61%CRF 5.55%48.05%MEMM 4.81%26.99%CRF 4.27%23.76%Using spelling featuresFigure4.Per-word error rates for POS tagging on the Penn tree-bank,usingﬁrst-order models trained on50%of the1.1million word corpus.The oov rate is5.45%.5.3POS tagging experimentsTo conﬁrm our synthetic data results,we also compared HMMs,MEMMs and CRFs on Penn treebank POS tag-ging,where each word in a given input sentence must be labeled with one of45syntactic tags.We carried out two sets of experiments with this natural language data.First,we trainedﬁrst-order HMM,MEMM, and CRF models as in the synthetic data experiments,in-troducing parameters for each tag-word pair andfor each tag-tag pair in the training set.The results are con-sistent with what is observed on synthetic data:the HMM outperforms the MEMM,as a consequence of the label bias problem,while the CRF outperforms the HMM.The er-ror rates for training runs using a50%-50%train-test split are shown in Figure5.3;the results are qualitatively sim-ilar for other splits of the data.The error rates on out-of-vocabulary(oov)words,which are not observed in the training set,are reported separately.In the second set of experiments,we take advantage of the power of conditional models by adding a small set of or-thographic features:whether a spelling begins with a num-ber or upper case letter,whether it contains a hyphen,and whether it ends in one of the following sufﬁxes:.Here weﬁnd,as expected,that both the MEMM and the CRF beneﬁt signif-icantly from the use of these features,with the overall error rate reduced by around25%,and the out-of-vocabulary er-ror rate reduced by around50%.One usually starts training from the all zero parameter vec-tor,corresponding to the uniform distribution.However, for these datasets,CRF training with that initialization is much slower than MEMM training.Fortunately,we can use the optimal MEMM parameter vector as a starting point for training the corresponding CRF.In Figure5.3, MEMM was trained to convergence in around100iter-ations.Its parameters were then used to initialize the train-ing of CRF,which converged in1,000iterations.In con-trast,training of the same CRF from the uniform distribu-tion had not converged even after2,000iterations.6.Further Aspects of CRFsMany further aspects of CRFs are attractive for applica-tions and deserve further study.In this section we brieﬂy mention just two.Conditional randomﬁelds can be trained using the expo-nential loss objective function used by the AdaBoost algo-rithm(Freund&Schapire,1997).Typically,boosting is applied to classiﬁcation problems with a small,ﬁxed num-ber of classes;applications of boosting to sequence labeling have treated each label as a separate classiﬁcation problem (Abney et al.,1999).However,it is possible to apply the parallel update algorithm of Collins et al.(2000)to op-timize the per-sequence exponential loss.This requires a forward-backward algorithm to compute efﬁciently certain feature expectations,along the lines of Algorithm T,ex-cept that each feature requires a separate set of forward and backward accumulators.Another attractive aspect of CRFs is that one can imple-ment efﬁcient feature selection and feature induction al-gorithms for them.That is,rather than specifying in ad-vance which features of to use,we could start from feature-generating rules and evaluate the beneﬁt of gener-ated features automatically on data.In particular,the fea-ture induction algorithms presented in Della Pietra et al. (1997)can be adapted toﬁt the dynamic programming techniques of conditional randomﬁelds.7.Related Work and ConclusionsAs far as we know,the present work is theﬁrst to combine the beneﬁts of conditional models with the global normal-ization of randomﬁeld models.Other applications of expo-nential models in sequence modeling have either attempted to build generative models(Rosenfeld,1997),which in-volve a hard normalization problem,or adopted local con-ditional models(Berger et al.,1996;Ratnaparkhi,1996; McCallum et al.,2000)that may suffer from label bias. Non-probabilistic local decision models have also been widely used in segmentation and tagging(Brill,1995; Roth,1998;Abney et al.,1999).Because of the computa-tional complexity of global training,these models are only trained to minimize the error of individual label decisions assuming that neighboring labels are correctly -bel bias would be expected to be a problem here too.An alternative approach to discriminative modeling of se-quence labeling is to use a permissive generative model, which can only model local dependencies,to produce a list of candidates,and then use a more global discrimina-tive model to rerank those candidates.This approach is standard in large-vocabulary speech recognition(Schwartz &Austin,1993),and has also been proposed for parsing (Collins,2000).However,these methods fail when the cor-rect output is pruned away in theﬁrst pass.。

介绍游戏英文

Gameplay
Deathmatch Mode
Deathmatch mode is a more traditional mode of play where players compete to achieve the highest number of kills. The mode is played on smaller maps and there are no teams, so players can respawn immediately after being killed
and improvements
that enhance the gameplay experience. The game's fast-paced and strategic gameplay, along with its diverse maps and modes, make it a must-play for fans of first-person shooters
-
THANKS
感谢您的观看
Gameplay
Hostage Rescue Mode
Hostage rescue mode is a popular mode where one team must rescue hostages held by the other team. The hostage rescue team must release the hostages from their captors and escort them to a safe zone, while the captors must prevent this from happening. The mode adds an extra layer of strategy and cooperation to the game

车牌识别外文翻译

中英文翻译A configurable method for multi-style license platerecognitionAutomatic license plate recognition (LPR) has been a practical technique in the past decades. Numerous applications, such as automatic toll collection, criminal pursuit and traffic law enforcement , have been benefited from it . Although some novel techniques, for example RFID (radio frequency identification), WSN (wireless sensor network), etc., have been proposed for car ID identification, LPR on image data is still an indispensable technique in current intelligent transportation systems for its convenience and low cost. LPR is generally divided into three steps: license plate detection, character segmentation and character recognition. The detection step roughly classifies LP and non-LP regions, the segmentation step separates the symbols/characters from each other in one LP so that only accurate outline of each image block of characters is left for the recognition, and the recognition step finally converts greylevel image block into characters/symbols by predefined recognition models. Although LPR technique has a long research history, it is still driven forward by various arising demands, the most frequent one of which is the variation of LP styles, for example:(1) Appearance variation caused by the change of image capturingconditions.(2)Style variation from one nation to another.(3)Style variation when the government releases new LP format. Wesummed them up into four factors, namely rotation angle,line number, character type and format, after comprehensive analyses of multi-style LP characteristics on real data. Generally speaking, any change of the above four factors can result in the change of LP style or appearance and then affect the detection, segmentation or recognition algorithms. If one LP has a large rotation angle, the segmentation and recognition algorithms for horizontal LP may not work. If there are more than one character lines in one LP, additional line separation algorithm is needed before a segmentation process. With the variation of character types when we apply the method from one nation to another, the ability to re-define the recognition models is needed. What is more, the change of LP styles requires the method to adjust by itself so that the segmented and recognized character candidates can match best with an LP format.Several methods have been proposed for multi-national LPs or multiformat LPs in the past years while few of them comprehensively address the style adaptation problem in terms of the abovementioned factors. Some of them only claim the ability of processing multinational LPs by redefining the detection and segmentation rules or recognition models.In this paper, we propose a configurable LPR method which is adaptable from one style to another, particularly from one nation to another, by defining the four factors as parameters.1Users can constrain the scope of a parameter and at the same time the method will adjust itself so that the recognition can be faster and more accurate. Similar to existing LPR techniques, we also provide details of detection, segmentation and recognition algorithms. The difference is that we emphasize on the configurable framework for LPR and the extensibility of the proposed method for multistyle LPs instead of the performance of each algorithm.In the past decades, many methods have been proposed for LPR that contains detection, segmentation and recognition algorithms. In the following paragraphs, these algorithms and LPR methods based on them are briefly reviewed.LP detection algorithms can be mainly classified into three classes according to the features used, namely edgebased algorithms, colorbased algorithms and texture-based algorithms. The most commonly used method for LP detection is certainly the combinations of edge detection and mathematical morphology .In these methods, gradient (edges) is first extracted from the image and then a spatial analysis by morphology is applied to connect the edges into LP regions. Another way is counting edges on the image rows to find out regions of dense edges or to describe the dense edges in LP regions by a Hough transformation .Edge analysis is the most straightforward method with low computation complexity and good extensibility. Compared with edgebased algorithms, colorbased algorithms depend more on the application conditions. Since LPs in a nation often have several2predefined colors, researchers have defined color models to segment region of interests as the LP regions .This kind of method can be affected a lot by lighting conditions. To win both high recall and low false positive rates, texture classification has been used for LP detection. In Ref.Kim et al. used an SVM to train texture classifiers to detect image block that contains LP pixels.In Ref. the authors used Gabor filters to extract texture features in multiscales and multiorientations to describe the texture properties of LP regions. In Ref. Zhang used X and Y derivative features,grey-value variance and Adaboost classifier to classify LP and non-LP regions in an image.In Refs. wavelet feature analysis is applied to identify LP regions. Despite the good performance of these methods the computation complexity will limit their usability. In addition, texture-based algorithms may be affected by multi-lingual factors.Multi-line LP segmentation algorithms can also be classified into three classes, namely algorithms based on projection，binarization and global optimization. In the projection algorithms, gradient or color projection on vertical orientation will be calculated at first. The “valleys”on the projection result are regarded as the space between characters and used to segment characters from each other.Segmented regions are further processed by vertical projection to obtain precise bounding boxes of the LP characters. Since simple segmentation methods are easily affected by the rotation of LP, segmenting the skewed LP becomes a key issue to be solved. In the binarization algorithms, global or local methods are often used3to obtain foreground from background and then region connection operation is used to obtain character regions. In the most recent work, local threshold determination and slide window technique are developed to improve the segmentation performance. In the global optimization algorithms, the goal is not to obtain good segmentation result for independent characters but to obtain a compromise of character spatial arrangement and single character recognition result. Hidden Markov chain has been used to formulate the dynamic segmentation of characters in LP. The advantage of the algorithm is that the global optimization will improve the robustness to noise. And the disadvantage is that precise format definition is necessary before a segmentation process.Character and symbol recognition algorithms in LPR can be categorized into learning-based ones and template matching ones. For the former one, artificial neural network (ANN) is the mostly used method since it is proved to be able to obtain very good recognition result given a large training set. An important factor in training an ANN recognition model for LP is to build reasonable network structure with good features. SVM-based method is also adopted in LPR to obtain good recognition performance with even few training samples. Recently, cascade classifier method is also used for LP recognition. Template matching is another widely used algorithm. Generally, researchers need to build template images by hand for the LP characters and symbols. They can assign larger weights for the important points, for example, the corner points, in the4template to emphasize the different characteristics of the characters. Invariance of feature points is also considered in the template matching method to improve the robustness. The disadvantage is that it is difficult to define new template by the users who have no professional knowledge on pattern recognition, which will restrict the application of the algorithm.Based on the abovementioned algorithms, lots of LPR methods have been developed. However, these methods aremainly developed for specific nation or special LP formats. In Ref. the authors focus on recognizing Greek LPs by proposing new segmentation and recognition algorithms. The characters on LPs are alphanumerics with several fixed formats. In Ref. Zhang et al. developed a learning-based method for LP detection and character recognition. Their method is mainly for LPs of Korean styles. In Ref. optical character recognition (OCR) technique are integrated into LPR to develop general LPR method, while the performance of OCR may drop when facing LPs of poor image quality since it is difficult to discriminate real character from candidates without format supervision. This method can only select candidates of best recognition results as LP characters without recovery process. Wang et al. developed a method to recognize LPR with various viewing angles. Skew factor is considered in their method. In Ref. the authors proposed an automatic LPR method which can treat the cases of changes of illumination, vehicle speed, routes and backgrounds, which was realized by developing new detection and segmentation algorithms with robustness to the5illumination and image blurring. The performance of the method is encouraging while the authors do not present the recognition result in multination or multistyle conditions. In Ref. the authors propose an LPR method in multinational environment with character segmentation and format independent recognition. Since no recognition information is used in character segmentation, false segmented characters from background noise may be produced. What is more, the recognition method is not a learning-based method, which will limit its extensibility. In Ref. Mecocci et al. propose a generative recognition method. Generative models (GM) are proposed to produce many synthetic characters whose statistical variability is equivalent (for each class) to that showed by real samples. Thus a suitable statistical description of a large set of characters can be obtained by using only a limited set of images. As a result, the extension ability of character recognition is improved. This method mainly concerns the character recognition extensibility instead of whole LPR method.From the review we can see that LPR method in multistyle LPR with multinational application is not fully considered. Lots of existing LPR methods can work very well in a special application condition while the performance will drop sharply when they are extended from one condition to another, or from several styles to others.多类型车牌识别配置的方法自动车牌识别（LPR）在过去的几十年中的实用技术。

Probability and Stochastic Processes

Probability and Stochastic ProcessesProbability and stochastic processes are important concepts in the field of mathematics and have applications in various areas such as engineering, finance, and computer science. In this response, I will discuss the significance of probability and stochastic processes from multiple perspectives, highlightingtheir practical applications, theoretical foundations, and potential limitations. From a practical perspective, probability and stochastic processes play a crucial role in decision-making under uncertainty. Whether it is predicting the weather, estimating the risk of a financial investment, or designing a reliable communication system, the ability to quantify and analyze uncertainty is essential. Probability theory provides a framework for modeling and analyzing random events, enabling us to make informed decisions based on the likelihood of different outcomes. Stochastic processes, on the other hand, allow us to model systems that evolve over time in a probabilistic manner, providing valuable insights into the behavior of complex systems. In the field of engineering, probability and stochastic processes are used extensively in reliability analysis and system design. By modeling the failure rates of components and the interactions between them, engineers can evaluate the reliability of a system and identify potential weaknesses. This information is crucial for designing robust systems that can withstand uncertainties and minimize the risk of failure. Stochastic processes, such as Markov chains and queuing theory, are also used to model and analyze various engineering systems, including communication networks, manufacturing processes, and transportation systems. From a financial perspective, probability and stochastic processes are essential tools for risk management and investment analysis. Financial markets are inherently uncertain, and understanding the probabilistic nature of asset prices and returns is crucial for making informed investment decisions. By modeling the behavior of financial variables using stochastic processes, such as geometric Brownian motion or jump-diffusion processes, analysts can estimate the probabilities of different market scenarios and assess the risk associated with different investment strategies. This information is invaluable for portfolio management, option pricing, and hedging strategies. From a theoretical perspective, probability theory and stochasticprocesses provide a rigorous mathematical foundation for understanding randomness and uncertainty. Probability theory, with its axioms and theorems, allows us to reason logically about uncertain events and make precise statements about their probabilities. Stochastic processes, as mathematical models for random phenomena, provide a framework for studying the long-term behavior of systems and analyzing their statistical properties. This theoretical understanding is not only important for practical applications but also for advancing our knowledge in various scientific disciplines, including physics, biology, and social sciences. However, it is important to acknowledge the limitations of probability and stochastic processes. Firstly, these concepts are based on assumptions and simplifications that may not always hold in real-world situations. For example, many stochastic models assume that the underlying processes are stationary and independent, which may not be true in practice. Secondly, probability and stochastic processes can only provide probabilistic predictions and estimates, rather than deterministic outcomes. This inherent uncertainty means that even with the best models and data, there will always be a degree of unpredictability. Lastly, the accuracy of probability and stochastic models heavily relies on the availability and quality of data. In situations where data is limited or unreliable, the predictions and estimates obtained from these models may be less accurate or even misleading. In conclusion, probability and stochastic processes are fundamental concepts with wide-ranging applications and theoretical significance. They provide a powerful framework for quantifying and analyzing uncertainty, enabling us to make informed decisions and understand the behavior of complex systems. From practical applications in engineering and finance to theoretical foundations in mathematics and science, probability and stochastic processes play a crucial role in our understanding of the world. However, it is important to recognize theirlimitations and the inherent uncertainties they entail. By embracing uncertainty and using probability and stochastic processes as tools for reasoning anddecision-making, we can navigate the complexities of the world with greater confidence and understanding.。

卡马克算法（地图重复利用，跑酷类游戏）

卡马克算法（地图重复利⽤，跑酷类游戏）----------------------------下⾯是理论知识--------------------------卡马克算法：由约翰·卡马克（John Carmack）开发的⼀种游戏地图处理⽅法，被⼴泛运⽤到2D卷轴式游戏和⼿机游戏中。

约翰·卡马克：id Software创始⼈之⼀，技术总监。

享誉世界的著名程序员，以卡马克算法和3D游戏引擎开发⽽闻名世界，被奉为游戏⾏业偶像。

同时他也是个全⾯型的技术天才，现在致⼒于民⽤航天器开发，是民⽤航天器开发⼩组Armadillo Aerospace的主要创办⼈和技术⾻⼲。

约翰·卡马克（百度百科）：约翰·卡马克（维基百科）：地图是游戏中必不可少的⼀种预算元素，尤其是在RPG、ACT等类型的游戏中作⽤更为重要，⼀个漂亮的地图效果和⼀个流畅的卷动速度会⼤⼤增加玩家的游戏体验。

⽽游戏中地图滚动的重绘有多种算法，由于⼿机性能的限制和开发周期等其他⾮技术条件，需要根据情况灵活选择所需的技术。

本⽂将主要介绍如何使⽤OPhone API来绘制2D游戏中的场景，也即地图的绘制⽅法。

地图绘制及滚动的常⽤算法⽆缝图⽚滚动画法最简单的⼀种画地图⽅法，⽆需使⽤数组，只需要使⽤⼀张⽆缝的背景图⽚，在屏幕上绘制两次，以此来实现最简单的地图滚动效果和图⽚的重复使⽤以节约资源。

如下图，红⾊虚线部分为屏幕，使⽤⼀个偏移量在屏幕中错开位置贴上两次图⽚，通过不断改变偏移量的⼤⼩来实现动画效果。

代码举例：//imgBack图⽚对象//posX图⽚在X轴⽅向上的偏移量canvas.drawBitmap(imgBack, -posX, 0, paint);canvas.drawBitmap(imgBack, imgBack.getHeight()+posX, 0, paint);if(posX==-imgBack.getHeight())posX=0; //imgBack图⽚对象 //posX图⽚在X轴⽅向上的偏移量 canvas.drawBitmap(imgBack, -posX, 0, paint); canvas.drawBitmap(imgBack, imgBack.getHeight()+posX, 0, paint); if(posX==-imgBack.getHeight()) posX=0;优点与局限：此算法⾮常简单，由于是单张图⽚反复滚动⽣成的背景图⽚，所以对于美术⼈员的限制较少，利于发挥，⽽且外观效果好。

Consensus and Cooperation in Networked Multi-Agent Systems

Consensus and Cooperation in Networked Multi-Agent SystemsAlgorithms that provide rapid agreement and teamwork between all participants allow effective task performance by self-organizing networked systems.By Reza Olfati-Saber,Member IEEE,J.Alex Fax,and Richard M.Murray,Fellow IEEEABSTRACT|This paper provides a theoretical framework for analysis of consensus algorithms for multi-agent networked systems with an emphasis on the role of directed information flow,robustness to changes in network topology due to link/node failures,time-delays,and performance guarantees. An overview of basic concepts of information consensus in networks and methods of convergence and performance analysis for the algorithms are provided.Our analysis frame-work is based on tools from matrix theory,algebraic graph theory,and control theory.We discuss the connections between consensus problems in networked dynamic systems and diverse applications including synchronization of coupled oscillators,flocking,formation control,fast consensus in small-world networks,Markov processes and gossip-based algo-rithms,load balancing in networks,rendezvous in space, distributed sensor fusion in sensor networks,and belief propagation.We establish direct connections between spectral and structural properties of complex networks and the speed of information diffusion of consensus algorithms.A brief introduction is provided on networked systems with nonlocal information flow that are considerably faster than distributed systems with lattice-type nearest neighbor interactions.Simu-lation results are presented that demonstrate the role of small-world effects on the speed of consensus algorithms and cooperative control of multivehicle formations.KEYWORDS|Consensus algorithms;cooperative control; flocking;graph Laplacians;information fusion;multi-agent systems;networked control systems;synchronization of cou-pled oscillators I.INTRODUCTIONConsensus problems have a long history in computer science and form the foundation of the field of distributed computing[1].Formal study of consensus problems in groups of experts originated in management science and statistics in1960s(see DeGroot[2]and references therein). The ideas of statistical consensus theory by DeGroot re-appeared two decades later in aggregation of information with uncertainty obtained from multiple sensors1[3]and medical experts[4].Distributed computation over networks has a tradition in systems and control theory starting with the pioneering work of Borkar and Varaiya[5]and Tsitsiklis[6]and Tsitsiklis,Bertsekas,and Athans[7]on asynchronous asymptotic agreement problem for distributed decision-making systems and parallel computing[8].In networks of agents(or dynamic systems),B con-sensus[means to reach an agreement regarding a certain quantity of interest that depends on the state of all agents.A B consensus algorithm[(or protocol)is an interaction rule that specifies the information exchange between an agent and all of its neighbors on the network.2 The theoretical framework for posing and solving consensus problems for networked dynamic systems was introduced by Olfati-Saber and Murray in[9]and[10] building on the earlier work of Fax and Murray[11],[12]. The study of the alignment problem involving reaching an agreement V without computing any objective functions V appeared in the work of Jadbabaie et al.[13].Further theoretical extensions of this work were presented in[14] and[15]with a look toward treatment of directed infor-mation flow in networks as shown in Fig.1(a).Manuscript received August8,2005;revised September7,2006.This work was supported in part by the Army Research Office(ARO)under Grant W911NF-04-1-0316. R.Olfati-Saber is with Dartmouth College,Thayer School of Engineering,Hanover,NH03755USA(e-mail:olfati@).J.A.Fax is with Northrop Grumman Corp.,Woodland Hills,CA91367USA(e-mail:alex.fax@).R.M.Murray is with the California Institute of Technology,Control and Dynamical Systems,Pasadena,CA91125USA(e-mail:murray@).Digital Object Identifier:10.1109/JPROC.2006.8872931This is known as sensor fusion and is an important application of modern consensus algorithms that will be discussed later.2The term B nearest neighbors[is more commonly used in physics than B neighbors[when applied to particle/spin interactions over a lattice (e.g.,Ising model).Vol.95,No.1,January2007|Proceedings of the IEEE2150018-9219/$25.00Ó2007IEEEThe common motivation behind the work in [5],[6],and [10]is the rich history of consensus protocols in com-puter science [1],whereas Jadbabaie et al.[13]attempted to provide a formal analysis of emergence of alignment in the simplified model of flocking by Vicsek et al.[16].The setup in [10]was originally created with the vision of de-signing agent-based amorphous computers [17],[18]for collaborative information processing in ter,[10]was used in development of flocking algorithms with guaranteed convergence and the capability to deal with obstacles and adversarial agents [19].Graph Laplacians and their spectral properties [20]–[23]are important graph-related matrices that play a crucial role in convergence analysis of consensus and alignment algo-rithms.Graph Laplacians are an important point of focus of this paper.It is worth mentioning that the second smallest eigenvalue of graph Laplacians called algebraic connectivity quantifies the speed of convergence of consensus algo-rithms.The notion of algebraic connectivity of graphs has appeared in a variety of other areas including low-density parity-check codes (LDPC)in information theory and com-munications [24],Ramanujan graphs [25]in number theory and quantum chaos,and combinatorial optimization prob-lems such as the max-cut problem [21].More recently,there has been a tremendous surge of interest V among researchers from various disciplines of engineering and science V in problems related to multia-gent networked systems with close ties to consensus prob-lems.This includes subjects such as consensus [26]–[32],collective behavior of flocks and swarms [19],[33]–[37],sensor fusion [38]–[40],random networks [41],[42],syn-chronization of coupled oscillators [42]–[46],algebraic connectivity 3of complex networks [47]–[49],asynchro-nous distributed algorithms [30],[50],formation control for multirobot systems [51]–[59],optimization-based co-operative control [60]–[63],dynamic graphs [64]–[67],complexity of coordinated tasks [68]–[71],and consensus-based belief propagation in Bayesian networks [72],[73].A detailed discussion of selected applications will be pre-sented shortly.In this paper,we focus on the work described in five key papers V namely,Jadbabaie,Lin,and Morse [13],Olfati-Saber and Murray [10],Fax and Murray [12],Moreau [14],and Ren and Beard [15]V that have been instrumental in paving the way for more recent advances in study of self-organizing networked systems ,or swarms .These networked systems are comprised of locally interacting mobile/static agents equipped with dedicated sensing,computing,and communication devices.As a result,we now have a better understanding of complex phenomena such as flocking [19],or design of novel information fusion algorithms for sensor networks that are robust to node and link failures [38],[72]–[76].Gossip-based algorithms such as the push-sum protocol [77]are important alternatives in computer science to Laplacian-based consensus algorithms in this paper.Markov processes establish an interesting connection between the information propagation speed in these two categories of algorithms proposed by computer scientists and control theorists [78].The contribution of this paper is to present a cohesive overview of the key results on theory and applications of consensus problems in networked systems in a unified framework.This includes basic notions in information consensus and control theoretic methods for convergence and performance analysis of consensus protocols that heavily rely on matrix theory and spectral graph theory.A byproduct of this framework is to demonstrate that seem-ingly different consensus algorithms in the literature [10],[12]–[15]are closely related.Applications of consensus problems in areas of interest to researchers in computer science,physics,biology,mathematics,robotics,and con-trol theory are discussed in this introduction.A.Consensus in NetworksThe interaction topology of a network of agents is rep-resented using a directed graph G ¼ðV ;E Þwith the set of nodes V ¼f 1;2;...;n g and edges E V ÂV .TheFig.1.Two equivalent forms of consensus algorithms:(a)a networkof integrator agents in which agent i receives the state x j of its neighbor,agent j ,if there is a link ði ;j Þconnecting the two nodes;and (b)the block diagram for a network of interconnecteddynamic systems all with identical transfer functions P ðs Þ¼1=s .The collective networked system has a diagonal transfer function and is a multiple-input multiple-output (MIMO)linear system.3To be defined in Section II-A.Olfati-Saber et al.:Consensus and Cooperation in Networked Multi-Agent Systems216Proceedings of the IEEE |Vol.95,No.1,January 2007neighbors of agent i are denoted by N i ¼f j 2V :ði ;j Þ2E g .According to [10],a simple consensus algorithm to reach an agreement regarding the state of n integrator agents with dynamics _x i ¼u i can be expressed as an n th-order linear system on a graph_x i ðt Þ¼X j 2N ix j ðt ÞÀx i ðt ÞÀÁþb i ðt Þ;x i ð0Þ¼z i2R ;b i ðt Þ¼0:(1)The collective dynamics of the group of agents following protocol (1)can be written as_x ¼ÀLx(2)where L ¼½l ij is the graph Laplacian of the network and itselements are defined as follows:l ij ¼À1;j 2N i j N i j ;j ¼i :&(3)Here,j N i j denotes the number of neighbors of node i (or out-degree of node i ).Fig.1shows two equivalent forms of the consensus algorithm in (1)and (2)for agents with a scalar state.The role of the input bias b in Fig.1(b)is defined later.According to the definition of graph Laplacian in (3),all row-sums of L are zero because of P j l ij ¼0.Therefore,L always has a zero eigenvalue 1¼0.This zero eigenvalues corresponds to the eigenvector 1¼ð1;...;1ÞT because 1belongs to the null-space of L ðL 1¼0Þ.In other words,an equilibrium of system (2)is a state in the form x Ã¼ð ;...; ÞT ¼ 1where all nodes agree.Based on ana-lytical tools from algebraic graph theory [23],we later show that x Ãis a unique equilibrium of (2)(up to a constant multiplicative factor)for connected graphs.One can show that for a connected network,the equilibrium x Ã¼ð ;...; ÞT is globally exponentially stable.Moreover,the consensus value is ¼1=n P i z i that is equal to the average of the initial values.This im-plies that irrespective of the initial value of the state of each agent,all agents reach an asymptotic consensus regarding the value of the function f ðz Þ¼1=n P i z i .While the calculation of f ðz Þis simple for small net-works,its implications for very large networks is more interesting.For example,if a network has n ¼106nodes and each node can only talk to log 10ðn Þ¼6neighbors,finding the average value of the initial conditions of the nodes is more complicated.The role of protocol (1)is to provide a systematic consensus mechanism in such a largenetwork to compute the average.There are a variety of functions that can be computed in a similar fashion using synchronous or asynchronous distributed algorithms (see [10],[28],[30],[73],and [76]).B.The f -Consensus Problem and Meaning of CooperationTo understand the role of cooperation in performing coordinated tasks,we need to distinguish between un-constrained and constrained consensus problems.An unconstrained consensus problem is simply the alignment problem in which it suffices that the state of all agents asymptotically be the same.In contrast,in distributed computation of a function f ðz Þ,the state of all agents has to asymptotically become equal to f ðz Þ,meaning that the consensus problem is constrained.We refer to this con-strained consensus problem as the f -consensus problem .Solving the f -consensus problem is a cooperative task and requires willing participation of all the agents.To demonstrate this fact,suppose a single agent decides not to cooperate with the rest of the agents and keep its state unchanged.Then,the overall task cannot be performed despite the fact that the rest of the agents reach an agree-ment.Furthermore,there could be scenarios in which multiple agents that form a coalition do not cooperate with the rest and removal of this coalition of agents and their links might render the network disconnected.In a dis-connected network,it is impossible for all nodes to reach an agreement (unless all nodes initially agree which is a trivial case).From the above discussion,cooperation can be infor-mally interpreted as B giving consent to providing one’s state and following a common protocol that serves the group objective.[One might think that solving the alignment problem is not a cooperative task.The justification is that if a single agent (called a leader)leaves its value unchanged,all others will asymptotically agree with the leader according to the consensus protocol and an alignment is reached.However,if there are multiple leaders where two of whom are in disagreement,then no consensus can be asymptot-ically reached.Therefore,alignment is in general a coop-erative task as well.Formal analysis of the behavior of systems that involve more than one type of agent is more complicated,partic-ularly,in presence of adversarial agents in noncooperative games [79],[80].The focus of this paper is on cooperative multi-agent systems.C.Iterative Consensus and Markov ChainsIn Section II,we show how an iterative consensus algorithm that corresponds to the discrete-time version of system (1)is a Markov chainðk þ1Þ¼ ðk ÞP(4)Olfati-Saber et al.:Consensus and Cooperation in Networked Multi-Agent SystemsVol.95,No.1,January 2007|Proceedings of the IEEE217with P ¼I À L and a small 90.Here,the i th element of the row vector ðk Þdenotes the probability of being in state i at iteration k .It turns out that for any arbitrary graph G with Laplacian L and a sufficiently small ,the matrix P satisfies the property Pj p ij ¼1with p ij !0;8i ;j .Hence,P is a valid transition probability matrix for the Markov chain in (4).The reason matrix theory [81]is so widely used in analysis of consensus algorithms [10],[12]–[15],[64]is primarily due to the structure of P in (4)and its connection to graphs.4There are interesting connections between this Markov chain and the speed of information diffusion in gossip-based averaging algorithms [77],[78].One of the early applications of consensus problems was dynamic load balancing [82]for parallel processors with the same structure as system (4).To this date,load balancing in networks proves to be an active area of research in computer science.D.ApplicationsMany seemingly different problems that involve inter-connection of dynamic systems in various areas of science and engineering happen to be closely related to consensus problems for multi-agent systems.In this section,we pro-vide an account of the existing connections.1)Synchronization of Coupled Oscillators:The problem of synchronization of coupled oscillators has attracted numer-ous scientists from diverse fields including physics,biology,neuroscience,and mathematics [83]–[86].This is partly due to the emergence of synchronous oscillations in coupled neural oscillators.Let us consider the generalized Kuramoto model of coupled oscillators on a graph with dynamics_i ¼ Xj 2N isin ð j À i Þþ!i (5)where i and !i are the phase and frequency of the i thoscillator.This model is the natural nonlinear extension of the consensus algorithm in (1)and its linearization around the aligned state 1¼...¼ n is identical to system (2)plus a nonzero input bias b i ¼ð!i À"!Þ= with "!¼1=n P i !i after a change of variables x i ¼ð i À"!t Þ= .In [43],Sepulchre et al.show that if is sufficiently large,then for a network with all-to-all links,synchroni-zation to the aligned state is globally achieved for all ini-tial states.Recently,synchronization of networked oscillators under variable time-delays was studied in [45].We believe that the use of convergence analysis methods that utilize the spectral properties of graph Laplacians willshed light on performance and convergence analysis of self-synchrony in oscillator networks [42].2)Flocking Theory:Flocks of mobile agents equipped with sensing and communication devices can serve as mobile sensor networks for massive distributed sensing in an environment [87].A theoretical framework for design and analysis of flocking algorithms for mobile agents with obstacle-avoidance capabilities is developed by Olfati-Saber [19].The role of consensus algorithms in particle-based flocking is for an agent to achieve velocity matching with respect to its neighbors.In [19],it is demonstrated that flocks are networks of dynamic systems with a dynamic topology.This topology is a proximity graph that depends on the state of all agents and is determined locally for each agent,i.e.,the topology of flocks is a state-dependent graph.The notion of state-dependent graphs was introduced by Mesbahi [64]in a context that is independent of flocking.3)Fast Consensus in Small-Worlds:In recent years,network design problems for achieving faster consensus algorithms has attracted considerable attention from a number of researchers.In Xiao and Boyd [88],design of the weights of a network is considered and solved using semi-definite convex programming.This leads to a slight increase in algebraic connectivity of a network that is a measure of speed of convergence of consensus algorithms.An alternative approach is to keep the weights fixed and design the topology of the network to achieve a relatively high algebraic connectivity.A randomized algorithm for network design is proposed by Olfati-Saber [47]based on random rewiring idea of Watts and Strogatz [89]that led to creation of their celebrated small-world model .The random rewiring of existing links of a network gives rise to considerably faster consensus algorithms.This is due to multiple orders of magnitude increase in algebraic connectivity of the network in comparison to a lattice-type nearest-neighbort graph.4)Rendezvous in Space:Another common form of consensus problems is rendezvous in space [90],[91].This is equivalent to reaching a consensus in position by a num-ber of agents with an interaction topology that is position induced (i.e.,a proximity graph).We refer the reader to [92]and references therein for a detailed discussion.This type of rendezvous is an unconstrained consensus problem that becomes challenging under variations in the network topology.Flocking is somewhat more challenging than rendezvous in space because it requires both interagent and agent-to-obstacle collision avoidance.5)Distributed Sensor Fusion in Sensor Networks:The most recent application of consensus problems is distrib-uted sensor fusion in sensor networks.This is done by posing various distributed averaging problems require to4In honor of the pioneering contributions of Oscar Perron (1907)to the theory of nonnegative matrices,were refer to P as the Perron Matrix of graph G (See Section II-C for details).Olfati-Saber et al.:Consensus and Cooperation in Networked Multi-Agent Systems218Proceedings of the IEEE |Vol.95,No.1,January 2007implement a Kalman filter [38],[39],approximate Kalman filter [74],or linear least-squares estimator [75]as average-consensus problems .Novel low-pass and high-pass consensus filters are also developed that dynamically calculate the average of their inputs in sensor networks [39],[93].6)Distributed Formation Control:Multivehicle systems are an important category of networked systems due to their commercial and military applications.There are two broad approaches to distributed formation control:i)rep-resentation of formations as rigid structures [53],[94]and the use of gradient-based controls obtained from their structural potentials [52]and ii)representation of form-ations using the vectors of relative positions of neighboring vehicles and the use of consensus-based controllers with input bias.We discuss the later approach here.A theoretical framework for design and analysis of distributed controllers for multivehicle formations of type ii)was developed by Fax and Murray [12].Moving in formation is a cooperative task and requires consent and collaboration of every agent in the formation.In [12],graph Laplacians and matrix theory were extensively used which makes one wonder whether relative-position-based formation control is a consensus problem.The answer is yes.To see this,consider a network of self-interested agents whose individual desire is to minimize their local cost U i ðx Þ¼Pj 2N i k x j Àx i Àr ij k 2via a distributed algorithm (x i is the position of vehicle i with dynamics _x i ¼u i and r ij is a desired intervehicle relative-position vector).Instead,if the agents use gradient-descent algorithm on the collective cost P n i ¼1U i ðx Þusing the following protocol:_x i ¼Xj 2N iðx j Àx i Àr ij Þ¼Xj 2N iðx j Àx i Þþb i (6)with input bias b i ¼Pj 2N i r ji [see Fig.1(b)],the objective of every agent will be achieved.This is the same as the consensus algorithm in (1)up to the nonzero bias terms b i .This nonzero bias plays no role in stability analysis of sys-tem (6).Thus,distributed formation control for integrator agents is a consensus problem.The main contribution of the work by Fax and Murray is to extend this scenario to the case where all agents are multiinput multioutput linear systems _x i ¼Ax i þBu i .Stability analysis of relative-position-based formation control for multivehicle systems is extensively covered in Section IV.E.OutlineThe outline of the paper is as follows.Basic concepts and theoretical results in information consensus are presented in Section II.Convergence and performance analysis of consensus on networks with switching topology are given in Section III.A theoretical framework for cooperative control of formations of networked multi-vehicle systems is provided in Section IV.Some simulationresults related to consensus in complex networks including small-worlds are presented in Section V.Finally,some concluding remarks are stated in Section VI.RMATION CONSENSUSConsider a network of decision-making agents with dynamics _x i ¼u i interested in reaching a consensus via local communication with their neighbors on a graph G ¼ðV ;E Þ.By reaching a consensus,we mean asymptot-ically converging to a one-dimensional agreement space characterized by the following equation:x 1¼x 2¼...¼x n :This agreement space can be expressed as x ¼ 1where 1¼ð1;...;1ÞT and 2R is the collective decision of the group of agents.Let A ¼½a ij be the adjacency matrix of graph G .The set of neighbors of a agent i is N i and defined byN i ¼f j 2V :a ij ¼0g ;V ¼f 1;...;n g :Agent i communicates with agent j if j is a neighbor of i (or a ij ¼0).The set of all nodes and their neighbors defines the edge set of the graph as E ¼fði ;j Þ2V ÂV :a ij ¼0g .A dynamic graph G ðt Þ¼ðV ;E ðt ÞÞis a graph in which the set of edges E ðt Þand the adjacency matrix A ðt Þare time-varying.Clearly,the set of neighbors N i ðt Þof every agent in a dynamic graph is a time-varying set as well.Dynamic graphs are useful for describing the network topology of mobile sensor networks and flocks [19].It is shown in [10]that the linear system_x i ðt Þ¼Xj 2N ia ij x j ðt ÞÀx i ðt ÞÀÁ(7)is a distributed consensus algorithm ,i.e.,guarantees con-vergence to a collective decision via local interagent interactions.Assuming that the graph is undirected (a ij ¼a ji for all i ;j ),it follows that the sum of the state of all nodes is an invariant quantity,or P i _xi ¼0.In particular,applying this condition twice at times t ¼0and t ¼1gives the following result¼1n Xix i ð0Þ:In other words,if a consensus is asymptotically reached,then necessarily the collective decision is equal to theOlfati-Saber et al.:Consensus and Cooperation in Networked Multi-Agent SystemsVol.95,No.1,January 2007|Proceedings of the IEEE219average of the initial state of all nodes.A consensus algo-rithm with this specific invariance property is called an average-consensus algorithm [9]and has broad applications in distributed computing on networks (e.g.,sensor fusion in sensor networks).The dynamics of system (7)can be expressed in a compact form as_x ¼ÀLx(8)where L is known as the graph Laplacian of G .The graph Laplacian is defined asL ¼D ÀA(9)where D ¼diag ðd 1;...;d n Þis the degree matrix of G with elements d i ¼Pj ¼i a ij and zero off-diagonal elements.By definition,L has a right eigenvector of 1associated with the zero eigenvalue 5because of the identity L 1¼0.For the case of undirected graphs,graph Laplacian satisfies the following sum-of-squares (SOS)property:x T Lx ¼12Xði ;j Þ2Ea ij ðx j Àx i Þ2:(10)By defining a quadratic disagreement function as’ðx Þ¼12x T Lx(11)it becomes apparent that algorithm (7)is the same as_x ¼Àr ’ðx Þor the gradient-descent algorithm.This algorithm globallyasymptotically converges to the agreement space provided that two conditions hold:1)L is a positive semidefinite matrix;2)the only equilibrium of (7)is 1for some .Both of these conditions hold for a connected graph and follow from the SOS property of graph Laplacian in (10).Therefore,an average-consensus is asymptotically reached for all initial states.This fact is summarized in the following lemma.Lemma 1:Let G be a connected undirected graph.Then,the algorithm in (7)asymptotically solves an average-consensus problem for all initial states.A.Algebraic Connectivity and Spectral Propertiesof GraphsSpectral properties of Laplacian matrix are instrumen-tal in analysis of convergence of the class of linear consensus algorithms in (7).According to Gershgorin theorem [81],all eigenvalues of L in the complex plane are located in a closed disk centered at Áþ0j with a radius of Á¼max i d i ,i.e.,the maximum degree of a graph.For undirected graphs,L is a symmetric matrix with real eigenvalues and,therefore,the set of eigenvalues of L can be ordered sequentially in an ascending order as0¼ 1 2 ÁÁÁ n 2Á:(12)The zero eigenvalue is known as the trivial eigenvalue of L .For a connected graph G , 290(i.e.,the zero eigenvalue is isolated).The second smallest eigenvalue of Laplacian 2is called algebraic connectivity of a graph [20].Algebraic connectivity of the network topology is a measure of performance/speed of consensus algorithms [10].Example 1:Fig.2shows two examples of networks of integrator agents with different topologies.Both graphs are undirected and have 0–1weights.Every node of the graph in Fig.2(a)is connected to its 4nearest neighbors on a ring.The other graph is a proximity graph of points that are distributed uniformly at random in a square.Every node is connected to all of its spatial neighbors within a closed ball of radius r 90.Here are the important degree information and Laplacian eigenvalues of these graphsa Þ 1¼0; 2¼0:48; n ¼6:24;Á¼4b Þ 1¼0; 2¼0:25; n ¼9:37;Á¼8:(13)In both cases, i G 2Áfor all i .B.Convergence Analysis for Directed Networks The convergence analysis of the consensus algorithm in (7)is equivalent to proving that the agreement space characterized by x ¼ 1; 2R is an asymptotically stable equilibrium of system (7).The stability properties of system (7)is completely determined by the location of the Laplacian eigenvalues of the network.The eigenvalues of the adjacency matrix are irrelevant to the stability analysis of system (7),unless the network is k -regular (all of its nodes have the same degree k ).The following lemma combines a well-known rank property of graph Laplacians with Gershgorin theorem to provide spectral characterization of Laplacian of a fixed directed network G .Before stating the lemma,we need to define the notion of strong connectivity of graphs.A graph5These properties were discussed earlier in the introduction for graphs with 0–1weights.Olfati-Saber et al.:Consensus and Cooperation in Networked Multi-Agent Systems220Proceedings of the IEEE |Vol.95,No.1,January 2007。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Michael L.Littman Brown University/Bellcore Department of Computer Science Brown UniversityProvidence,RI02912-1910mlittman@AbstractIn the Markov decision process(MDP)formaliza-tion of reinforcement learning,a single adaptiveagent interacts with an environment deﬁned by aprobabilistic transition function.In this solipsis-tic view,secondary agents can only be part of theenvironment and are thereforeﬁxed in their be-havior.The framework of Markov games allowsus to widen this view to include multiple adap-tive agents with interacting or competing goals.This paper considers a step in this direction inwhich exactly two agents with diametrically op-posed goals share an environment.It describesa Q-learning-like algorithm forﬁnding optimalpolicies and demonstrates its application to a sim-ple two-player game in which the optimal policyis probabilistic.1INTRODUCTIONNo agent lives in a vacuum;it must interact with other agents to achieve its goals.Reinforcement learning is a promis-ing technique for creating agents that co-exist[Tan,1993, Yanco and Stein,1993],but the mathematical frame-work that justiﬁes it is inappropriate for multi-agent en-vironments.The theory of Markov Decision Processes (MDP’s)[Barto et al.,1989,Howard,1960],which under-lies much of the recent work on reinforcement learning, assumes that the agent’s environment is stationary and as such contains no other adaptive agents.The theory of games[von Neumann and Morgenstern, 1947]is explicitly designed for reasoning about multi-agent systems.Markov games(see e.g.,[V an Der Wal,1981])is an extension of game theory to MDP-like environments. This paper considers the consequences of using the Markov game framework in place of MDP’s in reinforcement learn-ing.Only the speciﬁc case of two-player zero-sum games is addressed,but even in this restricted version there are insights that can be applied to open questions in theﬁeld of reinforcement learning.2DEFINITIONSAn MDP[Howard,1960]is deﬁned by a set of states, ,and actions,.A transition function,:PD,deﬁnes the effects of the various actions on the state of the environment.(PD represents the set of discrete probability distributions over the set.)The reward function,:,speciﬁes the agent’s task.In broad terms,the agent’s objective is toﬁnd a policy mapping its interaction history to a current choice of action so as to maximize the expected sum of discounted reward, 0,where is the reward received steps into the future.A discount factor,01controls how much effect future rewards have on the optimal decisions, with small values of emphasizing near-term gain and larger values giving signiﬁcant weight to later rewards.In its general form,a Markov game,sometimes called a stochastic game[Owen,1982],is deﬁned by a set of states, ,and a collection of action sets,1,one for each agent in the environment.State transitions are controlled by the current state and one action from each agent:: 1PD.Each agent also has an associated reward function,:1, for agent,and attempts to maximize its expected sum of discounted rewards,,where is the reward received steps into the future by agent.In this paper,we consider a well-studied specialization in which there are only two agents and they have diametrically opposed goals.This allows us to use a single reward func-tion that one agent tries to maximize and the other,called the opponent,tries to minimize.In this paper,we use to denote the agent’s action set,to denote the opponent’s action set,and to denote the immediate reward to the agent for taking action in state when its opponent takes action.Adopting this specialization,which we call a two-player zero-sum Markov game,simpliﬁes the mathematics but makes it impossible to consider important phenomena such as cooperation.However,it is aﬁrst step and can be consid-ered a strict generalization of both MDP’s(when1) and matrix games(when1).As in MDP’s,the discount factor,,can be thought of as the probability that the game will be allowed to con-tinue after the current move.It is possible to deﬁne a no-tion of undiscounted rewards[Schwartz,1993],but not all Markov games have optimal strategies in the undiscounted case[Owen,1982].This is because,in many games,it is best to postpone risky actions indeﬁnitely.For current pur-poses,the discount factor has the desirable effect of goading the players into trying to win sooner rather than later.3OPTIMAL POLICIESThe previous section deﬁned the agent’s objective as max-imizing the expected sum of discounted reward.There are subtleties in applying this deﬁnition to Markov games, however.First,we consider the parallel scenario in MDP’s. In an MDP,an optimal policy is one that maximizes the expected sum of discounted reward and is, meaning that there is no state from which any other policy can achieve a better expected sum of discounted reward. Every MDP has at least one optimal policy and of the op-timal policies for a given MDP,at least one is stationary and deterministic.This means that,for any MDP,there is a policy:that is optimal.The policy is called stationary since it does not change as a function of time and it is called deterministic since the same action is always chosen whenever the agent is in state,for all.For many Markov games,there is no policy that is undomi-nated because performance depends critically on the choice of opponent.In the game theory literature,the resolu-tion to this dilemma is to eliminate the choice and evaluate each policy with respect to the opponent that makes it look the worst.This performance measure prefers conservative strategies that can force any opponent to a draw to more daring ones that accrue a great deal of reward against some opponents and lose a great deal to others.This is the essence of minimax:Behave so as to maximize your reward in the worst case.Given this deﬁnition of optimality,Markov games have several important properties.Like MDP’s,every Markov game has a non-empty set of optimal policies,at least one of which is stationary.Unlike MDP’s,there need not be a deterministic optimal policy.Instead,the optimal stationary policy is sometimes probabilistic,mapping states to discrete probability distributions over actions,:PD.A classic example is“rock,paper,scissors”in which any deterministic policy can be consistently defeated.The idea that optimal policies are sometimes stochastic may seem strange to readers familiar with MDP’s or games with alternating turns like backgammon or tic-tac-toe,since in these frameworks there is always a deterministic policy that does no worse than the best probabilistic one.The need for probabilistic action choice stems from the agent’s uncer-tainty of its opponent’s current move and its requirement to avoid being“second guessed.”Agentrock paper scissors0-1 Opponent paper010 Table1:The matrix game for“rock,paper,scissors.”paper scissors(vs.rock) rock scissors(vs.paper) rock paper(vs.scissors) rock paper scissors1Table2:Linear constraints on the solution to a matrix game. 4FINDING OPTIMAL POLICIESThis section reviews methods forﬁnding optimal policies for matrix games,MDP’s,and Markov games.It uses a uni-form notation that is intended to emphasize the similarities between the three frameworks.To avoid confusion,func-tion names that appear more than once appear with different numbers of arguments each time.4.1MATRIX GAMESAt the core of the theory of games is the matrix game deﬁned by a matrix,,of instantaneous ponentis the reward to the agent for choosing action when its opponent chooses action.The agent strives to maximize its expected reward while the opponent tries to minimize it.Table1gives the matrix game corresponding to“rock, paper,scissors.”The agent’s policy is a probability distribution over actions, PD.For“rock,paper,scissors,”is made up of 3components:rock,paper,and scissors.According to the notion of optimality discussed earlier,the optimal agent’s minimum expected reward should be as large as possible. How can weﬁnd a policy that achieves this?Imagine that we would be satisﬁed with a policy that is guaranteed an expected score of no matter which action the opponent chooses.The inequalities in Table2,with0,constrain the components of to represent exactly those policies—any solution to the inequalities would sufﬁce.For to be optimal,we must identify the largest for which there is some value of that makes the constraints hold.Linear programming(see,e.g.,[Strang,1980])is a general technique for solving problems of this kind.In this example,linear programmingﬁnds a value of0for and (1/3,1/3,1/3)for.We can abbreviate this linear program as:maxPDminwhere expresses the expected reward to the agent for using policy against the opponent’s action.4.2MDP’sThere is a host of methods for solving MDP’s.This section describes a general method known as value iteration[Bert-sekas,1987].The value of a state,,is the total expected discounted reward attained by the optimal policy starting from state .States for which is large are“good”in that a smart agent can collect a great deal of reward starting from those states.The quality of a state-action pair,is the total expected discounted reward attained by the non-stationary policy that takes action from stateand then follows the optimal policy from then on.These functions satisfy the following recursive relationship for all and:(1)max(2) This says that the quality of a state-action pair is the im-mediate reward plus the discounted value of all succeeding states weighted by their likelihood.The value of a state is the quality of the best action for that state.It follows that knowing is enough to specify an optimal policy since in each state,we can choose the action with the highest -value.The method of value iteration starts with estimates for and and generates new estimates by treating the equal signs in Equations1–2as assignment operators.It can be shown that the estimated values for and converge to their true values[Bertsekas,1987].4.3MARKOV GAMESGiven,an agent can maximize its reward using the “greedy”strategy of always choosing the action with the highest-value.This strategy is greedy because it treats as a surrogate for immediate reward and then acts to maximize its immediate gain.It is optimal because the -function is an accurate summary of future rewards.A similar observation can be used for Markov games once we redeﬁne to be the expected reward for the optimal policy starting from state,and as the expected reward for taking action when the opponent chooses from state and continuing optimally thereafter.We can then treat the values as immediate payoffs in an unrelated sequence of matrix games(one for each state,), each of which can be solved optimally using the techniques of Section4.1.Thus,the value of a state in a Markov game ismaxPDminand the quality of action against action in state is The resulting recursive equations look much like Equations1–2and indeed the analogous value iteration algorithm can be shown to converge to the correct values[Owen,1982]. It is worth noting that in games with alternating turns,thevalue function need not be computed by linear programming since there is an optimal deterministic policy.In this case we can write max min.5LEARNING OPTIMAL POLICIES Traditionally,solving an MDP using value iteration involvesapplying Equations1–2simultaneously over all. Watkins[Watkins,1989]proposed an alternative approach that involves performing the updates asynchronously with-out the use of the transition function,.In this Q-learning formulation,an update is performed byan agent whenever it receives a reward of when making a transition from to after taking action.The up-date is:which takes the place ofEquation1.The probability with which this happens is precisely which is why it is possible for an agent to carry out the appropriate update without explicitly using .This learning rule converges to the correct values for and,assuming that every action is tried in every state inﬁnitely often and that new estimates are blended with previous ones using a slow enough exponentially weighted average[Watkins and Dayan,1992].It is straightforward,though seemingly novel,to apply thesame technique to solving Markov games.A completely speciﬁed version of the algorithm is given in Figure1.The variables in theﬁgure warrant explanation since some are given to the algorithm as part of the environment,others are internal to the algorithm and still others are parameters of the algorithm itself.V ariables from the environment are:the state set,S;the action set,A;the opponent’s action set,O;and the discount factor,gamma.The variables internal to the learner are:a learning rate,alpha,which is initialized to1.0and decays over time;the agent’s estimate of the-function,Q;the agent’s estimate of the-function,V;and the agent’s cur-rent policy for state,pi[s,.].The remaining variables are parameters of the algorithm:explor controls how often the agent will deviate from its current policy to en-sure that the state space is adequately explored,and decay controls the rate at which the learning rate decays.This algorithm is called minimax-Q since it is essentially identical to the standard Q-learning algorithm with a mini-max replacing the max.6EXPERIMENTSThis section demonstrates the minimax-Q learning algo-rithm using a simple two-player zero-sum Markov game modeled after the game of soccer.Initialize:With probability explor,return an action uniformly at random. Otherwise,if current state is s,Return action a with probability pi[s,a].Learn:6.1SOCCERThe game is played on a4x5grid as depicted in Figure2. The two players,A and B,occupy distinct squares of the grid and can choose one of5actions on each turn:N,S,E, W,and stand.Once both players have selected their actions, the two moves are executed in random order.The circle in theﬁgures represents the“ball.”When the player with the ball steps into the appropriate goal(left for A,right for B),that player scores a point and the board is reset to the conﬁguration shown in the left half of the ﬁgure.Possession of the ball goes to one or the other player at random.When a player executes an action that would take it to the square occupied by the other player,possession of the ball goes to the stationary player and the move does not take place.A good defensive maneuver,then,is to stand where the other player wants to go.Goals are worth one point and the discount factor is set to0.9,which makes scoring sooner somewhat better than scoring later.For an agent on the offensive to do better than breaking even against an unknown defender,the agent must use a probabilistic policy.For instance,in the example situation shown in the right half of Figure2,any deterministic choice for A can be blocked indeﬁnitely by a clever opponent.Only by choosing randomly between stand and S can the agent guarantee an opening and therefore an opportunity to score.6.2TRAINING AND TESTINGFour different policies were learned,two using the minimax-Q algorithm and two using Q-learning.For each learning algorithm,one learner was trained against a ran-dom opponent and the other against another learner of identical design.The resulting policies were named MR, MM,QR,and QQ for minimax-Q trained against random, minimax-Q trained against minimax-Q,Q trained against random,and Q trained against Q.The minimax-Q algorithm(MR,MM)was as described in Figure1with02and10log001106 09999954and learning took place for one million steps. (The value of decay was chosen so that the learning rate reached0.01at the end of the run.)The Q-learning algo-rithm(QR,QQ)was identical except a“max”operator was used in place of the minimax and the-table did not keep information about the opponent’s action.Parameters were set identically to the minimax-Q case.For MR and QR,the opponent for training was aﬁxed policy that chose actions uniformly at random.For MM and QQ, the opponent was another learner identical to theﬁrst but with separate and-tables.The resulting policies were evaluated in three ways.First, each policy was run head-to-head with a random policy for one hundred thousand steps.To emulate the discount factor,every step had a0.1probability of being declared a draw.Wins and losses against the random opponent were tabulated.The second test was a head-to-head competition with a hand-built policy.This policy was deterministic and had simple rules for scoring and blocking.In100,000steps,it completed5600games against the random opponent and won99.5%of them.The third test used Q-learning to train a“challenger”op-ponent for each of MR,MM,QR and QQ.The training procedure for the challengers followed that of QR where the“champion”policy was heldﬁxed while the challenger was trained against it.The resulting policies were then evaluated against their respective champions.This test was repeated three times to ensure stability with only theﬁrst reported here.All evaluations were repeated three times and averaged.6.3RESULTSTable3summarizes the results.The columns marked “games”list the number of completed games in100,000 steps and the columns marked“%won”list the percent-age won by the associated policy.Percentages close to50 indicate that the contest was nearly a draw.All the policies did quite well when tested against the ran-dom opponent.The QR policy’s performance was quite remarkable,however,since it completed more games than the other policies and won nearly all of them.This might be expected since QR was trained speciﬁcally to beat this op-ponent whereas MR,though trained in competition with the random policy,chooses actions with an idealized opponent in mind.Against the hand-built policy,MM and MR did well, roughly breaking even.The MM policy did marginally better.In the limit,this should not be the case since an agent trained by the minimax-Q algorithm should be in-sensitive to the opponent against which it was trained and always behave so as to maximize its score in the worst case. The fact that there was a difference suggests that the algo-rithm had not converged on the optimal policy yet.Prior to convergence,the opponent can make a big difference to the behavior of a minimax-Q agent since playing against a strong opponent means the training will take place in important parts of the state space.The performance of the QQ and QR policies against the hand-built policy was strikingly different.This points out an important consequence of not using a minimax criterion.A close look at the two policies indicated that QQ,by luck, implemented a defense that was perfect against the hand-built policy.The QR policy,on the other hand,happened to converge on a strategy that was not appropriate.Against a slightly different opponent,the tables would have been turned.The fact that the QQ policy did so well against the ran-dom and hand-built opponents,especially compared to the%won games%won gamesvs.random99.3720099.58600 vs.hand-built53.7530076.33300 vs.MR-challengervs.MM-challenger37.54400vs.QR-challengervs.QQ-challenger0.01200 Table3:Results for policies trained by minimax-Q(MR and MM)and Q-learning(QR and QQ).minimax policies,was somewhat surprising.Simultane-ously training two adaptive agents using Q-learning is not mathematically justiﬁed and in practice is prone to“lock-ing up,”that is,reaching a mutual local maximum in which both agents stop learning prematurely(see,e.g.,[Boyan, 1992]).In spite of this,some researchers have re-ported amazing success with this approach[Tesauro,1992, Boyan,1992]and it seemed to have been successful in this instance as well.The third experiment was intended to measure the worst case performance of each of the policies.The learned poli-cies were heldﬁxed while a challenger was trained to beat them.This is precisely the scenario that the minimax poli-cies were designed for because,in a fair game such as this,it should break even against even the strongest chal-lenger.The MR policy did not quite achieve this level of performance,indicating that one million steps against a random opponent was insufﬁcient for convergence to the optimal strategy.The MM policy did slightly better,win-ning against its challenger more often than did MR.It was beatable but only barely so.The algorithms trained by Q-learning did signiﬁcantly worse and were incapable of scoring at all against their challengers.This was due in great part to the fact that Q-learning is designed toﬁnd deterministic policies and every deterministic offense in this game has a perfect de-fense,much like rock,paper,scissors.Correctly extending Q-learning toﬁnd optimal probabilistic policies is exactly what minimax-Q was designed for.7DISCUSSIONThis paper explores the Markov game formalism as a math-ematical framework for reasoning about multi-agent envi-ronments.In particular,the paper describes a reinforcement learning approach to solving two-player zero-sum games in which the“max”operator in the update step of a standard Q-learning algorithm is replaced by a“minimax”operator that can be evaluated by solving a linear program.The use of linear programming in the innermost loop of a learning algorithm is somewhat problematic since the com-putational complexity of each step is large and typically many steps will be needed before the system reaches con-vergence.It is possible that approximate solutions to the linear programs would sufﬁce.Iterative methods are also quite promising since the relevant linear programs change slowly over time.For most applications of reinforcement learning to zero-sum games,this is not an impediment.Games such as check-ers[Samuel,1959],tic-tac-toe[Boyan,1992],backgam-mon[Tesauro,1992],and Go[Schraudolph et al.,1994] consist of a series of alternating moves and in such games the minimax operator can be implemented extremely efﬁ-ciently.The strength of the minimax criterion is that it allows the agent to converge to aﬁxed strategy that is guaranteed to be “safe”in that it does as well as possible against the worst possible opponent.It can be argued that this is unnecessary if the agent is allowed to adapt continually to its opponent. This is certainly true to some extent but any such agent will in principle be vulnerable to a devious form of trickery in which the opponent leads the agent to learn a poor policy and then exploits it.Identifying an opponent of this type for the Q-learning agent described in this paper would be an interesting topic for future research.The use of the minimax criterion and probabilistic policies is closely connected to other current research.First,a minimax criterion can be used in single-agent environments to produce more risk-averse behavior[Heger,1994].Here, the random transitions of the environment play the role of the opponent.Secondly,probabilistic policies have been used in the context of acting optimally in environments where the agent’s perception is incomplete[Singh et al., 1994].In these environments,random actions are used to combat the agent’s uncertainty as to the true state of its environment much as random actions in games help deal with the agent’s uncertainty of the opponent’s move. Although two-player Markov games are a fairly restricted class of multi-agent environments,they are of independent interest and include Markov decision processes as a special case.Applying insights from the theory of cooperative and multi-player games could also prove fruitful although ﬁnding useful connections may be challenging. AcknowledgmentsThanks to David Ackley,Justin Boyan,Tony Cassandra, and Leslie Kaelbling for ideas and suggestions.References[Barto et al.,1989]Barto, A.G.;Sutton,R.S.;and Watkins,C.J.C.H.1989.Learning and sequential decision making.Technical Report89-95,Department of Computer and Information Science,University of Massachusetts,Amherst,Massachusetts.Also pub-lished in Learning and Computational Neuroscience: Foundations of Adaptive Networks,Michael Gabriel and John Moore,editors.The MIT Press,Cambridge,Mas-sachusetts,1991.[Bertsekas,1987]Bertsekas,D.P.1987.Dynamic Pro-gramming:Deterministic and Stochastic Models.Prentice-Hall.[Boyan,1992]Boyan,Justin A.1992.Modular neural net-works for learning context-dependent game strategies.Master’s thesis,Department of Engineering and Com-puter Laboratory,University of Cambridge,Cambridge, England.[Heger,1994]Heger,Matthias1994.Consideration of risk in reinforcement learning.In Proceedings of the Machine Learning Conference.To appear.[Howard,1960]Howard,Ronald A.1960.Dynamic Pro-gramming and Markov Processes.The MIT Press,Cam-bridge,Massachusetts.[Owen,1982]Owen,Guillermo1982.Game Theory:Sec-ond edition.Academic Press,Orlando,Florida. [Samuel,1959]Samuel,A.L.1959.Some studies in ma-chine learning using the game of checkers.IBM Journal of Research and Development3:211–229.Reprinted inE.A.Feigenbaum and J.Feldman,editors,Computersand Thought,McGraw-Hill,New Y ork1963. [Schraudolph et al.,1994]Schraudolph,Nicol N.;Dayan, Peter;and Sejnowski,Terrence ing the td(lambda)algorithm to learn an evaluation function for the game of go.In Advances in Neural Information Pro-cessing Systems6,San Mateo,CA.Morgan Kaufman.To appear.[Schwartz,1993]Schwartz,Anton1993.A reinforcement learning method for maximizing undiscounted rewards.In Proceedings of the Tenth International Conference on Machine Learning,Amherst,Massachusetts.Morgan Kaufmann.298–305.[Singh et al.,1994]Singh,Satinder Pal;Jaakkola,Tommi;and Jordan,Michael I.1994.Model-free reinforcement learning for non-markovian decision problems.In Pro-ceedings of the Machine Learning Conference.To ap-pear.[Strang,1980]Strang,Gilbert1980.Linear Algebra and its applications:second edition.Academic Press,Or-lando,Florida.[Tan,1993]Tan,M.1993.Multi-agent reinforcement learning:independent vs.cooperative agents.In Pro-ceedings of the Tenth International Conference on Ma-chine Learning,Amherst,Massachusetts.Morgan Kauf-mann.[Tesauro,1992]Tesauro,G.J.1992.Practical issues in temporal difference.In Moody,J.E.;Lippman,D.S.;and Hanson,S.J.,editors1992,Advances in Neural In-formation Processing Systems4,San Mateo,CA.Mor-gan Kaufman.259–266.[V an Der Wal,1981]V an Der Wal,J.1981.Stochastic dy-namic programming.In Mathematical Centre Tracts 139.Morgan Kaufmann,Amsterdam.[von Neumann and Morgenstern,1947]von Neumann,J.and Morgenstern,O.1947.Theory of Games and Eco-nomic Behavior.Princeton University Press,Princeton, New Jersey.[Watkins and Dayan,1992]Watkins, C.J. C.H.and Dayan,P.1992.Q-learning.Machine Learning 8(3):279–292.[Watkins,1989]Watkins,C.J.C.H.1989.Learning with Delayed Rewards.Ph.D.Dissertation,Cambridge Uni-versity.[Yanco and Stein,1993]Yanco,Holly and Stein,Lynn An-drea1993.An adaptive communication protocol for co-operating mobile robots.In Meyer,Jean-Arcady;Roit-blat,H.L.;and Wilson,Stewart W.,editors1993,From Animals to Animats:Proceedings of the Second Interna-tional Conference on the Simultion of Adaptive Behavior.MIT Press/Bradford Books.478–485.。