Comparison of parametric and nonparametric models 参数与非参数模型的比较
- 格式:doc
- 大小:711.50 KB
- 文档页数:19
高三英语统计学分析单选题60题及答案1.The average score of a class is calculated by adding all the scores and dividing by the number of students. This is an example of _____.A.meanB.medianC.modeD.range答案:A。
本题考查统计学基本概念。
A 选项mean(平均数)是通过将所有数据相加再除以数据个数得到的,符合题干描述。
B 选项median((中位数)是将数据从小到大排列后位于中间位置的数。
C 选项mode(众数)是数据中出现次数最多的数。
D 选项range(极差)是数据中的最大值与最小值之差。
2.In a set of data, if there is a value that occurs most frequently, it is called the _____.A.meanB.medianC.modeD.range答案:C。
A 选项mean 是平均数。
B 选项median 是中位数。
C 选项mode 众数是出现次数最多的值,符合题意。
D 选项range 是极差。
3.The middle value in a sorted list of data is called the _____.A.meanB.medianC.modeD.range答案:B。
A 选项mean 是平均数。
B 选项median 中位数是排序后位于中间的数,符合题干描述。
C 选项mode 是众数。
D 选项range 是极差。
4.The difference between the highest and lowest values in a set of data is known as the _____.A.meanB.medianC.modeD.range答案:D。
Parametric and Nonparametric Volatility Measurement *Torben G. Andersen a , Tim Bollerslev b , and Francis X. Diebold cJuly 2002__________________*This paper is prepared for Yacine Aït-Sahalia and Lars Peter Hansen (eds.), Handbook of Financial Econometrics ,Amsterdam: North Holland. We are grateful to the National Science Foundation for research support, and to Nour Meddahi, Neil Shephard and Sean Campbell for useful discussions and detailed comments on earlier drafts.a Department of Finance, Kellogg School of Management, Northwestern University, Evanston, IL 60208, and NBERphone: 847-467-1285, e-mail: t-andersen@bDepartments of Economics and Finance, Duke University, Durham, NC 27708, and NBERphone: 919-660-1846, e-mail: boller@ c Departments of Economics, Finance and Statistics, University of Pennsylvania, Philadelphia, PA 19104, and NBERphone: 215-898-1507, e-mail: fdiebold@Andersen, T.G., Bollerslev, T., and Diebold, F.X. (2005),"Parametric and Nonparametric Volatility Measurement,"in L.P. Hansen and Y. Ait-Sahalia (eds.),Handbook of Financial Econometrics. Amsterdam: North-Holland, forthcoming.Table of ContentsAbstract1. Introduction2. Volatility Definitions2.1. Continuous-Time No-Arbitrage Pricing2.2. Notional, Expected, and Instantaneous Volatility2.3. Volatility Models and Measurements3. Parametric Methods3.1. Discrete-Time Models3.1.1. ARCH Models3.1.2. Stochastic Volatility Models3.2. Continuous-Time Models3.2.1. Continuous Sample Path Diffusions3.2.1. Jump Diffusions and Lévy Driven Processes4. Nonparametric Methods4.1. ARCH Filters and Smoothers4.2. Realized Volatility5. ConclusionReferencesABSTRACTVolatility has been one of the most active areas of research in empirical finance and time series econometrics during the past decade. This chapter provides a unified continuous-time, frictionless, no-arbitrage framework for systematically categorizing the various volatility concepts, measurement procedures, and modeling procedures. We define three different volatility concepts: (i) the notional volatility corresponding to the ex-post sample-path return variability over a fixed time interval, (ii) the ex-ante expected volatility over a fixed time interval, and (iii) the instantaneous volatility corresponding to the strength of the volatility process at a point in time. The parametric procedures rely on explicit functional form assumptions regarding the expected and/or instantaneous volatility. In the discrete-time ARCH class of models, the expectations are formulated in terms of directly observable variables, while the discrete- and continuous-time stochastic volatility models involve latent state variable(s). The nonparametric procedures are generally free from such functional form assumptions and hence afford estimates of notional volatility that are flexible yet consistent (as the sampling frequency of the underlying returns increases). The nonparametric procedures include ARCH filters and smoothers designed to measure the volatility over infinitesimally short horizons, as well as the recently-popularized realized volatility measures for (non-trivial) fixed-length time intervals.1 See, for example, Bollerslev, Chou and Kroner (1992).-1-1. INTRODUCTIONSince Engle’s (1982) seminal paper on ARCH models, the econometrics literature has focused considerable attention on time-varying volatility and the development of new tools for volatility measurement, modeling and forecasting. These advances have in large part been motivated by the empirical observation that financial asset return volatility is time-varying in a persistent fashion, across assets, asset classes, time periods, and countries.1 Asset return volatility,moreover, is central to finance, whether in asset pricing, portfolio allocation, or riskmanagement, and standard financial econometric methods and models take on a very different,conditional , flavor when volatility is properly recognized to be time-varying.The combination of powerful methodological advances and tremendous relevance in empirical finance produced explosive growth in the financial econometrics of volatility dynamics, with the econometrics and finance literatures cross-fertilizing each other furiously. Initial developments were tightly parametric, but the recent literature has moved in less parametric, and even fully nonparametric, directions. Here we review and provide a unified framework for interpreting both the parametric and nonparametric approaches.In section 2, we define three different volatility concepts: (i) the notional volatilitycorresponding to the ex-post sample-path return variability over a fixed time interval, (ii) the ex-ante expected volatility over a fixed time interval, and (iii) the instantaneous volatility corresponding to the strength of the volatility process at a point in time.In section 3, we survey parametric approaches to volatility modeling, which are based on explicit functional form assumptions regarding the expected and/or instantaneous volatility. In the discrete-time ARCH class of models, the expectations are formulated in terms of directlyobservable variables, while the discrete- and continuous-time stochastic volatility models both involve latent state variable(s).In section 4, we survey nonparametric approaches to volatility modeling, which are generally free from such functional form assumptions and hence afford estimates of notional volatility that are flexible yet consistent (as the sampling frequency of the underlying returns increases). The nonparametric approaches include ARCH filters and smoothers designed to measure the volatility over infinitesimally short horizons, as well as the recently-popularized realized volatility measures for (non-trivial) fixed-length time intervals.We conclude in section 5 by highlighting promising directions for future research.2. VOLATILITY DEFINITIONSHere we introduce a unified framework for defining and classifying different notions of return volatility in a continuous-time no-arbitrage setting. We begin by outlining the minimal set of regularity conditions invoked on the price processes and establish the notation used for the decomposition of returns into an expected, or mean, return and an innovation component. The resulting characterization of the price process is central to the development of our different volatility measures, and we rely on the concepts and notation introduced in this section throughout the chapter.2.1. Continuous-Time No-Arbitrage Price ProcessesMeasurement of return volatility requires determination of the component of a given price increment that represents a return innovation as opposed to an expected price movement. In a discrete-time setting this identification may only be achieved through a direct specification of the conditional mean return, for example through an asset pricing model, as economic principles impose few binding constraints on the price process. However, within a frictionless continuous-time framework the no-arbitrage requirement quite generally guarantees that the return innovation is an order of magnitude larger than the mean return. This result is not only critical to the characterization of arbitrage-free continuous-time price processes, but it also has important implications for the approach one may employ for measurement and modeling of volatility over short intradaily return horizons.We take as given a univariate risky logarithmic price process p(t) defined on a complete probability space, (S,ö,P). The price process evolves in continuous time over the interval [0,T], where T is a (finite) integer. The associated natural filtration is denoted (öt )t0[0,T]fö, where the information set,öt , contains the full history (up to time t) of the realized values of the asset price and other relevant (possibly latent) state variables, and is otherwise assumed to satisfy the usual conditions. It is sometimes useful to consider the information set generated by the asset price history alone. We refer to this coarser filtration, consisting of the initial conditions and the history of the asset prices only, by (F t )t0[0,T]f F / F T , so that by definition, F t föt . Finally, we assume there is an asset guaranteeing an instantaneously risk-free rate of interest, although we shall not refer to this rate explicitly. Many more risky assets may, of course, be available but we explicitly retain a univariate focus for notational simplicity. The extension to the multivariate setting is conceptually straightforward as discussed in specific instances below. The continuously compounded return over the time interval [t-h,t] is thenr(t,h) = p(t) - p(t-h), 0#h#t#T .(2.1) We also adopt the following short hand notation for the cumulative return up to time t, i.e., the return over the [0,t] time interval:-2-r(t) / r(t,t) = p(t) - p(0), 0#t#T.(2.2)These definitions imply a simple relation between the period-by-period and the cumulative returns that we use repeatedly in the sequel:r(t,h) = r(t) - r(t-h), 0#h#t#T.(2.3)A maintained assumption throughout is that - almost surely (P) (henceforth denoted (a.s.) ) - the asset price process remains strictly positive and finite, so that p(t) and r(t) are well defined over [0,T] (a.s.). It follows that r(t) has only countably many jump points over [0,T], and we adopt the convention of equating functions that have identical left and right limits everywhere.Defining r(t-) / lim J6 t, J<t r(J) and r(t+) / lim J6 t, J>t r(J), uniquely determines the right-continuous, left-limit (càdlàg) version of the process, for which r(t)=r(t+) (a.s.), and the left-continuous, right-limit (càglàd) version, for which r(t)=r(t-) (a.s.), for all t in [0,T]. In the following, we assume without loss of generality that we are working with the càdlàg version of the various return processes and components.The jumps in the cumulative price and return process are then)r(t) / r(t) - r(t-), 0#t#T.(2.4) Obviously, at continuity points for r(t), we have )r(t) = 0. Notice also that, because there are at most a countably infinite number of jumps, a jump occurrence is unusual in the sense that we generically haveP( )r(t) … 0 ) = 0 ,(2.5)for an arbitrarily chosen t in [0,T]. This does not imply that jumps necessarily are rare. In fact, equation (2.5) is consistent with there being a (countably) infinite number of jumps over any discrete interval - a phenomenon referred to as an explosion. Jump processes that do not explode are termed regular. For regular processes, the anticipated jump frequency is conveniently characterized by the instantaneous jump intensity, i.e., the probability of a jump over the next instant of time, and expressed in units that reflect the expected (and finite) number of jumps per unit time interval.In the following, we invoke the standard assumptions of no arbitrage opportunities and a finite expected return. Within our frictionless setting, it is well known that these conditions imply that the log-price process must constitute a (special) semi-martingale (e.g., Back, 1991). This, in turn, affords the following unique canonical return decomposition (e.g., Protter, 1992). PROPOSITION 1 - Return Decomposition-3-Any arbitrage-free logarithmic price process subject to the regularity conditions outlined above may be uniquely represented asr(t) / p(t) - p(0) = µ(t) + M(t) = µ(t) + M c(t) + M J(t),(2.6) where µ(t) is a predictable and finite variation process, M(t) is a local martingale which may be further decomposed into M c(t), a continuous sample path, infinite variation local martingale component, and M J(t), a compensated jump martingale. All components may be assumed to have initial conditions normalized such that µ(0) / M(0) / M c(0) / M J(0) / 0, which implies that r(t) / p(t).Proposition 1 provides a unique decomposition of the instantaneous return into an expected return component and a (martingale) innovation. Over discrete intervals, the relation becomes slightly more complex. Letting the expected returns over [t-h,t] be denoted by m(t,h), equation (2.6) impliesm(t,h) / E[r(t,h)|öt-h ] = E[µ(t,h)|öt-h ] , 0<h#t#T,(2.7) whereµ(t,h) / µ(t) - µ(t-h) , 0<h#t#T,(2.8) and the return innovation takes the formr(t,h) - m(t,h) = ( µ(t,h) - m(t,h) ) + M(t,h) , 0<h#t#T.(2.9) The first term on the right hand side of (2.9) signifies that the expected return process, even though it is (locally) predictable, may evolve stochastically over the [t-h,t] interval.2If µ(t,h) is predetermined (measurable with respect to öt-h), and thus known at time t-h, then the discrete-time return innovation reduces to M(t,h) / M(t) - M(t-h). However, any shift in the expected return process during the interval will generally render the initial term on the right hand side of (2.9) non-zero and thus contribute to the return innovation over [t-h,t].Although the discrete-time return innovation incorporates two distinct terms, the martingale component, M(t,h), is generally the dominant contributor to the return variation over short intervals, i.e., for h small. In order to discuss the intuition behind this result, which we formalize in the following section, it is convenient to decompose the expected return process into a purely continuous, predictable finite variation part, µc(t), and a purely predictable jump part, µJ(t).2In other words, even though the conditional mean is locally predictable, all return components in the special semi-martingale decomposition are generally stochastic: not only volatility, but also the jump intensity, the jump size distribution and the conditional mean process may evolve randomly over a finite interval.-4-Because the continuous component, µc(t), is of finite variation it is locally an order of magnitude smaller than the corresponding contribution from the continuous component of the innovation term, M c(t). The reason is - loosely speaking - that an asset earning, say, a positive expected return over the risk-free rate must have innovations that are an order of magnitude larger than the expected return over infinitesimal intervals. Otherwise, a sustained long position (infinitely many periods over any interval) in the risky asset will tend to be perfectly diversified due to a Law of Large Numbers, as the martingale part is uncorrelated. Thus, the risk-return relation becomes unbalanced. Only if the innovations are large, preventing the Law of Large Numbers from becoming operative, will this not constitute a violation of the no-arbitrage condition (e.g., Maheswaran and Sims, 1993). The presence of a non-trivial M J(t) component may similarly serve to eliminate arbitrage and retain a balanced risk-return trade-off relationship.Analogous considerations apply to the jump component for the expected return process, µJ(t), if this factor is present. There cannot be a predictable jump in the mean - i.e., a perfectly anticipated jump in terms of both time and size - unless it is accompanied by large jump innovation risk as well, so that Pr()M(t) … 0) > 0. Again - intuitively - if there was a known, say, positive jump then this induces arbitrage (by going long the asset) unless there is offsetting (jump) innovation risk.3 Most of the continuous-time asset pricing literature ignores predictable jumps, even if they are logically consistent with the framework. One reason may be that their existence is fragile in the following sense. A fully anticipated jump must be associated with release of new (price relevant) information at a given point in time. However, if there is any uncertainty about the timing of the announcement so that it is only known to occur within a given minute, or even a few seconds, then the timing of the jump is more aptly modelled by a continuous hazard function where the jump probability at each point in time is zero, and the predictable jump event is thus eliminated. In addition, even if there were predictable jumps associated with scheduled news releases, the size of the predictable component of the jump is likely much smaller than the size of the associated jump innovation, so that the descriptive power lost by ignoring the possibility of predictable jumps is minimal. Thus, rather than modify the standard setup to allow for the presence of predictable (but empirically negligible) jumps, we follow the tradition in the literature and assume away such jumps.Although we will not discuss specific model classes at length until later sections, it may be useful to briefly consider two simple examples to illustrate the somewhat abstract definitions given in the current section.EXAMPLE 1: Stochastic Volatility Jump Diffusion with Non-Zero Mean Jumps3This point is perhaps most readily understood by analogy to a discrete-time setting. When there is a predictable jump at time t, the instant from t- to t is effectively equivalent to a trading period, say from t-1 to t, within a discrete-time model. In that setting, no asset can earn a positive (or negative) excess return relative to the risk-free rate over (t-1,t] without bearing genuine risk as this would otherwise imply a trivial arbitrage opportunity.A predictable price jump without an associated positive probability of a jump innovation works entirely analogous.-5-Consider the following continuous-time jump diffusion expressed in stochastic differential equation (sde) form,dp(t) = ( µ + $F2(t))dt + F(t) dW(t) + 6(t) dq(t) , 0#t#T,where F(t) is a strictly positive continuous sample path process (a.s.), W(t) denotes a standard Brownian motion, q(t) is a pure jump process with q(t)=1 corresponding to a jump at time t and q(t)=0 otherwise, while 6(t) refers to the size of the corresponding jumps. We assume the jump size distribution has a constant mean of µ6 and variance of F62. Finally, the jump intensity is assumed constant (and finite) at a rate 8 per unit time. In the notation of Proposition 1, we then have the return components,µ(t) = µc(t) = µ@ t + $I0t F2(s)ds + 8@ µ6@ t ,M c(t) = I0t F(s)dW(s) ,M J(t) = E0#s#t6(s) q(s) - 8@ µ6@ t .Notice that the last term of the mean representation captures the expected contribution coming from the jumps, while the corresponding term is subtracted from the jump innovation process to provide a unique (compensated) jump martingale representation for M J.EXAMPLE 2: Discrete-Time Stochastic Volatility (ARCH) ModelConsider the discrete-time (jump) process for p(t) defined over the unit time interval, p(t) = p(t-1) + µ + $F2(t) + F(t) z(t) , t = 1, 2, ..., T,where (implicitly),p(t+J) / p(t) ,t = 0, 1, ..., T-1,0 #J <1,and z(t) denotes a martingale difference sequence with unit variance, while F(t) is a (possibly latent) positive (a.s.) stochastic process that is measurable with respect to öt-1. Of course, in this situation the continuous-time no-arbitrage arguments based on infinitely many long-short positions over fixed length time intervals are no longer operative. Nonetheless, we shall later argue that the orders of magnitude of the corresponding terms in the decomposition in Proposition 1 remain suggestive and useful from an empirical perspective. In fact, we may still think of the price process evolving in continuous time, but the market being open and prices observed only at discrete points in time. In this specific model we have, for t = 1, 2, ..., T,µ(t) = µ@ t +$E s=1,...,t F2(s) ,-6-M(t) / M J(t) = E s=1,...,t F(s) z(s) .A final comment is in order. We purposely express the price changes and associated returns in Proposition 1 over a discrete time interval. The concept of an instantaneous return employed in the formulation of continuous-time models, that are given in sde form (as in Example 1 above), is pure short-hand notation that is formally defined only through the corresponding integral representation, such as equation (2.6). Although this is a technical point, it has an important empirical analogy: real-time price data are not available at every instant and, due to pertinent microstructure features, prices are invariably constrained to lie on a discrete grid, both in the price and time dimension. Hence, there is no real-world counterpart to the notion of a continuous sample path martingale with infinite variation over arbitrarily small time intervals (say, less than a second). It is only feasible to measure return (and volatility) realizations over discrete time intervals. Moreover, sensible measures can typically only be constructed over much longer horizons than given by the minimal interval length for which consecutive trade prices or quotes are recorded. We return to this point later. For now, we simply note that our main conceptualization of volatility in the next section conforms directly with the focus on realizations measured over non-trivial discrete time intervals rather than vanishing, or instantaneous, interval lengths.2.2. Notional, Expected, and Instantaneous VolatilityThis section introduces three distinct volatility concepts that serve to formalize the process of measuring and modeling volatility within our frictionless, arbitrage-free setting.By definition volatility seeks to capture the strength of the (unexpected) return variation over a given period of time. However, two distinct features importantly differentiate the construction of all (reasonable) volatility measures. First, given a set of actual return observations, how is the realized volatility computed? Here, the emphasis is explicitly on ex-post measurement of the volatility. Second, decision making often requires forecasts of future return volatility. The focus is then on ex-ante expected volatility. The latter concept naturally calls for a model that may be used to map the current information set into a volatility forecast. In contrast, the (ex-post) realized volatility may be computed (or approximated) without reference to any specific model, thus rendering the task of volatility measurement essentially a nonparametric procedure.It is natural first to concentrate on the behavior of the martingale component in the return decomposition (2.6). However, a prerequisite for observing the M(t) process is that we have access to a continuous record of price data. Such data are simply not available and, even for extremely liquid markets, microstructure effects (discrete price grids, bid-ask bounce effects, etc.) prevent us from ever getting really close to a true continuous sample path realization. Consequently, we focus on measures that represent the (average) volatility over a discrete time-7-interval, rather than the instantaneous (point-in-time) volatility.4 This, in turn, suggests a natural and general notion of volatility based on the quadratic variation process for the local martingale component in the unique semi-martingale return decomposition.Specifically, let X(t) denote any (special) semi-martingale. The unique quadratic variation process, [X,X]t , t0[0,T], associated with X(t) is then formally defined by[X,X]t / X(t)2 - 2 I0t X(s- ) dX(s), 0<t#T,(2.10)where the stochastic integral of the adapted càglàd process, X(s- ), with respect to the càdlàg semi-martingale, X(s), is well-defined (e.g., Protter, 1992). It follows directly that the quadratic variation, [X,X], is an increasing stochastic process. Also, jumps in the sample path of the quadratic variation process necessarily occur concurrent with the jumps in the underlying semi-martingale process, )[X,X] =()X)2.Importantly, if M is a locally square integrable martingale, then the associated (M2 - [M,M]) process is a local martingale,E[ M(t,h)2 - ( [M,M]t - [M,M]t-h ) | öt-h ] = 0, 0 < h # t # T.(2.11)This relation, along with the following well-known result, provide the key to the interpretation of the quadratic variation process as one of our volatility measures.PROPOSITION 2 - Theory of Quadratic VariationLet a sequence of possibly random partitions of [0,T], (J m ), be given s.t. (J m ) / {J m,j }j$0 , m = 1,2, ... where J m,0 #J m,1#J m,2# ... satisfy, with probability one, for m 64 ,J m,06 0 ; sup j$1J m,j6 T; sup j$0 (J m,j+1 - J m,j ) 6 0 .Then, for t 0 [0,T],lim m64{E j$1 (X(t v J m,j) - X(t v J m,j-1))2 } 6 [X,X]t ,where t v J/ min (t,J), and the convergence is uniform in probability.Intuitively, the proposition says that the quadratic variation process represents the (cumulative) realized sample path variability of X(t) over the [0,t] time interval. This observation, together4Of course, by choosing the interval very small, one may in principle approximate the notion of point-in-time volatility, as discussed further below.-8-with the martingale property of the quadratic variation process in (2.11), immediately point to the following theoretical notion of ex-post return variability.DEFINITION 1 - Notional VolatilityThe Notional Volatility over [t-h,t], 0<h # t # T, isL2(t,h)/ [M,M]t - [M,M]t-h= [M c,M c]t - [M c,M c]t-h + E t-h<s#t)M2(s) .(2.12) This same volatility concept has recently been highlighted in a series of papers by Andersen, Bollerslev, Diebold and Labys (2001a,b) and Barndorff-Nielsen and Shephard (2002a,b,c). The latter authors term the corresponding concept Actual Volatility.Under the maintained assumption of no predictable jumps in the return process, and noting that the quadratic variation of any finite variation process, such as µc(t), is zero, we also have L2(t,h)/ [r,r]t - [r,r]t-h= [M c,M c]t - [M c,M c]t-h + E t-h<s#t)r2(s) .(2.13) Consequently, the notional volatility equals (the increment to) the quadratic variation for the return series. Equation (2.13) and Proposition 2 also suggest that (ex-post) it is possible to approximate the notional volatility arbitrarily well through the accumulation of ever finely sampled high-frequency squared return, and that this approach remains consistent independent of the expected return process. We shall return to a much more detailed analysis of this idea in our discussion of nonparametric ex-post volatility measures in section 4 below.Similarly, from (2.13) and Proposition 2 it is evident that the notional volatility,L2(t,h), directly captures the sample path variability of the log-price process over the [t-h,t] time interval. In particular, the notional volatility explicitly incorporates the effect of (realized) jumps in the price process: jumps contribute to the realized return variability and forecasts of volatility must account for the potential occurrence of such jumps. It also follows, from the properties of the quadratic variation process, thatE[L2(t,h)|öt-h ] = E[M(t,h)2 |öt-h ] = E[M2(t)|öt-h ] - M2(t-h), 0<h#t #T.(2.14)Hence, the expected notional volatility represents the expected future (cumulative) squared return innovation. As argued in section 2.1, this component is typically the dominant determinant of the expected return volatility.For illustration, consider again the two examples introduced in section 2.1 above. Alternative more complicated specifications and issues related to longer horizon returns are considered in section 3.-9-EXAMPLE 1: Stochastic Volatility Jump Diffusion with Non-Zero Mean Jumps (Revisited) The log-price process evolves according todp(t) = ( µ + $F2(t))dt + F(t) dW(t) + 6(t) dq(t) , 0#t#T.The notional volatility is thenL2(t,h) = I0h F2(t-h+s) ds + E t-h<s#t62(s) .The expected notional volatility involves taking the conditional expectation of this expression. Without an explicit model for the volatility process, this cannot be given in closed form. However, for small h the (expected) notional volatility is typically very close to the value attained if volatility is constant. In particular, to a first-order approximation,E[L2(t,h)|öt-h ] .F2(t-h)· h + 8· h·( µ62 +F62 ) .[F2(t-h) + 8( µ62 +F62 ) ] · h,whilem(t,h) . [ µ + $ ·F2(t-h) + 8·µ6 ] · h .Thus, the expected notional volatility is of order h, the expected return is of order h (and the variation of the mean return of order h2 ), whereas the martingale (return innovation) is of the order h1/2, and hence an order of magnitude larger for small h.EXAMPLE 2: Discrete-Time Stochastic Volatility (ARCH) Model (Revisited)The discrete-time (jump) process for p(t) defined over the unit time interval isp(t) = p(t-1) + µ + $F2(t) + F(t) z(t) , t = 1, 2, ..., T.The one-period notional volatility measure is thenL2(t,1) = F2(t) z2(t),while, of course, the expected notional volatility isE[L2(t,1)|öt-1 ] =F2(t),and the expected return ism(t,1) = µ + $F2(t).-10-。
国内图书分类号:H314国际图书分类号:802.0文学硕士学位论文中英情景喜剧中幽默策略的对比研究--以《老友记》和《爱情公寓》为例硕士研究生:徐净玉导师:王景惠教授申请学位:文学硕士学科、专业:外国语言学及应用语言学所在单位:外国语学院答辩日期:2012 年7 月 5 日授予学位单位:哈尔滨工业大学Classified Index: H314U.D.C.: 802.0Graduation Thesis for the M. A. DegreeA Comparative Study of Humor Strategiesin Chinese and English Sitcoms----A Case Study of Ipartment and FriendsCandidate: XU JingyuSupervisor: Prof. WANG JinghuiAcademic Degree Applied for: Master of ArtsSpecialty: Foreign Linguistics and Applied LinguisticsAffiliation: School of Foreign LanguagesDate of Oral Examination: July 1, 2012Degree Conferring Institution : Harbin Institute of TechnologyHarbin Institute of Technology Graduation Thesis for the MA Degree 哈尔滨工业大学硕士论文摘要情景喜剧中的幽默策略是近年来语言学领域的研究热点。
西方幽默研究主要关注三大传统幽默理论,语义脚本理论和言语幽默的一般理论。
中国学者则大多从语用学、认知语言学和修辞学角度,对情景喜剧、相声和脱口秀中的言语幽默进行文本分析。
八年级英语议论文论证方法单选题40题1. In the essay, the author mentions a story about a famous scientist to support his idea. This is an example of _____.A.analogyB.exampleparisonD.metaphor答案:B。
本题主要考查论证方法的辨析。
选项A“analogy”是类比;选项B“example”是举例;选项C“comparison”是比较;选项D“metaphor”是隐喻。
文中提到一个关于著名科学家的故事来支持观点,这是举例论证。
2. The writer uses the experience of his own life to prove his point. This kind of method is called _____.A.personal storyB.example givingC.case studyD.reference答案:B。
选项A“personal story”个人故事范围较窄;选项B“example giving”举例;选项C“case study”案例分析;选项D“reference”参考。
作者用自己的生活经历来证明观点,这是举例论证。
3. The author cites several historical events to strengthen his argument. What is this method?A.citing factsB.giving examplesC.making comparisonsing analogies答案:B。
选项A“citing facts”引用事实,历史事件可以作为例子,所以是举例论证;选项B“giving examples”举例;选项C“making comparisons”比较;选项D“using analogies”使用类比。
一、第1章软件工程概述1. Software deteriorates rather than wears out because(软件通常是变坏而不是磨损的原因是)A:Software suffers from exposure to hostile environments(软件暴露在不利的环境中)B:Defects are more likely to arise after software has been used often(软件错误更容易在使用后被发现)C:Multiple change requests introduce errors in component interactions(在组件交互中需求发生变化导致错误)D:Software spare parts become harder to order(软件的备用部分不易组织)2. Today the increased power of the personal computer has brought about an abandonment of the practice of team development of software.(如今个人电脑性能的提升导致遗弃了采用小组开发软件的方式。
)A:True(真)B:False (假)3. Which question no longer concerns the modern software engineer?(现如今的软件工程师不再考虑以下哪个问题?)A:Why does computer hardware cost so much?(计算机硬件为什么如此昂贵?)B:Why does software take a long time to finish?(软件为什么开发时间很长?)C:Why does it cost so much to develop a piece of software?(开发一项软件的开销为什么这么大?)D:Why cann't software errors be removed from products prior to delivery? (软件错误为什么不能在产品发布之前被找出?)4. In general software only succeeds if its behavior is consistent with the objectives of its designers.(通常意义上,只有表现得和设计目标一致的软件才是成功的软件。
ISLR第⼆章1 Statistical Learning1.1 What Is Statistical Learning?More generally, suppose that we observe a quantitative response Y and p different predictors, X1,X2, . . .,Xp. We assume that there is some relationship between Y and X = (X1,X2, . . .,Xp), which can be written in the very general formStatistical learning refers to a set of approaches for estimating f.1.1.1 Why Estimate f?Prediction(预测) and inference(推断).1. Prediction where ˆ f represents our estimate for f, and ˆ Y represents the resulting prediction for Y.^ f is often treated as a black box.The accuracy of ˆ Y as a prediction for Y depends on two quantities, which we will call the reducible error(可约误差) and the irreducible error(不可约误差).reducible error: In general, “^f will not be a perfect estimate for f, and this inaccuracy will introduce some error. This error is reducible because we can potentially improve the accuracy of ˆ f by using the most appropriate statistical learning technique to estimate f irreducible error: Y is also a function of ϵ, which, by definition, cannot be predicted using X. Therefore, variability associated with ϵ also affects the accuracy of our prediction.2. InferenceTo understand the relationship between X and Y , or more specifically, to understand how Y changes as a function of X1, . . .,Xp. Now ˆ f cannot be treated as a black box, because we need to know its exact form.1.1.2 How Do We Estimate f?1. Parametric methods(参数⽅法)Parametric methods involve a two-step model-based approach.(1) First, we make an assumption about the functional form, or shape, of f. For example, one very simple assumption is that f is linear in X: (2) After a model has been selected, we need a procedure that uses the training data to fit or train the modelAdvantage: simplifies the problemDisadvantage: The potential disadvantage of a parametric approach is that the model we choose will usually not match the true unknown form of f. If the chosen model is too far from the true f, then our estimate will be poor.2. Non-parametric Methods(⾮参数⽅法)Non-parametric methods do not make explicit assumptions about the functional form of f.Advantage: by avoiding the assumption of a particular functional form for f, they have the potential to accurately fit a wider range of possible shapes for f.Disadvantage: since they do not reduce the problem of estimating f to a small number of parameters, a very large number of observations (far more than is typically needed for a parametric approach) is required in order to obtain an accurate estimate for f1.1.3 The Trade-Off Between Prediction Accuracy and Model Interpretability(预测精度和模型解释性的权衡)We have established that when inference is the goal, there are clear advantages to using simple and relatively inflexible statistical learning methods. Surprisingly, we will often obtain more accurate predictions using a less flexible method.1.3.4 Supervised Versus Unsupervised Learning1. Supervised LearningFor each observation of the predictor measurement(s) xi, i = 1, …, n there is an associated response measurement yi. Such as linear regression and logistic regression, GAM, boosting, and support vector machines2. Unsupervised LearningUnsupervised learning describes the somewhat more challenging situation in which for every observation i = 1, . . . , n, we observe a vector of measurements xi but no associated response yi. Such as cluster analysis.1.2Assessing Model Accuracy(模型精度评价)1.2.1 Measuring the Quality of Fit(拟合效果检验) mean squared error (MSE): where ˆ f(xi) is the prediction that ˆ f gives for the ith observationtraining MSE(训练均⽅误差):The MSE is computed using the training data that was used to fit the modeltest MSE(测试均⽅误差): 测试均⽅误差越⼩越好1.2.2 The Bias-Variance Trade-Off(偏差⼀⽅差权衡)Variance refers to the amount by which ˆ f would change if we estimated it using a different training data set.Bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model.1.2.3 The Classification Settingtraining error rate:test error rate:1. The Bayes ClassifierIn a two-class problem where there are Bayes only two possible response values, say class 1 or class 2, the Bayes classifier corresponds to predicting class one if Pr(Y = 1|X = x0) > 0.5, and class two otherwise.the overall Bayes error rate is given by2. K-Nearest NeighborsGiven a positive integer K and a test observation x0, the KNN classifier first identifies the neighbors K points in the training data that are closest to x0, represented by N0. It then estimates the conditional probability for class j as the fraction of points in N0 whose response values equal j:。
A C O M P A R IS O N O F L I N E A R A N D H Y P E R T E X T F O R M A T S I N IN F O R M A T I O N R E T R I E V A L. Cliff McKnight, Andrew Dillon and John RichardsonHUSAT Research Centre, Department of Human Sciences, University of Loughborough.This item is not the definitive copy. Please use the following citation when referencing this material: McKnight, C., Dillon, A. and Richardson, J. (1990) A comparison of linear and hypertext formats in information retrieval. In: R. McAleese and C. Green, HYPERTEXT: State of the Art, Oxford: Intellect, 10-19.AbstractAn exploratory study is described in which the same text was presented to subjects in one of four formats, of which two were hypertext (TIES and Hypercard) and two were linear (Word Processor and paper). Subjects were required to use the text to answer 12 questions. Measurement was made of their time and accuracy and their movement through the document was recorded, in addition to a variety of subjective data being collected. Although there was no significant difference between conditions for task completion time, subjects performed more accurately with linear formats. The implications of these findings and the other data collected are discussed.Introduction.It has been suggested that the introduction of hypertexts could lead to improved access and (human) processing of information across a broad range of situations (Kreitzberg and Shneiderman, 1988). However, the term ‘hypertext’ has been used as though it was a unitary concept when, in fact, major differences exist between the various implementations which are currently available and some (Apple’s Hypercard for example) are powerful enough to allow the construction of a range of different applications. In addition these views tend to disregard the fact that written texts have evolved over several hundred years to support a range of task requirements in a variety of formats. This technical evolution has been accompanied by a comparable evolution in the skills that readers have in terms of rapidly scanning, searching and manipulating paper texts (McKnight et al., 1990).The recent introduction of cheap, commercial hypertext systems has been made possible by the widespread availability of powerful microcomputers. However, the recency of this development means that there is little evidence about the effectiveness of hypertexts and few guidelines for successful implementations.Although a small number of studies have been carried out, their findings have typically been contradictory (cf. Weldon et al.,1985, and Gordon et al., 1988). This outcome is predictable if allowances are not made for the range of text types (e.g., on-linehelp, technical documentation, tutorial packages) and reading purposes (e.g., familiarisation, searching for specific items, reading for understanding). Some text types would appear to receive no advantage from electronic implementation let alone hypertext treatment at the intra-document level (poetry or fiction, for example, where the task is straightforward reading rather than study or analysis of the text per se). Thus there appears to be justification for suggesting that some hypertext packages may be appropriate for some document types and not others. A discussion of text types can be found in Dillon and McKnight (1989).Marchionini and Shneiderman (1988) differentiate between the procedural and often iterative types of information retrieval undertaken by experts on behalf of end users and the more informal methods employed by end users themselves. They suggest that hypertext systems may be well suited to end users because they encourage “informal, personalized, content-oriented information seeking strategies” (p.71).The current exploratory study was designed to evaluate a number of document formats using a task that would tend to elicit ‘content-oriented information seeking strategies’. The study was also undertaken to evaluate a methodology and indicate specific questions to be investigated in future experiments.MethodSubjects16 subjects participated in the study, 9 male and 7 female, age range 21–36. All were members of HUSAT staff and all had experience of using a variety of computer systems and applications.MaterialsThe text used was “Introduction to Wines” by Elizabeth A. Buie and W. Edgar Hassell, a basic guide to the history, production and appreciation of wine. This hypertext was widely distributed by Ben Shneiderman as a demonstration of the TIES (The Interactive Encyclopedia System) package prior to its marketing as HyperTIES.This text was adapted for use in other formats by the present authors. In the TIES version, each topic is held as a separate file, resulting in 40 individual small files. For the Hypercard version, a topic card was created for each corresponding TIES file. For the word processor version, the text was arranged in an order which seemed sensible to the authors starting with the TIES ‘Introduction’ text and grouping the various topics under more general headings. A pilot test confirmed that the final version was generally consistent in structure with the present sample’s likely ordering.The Hypercard and word processor versions were displayed on a monochrome Macintosh II screen and the TIES version was displayed on an IBM PC colour screen. The paper version was a print-out of the word processor text.TaskSubjects were required to use the text to answer a set of 12 questions. These were specially developed by the authors to ensure that a range of information retrieval strategies were employed to answer them and that the questions did not unduly favour any one medium (e.g., one with a search facility).DesignA four-condition, independent subjects design was employed with presentation format (Hypercard, TIES, Paper and Word Processor) as the independent variable. The dependent variables were speed, accuracy, access strategy and subjects’ estimate of document size.ProcedureSubjects were tested individually. One experimenter described the nature of the investigation and introduced the subject to the text and system. Any questions the subject had were answered before a three minute familiarisation period commenced, during which the subject was encouraged to browse through the text. After three minutes the subjects were asked several questions pertaining to estimated document size and range of contents viewed. They were then given the question set and asked to attempt all questions in the presented order. Subjects were encouraged to verbalise their thoughts and a small tie-pin microphone was used to record their comments. Movement through the text was captured by video camera.ResultsEstimates of document sizeAfter familiarisation with the text, subjects were asked to estimate the size of the document in pages or screens. The linear formats contained 13 pages, the Hypercard version contained 53 cards, and the TIES version contained 78 screens. Therefore raw scores were converted to percentages. The responses are presented in Table 1 (where a correct response is 100).Condition TIES Paper HyperCard Word ProcessorSubject1 641.03 76.92 150.94 92.312 58.97 92.31 56.6 76.923 51.28 76.92 465.17 100.04 153.84 153.85 75.47 93.21Mean 226.28 100.0 187.05 90.61 SD 280.41 36.63 189.84 9.75 Table 1: Subjects’ estimates of document size.Subjects in the linear format conditions estimated the size of the document reasonably accurately. However, subjects who read the hypertexts were less accurate, several of them over-estimating the size by a very high margin. While a one-way ANOVA revealed no significant effect (F[3,12] = 0.61, NS) these data are interesting and suggest that subjective assessment of text size as a function of format is an issue worthy of further investigation. Such an assessment may well influence an estimation of the level of detail involved in the content as well as choice of appropriate access strategy.SpeedTime taken to complete the twelve tasks was recorded for each subject. Total time per subject and mean time per condition are presented in Table 2 (all data are in seconds).ProcessorWord Condition TIES Paper HyperCardSubject1 1753 795 1161 14802 1159 1147 655 8273 2139 2231 1013 10144 1073 1115 1610 1836Mean 1531 1322 1110 1289Table 2: Task completion times (in seconds).Clearly, while there is variation at the subject level there is little apparent difference between conditions. A one-way ANOVA confirmed this (F[3,12]= 0.47, p > 0.7) and even a t-test between the fastest and slowest conditions, Hypercard and TIES, failed to show a significant effect for speed (t = 1.31, d.f. = 6, p > 0.2).The term ‘accuracy’ in this instance refers to how many items a subject answers correctly. This was assessed by the experimenters who awarded one point for an unambiguously correct answer, a half-point for a partly correct answer and no points for a wrong answer or abandoned question. The accuracy scores per subject and mean accuracy scores per condition are shown in Table 3.ProcessorWord Condition TIES Paper HyperCardSubject1 6.0 11.0 8.5 11.52 9.5 12.0 7.5 12.08.0 10.5 10.0 10.534 6.0 11.0 7.5 9.0Mean 7.38 11.12 8.38 10.75SD 1.7 0.63 1.18 1.32 Table 3: Accuracy scores.As can be seen from these data, subjects performed better in both linear-format conditions than in the hypertext conditions. A one-way ANOVA revealed a significant effect for format (F[3,12] = 8.24, p < 0.005) and a series of post-hoc tests revealed significant difference between paper and TIES (t = 4.13, d.f. = 6, p < 0.01), Word Processor and TIES (t = 3.13, d.f. = 6, p < 0.05) and between Paper and Hypercard (t = 4.11, d.f. = 6, p < 0.01). Even using a more rigorous basis for rejection than the 5 per cent level, i.e., the 10/k(k-1) level, where k is the number of groups, suggested by Ferguson (1959), which results in a critical rejection level of p < 0.0083 in this instance, the Paper/TIES and Paper/Hypercard differences are still significant.The number of questions abandoned by subjects was also identified. Although there was no significant difference between conditions (F[3,12] = 1.85, NS) subjects using the linear formats abandoned less than those using the hypertext formats (total abandon rates: Paper = 1; Word Processor = 2; Hypercard = 4 and TIES = 9).NavigationTime spent viewing the Contents/Index (where applicable) was determined for each subject and converted to a percentage of total time. These scores are presented in Table 4.WordProcessor Condition TIES Paper HyperCardSubject1 53.28 2.72 47.16 6.342 25.36 1.49 19.1 13.933 49.5 10.24 17.5 12.874 30.84 5.36 23.4 7.54Mean 39.74 4.95 26.79 10.173.88 13.81 3.79 SD 13.72Table 4: Time spent viewing Contents/Index as a percentage of total time.This table demonstrates a very large difference between both hypertext formats and the linear formats. A one-way ANOVA revealed a significant effect for condition (F[3,12]= 9.95, p < 0.005). Once more, applying the more conservative critical rejection level, post-hoc tests revealed significant differences between Paper and TIES (t = 4.90, d.f. = 6, p < 0.003), between Word Processor and TIES (t = 4.16, d.f. = 6, p < 0.006) and between Hypercard and paper (t = 3.06, d.f. = 6, p < 0.03). Thus, interacting with a hypertext document may necessitate heavy usage of indices in order to navigate effectively through the information space.SummaryIn general, subjects performed better with the linear format texts than with the hypertexts. The linear formats led to significantly more accurate performance and to significantlyless time spent in the index and contents. Not surprisingly, estimating document size seems easier with linear formats.DiscussionWhile there was no significant effect for the estimation of document size data, a number of observations can be made. The accurate estimates for the Paper and Word Processor condition may well have resulted from the fact that the Contents pages indicated the total number of pages and that the page number was displayed for each page, and hence browsing through the document would have given repeated cues to the document size. Finally, in the Paper condition the subjects would have received tactile feedback as they manipulated the document.Subjects in the hypertext conditions had none of this information to help them form an impression of the document’s size. While many of the cards were discrete (i.e., there were few continuation cards) and were individually listed in the indices this information did not prevent some subjects from making large over-estimates. A poor estimate of document size could lead to incorrect assumptions concerning the range of coverage and level of a document and the adoption of an inappropriate reading strategy. Future studies might usefully explore the relationship between manipulation strategy and subjective impression of size for larger documents.A number of factors are likely to have influenced the subjects’ task completion times and as a result the lack of an overall speed effect is to be expected. These factors include variation in the subjects’ familiarity with the topic area (wines); variation in the subjects’ reading speeds; the presence or absence of a string search facility in the electronic versions; variation in the subjects’ familiarity with the different software packages; and their determination to continue searching until an answer is found. However, there does not appear to be a speed/accuracy trade-off.The strong effect found for the navigation measure appears to be consistent with the significant difference in accuracy scores between the four conditions. Subjects in the hypertext conditions spent considerably more time viewing the index and contents lists than did subjects in the linear conditions but were less successful in finding the correct answers to the questions. The hypertext systems elicited a style of text manipulation which consisted of selecting likely topics from the index, jumping to them and then returning to the index if the answer was not found. Relatively little time was spent browsing between linked nodes in the database. This is a surprising finding since hypertext systems in general are assumed to be designed with this style of manipulation in mind. It may be argued that a comprehension or summarisation task would have resulted in this style of manipulation but, in contrast to the above, subjects in the linear conditions tended to refer once to the Contents/Index and then scan the text for the answer rather than make repeated use of the Contents or Index.Further evidence of the superiority of scanning the text in the linear conditions as opposed to frequent recourse to the Contents/Index in the hypertext conditions issuggested by considering the relationship between use of the string search facilities and the numbers of questions abandoned before an answer was found. Three of the questions were designed so that a string search strategy would be optimal and two of the electronic text conditions (one linear, one hypertext) supported string searching. The lack of a string search facility in the TIES condition was associated with a very high proportion of abandoned questions (58%) whilst these three questions were answered with 100% accuracy by subjects in the paper condition.In the other two conditions in which string searching was supported it was used to different degrees of effectiveness. In the Hypercard condition the subjects employed string searching with 92% of the relevant questions and this resulted in 66% of the questions being answered correctly (and 17% being abandoned). In the Word Processor condition string searching was employed on 50% of the relevant questions and 92% were answered correctly (0% abandoned). Thus, although string searching was available to the subjects in the Word Processor condition it was used with less frequency than in the Hypercard condition. However, the subjects in the Word Processor condition answered substantially more of the questions correctly and this was presumably using strategies based on visual scanning.ConclusionAlthough some caution should be exercised in interpreting the results of this study, it is clear that for some texts and some tasks hypertext is not the universal panacea which some have claimed it to be. Furthermore, the various implementations of hypertext will support performance to a greater or lesser extent in different tasks. Future work should attempt to establish clearly the situations in which each general style of hypertext confers a positive advantage so that the potential of the medium can be realised.AcknowledgementThis work was funded by the British Library Research and Development Department and was carried out as part of Project Quartet.ReferencesDillon A. and McKnight C (1989) Towards the classification of text types: a repertory grid approach. International Journal of Man-Machine Studies, in press.Ferguson G A (1959) Statistical Analysis in Psychology and Education. McGraw-Hill, New York.Gordon S, Gustavel J, Moore J and Hankey J (1988) The effects of hypertext on reader knowledge representation. Proceedings of the Human Factors Society – 32nd Annual Meeting.Kreitzberg C and Shneiderman B (1988) Restructuring knowledge for an electronic encyclopedia. Proceedings of International Ergonomics Association’s 10th Congress.Marchionini G and Shneiderman B (1988) Finding facts vs. browsing knowledge in hypertext systems. Computer, January, 70–80.McKnight C, Dillon A and Richardson J (1990) From Here to Hypertext. Cambridge University Press, Cambridge, in press.Weldon L J, Mills C B, Koved L and Shneiderman B (1985) The structure of information in online and paper technical manuals. Proceedings of the Human Factors Society – 29th Annual Meeting.。
A Comparison of String Distance Metrics for Name-Matching TasksWilliam W.Cohen∗†Pradeep Ravikumar∗Stephen E.Fienberg∗†‡Center for Automated Center for Computer andLearning and Discovery∗Communications Security†Department of Statistics‡Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University Pittsburgh PA15213Pittsburgh PA15213Pittsburgh PA15213 wcohen@ pradeepr@ fienberg@AbstractUsing an open-source,Java toolkit of name-matchingmethods,we experimentally compare string distancemetrics on the task of matching entity names.We inves-tigate a number of different metrics proposed by differ-ent communities,including edit-distance metrics,fastheuristic string comparators,token-based distance met-rics,and hybrid methods.Overall,the best-performingmethod is a hybrid scheme combining a TFIDF weight-ing scheme,which is widely used in information re-trieval,with the Jaro-Winkler string-distance scheme,which was developed in the probabilistic record linkagecommunity.IntroductionThe task of matching entity names has been explored by a number of communities,including statistics,databases,and artificial intelligence.Each community has formulated the problem differently,and different techniques have been pro-posed.In statistics,a long line of research has been conducted in probabilistic record linkage,largely based on the sem-inal paper by Fellegi and Sunter(1969).This paper for-mulates entity matching as a classification problem,where the basic goal is to classify entity pairs as matching or non-matching.Fellegi and Sunter propose using largely unsupervised methods for this task,based on a feature-based representation of pairs which is manually designed and to some extent problem-specific.These proposals have been,by and large,adopted by subsequent researchers,al-though often with elaborations of the underlying statisti-cal model(Jaro1989;1995;Winkler1999;Larsen1999; Belin&Rubin1997).These methods have been used to match individuals and/or families between samples and cen-suses,e.g.,in evaluation of the coverage of the U.S.decen-nial census;or between administrative records and survey data bases,e.g.,in the creation of an anonymized research data base combining tax information from the Internal Rev-enue Service and data from the Current Population Survey. In the database community,some work on record match-ing has been based on knowledge-intensive approaches Copyright c 2003,American Association for Artificial Intelli-gence().All rights reserved.(Hernandez&Stolfo1995;Galhardas et al.2000;Raman &Hellerstein2001).However,the use of string-edit dis-tances as a general-purpose record matching scheme was proposed by Monge and Elkan(Monge&Elkan1997; 1996),and in previous work,we proposed use of the TFIDF distance metric for the same purpose(Cohen2000). In the AI community,supervised learning has been used for learning the parameters of string-edit distance metrics (Ristad&Yianilos1998;Bilenko&Mooney2002)and combining the results of different distance functions(Te-jada,Knoblock,&Minton2001;Cohen&Richman2002; Bilenko&Mooney2002).More recently,probabilistic ob-ject identification methods have been adapted to matching tasks(Pasula et al.2002).In these communities there has been more emphasis on developing“autonomous”match-ing techniques which require little or or no configuration for a new task,and less emphasis on developing“toolkits”of methods that can be applied to new tasks by experts. Recently,we have begun implementing an open-source, Java toolkit of name-matching methods which includes a va-riety of different techniques.In this paper we use this toolkit to conduct a comparison of several string distances on the tasks of matching and clustering lists of entity names.This experimental use of string distance metrics,while similar to previous experiments in the database and AI com-munities,is a substantial departure from their usual use in statistics.In statistics,databases tend to have more structure and specification,by design.Thus the statistical literature on probabilistic record linkage represents pairs of entities not by pairs of strings,but by vectors of“match features”such as names and categories for variables in survey databases. By developing appropriate match features,and appropriate statistical models of matching and non-matching pairs,this approach can achieve better matching performance(at least potentially).The use of string distances considered here is most useful for matching problems with little prior knowledge,or ill-structured data.Better string distance metrics might also be useful in the generation of“match features”in more struc-tured database situations.MethodsEdit-distance like functionsDistance functions map a pair of strings s and t to a real number r,where a smaller value of r indicates greater sim-ilarity between s and t.Similarity functions are analogous, except that larger values indicate greater similarity;at some risk of confusion to the reader,we will use this terms inter-changably,depending on which interpretation is most natu-ral.One important class of distance functions are edit dis-tances,in which distance is the cost of best sequence of edit operations that convert s to t.Typical edit operations are character insertion,deletion,and substitution,and each op-eration much be assigned a cost.We will consider two edit-distance functions.The sim-ple Levenstein distance assigns a unit cost to all edit opera-tions.As an example of a more complex well-tuned distance function,we also consider the Monger-Elkan distance func-tion(Monge&Elkan1996),which is an affine1variant of the Smith-Waterman distance function(Durban et al.1998) with particular cost parameters,and scaled to the interval [0,1].A broadly similar metric,which is not based on an edit-distance model,is the Jaro metric(Jaro1995;1989;Winkler 1999).In the record-linkage literature,good results have been obtained using variants of this method,which is based on the number and order of the common characters between two strings.Given strings s=a1...a K and t=b1...b L, define a character a i in s to be common with t there is a b j=a i in t such that i−H≤j≤i+H,where H=min(|s|,|t|)2.Let s =a 1...aKbe the characters in s whichare common with t(in the same order they appear in s)andlet t =b 1...bL be analogous;now define a transpositionfor s ,t to be a position i such that a i=b i.Let T s ,t be half the number of transpositions for s and t .The Jaro similarity metric for s and t isJaro(s,t)=13·|s ||s|+|t ||t|+|s |−T s ,t|s |A variant of this due to Winkler(1999)also uses the length P of the longest common prefix of s and t.Letting P = max(P,4)we defineJaro-Winkler(s,t)=Jaro(s,t)+P10·(1−Jaro(s,t))The Jaro and Jaro-Winkler metrics seem to be intended pri-marily for short strings(e.g.,personalfirst or last names.) Token-based distance functionsTwo strings s and t can also be considered as multisets(or bags)of words(or tokens).We also considered several token-based distance metrics.The Jaccard similarity be-tween the word sets S and T is simply|S∩T||S∪T|.TFIDF or1Affine edit-distance functions assign a relatively lower cost to a sequence of insertions or deletions.cosine similarity,which is widely used in the information retrieval community can be defined asTFIDF(S,T)=w∈S∩TV(w,S)·V(w,T)where TF w,S is the frequency of word w in S,N is the size of the“corpus”,IDF w is the inverse of the fraction of names in the corpus that contain w,V (w,S)=log(TF w,S+1)·log(IDF w)and V(w,S)=V (w,S)/wV (w,S)2.Our imple-mentation measures all document frequency statistics from the complete corpus of names to be matched.Following Dagan et al(1999),a token set S can be viewed as samples from an unknown distribution P S of tokens,and a distance between S and T can be computed based on these distributions.Inspired by Dagan et al we considered the Jensen-Shannon distance between P S and P T.Letting KL(P||Q)be the Kullback-Lieber divergence and letting Q(w)=12(P S(w)+P T(w)),this is simplyJensen-Shannon(S,T)=12(KL(P S||Q)+KL(P T||Q)) Distributions P S were estimated using maximum likelihood, a Dirichlet prior,and a Jelenik-Mercer mixture model(Laf-ferty&Zhai2001).The latter two cases require parameters; for Dirichlet,we usedα=1,and for Jelenik-Mercer,we used the mixture coefficientλ=0.5.From the record-linkage literature,a method proposed by Fellegi and Sunter(1969)can be easily extended to a token distance.As notation,let A and B be two sets of records to match,let C=A∩B,let D=A∪B,and for X= A,B,C,D let P X(w)be the empirical probability of a name containing the word w in the set X.Also let e X be the empirical probability of an error in a name in set X;e X,0 be the probability of a missing name in set X;e T be the probability of two correct but differing names in A and B; and let e=e A+e B+e T+e A,0+e B,0.Fellegi and Sunter propose ranking pairs s,t by theodds ratio log(Pr(M|s,t)Pr(U|s,t))where M is the class of matched pairs and U is the class of unmatched pairs.Letting AGREE(s,t,w)denote the event“s and t both agree in containing word w”,Fellegi and Sunter note that under cer-tain assumptionsPr(M|AGREE(s,t,w))≈P C(w)(1−e)Pr(U|AGREE(s,t,w))≈P A(w)·P B(w)(1−e)If we make two addition assumptions suggested by Fellegi and Sunter,and assume that(a)matches on a word w are independent,and(b)that P A(w)≈P B(w)≈P C(w)≈P D(w),wefind that the incremental score for the odds ratio associated with AGREE(s,t,w)is simply−log P D(w).In information retrieval terms,this is a simple un-normalized IDF weight.Unfortunately,updating the log-odds score of a pair on discovering a disagreeing token w is not as sim-ple.Estimates are provided by Fellegi and Sunter forP(M|¬AGREE(s,t,w))and P(U|¬AGREE(s,t,w)), but the error parameters e A,e B,...do not cancel out–instead one is left with a constant penalty term,independent of w.Departing slightly from this(and following the intu-ition that disagreement on frequent terms is less important) we introduce a variable penalty term of k log P D(w),where k is a parameter to be set by the user.In the experiments we used k=0.5,and call this method SFS distance(for Simplified Fellegi-Sunter).Hybrid distance functionsMonge and Elkan propose the following recursive matching scheme for comparing two long strings s and t.First,s and t are broken into substrings s=a1...a K and t=b...b L. Then,similarity is defined assim(s,t)=1KKi=1Lmaxj=1sim (A i,B j)where sim is some secondary distance function.We con-sidered an implementation of this scheme in which the sub-strings are tokens;following Monge and Elkan,we call this a level two distance function.We experimented with level two similarity functions which used Monge-Elkan,Jaro,and Jaro-Winkler as their base function sim .We also considered a“soft”version of TFIDF,in which similar tokens are considered as well as tokens in S∩T. Again let sim be a secondary similarity function.Let CLOSE(θ,S,T)be the set of words w∈S such that there is some v∈T such that dist (w,v)>θ,and for w∈CLOSE(θ,S,T),let D(w,T)=max v∈T dist(w,v). We defineSoftTFIDF(S,T)=w∈CLOSE(θ,S,T)V(w,S)·V(w,T)·D(w,T)In the experiments,we used Jaro-Winkler as a secondary distance andθ=0.9.“Blocking”or pruning methodsIn matching or clustering large lists of names,it is not com-putationally practical to measure distances between all pairs of strings.In statistical record linkage,it is common to group records by some variable which is known a priori to be usually the same for matching pairs.In census work,this grouping variable often names a small geographic region, and perhaps for this reason the technique is usually called “blocking”.Since this paper focuses on the behavior of string match-ing tools when little prior knowledge is available,we will use here knowledge-free approaches to reducing the set of string pairs to compare.In a matching task,there are two sets A and B.We consider as candidates all pairs of strings (s,t)∈A×B such that s and t share some substring v which appears in at most a fraction f of all names.In a clus-tering task,there is one set C,and we consider all candi-dates(s,t)∈C×C such that s=t,and again s and t share some not-too-frequent substring v.For purely token-basedName Src M/C#Strings#Tokensanimal1M570930,006bird11M3771,977bird21M9824,905bird31M38188bird41M7194,618business1M213910,526game1M9115,060park1M6543,425fodorZagrat2M86310,846ucdFolks3M90454census4M8415,765UV A3C1166,845coraATDV5C191647,512 Table1:Datasets used in experiments.Column2indi-cates the source of the data,and column3indicates if it is a matching(M)or clustering(C)problem.Origi-nal sources are1.(Cohen2000)2.(Tejada,Knoblock,& Minton2001)3.(Monge&Elkan1996)4.William Winkler (personal communication)5.(McCallum,Nigam,&Ungar 2000)communication)methods,the substring v must be a token,and otherwise,it must be a ing inverted indices this set of pairs can be enumerated quickly.For the moderate-size test sets considered here,we used f=1.On the matching datasets above,the token blocker finds between93.3%and100.0%of the correct pairs,with an average of98.9%.The4-gram blocker alsofinds between 93.3%and100.0%of the correct pairs,with an average of 99.0%.ExperimentsData and methodologyThe data used to evaluate these methods is shown in Ta-ble1.Most been described elsewhere in the literature.The “coraATDV”dataset includes thefields author,title,date, and venue in a single string.The“census”dataset is a syn-thetic,census-like dataset,from which only textualfields were used(last name,first name,middle initial,house num-ber,and street).To evaluate a method on a dataset,we ranked by distance all candidate pairs from the appropriate grouping algorithm. We computed the non-interpolated average precision of this ranking,the maximum F1score of the ranking,and also in-terpolated precision at the eleven recall levels0.0,0.1,..., 0.9,1.0.The non-interpolated average precision of a rank-ing containing N pairs for a task with m correct matches is 1mNr=1c(i)δ(i)i,where c(i)is the number of correct pairs ranked before position i,andδ(i)=1if the pair at rank i is correct and0otherwise.Interpolated precision at recall r is the max i c(i)i,where the max is taken over all ranks i suchthat c(i)m≥r.Maximum F1is max i>0F1(i),where F1(i) is the harmonic mean of recall at rank i(i.e.,c(i)/m)and precision at rank i(i.e.,c(i)/i).0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0.50.550.60.650.7 0.75 0.8 0.850.90.951M a x F 1 o f T F I D FMaxF1 of othervs Jaccardvs SFSvs Jensen-Shannony=x0.3 0.4 0.50.6 0.7 0.80.9 1 0.30.4 0.50.6 0.7 0.8 0.9 1A v g P r e c o f T F I D FAvgPrec of othervs Jaccardvs SFSvs Jensen-Shannony=x0.10.2 0.30.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.40.6 0.8 1P r e c i s i o nRecallTFIDFJensen-ShannonSFS JaccardFigure 1:Relative performance of token-based measures.Left,max F1of methods on matching problems,with points above the line y =x indicating better performance of TFIDF.Middle,same for non-interpolated average precision.Right,precision-recall curves averaged over all matching problems.Smoothed versions of Jensen-Shannon (not shown)are comparable in performance to the unsmoother version.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 00.10.20.3 0.4 0.5 0.6 0.7 0.8 0.91m a x F 1 o f M o n g e -E l k a nmax F1 of other distance metricvs Levenstein vs SmithWatermanvs Jarovs Jaro-Winklery=x0 0.10.2 0.3 0.4 0.5 0.6 0.70.8 0.9 1 00.10.20.3 0.4 0.5 0.6 0.7 0.8 0.91a v g P r e c o f M o n g e -E l k a navgPrec of other distance metricvs Levenstein vs SmithWatermanvs Jarovs Jaro-Winklery=xP r e c i s i o nRecallFigure 2:Relative performance of edit-distance measures.Left and middle,points above (below)the line y =x indicating better (worse)performance for Monge-Elkan,the system performing best on average.Results for matchingWe will first consider the matching datasets.As shown in Figure 1,TFIDF seems generally the best among the token-based distance metrics.It does slightly better on average than the others,and is seldom much worse than any other method with respect to interpolated average precision or maximum F1.As shown in Figure 2,the situation is less clear for the edit-distance based methods.The Monge-Elkan method does best on average,but the Jaro and Jaro-Winkler methods are close in average performance,and do noticably better than Monge-Elkan on several problems.The Jaro variants are also substantially more efficient (at least in our imple-mentation),taking about 1/10the time of the Monge-Elkan method.(The token-based methods are faster still,averaging less than 1/10the time of the Jaro variants.)As shown in Figure 3,SoftTFIDF is generally the best among the hybrid methods we considered.In general,the time for the hybrid methods is comparable to using the un-derlying string edit distance.(For instance,the average matching time for SoftTFIDF is about 13seconds on these problems,versus about 20seconds for the Jaro-Winkler method,and 1.2seconds for pure token-based TFIDF.)Finally,Figure 4compares the three best performing edit-distance like methods,the two best token-based methods,and the two best hybrid methods,using a similar method-ology.Generally speaking,SoftTFIDF is the best overalldistance measure for these datasets.Results for clusteringIt should be noted that the test suite of matching problems is dominated by problems from one source—eight of the eleven test cases are associated with the WHIRL project—and a different distribution of problems might well lead to quite different results.For instance,while the token-based methods perform well on average,they perform poorly on the census-like dataset,which contains many misspellings.As an additional experiment,we evaluated the four best-performing distance metrics (SoftTFIDF,TFIDF,SFS,and Level 2Jaro-Winkler)on the two clustering problems,which are taken from sources other than the WHIRL project.Ta-ble 2shows MaxF1and non-interpolated average precision for each method on each problem.SoftTFIDF again slightly outperforms the other methods on both of these tasks.UV ACoraATDVMethod MaxF1AvgPrec MaxF1AvgPrec SoftTFIDF 0.890.910.850.914TFIDF 0.790.840.840.907SFS0.710.750.820.864Level2J-W0.730.690.760.804Table 2:Results for selected distance metrics on two clusteringproblems.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80.9 00.10.20.3 0.4 0.5 0.6 0.7 0.80.91m a x F 1 o f S o f t T F I D Fmax F1 of other distance metricvs Level2 Levenstein vs Level2 Monge-Elkanvs Level2 Jarovs Level2 Jaro-Winklery=x0 0.1 0.2 0.3 0.4 0.5 0.6 0.70.8 0.9 00.10.20.3 0.4 0.5 0.6 0.7 0.80.91a v g P r e c o f S o f t T F I D FavgPrec of other distance metricvs Level2 Levenstein vs Level2 Monge-Elkanvs Level2 Jarovs Level2 Jaro-Winklery=xP r e c i s i o nRecallFigure 3:Relative performance of hybrid distance measures on matching problems,relative to the SoftTFIDF metric,which performs beston average.m a x F 1 o f S o f t T F I D Fmax F1 of other distance metrica v g P r e c o f S o f t T F I D FavgPrec of other distance metricP r e c i s i o nRecallFigure 4:Relative performance of best distance measures of each type on matching problems,relative to the SoftTFIDF metric.Learning to combine distance metricsAnother type of hybrid distance function can be obtained by combining other distance metrics.Following previous researchers (Tejada,Knoblock,&Minton 2001;Cohen &Richman 2002;Bilenko &Mooney 2002)we used a learning scheme to combine several of the distance functions above.Specifically,we represented pairs as feature vectors,using as features the numeric scores of Monge-Elkan,Jaro-Winkler,TFIDF,SFS,and SoftTFIDF.We then trained a binary SVM classifier (using SVM Light (Joachims 2002))using these features,and used its confidence in the “match”class as a score.Figure 5shows the results of a three-fold cross-validation on nine of the matching problems.The learned combina-tion generally slightly outperforms the individual metrics,including SoftTFIDF,particularly at extreme recall levels.Note,however,that the learned metric uses much more in-formation:in particular,in most cases it has been trained on several thousand labeled candidate pairs,while the other metrics considered here require no training data.Concluding remarksRecently,we have begun implementing an open-source,Java toolkit of name-matching methods.This toolkit includes a variety of different techniques,as well as the infrastructure to combine techniques readily,and evaluate them systemat-ically on test ing this toolkit,we conducted a com-parison of several string distances on the tasks of matchingand clustering lists of entity names.Many of these were techniques previously proposed in the literature,and some are novel hybrids of previous methods.We compared these accuracy of these methods for use in an automatic matching scheme,in which pairs of names are proposed by a simple grouping method,and then ranked ac-cording to ed in this way,we saw that the TFIDF ranking performed best among several token-based distance metrics,and that a tuned affine-gap edit-distance metric pro-posed by Monge and Elkan performed best among several string edit-distance metrics.A surprisingly good distance metric is a fast heuristic scheme,proposed by Jaro (Jaro 1995;1989)and later extended by Winkler (Winkler 1999).This works almost as well as the Monge-Elkan scheme,but is an order of magnitude faster.One simple way of combining the TFIDF method and the Jaro-Winkler is to replace the exact token matches used in TFIDF with approximate token matches based on the Jaro-Winkler scheme.This combination performs slightly bet-ter than either Jaro-Winkler or TFIDF on average,and oc-casionally performs much better.It is also close in perfor-mance to a learned combination of several of the best metrics considered in this paper.AcknowledgementsThe preparation of this paper was supported in part by Na-tional Science Foundation Grant No.EIA-0131884to the National Institute of Statistical Sciences and by a contract from the Army Research Office to the Center for Computerm a x F 1 o f l e a r n e d m e t r i cmax F1 of other distance metrica v g P r e c o f l e a r n e d m e t r i cavgPrec of other distance metric0.20.3 0.4 0.5 0.6 0.7 0.80.9 0 0.2 0.40.6 0.8 1P r e c i s i o nRecallSoftTFIDF Learned metricFigure 5:Relative performance of best distance measures of each type on matching problems,relative to a learned combination of the samemeasures.and Communications Security with Carnegie Mellon Uni-versity.ReferencesBelin,T.R.,and Rubin,D.B.1997.A method for calibrating false-match rates in record linkage.In Record Linkage –1997:Proceedings of an International Workshop and Exposition ,81–94.U.S.Office of Management and Budget (Washington).Bilenko,M.,and Mooney,R.2002.Learning to combine trained distance metrics for duplicate detection in databases.Technical Report Technical Report AI 02-296,Artificial Intel-ligence Lab,University of Texas at Austin.Available from /users/ml/papers/marlin-tr-02.pdf.Cohen,W.W.,and Richman,J.2002.Learning to match and cluster large high-dimensional data sets for data integration.In Proceedings of The Eighth ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining (KDD-2002).Cohen,W.W.2000.Data integration using similarity joins and a word-based information representation language.ACM Transac-tions on Information Systems 18(3):288–321.Dagan,I.;Lee,L.;and Pereira,F.1999.Similarity-based models of word cooccurrence probabilities.Machine Learning 34(1-3).Durban,R.;Eddy,S.R.;Krogh,A.;and Mitchison,G.1998.Biological sequence analysis -Probabilistic models of proteins and nucleic acids .Cambridge:Cambridge University Press.Fellegi,I.P.,and Sunter,A.B.1969.A theory for record linkage.Journal of the American Statistical Society 64:1183–1210.Galhardas,H.;Florescu,D.;Shasha,D.;and Simon,E.2000.An extensible framework for data cleaning.In ICDE ,312.Hernandez,M.,and Stolfo,S.1995.The merge/purge problem for large databases.In Proceedings of the 1995ACM SIGMOD .Jaro,M.A.1989.Advances in record-linkage methodology as applied to matching the 1985census of Tampa,Florida.Journal of the American Statistical Association 84:414–420.Jaro,M.A.1995.Probabilistic linkage of large public health data files (disc:P687-689).Statistics in Medicine 14:491–498.Joachims,T.2002.Learning to Classify Text Using Support Vec-tor Machines .Kluwer.Lafferty,J.,and Zhai,C.2001.A study of smoothing methods for language models applied to ad hoc information retrieval.In 2001ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR).Larsen,M.1999.Multiple imputation analysis of records linked using mixture models.In Statistical Society of Canada Proceed-ings of the Survey Methods Section ,65–71.Statistical Society of Canada (McGill University,Montreal).McCallum,A.;Nigam,K.;and Ungar,L.H.2000.Efficient clus-tering of high-dimensional data sets with application to reference matching.In Knowledge Discovery and Data Mining ,169–178.Monge,A.,and Elkan,C.1996.The field-matching problem:algorithm and applications.In Proceedings of the Second Inter-national Conference on Knowledge Discovery and Data Mining .Monge,A.,and Elkan,C.1997.An efficient domain-independent algorithm for detecting approximately duplicate database records.In The proceedings of the SIGMOD 1997workshop on data min-ing and knowledge discovery .Pasula,H.;Marthi,B.;Milch,B.;Russell,S.;and Shpitser,I.2002.Identity uncertainty and citation matching.In Advances in Neural Processing Systems 15.Vancouver,British Columbia:MIT Press.Raman,V .,and Hellerstein,J.2001.Potter’s wheel:An interac-tive data cleaning system.In The VLDB Journal ,381–390.Ristad,E.S.,and Yianilos,P.N.1998.Learning string edit distance.IEEE Transactions on Pattern Analysis and Machine Intelligence 20(5):522–532.Tejada,S.;Knoblock,C.A.;and Minton,S.2001.Learning ob-ject identification rules for information rmation Systems 26(8):607–633.Winkler,W.E.1999.The state of record linkage and cur-rent research problems.Statistics of Income Division,In-ternal Revenue Service Publication R99/04.Available from /srd/www/byname.html.。
Package‘mixcat’October13,2022Title Mixed Effects Cumulative Link and Logistic Regression ModelsVersion1.0-4Date2019-12-20Author Georgios Papageorgiou[aut,cre],John Hinde[aut]Maintainer Georgios Papageorgiou<******************>Depends R(>=2.8.1),statmodLazyLoad yesDescription Mixed effects cumulative and baseline logit link models for the analysis of ordi-nal or nominal responses,with non-parametric distribution for the random effects.License GPL(>=2)NeedsCompilation yesRepository CRANDate/Publication2019-12-2012:20:02UTCR topics documented:mixcat-package (1)npmlt (2)schizo (6)summary.npmreg (7)Index9 mixcat-package Mixed effects cumulative link and logistic regression modelsDescriptionMixed effects models for the analysis of binary or multinomial(ordinal or nominal)data with non-parametric distribution for the random effects.The main function is npmlt and itfits cumulative and baseline logit models.1DetailsPackage:mixcatType:PackageVersion: 1.0-4Date:2019-12-20License:GPL(>=2)LazyLoad:noThis program is free software;you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation;either version2of the Li-cense,or(at your option)any later version.This program is distributed in the hope that it will be useful,but WITHOUT ANY W ARRANTY;without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.See the GNU General Public License for more details.For details on the GNU General Public License see /copyleft/gpl.html or write to the Free Software Foundation,Inc.,51Franklin Street,Fifth Floor,Boston,MA02110-1301,USA.AcknowledgmentsPapageorgiou’s work was supported by the Science Foundation Ireland Research Frontiers grant 07/RFP/MATF448.Author(s)Georgios Papageorgiou and John Hinde(2011)Maintainer:Georgios Papageorgiou<******************>ReferencesPapageorgiou,G.and Hinde,J.(2012).Multivariate generalized linear mixed models with semi-nonparametric and smooth nonparametric random effects densities.Statistics and Computing22, 79-92npmlt Mixed effects cumulative link and logistic regression modelsDescriptionFits cumulative logit and baseline logit and link mixed effects regression models with non-para-metric distribution for the random effects.Usagenpmlt(formula,formula.npo=~1,random=~1,id,k=1,eps=0.0001,start.int=NULL,start.reg=NULL,start.mp=NULL,start.m=NULL,link="clogit",EB=FALSE,maxit=500,na.rm=TRUE,tol=0.0001)Argumentsformula a formula defining the response and thefixed,proportional odds,effects part of the model,e.g.y~x.formula.npo a formula defining non proportional odds variables of the model.A response is not needed as it has been provided in formula.Intercepts need not be providedas they are always non proportional.Variables in formula.npo must be a subsetof the variables that appear in the right hand side of formula,e.g.~x.random a formula defining the random part of the model.For instance,random=~1 defines a random intercept model,while random=~1+x defines a model withrandom intercept and random slope for the variable x.If argument k=1,theresulting model is afixed effects model(see below).Variables in random mustbe a subset of the variables that appear in the right hand side of formula.id a factor that defines the primary sampling units,e.g.groups,clusters,classes,or individuals in longitudinal studies.These sampling units have their own randomcoefficient,as defined in random.If argument id is missing it is taken to beid=seq(N),where N is the total number of observations,suitable for overdis-persed independent multinomial data.k the number of mass points and masses for the non-parametric(discrete)random effects distribution.If k=1the functionfits afixed effects models,regerdless ofthe random specification,as with k=1the random effects distribution is degen-erate at zero.eps positive convergence tolerance epsilon.Convergence is declared when the max-imum of the absolute value of the score vector is less than epsilon.start.int a vector of length(number of categories minus one)with the starting values the fixed intercept(s).start.reg a vector with the starting values for the regression coefficients.One starting value for the proportional odds effects and(number of categories minus one)starting values for the non proportional effects,in the same order as they appearin formula.start.mp starting values for the mass points of the random effects distribution in the form:(k starting values for the intercepts,k starting values for thefirst randomslope,...).start.m starting values for the masses of the random effects distribution:a vector of length k with non-negative elements that sum to1.link for a cumulative logit model set link="clogit"(default).For a baseline logit model,set link="blogit".Baseline category is the last category.EB if EB=TRUE the empirical Bayes estimates of the random effects are calculated and stored in the component eBayes.Further,fitted values of the linear predictor(stored in the component fitted)andfitted probabilities(stored in object prob)are obtained at the empirical Bayes estimates of the random effects.Otherwise,if EB=FALSE(default),empirical Bayes estimates are not calculated andfittedvalues of the linear predictors and probabilities are calculated at the zero valueof the random effects.maxit integer giving the maximal number of iterations of thefitting algorithm until convergence.By default this number is set to500.na.rm a logical value indicating whether NA values should be stripped before the com-putation proceeds.tol positive tolerance level used for calculating generalised inverses(g-inverses).Consider matrix A=P DP T,where D=Diag{eigen i}is diagonal withentries the eigen values of A.Its g-inverse is calculated as A−=P D−P T,where D−is diagonal with entries1/eigen i if eigen i>tol,and0otherwise. DetailsMaximizing a likelihood over an unspecified random effects distribution results in a discrete mass point estimate of this distribution(Laird,1978;Lindsay,1983).Thus,the terms‘non-parametric’(NP)and‘discrete’random effects distribution are used here interchangeably.Function npmlt allows the user to choose the number k of mass points/masses of the discrete distribution,a choice that should be based on the log-likelihood.Note that the mean of the NP distribution is constrained to be zero and thus for k=1thefitted model is equivalent to afixed effects model.For k>1and a random slope in the model,the mass points are bivariate with a component that corresponds to the intercept and another that corresponds to the slope.General treatments of non-parametric modeling can be found in Aitkin,M.(1999)and Aitkin et al.(2009).For more details on multinomial data see Hartzel et al(2001).The response variable y can be binary or multinomial.A binary response should take values1and 2,and the function npmlt will model the probability of1.For an ordinal response,taking values 1,...,q,a cumulative logit model can befit.Ignoring the random effects,such a model,withformula y~x,takes the formlogP(Y≤r)1−P(Y≤r)=βr+γx,whereβr,r=1,...,q−1,are the cut-points andγis the slope.Further,if argument formula.npo is specified as~x,the model becomeslogP(Y≤r)1−P(Y≤r)=βr+γr x,Similarly,for a nominal response with q categories,a baseline logit model can befit.Thefixed effects part of the model,y~x,takes the form,log P(Y=r)P(Y=q)=βr+γx,where r=1,...,q−1.Again,formula.npo can be specified as~x,in which case slopeγwill be replaced by category specific slopes,γr.The user is provided with the option of specifying starting values for some or all the model parame-ters.This option allows for starting the algorithm at different starting points,in order to ensure thatit has convered to the point of maximum likelihood.Further,if thefitting algorithm fails,the user can start byfitting a less complex model and use the estimates of this model as starting values for the more complex one.With reference to the tol argument,thefitting algorithm calculates g-inverses of two matrices:1.the information matrix of the model,and2.the covariance matrix of multinomial proportions.The covariance matrix of a multinomial proportion p of length q is calculated as Diag{p∗}−p∗p∗T, where p∗is of length q−1.A g-inverse for this matrix is needed because elements of p∗can become zero or one.ValueThe function npmlt returns an object of class‘npmreg’,a list containing at least the following components:call the matched call.formula the formula supplied.formula.npo the formula for the non proportional odds supplied.random the random effects formula supplied.coefficients a named vector of regression coefficients.mass.points a vector or a table that contains the mass point estimates.masses the masses(probabilities)corresponding to the mass points.vcvremat the estimated variance-covariance matrix of the random effects.var.cor.mat the estimated variance-covariance matrix of the random effects,with the upper triangular covariances replaced by the corresponding correlations.m2LogL minus twice the maximized log-likelihood of the chosen model.SE.coefficientsa named vector of standard errors of the estimated regression coefficients.SE.mass.points a vector or a table that contains the the standard errors of the estimated mass points.SE.masses the standard errors of the estimated masses.VRESE the standard errors of the estimates of the variances of random effects.CVmat the inverse of the observed information matrix of the model.eBayes if EB=TRUE it contains the empirical Bayes estimates of the random effects.Oth-erwise it contains vector(s)of zeros.fitted thefitted values of the linear predictors computed at the empirical Bayes esti-mates of the random effects,if EB=TRUE.Otherwise,if EB=FALSE(default)thesefitted values are computed at the zero value of the random effects.prob the estimated probabilities of observing a response at one of the categories.These probabilities are computed at the empirical Bayes estimates of the ran-dom effects,if EB=TRUE.If EB=FALSE(default)these estimated probabilities arecomputed at the zero value of the random effects.nrp number of random slopes specified.iter the number of iterations of thefitting algorithm.6schizo maxit the maximal allowed number of iterations of thefitting algorithm until conver-gence.flagcvm last iteration at which eigenvalue(s)of covariance matrix of multinomial variable were less than tol argument.flaginfo last iteration at which eigenvalue(s)of model information matrix were less than tol argument.Author(s)Georgios Papageorgiou<******************>ReferencesAitkin,M.(1999).A general maximum likelihood analysis of variance components in generalized linear models.Biometrics55,117-128.Aitkin,M.,Francis,B.,Hinde,J.,and Darnell,R.(2009).Statistical Modelling in R.Oxford Statistical Science Series,Oxford,UK.Hedeker,D.and Gibbons,R.(2006).Longitudinal Data Analysis.Wiley,Palo Alto,CA.Hartzel,J.,Agresti,A.,and Caffo,B.(2001).Multinomial logit random effects models.Statistical Modelling,1(2),81-102.Laird,N.(1978).Nonparametric maximum likelihood estimation of a mixing distribution.Journal of the American Statistical Association,73,805-811.Lindsay,B.G.(1983).The geometry of mixture likelihoods,Part II:The exponential family.The Annals of Statistics,11,783-792.See Alsosummary.npmregExamplesdata(schizo)attach(schizo)npmlt(y~trt*sqrt(wk),formula.npo=~trt,random=~1+trt,id=id,k=2,EB=FALSE)schizo National Institute of Mental Health shizophrenia studyDescriptionSchizophrenia data from a randomized controlled trial with patients assigned to either drug or placebo group."Severity of Illness"was measured,at weeks0,1,...6,on a four category ordered scale:1.normal or borderline mentally ill,dly or moderately ill,3.markedly ill,and4.severely or among the most extremely ill.Most of the observations where made on weeks0,1,3, and6.Usagedata(schizo)FormatA data frame with1603observations on437subjects.Four numerical vectors contain informationonid patient ID.y ordinal response on a4category scale.trt treatment indicator:1for drug,0for placebo.wk week.Source/~hedeker/ml.htmlReferencesHedeker,D.and Gibbons,R.(2006).Longitudinal Data Analysis.Wiley,Palo Alto,CA. summary.npmreg Summarizing mixed multinomial regression modelfitsDescriptionsummary and print methods for objects of type npmreg.Usage##S3method for class npmregsummary(object,digits=max(3,getOption("digits")-3),...)##S3method for class npmregprint(x,digits=max(3,getOption("digits")-3),...)Argumentsobject an object of class npmreg.x an object of class npmreg.digits the minimum number of significant digits to be printed in values....further arguments,which will mostly be ignored.DetailsThe function npmlt returns an object of class"npmreg".The function summary(i.e.,summary.npmreg) can be used to obtain or print a summary of the results,and the function print(i.e.,print.npmreg) to print the results.ValueSummary or print.Author(s)Georgios Papageorgiou<******************>See AlsonpmltExamplesdata(schizo)attach(schizo)fit1<-npmlt(y~trt*sqrt(wk),formula.npo=~trt,random=~1,id=id,k=2) print(fit1)summary(fit1)Index∗datasetsschizo,6∗modelsmixcat-package,1npmlt,2summary.npmreg,7∗regressionmixcat-package,1npmlt,2summary.npmreg,7mixcat(mixcat-package),1mixcat-package,1npmlt,1,2,8print.npmreg(summary.npmreg),7schizo,6summary.npmreg,6,79。
Transportation Research Part C 10 (2002) 303-321/locate/trc Comparison of parametric and nonparametric modelsfor traffic flow forecastingBrian L. Smith a,*, Billy M. Williams b, R. Keith Oswald ca Department of Civil Engineering, Thornton Hall, University of Virginia, Charlottesville, VA 22903-2442, USAb School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, GA 30332-0355, USAc Department of Systems Engineering, Olsson Hall, University of Virginia, Charlottesville, VA 22903-2442, USAReceived 12 July 1999; accepted 5 July 2001AbstractSingle point short-term traffic flow forecasting will play a key role in supporting demand forecasts needed by operational network models. Seasonal autoregressive integrated moving average (ARIMA), a classic parametric modeling approach to time series, and nonparametric regression models have been proposed as well suited for application to single point short-term traffic flow forecasting. Past research has shown sea- sonal ARIMA models to deliver results that are statistically superior to basic implementations of non- parametric regression. However, the advantages associated with a data-driven nonparametric forecasting approach motivate further investigation of refined nonparametric forecasting methods. Following this mo- tivation, this research effort seeks to examine the theoretical foundation of nonparametric regression and to answer the question of whether nonparametric regression based on heuristically improved forecast gener- ation methods approach the single interval traffic flow prediction performance of seasonal ARIMA models.2002 Elsevier Science Ltd. All rights reserved.Keywords: Traffic forecasting; Nonparametric regression; ARIMA models; Motorway flows; Short-term traffic prediction; Statistical forecasting1. IntroductionAs intelligent transportation systems (ITS) are implemented widely throughout the world, managers of transportation systems have access to large amounts of‘‘real-time’’ status data. While this data is key, it has been recognized by researchers and practitioners alike that the full * Corresponding author. Tel.: +1-434-243-8585; fax: +1-434-982-2951.E-mail address: briansmith@ (B.L. Smith).0968-090X/02/$ - see front matter 2002 Elsevier Science Ltd. All rights reserved.PII: S 0 9 6 8 - 0 9 0 X ( 0 2 ) 0 0 0 0 9 - 8304 B.L. Smith et al. / Transportation Research Part C 10 (2002) 303-321benefits of ITS cannot be realized without an ability to anticipate traffic conditions in the short-term (less than one hour into the future). The development of traffic condition forecasting models will provide this ability, and support proactive transportation management and comprehensive traveller information services. What distinguishes traffic condition forecasting from the more traditional forecasts of transportation planning is the length of the prediction horizon. While planning models use socioeconomic data and trends to forecast over a period of years, traffic condition forecasting models predict conditions within the hour using data from roadway sensors. Without a predictive capability, ITS will provide services in a reactive manner. In other words, there will be a lag between the collection of data and the implementation of traffic control strat- egies. Therefore, the system will be controlled based on old information. In order to control the system in a proactive manner, ITS must have a predictive capability. ‘‘The ability to make and continuously update predictions of traffic flows and link times for several minutes into the future using real-time data is a major requirement for providing dynamic traffic control’’ (Cheslow et al., 1992). Furthermore, traffic condition forecasting is important in traveller information systems. This is illustrated in the statement, ‘‘the rationale behind using predictive information (for route guidance) is that traveller’s decisions are affected by future traffic conditions expected to be in effect when they reach downstream sections of the network’’ (Kaysi et al., 1993). In fact, current traveller infor- mation activities are being hampered by the lack of a capability to predict future traffic conditions.1.1. Transportation condition forecastingPredicting future traffic conditions is a general concept. In fact, there are a wide variety of needs for condition forecasts depending on particular applications. For example, forecasts of traffic flow, speed, travel time, and queue length are required for specific transportation management applications. Furthermore, it is certainly true that future traffic conditions will be dependent on transportation management strategies employed, and that traffic flow can be tracked, given a sufficient number and placement of sensors, as it progresses through a network. For these reasons, an ideal approach to traffic condition forecasting is a network-based model that simulates the system as it responds to transportation management strategies. In other words, a model that takes advantage of spatial information and traffic flow dynamics.However, every network must have boundary points, locations that serve as entry‘‘nodes’’ tothe network that do not have the advantage of upstream detectors to use in predicting their state.In fact, any dynamic traffic network models, such as commonly used simulation models (for example, CORSIM), are‘‘driven’’ by the traffic flow rate (usually in vehicles per hour) at these boundary points. In essence, the boundary points define the short-term demand for travel throughthe network. For this reason, an important subset of traffic condition forecasting is the ability to predict future traffic flow at a location (a boundary point), given only data describing past conditions at this point. The research described in this paper addresses this specific challenge of single point traffic flow prediction.Based on this definition of the problem, one will note that single point traffic flow prediction canbe described as a time series problem. Given a series of past flow readings measured in regular intervals, a model is desired to predict the flow at the next interval. Thus, we consider the demandfor travel (measured by flow) at a point to be a dynamic system, which by definition consists of two parts: a state, or the essential information describing the system, and a dynamic, which is the ruleB.L. Smith et al. / Transportation Research Part C 10 (2002) 303-321 305 that governs how the state changes over time (Mulhern and Caprara, 1994). A dynamic system evolves towards an attractor, which is the set of values to which the system eventually settles (Thearling, 1995). Dynamic systems are often complex due to chaotic attractors, which have a bounded path that is fully determined, yet never recurs (Mulhern and Caprara, 1994). The presenceof chaotic attractors tends to cloud the distinction between deterministic and stochastic behavior,and makes the selection of an ideal modeling approach difficult. A considerable amount of researchhas addressed dynamic system modeling in areas such as fluid turbulence (Takens, 1981), disease cases (Sugihara and May, 1990), and marketing (Mulhern and Caprara, 1994). In addition, thisarea has seen a large level of interest in the transportation research community of late.When evaluating dynamic systems models for traffic flow forecasting, it is certainly important to consider the accuracy of the model. However, it is equally important to consider the environment in which the model will be developed, used, and maintained. Intelligent transportation systems deployed in metropolitan regions include thousands of traffic detectors. For example, in the Hampton Roads region of Virginia, a moderate-sized metropolitan area, the state department of transportation is in the process of installing over 1200 sensor stations on the freeways alone. In addition, the department, like most others, faces severe personnel shortages. As a result, models that require significant expertise and time to calibrate and maintain, on a site-by-site basis, are simply infeasible for use in this environment. Therefore, one must explicitly consider the requirements for dynamic system model calibration and maintenance.1.2. Modeling approachesPredicting traffic flow at a point decomposes into a time-series modeling problem in which one attempts to predict the value of a variable based on a series of past samples of the variable at regular intervals. Time series models have been studied in many disciplines, including transportation. In particular, transportation researchers have developed time series traffic prediction models using techniques such as autoregressive integrated moving average (ARIMA), nonparametric regression, neural networks, and heuristics. A complete review of transportation applications in this area can be found in Smith (1995) and Williams (1999).Recent research by Williams (1999) has applied traditional parametric Box-Jenkins time series models to the dynamic system of single point traffic flow forecasting. This work has addressed most of the parametric model concerns for traffic condition data by establishing a theoretical foundation for using seasonal ARIMA forecast models (see Section 3.3). ARIMA forecastingis founded in stochastic system theory. ARIMA processes are nondeterministic with linear state transitions and can be periodic with the periodicity accounted for through seasonal differencing. There are, however, some significant concerns with the ability to fit and maintain seasonal ARIMA models on a‘‘production’’ basis in ITS systems, given the level of expertise and amount of time required to do so. For example, the researchers recently developed a number of seasonal ARIMA models for the Hampton Roads system described above. It was found that the necessary outlier detection and parameter estimation for the seasonal ARIMA models was quitetime-consuming. Using only three months of ten-minute time series data, it took more than six daysto detect and model the outliers and estimate the parameters for a seasonal ARIMA model at a gle sensor location. Estimating parameters sequentially for each location in a moderately sized system with 200 detector locations would take more than 3 years. Clearly, this is an extreme306 B.L. Smith et al. / Transportation Research Part C 10 (2002) 303-321example resulting from a non-time critical research environment. In fact, the researchers are currently exploring methods to automate and expedite the seasonal ARIMA‘‘fitting’’ process. However, it does illustrate a significant practical concern regarding the wide-scale use of seasonal ARIMA models in traffic condition forecasting.The alternative modeling approach that we will focus upon in this research is nonparametric regression. Nonparametric regression was selected as a high potential alternative to seasonal ARIMA modeling due to the fact that the authors have demonstrated the advantages ofnon-parametric regression over other approaches, such as neural networks, in previous research efforts (Smith, 1995). Nonparametric regression forecasting is founded on chaotic system theory. By definition, chaotic systems are defined by state transitions that are deterministic and non-linear. Furthermore, the state transitions of a chaotic system are ergodic, not periodic.Given that ARIMA models are founded on stochastic system theory, it may seem inappropriate on the surface to simultaneously consider both nonparametric regression and ARIMA as candidates for modeling and forecasting of the same date type. However, although an argument that traffic condition data is characteristically stochastic may be more easily asserted, the presence of ‘‘chaos like’’ behavior cannot be completely dismissed, especially during congestion when traffic flow is unstable and a stronger causative link may be operating in the time dimension. For example, Disbro and Frame presented evidence that traffic flow exhibits chaotic behavior in a 1989paper (Disbro and Frame, 1989). Furthermore, the characteristic weekly pattern of traffic condition data supports the assertion that historical neighbor states should yield effective forecasts in the same way that past chaotic orbits passing through nearby points provide‘‘good’’ short-term forecasts. Coupling this expectation of reasonably accurate forecasts with the attractive implementation characteristics of a data driven approach provides ample motivation to investigate nonparametric regression performance.2. Nearest neighbor nonparametric regressionNonparametric regression relies on data describing the relationship between dependent and independent variables. The basic approach of nonparametric regression is heavily influenced by its roots in pattern recognition (Karlsson and Yakowitz, 1987). In essence, the approach locates the state of the system (defined by the independent variables) in a‘‘neighborhood’’ of past, similar states. Once this neighborhood has been established, the past cases in the neighborhood are used to estimate the value of the dependent variable.Clearly, this approach assumes that the bulk of the knowledge about the relationship lies in the data rather than the person developing the model (Eubank, 1988). Of course, the quality of the database, particularly in storing past cases that represent the breadth of all possible future conditions, dictates the effectiveness of a nonparametric regression model. To put it simply, if similar cases are not present in the database, an informed forecast cannot be generated.2.1. Implementation challengesThere are four fundamental challenges when applying nonparametric regression: definition ofan appropriate state space, definition of a distance metric to determine nearness of historicalB.L. Smith et al. / Transportation Research Part C 10 (2002) 303-321 307 observation to the current conditions, selection of a forecast generation method given a collection of nearest neighbors, and management of the potential neighbors database.2.1.1. State spaceFor data that is time series in nature, a state vector defines each record with a measurement at time t; t 1; . . . ; t d where d is an appropriate number of lags. For example, a state vector xðtÞ of flow rate measurements collected every 10 min with d ¼ 3 can be written asxðtÞ¼½V ðtÞ; V ðt 1Þ; V ðt 2Þ; V ðt 3Þ ; ð1Þwhere V ðtÞ is the flow rate during the current time interval, V ðt 1Þ is the flow rate during the previous 10-min interval, and so on. Note that an infinite number of possible state vectors exist. Furthermore, they are not restricted to incorporating only lagged values but may also be supplemented with aggregate measures such as historical averages.2.1.2. Distance metricA common approach to measuring‘‘nearness’’ in nonparametric regression is to use Euclidean distance in the independent variable space. Such an approach‘‘weights’’ the value of each in- dependent variable equally. In other words, it considers the information content of each to be of equal value. In many cases, knowledge of the problem domain may make such an assumption unreasonable. In that case, a weighed distance metric where the‘‘dimension’’ of variables with higher information content would be weighed more heavily may be appropriate. While this makes intuitive sense, it is clear that such an approach is heuristic in nature, and requires careful con- sideration by the modeler.2.1.3. Forecast generationThe most straightforward approach to generating the forecast for the dependent variable is to compute a simple average of the dependent variable values of the neighbors that have fallen within the nonparametric regression neighborhood. The weakness of such an approach is that it ignores all information provided by the distance metric. In other words, it is logical to assume that past cases‘‘nearer’’ the current case have higher information content and should play a larger role in generating the forecast.To address this concern, a number of weighting schemes have been proposed for use within nonparametric regression. In general these weights are proportional to the distance between the neighbor and the current condition. An alternative to the use of weights is to fit a linear or nonlinear model to the cases in the neighborhood, and then use the model to forecast the value of the dependent variable. For example, Mulhern and Caprara (1994) use a regression model in the selected neighborhood to generate marketing forecasts with promising results. However, we chose not to investigate this approach due to the fact that it introduces a parametric model to a non- parametric technique.Finally, it should be clear that there are an infinite number of approaches to forecast generation. As we will discuss later in the paper, we were able to improve the effectiveness of our models by weighing our forecast directly with historical data.308 B.L. Smith et al. / Transportation Research Part C 10 (2002) 303-3212.1.4. Management of potential neighbor databaseAs stated earlier, the effectiveness of nonparametric regression is directly dependent on the quality of the database of potential neighbors. Clearly, it is desirable to possess a large database of cases that span the likely conditions that a system is expected to face. However, while as large a database as possible is desirable for increasing the accuracy of the model, the size of the database has significant implications on the timeliness of model execution.When considering how nonparametric models work, one will see that the majority of effort at runtime is expended in searching the database for neighbors. As the database grows, this search process grows accordingly. For real-time applications, such as traffic flow forecasting, this is a significant issue. Steps must be taken to keep the database size manageable, while ensuring that the database has the depth and breadth necessary to support forecasting.Again, the number of approaches to accomplish this is infinite. One approach would be to cluster the database, and only search those cases in the cluster in which the current state falls. Another approach would be to periodically delete records from the database. This process would involve searching for multiple records that are nearly identical. Such an approach would require the use of a distance metric such as those discussed earlier. While examining this issue fully is beyond the scope of this paper, it is important to realize that the management of the database is a critically important issue, particularly in real-time applications of nonparametric regression.2.2. Nonparametric regression theoretical foundationIt is important to recognize that nonparametric regression has a strong theoretical basis. The statistical properties of the nearest neighbor approach are attractive. Using such an approach will result in an asymptotically optimal forecaster (Karlsson and Yakowitz, 1987). This means that fora state space of size m values, the nearest neighbor model will asymptotically be at least as good as any mth order parametric model. This property indicates that if one has access to a large, high- quality database, the nearest neighbor approach should yield results comparable to parametric models.While the state definition for a nonparametric model is flexible, and can be defined in an in- finite number of ways, a common approach for time-series problems is to define the state as the series of system values (in our case, traffic flow measurements) measured during the past d in- tervals. In dynamic systems, an attractor is defined as a value to which a system eventually settlesas t ! 1 (Takens, 1979). In other words, the system dynamic is driven by the attractor. In his work, Takens found that for a D dimension attractor, it is necessary to define the number of lagsðdÞ in the state as 2D þ 1 (1979). Therefore, given that D is not known for single point traffic flow prediction, there is theoretical justification to consider multiple values of d when defining the state.Furthermore, there is justification to investigate the inclusion of other values in the state definition beyond lagged time series values. For example, since nearest neighbor models‘‘geometrically attempt to reconstruct whatever attractor is operating in a time series’’ (Mulhern and Caprara, 1994), including historical averages in the state vector further clarifies the position of each observation along the cyclical flow-time curve, which may improve forecast accuracy by finding neighbors that are more similar to the current conditions.B.L. Smith et al. / Transportation Research Part C 10 (2002) 303-321 309 Thus, based on dynamic systems literature, there is theoretical justification for considering multiple system state definitions when applying nonparametric regression. This highlights a promising opportunity to improve the accuracy of nonparametric regression forecasting.3. Experimental approach3.1. Research questionEarlier work by Smith (1995) found that a simple implementation of the nearest neighbor forecasting approach provides reasonably accurate traffic condition forecasts. Subsequent research by Williams (1999) demonstrated that seasonal ARIMA models outperformed the straight average k-nearest neighbor method. However, the advantages associated with a data-driven nonparametric forecasting approach motivate further investigation of refined nonparametric regression forecasting methods. Following this motivation, the research effort presented in this article sought to answer the question of whether nonparametric regression based on heuristically improved forecast generation methods would approach the single interval traffic flow prediction performance of seasonal ARIMA models.3.2. The dataThe data were collected by the United Kingdom Highways Agency using the motorway inci- dent detection and automatic signaling (MIDAS) system. The pilot project for the MIDAS system included extensive instrumentation of the southwest quadrant of the London Orbital Motorway, M25. The instrumentation included four travel lane specific loop detectors at 500-m intervals. Traffic condition data from these detectors is collected on a one-minute interval, 24 h a day. The data were collected from 4 September to 30 November, 1996. The data are 15-min traffic flow rates from two loop detector locations, detector number 4757A and 4762A, which lie between the M3 and M23 interchanges. The 15-min data were formed by averaging the raw 1-min data over 15- min intervals and aggregating across all travel lanes. The data are for traffic flow in the clockwise direction on the outer loop of the M25. Fig. 1 shows the general location of the data sites.For this research the data were split into two independent samples. The data collected between 4 September and 18 October provided the basis for model development. The models were then evaluated using the data collected between 19 October and 30 November.Table 1 presents descriptive statistics for the data sets. The data contain very few missing observations with only approximately 1% of the data missing for each development data set and no missing observations for the evaluation data. The PM peak flow rates fall in the range of 7000- 8000 vehicles per hour (vph).3.3. Benchmark modelsAs stated in Section 3.1, this research focuses on improving the performance of nonparametric regression. To assess the quality of the forecasts, two benchmark models were used to‘‘frame’’ the^ where ^310B.L. Smith et al. / Transportation Research Part C 10 (2002) 303-321Fig. 1. M25 detector locations.Table 1Descriptive statistics--M25 dataData seriesSeries length (number Missing Percent Series mean Mean absolute of observations) values missing (%) (veh/h) one-step change Detector 4757A Sep/Oct 19964320 44 1.02 3106 309 Detector 4757A Oct/Nov 19964128 0 0.0 2916 282 Detector 4762A Sep/Oct 19964320 51 1.18 3100 315 Detector 4762A Oct/Nov 19964128 0 0.0 2915 287model must certainly outperform a na €ıve model. Secondly, a seasonal ARIMA model was used in the context of a best case. In other words, the goal of modifying the nonparametric regression models was to approach the performance of parametric seasonal ARIMA models. The benchmark models utilized in the work are described more fully below.3.3.1. Na €ıve forecastNa €ıve forecasts were calculated in a way that takes into account both the current conditions and the historical pattern. Average flow rates were calculated for each time-of-day and day-of-the- week using the development data for each site. For the test data, na €ıve forecasts were calculated using these historical average flow rates by the equationV ðt þ 1Þ ¼ V ðt Þ V hist ðt Þ V hist ðt þ 1Þ;ð2Þ V ðt þ 1Þ is the forecast for the next 15 min time interval, V ðt Þ is the traffic flow rate at the current time interval, and V hist ðt Þ is the historical average of the traffic flow rate at the time-of-day and day-of-the-week associated with time interval t.B.L. Smith et al. / Transportation Research Part C 10 (2002) 303-321 311 This forecast approach follows the intuition that the ratio of current conditions to historic conditions will be a good indicator of how the traffic conditions in the next interval will deviate from the historic conditions. A proposed forecast method that could not be shown to consistently produce more accurate forecasts than this na€ıve method would be of little value.3.3.2. Seasonal ARIMA parametric modelRecent work by Williams (1999) put forth a strong theoretical foundation for modeling traffic condition data as seasonal ARIMA processes and demonstrated the forecast accuracy of fitted seasonal ARIMA models. The theoretical foundation rests on two fundamental time series concepts, namely the definition of time series stationarity and the theorem known as the Wold Decomposition.3.3.2.1. Seasonal ARIMA definition. The formal definition of a seasonal ARIMA process follows: Seasonal ARIMA processA time series fX t g is a seasonal ARIMA ðp; d ; qÞðP ; D; QÞS process with period S if d and D arenonnegative integers and if the differenced series Y t¼ð1 BÞdð1 regressive moving average process defined by the expression /ðBÞUðB sÞY t¼ hðBÞHðB sÞe t ;where B is the backshift operator defined by B a W t¼ W t a, /ðzÞ¼ 1 /1 /p z p ; UðzÞ¼ 1 U1Z U P Z P ;hðzÞ¼ 1 h1 h q Z q; HðzÞ¼ 1 H1Z H Q Z Q; B sÞD X t is a stationary auto-ð3Þand e t is identically and normally distributed with mean zero, variance r2 and covðe t ; e t kÞ¼ 08k ¼0, i.e., fe t g WN ð0; r2Þ.In the definition above, the parameters p and q represent the autoregressive and moving average order, respectively, and the parameters P and Q represent the autoregressive and moving average order at the model’s seasonal period length, S. The parameters d and D represent the order of ordinary and seasonal differencing, respectively. The reader is referred to a comprehensive time series text, such as Brockwell and Davis (1996) or Fuller (1996) for additional background on ARIMA time series models.3.3.2.2. Theoretical justification. Formal ARIMA model definitions apply to time series that are weakly stationary or that can be made weakly stationary through differencing. By definition, a time series fX t g is said to be weakly stationary if(i) the unconditional expected value of X t is the same for all t and(ii) the covariance between any two observations in the series is dependent only on the lag between the observations (independent of t).Given the characteristic cyclical pattern of traffic condition data, it is obvious that neither raw data series nor data series that have been transformed by a first ordinary difference are stationary. Traffic flow, for example, rises to and falls from characteristic peaks. Although the daily patterns for the weekdays are quite similar to each other, the weekday patterns do vary from day to day to。