A predictive model (ETLM) for arsenate adsorption and
- 格式:pdf
- 大小:754.74 KB
- 文档页数:29
Ensemble Approaches for Regression:aSurveyJo˜a o M.Moreira a,∗,Carlos Soares b,c,Al´ıpio M.Jorge b,c andJorge Freire de Sousa aa Faculty of Engineering,University of Porto,Rua Dr.Roberto Frias,s/n4200-465Porto PORTUGALb Faculty of Economics,University of Porto,Rua Dr.Roberto Frias,s/n4200-464Porto PORTUGALc LIAAD,INESC Porto L.A.,R.de Ceuta,118,6,4050-190,Porto PORTUGALAbstractThis paper discusses approaches from different research areas to ensemble regression. The goal of ensemble regression is to combine several models in order to improve the prediction accuracy on learning problems with a numerical target variable.The process of ensemble learning for regression can be divided into three phases:the generation phase,in which a set of candidate models is induced,the pruning phase, to select of a subset of those models and the integration phase,in which the output of the models is combined to generate a prediction.We discuss different approaches to each of these phases,categorizing them in terms of relevant characteristics and relating contributions from differentfields.Given that previous surveys have focused on classification,we expect that this work will provide a useful overview of existing work on ensemble regression and enable the identification of interesting lines for further research.Key words:ensembles,regression,supervised learning1IntroductionEnsemble learning typically refers to methods that generate several models which are combined to make a prediction,either in classification or regression ∗Tel.:+351225081639;fax:+351225081538Email address:jmoreira@fe.up.pt(Jo˜a o M.Moreira).Preprint submitted to Elsevier19December2007problems.This approach has been the object of a significant amount of re-search in recent years and good results have been reported(e.g.,[1–3]).The advantage of ensembles with respect to single models has been reported in terms of increased robustness and accuracy[4].Most work on ensemble learning focuses on classification problems.However, techniques that are successful for classification are often not directly appli-cable for regression.Therefore,although both are related,ensemble learning approaches have been developed somehow independently.Therefore,existing surveys on ensemble methods for classification[5,6]are not suitable to provide an overview of existing approaches for regression.This paper surveys existing approaches to ensemble learning for regression. The relevance of this paper is strengthened by the fact that ensemble learning is an object of research in different communities,including pattern recog-nition,machine learning,statistics and neural networks.These communities have different conferences and journals and often use different terminology and notation,which makes it quite hard for a researcher to be aware of all contri-butions that are relevant to his/her own work.Therefore,besides attempting to provide a thorough account of the work in the area,we also organize those approaches independently of the research area they were originally proposed in.Hopefully,this organization will enable the identification of opportunities for further research and facilitate the classification of new approaches.In the next section,we provide a general discussion of the process of ensem-ble learning.This discussion will lay out the basis according to which the remaining sections of the paper will be presented:ensemble generation(Sect.3),ensemble pruning(Sect.4)and ensemble integration(Sect.5).Sect.6 concludes the paper with a summary.2Ensemble Learning for RegressionIn this section we provide a more accurate definition of ensemble learning and provide terminology.Additionally,we present a general description of the process of ensemble learning and describe a taxonomy of different approaches, both of which define the structure of the rest of the paper.Next we discuss the experimental setup for ensemble learning.Finally,we analyze the error decompositon of ensemble learning methods for regression.22.1DefinitionFirst of all we need to define clearly what ensemble learning is,and to define a taxonomy of methods.As far as we know,there is no widely accepted definition of ensemble learning.Some of the existing definitions are partial in the sense that they focus just on the classification problem or on part of the ensemble learning process[7].For these reasons we propose the following definition:Definition1Ensemble learning is a process that uses a set of models,each of them obtained by applying a learning process to a given problem.This set of models(ensemble)is integrated in some way to obtain thefinal prediction.This definition has important characteristics.In thefirst place,contrary to the informal definition given at the beginning of the paper,this one covers not only ensembles in supervised learning(both classification and regression problems),but also in unsupervised learning,namely the emerging research area of ensembles of clusters[8].Additionally,it clearly separates ensemble and divide-and-conquer approaches. This last family of approaches split the input space in several sub-regions and train separately each model in each one of the sub-regions.With this approach the initial problem is converted in the resolution of several simpler sub-problems.Finally,it does not separate the combination and selection approaches as it is usually done.According to this definition,selection is a special case of combination where the weights are all zero except for one of them(to be discussed in Sect.5).More formally,an ensemble F is composed of a set of predictors of a function f denoted asˆf i.F={ˆf i,i=1,...,k}.(1) The resulting ensemble predictor is denoted asˆf f.2.1.1The Ensemble Learning ProcessThe ensemble process can be divided into three steps[9](Fig.1),usually referred as the overproduce-and-choose approach.Thefirst step is ensemble generation,which consists of generating a set of models.It often happens that,during thefirst step,a number of redundant models are generated.In the ensemble pruning step,the ensemble is pruned by eliminating some of the models generated earlier.Finally,in the ensemble integration step,a strategy3to combine the base models is defined.This strategy is then used to obtain the prediction of the ensemble for new cases,based on the predictions of the base models.Our characterization of the ensemble learning process is slightly more detailed than the one presented by Rooney et al.[10].For those authors,ensemble learning consists on the solution of two problems:(1)how to generate the ensemble of models?(ensemble generation);and(2)how to integrate the pre-dictions of the models from the ensemble in order to obtain thefinal ensemble prediction?(ensemble integration).This last approach(without the pruning step),is named direct,and can be seen as a particular case of the model presented in Fig.1,named overproduce-and-choose.Ensemble pruning has been reported,at least in some cases,to reduce the size of the ensembles obtained without degrading the accuracy.Pruning has also been added to direct methods successfully increasing the accuracy[11,12].A subject to be discussed further in Sect.4.2.1.2Taxonomy and TerminologyConcerning the categorization of the different approaches to ensemble learning, we will follow mainly the taxonomy presented by the same authors[10].They divide ensemble generation approaches into homogeneous,if all the models were generated using the same induction algorithm,and heterogeneous,oth-erwise.Ensemble integration methods,are classified by some authors[10,13]as com-bination(also called fusion)or as selection.The former approach combines the predictions of the models from the ensemble in order to obtain thefinal ensemble prediction.The latter approach selects from the ensemble the most promising model(s)and the prediction of the ensemble is based on the se-lected model(s)only.Here,we use,instead,the classification of constant vs. non-constant weighting functions given by Merz[14].In thefirst case,the predictions of the base models are always combined in the same way.In the second case,the way the predictions are combined can be different for different input values.As mentioned earlier,research on ensemble learning is carried out in different communities.Therefore,different terms are sometimes used for the same con-cept.In Table1we list several groups of synonyms,extended from a previous list by Kuncheva[5].Thefirst column contains the most frequently used terms in this paper.4Table1Synonymsensemble committee,multiple models,multiple classifiers(regressors) predictor model,regressor(classifier),learner,hypothesis,expertexample instance,case,data point,objectcombination fusion,competitive classifiers(regressors),ensemble approach, multiple topologyselection cooperative classifiers(regressors),modular approach,hybrid topology2.2Experimental setupThe experimental setups used in ensemble learning methods are very differ-ent depending on communities and authors.Our aim is to propose a general framework rather than to do a survey on the different experimental setups described in the literature.The most common approach is to split the data into three parts:(1)the training set,used to obtain the base predictors;(2) the validation set,used for assessment of the generalization error of the base predictors;and(3)the test set,used for assessment of the generalization error of thefinal ensemble method.If a pruning algorithm is used,it is tested to-gether with the integration method on the test set.Hastie et al.[15]propose to split50%for training,25%for validation and the remaining25%to use as test set.This strategy works for large data sets,let’s say,data sets with more than one thousand examples.For large data sets we propose the use of this approach mixed with cross-validation.To do this,and for this particular partition(50%,25%,and25%),the data set is randomly divided in four equal parts,two of them being used as training set,another one as validation set and the last one as test set.This process is repeated using all the combinations of training,validation and test sets among the four parts.With this partition there are twelve combinations.For smaller data sets,the percentage of data used for training must be higher.It can be80%,10%and10%.In this case the number of combinations is ninety.The main advantage is to train the base predictors with more examples(it can be critical for small data sets)but it has the disadvantage of increasing the computational cost.The process can be repeated several times in order to obtain different sample values for the evaluation criterion,namely the mse(eq.3).52.3RegressionIn this paper we assume a typical regression problem.Data consists of a set of n examples of the form{(x1,f(x1)),...,(x n,f(x n))}.The goal is to induce a functionˆf from the data,whereˆf:X→ ,where,ˆf(x)=f(x),∀x∈X,(2) where f represents the unknown true function.The algorithm used to obtain theˆf function is called induction algorithm or learner.Theˆf function is called model or predictor.The usual goal for regression is to minimize a squared error loss function,namely the mean squared error(mse),mse=1nni(ˆf(x i)−f(x i))2.(3)2.4Understanding the generalization error of ensemblesTo accomplish the task of ensemble generation,it is necessary to know the characteristics that the ensemble should have.Empirically,it is stated by several authors that a good ensemble is the one with accurate predictors and making errors in different parts of the input space.For the regression problem it is possible to decompose the generalization error in different components, which can guide the process to optimize the ensemble generation.Here,the functions are represented,when appropriate,without the input vari-ables,just for the sake of simplicity.For example,instead of f(x)we use f. We closely follow Brown[16].Understanding the ensemble generalization error enables us to know which characteristics should the ensemble members have in order to reduce the over-all generalization error.The generalization error decomposition for regression is straightforward.What follows is about the decomposition of the mse(eq.3).Despite the fact that the majority of the works were presented in the con-text of neural network ensembles,the results presented in this section are not dependent of the induction algorithm used.Geman et al.present the bias/variance decomposition for a single neural net-work[17]:E{[ˆf−E(f)]2}=[E(ˆf)−E(f)]2+E{[ˆf−E(ˆf)]2}.(4) Thefirst term on the right hand side is called the bias and represents the6distance between the expected value of the estimator ˆfand the unknown pop-ulation average.The second term,the variance component,measures how the predictions vary with respect to the average prediction.This can be rewritten as:mse (f )=bias (f )2+var (f ).(5)Krogh &Vedelsby describe the ambiguity decomposition,for an ensemble of k neural networks [18].Assuming that ˆf f (x )= k i =1[αi ׈f i (x )](see Sect.5.1)where k i =1(αi )=1and αi ≥0,i =1,...,k ,they show that the error for a single example is:(ˆf f −f )2=ki =1[αi ×(ˆf i −f )2]−k i =1[αi ×(ˆf i −ˆf f )2].(6)This expression shows explicitly that the ensemble generalization error is less than or equal to the generalization error of a randomly selected single predic-tor.This is true because the ambiguity component (the second term on the right)is always non negative.Another important result of this decomposition is that it is possible to reduce the ensemble generalization error by increasing the ambiguity without increasing the bias.The ambiguity term measures the disagreement among the base predictors on a given input x (omitted in the formulae just for the sake of simplicity,as previously referred).Two full proofs of the ambiguity decomposition [18]are presented in [16].Later,Ueda &Nakano presented the bias/variance/covariance decomposition of the generalization error of ensemble estimators [19].In this decomposition it is assumed that ˆf f (x )=1× k i =1[ˆf i (x )]:E [(ˆf f −f )2]=bias 2+1k ×var +(1−1k)×covar,(7)wherebias =1k ×k i =1[E i (f i )−f ],(8)var =1k ×k i =1{E i {[ˆf i −E i (ˆf i )]2}},(9)covar =1k ×(k −1)×k i =1k j =1,j =i E i,j {[ˆf i −E i (ˆf i )][ˆf j −E j (ˆf j )]}.(10)7The indexes i,j of the expectation mean that the expression is true for par-ticular training sets,respectively,L i and L j.Brown provides a good discussion on the relation between ambiguity and co-variance[16].An important result obtained from the study of this relation is the confirmation that it is not possible to maximize the ensemble ambiguity without affecting the ensemble bias component as well,i.e.,it is not possi-ble to maximize the ambiguity component and minimize the bias component simultaneously.The discussion of the present section is usually referred in the context of ensemble diversity,i.e.,the study on the degree of disagreement between the base predictors.Many of the above statements are related to the well known statistical problem of point estimation.This discussion is also related with the multi-collinearity problem that will be discussed in Sect.5.3Ensemble generationThe goal of ensemble generation is to generate a set of models,F={ˆf i,i= 1,...,k}.If the models are generated using the same induction algorithm the ensemble is called homogeneous,otherwise it is called heterogeneous.Homogeneous ensemble generation is the best covered area of ensemble learn-ing in the literature.See,for example,the state of the art surveys from Di-etterich[7],or Brown et al.[20].In this section we mainly follow the former [7].In homogeneous ensembles,the models are generated using the same algo-rithm.Thus,as explained in the following sections,diversity can be achieved by manipulating the data(Section3.1)or by the model generation process (Section3.2).Heterogeneous ensembles are obtained when more than one learning algorithm is used.This approach is expected to obtain models with higher diversity[21]. The problem is the lack of control on the diversity of the ensemble during the generation phase.In homogeneous ensembles,diversity can be systemati-cally controlled during their generation,as will be discussed in the following sections.Conversely,when using several algorithms,it may not be so easy to control the differences between the generated models.This difficulty can be solved by the use of the overproduce-and-choose ing this approach the diversity is guaranteed in the pruning phase[22].Another approach,com-monly followed,combines the two approaches,by using different induction algorithms mixed with the use of different parameter sets[23,10](Sect.3.2.1). Some authors claim that the use of heterogeneous ensembles improves the performance of homogeneous ensemble generation.Note that heterogeneous8ensembles can use homogeneous ensemble models as base learners.3.1Data manipulationData can be manipulated in three different ways:subsampling from the train-ing set,manipulating the input features and manipulating the output targets.3.1.1Subsampling from the training setThese methods have in common that the models are obtained using different subsamples from the training set.This approach generally assumes that the algorithm is unstable,i.e.,small changes in the training set imply important changes in the result.Decision trees,neural networks,rule learning algorithms and MARS are well known unstable algorithms[24,7].However,some of the methods based on subsampling(e.g.,bagging and boosting)have been suc-cessfully applied to algorithms usually regarded as stable,such as Support Vector Machines(SVM)[25].One of the most popular of such methods is bagging[26].It uses randomly generated training sets to obtain an ensemble of predictors.If the original training set L has m examples,bagging(bootstrap aggregating)generates a model by sampling uniformly m examples with replacement(some examples appear several times while others do not appear at all).Both Breiman[26] and Domingos[27]give insights on why does bagging work.Based on[28],Freund&Schapire present the AdaBoost(ADAptive BOOST-ing)algorithm,the most popular boosting algorithm[29].The main idea is that it is possible to convert a weak learning algorithm into one that achieves arbitrarily high accuracy.A weak learning algorithm is one that performs slightly better than random prediction.This conversion is done by combining the estimations of several predictors.Like in bagging[26],the examples are randomly selected with replacement but,in AdaBoost,each example has a different probability of being selected.Initially,this probability is equal for all the examples,but in the following iterations examples with more inaccurate predictions have higher probability of being selected.In each new iteration there are more’difficult examples’in the training set.Despite boosting has been originally developed for classification,several algorithms have been pro-posed for regression but none has emerged as being the appropriate one[30]. Parmanto et al.describe the cross-validated committees technique for neural networks ensemble generation usingυ-fold cross validation[31].The main idea is to use as ensemble the models obtained by the use of theυtraining sets on the cross validation process.93.1.2Manipulating the input featuresIn this approach,different training sets are obtained by changing the repre-sentation of the examples.A new training set j is generated by replacing theoriginal representation{(x i,f(x i))into a new one{(xi ,f(x i)).There are twotypes of approaches.Thefirst one is feature selection,i.e.,xi ⊂x i.In the sec-ond approach,the representation is obtained by applying some transformation to the original attributes,i.e.,xi=g(x i).A simple feature selection approach is the random subspace method,consisting of a random selection[32].The models in the ensemble are independently constructed using a randomly selected feature subset.Originally,decision trees were used as base learners and the ensemble was called decision forests[32]. Thefinal prediction is the combination of the predictions of all the trees in the forest.Alternatively,iterative search methods can be used to select the different fea-ture subsets.Opitz uses a genetic algorithm approach that continuously gen-erates new subsets starting from a random feature selection[33].The author uses neural networks for the classification problem.He reports better results using this approach than using the popular bagging and AdaBoost methods. In[34]the search method is a wrapper like hill-climbing strategy.The criteria used to select the feature subsets are the minimization of the individual error and the maximization of ambiguity(Sect.2.4).A feature selection approach can also be used to generate ensembles for al-gorithms that are stable with respect to the training set but unstable w.r.t. the set of features,namely the nearest neighbors induction algorithm.In[35] the feature subset selection is done using adaptive sampling in order to re-duce the risk of discarding discriminating pared to random feature selection,this approach reduces diversity between base predictors but increases their accuracy.A simple transformation approach is input smearing[36].It aims to increase the diversity of the ensemble by adding Gaussian noise to the inputs.The goal is to improve the results of bagging.Each input value x is changed into a smeared value x using:x =x+p∗N(0,ˆσX)(11)where p is an input parameter of the input smearing algorithm andˆσX is the sample standard deviation of X,using the training set data.In this case, the examples are changed,but the training set keeps the same number of examples.In this work just the numeric input variables are smeared even if the nominal ones could also be smeared using a different strategy.Results10compare favorably to bagging.A similar approach called BEN-Bootstrap Ensemble with Noise,was previously presented by Raviv&Intrator[37]. Rodriguez et al.[3]present a method that combines selection and transfor-mation,called rotation forests.The original set of features is divided into k disjoint subsets to increase the chance of obtaining higher diversity.Then,for each subset,a principal component analysis(PCA)approach is used to project the examples into a set of new features,consisting of linear combinations of the original ing decision trees as base learners,this strategy assures diversity,(decision trees are sensitive to the rotation of the axis)and accuracy (PCA concentrates in a few features most of the information contained in the data).The authors claim that rotation forests outperform bagging,AdaBoost and random forests(to be discussed further away in Sect.3.2.2).However,the adaptation of rotation forests for regression does not seem to be straightfor-ward.3.1.3Manipulating the output targetsThe manipulation of the output targets can also be used to generate different training sets.However,not much research follows this approach and most of it focus on classification.An exception is the work of Breiman,called output smearing[38].The basic idea is to add Gaussian noise to the target variable of the training set,in the same way as it is done for input features in the input smearing method (Sect.3.1.2).Using this approach it is possible to generate as many models as desired.Although it was originally proposed using CART trees as base models, it can be used with other base algorithms.The comparison between output smearing and bagging shows a consistent generalization error reduction,even if not outstanding.An alternative approach consists of the following steps.First it generates a model using the original data.Second,it generates a model that estimates the error of the predictions of thefirst model and generates an ensemble that combines the prediction of the previous model with the correction of the cur-rent one.Finally,it iteratively generates models that predict the error of the current ensemble and then updates the ensemble with the new model.The training set used to generate the new model in each iteration is obtained by replacing the output targets with the errors of the current ensemble.This ap-proach was proposed by Breiman,using bagging as the base algorithm and was called iterated bagging[39].Iterated bagging reduces generalization error when compared with bagging,mainly due to the bias reduction during the iteration process.113.2Model generation manipulationAs an alternative to manipulating the training set,it is possible to change the model generation process.This can be done by using different parameter sets, by manipulating the induction algorithm or by manipulating the resulting model.3.2.1Manipulating the parameter setsEach induction algorithm is sensitive to the values of the input parameters. The degree of sensitivity of the induction algorithm is different for different input parameters.To maximize the diversity of the models generated,one should focus on the parameters which the algorithm is most sensitive to.Neural network ensemble approaches quite often use different initial weights to obtain different models.This is done because the resulting models vary significantly with different initial weights[40].Several authors,like Rosen,for example,use randomly generated seeds(initial weights)to obtain different models[41],while other authors mix this strategy with the use of different number of layers and hidden units[42,43].The k-nearest neighbors ensemble proposed by Yankov et al.[44]has just two members.They differ on the number of nearest neighbors used.They are both sub-optimal.One of them because the number of nearest neighbors is too small,and the other because it is too large.The purpose is to increase diversity(see Sect.5.2.1).3.2.2Manipulating the induction algorithmDiversity can be also attained by changing the way induction is done.There-fore,the same learning algorithm may have different results on the same data. Two main categories of approaches for this can be identified:Sequential and parallel.In sequential approaches,the induction of a model is influenced only by the previous ones.In parallel approaches it is possible to have more ex-tensive collaboration:(1)each process takes into account the overall quality of the ensemble and(2)information about the models is exchanged between processes.Rosen[41]generates ensembles of neural networks by sequentially training networks,adding a decorrelation penalty to the error function,to increase ing this approach,the training of each network tries to minimize a function that has a covariance component,thus decreasing the generalization error of the ensemble,as stated in[19].This was thefirst approach using the12decomposition of the generalization error made by Ueda&Nakano[19](Sect.2.4)to guide the ensemble generation process.Another sequential method to generate ensembles of neural networks is called SECA(Stepwise Ensemble Construction Algorithm)[30].It uses bagging to obtain the training set for each neural network.The neural networks are trained sequentially.The process stops when adding another neural network to the current ensemble increases the generalization error.The Cooperative Neural Network Ensembles(CNNE)method[45]also uses a sequential approach.In this work,the ensemble begins with two neural net-works and then,iteratively,CNNE tries to minimize the ensemble errorfirstly by training the existing networks,then by adding a hidden node to an existing neural network,andfinally by adding a new neural network.Like in Rosen’s approach,the error function includes a term representing the correlation be-tween the models in the ensemble.Therefore,to maximize the diversity,all the models already generated are trained again at each iteration of the process. The authors test their method not only on classification datasets but also on one regression data set,with promising results.Tsang et al.[46]propose an adaptation of the CVM(Core Vector Machines) algorithm[47]that maximizes the diversity of the models in the ensemble by guaranteeing that they are orthogonal.This is achieved by adding constraints to the quadratic programming problem that is solved by the CVM algorithm. This approach can be related to AdaBoost because higher weights are given to instances which are incorrectly classified in previous iterations.Note that the sequential approaches mentioned above add a penalty term to the error function of the learning algorithm.This sort of added penalty has been also used in the parallel method Ensemble Learning via Negative Cor-relation(ELNC)to generate neural networks that are learned simultaneously so that the overall quality of the ensemble is taken into account[48]. Parallel approaches that exchange information during the process typically integrate the learning algorithm with an evolutionary framework.Opitz& Shavlik[49]present the ADDEMUP(Accurate anD Diverse Ensemble-Maker giving United Predictions)method to generate ensembles of neural networks. In this approach,thefitness metric for each network weights the accuracy of the network and the diversity of this network within the ensemble.The bias/variance decomposition presented by Krogh&Vedelsby[18]is used.Ge-netic operators of mutation and crossover are used to generate new models from previous ones.The new networks are trained emphasizing misclassified examples.The best networks are selected and the process is repeated until a stopping criterion is met.This approach can be used on other induction algorithms.A similar approach is the Evolutionary Ensembles with Negative Correlation Learning(EENCL)method,which combines the ELNC method13。
Risk assessmentof Campylobacter spp.in broiler chickensand Vibrio spp. in seafood Report of a Joint FAO/WHO ExpertConsultationBangkok, Thailand5-9 August 2002FAOFOOD ANDNUTRITIONPAPERFOODSAFETY CONSULTATIONSFoodandAgricultureOrganizationoftheUnitedNationsWORLDHEALTHORGANIZATIONA CKNOWLEDGEMENTSThe Food and Agriculture Organization of the United Nations and the World Health Organization would like to express their appreciation to the expert drafting groups (see Annex 2) for the time and effort which they dedicated to the preparation of thorough and extensive technical documents on risk assessment of Campylobacter spp. in broiler chickens and Vibrio spp. in seafoods. The deliberations of this expert consultation were based on these documents.Contents1INTRODUCTION (1)2BACKGROUND (1)3OBJECTIVES OF THE CONSULTATION (2)4SUMMARY OF THE GENERAL DISCUSSIONS (3)5RISK ASSESSMENT OF THERMOPHILIC CAMPYLOBACTER SPP. IN BROILER CHICKENS45.1.1Summary of the Risk Assessment (4)5.1.2Introduction (4)5.1.3Scope (4)5.1.4Approach (5)5.1.5Key findings (10)5.1.6Risk assessment and developing countries (11)5.1.7Limitations and caveats (12)5.1.8Gaps in the data (12)5.2R EVIEW OF THE R ISK A SSESSMENT (13)5.2.1Introduction: (13)5.2.2Hazard identification (13)5.2.3Exposure assessment: (13)5.2.4Hazard characterization (15)5.2.5Risk characterization (16)5.2.6Risk assessment and developing countries (16)5.2.7Data deficiencies (16)5.3U TILITY AND A PPLICABILITY (17)5.4R ESPONSE TO THE SPECIFIC RISK MANAGEMENT QUESTIONS POSED BY THE C ODEX COMMITTEE ON FOODHYGIENE (18)5.5C ONCLUSIONS AND R ECOMMENDATIONS (18)6RISK ASSESSMENT OF VIBRIO SPP. IN SEAFOOD (19)6.1S UMMARY OF THE RISK ASSESSMENTS (19)6.1.1Introduction: Vibrio spp. in seafood (19)6.1.2Scope (20)6.1.3Vibrio parahaemolyticus in raw oysters consumed in Japan, New Zealand, Canada, Australia andthe United States of America (20)6.1.4Vibrio parahaemolyticus in bloody clams (28)6.1.5Vibrio parahaemolyticus in finfish (31)6.1.6Vibrio vulnificus in raw oysters (33)6.1.7Choleragenic Vibrio cholerae O1 and O139 in warm-water shrimps for export (37)6.2R EVIEW OF THE R ISK A SSESSMENTS (41)6.2.1Introduction: Vibrio spp. in seafood (41)6.2.2Vibrio parahaemolyticus in raw oysters consumed in Japan, New Zealand, Canada, Australia andthe United States of America oysters (42)6.2.3Vibrio parahaemolyticus in bloody clams (42)6.2.4Vibrio parahaemolyticus in finfish (42)6.2.5Vibrio vulnificus in raw oysters (43)6.2.6Choleragenic Vibrio cholerae O1 and O139 in warm-water shrimp for export (43)6.3U TILITY AND APPLICABILITY (43)6.4R ESPONSE TO THE SPECIFIC QUESTIONS POSED BY THE C ODEX C OMMITTEE ON F ISH AND F ISHERYP RODUCTS (44)6.5R ESPONSE TO THE NEEDS OF THE C ODEX C OMMITTEE ON F OOD H YGIENE (46)6.6C ONCLUSIONS AND RECOMMENDATIONS (46)7CONCLUSIONS (47)8RECOMMENDATIONS (47)ANNEX 1: LIST OF PARTICIPANTS (49)ANNEX 2: LIST OF WORKING DOCUMENTS (51)List of AbbreviationsCAC Codex Alimentarius CommissionCCFFP Codex Committee on Fish and Fishery ProductsCCFH Codex Committee on Food HygieneCFU colony forming unitCT cholera toxinctx cholera toxin geneFAO Food and Agriculture Organization of the United NationsFDA-VPRA The United States Food and Drug Administration draft riskassessment on the "Public Health Impact of Vibrioparahaemolyticus in Raw Molluscan Shellfish"g gram(s)GHP Good Hygienic PracticesGMP Good Manufacturing Practicesh hour(s)ISSC Interstate Seafood Sanitation ConferenceMPN Most Probable NumberPCR Polymerase Chain Reactionppt parts per thousandRAP FAO Regional Office for Asia and the PacificTDH Thermostable direct haemolysintdh Thermostable direct haemolysin geneTLH Thermolabile haemolysinTRH TDH-related haemolysintrh tdh-related haemolysin geneUSFDA United States Food and Drug AdministrationWHO World Health Organization1 IntroductionThe Food and Agriculture Organization of the United Nations (FAO) and the World Health Organization (WHO) convened an expert consultation on “Risk assessment of Campylobacter spp. in broiler chickens and Vibrio spp. in seafood” in the FAO Regional Office for Asia and the Pacific (RAP), Bangkok, Thailand on 5 - 9 August 2002. The list of participants is presented in Annex 1.Mr Dong Qingsong, FAO Deputy Regional Representative for Asia and the Pacific and Officer-in-charge, RAP, opened the meeting on behalf of the two sponsoring organizations. In welcoming the participants Mr Qingsong noted the increasing significance of microbiological hazards in relation to food safety. He noted that international trade had amplified the opportunity for these hazards to be disseminated from the original point of production to locations thousands of miles away, thereby permitting such food safety hazards to impact on public health and trade in more than one country. Mr Qingsong observed that this underlined the need to first consider microbiological hazards at the international level and provide the means by which they can then be addressed at regional and national levels. He highlighted the commitment of FAO and WHO to provide a neutral international forum to consider new approaches to achieving food safety, and in particular to address microbiological risk assessment.As the meeting was convened in Asia, Mr Qingsong also highlighted the importance of the work on microbiological risk assessment for this region. He noted that seafood and poultry are important products in both domestic and international trade and provide a valuable contribution to the economy of the region. He also noted the concerns for public health raised by the microbiological hazards associated with these foods. Therefore, in conclusion he requested the expert consultation to consider, in particular, the practical application of their work so that the output would be of value to a range of end users and to countries in different stages of development.The consultation appointed Dr Servé Notermans (the Netherlands) as chairperson of the consultation and Ms Dorothy-Jean McCoubrey (New Zealand) as rapporteur. Dr Notermans also served as chairperson of the campylobacter working group and Prof. Diane Newell (United Kingdom) served as rapporteur. Dr Mark Tamplin (United States of America) served as chairperson of the working group on vibrio and Dr Ron Lee (United Kingdom) as rapporteur of that group.2 BackgroundRisk assessment of microbiological hazards in foods has been identified as a priority area of work for the Codex Alimentarius Commission (CAC). At its 32nd session the Codex Committee on Food Hygiene (CCFH) identified a list of pathogen-commodity combinations for which it requires expert risk assessment advice. In response, FAO and WHO jointly launched a programme of work with the objective of providing expert advice on risk assessment of microbiological hazards in foods to their member countries and to the CAC.Dr. Hajime Toyofuku, WHO, and Dr. Sarah Cahill, FAO provided participants with an overview of the joint FAO/WHO activities on microbiological risk assessment. In their presentation, they also highlighted the objectives and expected outcomes of the current meeting.The FAO/WHO programme of activities on microbiological risk assessment aims to serve two customers - the CAC and the FAO and WHO member countries. The CAC, and in particular the CCFH, needs sound scientific advice as a basis for the development of guidelines and recommendations as well as the answers to specific risk management questions on certain pathogen-commodity combinations. Member countries on the other hand need specific risk assessment tools to use in conducting their own assessments and, if possible, some modules that can be directly applied in a national risk assessment.To implement this programme of work, FAO and WHO are convening a series of joint expert consultations. To date three expert consultations have been held. The first of these was held on 17 - 21July 20001 and the second on 30 April - 4 May 20012. Both of these expert consultations addressed risk assessment of Salmonella spp. in broiler chickens and eggs and Listeria monocytogenes in ready-to-eat foods. In March 2001, FAO and WHO initiated risk assessment work on the two pathogen-commodity combinations being considered in this expert consultation. Two ad hoc expert drafting groups were established to examine the available relevant information on Campylobacter spp. in broiler chickens and Vibrio spp. in seafood and prepare documentation on both the exposure assessment and hazard characterization steps of the risk assessment. These documents were reviewed and evaluated by a joint expert consultation convened on 23 - 27 July 20013.In October 2001, the report of that expert consultation was delivered to CCFH. Additional risk management guidance was sought from the Committee in relation to the finalization of the risk assessments. In response to this, the Committee established risk management drafting groups to consider the approach it could take towards providing guidance on managing problems associated with Campylobacter spp. in broiler chickens and Vibrio spp. in seafood. A limited amount of feedback was received; therefore the risk assessment drafting groups, FAO and WHO met early in 2002 to consider how to complete the risk assessment work. Both risk assessments have been progressed towards completion, although to varying extents. The risk assessment documents that have been developed to date were reviewed and evaluated by the expert consultation.The purpose of this report is to present the summary of the draft documents on risk assessment of Campylobacter spp. in broiler chickens and Vibrio spp. in seafood as well as the discussions and recommendations of the expert consultation.3 Objectives of the consultationThe expert consultation examined the information provided by the expert drafting groups on risk assessment of Campylobacter spp. in broiler chickens and Vibrio spp. in seafood with the following objectives;•To critically review the documents prepared by the ad hoc expert drafting groups giving particular attention to;•the risk characterization component and the manner in which the risk assessment outputs were generated;•the assumptions made in the risk assessment;•the associated uncertainty and variability;•the potential application of the work.•To provide scientific advice to FAO and WHO member countries on the risk assessment of Vibrio spp. in seafood and Campylobacter spp. in broiler chickens based on the available documentation and the discussions during the expert consultation.•To respond to the risk management needs of Codex.1 FAO. 2000. Joint FAO/WHO Expert Consultation on Risk Assessment of Microbiological Hazards in Foods, FAO Headquarters, Rome, Italy, 17 - 20 July 2000. FAO Food and Nutrition Paper 71./es/ESN/food/risk_mra_campylobacter_en.stm / http://www.who.int/fsf/Micro/index.htm2 FAO. 2001. Joint FAO/WHO Expert Consultation on Risk Assessment of Microbiological Hazards in Foods: Risk characterization of Salmonella spp. in eggs and broiler chickens and Listeria monocytogenes in ready-to-eat foods, FAO Headquarters, Rome, Italy, 30 April - 4 May 2001. FAO Food and Nutrition Paper 72./es/ESN/food/risk_mra_salmonella_en.stm / http://www.who.int/fsf/Micro/index.htm3 WHO. 2001. Report of the Joint FAO/WHO Expert Consultation on Risk Assessment of Microbiological Hazards in Foods; Hazard identification, exposure assessment and hazard characterization of Campylobacter spp. in broiler chickens and Vibrio spp. in seafood. WHO Headquarters, Geneva, Switzerland 23 - 27 July 2001. WHO 2001./es/ESN/food/risk_mra_campylobacter_en.stm / http://www.who.int/fsf/Micro/index.htm4 Summary of the general discussionsThe drafting groups presented overviews of the risk assessment documents on Campylobacter spp. in broiler chickens and Vibrio spp. in seafood to the expert consultation. A summary of the draft risk assessments and the discussions of the expert consultation are given in sections five and six of this report. However, a number of issues relevant to risk assessment in general that were addressed are summarized below.It was anticipated that the campylobacter and vibrio risk assessments will prove useful for decision makers such as Codex committees, food industries, food control authorities, regulators and other stake holders. It was not the task of the risk assessment groups to undertake risk profiling or determine acceptable risk levels – this is the role of risk managers. However, the outputs of this work should provide tools and information useful to both risk managers and risk assessors.It was not possible at this stage to complete full quantitative risk assessments for all pathogens, inclusive of full validation exercises. This was due to the lack of extensive quantitative data being available, the uncertainties that exist and the need to make assumptions e.g. that all strains of Campylobacter spp. are pathogenic. However, it was noted that even preliminary quantitative and good qualitative risk assessments would be very useful in decision making. The risk assessment of Vibrio parahaemolyticus in bloody clams provides an example of how readily accessible data can be collated into the risk assessment framework to give valuable information that could be used towards improving public health.The models on Campylobacter spp. and Vibrio spp provided examples of how risk assessment can be applied to pathogen-commodity combinations. However, it is important that countries do not just transfer any conclusions from this risk assessment work to make risk management decisions without giving due consideration to the effect of different species, environments and populations in their own country.The challenge will be to show the world, including developing countries, the advantages of undertaking risk assessment. Training, complete with easy to understand guidelines on how to apply the risk assessments was considered extremely important. The training and guidelines material should be written in simple language that is applicable to different audiences and countries with different levels of development.There is concern by many countries that not having a full risk assessments will create trade barriers. Assistance may need to be given to many developing countries to generate and collect data and undertake modelling exercises so that such concerns can be addressed.The consultation formed two working groups to address the risk assessment documentation on Campylobacter spp. and Vibrio spp. respectively. The composition of the two working groups is outlined in the table below.Campylobacter spp. in broiler chickensIndependent experts Expert members of the drafting group Members of the secretariat John Cowden Bjarke Christensen Sarah CahillHeriberto Fernández Aamir Fazil Karen HulebackGeoffrey C. Mead Emma Hartnett Jeronimas Maskeliunas Diane G. Newell Anna Lammerding Hajime ToyofukuServé Notermans Maarten NautaSasitorn Kanarat Greg PaoliPaul Brett VanderlindeVibrio spp. in seafoodIndependent experts Expert members of the drafting group Members of the secretariat Nourredine Bouchriti Anders Dalsgaard Lahsen AbabouchJean-Michel Fournier Angelo DePaola Peter Karim BenEmbarek Ron Lee I. Karunasager Maria de Lourdes Costarrica Carlos A. Lima dos Santos Thomas McMeekinDorothy Jean McCoubrey Ken OsakaMarianne Miliotis John SumnerMitsuaki Nishibuchi Mark WalderhaugPensri RodmaMark Tamplin5 Risk Assessment of thermophilic Campylobacter spp. inbroiler chickens5.1.1 Summary of the Risk Assessment5.1.2 IntroductionAn understanding of thermophilic Campylobacter spp., and specifically Campylobacter jejuniin broiler chickens is important from both public health and international trade perspectives. In orderto achieve such an understanding, evaluation of this pathogen-commodity combination by quantitative risk assessment methodology was undertaken.The first steps of this process, hazard identification, hazard characterization and the conceptual model for exposure assessment were presented at an expert consultation in July 2001, and summarized in the report of that consultation4. The exposure assessment model described the production-to-consumption pathway for fresh and frozen broiler chicken carcasses prepared and consumed in the home.The final step, risk characterization, integrates the exposure and dose-response information in order to attempt to quantify the human-health risk attributable to pathogenic thermophilic Campylobacter spp. in broiler chickens. The risk assessment model and results were presented to the present expert consultation in the working document MRA 02/01. Recommendations for modifications and/or areas for further development of the exposure assessment model, arising from the 2001 consultation were taken into consideration and, where possible, were incorporated during preparation of the MRA 02/01 report. Recommendations, that were not incorporated, are noted in the exposure assessment summary section below.5.1.3 ScopeThe purpose of the work was to develop a risk assessment that attempts to understand how the incidence of human campylobacteriosis is influenced by various factors during chicken rearing, processing, distribution, retail storage, consumer handling, meal preparation and finally consumption.A benefit of this approach is that it enables consideration of the broadest range of intervention4 WHO. 2001. Report of the Joint FAO/WHO Expert Consultation on Risk Assessment of Microbiological Hazards in Foods; Hazard identification, exposure assessment and hazard characterization of Campylobacter spp. in broiler chickens and Vibrio spp. in seafood. WHO Headquarters, Geneva, Switzerland 23 - 27 July 2001. WHO 2001./es/ESN/food/risk_mra_campylobacter_en.stm / http://www.who.int/fsf/Micro/index.htmstrategies. The risk characterization estimates the probability of illness per serving of chicken associated with the presence of thermophilic Campylobacter spp. on fresh and frozen whole broiler chicken carcasses with the skin intact and which are cooked in the domestic kitchen for immediate consumption. A schematic representation of the risk assessment is shown in Figure 5.1.ConcentrationFIGURE 5.1: Schematic representation of the risk assessment of Campylobacter spp. in broiler chickens.5.1.4 Approach5.1.4.1 Hazard identificationThermophilic Campylobacter spp. are a leading cause of zoonotic enteric illness in most developed countries (Friedman et al., 20005). Human cases are usually caused by C. jejuni, and to a lesser extent, by Campylobacter coli. Information on the burden of human campylobacteriosis in developing countries is more limited (Oberhelman & Taylor, 20006).Nevertheless, it is reported that asymptomatic infection with C. jejuni and C. coli is frequent in adults. In children, under the age of two, C. jejuni, C. coli and other Campylobacter spp. are all associated with enteric disease.Campylobacter spp. may be transferred to humans by direct contact with contaminated animals or animal carcasses or indirectly through the ingestion of contaminated food or water. The principal reservoirs of thermophilic Campylobacter spp. are the alimentary tracts of wild and domesticated mammals and birds. Campylobacter spp. are commonly found in poultry, cattle, pigs, sheep, wild animals and birds, and in dogs and cats. Foodstuffs, including, poultry, beef, pork, other meat products, raw milk and milk products, and, less frequently, fish and fish products, mussels and fresh vegetables can also be contaminated (Jacobs-Reitsma, 20007). Findings of analytical epidemiological studies are conflicting. Some have identified handling raw poultry and eating poultry products as important risk factors for sporadic campylobacteriosis (Friedman, et al., 20005), however others have found that contact in the home with such products is protective (Adak et al., 19958).5 Friedman, C., Neimann, J., Wegener, H. & Tauxe, R. 2000. Epidemiology of Campylobacter jejuni infections in the United States and other industrialized nations. In Campylobacter 2nd Edition, pp. 121-138. Edited by I. Nachamkin & M. J. Blaser. Washington DC: ASM Press.6 Oberhelman, R. & Taylor, D. 2000. Campylobacter infections in developing countries. In Campylobacter 2nd Edition, pp. 139-154. Edited by I. Nachamkin & M. Blaser. Washington DC: ASM Press.7 Jacobs-Reitsma, W. 2000. Campylobacter in the food supply. In Campylobacter 2nd Ed., pp. 467-481. Edited by I. Nachamkin & M. J. Blaser. Washington DC: ASM Press.8 Adak, G. K., Cowden, J. M., Nicholas, S. & Evans, H. S. 1995. The Public Health Laboratory Service national case-control study of primary indigenous sporadic cases of campylobacter infection. Epidemiology and Infection 115, 15-22.Information for the hazard identification section was compiled from published literature and from unpublished data submitted to FAO and WHO by public health agencies and other interested parties.5.1.4.2 Hazard characterizationHazard characterization provides a description of the public health outcomes following infection, including sequelae, pathogen characteristics influencing the organism's ability to elicit infection and illness, host characteristics that influence the acquisition of infection, and food-related factors that may affect the survival of C. jejuni in the human gastrointestinal tract. A dose-response model was derived that mathematically describes the relationship between the numbers of organisms that might be present in a food and consumed (the dose), and the human health outcome (the response). In order to achieve this, human feeding trial data (Black et al., 19889)for two strains of C jejuni were pooled and used to derive estimates for the probability of infection (colonization of the gastrointestinal track without overt symptoms) as well as the probability of illness once infected. Campylobacteriosis predominantly occurs as sporadic cases (Friedman, et al., 200010), and hence dose-response data from outbreak investigations are essentially non-existent.The probability of infection upon ingestion of a dose of C. jejuni was estimated with the caveat that the data are from a feeding study involving healthy volunteers, and using a milk matrix and a limited number of campylobacter strains. Whether the probability of infection and/or illness are different for immune individuals or those individuals with an increased susceptibility to illness (e.g. immunosuppressed, very young, etc.) compared to the volunteers could not be determined from the feeding trial data. The probability of illness following infection was also estimated based upon these investigations, and assumed to be dose-independent. Again, the impact of other factors, such as susceptibility, on the probability of illness cannot be quantified due to a lack of adequate epidemiological data and resolution to this level. The progression of the illness to more serious outcomes and the development of some sequelae can be crudely estimated from the approximate proportions reported in the literature, but these were not included in the present work. For the purposes of this model it was assumed that C. coli has the same properties as C. jejuni.5.1.4.3 Exposure assessmentThe exposure model describes the stages Rearing and Transport, Slaughter and Processing, and Preparation and Consumption (Figure 5.1). This modular approach estimates the prevalence and concentration of thermophilic Campylobacter spp., and the changes in these associated with each stage of processing, and refrigerated and frozen storage and consumer handling of the product.Routes contributing to initial introduction of Campylobacter spp. into a poultry flock11 on the farm have been identified in epidemiological studies. However, these remain poorly understood and the phenomenon may be multi-factorial, and hence, colonization was modelled based on an unspecified route of introduction. However, an epidemiology module has been created as a supplement to the model, which allows evaluation of interventions aimed at changing the frequency with which risk factors occur. As such risk factors are usually interdependent and may be country specific, incorporation of this module into the current risk model was not appropriate: however, it does illustrate how such information may be used in a specific situation.The effects of cooling and freezing, and adding chlorine during water chilling, are evaluated in the model. Effects of other processing treatments, e.g. lactic acid and irradiation, on reduction of Campylobacter spp. contamination on poultry carcasses were not explicitly evaluated as the quantitative data are lacking: however, if the level of reduction as a consequence of any type of 9 Black, R. E., Levine, M. M., Clements, M. L., Hughes, T. P. & Blaser, M. J. 1988. Experimental Campylobacter jejuni infection in humans. Journal of Infectious Diseases 157, 472-9.10 Friedman, C., Neimann, J., Wegener, H. & Tauxe, R. 2000. Epidemiology of Campylobacter jejuni infections in the United States and other industrialized nations. In Campylobacter 2nd Edition, pp. 121-138. Edited by I. Nachamkin & M. J. Blaser. Washington DC: ASM Press.11 A flock is a group of hens of similar age that are managed and housed together.processing modification, can be quantified for any treatment, then the impact on risk can be readily evaluated within the risk model.The home preparation module considered two routes of ingestion of viable organisms: via consumption of contaminated, undercooked chicken meat, and via cross-contamination, from contaminated raw poultry onto another food that is eaten without further cooking, or directly from the hands (Figure 5.2). As a result of recommendations from the previous expert consultation12 the “protected areas” cooking approach and the drip-fluid model approach for cross-contamination were selected. The "protected areas" approach to modelling the cooking step considers that campylobacter cells may be located in parts of the carcass that reach cooking temperature more slowly than other locations. The drip-fluid model for cross-contamination relates to the spread of campylobacter cells via drip-fluid from raw carcasses.FIGURE 5.2: Schematic representation of the home preparation module.5.1.4.4 Risk characterizationThe risk characterization step integrates the information collected during the hazard identification, hazard characterization, and exposure assessment steps to arrive at estimates of adverse events that may arise due to consumption of chicken (Figure 5.1). This step links the probability and magnitude of exposure to campylobacter associated with consumption of chicken to adverse outcomes that might occur. The resulting risk is expressed as individual risk or the risk per serving of chicken. Although this model does not address risk to a specific population, data on amounts of chicken consumed can be incorporated into the model to arrive at estimates of the population-based risk.The current model is unable to provide a central estimate of risk due to virtually unbounded uncertainty13 on two key components of the model, namely the impact of undercooking and the impact of cross-contamination. Since these processes are the ultimate determinant of the final exposure of the consumer, the unbounded and unresolvable nature of the uncertainty undermines the establishment of a central estimate of consumer risk. In addition, the model has not yet been applied 12 WHO. 2001. Report of the Joint FAO/WHO Expert Consultation on Risk Assessment of Microbiological Hazards in Foods; Hazard identification, exposure assessment and hazard characterization of Campylobacter spp. in broiler chickens and Vibrio spp. in seafood. WHO Headquarters, Geneva, Switzerland 23 - 27 July 2001. WHO 2001./es/ESN/food/risk_mra_campylobacter_en.stm / http://www.who.int/fsf/Micro/index.htm13 As the upper and lower boundaries of the values describing the impact of undercooking and cross-contamination are unknown the distribution of possible values is potentially infinite and therefore unbounded.。
Journal of Statistical SoftwareNovember2008,Volume28,Issue5./ Building Predictive Models in R Using thecaret PackageMax KuhnPfizer Global R&DAbstractThe caret package,short for classification and regression training,contains numerous tools for developing predictive models using the rich set of models available in R.Thepackage focuses on simplifying model training and tuning across a wide variety of modelingtechniques.It also includes methods for pre-processing training data,calculating variableimportance,and model visualizations.An example from computational chemistry is usedto illustrate the functionality on a real data set and to benchmark the benefits of parallelprocessing with several types of models.Keywords:model building,tuning parameters,parallel processing,R,NetWorkSpaces.1.IntroductionThe use of complex classification and regression models is becoming more and more com-monplace in science,finance and a myriad of other domains(Ayres2007).The R language (R Development Core Team2008)has a rich set of modeling functions for both classification and regression,so many in fact,that it is becoming increasingly more difficult to keep track of the syntactical nuances of each function.The caret package,short for classification and regression training,was built with several goals in mind:to eliminate syntactical differences between many of the functions for building and predicting models,to develop a set of semi-automated,reasonable approaches for optimizing the values of the tuning parameters for many of these models andcreate a package that can easily be extended to parallel processing systems.2caret:Building Predictive Models in RThe package contains functionality useful in the beginning stages of a project(e.g.,data splitting and pre-processing),as well as unsupervised feature selection routines and methods to tune models using resampling that helps diagnose over-fitting.The package is available at the Comprehensive R Archive Network at http://CRAN.R-project. org/package=caret.caret depends on over25other packages,although many of these are listed as“suggested”packages which are not automatically loaded when caret is started.Pack-ages are loaded individually when a model is trained or predicted.After the package is in-stalled and loaded,help pages can be found using help(package="caret").There are three package vignettes that can be found using the vignette function(e.g.,vignette(package= "caret")).For the remainder of this paper,the capabilities of the package are discussed for:data splitting and pre-processing;tuning and building models;characterizing performance and variable importance;and parallel processing tools for decreasing the model build time.Rather than discussing the capabilities of the package in a vacuum,an analysis of an illustrative example is used to demonstrate the functionality of the package.It is assumed that the readers are familiar with the various tools that are mentioned.Hastie et al.(2001)is a good technical introduction to these tools.2.An illustrative exampleIn computational chemistry,chemists often attempt to build predictive relationships between the structure of a chemical and some observed endpoint,such as activity against a biological ing the structural formula of a compound,chemical descriptors can be generated that attempt to capture specific characteristics of the chemical,such as its size,complexity,“greasiness”etc.Models can be built that use these descriptors to predict the outcome of interest.See Leach and Gillet(2003)for examples of descriptors and how they are used. Kazius et al.(2005)investigated using chemical structure to predict mutagenicity(the in-crease of mutations due to the damage to genetic material).An Ames test(Ames et al.1972) was used to evaluate the mutagenicity potential of various chemicals.There were4,337com-pounds included in the data set with a mutagenicity rate of55.3%.Using these compounds, the dragonX software(Talete SRL2007)was used to generate a baseline set of1,579predic-tors,including constitutional,topological and connectivity descriptors,among others.These variables consist of basic numeric variables(such as molecular weight)and counts variables (e.g.,number of halogen atoms).The descriptor data are contained in an R data frame names descr and the outcome data are in a factor vector called mutagen with levels"mutagen"and"nonmutagen".These data are available from the package website /.3.Data preparationSince there is afinite amount of data to use for model training,tuning and evaluation,one of thefirst tasks is to determine how the samples should be utilized.There are a few schools of thought.Statistically,the most efficient use of the data is to train the model using all of the samples and use resampling(e.g.,cross-validation,the bootstrap etc.)to evaluate the efficacy of the model.Although it is possible to use resampling incorrectly(Ambroise andJournal of Statistical Software3 McLachlan2002),this is generally true.However,there are some non-technical reasons why resampling alone may be insufficient.Depending on how the model will be used,an external test/validation sample set may be needed so that the model performance can be characterized on data that were not used in the model training.As an example,if the model predictions are to be used in a highly regulated environment(e.g.,clinical diagnostics),the user may be constrained to“hold-back”samples in a validation set.For illustrative purposes,we will do an initial split of the data into training and test sets. The test set will be used only to evaluate performance(such as to compare models)and the training set will be used for all other activities.The function createDataPartition can be used to create stratified random splits of a data set.In this case,75%of the data will be used for model training and the remainder will be used for evaluating model performance.The function creates the random splits within each class so that the overall class distribution is preserved as well as possible.R>library("caret")R>set.seed(1)R>inTrain<-createDataPartition(mutagen,p=3/4,list=FALSE)R>R>trainDescr<-descr[inTrain,]R>testDescr<-descr[-inTrain,]R>trainClass<-mutagen[inTrain]R>testClass<-mutagen[-inTrain]R>R>prop.table(table(mutagen))mutagenmutagen nonmutagen0.55363320.4463668R>prop.table(table(trainClass))trainClassmutagen nonmutagen0.55350550.4464945In cases where the outcome is numeric,the samples are split into quartiles and the sampling is done within each quartile.Although not discussed in this paper,the package also contains method for selecting samples using maximum dissimilarity sampling(Willett1999).This approach to sampling can be used to partition the samples into training and test sets on the basis of their predictor values.There are many models where predictors with a single unique value(also known as“zero-variance predictors”)will cause the model to fail.Since we will be tuning models using resampling methods,a random sample of the training set may result in some predictors with more than one unique value to become a zero-variance predictor(in our data,the simple split of the data into a test and training set caused three descriptors to have a single unique value in the training set).These so-called“near zero-variance predictors”can cause numerical problems during resampling for some models,such as linear regression.As an example of such a predictor,the variable nR04is the number of number of4-membered rings in a compound.For the training set,almost all of the samples(n=3,233)have no4caret:Building Predictive Models in R4-member rings while18compounds have one and a single compound has2such rings.If these data are resampled,this predictor might become a problem for some models.To identify this kind of predictors,two properties can be examined:First,the percentage of unique values in the training set can be calculated for each predictor.Variables with low percentages have a higher probability of becoming a zero variance predictor during resampling.For nR04,the percentage of unique values in the training set is low(9.2%).However,this in itself is not a problem.Binary predictors, such as dummy variables,are likely to have low percentages and should not be discarded for this simple reason.The other important criterion to examine is the skewness of the frequency distribution of the variable.If the ratio of most frequent value of a predictor to the second most frequent value is large,the distribution of the predictor may be highly skewed.For nR04,the frequency ratio is large(179=3233/18),indicating a significant imbalance in the frequency of values.If both these criteria areflagged,the predictor may be a near zero-variance predictor.It is suggested that if:1.the percentage of unique values is less than20%and2.the ratio of the most frequent to the second most frequent value is greater than20, the predictor may cause problem for some models.The function nearZeroVar can be used to identify near zero-variance predictors in a dataset.It returns an index of the column numbers that violate the two conditions above.Also,some models are susceptible to multicollinearity(i.e.,high correlations between pre-dictors).Linear models,neural networks and other models can have poor performance in these situations or may generate unstable solutions.Other models,such as classification or regression trees,might be resistant to highly correlated predictors,but multicollinearity may negatively affect interpretability of the model.For example,a classification tree may have good performance with highly correlated predictors,but the determination of which predictors are in the model is random.If there is a need to minimize the effect of multicollinearity,there are a few options.First, models that are resistant to large between-predictor correlations,such as partial least squares, can be used.Also,principal component analysis can be used to reduce the number of dimen-sions in a way that removes correlations(see below).Alternatively,we can identify and remove predictors that contribute the most to the correlations.In linear models,the traditional method for reducing multicollinearity is to identify the of-fending predictors using the variable inflation factor(VIF).For each variable,this statistic measures the increase in the variation of the model parameter estimate in comparison to the optimal situation(i.e.,an orthogonal design).This is an acceptable technique when linear models are used and there are more samples than predictors.In other cases,it may not be as appropriate.As an alternative,we can compute the correlation matrix of the predictors and use an al-gorithm to remove the a subset of the problematic predictors such that all of the pairwiseJournal of Statistical Software5correlations are below a threshold:repeatFind the pair of predictors with the largest absolute correlation;For both predictors,compute the average correlation between each predictor and all of the other variables;Flag the variable with the largest mean correlation for removal;Remove this row and column from the correlation matrix;until no correlations are above a threshold;This algorithm can be used tofind the minimal set of predictors that can be removed so that the pairwise correlations are below a specific threshold.Note that,if two variables have a high correlation,the algorithm determines which one is involved with the most pairwise correlations and is removed.For illustration,predictors that result in absolute pairwise correlations greater than0.90can be removed using the findCorrelation function.This function returns an index of column numbers for removal.R>ncol(trainDescr)[1]1576R>descrCorr<-cor(trainDescr)R>highCorr<-findCorrelation(descrCorr,0.90)R>trainDescr<-trainDescr[,-highCorr]R>testDescr<-testDescr[,-highCorr]R>ncol(trainDescr)[1]650For chemical descriptors,it is not uncommon to have many very large correlations between the predictors.In this case,using a threshold of0.90,we eliminated926descriptors from the data.Once thefinal set of predictors is determined,the values may require transformations be-fore being used in a model.Some models,such as partial least squares,neural networks and support vector machines,need the predictor variables to be centered and/or scaled. The preProcess function can be used to determine values for predictor transformations us-ing the training set and can be applied to the test set or future samples.The function has an argument,method,that can have possible values of"center","scale","pca"and "spatialSign".Thefirst two options provide simple location and scale transformations of each predictor(and are the default values of method).The predict method for this class is then used to apply the processing to new samplesR>xTrans<-preProcess(trainDescr)R>trainDescr<-predict(xTrans,trainDescr)R>testDescr<-predict(xTrans,testDescr)The"pca"option computes loadings for principal component analysis that can be applied to any other data set.In order to determine how many components should be retained,the preProcess function has an argument called thresh that is a threshold for the cumulative percentage of variance captured by the principal components.The function will add compo-nents until the cumulative percentage of variance is above the threshold.Note that the data6caret:Building Predictive Models in Rare automatically scaled when method="pca",even if the method value did not indicate that scaling was needed.For PCA transformations,the predict method generates values with column names"PC1","PC2",etc.Specifying method="spatialSign"applies the spatial sign transformation(Serneels et al. 2006)where the predictor values for each sample are projected onto a unit circle using x∗= x/||x||.This transformation may help when there are outliers in the x space of the training set.4.Building and tuning modelsThe train function can be used to select values of model tuning parameters(if any)and/or estimate model performance using resampling.As an example,a radial basis function support vector machine(SVM)can be used to classify the samples in our computational chemistry data.This model has two tuning parameters.Thefirst is the scale functionσin the radial basis functionK(a,b)=exp(−σ||a−b||2)and the other is the cost value C used to control the complexity of the decision boundary. We can create a grid of candidate tuning values to ing resampling methods,such as the bootstrap or cross-validation,a set of modified data sets are created from the training samples.Each data set has a corresponding set of hold-out samples.For each candidate tuning parameter combination,a model isfit to each resampled data set and is used to predict the corresponding held out samples.The resampling performance is estimated by aggregating the results of each hold-out sample set.These performance estimates are used to evaluate which combination(s)of the tuning parameters are appropriate.Once thefinal tuning values are assigned,thefinal model is refit using the entire training set.For the train function,the possible resampling methods are:bootstrapping,k-fold cross-validation,leave-one-out cross-validation,and leave-group-out cross-validation(i.e.,repeated splits without replacement).By default,25iterations of the bootstrap are used as the resam-pling scheme.In this case,the number of iterations was increased to200due to the large number of samples in the training set.For this particular model,it turns out that there is an analytical method for directly estimating a suitable value ofσfrom the training data(Caputo et al.2002).By default,the train function uses the sigest function in the kernlab package(Karatzoglou et al.2004)to initialize this parameter.In doing this,the value of the cost parameter C is the only tuning parameter. The train function has the following arguments:x:a matrix or data frame of predictors.Currently,the function only accepts numeric values(i.e.,no factors or character variables).In some cases,the model.matrix function may be needed to generate a data frame or matrix of purely numeric datay:a numeric or factor vector of outcomes.The function determines the type of problem (classification or regression)from the type of the response given in this argument.method:a character string specifying the type of model to be used.See Table1for the possible values.Journal of Statistical Software7 metric:a character string with values of"Accuracy","Kappa","RMSE"or"Rsquared".This value determines the objective function used to select thefinal model.For example, selecting"Kappa"makes the function select the tuning parameters with the largest value of the mean Kappa statistic computed from the held-out samples.trControl:takes a list of control parameters for the function.The type of resampling as well as the number of resampling iterations can be set using this list.The function trainControl can be used to compute default parameters.The default number of resampling iterations is25,which may be too small to obtain accurate performance estimates in some cases.tuneLength:controls the size of the default grid of tuning parameters.For each model, train will select a grid of complexity parameters as candidate values.For the SVM model,the function will tune over C=10−1,1,10.To expand the size of the default list,the tuneLength argument can be used.By selecting tuneLength=5,values of C ranging from0.1to1,000are evaluated.tuneGrid:can be used to define a specific grid of tuning parameters.See the example below....:the three dots can be used to pass additional arguments to the functions listed in Table1.For example,we have already centered and scaled the predictors,so the argument scaled=FALSE can be passed to the ksvm function to avoid duplication of the pre-processing.We can tune and build the SVM model using the code below.R>bootControl<-trainControl(number=200)R>set.seed(2)R>svmFit<-train(trainDescr,trainClass,+method="svmRadial",tuneLength=5,+trControl=bootControl,scaled=FALSE)Model1:sigma=0.0004329517,C=1e-01Model2:sigma=0.0004329517,C=1e+00Model3:sigma=0.0004329517,C=1e+01Model4:sigma=0.0004329517,C=1e+02Model5:sigma=0.0004329517,C=1e+03R>svmFitCall:train.default(x=trainDescr,y=trainClass,method="svmRadial", scaled=FALSE,trControl=bootControl,tuneLength=5)3252samples650predictorssummary of bootstrap(200reps)sample sizes:3252,3252,3252,3252,3252,3252,...8caret:Building Predictive Models in Rboot resampled training results across tuning parameters:sigma C Accuracy Kappa Accuracy SD Kappa SD Selected0.0004330.10.7050.3950.01220.02520.00043310.8060.6060.01090.02180.000433100.8180.6310.01040.0211*0.0004331000.80.5950.01120.02270.00043310000.7820.5580.01110.0223Accuracy was used to select the optimal modelIn this output,each row in the table corresponds to a specific combination of tuning param-eters.The“Accuracy”column is the average accuracy of the200held-out samples and the column labeled as“Accuracy SD”is the standard deviation of the200accuracies.The Kappa statistic is a measure of concordance for categorical data that measures agreement relative to what would be expected by chance.Values of1indicate perfect agreement,while a value of zero would indicate a lack of agreement.Negative Kappa values can also occur, but are less common since it would indicate a negative association between the observed and predicted data.Kappa is an excellent performance measure when the classes are highly unbalanced.For example,if the mutagenicity rate in the data had been very small,say5%, most models could achieve high accuracy by predicting all compounds to be nonmutagenic.In this case,the Kappa statistic would result in a value near zero.The Kappa statistic given here is the unweighted version computed by the classAgreement function in the e1071package (Dimitriadou et al.2008).The Kappa columns in the output above are also summarized across the200resampled Kappa estimates.As previously mentioned,the“optimal”model is selected to be the candidate model with the largest accuracy.If more than one tuning parameter is“optimal”then the function will try to choose the combination that corresponds to the least complex model.For these data,σwas estimated to be0.000433and C=10appears to be optimal.Based on these values,the model was refit to the original set of3,252samples and this object is stored in svmFit$finalModel. R>svmFit$finalModelSupport Vector Machine object of class"ksvm"SV type:C-svc(classification)parameter:cost C=10Gaussian Radial Basis kernel function.Hyperparameter:sigma=0.000432951668058316Number of Support Vectors:1616Objective Function Value:-9516.185Training error:0.082411Probability model included.Journal of Statistical Software9 Model method value Package Tuning parameters Recursive partitioning rpart rpart∗maxdepthctree party mincriterion Boosted trees gbm gbm∗interaction.depth,n.trees,shrinkageblackboost mboost maxdepth,mstopada ada maxdepth,iter,nu Other boosted models glmboost mboost mstopgamboost mboost mstoplogitboost caTools nIter Random forests rf randomForest∗mtrycforest party mtry Bagged trees treebag ipred NoneNeural networks nnet nnet decay,sizePartial least squares pls∗pls,caret ncompSupport vector machines svmRadial kernlab sigma,C(RBF kernel)Support vector machines svmPoly kernlab scale,degree,C(polynomial kernel)Gaussian processes gaussprRadial kernlab sigma(RBF kernel)Gaussian processes gaussprPoly kernlab scale,degree(polynomial kernel)Linear least squares lm∗stats NoneMultivariate adaptive earth∗,mars earth degree,npruneregression splinesBagged MARS bagEarth∗caret,earth degree,npruneElastic net enet elasticnet lambda,fraction The lasso lasso elasticnet fractionRelevance vector machines rvmRadial kernlab sigma(RBF kernel)Relevance vector machines rvmPoly kernlab scale,degree(polynomial kernel)Linear discriminant analysis lda MASS NoneStepwise diagonal sddaLDA,SDDA Nonediscriminant analysis sddaQDALogistic/multinomial multinom nnet decayregressionRegularized discriminant rda klaR lambda,gammaanalysisFlexible discriminant fda∗mda,earth degree,npruneanalysis(MARS basis)Table1:Models used in train(∗indicates that a model-specific variable importance method is available,see Section9.).(continued on next page)10caret:Building Predictive Models in RModel method value Package Tuning parameters Bagged FDA bagFDA∗caret,earth degree,npruneLeast squares support vector lssvmRadial kernlab sigmamachines(RBF kernel)k nearest neighbors knn3caret kNearest shrunken centroids pam∗pamr thresholdNaive Bayes nb klaR usekernelGeneralized partial gpls gpls K.provleast squaresLearned vector quantization lvq class kTable1:Models used in train(∗indicates that a model-specific variable importance method is available,see Section9.).In many cases,more control over the grid of tuning parameters is needed.For example, for boosted trees using the gbm function in the gbm package(Ridgeway2007),we can tune over the number of trees(i.e.,boosting iterations),the complexity of the tree(indexed by interaction.depth)and the learning rate(also known as shrinkage).As an example,a user could specify a grid of values to tune over using a data frame where the rows correspond to tuning parameter combinations and the columns are the names of the tuning variables (preceded by a dot).For our data,we will generate a grid of50combinations and use the tuneGrid argument to the train function to use these values.R>gbmGrid<-expand.grid(.interaction.depth=(1:5)*2,+.n.trees=(1:10)*25,.shrinkage=.1)R>set.seed(2)R>gbmFit<-train(trainDescr,trainClass,+method="gbm",trControl=bootControl,verbose=FALSE,+bag.fraction=0.5,tuneGrid=gbmGrid)Model1:interaction.depth=2,shrinkage=0.1,n.trees=250collapsing over other values of n.treesModel2:interaction.depth=4,shrinkage=0.1,n.trees=250collapsing over other values of n.treesModel3:interaction.depth=6,shrinkage=0.1,n.trees=250collapsing over other values of n.treesModel4:interaction.depth=8,shrinkage=0.1,n.trees=250collapsing over other values of n.treesModel5:interaction.depth=10,shrinkage=0.1,n.trees=250collapsing over other values of n.treesIn this model,we generated200bootstrap replications for each of the50candidate models, computed performance and selected the model with the largest accuracy.In this case the model automatically selected an interaction depth of8and used250boosting iterations (although other values may very well be appropriate;see Figure1).There are a variety(a)#Trees b o o t r e s a m p l e d t r a i n i n g a c c u r a c y0.740.760.780.8050100150200250qq qqq qqqqqInteraction Depth 246810q(b)#Treesb o o t r e s a m p l e d t r a i n i n g k a p p a0.450.500.550.6050100150200250qqqqq qqqqqInteraction Depth 246810q(c)#TreesI n t e r a c t i o n D e p t h2468102550751001201501802002202500.730.740.750.760.770.780.790.800.81(d)D e n s i t y1020300.780.800.820.840.550.600.650.70Figure 1:Examples of plot functions for train objects.(a)A plot of the classificationaccuracy versus the tuning factors (using plot(gbmFit)).(b)Similarly,a plot of the Kappa statistic profiles (plot(gbmFit,metric ="Kappa")).(c)A level plot of the accuracy values (plot(gbmFit,plotType ="level")).(d)Density plots of the 200bootstrap estimates of accuracy and Kappa for the final model (resampleHist(gbmFit)).of different visualizations for train objects.Figure1shows several examples plots created using plot.train and resampleHist.Note that the output states that the procedure was“collapsing over other values of n.trees”. For some models(method values of pls,plsda,earth,rpart,gbm,gamboost,glmboost, blackboost,ctree,pam,enet and lasso),the train function willfit a model that can be used to derive predictions for some sub-models.For example,since boosting models save the model results for each iteration of boosting,train canfit the model with the largest number of iterations and derive the other models where the other tuning parameters are the same but fewer number of boosting iterations are requested.In the example above,for a model with interaction.depth=2and shrinkage=.1,we only need tofit the model with the largest number of iterations(250in this example).Holding the interaction depth and shrinkage constant,the computational time to get predictions for models with less than 250iterations is relatively cheap.For the example above,wefit a total of200×5=1,000 models instead of25×5×10=10,000.The train function tries to exploit this idea for as many models as possible.For recursive partitioning models,an initial model isfit to all of the training data to obtain the possible values of the maximum depth of any node(maxdepth).The tuning grid is created based on these values.If tuneLength is larger than the number of possible maxdepth values determined by the initial model,the grid will be truncated to the maxdepth list.The same is also true for nearest shrunken centroid models,where an initial model isfit tofind the range of possible threshold values,and MARS models(see Section7).Also,for the glmboost and gamboost functions from the mboost package(Hothorn and B¨u hlmann2007),an additional tuning parameter,prune,is used by train.If prune="yes", the number of trees is reduced based on the AIC statistic.If"no",the number of trees is kept at the value specified by the mstop parameter.See B¨u hlmann and Hothorn(2007)for more details about AIC pruning.In general,the functions in the caret package assume that there are no missing values in the data or that these values have been handled via imputation or other means.For the train function,there are some models(such as rpart)that can handle missing values.In these cases,the data passed to the x argument can contain missing values.5.Prediction of new samplesAs previously noted,an object of class train contains an element called finalModel,which is thefitted model with the tuning parameter values selected by resampling.This object can be used in the traditional way to generate predictions for new samples,using that model’s predict function.For the most part,the prediction functions in R follow a consistent syntax, but there are exceptions.For example,boosted tree models produced by the gbm function also require the number of trees to be specified.Also,predict.mvr from the pls package (Mevik and Wehrens2007)will produce predictions for every candidate value of ncomp that was tested.To avoid having to remember these nuances,caret offers several functions to deal with these issues.The function predict.train is an interface to the model’s predict method that handles any extra parameter specifications(such as previously mentioned for gbm and PLS models).For example:。
预测能力作文英语模板英文回答:Predictive Modeling。
Predictive modeling is a powerful technique that uses data and statistical algorithms to make predictions about future events. It is widely used in various industries, such as marketing, finance, and healthcare, to improve decision-making and optimize outcomes.The process of predictive modeling involves collecting relevant data, cleaning and preparing it, selecting appropriate algorithms, training the models, and evaluating their performance. Once a model is trained, it can be used to predict future values or outcomes based on new input data.There are various types of predictive modeling techniques available, including:Regression: Predicts continuous values, such as sales revenue or customer churn.Classification: Predicts categorical values, such as whether an individual will default on a loan or purchase a product.Clustering: Groups data into similar categories based on their characteristics.Time Series Forecasting: Predicts future values of a time series based on historical patterns.Predictive modeling offers several benefits, including:Improved decision-making: Enables businesses to make informed decisions based on data-driven insights.Increased efficiency: Automates predictions and frees up resources for other tasks.Personalized experiences: Allows for targeted marketing, product recommendations, and customer segmentation.Risk assessment: Identifies risks and potential vulnerabilities in various areas.However, it is important to note that predictive modeling is not a perfect science and comes with certain limitations and challenges:Data quality: The accuracy of predictions depends heavily on the quality and availability of data.Model selection: Choosing the appropriate algorithm is crucial for effective predictions.Overfitting: Models can be trained too closely to the training data, resulting in poor performance on new data.Ethical considerations: Predictive modeling can raise ethical concerns related to data privacy, bias, andfairness.To mitigate these challenges, it is essential to ensure data quality, carefully evaluate models, avoid overfitting, and address ethical considerations. Predictive modeling remains a valuable tool when used responsibly and with appropriate caution.中文回答:预测建模。
validation for predictive models -回复什么是预测模型验证?预测模型验证是指评估预测模型是否有效和可靠的过程。
在构建预测模型之前,我们需要确认模型的准确性和稳定性,以确保其能够在真实环境中取得良好的预测性能。
预测模型验证是一个关键的步骤,能够帮助我们判断模型是否足够可靠,可用于未知数据的预测。
预测模型验证的步骤:1. 数据集划分:首先,我们需要将数据集划分为训练集和测试集,通常采用70的数据作为训练集,剩余的30数据作为测试集。
划分数据集的目的是用训练集训练模型参数,然后使用测试集验证模型的预测性能。
2. 特征选择:在特征选择过程中,我们需要选择对预测目标具有重要影响的特征。
通常通过统计方法(如相关系数分析、方差分析等)和机器学习算法(如决策树、随机森林等)来评估特征的重要性。
选择合适的特征可以提高模型的预测能力和泛化能力。
3. 建立模型:选择适合的预测模型是预测模型验证的关键步骤。
常见的预测模型包括线性回归、逻辑回归、决策树、支持向量机等。
在模型建立过程中,我们需要提取有效特征,并选择合适的模型参数和算法。
4. 模型训练:使用划分好的训练集来训练模型,并调整模型参数,使模型能够更好地拟合训练数据。
在模型训练过程中,我们可以使用交叉验证方法来评估模型的效果,选择最佳的参数组合。
5. 模型评估:使用划分好的测试集来评估模型的预测性能。
常用的评估指标包括平均绝对误差(MAE)、均方根误差(RMSE)、准确率、召回率等。
通过评估指标的结果,我们可以判断模型的稳定性和准确性。
6. 模型调优:如果模型的预测性能不佳,我们可以通过调整模型参数、增加特征、改变算法等方式来改善模型的预测能力。
模型调优不仅可以提高模型的准确性,还可以提高模型的鲁棒性。
7. 模型应用:在预测模型验证完成后,我们可以将模型应用于未知数据的预测。
通过对未知数据的预测,我们可以评估模型的泛化能力和实用性。
模拟ai英文面试题目及答案模拟AI英文面试题目及答案1. 题目: What is the difference between a neural network anda deep learning model?答案: A neural network is a set of algorithms modeled loosely after the human brain that are designed to recognize patterns. A deep learning model is a neural network with multiple layers, allowing it to learn more complex patterns and features from data.2. 题目: Explain the concept of 'overfitting' in machine learning.答案: Overfitting occurs when a machine learning model learns the training data too well, including its noise and outliers, resulting in poor generalization to new, unseen data.3. 题目: What is the role of a 'bias' in an AI model?答案: Bias in an AI model refers to the systematic errors introduced by the model during the learning process. It can be due to the choice of model, the training data, or the algorithm's assumptions, and it can lead to unfair or inaccurate predictions.4. 题目: Describe the importance of data preprocessing in AI.答案: Data preprocessing is crucial in AI as it involves cleaning, transforming, and reducing the data to a suitableformat for the model to learn effectively. Proper preprocessing can significantly improve the performance of AI models by ensuring that the input data is relevant, accurate, and free from noise.5. 题目: How does reinforcement learning differ from supervised learning?答案: Reinforcement learning is a type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a reward signal. It differs from supervised learning, where the model learns from labeled data to predict outcomes based on input features.6. 题目: What is the purpose of a 'convolutional neural network' (CNN)?答案: A convolutional neural network (CNN) is a type of deep learning model that is particularly effective for processing data with a grid-like topology, such as images. CNNs use convolutional layers to automatically and adaptively learn spatial hierarchies of features from input images.7. 题目: Explain the concept of 'feature extraction' in AI.答案: Feature extraction in AI is the process of identifying and extracting relevant pieces of information from the raw data. It is a crucial step in many machine learning algorithms, as it helps to reduce the dimensionality of the data and to focus on the most informative aspects that can be used to make predictions or classifications.8. 题目: What is the significance of 'gradient descent' in training AI models?答案: Gradient descent is an optimization algorithm used to minimize a function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In the context of AI, it is used to minimize the loss function of a model, thus refining the model's parameters to improve its accuracy.9. 题目: How does 'transfer learning' work in AI?答案: Transfer learning is a technique where a pre-trained model is used as the starting point for learning a new task. It leverages the knowledge gained from one problem to improve performance on a different but related problem, reducing the need for large amounts of labeled data and computational resources.10. 题目: What is the role of 'regularization' in preventing overfitting?答案: Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function, which discourages overly complex models. It helps to control the model's capacity, forcing it to generalize better to new data by not fitting too closely to the training data.。
validation for predictive models -回复什么是预测模型的验证?预测模型的验证是对构建的预测模型进行评估和验证的过程。
在机器学习和统计学中,预测模型是通过使用历史数据来预测未来结果的工具。
预测模型可以帮助我们预测客户购买行为、股票价格波动、气候变化等等。
然而,构建出一个准确可靠的预测模型并不容易,因此验证模型的准确性和可靠性是至关重要的。
为什么需要验证预测模型?在构建预测模型时,我们使用了历史数据来训练模型,然后使用该模型来预测未来的数据。
然而,历史数据只能提供模型构建的基础,而无法证明模型在未来的预测中的准确性。
验证预测模型可以帮助我们确定模型在未来数据中的预测精度,从而对模型的准确性进行评估。
预测模型的验证方法:1. 留出法(Hold-Out Validation):将数据集划分为用于训练模型和验证模型的两个独立数据集。
通常,将数据集的70用于训练模型,30用于验证模型。
这种方法简单易行,但可能会导致结果不稳定,因为模型对训练集的拟合程度较高,而对验证集的拟合程度较低。
2. 交叉验证(Cross-Validation):将数据集划分为k个子集,每次使用k-1个子集用于训练模型,然后使用剩下的子集验证模型。
这样可以得到k个模型的验证结果的均值。
交叉验证可以减少模型对某一特定数据集的依赖性,提高模型的稳定性。
3. 自助法(Bootstrap):从原始数据集中有放回地抽取n个样本,构成一个新的训练集进行模型训练,并将未被抽取的样本用于模型验证。
这种方法可以用于改善数据集较小的情况,但可能会引入更多的样本噪声。
4. 时间序列验证:对于包含时间维度的数据,可以使用时间序列验证。
将数据按照时间顺序划分为训练集和验证集,以模拟真实预测的情况。
预测模型的评估指标:1. 均方误差(Mean Squared Error, MSE):MSE测量模型预测值和实际值之间的平方误差的平均值。
英文预训练语料 nlpPre-trained Corpora for NLPNatural language processing (NLP) is a field of computer science and artificial intelligence that focuses on the interaction between computers and human language. To develop effective NLP models, a large amount of textual data is required for training. This data is referred to as a "corpus."A corpus is a collection of texts or utterances in a particular language or domain. It serves as the input to train NLP models, enabling them to learn language patterns, grammar, semantics, and pragmatics. By analyzing and processing the data in a corpus, NLP models can generate meaningful output, such as text generation, machine translation, sentiment analysis, and named entity recognition.There are various types of corpora available for NLP, including:1. General corpora: These are large collections of text from a wide range of sources, such as the Internet, books, newspapers, and magazines. Examples include the Corpus of Contemporary American English (COCA) and the British National Corpus (BNC).2. Domain-specific corpora: These corpora focus on specific domains or topics, such as medicine, law, or finance. They contain text related to that particular domain, enabling NLP models to specialize in processing texts within that field.3.Annotated corpora: Annotated corpora include additional metadata or annotations to provide further information about the text. Examples include sentiment annotated corpora for sentiment analysis orNamed Entity Recognition corpora for identifying entities in text.4. Multilingual corpora: These corpora contain text in multiple languages, facilitating the development of multilingual NLP models or cross-lingual applications.To create a corpus, text data is typically collected from various sources and cleaned and预处理 to remove noise and standardize the format. The corpus can thenbe divided into training, validation, and test sets to evaluate the performance of NLP models during training and inference.In recent years, the availability of large-scale pre-trained corpora has revolutionized the field of NLP. Pre-trained language models, such as GPT-3, BERT, and RoBERTa, are trained on massive amounts of text data from the Internet. These models can be fine-tuned on specific tasks or domains using smaller datasets, significantly improving the performance and efficiency of NLP applications.In summary, pre-trained corpora are essential for developing and advancing NLP models. They provide the necessary data to train models, enabling them to understand and process human language with greater accuracy and effectiveness. The continued development and availability of high-quality corpora will drive further innovations in NLP and its applications across various domains.。
Predictive Models 与 Predicetive Modeling的区别【阅读难度:easy】⼀直认为predicetive models和predictive modeling没有什么太⼤区别,⼀直以为predictive modeling之后的结果模型⾃然是predictive model。
直到最近,看到全美再保险公司(Transamerica Re)的⼀位精算师的⽂章,颠覆了以前的想法。
这篇⽂章⽤⼀个⼩问题引出他的观点。
这个⼩问题看似很简单:请问,下⾯哪个属于predictive model?(A)⼀张纯风险损失率表(这是⼀个意译,原⽂指的是寿险⽣命表)(B)⼀份承保风险分类⼿册(u/w risk classification manual)(C)⼀个包含富有想象⼒的彩票号码的幸运饼⼲(D)以上都是(E)以上都不是在⾯对这个问题时,我的第⼀次回答是选(B),尽管(A)看着也对,但是作为⼀道单选题,(B)作为答案更优于(A),这种思维是⼀个典型的考试型思维,呵呵然⽽,全美再的这位精算师给出的标准答案确实令我⼤吃⼀惊,the correct answer is to the above question is (E)。
他在⽂章中认为,predicetive models和predictive modeling完全是两回事。
并且给出了两个术语的定义,如下:Predictive modeling:refers to statistical methodology or processes for harnessing the non-random correlations between interested outcomes and possible predictors.Predictive model:is an estimated non-deterministic (or random) relationship between some interested outcomes such as mortality and observable predictors such as applicants’ age, sex, etc.简单地说,他认为Predictive modeling是在处理确定性的、⾮随机性的变量间关系,⽽Predictive models在处理变量间的随机性的、⾮确定性的关系。
A G U I D E T OC L A I M S-B A S E DI D E N T I T Y A N DA C C E S S C O N T R O LAuthentication and Authorization forServices and the WebDominick BaierVittorio BertocciKeith BrownEugenio PaceMatias WoloskiA Guide to Claims-Based Identity and Access ControlContentsforewordKim Cameron ixStuart Kwanxi prefaceWho This Book Is For xii Why This Book Is Pertinent Now xivA Note About Terminology xiv How This Book Is Structured xvi What You Need to Use the Code xxi Who’s Whoxix acknowledgements xxi1 An Introduction to ClaimsWhat Do Claims Provide? 1 Not Every System Needs Claims 2 Claims Simplify Authentication Logic 3 A Familiar Example 3 What Makes a Good Claim? 5 Understanding Issuers and ADFS 6 User Anonymity 7 Implementing Claims-Based Identity 7 Step 1: Add Logic to Your Applications to Support Claims 7 Step 2: Acquire or Build an Issuer 8 Step 3: Confi gure Your Application to Trust the Issuer 8 Step 4: Confi gure the Issuer to Know About the Application 9 A Summary of Benefi ts 10Moving On 102 Cla i ms-Based Arch i tecturesA Closer Look at Claims-Based Architectures 12 Browser-Based Applications 13 Smart Clients20 Federating Identity Across Realms 22 The Benefi ts of Cross-Realm Identity 23 How Federated Identity Works 24 Home Realm Discovery26 Design Considerations for Claims-Based Applications 28 What Makes a Good Claim?28 How Can You Uniquely Identify One User From Another?29How Can You Get a List of All Possible Users29and All Possible Claims?Where Should Claims Be Issued?30ISBN: 9780735640597This document is provided “as-is”. Information and views expressed in this document, including URL and other Internet Web site references, may change without notice. You bear the risk of using it.Some examples depicted herein are provided for illustration only and are fi ctitious. No real association or connection is intended or should be inferred.© 2010 Microsoft. All rights reserved.Microsoft, Active Directory, Active Directory Federation Services, MSDN, Visual Studio, Windows, Windows Identity Foundation, and Windows Live are trademarks of the Microsoft group of companies. All other trademarks are property of their respective owners.6 Federated Identity with Multiple PartnersThe Premise81 Goals and Requirements 82 Overview of the Solution83 Using Claims in Fabrikam Shipping 86 Inside the Implementation88 Setup and Physical Deployment 97 Establishing the Trust Relationship 97 User-Confi gurable Claims Transformation Rules99appendix a using fedutil 101appendix b message sequences103 The Browser-Based Scenario 104The Active Client Scenario116appendix c industry standards123 Security Assertion Markup Language (SAML)123 WS-Federation123 WS-Federation: Passive Requestor Profi le 123 WS-Security124 WS-SecureConversation 124 WS-Trust 124 XM LEncryption124appendix d certificates125 Certifi cates for Browser-Based Applications 125 On the Issuer (Browser Scenario) 127 On the Web Application Server 127 Certifi cates for Active Clients 128 On the Issuer (Active Scenario) 128 On the Web Service Host 130 On the Active Client Host132glossary 133index1433 Claims-Based Single Sign-On for the WebThe Premise33Goals and Requirements 35 Overview of the Solution 36 Inside the Implementation 38 a-Expense Before Claims 39 a-Expense with Claims 41 a-Order Before Claims 48 a-Order with Claims49 Signing Out of an Application 50 Setup and Physical Deployment 50 Using a Mock Issuer50 Isolating Active Directory51 Converting to a Production Issuer 52 Enabling Internet Access52 Variation—Moving to Windows Azure 52 More Information564 Federated Identity for Web ApplicationsThe Premise57 Goals and Requirements 58 Overview of the Solution 58 Benefi ts and Limitations 63 Inside the Implementation63 Setup and Physical Deployment63 Using Mock Issuers for Development and Testing 63 Establishing Trust Relationships 64 More Information655 Federated Identity for Web ServicesThe Premise67 Goals and Requirements 68 Overview of the Solution 68 Inside the Implementation 70 Implementing the Web Service 70 Implementing the Active Client72 Implementing the Authorization Strategy 75 Debugging the Application76 Setup and Physical Deployment 77 Confi guring ADFS 2.0 for Web Services77Foreword Claims-based identity means to control the digital experience and touse digital resources based on things that are said by one party aboutanother. A party can be a person, organization, government, Web site,Web service, or even a device. The very simplest example of a claim issomething that a party says about itself.As the authors of this book point out, there is nothing new aboutthe use of claims. As far back as the early days of mainframe comput-ing, the operating system asked users for passwords and then passedeach new application a “claim” about who was using it. But this worldwas a kind of “Garden of Eden” because applications didn’t questionwhat they were told.As systems became interconnected and more complicated, weneeded ways to identify parties across multiple computers. One wayto do this was for the parties that used applications on one computerto authenticate to the applications (and/or operating systems) thatran on the other computers. This mechanism is still widely used—forexample, when logging on to a great number of Web sites.However, this approach becomes unmanageable when you havemany co-operating systems (as is the case, for example in the enter-prise). Therefore, specialized services were invented that would regis-ter and authenticate users, and subsequently provide claims aboutthem to interested applications. Some well-known examples areNTLM, Kerberos, Public Key Infrastructure (PKI), and the SecurityAssertion Markup Language (SAML).If systems that use claims have been around for so long, how can“claims-based computing” be new or important? The answer is a vari-ant of the old adage that “All tables have legs, but not all legs havetables.” The claims-based model embraces and subsumes the capabili-ties of all the systems that have existed to date, but it also allowsmany new things to be accomplished. This book gives a great sense ofthe resultant opportunities.xIn the spring of 2008, months before the Windows Identity Founda-tion made its fi rst public appearance, I was on the phone with the chief software architect of a Fortune 500 company when I experi-enced one of those vivid, clarifying moments that come during the course of a software project. We were chatting about how diffi cult it was to manage an environment with hundreds or even thousands of developers, all building different kinds of applications for different audiences. In such an environment, the burden of consistent applica-tion security usually falls on the shoulders of one designated security architect.A big part of that architect’s job is to guide developers on how to handle authentication. Developers have many technologies to choose from. Windows Integrated Authentication, SAML, LDAP, and X.509 are just a few. The security architect is responsible for writing detailed implementation guidance on when and how to use all of them. I imag-ined a document with hundreds of pages of technology overviews, decision fl owcharts, and code appendices that demonstrate the cor-rect use of technology X for scenario Y. “If you are building a Web application, for employees, on the intranet, on Windows, use Win-dows Integrated Authentication and LDAP, send your queries to the enterprise directory....”I could already tell that this document, despite the architect’s best efforts, was destined to sit unread on the corner of every developer’s desk. It was all just too hard; although every developer knows security is important, no one has the time to read all that. Nevertheless, every organization needed an architect to write these guidelines. It was the only meaningful thing they could do to manage this complexity.It was at that moment that I realized the true purpose of the forthcoming Windows Identity Foundation. It was to render the tech-nology decision trivial. Architects would no longer need to create com-plex guidelines for authentication. This was an epiphany of sorts. ForewordFor one thing, identity no longer depends on the use of unique identifi ers. NTLM, Kerberos, and public key certifi cates conveyed, above all else, an identifi cation number or name. This unique number could be used as a directory key to look up other attributes and to track activities. But once we start thinking in terms of claims-based computing, identifi ers were not mandatory. We don’t need to say that a person is associated with the number X, and then look in a database to see if number X is married. We just say the person is married. An identifi er is reduced to one potential claim (a thing said by some party) among many.This opens up the possibility of many more directly usable and substantive claims, such as a family name, a person’s citizenship, the right to do something, or the fact that someone is in a certain age group or is a great customer. One can make this kind of claim without revealing a party’s unique identity. This has immense implications for privacy, which becomes an increasingly important concern as digital identity is applied to our personal lives.Further, while the earlier systems were all hermetic worlds, we can now look at them as examples of the same thing and transform a claim made in one world to a claim made in another. We can use “claims transformers” to convert claims from one system to another, to interpret meanings, apply policies, and to provide elasticity. This is what makes claims essential for connecting our organizations and en-terprises into a cloud. Because they are standardized, we can use them across platforms and look at the distributed fabric as a real circuit board on which we can assemble our services and components.Claims offer a single conceptual model, programming interface, and end-user paradigm; whereas before claims, we had a cacophony of disjoint approaches. In my experience, the people who use these new approaches to build products universally agree that they solve many pressing problems that were impossibly difficult before. Yet these people also offer a word of advice. Though embracing what has existed, the claims-based paradigm is fundamentally a new one; the biggest challenge is to understand this and take advantage of it.That’s why this book is so useful. It deals with the fundamental issues, but it is practical and concise. The time spent reading it will be repaid many times over as you become an expert in one of the trans-formative technologies of our time.Kim CameronDistinguished Engineer—Microsoft Identity DivisionxiixiiiAs an application designer or developer, imagine a world where you don’t have to worry about authentication. Imagine instead that all requests to your application already include the information you need to make access control decisions and to personalize the application for the user.In this world, your applications can trust another system compo-nent to securely provide user information, such as the user’s name or e-mail address, a manager’s e-mail address, or even a purchasing autho-rization limit. The user’s information always arrives in the same simple format, regardless of the authentication mechanism, whether it’s Microsoft® Windows® integrated authentication, forms-based authentication in a Web browser, an X.509 client certifi cate, or something more exotic. Even if someone in charge of your company’s security policy changes how users authenticate, you still get the information, and it’s always in the same format.This is the utopia of claims-based identity that A Guide to Claims-Based Identity and Access Control describes. As you’ll see, claims provide an innovative approach for building applications that authen-ticate and authorize users.Who This Book Is ForThis book gives you enough information to evaluate claims-based identity as a possible option when you’re planning a new application or making changes to an existing one. It is intended for any architect, developer, or information technology (IT) professional who designs, builds, or operates Web applications and services that require identity information about their users. Although applications that use claims-based identity exist on many platforms, this book is written for people who work with Windows-based systems. You should be familiar with the Microsoft .NET Framework, , Windows Communication Foundation (WCF), and Microsoft Visual C#®.PrefaceWindows Identity Foundation would allow authentication logic to be factored out of the application logic, and as a result most devel-opers would never have to deal with the underlying complexity. Factoring out authentication logic would insulate applications from changing requirements. Making an application available to users at multiple organizations or even moving it to the cloud would just mean reconfi guring the identity infrastructure and not rewriting the application code. This refactoring of identity logic is called the claims-based identity model.Eugenio Pace from the Microsoft patterns & practices group has brought together some of the foremost minds on this topic so that their collective experience can be yours. He has focused on practical scenarios that will help you get started writing your own claims-aware applications. The guide works progressively, with the simplest and most common scenarios explained fi rst. It also contains a clear overview of the main concepts. Working source code for all of the examples can be found online ().I have truly enjoyed having Eugenio be part of our extended engineering team during this project. His enthusiasm, creativity, and perseverance have made this book possible. Eugenio is one of the handful of people I have met who revel in the challenge of identity and security and who care deeply that it be done right.Our goal is for this book to earn its way to the corner of your desk and lie there dog-eared and much referenced, so that we can be your identity experts and you can get on with the job that is most important to you, building applications that matter. We wish you much success.Stuart KwanGroup Program Manager, Identity and Access Platform Microsoft Corporation Redmond, Washington January 2010xiv xvsecurity token service (sts) = issuerTechnically, a security token service is the interface within an issuer that accepts requests and creates and issues security tokens contain-ing claims.identity provider (idp) = issuerAn identity provider is an issuer or a “token issuer,” if you prefer. Identity providers validate various user credentials, such as user names and passwords, and certifi cates, and they issue tokens. For brevity, this book uses the term “issuer.”resource security token service (r-sts)= issuerA resource security token service accepts one token and issues another. Rather than having information about identity, it has infor-mation about the resource. For example, an R-STS can translate tokens issued by an identity provider into application-specifi c claims.For this book, there’s no point in separating the functions of an identity provider, a security token service, and a resource security token service. In fact, much of the existing literature lazily mixes these up, anyway, which just adds to the confusion. This book uses the more practical term, “issuer,” to encompass all these concepts.active client = smart or rich client passive client = browserMuch of the literature refers to “active” versus “passive” clients. An active client can use a sophisticated library such as Windows Communication Foundation (WCF) to implement the protocols that request and pass around security tokens (WS-Trust is the protocol used in active scenarios). In order to support many different browsers, the passive scenarios use a much simpler protocol to request and pass around tokens that rely on simple HTTP primitives such as HTTP GET (with redirects) and POST. (This simpler protocol is defi ned in the WS-Federation specifi cation, section 13.)In this book, an active client is a rich client or a smart client. A passive client is a Web browser.preface Why This Book Is Pertinent NowAlthough claims-based identity has been possible for quite a while, there are now tools available that make it much easier for developers of Windows-based applications to implement it. These tools include the Windows Identity Foundation (WIF) and Microsoft Active Direc-tory® Federation Services (ADFS) v2. This book shows you when and how to use these tools in the context of some commonly occurring scenarios.A Note About TerminologyThis book explains claims-based identity without using a lot of new terminology. However, if you read the various standards and much of the existing literature, you’ll see terms such as “relying party,” “STS,” “subject,” “identity provider,” and so on. Here is a short list that equates some of the most common expressions used in the literature with the more familiar terms used in this book. For additional clarifi ca-tion about terminology, see the glossary at the end of the book.relying party (rp) = application service provider (sp) = applicationA relying party or a service provider is an application that uses claims. The term “relying party” arose because the application relies on an issuer to provide information about identity. The term “service provider” is commonly used with the Security Assertion Markup Language (SAML). Because this book is intended for people who design and build applications, it uses “application,” or “claims-aware application,” instead of “relying party,” “RP ,” “service provider” or “SP.”subject = user principal = userA subject or a principal is a user. The term “subject” has been around for years in security literature, and it does make sense when you think about it—the user is the subject of access control, personalization, and so on. A subject can be a non-human entity, such as printer or another device, but this book doesn’t discuss these scenarios. In addition, the .NET Framework uses the term “principal” rather than “subject.” This book talks about “users” rather than “subjects” or “principals.”xvi xvii“Claims-Based Single Sign-On for the Web” shows you how to implement single-sign on within a corporate intranet. Although this may be something that you can also implement with Windows inte-grated authentication, it is the fi rst stop on the way to implementing more complex scenarios. It includes a section for Windows Azure™ that shows you how to move the claims-based application to the cloud.“Federated Identity for Web Applications” shows you how you can give your business partners access to your applications while maintaining the integrity of your corporate directory and theirs. In other words, your partners’ employees can use their corporate creden-tials to gain access to your applications.“Federated Identity for Web Services” shows you how to use the claims-based approach with Web services, where a partner uses a smart client rather than a browser.“Federated Identity with Multiple Partners” is a variation of the previous scenario that shows you how to federate with partners who have no issuer of their own as well as those who do. It demonstrates how to use the MVC framework to create a claims-aware application.What You Need to Use the CodeYou can either run the scenarios on your own system or you can cre-ate a realistic lab environment. Running the scenarios on your own system is very simple and has only a few requirements. These are the system requirements for running the scenarios on your system:Microsoft Windows Vista SP1, Windows 7, or Microsoft • Windows Server 2008 (32-bit or 64-bit)Microsoft Internet Information Services (IIS) 7.0 • Microsoft .NET Framework 3.5 SP1• Microsoft Visual Studio® 2008 SP1• Windows Azure Tools for Microsoft Visual Studio • Windows Identity Foundation• Running the scenarios in a realistic lab environment, with an instance of Active Directory Federation Services (ADFS) and Active Directory, requires an application server, ADFS, Active Directory, and a client system. Here are their system requirements.preface How This Book Is StructuredYou can think of the structure of this book as a subway that has main stops and branches. After the preface, there are two chapters that contain general information. These are followed by scenarios that show how to apply this knowledge with increasingly more sophisti-cated requirements.Here is the map of our subway.figure 1Map of the book“An Introduction to Claims” explains what a claim is and gives general rules on what makes a good claim and how to incorporate them in your application. It’s probably a good idea that you read this chapter before you go on to the scenarios.“Claims-Based Architectures” shows you how to use claims with browser-based applications and smart client–based applications. In particular, the chapter focuses on how to implement single sign-on for your users, whether they are on an intranet or an extranet. This chap-ter is optional. You don’t need to read it before you go on to the scenarios.Windows AzureFederation with Multiple Partnersxviii xixWho’s WhoAs we’ve said, this book uses a set of scenarios that traces the evolu-tion of several corporate applications. A panel of experts comments on the development efforts. The panel includes a security specialist, a software architect, a software developer, and an IT professional. The scenarios can be considered from each of these points of view. Here are our experts.If you have a particular area of interest, look for notes provided by thespecialists whose interests align with yours.Bharath is a security specialist. He checks that solutions for authentication and authorization reliably safeguard a company’s data. He is a cautious person, for good reasons.Providing authentication for a single application is easy. Securing all applications across our organization is a different thing.Jana is a software architect. She plans the overall structure of an application. Her perspective is both practical and strategic. In other words, she considers not only what technical approaches are needed today, but also what direction a company needs to consider for the future.It’s not easy, balancing the needs of users, the IT organization, the developers, and the technical platforms we rely on.Markus is a senior software developer. He is analytical, detail-oriented, and methodical. He’s focused on the task at hand, which is building a great claims-based application. He knows that he’s the person who’s ultimately responsible for the code.I don’t care what you use for authentication, I’ll make it work.Poe is an IT professional who’s an expert in deploying and running in a corporate data center. He’s also an Active Directory guru. Poe has a keen interest in practical solutions; after all, he’s the one who gets paged at 3:00 AM when there’s a problem.Each application handles authentication differ-ently. Can I get a bit of consistency please?!?preface application serverThe application server requires the following:Windows Server 2008• Microsoft IIS 7.0• Microsoft Visual Studio 2008 SP1• .NET Framework 3.5 SP1• Windows Identity Foundation• adfsThe ADFS server requires the following:Windows Server 2008• Internet Information Services (IIS) 7.0 • .NET Framework 3.5 SP1• SQL Server® 2005/2008 Express Edition• active directoryThe Active Directory system requires a Windows server 2008 with Active Directory installed.client computerThe client computer requires Windows Vista or Windows 7 for active scenarios. Passive scenarios may use any Web browser as the client that supports redirectsxxAcknowledgmentsThis book marks a milestone in a journey I started in the winterof 2007. At that time, I was offered the opportunity to enter acompletely new domain, the world of software delivered as a service.Offerings such as Windows Azure were far from being realized, and“the cloud” was still to be defi ned and fully understood. My workmainly focused on uncovering the specifi c challenges that companieswould face with this new way of delivering software.It was immediately obvious that managing identity and accesscontrol was a major obstacle for developers. Identity and access con-trol were fundamental. They were prerequisites for everything else.If you didn’t get authentication and authorization right, you wouldbe building your application on a foundation of sand.Thus began my journey in the world of claims-based identity. Iwas very lucky to initiate this journey with none other than a claimsJedi, Vittorio Bertocci. He turned me into a convert.Initially, I was puzzled that so few people were deploying whatseemed, at fi rst glance, to be simple principles. Then I understoodwhy. In my discussions with colleagues and customers, I frequentlyfound myself having to think twice about many of the concepts andabout the mechanics needed to put them into practice. In fact, evenafter longer exposure to the subject, I found myself having to care-fully retrace the interactions among implementation components.The principles may have been simple, but translating them intorunning code was a different matter. Translating them into the rightrunning code was even harder.Around this same time, Microsoft announced Windows IdentityFoundation, ADFS v2, and Access Control Service. Once I understoodhow to apply those technologies, and how they dramatically simpli-fi ed claims-based development, I realized that the moment had cometo create a guide like the one you are now reading.Even after I had spent a signifi cant amount of time on the subject,I realized that providing prescriptive guidance required greater profi -xxixxii xxiiiKrish Shenoy, Travis Spencer (), Mario Szpusz-ta (Sr. Architect Advisor, Microsoft Austria), Chris Tavares, Peter M. Thompson, and Todd West.Finally, I want to thank Stuart Kwan and Conrad Bayer from the Identity Division at Microsoft for their support throughout. Even though their teams were extremely busy shipping WIF and ADFS,they always found time help us.Eugenio PaceSenior Program Manager —patterns & practices Microsoft Corporation Redmond, January 2010acknowledgments ciency than my own, and I was lucky to be able to recruit for my quest some very bright and experienced experts. I have thoroughly enjoyed working with them on this project and would be honored to work with this fi ne team again. I was also fortunate to have skilled software developers, software testers, technical writers, and others as project contributors.I want to start by thanking the following subject matter experts and key contributors to this guide: Dominick Baier, Vittorio Bertocci, Keith Brown, and Matias Woloski. These guys were outstanding. I admired their rigor, their drive for excellence, and also their commit-ment to pragmatic solutions.Running code is a very powerful device for explaining how tech-nology works. Designing sample applications that are both techni-cally and pedagogically sound is no simple task. I want to thank the project’s development and test teams for providing that balance: Federico Boerr, Carlos Farre, Diego Marcet, Anant Manuj Mittal, Er-win van der Valk, and Matias Woloski.This guide is meant to be authoritative and prescriptive in the topics it covers. However, we also wanted it to be simple to under-stand, approachable, and entertaining, a guide you would fi nd interest-ing and you would enjoy reading. We invested in two areas to achieve this: an approachable writing style and an appealing visual design.A team of technical writers and editors were responsible for the text. They performed the miracle of translating and organizing our jargon- and acronym-plagued drafts, notes, and conversations into clear, readable text. I want to direct many thanks to RoAnn Corbisier, Colin Campbell, Roberta Leibovitz, and Tina Burden for doing such a fi ne job on that.The innovative visual design concept used for this guide was developed by Roberta Leibovitz and Colin Campbell (Modeled Computation LLC) who worked with a team of talented designers and illustrators. The book design was created by John Hubbard (eson). The cartoon faces and chapter divisions were drawn by the award-winning Seattle-based cartoonist Ellen Forney. The technical illustra-tions were adapted from my Tablet PC mock ups by Veronica Ruiz. I want to thank the creative team for giving this guide such a great look.I also want to thank all the customers, partners, and community members who have patiently reviewed our early content and drafts. You have truly helped us shape this guide. Among those, I want to highlight the exceptional contributions of Zulfi qar Ahmed, Michele Leroux Bustamante (IDesign), Pablo Mariano Cibraro (Tellago Inc), Hernan DeLahitte (DigitFactory), Pedro Felix, Tim Fischer (Microsoft Germany), Mario Fontana, David Hill, Doug Hiller, Jason Hogg, Eze-quiel Jadib, Brad Jonas, Seshadri Mani, Marcelo Mas, Vijayavani Nori,。
Traffic prediction with adaptivefiltersPrzemys l aw G l ombInstitute of Theoretical and Applied InformaticsBa l tycka5,44-100Gliwice,Polandprzemg@iitis.gliwice.plAbstractThis article presents the analysis of the traffic prediction scheme with two separatefilters: predictor and corrector.A NLMS adaptive linearfilter is used for prediction of a traffic signal,andanother one for prediction of error.Final result is the sum of both.Further in the article thereare results from evaluation on the lbl-tcp-3data set,discussion and conclusions.Keywords:traffic prediction,traffic models,adaptivefiltering.1IntroductionTraffic prediction problem can be formulated as:given traffic signal history,recorded at discrete time instants,formulate the prediction of traffic k time instants from current.Investigation of methods and algorithms of this kind is important from both theoretical(as it provides insight into and validation of traffic models)and practical standpoint(they form the base of effective traffic engineering algorithms, e.g.AQM).Of popular methods that are used for this task,linearfiltering is one of the most widespread, because it’s high performance together with low computational requirements.A detailed analysis of a linearfilter based on minimum mean square error(LMMSE)estimation used for prediction is presented in[1].It is then applied to modified RED scheme,where the decision of packet drop is based on the predicted rather than actual values.It is noted that this algorithm’s performance surpasses that of other analyzed RED propositions in terms of expected queue length and number of dropped packets.A similar investigation is presented in[2],where adaptive linear predictor(APACE)is used for queue length predictor.This idea was also tested as RED modification;adaptation allows for efficient application in changing network conditions,resulting in high link utilization.Another case where adaptivity results in efficient traffic prediction is WDRLS model[3],where it is applied after wavelet decomposition to the detail series.More advanced techniques involve the identification of a linearfilter with bothfinite(FIR/MA, moving average)and infinite impulse response(IIR/AR,autoregressive)coefficients,forming together the ARMA model[4].For network traffic,which is known to exhibit long range dependence(LRD), a generalization of this model is used,incorporating integrating component with non-integer power (FARIMA).It has been successfully used both as a model and a predictor of traffic[5].Recently,a related model combining ARIMA with Generalized Auto Regressive Conditional Heteroscedasticity (GARCH)has been proposed,reportedly providing even better prediction performance[6].A survey on short term prediction with many models combining AR/MA principles with different heuristics can be found in[7].Results presented in this article are part of ongoing work aiming at application of selected linear and non-linear(kernel)methods for traffic analysis[8]and prediction[9].The contribution of this article is the investigation of linear prediction scheme with twofilters.First is used for traditional prediction.The other one is following the prediction errors,and is used for generation of correction info.Final prediction is used by adding those two components.The scheme has been implemented and tested on real network data(lbl-tcp-3[10]data set).2Algorithm descriptionThe objective of the algorithm is to predict traffic volume given step k.Both input and output data correspond to real traffic data aggregated at given timescale.The algorithm uses twofilters,one for rough prediction,the other for correction(update).The process can be summarized as follows:1.Aggregated traffic is given as an input at discrete time instants t2.For each input predictionfilter produces prediction valueˆp t+k3.Current input is evaluated on past predictions forming new value of error signal4.Correctionfilter produces‘prediction of an error’value(correction)ˆc t+k(from error signal)5.Final prediction(ˆp t+k+ˆc t+k)is the output of the algorithmThe scheme is also presented onfigure2(a).For the two predictors NLMS linear adaptivefilters[11] have been chosen.Bothfilters are adapted based on their respective errors(prediction volume vs actual volume for predictor,predicted error vs actual error for corrector).Linearfilters are suitable for this kind of data.Adaptive prediction as in step two above has been applied with good performance result[1–3].However,since traffic signal is complex itself,it can be expected that the error signal still has regularities that can be exploited.This is the motivation for using anotherfilter in step four above.In practical application,this speeds up the adaptation when traffic patterns change.This behavior can be seen infigure2(b).While NLMSfilter is computationally efficient,it’s adaptation ratio is slow in comparison to other methods(RLS for example).This is especially important in this scheme,as it requires2k time instants before correctionfilter adaptation ually several thousands steps are required with best performing adaptation ratio before achievingfilter‘steady’state.2.1AdaptivefilteringThe prediction using adaptive NLMS(Normalized Least Mean Square)filtering process is defined as follows.At given time instant t the corresponding traffic vector is x t.Prediction with step k is realized asˆp t+k=Ni=1w i[x t]i(1)where w=(w1,w2,...,w L)∈R N is the weight vector,[x t]i is the i-th value of vector x t.At each step weights are updated with the formulaw t+1=w t+βx t(2) with the coefficientβcomputed using instantaneous error value e t=p t−ˆp tβ=µ0e t1+||x t||2(3)The history size N and adaptation ratioµ0form the parameters of thefilter1.Traditional ap-proach,which assumes absence of prior information,initializes the weight vector with zeros.3ExperimentsThis section presents the results of several experiments that were performed to evalue the accuracy of the proposed scheme.1Note that prediction step k remains constant for the whole time of prediction.3.1The settingThe traffic data used in the experiments is the lbl-tcp-3available at[10].This trace contains two hours’worth of all wide-area TCP traffic between the Lawrence Berkeley Laboratory and the rest of the world.The traffic was analyzed on timescales10ms,100ms and1s.Thefilters were implemented and prediction was made for selected timescales.The experiments were performed for prediction steps of k=5,10,20.After[6],the SER(signal to error ratio)was adopted as performance measureSER=10log10E{X2t}E{(X t−ˆX t)2}dB(4)where E{·}is the mean value of the expected value,X(t)is the actual traffic value,andˆX(t) is the predicted traffic value.The SER was evaluated both for rough(without correction)andfinal (corrected)prediction signals.3.2ParametersThe two parameters chosen separately for bothfilters are adaptation ratio andfilter history size (number of taps).Adaptation ratio has a strong effect on predicted signal.With ratio too low thefilter is not able to react accurately to changing traffic pattern.Too high ratio produces oscillations andfilter instability that continuously increases error.The selected adaptation ratio results are presented in figure2(c).It can be seen that optimum rate is similar for bothfilters.Also it is evident that using prediction/correction scheme allows for wider range of values for this parameter.This could be exploited by particular applications(i.e.faster reaction on traffic bursts).The optimum value of adaptation rate shifts with the increase of timescale.When the optimum adaptation ratio is similar for bothfilters,the best history size is clearly different(figure2(d)).The predictionfilter achieves best results for4–6taps,while correction performs best with only2taps.When increase of prediction taps leads to abrupt decreases of performance, correction taps decrease performance approximately exponentially.3.3Prediction resultsFinal prediction results are presented onfigure1.The results were compared to results presented int[6] where on the same dataset more advanced models(ARIMA/GARCH,FARIMA)were evaluated.The scheme proposed here performs worse at small steps(k<5),and significantly better at large steps (k>5).The correction itself produces an improvement of around2-4dB.Figure2(e)shows the evolution of prediction with time.Both the intermediate andfinal predictions are shown.After initial adaptation(beginning of thefigure),the intermediate prediction varies in quality for different periods.This is caused by changing patterns of traffic.The correction effect has the ability to equalize the accuracy where traffic change reduces forces large update on prediction filter.Figure2(f)shows the histogram of prediction errors.They can be accurately approximated with laplacian distribution.4ConclusionThis paper presented evaluation of two step traffic prediction scheme.Two adaptivefilters,predictor and corrector,are used for producing accurate value of signal with given step.The results obtained with experiments on real traffic data are encouraging,especially when compared to other methods tried on this dataset.Further work will include application of this method to improve kernel based traffic predictors described in[8,9].This work was partially supported by the Polish Ministry of Education and Science research grant 3T11C04128.Aggregation level k=5steps k=10steps k=20steps10ms 6.15dB7.52dB7.77dB100ms 5.42dB7.64dB8.39dB1s 5.84dB7.80dB7.33dBFARIMA from[6]6.98†dB0.42dB0.37†dBARIMA/GARCH from[6]11.4†dB 3.89dB3.45†dBFigure1:Results of prediction for selected aggregation levels and prediction steps.For comparison, values from[6]are reproduced.Values with(†)were interpolated from presented in[6]data at k=1,10,100.References[1]Y.Gao,G.He,and J.C.Hou,“On exploiting traffic predictability in Active Queue Management,”in INFOCOM,2002.[2]A.Jain,A.Karandikar,and R.Verma,“Adaptive prediction based approach for congestionestimation(APACE)in active queue management,”Computer Communications,vol.27,pp.1647–1660,2004.[3]X.Wang,Y.Ren,and X.Shan,“WDRLS:A wavelet based on-line predictor for network traffic,”in Proc.of the IEEE Globecom,2003,pp.4034–4038.[4]G.Box and G.Jenkins,Time Series Analysis:Forecasting and Control.Prentice Hall,1994.[5]C.Dethe and D.Wakde,“On the prediction of packet process in network traffic using FARIMAtime-series model,”Journal of the Indian Institute of Science,vol.84,no.1-2,pp.31–39,2004.[6]B.Zhou,D.He,and Z.Sun,“Traffic modeling and prediction using ARIMA/GARCH model,”inProc.of2nd EuroNGI Conference on Next Generation Internet Design and Engineering,2006.[7]M.Papadopouli,H.Shen,and E.Raftopoulos,“Evaluation of short-term traffic forecasting,”inProc.of2nd EuroNGI Conference on Next Generation Internet Design and Engineering,2006.[8]K.Grochla and P.G l omb,“Representing single-link traffic quality with kernel PCA,”Instituteof Theoretical and Applied Informatics,Tech.Rep.01/2006.[9]P.G l omb,“Network traffic prediction with kernels,”in3nd EuroNGI Workshop on New Trendsin Modeling,Quantitative Methods and Measurements,2006.[10][Online].Available:/html/contrib/LBL-TCP-3.html[11]S.Haykin,Adaptive Filter Theory.Prentice Hall,2001.(a)Basic prediction scheme.Final estimated value is com-bined from predicted traffic value and correction from pre-dicted error.Both predictors use NLMS adaptive filters.0 50001000015000 20000 25000 3000035000 40000950952 954 956 958960 962 964 966 968 970a g g r e g a t e d t r a f i c v o l u m e [B ]time [s]signalprediction µ=0.001, N=5, SER=7.294 dB correction µ=0.001, N=5, SER=11.478 dB(b)Example effect of prediction and prediction with correction superimposed on actual traffic signal. 1e-0061e-0050.00010.0010.010.1prediction µ1e-006 1e-005 0.0001 0.001 0.01 0.1correction µ 0 1 2 3 4 5 6 7 8signal to error ratio [dB](c)Effect of predictors adaptation ratio on prediction quality. 2468101214prediction history size246 810 1214correction history size1 2 3 4 5 6 7 8 9 10signal to error ratio [dB](d)Effect of predictors history sizes on prediction quality.246810 1214100020003000 4000500060007000s i g n a l t o e r r o r r a t i o [d B ]time [s]prediction SER over 100s intervals correction SER over 100s intervals(e)Time evolution of signal to error ratio on timescale 100ms.The SER values are calculated on non overlapping 100s seg-ments.Note the periods of changing traffic structure.1e-0050.00010.0010.010.1-10000-50000 5000 10000f r e q u e n c y o f o c c u r e n c eerror value(f)Histogram of final prediction errors.Figure 2:Illustration of experiments with predictor/corrector scheme.。
a model英语范文英文回答:A Model.A model is a representation of something. It can be a physical object, a mathematical equation, or a computer program. Models are used to understand and predict the behavior of complex systems.There are many different types of models. Some models are simple, while others are very complex. The type of model that is used depends on the purpose of the model and the available data.Simple models are often used to illustrate a concept or to make a prediction. For example, a scientist might use a simple model to predict the trajectory of a projectile. Complex models are often used to simulate the behavior of complex systems, such as the climate or the economy.Models are an important tool for scientists and engineers. They allow us to understand and predict the behavior of complex systems. Models can also be used to design new systems and to make decisions.中文回答:模型。
A predictive model (ETLM)for arsenate adsorption and surface speciation on oxides consistent with spectroscopicand theoretical molecular evidenceKeisuke Fukushi 1,Dimitri A.Sverjensky*Department of Earth and Planetary Sciences,Johns Hopkins University,Baltimore,MD 21218,USA Received 7September 2006;accepted in revised form 9May 2007;available online 7June 2007AbstractThe nature of adsorbed arsenate species for a wide range of minerals and environmental conditions is fundamental to pre-diction of the migration and long-term fate of arsenate in natural environments.Spectroscopic experiments and theoretical calculations have demonstrated the potential importance of a variety of arsenate surface species on several iron and aluminum oxides.However,integration of the results of these studies with surface complexation models and extrapolation over wide ranges of conditions and for many oxides remains a challenge.In the present study,in situ X-ray and infrared spectroscopic and theoretical molecular evidence of arsenate (and the analogous phosphate)surface speciation are integrated with an extended triple layer model (ETLM)of surface complexation,which takes into account the electrostatic work associated with the ions and the water dipoles involved in inner-sphere surface complexation by the ligand exchange mechanism.Three reactions forming inner-sphere arsenate surface species2>SOH þH 3AsO 40¼ð>SO Þ2AsO 2ÀþH þþ2H 2O 2>SOH þH 3AsO 40¼ð>SO Þ2AsOOH þ2H 2O and>SOH þH 3AsO 40¼>SOAsO 22Àþ2H þþH 2Owere found to be consistent with adsorption envelope,adsorption isotherm,proton surface titration and proton coadsorption of arsenate on hydrous ferric oxide (HFO),ferrihydrite,four goethites,amorphous aluminum oxide,a -Al 2O 3,b -Al(OH)3,and a -Al(OH)3up to surface coverages of about 2.5l mol m À2.At higher surface coverages,adsorption is not the predominant mode of arsenate sorption.The four goethites showed a spectrum of model arsenate surface speciation behavior from predom-inantly binuclear to mononuclear.Goethite in the middle of this spectrum of behavior (selected as a model goethite)showed predicted changes in arsenate surface speciation with changes in pH,ionic strength and surface coverage very closely consis-tent with qualitative trends inferred in published in situ X-ray and infrared spectroscopic studies of arsenate and phosphate on several additional goethites.The model speciation results for arsenate on HFO,a -and b -Al(OH)3were also consistent with X-ray and molecular evidence.The equilibrium constants for arsenate adsorption expressed in terms of site-occupancy standard states show systematic differences for different solids,including the model goethite.The differences can be explained with the aid of Born solvation theory,which enables the development of a set of predictive equations for arsenate adsorption equilibrium constants on all oxides.The predictive equations indicate that the Born solvation effect for mononuclear species is much stronger than for binuclear species.This favors the development of the mononuclear species on iron oxides such as HFO with high values of the dielectric constant relative to aluminum oxides such as gibbsite with much lower dielectric constants.However,on0016-7037/$-see front matter Ó2007Elsevier Ltd.All rights reserved.doi:10.1016/j.gca.2007.05.018*Corresponding author.Fax:+14105167933.E-mail addresses:fukushi@kenroku.kanazawa-u.ac.jp (K.Fukushi),sver@ (D.A.Sverjensky).1Present address:Institute of Nature and Environmental Technology,Kanazawa University,Kakuma-machi,Kanazawa,Ishikawa 920-1192,Japan./locate/gcaGeochimica et Cosmochimica Acta 71(2007)3717–3745hematite and corundum,with similar dielectric constants,the predicted surface speciations of arsenate are similar:at lower pH values and/or higher surface coverages,binuclear species are predicted to predominate,at higher pH values and/or lower sur-face coverages,mononuclear species predominate.Ó2007Elsevier Ltd.All rights reserved.1.INTRODUCTIONArsenic has received a great deal of public attention be-cause of its links to certain types of cancers and its high lev-els in some drinking water supplies(Nordstrom,2002; Hopenhayn,2006).Arsenate is the predominant species of arsenic in oxidized aquatic systems(Myneni et al.,1998). The concentration of arsenate in natural waters is strongly influenced by adsorption on oxide surfaces(Fuller and Davis,1989;Pichler et al.,1999;Fukushi et al.,2003).To predict the migration and long-term fate of arsenate in nat-ural environments,the behavior and the nature of adsorbed arsenate species must be known for a wide variety of min-erals and over the full range of environmental conditions.The surface speciation of arsenate on oxide powders and on single crystal surfaces has been studied through X-ray spectroscopic and infrared investigations as well as theoret-ical molecular calculations.On powders,a variety of sur-face species have been inferred from extended X-ray absorptionfine structure(EXAFS)studies.The majority of these studies have been concerned with iron oxides and a very limited range of ionic strengths.For arsenate on fer-rihydrite and FeOOH polymorphs(goethite,akaganeite and lepidocrocite)at pH8and I=0.1,it was inferred that arsenate adsorbed mainly as an inner-sphere bidentate-binuclear complex(Waychunas et al.,1993).Monoden-tate-mononuclear complexes were also inferred.The monodentate/bidentate ratio decreased with increasing arsenate surface coverage.In contrast,Manceau(1995) reinterpreted the EXAFS spectra as indicating bidentate-mononuclear species.The possible existence of such a spe-cies,in addition to the ones described by Waychunas et al.(1993)was described in Waychunas et al.(2005).All three of the above species were subsequently inferred for arsenate adsorption on goethite from EXAFS results obtained at pH6–9and I=0.1(Fendorf et al.,1997).The monodentate-mononuclear species was inferred to domi-nate at higher pH values and lower surface coverages, whereas the bidentate-binuclear species dominated at lower pH and higher surface coverages.The bidentate-mononu-clear species was detected in only minor amounts at high surface coverage.For arsenate on goethite and lepidocro-cite at pH6,a bidentate-binuclear species was inferred (Farquhar et al.,2002).The same species was inferred for arsenate on goethite,lepidocrocite,hematite and ferrihy-drite at pH4and7and I=0.1through EXAFS combined with molecular calculations(Sherman and Randall,2003). For arsenate on hematite,as a function of pH4.5–8and surface coverage,Arai et al.(2004)interpreted EXAFS data to infer a predominant bidentate-binuclear and a minor amount of a bidentate-mononuclear species.In the only powder studies of aluminum oxides,EXAFS and XANES of arsenate adsorption on b-Al(OH)3at pH4,8,and10and I=0.1and0.8by Arai et al.(2001)showed that arse-nate adsorbs to b-Al(OH)3as a bidentate-binuclear inner-sphere species(their XANES data did not indicate any out-er-sphere species regardless of pH and ionic strength),and an EXAFS study of arsenate on a-Al(OH)3at pH5.5(in dilute Na-arsenate solutions)also showed that arsenate forms an inner-sphere bidentate-binuclear species(Ladeira et al.,2001).On single crystal surfaces of corundum and hematite, several arsenate surface species have been reported recently. For the(0001)and(10–12)planes of hematite,prepared at pH5with10À4M sodium arsenate solutions,then studied with GIXAFS and surface diffraction in a humid atmo-sphere,71–78%of the adsorbed arsenate was determined to be a bidentate-mononuclear species,with the remainder present as a bidentate-binuclear species(Waychunas et al., 2005).However,in bulk water at pH5,resonant anomalous X-ray reflectivity and standing-wave studies of the(01–12) planes of corundum and hematite,have established a biden-tate-binuclear species,and thefirst reported occurrence of an outer-sphere or H-bonded arsenate species(Catalano et al.,2006a,b;2007).In the present study,we wish to emphasize that distinguishing between an outer-sphere and H-bonded species is not possible in surface complexa-tion models of reaction stoichiometry,although consider-ations of Born solvation theory may facilitate the distinction.Overall,the X-ray studies on powders and single crystals provide strong indications of a variety of surface arsenate species,presumably present under different conditions of pH,ionic strength and surface coverage,as well as being present on different solids and crystallographically different surfaces.Even on a single crystal surface,a great variety of possible surface species coordination geometries can be proposed,e.g.monodentate,bidentate and tridentate possi-bilities(Waychunas et al.,2005).However,the most fre-quently reported inner-sphere arsenate species coordination geometries in X-ray studies remain the biden-tate-binuclear,bidendate-mononuclear,and the monoden-tate-mononuclear types of species.In this regard,it should also be emphasized that none of the protonation states of these coordination geometries are established by X-ray studies.A variety of possible protonation states adds even further potential complexity to the speciation of ad-sorbed arsenate.It is here that other spectroscopic tech-niques(e.g.infrared and Raman)and molecular calculations,as well as the reaction stoichiometries in sur-face complexation models can possibly help.In situ ATR-FTIR studies of arsenate on ferrihydrite and amorphous hydrous ferric hydroxide have inferred that adsorption of arsenate occurred as inner-sphere species (Roddick-Lanzilotta et al.,2002;Voegelin and Hug, 2003).An ex situ FTIR study(Sun and Doner,1996)has3718K.Fukushi,D.A.Sverjensky/Geochimica et Cosmochimica Acta71(2007)3717–3745inferred binuclear-bridging complexes of arsenate on goe-thite.However,the latter noted that their results could be affected by not having bulk water present in their spectro-scopic experiment.In addition,IR studies of arsenate adsorption may only show small or no shifts with changes of pH(Goldberg and Johnston,2001).As a consequence, the assignment of arsenate surface species from infrared and Raman spectra is sufficiently difficult(Myneni et al., 1998)that a definitive in situ study has yet to be done.Con-sequently,in the present study we have made use of in situ infrared results for an analogous anion phosphate.Phosphate and arsenate have very similar aqueous pro-tonation characteristics and are widely considered to have very similar adsorption characteristics(Waychunas et al., 1993;Hiemstra and van Riemsdijk,1999;Goldberg and Johnston,2001;Arai et al.,2004).Furthermore,infrared studies of phosphate adsorption indicate enough shifts of the spectra with experimental conditions to assign the phos-phate surface species(Tejedor-Tejedor and Anderson,1990; Persson et al.,1996;Arai and Sparks,2001;Loring et al., 2006;Nelson et al.,2006).The phosphate surface species in-ferred by Tejedor-Tejedor and Anderson(1990)have al-ready been used to develop a CD surface complexation model for arsenate on goethite(Hiemstra and van Rie-msdijk,1999).In the present study,we take a similar ap-proach using the studies by Tejedor-Tejedor and Anderson(1990),Arai and Sparks(2001),and Elzinga and Sparks(2007)which are the definitive published studies of phosphate adsorption referring to in situ conditions.It should be emphasized here that,in detail,there are differ-ences between some of the surface species inferred in the above studies.In the present study,we investigated a sur-face complexation application of the species suggested by these authors to see if a consistent set of species would ap-ply to a wide range of oxides.According to Tejedor-Tejedor and Anderson(1990), phosphate on goethite adopts three surface species:proton-ated bidentate-binuclear,deprotonated bidentate-binuclear and deprotonated monodentate species.The relative abun-dances of these species were shown to be strong functions of pH and surface coverage at an ionic strength of0.01.The protonated bidentate-binuclear species predominated at pH values of about3.6to5.5–6.0(at surface coverages of 190and150l mol/g).The deprotonated bidentate-binu-clear species became predominant between pH values of about6–8.3.The mondentate species was detectable at pH values as low as6at surface coverages of190l mol/g and increased in relative abundance as the surface coverage decreased towards50l mol/g.Consistent results have been obtained for phosphate on ferrihydrite(Arai and Sparks, 2001).The latter assigned protonated bidentate-binuclear species at pH values of4to6and surface coverages of 0.38–2.69l mol/m2and a deprotonated bidentate-binuclear species at pH>7.5,as well as a possible non-protonated Na-phosphate surface species at high pH values.More re-cently,in an investigation of phosphate adsorption on hematite,Elzinga and Sparks(2007)suggested a bridging binuclear protonated species,a protonated mononuclear species and a deprotonated mononuclear species.It was also suggested that the‘‘bridging’’species may instead be mononuclear with a strong H-bond to an adjacent Fe-octa-hedra.Broadly speaking,the in situ infrared studies for phosphate adsorption on iron oxides are consistent with the results of the X-ray studies for arsenate discussed above (although the X-ray studies cannot provide information on protonation states).However,the above infrared studies of phosphate provide invaluable additional information on the possible states of protonation of the surface phosphate species,which we use as afirst step in developing a surface complexation model for arsenate adsorption below.The surface structure and protonation state of adsorbed arsenate have been addressed through theoretical density functional theory(DFT)and molecular orbital density functional theory(MO/DFT)calculations(Ladeira et al., 2001;Sherman and Randall,2003;Kubicki,2005).DFT calculations for arsenate with aluminum oxide clusters indi-cated that a bidentate-binuclear species was more stable than bidentate-mononuclear,monodentate-mononuclear or monodentate-binuclear species(Ladeira et al.,2001). From predicted geometries of arsenate adsorbed on iron hydroxide clusters using DFT comparing EXAFS data,it was suggested that arsenate was adsorbed as a doubly pro-tonated species(Sherman and Randall,2003).Predicted energetics indicated that a bidentate-binuclear species was more stable than a bidentate-mononuclear and monoden-tate species.In contrast to these DFT calculations,MO/ DFT calculations by using a different type of iron hydrox-ide clusters(containing solvated waters)showed that deprotonated arsenate surface species resulted in good agreement with spectroscopic data(Kubicki,2005).The bidentate-binuclear configuration is most consistent with spectroscopic data,but the model Gibbs free energies of adsorption of clusters suggested that the monodentate configuration is energetically more stable as a cluster.The MO/DFT calculations also indicated a strong preference for arsenate to bind to iron clusters relative to aluminum hydroxide clusters.Detailed molecular simulations of phos-phate on iron oxide clusters(Kwon and Kubicki,2004) emphasize mononuclear species with differing protonation states that agree very well with an ex situ infrared study of phosphate on goethite(Persson et al.,1996).Numerous studies have focused on surface complexation modeling of arsenate on oxides(Goldberg,1986;Dzombak and Morel,1990;Manning and Goldberg,1996;Grossl et al.,1997;Manning and Goldberg,1997;Hiemstra and van Riemsdijk,1999;Gao and Mucci,2001;Goldberg and Johnston,2001;Halter and Pfeifer,2001;Dixit and Hering,2003;Arai et al.,2004;Hering and Dixit,2005). Surface complexation modeling has the capability to predict the behavior and the nature of adsorbed arsenate species as functions of environmental parameters.How-ever,the complexation reactions have not always been con-sistent with in situ spectroscopic results.Instead,with the exception of the study by Hiemstra and van Riemsdijk (1999),regression of macroscopic adsorption data has often been carried out merely tofit the data with the minimum number of surface species(Hering and Dixit,2005).It is now widely recognized that the evidence of oxyanion speciation from spectroscopic studies should be integrated with models describing macroscopic adsorption andAs(V)surface speciation on oxides3719electrokinetic data(Suarez et al.,1997;Hiemstra and van Riemsdijk,1999;Blesa et al.,2000;Goldberg and Johnston, 2001).In the present study,we use the results of the in situ X-ray and infrared studies of arsenate and phosphate discussed above to guide the choice of surface species in the extended triple-layer model(ETLM)recently developed to account for the electrostatic effects of water-dipole desorption during inner-sphere surface complexation(Sver-jensky and Fuksuhi,2006a,b).We investigate the applica-bility of three spectroscopically identified species byfitting adsorption,proton surface titration in the presence of arse-nate,and proton coadsorption with arsenate on a wide range of oxides including goethite,hydrous ferric oxide (HFO),ferrihydrite,hematite,amorphous aluminum oxide, b-Al(OH)3,a-Al(OH)3and a-Al2O3(Manning and Gold-berg,1996;Jain et al.,1999;Jain and Loeppert,2000;Arai et al.,2001,2004;Gao and Mucci,2001;Goldberg and Johnston,2001;Halter and Pfeifer,2001;Dixit and Hering, 2003).These data were selected to cover wide ranges of pH, ionic strength and surface coverage,and a wider range of oxide types than have been investigated spectroscopically. Other arsenate data examined in the present study(Rietra et al.,1999;Arai et al.,2004;Dutta et al.,2004;Pena et al.,2005,2006;Zhang and Stanforth,2005)did not in-volve a wide enough range of conditions to permit retrieval of equilibrium constants for three surface species.Electro-kinetic data involving arsenate adsorption were also exam-ined in the present study(e.g.Anderson and Malotky,1979; Arai et al.,2001;Goldberg and Johnston,2001;Pena et al., 2006).However,these data refer to a much more limited range of conditions.The main purpose of the present study is to determine the effects of pH,ionic strength,surface coverage and type of solid on the surface speciation of arsenate identified spec-troscopically.The results are then compared with indepen-dent trends established in X-ray and infrared spectroscopic studies for arsenate and phosphate.A second goal of the present study is to provide a predictive basis for arsenate surface speciation on all oxides.By adopting an internally consistent set of standard states,systematic differences in the equilibrium constants for the surface arsenate species from one oxide to another can be established and explained with the aid of Born solvation theory.This makes it possi-ble to predict arsenate adsorption and surface speciation on all oxides in1:1electrolyte systems.2.ETLM TREATMENT OF ARSENATEADSORPTION2.1.Aqueous speciation,surface protonation and electrolyte adsorptionAqueous speciation calculations were carried out taking into account aqueous ionic activity coefficients appropriate to single electrolytes up to high ionic strengths calculated with the extended Debye-Huckel equation(Helgeson et al.,1981;Criscenti and Sverjensky,1999).Electrolyte ion pairs used were consistent with previous studies (Criscenti and Sverjensky,1999,2002).Aqueous arsenate protonation equilibria were taken from a recent study (Nordstrom and Archer,2003):HþþH2AsO4À¼H3AsO40;log K¼2:30ð1ÞHþþHAsO42À¼H2AsO4À;log K¼6:99ð2ÞHþþAsO43À¼HAsO42À;log K¼11:8ð3ÞAqueous arsenate complexation with the electrolyte cations used in experimental studies can be expected to occur, based on the analogous phosphate systems.However,the lack of experimental characterization of arsenate-electro-lyte cation complexing prevented including this in the present study.Preliminary calculations carried out using Na-complexes of phosphate as a guide to the strength of Na–arsenate complexes indicated that the results of the present study were not significantly affected at ionic strengths of0.1and below.The sample characteristics and surface protonation and electrolyte adsorption equilibrium constants used in the present study are summarized in Table1.Although HFO and ferrihydrite might be thought to have extremely similar surface chemical properties,we distinguished between them in our previous study of As(III)adsorption(Sverjensky and Fukushi,2006b).By using differences in the pH ZPC(i.e.7.9 vs.8.5,respectively,Table1),we inferred different effective dielectric constants for these solids,which facilitated the use of Born solvation theory to explain the different magnitudes of the log K values for As(III)adsorption.It will be shown below that the same approach works for differences in arse-nate adsorption,at least for these specific HFO and ferrihy-drite samples.Surface protonation constants referring to the site-occupancy standard states(denoted by the super-script‘‘h’’),i.e.log K h1and log K h2,were calculated from val-ues of pH ZPC and D p K hn(Sverjensky,2005)usinglog K h1¼pH ZPCÀD p K hn2ð4Þandlog K h2¼pH ZPCþD p K hn2ð5ÞValues of pH ZPC were taken from measured low ionic strength isoelectric points or point-of-zero-salt effects cor-rected for electrolyte adsorption(Sverjensky,2005).For goethite and HFO examined by Dixit and Hering(2003), neither isoelectric points nor surface titration data were re-ported.The value of pH ZPC for the goethite was predicted for the present study using Born solvation and crystal chemical theory(Sverjensky,2005).For HFO,the pH ZPC was assumed to be the same as in Davis and Leckie (1978),consistent with our previous analysis for arsenite adsorption on the same sample of HFO(Sverjensky andFukushi,2006).Values of D p K hnwere predicted theoreti-cally(Sverjensky,2005).For convenience,protonation constants referring to the hypothetical 1.0molar standard state(denoted by the superscript‘‘0’’)are also given in Table1.The relationship between the two standard states is given by(Sverjensky, 2003)3720K.Fukushi,D.A.Sverjensky/Geochimica et Cosmochimica Acta71(2007)3717–3745log K01¼log K h1ÀlogN S A SN Að6Þlog K02¼log K h2þlogN S A SN Að7Þwhere,N S represents the surface site density on the s th solid sorbent(sites mÀ2);Nàrepresents the standard state sorbate species site density(sites mÀ2);A S represents the BET sur-face area of the s th solid sorbent(m2gÀ1);Aàrepresents a standard state BET surface area(m2gÀ1).In the present study,values of Nà=10·1018sites mÀ2 and Aà=10m2gÀ1are selected for all solids.Electrolyte adsorption equilibrium constants referring tosite-occupancy standard states,log K hMþand log K hLÀ,andcapacitances,C1,were obtained from regression of proton surface charge data when such data were available.Other-wise,these parameters were obtained from theoretical pre-dictions(Sverjensky,2005).For convenience,Table1also contains values of the electrolyte adsorption equilibrium constants relative to the>SOH species(denoted by the superscript‘‘*’’)and with respect to the hypothetical1.0molar standard state,i.e.logÃK0Mþand logÃK0LÀ.The rela-tionships of logÃK0Mþand logÃK0LÀto log K hMþand log K hLÀare given by(Sverjensky,2003,2005)logÃK0Mþ¼log K hMþÀpH ZPCÀD p K hnÀlogN S A SN Að8ÞlogÃK0LÀ¼log K hLÀþpH ZPCÀD p K hn2ÀlogN S A SN Að9ÞSurface areas(A S)were taken from experimental measure-ments(BET)for goethite,a-Al(OH)3,b-Al(OH)3and a-Al2O3.However,for samples of HFO,ferrihydrite and amÆAlO,it was assumed that the surface areas are 600m2.gÀ1consistent with recommendations made previ-ously(Davis and Leckie,1978;Dzombak and Morel, 1990)and our previous analyses of arsenite and sulfate adsorption on these oxides.Site densities(N S)together with arsenate adsorption equilibrium constants were derived from regression of adsorption data where these data referred to a wide enough range of surface coverages.This procedure was adopted because oxyanions probably form inner-sphere complexes on only a subset of the total sites available,most likely the singly coordinated oxygens at the surface(Catalano et al.,2006a,b,c,d;2007;Hiemstra and van Riemsdijk, 1999).Theoretical estimation of these site densities is impossible for powdered samples when the proportions of the crystal faces are unknown,as is the case for all the experimental studies analyzed below.Furthermore,for goethites,surface chemical characteristics vary widely,even for those synthesized in the absence of CO2(Sverjensky, 2005).Together with site densities for goethites already published(Sverjensky and Fukushi,2006a,b),the goethite site densities obtained by regression in the present study, with one exception,support an empirical correlation between site density and surface area of goethite(Fukushi and Sverjensky,2007).This correlation should enable estimation of site densities for those goethites for which oxyanion adsorption data over a wide range of surface coverage are not available.In the cases of arsenate adsorp-tion on HFO,ferrihydrite,goethite from Dixit and Hering (2003),and b-Al(OH)3and amÆAlO,the site densities were taken from our previous regression of arsenite adsorption data(Sverjensky and Fukushi,2006b).We used a single site density for each sample applied to all the surface equilibria as described previously(Sverjensky and Fukushi,2006b). All the calculations were carried out with the aid of the computer code GEOSURF(Sahai and Sverjensky,1998).2.2.Arsenate adsorption reaction stoichiometriesWe constructed our ETLM for arsenate adsorption on oxides by choosing surface species based on the in situ X-ray and infrared spectroscopic studies of phosphate and arsenate and the molecular calculations for arsenate interaction with metal oxide clusters.We used three reaction stoichiometries expressed as the formation of inner-sphere surface species as illustrated in Fig.1. Additional species could be incorporated into the model, but even the data analyzed in the present study,which refers to a wide range of experimental conditions,did not permit retrieval of equilibrium constants for more than three surface species.We tested numerous reaction stoichiometries,but found that the best three reaction stoichiometries involved surface species identical to the ones proposed in the in situ ATR-FTIR study of phosphate adsorption on goethite(Tejedor-Tejedor and Anderson, 1990)discussed above.For the most part,these species are also consistent with the X-ray studies cited above.The three reactions produce a deprotonated bidentate-binuclear species according to2>SOHþH3AsO40¼ð>SOÞ2AsO2ÀþHþþ2H2Oð10ÞÃK hð>SOÞ2AsO2À¼að>SOÞ2AsO2Àa Hþa2H2Oa2>SOHa H3AsO4010FðD W r;10Þ2:303RTð11Þa protonated bidentate-binuclear species according to2>SOHþH3AsO40¼ð>SOÞ2AsOOHþ2H2Oð12ÞÃK hð>SOÞ2AsOOH¼að>SOÞ2AsOOHa2H2Oa>SOHa H3AsO4010FðD W r;12Þ2:303RTð13Þand a deprotonated monodentate species according to>SOHþH3AsO40¼>SOAsO32Àþ2HþþH2Oð14ÞÃK h>SOAsO32À¼a>SOAsO32Àa2Hþa H2Oa>SOH a H3AsO4010FðD W r;14Þ2:303RTð15ÞHere,the superscript‘‘h’’again represents site-occupancy standard states(Sverjensky,2003,2005).The exponential terms contain the electrostatic factor,D W r,which is a reac-tion property arising from the work done in an electricfield when species in the reaction move on or offthe charged sur-face.It is evaluated taking into account the adsorbing ions and the water dipoles released in Eqs.(10),(12)and(14). Electrostatic work is done in both instances and is reflected in the formulation of D W r(Sverjensky and Fukushi, 2006a,b).3722K.Fukushi,D.A.Sverjensky/Geochimica et Cosmochimica Acta71(2007)3717–3745。