A predictive model (ETLM) for arsenate adsorption and
- 格式:pdf
- 大小:754.74 KB
- 文档页数:29
Ensemble Approaches for Regression:aSurveyJo˜a o M.Moreira a,∗,Carlos Soares b,c,Al´ıpio M.Jorge b,c andJorge Freire de Sousa aa Faculty of Engineering,University of Porto,Rua Dr.Roberto Frias,s/n4200-465Porto PORTUGALb Faculty of Economics,University of Porto,Rua Dr.Roberto Frias,s/n4200-464Porto PORTUGALc LIAAD,INESC Porto L.A.,R.de Ceuta,118,6,4050-190,Porto PORTUGALAbstractThis paper discusses approaches from different research areas to ensemble regression. The goal of ensemble regression is to combine several models in order to improve the prediction accuracy on learning problems with a numerical target variable.The process of ensemble learning for regression can be divided into three phases:the generation phase,in which a set of candidate models is induced,the pruning phase, to select of a subset of those models and the integration phase,in which the output of the models is combined to generate a prediction.We discuss different approaches to each of these phases,categorizing them in terms of relevant characteristics and relating contributions from differentfields.Given that previous surveys have focused on classification,we expect that this work will provide a useful overview of existing work on ensemble regression and enable the identification of interesting lines for further research.Key words:ensembles,regression,supervised learning1IntroductionEnsemble learning typically refers to methods that generate several models which are combined to make a prediction,either in classification or regression ∗Tel.:+351225081639;fax:+351225081538Email address:jmoreira@fe.up.pt(Jo˜a o M.Moreira).Preprint submitted to Elsevier19December2007problems.This approach has been the object of a significant amount of re-search in recent years and good results have been reported(e.g.,[1–3]).The advantage of ensembles with respect to single models has been reported in terms of increased robustness and accuracy[4].Most work on ensemble learning focuses on classification problems.However, techniques that are successful for classification are often not directly appli-cable for regression.Therefore,although both are related,ensemble learning approaches have been developed somehow independently.Therefore,existing surveys on ensemble methods for classification[5,6]are not suitable to provide an overview of existing approaches for regression.This paper surveys existing approaches to ensemble learning for regression. The relevance of this paper is strengthened by the fact that ensemble learning is an object of research in different communities,including pattern recog-nition,machine learning,statistics and neural networks.These communities have different conferences and journals and often use different terminology and notation,which makes it quite hard for a researcher to be aware of all contri-butions that are relevant to his/her own work.Therefore,besides attempting to provide a thorough account of the work in the area,we also organize those approaches independently of the research area they were originally proposed in.Hopefully,this organization will enable the identification of opportunities for further research and facilitate the classification of new approaches.In the next section,we provide a general discussion of the process of ensem-ble learning.This discussion will lay out the basis according to which the remaining sections of the paper will be presented:ensemble generation(Sect.3),ensemble pruning(Sect.4)and ensemble integration(Sect.5).Sect.6 concludes the paper with a summary.2Ensemble Learning for RegressionIn this section we provide a more accurate definition of ensemble learning and provide terminology.Additionally,we present a general description of the process of ensemble learning and describe a taxonomy of different approaches, both of which define the structure of the rest of the paper.Next we discuss the experimental setup for ensemble learning.Finally,we analyze the error decompositon of ensemble learning methods for regression.22.1DefinitionFirst of all we need to define clearly what ensemble learning is,and to define a taxonomy of methods.As far as we know,there is no widely accepted definition of ensemble learning.Some of the existing definitions are partial in the sense that they focus just on the classification problem or on part of the ensemble learning process[7].For these reasons we propose the following definition:Definition1Ensemble learning is a process that uses a set of models,each of them obtained by applying a learning process to a given problem.This set of models(ensemble)is integrated in some way to obtain thefinal prediction.This definition has important characteristics.In thefirst place,contrary to the informal definition given at the beginning of the paper,this one covers not only ensembles in supervised learning(both classification and regression problems),but also in unsupervised learning,namely the emerging research area of ensembles of clusters[8].Additionally,it clearly separates ensemble and divide-and-conquer approaches. This last family of approaches split the input space in several sub-regions and train separately each model in each one of the sub-regions.With this approach the initial problem is converted in the resolution of several simpler sub-problems.Finally,it does not separate the combination and selection approaches as it is usually done.According to this definition,selection is a special case of combination where the weights are all zero except for one of them(to be discussed in Sect.5).More formally,an ensemble F is composed of a set of predictors of a function f denoted asˆf i.F={ˆf i,i=1,...,k}.(1) The resulting ensemble predictor is denoted asˆf f.2.1.1The Ensemble Learning ProcessThe ensemble process can be divided into three steps[9](Fig.1),usually referred as the overproduce-and-choose approach.Thefirst step is ensemble generation,which consists of generating a set of models.It often happens that,during thefirst step,a number of redundant models are generated.In the ensemble pruning step,the ensemble is pruned by eliminating some of the models generated earlier.Finally,in the ensemble integration step,a strategy3to combine the base models is defined.This strategy is then used to obtain the prediction of the ensemble for new cases,based on the predictions of the base models.Our characterization of the ensemble learning process is slightly more detailed than the one presented by Rooney et al.[10].For those authors,ensemble learning consists on the solution of two problems:(1)how to generate the ensemble of models?(ensemble generation);and(2)how to integrate the pre-dictions of the models from the ensemble in order to obtain thefinal ensemble prediction?(ensemble integration).This last approach(without the pruning step),is named direct,and can be seen as a particular case of the model presented in Fig.1,named overproduce-and-choose.Ensemble pruning has been reported,at least in some cases,to reduce the size of the ensembles obtained without degrading the accuracy.Pruning has also been added to direct methods successfully increasing the accuracy[11,12].A subject to be discussed further in Sect.4.2.1.2Taxonomy and TerminologyConcerning the categorization of the different approaches to ensemble learning, we will follow mainly the taxonomy presented by the same authors[10].They divide ensemble generation approaches into homogeneous,if all the models were generated using the same induction algorithm,and heterogeneous,oth-erwise.Ensemble integration methods,are classified by some authors[10,13]as com-bination(also called fusion)or as selection.The former approach combines the predictions of the models from the ensemble in order to obtain thefinal ensemble prediction.The latter approach selects from the ensemble the most promising model(s)and the prediction of the ensemble is based on the se-lected model(s)only.Here,we use,instead,the classification of constant vs. non-constant weighting functions given by Merz[14].In thefirst case,the predictions of the base models are always combined in the same way.In the second case,the way the predictions are combined can be different for different input values.As mentioned earlier,research on ensemble learning is carried out in different communities.Therefore,different terms are sometimes used for the same con-cept.In Table1we list several groups of synonyms,extended from a previous list by Kuncheva[5].Thefirst column contains the most frequently used terms in this paper.4Table1Synonymsensemble committee,multiple models,multiple classifiers(regressors) predictor model,regressor(classifier),learner,hypothesis,expertexample instance,case,data point,objectcombination fusion,competitive classifiers(regressors),ensemble approach, multiple topologyselection cooperative classifiers(regressors),modular approach,hybrid topology2.2Experimental setupThe experimental setups used in ensemble learning methods are very differ-ent depending on communities and authors.Our aim is to propose a general framework rather than to do a survey on the different experimental setups described in the literature.The most common approach is to split the data into three parts:(1)the training set,used to obtain the base predictors;(2) the validation set,used for assessment of the generalization error of the base predictors;and(3)the test set,used for assessment of the generalization error of thefinal ensemble method.If a pruning algorithm is used,it is tested to-gether with the integration method on the test set.Hastie et al.[15]propose to split50%for training,25%for validation and the remaining25%to use as test set.This strategy works for large data sets,let’s say,data sets with more than one thousand examples.For large data sets we propose the use of this approach mixed with cross-validation.To do this,and for this particular partition(50%,25%,and25%),the data set is randomly divided in four equal parts,two of them being used as training set,another one as validation set and the last one as test set.This process is repeated using all the combinations of training,validation and test sets among the four parts.With this partition there are twelve combinations.For smaller data sets,the percentage of data used for training must be higher.It can be80%,10%and10%.In this case the number of combinations is ninety.The main advantage is to train the base predictors with more examples(it can be critical for small data sets)but it has the disadvantage of increasing the computational cost.The process can be repeated several times in order to obtain different sample values for the evaluation criterion,namely the mse(eq.3).52.3RegressionIn this paper we assume a typical regression problem.Data consists of a set of n examples of the form{(x1,f(x1)),...,(x n,f(x n))}.The goal is to induce a functionˆf from the data,whereˆf:X→ ,where,ˆf(x)=f(x),∀x∈X,(2) where f represents the unknown true function.The algorithm used to obtain theˆf function is called induction algorithm or learner.Theˆf function is called model or predictor.The usual goal for regression is to minimize a squared error loss function,namely the mean squared error(mse),mse=1nni(ˆf(x i)−f(x i))2.(3)2.4Understanding the generalization error of ensemblesTo accomplish the task of ensemble generation,it is necessary to know the characteristics that the ensemble should have.Empirically,it is stated by several authors that a good ensemble is the one with accurate predictors and making errors in different parts of the input space.For the regression problem it is possible to decompose the generalization error in different components, which can guide the process to optimize the ensemble generation.Here,the functions are represented,when appropriate,without the input vari-ables,just for the sake of simplicity.For example,instead of f(x)we use f. We closely follow Brown[16].Understanding the ensemble generalization error enables us to know which characteristics should the ensemble members have in order to reduce the over-all generalization error.The generalization error decomposition for regression is straightforward.What follows is about the decomposition of the mse(eq.3).Despite the fact that the majority of the works were presented in the con-text of neural network ensembles,the results presented in this section are not dependent of the induction algorithm used.Geman et al.present the bias/variance decomposition for a single neural net-work[17]:E{[ˆf−E(f)]2}=[E(ˆf)−E(f)]2+E{[ˆf−E(ˆf)]2}.(4) Thefirst term on the right hand side is called the bias and represents the6distance between the expected value of the estimator ˆfand the unknown pop-ulation average.The second term,the variance component,measures how the predictions vary with respect to the average prediction.This can be rewritten as:mse (f )=bias (f )2+var (f ).(5)Krogh &Vedelsby describe the ambiguity decomposition,for an ensemble of k neural networks [18].Assuming that ˆf f (x )= k i =1[αi ׈f i (x )](see Sect.5.1)where k i =1(αi )=1and αi ≥0,i =1,...,k ,they show that the error for a single example is:(ˆf f −f )2=ki =1[αi ×(ˆf i −f )2]−k i =1[αi ×(ˆf i −ˆf f )2].(6)This expression shows explicitly that the ensemble generalization error is less than or equal to the generalization error of a randomly selected single predic-tor.This is true because the ambiguity component (the second term on the right)is always non negative.Another important result of this decomposition is that it is possible to reduce the ensemble generalization error by increasing the ambiguity without increasing the bias.The ambiguity term measures the disagreement among the base predictors on a given input x (omitted in the formulae just for the sake of simplicity,as previously referred).Two full proofs of the ambiguity decomposition [18]are presented in [16].Later,Ueda &Nakano presented the bias/variance/covariance decomposition of the generalization error of ensemble estimators [19].In this decomposition it is assumed that ˆf f (x )=1× k i =1[ˆf i (x )]:E [(ˆf f −f )2]=bias 2+1k ×var +(1−1k)×covar,(7)wherebias =1k ×k i =1[E i (f i )−f ],(8)var =1k ×k i =1{E i {[ˆf i −E i (ˆf i )]2}},(9)covar =1k ×(k −1)×k i =1k j =1,j =i E i,j {[ˆf i −E i (ˆf i )][ˆf j −E j (ˆf j )]}.(10)7The indexes i,j of the expectation mean that the expression is true for par-ticular training sets,respectively,L i and L j.Brown provides a good discussion on the relation between ambiguity and co-variance[16].An important result obtained from the study of this relation is the confirmation that it is not possible to maximize the ensemble ambiguity without affecting the ensemble bias component as well,i.e.,it is not possi-ble to maximize the ambiguity component and minimize the bias component simultaneously.The discussion of the present section is usually referred in the context of ensemble diversity,i.e.,the study on the degree of disagreement between the base predictors.Many of the above statements are related to the well known statistical problem of point estimation.This discussion is also related with the multi-collinearity problem that will be discussed in Sect.5.3Ensemble generationThe goal of ensemble generation is to generate a set of models,F={ˆf i,i= 1,...,k}.If the models are generated using the same induction algorithm the ensemble is called homogeneous,otherwise it is called heterogeneous.Homogeneous ensemble generation is the best covered area of ensemble learn-ing in the literature.See,for example,the state of the art surveys from Di-etterich[7],or Brown et al.[20].In this section we mainly follow the former [7].In homogeneous ensembles,the models are generated using the same algo-rithm.Thus,as explained in the following sections,diversity can be achieved by manipulating the data(Section3.1)or by the model generation process (Section3.2).Heterogeneous ensembles are obtained when more than one learning algorithm is used.This approach is expected to obtain models with higher diversity[21]. The problem is the lack of control on the diversity of the ensemble during the generation phase.In homogeneous ensembles,diversity can be systemati-cally controlled during their generation,as will be discussed in the following sections.Conversely,when using several algorithms,it may not be so easy to control the differences between the generated models.This difficulty can be solved by the use of the overproduce-and-choose ing this approach the diversity is guaranteed in the pruning phase[22].Another approach,com-monly followed,combines the two approaches,by using different induction algorithms mixed with the use of different parameter sets[23,10](Sect.3.2.1). Some authors claim that the use of heterogeneous ensembles improves the performance of homogeneous ensemble generation.Note that heterogeneous8ensembles can use homogeneous ensemble models as base learners.3.1Data manipulationData can be manipulated in three different ways:subsampling from the train-ing set,manipulating the input features and manipulating the output targets.3.1.1Subsampling from the training setThese methods have in common that the models are obtained using different subsamples from the training set.This approach generally assumes that the algorithm is unstable,i.e.,small changes in the training set imply important changes in the result.Decision trees,neural networks,rule learning algorithms and MARS are well known unstable algorithms[24,7].However,some of the methods based on subsampling(e.g.,bagging and boosting)have been suc-cessfully applied to algorithms usually regarded as stable,such as Support Vector Machines(SVM)[25].One of the most popular of such methods is bagging[26].It uses randomly generated training sets to obtain an ensemble of predictors.If the original training set L has m examples,bagging(bootstrap aggregating)generates a model by sampling uniformly m examples with replacement(some examples appear several times while others do not appear at all).Both Breiman[26] and Domingos[27]give insights on why does bagging work.Based on[28],Freund&Schapire present the AdaBoost(ADAptive BOOST-ing)algorithm,the most popular boosting algorithm[29].The main idea is that it is possible to convert a weak learning algorithm into one that achieves arbitrarily high accuracy.A weak learning algorithm is one that performs slightly better than random prediction.This conversion is done by combining the estimations of several predictors.Like in bagging[26],the examples are randomly selected with replacement but,in AdaBoost,each example has a different probability of being selected.Initially,this probability is equal for all the examples,but in the following iterations examples with more inaccurate predictions have higher probability of being selected.In each new iteration there are more’difficult examples’in the training set.Despite boosting has been originally developed for classification,several algorithms have been pro-posed for regression but none has emerged as being the appropriate one[30]. Parmanto et al.describe the cross-validated committees technique for neural networks ensemble generation usingυ-fold cross validation[31].The main idea is to use as ensemble the models obtained by the use of theυtraining sets on the cross validation process.93.1.2Manipulating the input featuresIn this approach,different training sets are obtained by changing the repre-sentation of the examples.A new training set j is generated by replacing theoriginal representation{(x i,f(x i))into a new one{(xi ,f(x i)).There are twotypes of approaches.Thefirst one is feature selection,i.e.,xi ⊂x i.In the sec-ond approach,the representation is obtained by applying some transformation to the original attributes,i.e.,xi=g(x i).A simple feature selection approach is the random subspace method,consisting of a random selection[32].The models in the ensemble are independently constructed using a randomly selected feature subset.Originally,decision trees were used as base learners and the ensemble was called decision forests[32]. Thefinal prediction is the combination of the predictions of all the trees in the forest.Alternatively,iterative search methods can be used to select the different fea-ture subsets.Opitz uses a genetic algorithm approach that continuously gen-erates new subsets starting from a random feature selection[33].The author uses neural networks for the classification problem.He reports better results using this approach than using the popular bagging and AdaBoost methods. In[34]the search method is a wrapper like hill-climbing strategy.The criteria used to select the feature subsets are the minimization of the individual error and the maximization of ambiguity(Sect.2.4).A feature selection approach can also be used to generate ensembles for al-gorithms that are stable with respect to the training set but unstable w.r.t. the set of features,namely the nearest neighbors induction algorithm.In[35] the feature subset selection is done using adaptive sampling in order to re-duce the risk of discarding discriminating pared to random feature selection,this approach reduces diversity between base predictors but increases their accuracy.A simple transformation approach is input smearing[36].It aims to increase the diversity of the ensemble by adding Gaussian noise to the inputs.The goal is to improve the results of bagging.Each input value x is changed into a smeared value x using:x =x+p∗N(0,ˆσX)(11)where p is an input parameter of the input smearing algorithm andˆσX is the sample standard deviation of X,using the training set data.In this case, the examples are changed,but the training set keeps the same number of examples.In this work just the numeric input variables are smeared even if the nominal ones could also be smeared using a different strategy.Results10compare favorably to bagging.A similar approach called BEN-Bootstrap Ensemble with Noise,was previously presented by Raviv&Intrator[37]. Rodriguez et al.[3]present a method that combines selection and transfor-mation,called rotation forests.The original set of features is divided into k disjoint subsets to increase the chance of obtaining higher diversity.Then,for each subset,a principal component analysis(PCA)approach is used to project the examples into a set of new features,consisting of linear combinations of the original ing decision trees as base learners,this strategy assures diversity,(decision trees are sensitive to the rotation of the axis)and accuracy (PCA concentrates in a few features most of the information contained in the data).The authors claim that rotation forests outperform bagging,AdaBoost and random forests(to be discussed further away in Sect.3.2.2).However,the adaptation of rotation forests for regression does not seem to be straightfor-ward.3.1.3Manipulating the output targetsThe manipulation of the output targets can also be used to generate different training sets.However,not much research follows this approach and most of it focus on classification.An exception is the work of Breiman,called output smearing[38].The basic idea is to add Gaussian noise to the target variable of the training set,in the same way as it is done for input features in the input smearing method (Sect.3.1.2).Using this approach it is possible to generate as many models as desired.Although it was originally proposed using CART trees as base models, it can be used with other base algorithms.The comparison between output smearing and bagging shows a consistent generalization error reduction,even if not outstanding.An alternative approach consists of the following steps.First it generates a model using the original data.Second,it generates a model that estimates the error of the predictions of thefirst model and generates an ensemble that combines the prediction of the previous model with the correction of the cur-rent one.Finally,it iteratively generates models that predict the error of the current ensemble and then updates the ensemble with the new model.The training set used to generate the new model in each iteration is obtained by replacing the output targets with the errors of the current ensemble.This ap-proach was proposed by Breiman,using bagging as the base algorithm and was called iterated bagging[39].Iterated bagging reduces generalization error when compared with bagging,mainly due to the bias reduction during the iteration process.113.2Model generation manipulationAs an alternative to manipulating the training set,it is possible to change the model generation process.This can be done by using different parameter sets, by manipulating the induction algorithm or by manipulating the resulting model.3.2.1Manipulating the parameter setsEach induction algorithm is sensitive to the values of the input parameters. The degree of sensitivity of the induction algorithm is different for different input parameters.To maximize the diversity of the models generated,one should focus on the parameters which the algorithm is most sensitive to.Neural network ensemble approaches quite often use different initial weights to obtain different models.This is done because the resulting models vary significantly with different initial weights[40].Several authors,like Rosen,for example,use randomly generated seeds(initial weights)to obtain different models[41],while other authors mix this strategy with the use of different number of layers and hidden units[42,43].The k-nearest neighbors ensemble proposed by Yankov et al.[44]has just two members.They differ on the number of nearest neighbors used.They are both sub-optimal.One of them because the number of nearest neighbors is too small,and the other because it is too large.The purpose is to increase diversity(see Sect.5.2.1).3.2.2Manipulating the induction algorithmDiversity can be also attained by changing the way induction is done.There-fore,the same learning algorithm may have different results on the same data. Two main categories of approaches for this can be identified:Sequential and parallel.In sequential approaches,the induction of a model is influenced only by the previous ones.In parallel approaches it is possible to have more ex-tensive collaboration:(1)each process takes into account the overall quality of the ensemble and(2)information about the models is exchanged between processes.Rosen[41]generates ensembles of neural networks by sequentially training networks,adding a decorrelation penalty to the error function,to increase ing this approach,the training of each network tries to minimize a function that has a covariance component,thus decreasing the generalization error of the ensemble,as stated in[19].This was thefirst approach using the12decomposition of the generalization error made by Ueda&Nakano[19](Sect.2.4)to guide the ensemble generation process.Another sequential method to generate ensembles of neural networks is called SECA(Stepwise Ensemble Construction Algorithm)[30].It uses bagging to obtain the training set for each neural network.The neural networks are trained sequentially.The process stops when adding another neural network to the current ensemble increases the generalization error.The Cooperative Neural Network Ensembles(CNNE)method[45]also uses a sequential approach.In this work,the ensemble begins with two neural net-works and then,iteratively,CNNE tries to minimize the ensemble errorfirstly by training the existing networks,then by adding a hidden node to an existing neural network,andfinally by adding a new neural network.Like in Rosen’s approach,the error function includes a term representing the correlation be-tween the models in the ensemble.Therefore,to maximize the diversity,all the models already generated are trained again at each iteration of the process. The authors test their method not only on classification datasets but also on one regression data set,with promising results.Tsang et al.[46]propose an adaptation of the CVM(Core Vector Machines) algorithm[47]that maximizes the diversity of the models in the ensemble by guaranteeing that they are orthogonal.This is achieved by adding constraints to the quadratic programming problem that is solved by the CVM algorithm. This approach can be related to AdaBoost because higher weights are given to instances which are incorrectly classified in previous iterations.Note that the sequential approaches mentioned above add a penalty term to the error function of the learning algorithm.This sort of added penalty has been also used in the parallel method Ensemble Learning via Negative Cor-relation(ELNC)to generate neural networks that are learned simultaneously so that the overall quality of the ensemble is taken into account[48]. Parallel approaches that exchange information during the process typically integrate the learning algorithm with an evolutionary framework.Opitz& Shavlik[49]present the ADDEMUP(Accurate anD Diverse Ensemble-Maker giving United Predictions)method to generate ensembles of neural networks. In this approach,thefitness metric for each network weights the accuracy of the network and the diversity of this network within the ensemble.The bias/variance decomposition presented by Krogh&Vedelsby[18]is used.Ge-netic operators of mutation and crossover are used to generate new models from previous ones.The new networks are trained emphasizing misclassified examples.The best networks are selected and the process is repeated until a stopping criterion is met.This approach can be used on other induction algorithms.A similar approach is the Evolutionary Ensembles with Negative Correlation Learning(EENCL)method,which combines the ELNC method13。
Risk assessmentof Campylobacter spp.in broiler chickensand Vibrio spp. in seafood Report of a Joint FAO/WHO ExpertConsultationBangkok, Thailand5-9 August 2002FAOFOOD ANDNUTRITIONPAPERFOODSAFETY CONSULTATIONSFoodandAgricultureOrganizationoftheUnitedNationsWORLDHEALTHORGANIZATIONA CKNOWLEDGEMENTSThe Food and Agriculture Organization of the United Nations and the World Health Organization would like to express their appreciation to the expert drafting groups (see Annex 2) for the time and effort which they dedicated to the preparation of thorough and extensive technical documents on risk assessment of Campylobacter spp. in broiler chickens and Vibrio spp. in seafoods. The deliberations of this expert consultation were based on these documents.Contents1INTRODUCTION (1)2BACKGROUND (1)3OBJECTIVES OF THE CONSULTATION (2)4SUMMARY OF THE GENERAL DISCUSSIONS (3)5RISK ASSESSMENT OF THERMOPHILIC CAMPYLOBACTER SPP. IN BROILER CHICKENS45.1.1Summary of the Risk Assessment (4)5.1.2Introduction (4)5.1.3Scope (4)5.1.4Approach (5)5.1.5Key findings (10)5.1.6Risk assessment and developing countries (11)5.1.7Limitations and caveats (12)5.1.8Gaps in the data (12)5.2R EVIEW OF THE R ISK A SSESSMENT (13)5.2.1Introduction: (13)5.2.2Hazard identification (13)5.2.3Exposure assessment: (13)5.2.4Hazard characterization (15)5.2.5Risk characterization (16)5.2.6Risk assessment and developing countries (16)5.2.7Data deficiencies (16)5.3U TILITY AND A PPLICABILITY (17)5.4R ESPONSE TO THE SPECIFIC RISK MANAGEMENT QUESTIONS POSED BY THE C ODEX COMMITTEE ON FOODHYGIENE (18)5.5C ONCLUSIONS AND R ECOMMENDATIONS (18)6RISK ASSESSMENT OF VIBRIO SPP. IN SEAFOOD (19)6.1S UMMARY OF THE RISK ASSESSMENTS (19)6.1.1Introduction: Vibrio spp. in seafood (19)6.1.2Scope (20)6.1.3Vibrio parahaemolyticus in raw oysters consumed in Japan, New Zealand, Canada, Australia andthe United States of America (20)6.1.4Vibrio parahaemolyticus in bloody clams (28)6.1.5Vibrio parahaemolyticus in finfish (31)6.1.6Vibrio vulnificus in raw oysters (33)6.1.7Choleragenic Vibrio cholerae O1 and O139 in warm-water shrimps for export (37)6.2R EVIEW OF THE R ISK A SSESSMENTS (41)6.2.1Introduction: Vibrio spp. in seafood (41)6.2.2Vibrio parahaemolyticus in raw oysters consumed in Japan, New Zealand, Canada, Australia andthe United States of America oysters (42)6.2.3Vibrio parahaemolyticus in bloody clams (42)6.2.4Vibrio parahaemolyticus in finfish (42)6.2.5Vibrio vulnificus in raw oysters (43)6.2.6Choleragenic Vibrio cholerae O1 and O139 in warm-water shrimp for export (43)6.3U TILITY AND APPLICABILITY (43)6.4R ESPONSE TO THE SPECIFIC QUESTIONS POSED BY THE C ODEX C OMMITTEE ON F ISH AND F ISHERYP RODUCTS (44)6.5R ESPONSE TO THE NEEDS OF THE C ODEX C OMMITTEE ON F OOD H YGIENE (46)6.6C ONCLUSIONS AND RECOMMENDATIONS (46)7CONCLUSIONS (47)8RECOMMENDATIONS (47)ANNEX 1: LIST OF PARTICIPANTS (49)ANNEX 2: LIST OF WORKING DOCUMENTS (51)List of AbbreviationsCAC Codex Alimentarius CommissionCCFFP Codex Committee on Fish and Fishery ProductsCCFH Codex Committee on Food HygieneCFU colony forming unitCT cholera toxinctx cholera toxin geneFAO Food and Agriculture Organization of the United NationsFDA-VPRA The United States Food and Drug Administration draft riskassessment on the "Public Health Impact of Vibrioparahaemolyticus in Raw Molluscan Shellfish"g gram(s)GHP Good Hygienic PracticesGMP Good Manufacturing Practicesh hour(s)ISSC Interstate Seafood Sanitation ConferenceMPN Most Probable NumberPCR Polymerase Chain Reactionppt parts per thousandRAP FAO Regional Office for Asia and the PacificTDH Thermostable direct haemolysintdh Thermostable direct haemolysin geneTLH Thermolabile haemolysinTRH TDH-related haemolysintrh tdh-related haemolysin geneUSFDA United States Food and Drug AdministrationWHO World Health Organization1 IntroductionThe Food and Agriculture Organization of the United Nations (FAO) and the World Health Organization (WHO) convened an expert consultation on “Risk assessment of Campylobacter spp. in broiler chickens and Vibrio spp. in seafood” in the FAO Regional Office for Asia and the Pacific (RAP), Bangkok, Thailand on 5 - 9 August 2002. The list of participants is presented in Annex 1.Mr Dong Qingsong, FAO Deputy Regional Representative for Asia and the Pacific and Officer-in-charge, RAP, opened the meeting on behalf of the two sponsoring organizations. In welcoming the participants Mr Qingsong noted the increasing significance of microbiological hazards in relation to food safety. He noted that international trade had amplified the opportunity for these hazards to be disseminated from the original point of production to locations thousands of miles away, thereby permitting such food safety hazards to impact on public health and trade in more than one country. Mr Qingsong observed that this underlined the need to first consider microbiological hazards at the international level and provide the means by which they can then be addressed at regional and national levels. He highlighted the commitment of FAO and WHO to provide a neutral international forum to consider new approaches to achieving food safety, and in particular to address microbiological risk assessment.As the meeting was convened in Asia, Mr Qingsong also highlighted the importance of the work on microbiological risk assessment for this region. He noted that seafood and poultry are important products in both domestic and international trade and provide a valuable contribution to the economy of the region. He also noted the concerns for public health raised by the microbiological hazards associated with these foods. Therefore, in conclusion he requested the expert consultation to consider, in particular, the practical application of their work so that the output would be of value to a range of end users and to countries in different stages of development.The consultation appointed Dr Servé Notermans (the Netherlands) as chairperson of the consultation and Ms Dorothy-Jean McCoubrey (New Zealand) as rapporteur. Dr Notermans also served as chairperson of the campylobacter working group and Prof. Diane Newell (United Kingdom) served as rapporteur. Dr Mark Tamplin (United States of America) served as chairperson of the working group on vibrio and Dr Ron Lee (United Kingdom) as rapporteur of that group.2 BackgroundRisk assessment of microbiological hazards in foods has been identified as a priority area of work for the Codex Alimentarius Commission (CAC). At its 32nd session the Codex Committee on Food Hygiene (CCFH) identified a list of pathogen-commodity combinations for which it requires expert risk assessment advice. In response, FAO and WHO jointly launched a programme of work with the objective of providing expert advice on risk assessment of microbiological hazards in foods to their member countries and to the CAC.Dr. Hajime Toyofuku, WHO, and Dr. Sarah Cahill, FAO provided participants with an overview of the joint FAO/WHO activities on microbiological risk assessment. In their presentation, they also highlighted the objectives and expected outcomes of the current meeting.The FAO/WHO programme of activities on microbiological risk assessment aims to serve two customers - the CAC and the FAO and WHO member countries. The CAC, and in particular the CCFH, needs sound scientific advice as a basis for the development of guidelines and recommendations as well as the answers to specific risk management questions on certain pathogen-commodity combinations. Member countries on the other hand need specific risk assessment tools to use in conducting their own assessments and, if possible, some modules that can be directly applied in a national risk assessment.To implement this programme of work, FAO and WHO are convening a series of joint expert consultations. To date three expert consultations have been held. The first of these was held on 17 - 21July 20001 and the second on 30 April - 4 May 20012. Both of these expert consultations addressed risk assessment of Salmonella spp. in broiler chickens and eggs and Listeria monocytogenes in ready-to-eat foods. In March 2001, FAO and WHO initiated risk assessment work on the two pathogen-commodity combinations being considered in this expert consultation. Two ad hoc expert drafting groups were established to examine the available relevant information on Campylobacter spp. in broiler chickens and Vibrio spp. in seafood and prepare documentation on both the exposure assessment and hazard characterization steps of the risk assessment. These documents were reviewed and evaluated by a joint expert consultation convened on 23 - 27 July 20013.In October 2001, the report of that expert consultation was delivered to CCFH. Additional risk management guidance was sought from the Committee in relation to the finalization of the risk assessments. In response to this, the Committee established risk management drafting groups to consider the approach it could take towards providing guidance on managing problems associated with Campylobacter spp. in broiler chickens and Vibrio spp. in seafood. A limited amount of feedback was received; therefore the risk assessment drafting groups, FAO and WHO met early in 2002 to consider how to complete the risk assessment work. Both risk assessments have been progressed towards completion, although to varying extents. The risk assessment documents that have been developed to date were reviewed and evaluated by the expert consultation.The purpose of this report is to present the summary of the draft documents on risk assessment of Campylobacter spp. in broiler chickens and Vibrio spp. in seafood as well as the discussions and recommendations of the expert consultation.3 Objectives of the consultationThe expert consultation examined the information provided by the expert drafting groups on risk assessment of Campylobacter spp. in broiler chickens and Vibrio spp. in seafood with the following objectives;•To critically review the documents prepared by the ad hoc expert drafting groups giving particular attention to;•the risk characterization component and the manner in which the risk assessment outputs were generated;•the assumptions made in the risk assessment;•the associated uncertainty and variability;•the potential application of the work.•To provide scientific advice to FAO and WHO member countries on the risk assessment of Vibrio spp. in seafood and Campylobacter spp. in broiler chickens based on the available documentation and the discussions during the expert consultation.•To respond to the risk management needs of Codex.1 FAO. 2000. Joint FAO/WHO Expert Consultation on Risk Assessment of Microbiological Hazards in Foods, FAO Headquarters, Rome, Italy, 17 - 20 July 2000. FAO Food and Nutrition Paper 71./es/ESN/food/risk_mra_campylobacter_en.stm / http://www.who.int/fsf/Micro/index.htm2 FAO. 2001. Joint FAO/WHO Expert Consultation on Risk Assessment of Microbiological Hazards in Foods: Risk characterization of Salmonella spp. in eggs and broiler chickens and Listeria monocytogenes in ready-to-eat foods, FAO Headquarters, Rome, Italy, 30 April - 4 May 2001. FAO Food and Nutrition Paper 72./es/ESN/food/risk_mra_salmonella_en.stm / http://www.who.int/fsf/Micro/index.htm3 WHO. 2001. Report of the Joint FAO/WHO Expert Consultation on Risk Assessment of Microbiological Hazards in Foods; Hazard identification, exposure assessment and hazard characterization of Campylobacter spp. in broiler chickens and Vibrio spp. in seafood. WHO Headquarters, Geneva, Switzerland 23 - 27 July 2001. WHO 2001./es/ESN/food/risk_mra_campylobacter_en.stm / http://www.who.int/fsf/Micro/index.htm4 Summary of the general discussionsThe drafting groups presented overviews of the risk assessment documents on Campylobacter spp. in broiler chickens and Vibrio spp. in seafood to the expert consultation. A summary of the draft risk assessments and the discussions of the expert consultation are given in sections five and six of this report. However, a number of issues relevant to risk assessment in general that were addressed are summarized below.It was anticipated that the campylobacter and vibrio risk assessments will prove useful for decision makers such as Codex committees, food industries, food control authorities, regulators and other stake holders. It was not the task of the risk assessment groups to undertake risk profiling or determine acceptable risk levels – this is the role of risk managers. However, the outputs of this work should provide tools and information useful to both risk managers and risk assessors.It was not possible at this stage to complete full quantitative risk assessments for all pathogens, inclusive of full validation exercises. This was due to the lack of extensive quantitative data being available, the uncertainties that exist and the need to make assumptions e.g. that all strains of Campylobacter spp. are pathogenic. However, it was noted that even preliminary quantitative and good qualitative risk assessments would be very useful in decision making. The risk assessment of Vibrio parahaemolyticus in bloody clams provides an example of how readily accessible data can be collated into the risk assessment framework to give valuable information that could be used towards improving public health.The models on Campylobacter spp. and Vibrio spp provided examples of how risk assessment can be applied to pathogen-commodity combinations. However, it is important that countries do not just transfer any conclusions from this risk assessment work to make risk management decisions without giving due consideration to the effect of different species, environments and populations in their own country.The challenge will be to show the world, including developing countries, the advantages of undertaking risk assessment. Training, complete with easy to understand guidelines on how to apply the risk assessments was considered extremely important. The training and guidelines material should be written in simple language that is applicable to different audiences and countries with different levels of development.There is concern by many countries that not having a full risk assessments will create trade barriers. Assistance may need to be given to many developing countries to generate and collect data and undertake modelling exercises so that such concerns can be addressed.The consultation formed two working groups to address the risk assessment documentation on Campylobacter spp. and Vibrio spp. respectively. The composition of the two working groups is outlined in the table below.Campylobacter spp. in broiler chickensIndependent experts Expert members of the drafting group Members of the secretariat John Cowden Bjarke Christensen Sarah CahillHeriberto Fernández Aamir Fazil Karen HulebackGeoffrey C. Mead Emma Hartnett Jeronimas Maskeliunas Diane G. Newell Anna Lammerding Hajime ToyofukuServé Notermans Maarten NautaSasitorn Kanarat Greg PaoliPaul Brett VanderlindeVibrio spp. in seafoodIndependent experts Expert members of the drafting group Members of the secretariat Nourredine Bouchriti Anders Dalsgaard Lahsen AbabouchJean-Michel Fournier Angelo DePaola Peter Karim BenEmbarek Ron Lee I. Karunasager Maria de Lourdes Costarrica Carlos A. Lima dos Santos Thomas McMeekinDorothy Jean McCoubrey Ken OsakaMarianne Miliotis John SumnerMitsuaki Nishibuchi Mark WalderhaugPensri RodmaMark Tamplin5 Risk Assessment of thermophilic Campylobacter spp. inbroiler chickens5.1.1 Summary of the Risk Assessment5.1.2 IntroductionAn understanding of thermophilic Campylobacter spp., and specifically Campylobacter jejuniin broiler chickens is important from both public health and international trade perspectives. In orderto achieve such an understanding, evaluation of this pathogen-commodity combination by quantitative risk assessment methodology was undertaken.The first steps of this process, hazard identification, hazard characterization and the conceptual model for exposure assessment were presented at an expert consultation in July 2001, and summarized in the report of that consultation4. The exposure assessment model described the production-to-consumption pathway for fresh and frozen broiler chicken carcasses prepared and consumed in the home.The final step, risk characterization, integrates the exposure and dose-response information in order to attempt to quantify the human-health risk attributable to pathogenic thermophilic Campylobacter spp. in broiler chickens. The risk assessment model and results were presented to the present expert consultation in the working document MRA 02/01. Recommendations for modifications and/or areas for further development of the exposure assessment model, arising from the 2001 consultation were taken into consideration and, where possible, were incorporated during preparation of the MRA 02/01 report. Recommendations, that were not incorporated, are noted in the exposure assessment summary section below.5.1.3 ScopeThe purpose of the work was to develop a risk assessment that attempts to understand how the incidence of human campylobacteriosis is influenced by various factors during chicken rearing, processing, distribution, retail storage, consumer handling, meal preparation and finally consumption.A benefit of this approach is that it enables consideration of the broadest range of intervention4 WHO. 2001. Report of the Joint FAO/WHO Expert Consultation on Risk Assessment of Microbiological Hazards in Foods; Hazard identification, exposure assessment and hazard characterization of Campylobacter spp. in broiler chickens and Vibrio spp. in seafood. WHO Headquarters, Geneva, Switzerland 23 - 27 July 2001. WHO 2001./es/ESN/food/risk_mra_campylobacter_en.stm / http://www.who.int/fsf/Micro/index.htmstrategies. The risk characterization estimates the probability of illness per serving of chicken associated with the presence of thermophilic Campylobacter spp. on fresh and frozen whole broiler chicken carcasses with the skin intact and which are cooked in the domestic kitchen for immediate consumption. A schematic representation of the risk assessment is shown in Figure 5.1.ConcentrationFIGURE 5.1: Schematic representation of the risk assessment of Campylobacter spp. in broiler chickens.5.1.4 Approach5.1.4.1 Hazard identificationThermophilic Campylobacter spp. are a leading cause of zoonotic enteric illness in most developed countries (Friedman et al., 20005). Human cases are usually caused by C. jejuni, and to a lesser extent, by Campylobacter coli. Information on the burden of human campylobacteriosis in developing countries is more limited (Oberhelman & Taylor, 20006).Nevertheless, it is reported that asymptomatic infection with C. jejuni and C. coli is frequent in adults. In children, under the age of two, C. jejuni, C. coli and other Campylobacter spp. are all associated with enteric disease.Campylobacter spp. may be transferred to humans by direct contact with contaminated animals or animal carcasses or indirectly through the ingestion of contaminated food or water. The principal reservoirs of thermophilic Campylobacter spp. are the alimentary tracts of wild and domesticated mammals and birds. Campylobacter spp. are commonly found in poultry, cattle, pigs, sheep, wild animals and birds, and in dogs and cats. Foodstuffs, including, poultry, beef, pork, other meat products, raw milk and milk products, and, less frequently, fish and fish products, mussels and fresh vegetables can also be contaminated (Jacobs-Reitsma, 20007). Findings of analytical epidemiological studies are conflicting. Some have identified handling raw poultry and eating poultry products as important risk factors for sporadic campylobacteriosis (Friedman, et al., 20005), however others have found that contact in the home with such products is protective (Adak et al., 19958).5 Friedman, C., Neimann, J., Wegener, H. & Tauxe, R. 2000. Epidemiology of Campylobacter jejuni infections in the United States and other industrialized nations. In Campylobacter 2nd Edition, pp. 121-138. Edited by I. Nachamkin & M. J. Blaser. Washington DC: ASM Press.6 Oberhelman, R. & Taylor, D. 2000. Campylobacter infections in developing countries. In Campylobacter 2nd Edition, pp. 139-154. Edited by I. Nachamkin & M. Blaser. Washington DC: ASM Press.7 Jacobs-Reitsma, W. 2000. Campylobacter in the food supply. In Campylobacter 2nd Ed., pp. 467-481. Edited by I. Nachamkin & M. J. Blaser. Washington DC: ASM Press.8 Adak, G. K., Cowden, J. M., Nicholas, S. & Evans, H. S. 1995. The Public Health Laboratory Service national case-control study of primary indigenous sporadic cases of campylobacter infection. Epidemiology and Infection 115, 15-22.Information for the hazard identification section was compiled from published literature and from unpublished data submitted to FAO and WHO by public health agencies and other interested parties.5.1.4.2 Hazard characterizationHazard characterization provides a description of the public health outcomes following infection, including sequelae, pathogen characteristics influencing the organism's ability to elicit infection and illness, host characteristics that influence the acquisition of infection, and food-related factors that may affect the survival of C. jejuni in the human gastrointestinal tract. A dose-response model was derived that mathematically describes the relationship between the numbers of organisms that might be present in a food and consumed (the dose), and the human health outcome (the response). In order to achieve this, human feeding trial data (Black et al., 19889)for two strains of C jejuni were pooled and used to derive estimates for the probability of infection (colonization of the gastrointestinal track without overt symptoms) as well as the probability of illness once infected. Campylobacteriosis predominantly occurs as sporadic cases (Friedman, et al., 200010), and hence dose-response data from outbreak investigations are essentially non-existent.The probability of infection upon ingestion of a dose of C. jejuni was estimated with the caveat that the data are from a feeding study involving healthy volunteers, and using a milk matrix and a limited number of campylobacter strains. Whether the probability of infection and/or illness are different for immune individuals or those individuals with an increased susceptibility to illness (e.g. immunosuppressed, very young, etc.) compared to the volunteers could not be determined from the feeding trial data. The probability of illness following infection was also estimated based upon these investigations, and assumed to be dose-independent. Again, the impact of other factors, such as susceptibility, on the probability of illness cannot be quantified due to a lack of adequate epidemiological data and resolution to this level. The progression of the illness to more serious outcomes and the development of some sequelae can be crudely estimated from the approximate proportions reported in the literature, but these were not included in the present work. For the purposes of this model it was assumed that C. coli has the same properties as C. jejuni.5.1.4.3 Exposure assessmentThe exposure model describes the stages Rearing and Transport, Slaughter and Processing, and Preparation and Consumption (Figure 5.1). This modular approach estimates the prevalence and concentration of thermophilic Campylobacter spp., and the changes in these associated with each stage of processing, and refrigerated and frozen storage and consumer handling of the product.Routes contributing to initial introduction of Campylobacter spp. into a poultry flock11 on the farm have been identified in epidemiological studies. However, these remain poorly understood and the phenomenon may be multi-factorial, and hence, colonization was modelled based on an unspecified route of introduction. However, an epidemiology module has been created as a supplement to the model, which allows evaluation of interventions aimed at changing the frequency with which risk factors occur. As such risk factors are usually interdependent and may be country specific, incorporation of this module into the current risk model was not appropriate: however, it does illustrate how such information may be used in a specific situation.The effects of cooling and freezing, and adding chlorine during water chilling, are evaluated in the model. Effects of other processing treatments, e.g. lactic acid and irradiation, on reduction of Campylobacter spp. contamination on poultry carcasses were not explicitly evaluated as the quantitative data are lacking: however, if the level of reduction as a consequence of any type of 9 Black, R. E., Levine, M. M., Clements, M. L., Hughes, T. P. & Blaser, M. J. 1988. Experimental Campylobacter jejuni infection in humans. Journal of Infectious Diseases 157, 472-9.10 Friedman, C., Neimann, J., Wegener, H. & Tauxe, R. 2000. Epidemiology of Campylobacter jejuni infections in the United States and other industrialized nations. In Campylobacter 2nd Edition, pp. 121-138. Edited by I. Nachamkin & M. J. Blaser. Washington DC: ASM Press.11 A flock is a group of hens of similar age that are managed and housed together.processing modification, can be quantified for any treatment, then the impact on risk can be readily evaluated within the risk model.The home preparation module considered two routes of ingestion of viable organisms: via consumption of contaminated, undercooked chicken meat, and via cross-contamination, from contaminated raw poultry onto another food that is eaten without further cooking, or directly from the hands (Figure 5.2). As a result of recommendations from the previous expert consultation12 the “protected areas” cooking approach and the drip-fluid model approach for cross-contamination were selected. The "protected areas" approach to modelling the cooking step considers that campylobacter cells may be located in parts of the carcass that reach cooking temperature more slowly than other locations. The drip-fluid model for cross-contamination relates to the spread of campylobacter cells via drip-fluid from raw carcasses.FIGURE 5.2: Schematic representation of the home preparation module.5.1.4.4 Risk characterizationThe risk characterization step integrates the information collected during the hazard identification, hazard characterization, and exposure assessment steps to arrive at estimates of adverse events that may arise due to consumption of chicken (Figure 5.1). This step links the probability and magnitude of exposure to campylobacter associated with consumption of chicken to adverse outcomes that might occur. The resulting risk is expressed as individual risk or the risk per serving of chicken. Although this model does not address risk to a specific population, data on amounts of chicken consumed can be incorporated into the model to arrive at estimates of the population-based risk.The current model is unable to provide a central estimate of risk due to virtually unbounded uncertainty13 on two key components of the model, namely the impact of undercooking and the impact of cross-contamination. Since these processes are the ultimate determinant of the final exposure of the consumer, the unbounded and unresolvable nature of the uncertainty undermines the establishment of a central estimate of consumer risk. In addition, the model has not yet been applied 12 WHO. 2001. Report of the Joint FAO/WHO Expert Consultation on Risk Assessment of Microbiological Hazards in Foods; Hazard identification, exposure assessment and hazard characterization of Campylobacter spp. in broiler chickens and Vibrio spp. in seafood. WHO Headquarters, Geneva, Switzerland 23 - 27 July 2001. WHO 2001./es/ESN/food/risk_mra_campylobacter_en.stm / http://www.who.int/fsf/Micro/index.htm13 As the upper and lower boundaries of the values describing the impact of undercooking and cross-contamination are unknown the distribution of possible values is potentially infinite and therefore unbounded.。
Journal of Statistical SoftwareNovember2008,Volume28,Issue5./ Building Predictive Models in R Using thecaret PackageMax KuhnPfizer Global R&DAbstractThe caret package,short for classification and regression training,contains numerous tools for developing predictive models using the rich set of models available in R.Thepackage focuses on simplifying model training and tuning across a wide variety of modelingtechniques.It also includes methods for pre-processing training data,calculating variableimportance,and model visualizations.An example from computational chemistry is usedto illustrate the functionality on a real data set and to benchmark the benefits of parallelprocessing with several types of models.Keywords:model building,tuning parameters,parallel processing,R,NetWorkSpaces.1.IntroductionThe use of complex classification and regression models is becoming more and more com-monplace in science,finance and a myriad of other domains(Ayres2007).The R language (R Development Core Team2008)has a rich set of modeling functions for both classification and regression,so many in fact,that it is becoming increasingly more difficult to keep track of the syntactical nuances of each function.The caret package,short for classification and regression training,was built with several goals in mind:to eliminate syntactical differences between many of the functions for building and predicting models,to develop a set of semi-automated,reasonable approaches for optimizing the values of the tuning parameters for many of these models andcreate a package that can easily be extended to parallel processing systems.2caret:Building Predictive Models in RThe package contains functionality useful in the beginning stages of a project(e.g.,data splitting and pre-processing),as well as unsupervised feature selection routines and methods to tune models using resampling that helps diagnose over-fitting.The package is available at the Comprehensive R Archive Network at http://CRAN.R-project. org/package=caret.caret depends on over25other packages,although many of these are listed as“suggested”packages which are not automatically loaded when caret is started.Pack-ages are loaded individually when a model is trained or predicted.After the package is in-stalled and loaded,help pages can be found using help(package="caret").There are three package vignettes that can be found using the vignette function(e.g.,vignette(package= "caret")).For the remainder of this paper,the capabilities of the package are discussed for:data splitting and pre-processing;tuning and building models;characterizing performance and variable importance;and parallel processing tools for decreasing the model build time.Rather than discussing the capabilities of the package in a vacuum,an analysis of an illustrative example is used to demonstrate the functionality of the package.It is assumed that the readers are familiar with the various tools that are mentioned.Hastie et al.(2001)is a good technical introduction to these tools.2.An illustrative exampleIn computational chemistry,chemists often attempt to build predictive relationships between the structure of a chemical and some observed endpoint,such as activity against a biological ing the structural formula of a compound,chemical descriptors can be generated that attempt to capture specific characteristics of the chemical,such as its size,complexity,“greasiness”etc.Models can be built that use these descriptors to predict the outcome of interest.See Leach and Gillet(2003)for examples of descriptors and how they are used. Kazius et al.(2005)investigated using chemical structure to predict mutagenicity(the in-crease of mutations due to the damage to genetic material).An Ames test(Ames et al.1972) was used to evaluate the mutagenicity potential of various chemicals.There were4,337com-pounds included in the data set with a mutagenicity rate of55.3%.Using these compounds, the dragonX software(Talete SRL2007)was used to generate a baseline set of1,579predic-tors,including constitutional,topological and connectivity descriptors,among others.These variables consist of basic numeric variables(such as molecular weight)and counts variables (e.g.,number of halogen atoms).The descriptor data are contained in an R data frame names descr and the outcome data are in a factor vector called mutagen with levels"mutagen"and"nonmutagen".These data are available from the package website /.3.Data preparationSince there is afinite amount of data to use for model training,tuning and evaluation,one of thefirst tasks is to determine how the samples should be utilized.There are a few schools of thought.Statistically,the most efficient use of the data is to train the model using all of the samples and use resampling(e.g.,cross-validation,the bootstrap etc.)to evaluate the efficacy of the model.Although it is possible to use resampling incorrectly(Ambroise andJournal of Statistical Software3 McLachlan2002),this is generally true.However,there are some non-technical reasons why resampling alone may be insufficient.Depending on how the model will be used,an external test/validation sample set may be needed so that the model performance can be characterized on data that were not used in the model training.As an example,if the model predictions are to be used in a highly regulated environment(e.g.,clinical diagnostics),the user may be constrained to“hold-back”samples in a validation set.For illustrative purposes,we will do an initial split of the data into training and test sets. The test set will be used only to evaluate performance(such as to compare models)and the training set will be used for all other activities.The function createDataPartition can be used to create stratified random splits of a data set.In this case,75%of the data will be used for model training and the remainder will be used for evaluating model performance.The function creates the random splits within each class so that the overall class distribution is preserved as well as possible.R>library("caret")R>set.seed(1)R>inTrain<-createDataPartition(mutagen,p=3/4,list=FALSE)R>R>trainDescr<-descr[inTrain,]R>testDescr<-descr[-inTrain,]R>trainClass<-mutagen[inTrain]R>testClass<-mutagen[-inTrain]R>R>prop.table(table(mutagen))mutagenmutagen nonmutagen0.55363320.4463668R>prop.table(table(trainClass))trainClassmutagen nonmutagen0.55350550.4464945In cases where the outcome is numeric,the samples are split into quartiles and the sampling is done within each quartile.Although not discussed in this paper,the package also contains method for selecting samples using maximum dissimilarity sampling(Willett1999).This approach to sampling can be used to partition the samples into training and test sets on the basis of their predictor values.There are many models where predictors with a single unique value(also known as“zero-variance predictors”)will cause the model to fail.Since we will be tuning models using resampling methods,a random sample of the training set may result in some predictors with more than one unique value to become a zero-variance predictor(in our data,the simple split of the data into a test and training set caused three descriptors to have a single unique value in the training set).These so-called“near zero-variance predictors”can cause numerical problems during resampling for some models,such as linear regression.As an example of such a predictor,the variable nR04is the number of number of4-membered rings in a compound.For the training set,almost all of the samples(n=3,233)have no4caret:Building Predictive Models in R4-member rings while18compounds have one and a single compound has2such rings.If these data are resampled,this predictor might become a problem for some models.To identify this kind of predictors,two properties can be examined:First,the percentage of unique values in the training set can be calculated for each predictor.Variables with low percentages have a higher probability of becoming a zero variance predictor during resampling.For nR04,the percentage of unique values in the training set is low(9.2%).However,this in itself is not a problem.Binary predictors, such as dummy variables,are likely to have low percentages and should not be discarded for this simple reason.The other important criterion to examine is the skewness of the frequency distribution of the variable.If the ratio of most frequent value of a predictor to the second most frequent value is large,the distribution of the predictor may be highly skewed.For nR04,the frequency ratio is large(179=3233/18),indicating a significant imbalance in the frequency of values.If both these criteria areflagged,the predictor may be a near zero-variance predictor.It is suggested that if:1.the percentage of unique values is less than20%and2.the ratio of the most frequent to the second most frequent value is greater than20, the predictor may cause problem for some models.The function nearZeroVar can be used to identify near zero-variance predictors in a dataset.It returns an index of the column numbers that violate the two conditions above.Also,some models are susceptible to multicollinearity(i.e.,high correlations between pre-dictors).Linear models,neural networks and other models can have poor performance in these situations or may generate unstable solutions.Other models,such as classification or regression trees,might be resistant to highly correlated predictors,but multicollinearity may negatively affect interpretability of the model.For example,a classification tree may have good performance with highly correlated predictors,but the determination of which predictors are in the model is random.If there is a need to minimize the effect of multicollinearity,there are a few options.First, models that are resistant to large between-predictor correlations,such as partial least squares, can be used.Also,principal component analysis can be used to reduce the number of dimen-sions in a way that removes correlations(see below).Alternatively,we can identify and remove predictors that contribute the most to the correlations.In linear models,the traditional method for reducing multicollinearity is to identify the of-fending predictors using the variable inflation factor(VIF).For each variable,this statistic measures the increase in the variation of the model parameter estimate in comparison to the optimal situation(i.e.,an orthogonal design).This is an acceptable technique when linear models are used and there are more samples than predictors.In other cases,it may not be as appropriate.As an alternative,we can compute the correlation matrix of the predictors and use an al-gorithm to remove the a subset of the problematic predictors such that all of the pairwiseJournal of Statistical Software5correlations are below a threshold:repeatFind the pair of predictors with the largest absolute correlation;For both predictors,compute the average correlation between each predictor and all of the other variables;Flag the variable with the largest mean correlation for removal;Remove this row and column from the correlation matrix;until no correlations are above a threshold;This algorithm can be used tofind the minimal set of predictors that can be removed so that the pairwise correlations are below a specific threshold.Note that,if two variables have a high correlation,the algorithm determines which one is involved with the most pairwise correlations and is removed.For illustration,predictors that result in absolute pairwise correlations greater than0.90can be removed using the findCorrelation function.This function returns an index of column numbers for removal.R>ncol(trainDescr)[1]1576R>descrCorr<-cor(trainDescr)R>highCorr<-findCorrelation(descrCorr,0.90)R>trainDescr<-trainDescr[,-highCorr]R>testDescr<-testDescr[,-highCorr]R>ncol(trainDescr)[1]650For chemical descriptors,it is not uncommon to have many very large correlations between the predictors.In this case,using a threshold of0.90,we eliminated926descriptors from the data.Once thefinal set of predictors is determined,the values may require transformations be-fore being used in a model.Some models,such as partial least squares,neural networks and support vector machines,need the predictor variables to be centered and/or scaled. The preProcess function can be used to determine values for predictor transformations us-ing the training set and can be applied to the test set or future samples.The function has an argument,method,that can have possible values of"center","scale","pca"and "spatialSign".Thefirst two options provide simple location and scale transformations of each predictor(and are the default values of method).The predict method for this class is then used to apply the processing to new samplesR>xTrans<-preProcess(trainDescr)R>trainDescr<-predict(xTrans,trainDescr)R>testDescr<-predict(xTrans,testDescr)The"pca"option computes loadings for principal component analysis that can be applied to any other data set.In order to determine how many components should be retained,the preProcess function has an argument called thresh that is a threshold for the cumulative percentage of variance captured by the principal components.The function will add compo-nents until the cumulative percentage of variance is above the threshold.Note that the data6caret:Building Predictive Models in Rare automatically scaled when method="pca",even if the method value did not indicate that scaling was needed.For PCA transformations,the predict method generates values with column names"PC1","PC2",etc.Specifying method="spatialSign"applies the spatial sign transformation(Serneels et al. 2006)where the predictor values for each sample are projected onto a unit circle using x∗= x/||x||.This transformation may help when there are outliers in the x space of the training set.4.Building and tuning modelsThe train function can be used to select values of model tuning parameters(if any)and/or estimate model performance using resampling.As an example,a radial basis function support vector machine(SVM)can be used to classify the samples in our computational chemistry data.This model has two tuning parameters.Thefirst is the scale functionσin the radial basis functionK(a,b)=exp(−σ||a−b||2)and the other is the cost value C used to control the complexity of the decision boundary. We can create a grid of candidate tuning values to ing resampling methods,such as the bootstrap or cross-validation,a set of modified data sets are created from the training samples.Each data set has a corresponding set of hold-out samples.For each candidate tuning parameter combination,a model isfit to each resampled data set and is used to predict the corresponding held out samples.The resampling performance is estimated by aggregating the results of each hold-out sample set.These performance estimates are used to evaluate which combination(s)of the tuning parameters are appropriate.Once thefinal tuning values are assigned,thefinal model is refit using the entire training set.For the train function,the possible resampling methods are:bootstrapping,k-fold cross-validation,leave-one-out cross-validation,and leave-group-out cross-validation(i.e.,repeated splits without replacement).By default,25iterations of the bootstrap are used as the resam-pling scheme.In this case,the number of iterations was increased to200due to the large number of samples in the training set.For this particular model,it turns out that there is an analytical method for directly estimating a suitable value ofσfrom the training data(Caputo et al.2002).By default,the train function uses the sigest function in the kernlab package(Karatzoglou et al.2004)to initialize this parameter.In doing this,the value of the cost parameter C is the only tuning parameter. The train function has the following arguments:x:a matrix or data frame of predictors.Currently,the function only accepts numeric values(i.e.,no factors or character variables).In some cases,the model.matrix function may be needed to generate a data frame or matrix of purely numeric datay:a numeric or factor vector of outcomes.The function determines the type of problem (classification or regression)from the type of the response given in this argument.method:a character string specifying the type of model to be used.See Table1for the possible values.Journal of Statistical Software7 metric:a character string with values of"Accuracy","Kappa","RMSE"or"Rsquared".This value determines the objective function used to select thefinal model.For example, selecting"Kappa"makes the function select the tuning parameters with the largest value of the mean Kappa statistic computed from the held-out samples.trControl:takes a list of control parameters for the function.The type of resampling as well as the number of resampling iterations can be set using this list.The function trainControl can be used to compute default parameters.The default number of resampling iterations is25,which may be too small to obtain accurate performance estimates in some cases.tuneLength:controls the size of the default grid of tuning parameters.For each model, train will select a grid of complexity parameters as candidate values.For the SVM model,the function will tune over C=10−1,1,10.To expand the size of the default list,the tuneLength argument can be used.By selecting tuneLength=5,values of C ranging from0.1to1,000are evaluated.tuneGrid:can be used to define a specific grid of tuning parameters.See the example below....:the three dots can be used to pass additional arguments to the functions listed in Table1.For example,we have already centered and scaled the predictors,so the argument scaled=FALSE can be passed to the ksvm function to avoid duplication of the pre-processing.We can tune and build the SVM model using the code below.R>bootControl<-trainControl(number=200)R>set.seed(2)R>svmFit<-train(trainDescr,trainClass,+method="svmRadial",tuneLength=5,+trControl=bootControl,scaled=FALSE)Model1:sigma=0.0004329517,C=1e-01Model2:sigma=0.0004329517,C=1e+00Model3:sigma=0.0004329517,C=1e+01Model4:sigma=0.0004329517,C=1e+02Model5:sigma=0.0004329517,C=1e+03R>svmFitCall:train.default(x=trainDescr,y=trainClass,method="svmRadial", scaled=FALSE,trControl=bootControl,tuneLength=5)3252samples650predictorssummary of bootstrap(200reps)sample sizes:3252,3252,3252,3252,3252,3252,...8caret:Building Predictive Models in Rboot resampled training results across tuning parameters:sigma C Accuracy Kappa Accuracy SD Kappa SD Selected0.0004330.10.7050.3950.01220.02520.00043310.8060.6060.01090.02180.000433100.8180.6310.01040.0211*0.0004331000.80.5950.01120.02270.00043310000.7820.5580.01110.0223Accuracy was used to select the optimal modelIn this output,each row in the table corresponds to a specific combination of tuning param-eters.The“Accuracy”column is the average accuracy of the200held-out samples and the column labeled as“Accuracy SD”is the standard deviation of the200accuracies.The Kappa statistic is a measure of concordance for categorical data that measures agreement relative to what would be expected by chance.Values of1indicate perfect agreement,while a value of zero would indicate a lack of agreement.Negative Kappa values can also occur, but are less common since it would indicate a negative association between the observed and predicted data.Kappa is an excellent performance measure when the classes are highly unbalanced.For example,if the mutagenicity rate in the data had been very small,say5%, most models could achieve high accuracy by predicting all compounds to be nonmutagenic.In this case,the Kappa statistic would result in a value near zero.The Kappa statistic given here is the unweighted version computed by the classAgreement function in the e1071package (Dimitriadou et al.2008).The Kappa columns in the output above are also summarized across the200resampled Kappa estimates.As previously mentioned,the“optimal”model is selected to be the candidate model with the largest accuracy.If more than one tuning parameter is“optimal”then the function will try to choose the combination that corresponds to the least complex model.For these data,σwas estimated to be0.000433and C=10appears to be optimal.Based on these values,the model was refit to the original set of3,252samples and this object is stored in svmFit$finalModel. R>svmFit$finalModelSupport Vector Machine object of class"ksvm"SV type:C-svc(classification)parameter:cost C=10Gaussian Radial Basis kernel function.Hyperparameter:sigma=0.000432951668058316Number of Support Vectors:1616Objective Function Value:-9516.185Training error:0.082411Probability model included.Journal of Statistical Software9 Model method value Package Tuning parameters Recursive partitioning rpart rpart∗maxdepthctree party mincriterion Boosted trees gbm gbm∗interaction.depth,n.trees,shrinkageblackboost mboost maxdepth,mstopada ada maxdepth,iter,nu Other boosted models glmboost mboost mstopgamboost mboost mstoplogitboost caTools nIter Random forests rf randomForest∗mtrycforest party mtry Bagged trees treebag ipred NoneNeural networks nnet nnet decay,sizePartial least squares pls∗pls,caret ncompSupport vector machines svmRadial kernlab sigma,C(RBF kernel)Support vector machines svmPoly kernlab scale,degree,C(polynomial kernel)Gaussian processes gaussprRadial kernlab sigma(RBF kernel)Gaussian processes gaussprPoly kernlab scale,degree(polynomial kernel)Linear least squares lm∗stats NoneMultivariate adaptive earth∗,mars earth degree,npruneregression splinesBagged MARS bagEarth∗caret,earth degree,npruneElastic net enet elasticnet lambda,fraction The lasso lasso elasticnet fractionRelevance vector machines rvmRadial kernlab sigma(RBF kernel)Relevance vector machines rvmPoly kernlab scale,degree(polynomial kernel)Linear discriminant analysis lda MASS NoneStepwise diagonal sddaLDA,SDDA Nonediscriminant analysis sddaQDALogistic/multinomial multinom nnet decayregressionRegularized discriminant rda klaR lambda,gammaanalysisFlexible discriminant fda∗mda,earth degree,npruneanalysis(MARS basis)Table1:Models used in train(∗indicates that a model-specific variable importance method is available,see Section9.).(continued on next page)10caret:Building Predictive Models in RModel method value Package Tuning parameters Bagged FDA bagFDA∗caret,earth degree,npruneLeast squares support vector lssvmRadial kernlab sigmamachines(RBF kernel)k nearest neighbors knn3caret kNearest shrunken centroids pam∗pamr thresholdNaive Bayes nb klaR usekernelGeneralized partial gpls gpls K.provleast squaresLearned vector quantization lvq class kTable1:Models used in train(∗indicates that a model-specific variable importance method is available,see Section9.).In many cases,more control over the grid of tuning parameters is needed.For example, for boosted trees using the gbm function in the gbm package(Ridgeway2007),we can tune over the number of trees(i.e.,boosting iterations),the complexity of the tree(indexed by interaction.depth)and the learning rate(also known as shrinkage).As an example,a user could specify a grid of values to tune over using a data frame where the rows correspond to tuning parameter combinations and the columns are the names of the tuning variables (preceded by a dot).For our data,we will generate a grid of50combinations and use the tuneGrid argument to the train function to use these values.R>gbmGrid<-expand.grid(.interaction.depth=(1:5)*2,+.n.trees=(1:10)*25,.shrinkage=.1)R>set.seed(2)R>gbmFit<-train(trainDescr,trainClass,+method="gbm",trControl=bootControl,verbose=FALSE,+bag.fraction=0.5,tuneGrid=gbmGrid)Model1:interaction.depth=2,shrinkage=0.1,n.trees=250collapsing over other values of n.treesModel2:interaction.depth=4,shrinkage=0.1,n.trees=250collapsing over other values of n.treesModel3:interaction.depth=6,shrinkage=0.1,n.trees=250collapsing over other values of n.treesModel4:interaction.depth=8,shrinkage=0.1,n.trees=250collapsing over other values of n.treesModel5:interaction.depth=10,shrinkage=0.1,n.trees=250collapsing over other values of n.treesIn this model,we generated200bootstrap replications for each of the50candidate models, computed performance and selected the model with the largest accuracy.In this case the model automatically selected an interaction depth of8and used250boosting iterations (although other values may very well be appropriate;see Figure1).There are a variety(a)#Trees b o o t r e s a m p l e d t r a i n i n g a c c u r a c y0.740.760.780.8050100150200250qq qqq qqqqqInteraction Depth 246810q(b)#Treesb o o t r e s a m p l e d t r a i n i n g k a p p a0.450.500.550.6050100150200250qqqqq qqqqqInteraction Depth 246810q(c)#TreesI n t e r a c t i o n D e p t h2468102550751001201501802002202500.730.740.750.760.770.780.790.800.81(d)D e n s i t y1020300.780.800.820.840.550.600.650.70Figure 1:Examples of plot functions for train objects.(a)A plot of the classificationaccuracy versus the tuning factors (using plot(gbmFit)).(b)Similarly,a plot of the Kappa statistic profiles (plot(gbmFit,metric ="Kappa")).(c)A level plot of the accuracy values (plot(gbmFit,plotType ="level")).(d)Density plots of the 200bootstrap estimates of accuracy and Kappa for the final model (resampleHist(gbmFit)).of different visualizations for train objects.Figure1shows several examples plots created using plot.train and resampleHist.Note that the output states that the procedure was“collapsing over other values of n.trees”. For some models(method values of pls,plsda,earth,rpart,gbm,gamboost,glmboost, blackboost,ctree,pam,enet and lasso),the train function willfit a model that can be used to derive predictions for some sub-models.For example,since boosting models save the model results for each iteration of boosting,train canfit the model with the largest number of iterations and derive the other models where the other tuning parameters are the same but fewer number of boosting iterations are requested.In the example above,for a model with interaction.depth=2and shrinkage=.1,we only need tofit the model with the largest number of iterations(250in this example).Holding the interaction depth and shrinkage constant,the computational time to get predictions for models with less than 250iterations is relatively cheap.For the example above,wefit a total of200×5=1,000 models instead of25×5×10=10,000.The train function tries to exploit this idea for as many models as possible.For recursive partitioning models,an initial model isfit to all of the training data to obtain the possible values of the maximum depth of any node(maxdepth).The tuning grid is created based on these values.If tuneLength is larger than the number of possible maxdepth values determined by the initial model,the grid will be truncated to the maxdepth list.The same is also true for nearest shrunken centroid models,where an initial model isfit tofind the range of possible threshold values,and MARS models(see Section7).Also,for the glmboost and gamboost functions from the mboost package(Hothorn and B¨u hlmann2007),an additional tuning parameter,prune,is used by train.If prune="yes", the number of trees is reduced based on the AIC statistic.If"no",the number of trees is kept at the value specified by the mstop parameter.See B¨u hlmann and Hothorn(2007)for more details about AIC pruning.In general,the functions in the caret package assume that there are no missing values in the data or that these values have been handled via imputation or other means.For the train function,there are some models(such as rpart)that can handle missing values.In these cases,the data passed to the x argument can contain missing values.5.Prediction of new samplesAs previously noted,an object of class train contains an element called finalModel,which is thefitted model with the tuning parameter values selected by resampling.This object can be used in the traditional way to generate predictions for new samples,using that model’s predict function.For the most part,the prediction functions in R follow a consistent syntax, but there are exceptions.For example,boosted tree models produced by the gbm function also require the number of trees to be specified.Also,predict.mvr from the pls package (Mevik and Wehrens2007)will produce predictions for every candidate value of ncomp that was tested.To avoid having to remember these nuances,caret offers several functions to deal with these issues.The function predict.train is an interface to the model’s predict method that handles any extra parameter specifications(such as previously mentioned for gbm and PLS models).For example:。
预测能力作文英语模板英文回答:Predictive Modeling。
Predictive modeling is a powerful technique that uses data and statistical algorithms to make predictions about future events. It is widely used in various industries, such as marketing, finance, and healthcare, to improve decision-making and optimize outcomes.The process of predictive modeling involves collecting relevant data, cleaning and preparing it, selecting appropriate algorithms, training the models, and evaluating their performance. Once a model is trained, it can be used to predict future values or outcomes based on new input data.There are various types of predictive modeling techniques available, including:Regression: Predicts continuous values, such as sales revenue or customer churn.Classification: Predicts categorical values, such as whether an individual will default on a loan or purchase a product.Clustering: Groups data into similar categories based on their characteristics.Time Series Forecasting: Predicts future values of a time series based on historical patterns.Predictive modeling offers several benefits, including:Improved decision-making: Enables businesses to make informed decisions based on data-driven insights.Increased efficiency: Automates predictions and frees up resources for other tasks.Personalized experiences: Allows for targeted marketing, product recommendations, and customer segmentation.Risk assessment: Identifies risks and potential vulnerabilities in various areas.However, it is important to note that predictive modeling is not a perfect science and comes with certain limitations and challenges:Data quality: The accuracy of predictions depends heavily on the quality and availability of data.Model selection: Choosing the appropriate algorithm is crucial for effective predictions.Overfitting: Models can be trained too closely to the training data, resulting in poor performance on new data.Ethical considerations: Predictive modeling can raise ethical concerns related to data privacy, bias, andfairness.To mitigate these challenges, it is essential to ensure data quality, carefully evaluate models, avoid overfitting, and address ethical considerations. Predictive modeling remains a valuable tool when used responsibly and with appropriate caution.中文回答:预测建模。
validation for predictive models -回复什么是预测模型验证?预测模型验证是指评估预测模型是否有效和可靠的过程。
在构建预测模型之前,我们需要确认模型的准确性和稳定性,以确保其能够在真实环境中取得良好的预测性能。
预测模型验证是一个关键的步骤,能够帮助我们判断模型是否足够可靠,可用于未知数据的预测。
预测模型验证的步骤:1. 数据集划分:首先,我们需要将数据集划分为训练集和测试集,通常采用70的数据作为训练集,剩余的30数据作为测试集。
划分数据集的目的是用训练集训练模型参数,然后使用测试集验证模型的预测性能。
2. 特征选择:在特征选择过程中,我们需要选择对预测目标具有重要影响的特征。
通常通过统计方法(如相关系数分析、方差分析等)和机器学习算法(如决策树、随机森林等)来评估特征的重要性。
选择合适的特征可以提高模型的预测能力和泛化能力。
3. 建立模型:选择适合的预测模型是预测模型验证的关键步骤。
常见的预测模型包括线性回归、逻辑回归、决策树、支持向量机等。
在模型建立过程中,我们需要提取有效特征,并选择合适的模型参数和算法。
4. 模型训练:使用划分好的训练集来训练模型,并调整模型参数,使模型能够更好地拟合训练数据。
在模型训练过程中,我们可以使用交叉验证方法来评估模型的效果,选择最佳的参数组合。
5. 模型评估:使用划分好的测试集来评估模型的预测性能。
常用的评估指标包括平均绝对误差(MAE)、均方根误差(RMSE)、准确率、召回率等。
通过评估指标的结果,我们可以判断模型的稳定性和准确性。
6. 模型调优:如果模型的预测性能不佳,我们可以通过调整模型参数、增加特征、改变算法等方式来改善模型的预测能力。
模型调优不仅可以提高模型的准确性,还可以提高模型的鲁棒性。
7. 模型应用:在预测模型验证完成后,我们可以将模型应用于未知数据的预测。
通过对未知数据的预测,我们可以评估模型的泛化能力和实用性。
模拟ai英文面试题目及答案模拟AI英文面试题目及答案1. 题目: What is the difference between a neural network anda deep learning model?答案: A neural network is a set of algorithms modeled loosely after the human brain that are designed to recognize patterns. A deep learning model is a neural network with multiple layers, allowing it to learn more complex patterns and features from data.2. 题目: Explain the concept of 'overfitting' in machine learning.答案: Overfitting occurs when a machine learning model learns the training data too well, including its noise and outliers, resulting in poor generalization to new, unseen data.3. 题目: What is the role of a 'bias' in an AI model?答案: Bias in an AI model refers to the systematic errors introduced by the model during the learning process. It can be due to the choice of model, the training data, or the algorithm's assumptions, and it can lead to unfair or inaccurate predictions.4. 题目: Describe the importance of data preprocessing in AI.答案: Data preprocessing is crucial in AI as it involves cleaning, transforming, and reducing the data to a suitableformat for the model to learn effectively. Proper preprocessing can significantly improve the performance of AI models by ensuring that the input data is relevant, accurate, and free from noise.5. 题目: How does reinforcement learning differ from supervised learning?答案: Reinforcement learning is a type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a reward signal. It differs from supervised learning, where the model learns from labeled data to predict outcomes based on input features.6. 题目: What is the purpose of a 'convolutional neural network' (CNN)?答案: A convolutional neural network (CNN) is a type of deep learning model that is particularly effective for processing data with a grid-like topology, such as images. CNNs use convolutional layers to automatically and adaptively learn spatial hierarchies of features from input images.7. 题目: Explain the concept of 'feature extraction' in AI.答案: Feature extraction in AI is the process of identifying and extracting relevant pieces of information from the raw data. It is a crucial step in many machine learning algorithms, as it helps to reduce the dimensionality of the data and to focus on the most informative aspects that can be used to make predictions or classifications.8. 题目: What is the significance of 'gradient descent' in training AI models?答案: Gradient descent is an optimization algorithm used to minimize a function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In the context of AI, it is used to minimize the loss function of a model, thus refining the model's parameters to improve its accuracy.9. 题目: How does 'transfer learning' work in AI?答案: Transfer learning is a technique where a pre-trained model is used as the starting point for learning a new task. It leverages the knowledge gained from one problem to improve performance on a different but related problem, reducing the need for large amounts of labeled data and computational resources.10. 题目: What is the role of 'regularization' in preventing overfitting?答案: Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function, which discourages overly complex models. It helps to control the model's capacity, forcing it to generalize better to new data by not fitting too closely to the training data.。
A predictive model (ETLM)for arsenate adsorption and surface speciation on oxides consistent with spectroscopicand theoretical molecular evidenceKeisuke Fukushi 1,Dimitri A.Sverjensky*Department of Earth and Planetary Sciences,Johns Hopkins University,Baltimore,MD 21218,USA Received 7September 2006;accepted in revised form 9May 2007;available online 7June 2007AbstractThe nature of adsorbed arsenate species for a wide range of minerals and environmental conditions is fundamental to pre-diction of the migration and long-term fate of arsenate in natural environments.Spectroscopic experiments and theoretical calculations have demonstrated the potential importance of a variety of arsenate surface species on several iron and aluminum oxides.However,integration of the results of these studies with surface complexation models and extrapolation over wide ranges of conditions and for many oxides remains a challenge.In the present study,in situ X-ray and infrared spectroscopic and theoretical molecular evidence of arsenate (and the analogous phosphate)surface speciation are integrated with an extended triple layer model (ETLM)of surface complexation,which takes into account the electrostatic work associated with the ions and the water dipoles involved in inner-sphere surface complexation by the ligand exchange mechanism.Three reactions forming inner-sphere arsenate surface species2>SOH þH 3AsO 40¼ð>SO Þ2AsO 2ÀþH þþ2H 2O 2>SOH þH 3AsO 40¼ð>SO Þ2AsOOH þ2H 2O and>SOH þH 3AsO 40¼>SOAsO 22Àþ2H þþH 2Owere found to be consistent with adsorption envelope,adsorption isotherm,proton surface titration and proton coadsorption of arsenate on hydrous ferric oxide (HFO),ferrihydrite,four goethites,amorphous aluminum oxide,a -Al 2O 3,b -Al(OH)3,and a -Al(OH)3up to surface coverages of about 2.5l mol m À2.At higher surface coverages,adsorption is not the predominant mode of arsenate sorption.The four goethites showed a spectrum of model arsenate surface speciation behavior from predom-inantly binuclear to mononuclear.Goethite in the middle of this spectrum of behavior (selected as a model goethite)showed predicted changes in arsenate surface speciation with changes in pH,ionic strength and surface coverage very closely consis-tent with qualitative trends inferred in published in situ X-ray and infrared spectroscopic studies of arsenate and phosphate on several additional goethites.The model speciation results for arsenate on HFO,a -and b -Al(OH)3were also consistent with X-ray and molecular evidence.The equilibrium constants for arsenate adsorption expressed in terms of site-occupancy standard states show systematic differences for different solids,including the model goethite.The differences can be explained with the aid of Born solvation theory,which enables the development of a set of predictive equations for arsenate adsorption equilibrium constants on all oxides.The predictive equations indicate that the Born solvation effect for mononuclear species is much stronger than for binuclear species.This favors the development of the mononuclear species on iron oxides such as HFO with high values of the dielectric constant relative to aluminum oxides such as gibbsite with much lower dielectric constants.However,on0016-7037/$-see front matter Ó2007Elsevier Ltd.All rights reserved.doi:10.1016/j.gca.2007.05.018*Corresponding author.Fax:+14105167933.E-mail addresses:fukushi@kenroku.kanazawa-u.ac.jp (K.Fukushi),sver@ (D.A.Sverjensky).1Present address:Institute of Nature and Environmental Technology,Kanazawa University,Kakuma-machi,Kanazawa,Ishikawa 920-1192,Japan./locate/gcaGeochimica et Cosmochimica Acta 71(2007)3717–3745hematite and corundum,with similar dielectric constants,the predicted surface speciations of arsenate are similar:at lower pH values and/or higher surface coverages,binuclear species are predicted to predominate,at higher pH values and/or lower sur-face coverages,mononuclear species predominate.Ó2007Elsevier Ltd.All rights reserved.1.INTRODUCTIONArsenic has received a great deal of public attention be-cause of its links to certain types of cancers and its high lev-els in some drinking water supplies(Nordstrom,2002; Hopenhayn,2006).Arsenate is the predominant species of arsenic in oxidized aquatic systems(Myneni et al.,1998). The concentration of arsenate in natural waters is strongly influenced by adsorption on oxide surfaces(Fuller and Davis,1989;Pichler et al.,1999;Fukushi et al.,2003).To predict the migration and long-term fate of arsenate in nat-ural environments,the behavior and the nature of adsorbed arsenate species must be known for a wide variety of min-erals and over the full range of environmental conditions.The surface speciation of arsenate on oxide powders and on single crystal surfaces has been studied through X-ray spectroscopic and infrared investigations as well as theoret-ical molecular calculations.On powders,a variety of sur-face species have been inferred from extended X-ray absorptionfine structure(EXAFS)studies.The majority of these studies have been concerned with iron oxides and a very limited range of ionic strengths.For arsenate on fer-rihydrite and FeOOH polymorphs(goethite,akaganeite and lepidocrocite)at pH8and I=0.1,it was inferred that arsenate adsorbed mainly as an inner-sphere bidentate-binuclear complex(Waychunas et al.,1993).Monoden-tate-mononuclear complexes were also inferred.The monodentate/bidentate ratio decreased with increasing arsenate surface coverage.In contrast,Manceau(1995) reinterpreted the EXAFS spectra as indicating bidentate-mononuclear species.The possible existence of such a spe-cies,in addition to the ones described by Waychunas et al.(1993)was described in Waychunas et al.(2005).All three of the above species were subsequently inferred for arsenate adsorption on goethite from EXAFS results obtained at pH6–9and I=0.1(Fendorf et al.,1997).The monodentate-mononuclear species was inferred to domi-nate at higher pH values and lower surface coverages, whereas the bidentate-binuclear species dominated at lower pH and higher surface coverages.The bidentate-mononu-clear species was detected in only minor amounts at high surface coverage.For arsenate on goethite and lepidocro-cite at pH6,a bidentate-binuclear species was inferred (Farquhar et al.,2002).The same species was inferred for arsenate on goethite,lepidocrocite,hematite and ferrihy-drite at pH4and7and I=0.1through EXAFS combined with molecular calculations(Sherman and Randall,2003). For arsenate on hematite,as a function of pH4.5–8and surface coverage,Arai et al.(2004)interpreted EXAFS data to infer a predominant bidentate-binuclear and a minor amount of a bidentate-mononuclear species.In the only powder studies of aluminum oxides,EXAFS and XANES of arsenate adsorption on b-Al(OH)3at pH4,8,and10and I=0.1and0.8by Arai et al.(2001)showed that arse-nate adsorbs to b-Al(OH)3as a bidentate-binuclear inner-sphere species(their XANES data did not indicate any out-er-sphere species regardless of pH and ionic strength),and an EXAFS study of arsenate on a-Al(OH)3at pH5.5(in dilute Na-arsenate solutions)also showed that arsenate forms an inner-sphere bidentate-binuclear species(Ladeira et al.,2001).On single crystal surfaces of corundum and hematite, several arsenate surface species have been reported recently. For the(0001)and(10–12)planes of hematite,prepared at pH5with10À4M sodium arsenate solutions,then studied with GIXAFS and surface diffraction in a humid atmo-sphere,71–78%of the adsorbed arsenate was determined to be a bidentate-mononuclear species,with the remainder present as a bidentate-binuclear species(Waychunas et al., 2005).However,in bulk water at pH5,resonant anomalous X-ray reflectivity and standing-wave studies of the(01–12) planes of corundum and hematite,have established a biden-tate-binuclear species,and thefirst reported occurrence of an outer-sphere or H-bonded arsenate species(Catalano et al.,2006a,b;2007).In the present study,we wish to emphasize that distinguishing between an outer-sphere and H-bonded species is not possible in surface complexa-tion models of reaction stoichiometry,although consider-ations of Born solvation theory may facilitate the distinction.Overall,the X-ray studies on powders and single crystals provide strong indications of a variety of surface arsenate species,presumably present under different conditions of pH,ionic strength and surface coverage,as well as being present on different solids and crystallographically different surfaces.Even on a single crystal surface,a great variety of possible surface species coordination geometries can be proposed,e.g.monodentate,bidentate and tridentate possi-bilities(Waychunas et al.,2005).However,the most fre-quently reported inner-sphere arsenate species coordination geometries in X-ray studies remain the biden-tate-binuclear,bidendate-mononuclear,and the monoden-tate-mononuclear types of species.In this regard,it should also be emphasized that none of the protonation states of these coordination geometries are established by X-ray studies.A variety of possible protonation states adds even further potential complexity to the speciation of ad-sorbed arsenate.It is here that other spectroscopic tech-niques(e.g.infrared and Raman)and molecular calculations,as well as the reaction stoichiometries in sur-face complexation models can possibly help.In situ ATR-FTIR studies of arsenate on ferrihydrite and amorphous hydrous ferric hydroxide have inferred that adsorption of arsenate occurred as inner-sphere species (Roddick-Lanzilotta et al.,2002;Voegelin and Hug, 2003).An ex situ FTIR study(Sun and Doner,1996)has3718K.Fukushi,D.A.Sverjensky/Geochimica et Cosmochimica Acta71(2007)3717–3745inferred binuclear-bridging complexes of arsenate on goe-thite.However,the latter noted that their results could be affected by not having bulk water present in their spectro-scopic experiment.In addition,IR studies of arsenate adsorption may only show small or no shifts with changes of pH(Goldberg and Johnston,2001).As a consequence, the assignment of arsenate surface species from infrared and Raman spectra is sufficiently difficult(Myneni et al., 1998)that a definitive in situ study has yet to be done.Con-sequently,in the present study we have made use of in situ infrared results for an analogous anion phosphate.Phosphate and arsenate have very similar aqueous pro-tonation characteristics and are widely considered to have very similar adsorption characteristics(Waychunas et al., 1993;Hiemstra and van Riemsdijk,1999;Goldberg and Johnston,2001;Arai et al.,2004).Furthermore,infrared studies of phosphate adsorption indicate enough shifts of the spectra with experimental conditions to assign the phos-phate surface species(Tejedor-Tejedor and Anderson,1990; Persson et al.,1996;Arai and Sparks,2001;Loring et al., 2006;Nelson et al.,2006).The phosphate surface species in-ferred by Tejedor-Tejedor and Anderson(1990)have al-ready been used to develop a CD surface complexation model for arsenate on goethite(Hiemstra and van Rie-msdijk,1999).In the present study,we take a similar ap-proach using the studies by Tejedor-Tejedor and Anderson(1990),Arai and Sparks(2001),and Elzinga and Sparks(2007)which are the definitive published studies of phosphate adsorption referring to in situ conditions.It should be emphasized here that,in detail,there are differ-ences between some of the surface species inferred in the above studies.In the present study,we investigated a sur-face complexation application of the species suggested by these authors to see if a consistent set of species would ap-ply to a wide range of oxides.According to Tejedor-Tejedor and Anderson(1990), phosphate on goethite adopts three surface species:proton-ated bidentate-binuclear,deprotonated bidentate-binuclear and deprotonated monodentate species.The relative abun-dances of these species were shown to be strong functions of pH and surface coverage at an ionic strength of0.01.The protonated bidentate-binuclear species predominated at pH values of about3.6to5.5–6.0(at surface coverages of 190and150l mol/g).The deprotonated bidentate-binu-clear species became predominant between pH values of about6–8.3.The mondentate species was detectable at pH values as low as6at surface coverages of190l mol/g and increased in relative abundance as the surface coverage decreased towards50l mol/g.Consistent results have been obtained for phosphate on ferrihydrite(Arai and Sparks, 2001).The latter assigned protonated bidentate-binuclear species at pH values of4to6and surface coverages of 0.38–2.69l mol/m2and a deprotonated bidentate-binuclear species at pH>7.5,as well as a possible non-protonated Na-phosphate surface species at high pH values.More re-cently,in an investigation of phosphate adsorption on hematite,Elzinga and Sparks(2007)suggested a bridging binuclear protonated species,a protonated mononuclear species and a deprotonated mononuclear species.It was also suggested that the‘‘bridging’’species may instead be mononuclear with a strong H-bond to an adjacent Fe-octa-hedra.Broadly speaking,the in situ infrared studies for phosphate adsorption on iron oxides are consistent with the results of the X-ray studies for arsenate discussed above (although the X-ray studies cannot provide information on protonation states).However,the above infrared studies of phosphate provide invaluable additional information on the possible states of protonation of the surface phosphate species,which we use as afirst step in developing a surface complexation model for arsenate adsorption below.The surface structure and protonation state of adsorbed arsenate have been addressed through theoretical density functional theory(DFT)and molecular orbital density functional theory(MO/DFT)calculations(Ladeira et al., 2001;Sherman and Randall,2003;Kubicki,2005).DFT calculations for arsenate with aluminum oxide clusters indi-cated that a bidentate-binuclear species was more stable than bidentate-mononuclear,monodentate-mononuclear or monodentate-binuclear species(Ladeira et al.,2001). From predicted geometries of arsenate adsorbed on iron hydroxide clusters using DFT comparing EXAFS data,it was suggested that arsenate was adsorbed as a doubly pro-tonated species(Sherman and Randall,2003).Predicted energetics indicated that a bidentate-binuclear species was more stable than a bidentate-mononuclear and monoden-tate species.In contrast to these DFT calculations,MO/ DFT calculations by using a different type of iron hydrox-ide clusters(containing solvated waters)showed that deprotonated arsenate surface species resulted in good agreement with spectroscopic data(Kubicki,2005).The bidentate-binuclear configuration is most consistent with spectroscopic data,but the model Gibbs free energies of adsorption of clusters suggested that the monodentate configuration is energetically more stable as a cluster.The MO/DFT calculations also indicated a strong preference for arsenate to bind to iron clusters relative to aluminum hydroxide clusters.Detailed molecular simulations of phos-phate on iron oxide clusters(Kwon and Kubicki,2004) emphasize mononuclear species with differing protonation states that agree very well with an ex situ infrared study of phosphate on goethite(Persson et al.,1996).Numerous studies have focused on surface complexation modeling of arsenate on oxides(Goldberg,1986;Dzombak and Morel,1990;Manning and Goldberg,1996;Grossl et al.,1997;Manning and Goldberg,1997;Hiemstra and van Riemsdijk,1999;Gao and Mucci,2001;Goldberg and Johnston,2001;Halter and Pfeifer,2001;Dixit and Hering,2003;Arai et al.,2004;Hering and Dixit,2005). Surface complexation modeling has the capability to predict the behavior and the nature of adsorbed arsenate species as functions of environmental parameters.How-ever,the complexation reactions have not always been con-sistent with in situ spectroscopic results.Instead,with the exception of the study by Hiemstra and van Riemsdijk (1999),regression of macroscopic adsorption data has often been carried out merely tofit the data with the minimum number of surface species(Hering and Dixit,2005).It is now widely recognized that the evidence of oxyanion speciation from spectroscopic studies should be integrated with models describing macroscopic adsorption andAs(V)surface speciation on oxides3719electrokinetic data(Suarez et al.,1997;Hiemstra and van Riemsdijk,1999;Blesa et al.,2000;Goldberg and Johnston, 2001).In the present study,we use the results of the in situ X-ray and infrared studies of arsenate and phosphate discussed above to guide the choice of surface species in the extended triple-layer model(ETLM)recently developed to account for the electrostatic effects of water-dipole desorption during inner-sphere surface complexation(Sver-jensky and Fuksuhi,2006a,b).We investigate the applica-bility of three spectroscopically identified species byfitting adsorption,proton surface titration in the presence of arse-nate,and proton coadsorption with arsenate on a wide range of oxides including goethite,hydrous ferric oxide (HFO),ferrihydrite,hematite,amorphous aluminum oxide, b-Al(OH)3,a-Al(OH)3and a-Al2O3(Manning and Gold-berg,1996;Jain et al.,1999;Jain and Loeppert,2000;Arai et al.,2001,2004;Gao and Mucci,2001;Goldberg and Johnston,2001;Halter and Pfeifer,2001;Dixit and Hering, 2003).These data were selected to cover wide ranges of pH, ionic strength and surface coverage,and a wider range of oxide types than have been investigated spectroscopically. Other arsenate data examined in the present study(Rietra et al.,1999;Arai et al.,2004;Dutta et al.,2004;Pena et al.,2005,2006;Zhang and Stanforth,2005)did not in-volve a wide enough range of conditions to permit retrieval of equilibrium constants for three surface species.Electro-kinetic data involving arsenate adsorption were also exam-ined in the present study(e.g.Anderson and Malotky,1979; Arai et al.,2001;Goldberg and Johnston,2001;Pena et al., 2006).However,these data refer to a much more limited range of conditions.The main purpose of the present study is to determine the effects of pH,ionic strength,surface coverage and type of solid on the surface speciation of arsenate identified spec-troscopically.The results are then compared with indepen-dent trends established in X-ray and infrared spectroscopic studies for arsenate and phosphate.A second goal of the present study is to provide a predictive basis for arsenate surface speciation on all oxides.By adopting an internally consistent set of standard states,systematic differences in the equilibrium constants for the surface arsenate species from one oxide to another can be established and explained with the aid of Born solvation theory.This makes it possi-ble to predict arsenate adsorption and surface speciation on all oxides in1:1electrolyte systems.2.ETLM TREATMENT OF ARSENATEADSORPTION2.1.Aqueous speciation,surface protonation and electrolyte adsorptionAqueous speciation calculations were carried out taking into account aqueous ionic activity coefficients appropriate to single electrolytes up to high ionic strengths calculated with the extended Debye-Huckel equation(Helgeson et al.,1981;Criscenti and Sverjensky,1999).Electrolyte ion pairs used were consistent with previous studies (Criscenti and Sverjensky,1999,2002).Aqueous arsenate protonation equilibria were taken from a recent study (Nordstrom and Archer,2003):HþþH2AsO4À¼H3AsO40;log K¼2:30ð1ÞHþþHAsO42À¼H2AsO4À;log K¼6:99ð2ÞHþþAsO43À¼HAsO42À;log K¼11:8ð3ÞAqueous arsenate complexation with the electrolyte cations used in experimental studies can be expected to occur, based on the analogous phosphate systems.However,the lack of experimental characterization of arsenate-electro-lyte cation complexing prevented including this in the present study.Preliminary calculations carried out using Na-complexes of phosphate as a guide to the strength of Na–arsenate complexes indicated that the results of the present study were not significantly affected at ionic strengths of0.1and below.The sample characteristics and surface protonation and electrolyte adsorption equilibrium constants used in the present study are summarized in Table1.Although HFO and ferrihydrite might be thought to have extremely similar surface chemical properties,we distinguished between them in our previous study of As(III)adsorption(Sverjensky and Fukushi,2006b).By using differences in the pH ZPC(i.e.7.9 vs.8.5,respectively,Table1),we inferred different effective dielectric constants for these solids,which facilitated the use of Born solvation theory to explain the different magnitudes of the log K values for As(III)adsorption.It will be shown below that the same approach works for differences in arse-nate adsorption,at least for these specific HFO and ferrihy-drite samples.Surface protonation constants referring to the site-occupancy standard states(denoted by the super-script‘‘h’’),i.e.log K h1and log K h2,were calculated from val-ues of pH ZPC and D p K hn(Sverjensky,2005)usinglog K h1¼pH ZPCÀD p K hn2ð4Þandlog K h2¼pH ZPCþD p K hn2ð5ÞValues of pH ZPC were taken from measured low ionic strength isoelectric points or point-of-zero-salt effects cor-rected for electrolyte adsorption(Sverjensky,2005).For goethite and HFO examined by Dixit and Hering(2003), neither isoelectric points nor surface titration data were re-ported.The value of pH ZPC for the goethite was predicted for the present study using Born solvation and crystal chemical theory(Sverjensky,2005).For HFO,the pH ZPC was assumed to be the same as in Davis and Leckie (1978),consistent with our previous analysis for arsenite adsorption on the same sample of HFO(Sverjensky andFukushi,2006).Values of D p K hnwere predicted theoreti-cally(Sverjensky,2005).For convenience,protonation constants referring to the hypothetical 1.0molar standard state(denoted by the superscript‘‘0’’)are also given in Table1.The relationship between the two standard states is given by(Sverjensky, 2003)3720K.Fukushi,D.A.Sverjensky/Geochimica et Cosmochimica Acta71(2007)3717–3745log K01¼log K h1ÀlogN S A SN Að6Þlog K02¼log K h2þlogN S A SN Að7Þwhere,N S represents the surface site density on the s th solid sorbent(sites mÀ2);Nàrepresents the standard state sorbate species site density(sites mÀ2);A S represents the BET sur-face area of the s th solid sorbent(m2gÀ1);Aàrepresents a standard state BET surface area(m2gÀ1).In the present study,values of Nà=10·1018sites mÀ2 and Aà=10m2gÀ1are selected for all solids.Electrolyte adsorption equilibrium constants referring tosite-occupancy standard states,log K hMþand log K hLÀ,andcapacitances,C1,were obtained from regression of proton surface charge data when such data were available.Other-wise,these parameters were obtained from theoretical pre-dictions(Sverjensky,2005).For convenience,Table1also contains values of the electrolyte adsorption equilibrium constants relative to the>SOH species(denoted by the superscript‘‘*’’)and with respect to the hypothetical1.0molar standard state,i.e.logÃK0Mþand logÃK0LÀ.The rela-tionships of logÃK0Mþand logÃK0LÀto log K hMþand log K hLÀare given by(Sverjensky,2003,2005)logÃK0Mþ¼log K hMþÀpH ZPCÀD p K hnÀlogN S A SN Að8ÞlogÃK0LÀ¼log K hLÀþpH ZPCÀD p K hn2ÀlogN S A SN Að9ÞSurface areas(A S)were taken from experimental measure-ments(BET)for goethite,a-Al(OH)3,b-Al(OH)3and a-Al2O3.However,for samples of HFO,ferrihydrite and amÆAlO,it was assumed that the surface areas are 600m2.gÀ1consistent with recommendations made previ-ously(Davis and Leckie,1978;Dzombak and Morel, 1990)and our previous analyses of arsenite and sulfate adsorption on these oxides.Site densities(N S)together with arsenate adsorption equilibrium constants were derived from regression of adsorption data where these data referred to a wide enough range of surface coverages.This procedure was adopted because oxyanions probably form inner-sphere complexes on only a subset of the total sites available,most likely the singly coordinated oxygens at the surface(Catalano et al.,2006a,b,c,d;2007;Hiemstra and van Riemsdijk, 1999).Theoretical estimation of these site densities is impossible for powdered samples when the proportions of the crystal faces are unknown,as is the case for all the experimental studies analyzed below.Furthermore,for goethites,surface chemical characteristics vary widely,even for those synthesized in the absence of CO2(Sverjensky, 2005).Together with site densities for goethites already published(Sverjensky and Fukushi,2006a,b),the goethite site densities obtained by regression in the present study, with one exception,support an empirical correlation between site density and surface area of goethite(Fukushi and Sverjensky,2007).This correlation should enable estimation of site densities for those goethites for which oxyanion adsorption data over a wide range of surface coverage are not available.In the cases of arsenate adsorp-tion on HFO,ferrihydrite,goethite from Dixit and Hering (2003),and b-Al(OH)3and amÆAlO,the site densities were taken from our previous regression of arsenite adsorption data(Sverjensky and Fukushi,2006b).We used a single site density for each sample applied to all the surface equilibria as described previously(Sverjensky and Fukushi,2006b). All the calculations were carried out with the aid of the computer code GEOSURF(Sahai and Sverjensky,1998).2.2.Arsenate adsorption reaction stoichiometriesWe constructed our ETLM for arsenate adsorption on oxides by choosing surface species based on the in situ X-ray and infrared spectroscopic studies of phosphate and arsenate and the molecular calculations for arsenate interaction with metal oxide clusters.We used three reaction stoichiometries expressed as the formation of inner-sphere surface species as illustrated in Fig.1. Additional species could be incorporated into the model, but even the data analyzed in the present study,which refers to a wide range of experimental conditions,did not permit retrieval of equilibrium constants for more than three surface species.We tested numerous reaction stoichiometries,but found that the best three reaction stoichiometries involved surface species identical to the ones proposed in the in situ ATR-FTIR study of phosphate adsorption on goethite(Tejedor-Tejedor and Anderson, 1990)discussed above.For the most part,these species are also consistent with the X-ray studies cited above.The three reactions produce a deprotonated bidentate-binuclear species according to2>SOHþH3AsO40¼ð>SOÞ2AsO2ÀþHþþ2H2Oð10ÞÃK hð>SOÞ2AsO2À¼að>SOÞ2AsO2Àa Hþa2H2Oa2>SOHa H3AsO4010FðD W r;10Þ2:303RTð11Þa protonated bidentate-binuclear species according to2>SOHþH3AsO40¼ð>SOÞ2AsOOHþ2H2Oð12ÞÃK hð>SOÞ2AsOOH¼að>SOÞ2AsOOHa2H2Oa>SOHa H3AsO4010FðD W r;12Þ2:303RTð13Þand a deprotonated monodentate species according to>SOHþH3AsO40¼>SOAsO32Àþ2HþþH2Oð14ÞÃK h>SOAsO32À¼a>SOAsO32Àa2Hþa H2Oa>SOH a H3AsO4010FðD W r;14Þ2:303RTð15ÞHere,the superscript‘‘h’’again represents site-occupancy standard states(Sverjensky,2003,2005).The exponential terms contain the electrostatic factor,D W r,which is a reac-tion property arising from the work done in an electricfield when species in the reaction move on or offthe charged sur-face.It is evaluated taking into account the adsorbing ions and the water dipoles released in Eqs.(10),(12)and(14). Electrostatic work is done in both instances and is reflected in the formulation of D W r(Sverjensky and Fukushi, 2006a,b).3722K.Fukushi,D.A.Sverjensky/Geochimica et Cosmochimica Acta71(2007)3717–3745。