Statistical Models for Assessing the Individuality of Fingerprints
- 格式:pdf
- 大小:1.68 MB
- 文档页数:13
hansen检验stata指令-回复Hansen Test: An Introduction to its Importance and Stata CommandIntroduction:In econometrics, it is crucial to ensure the validity and reliability of statistical models that are used to analyze and draw conclusions from data. One important aspect of model validation is ensuring that the assumptions of the model are met, as violating these assumptions can lead to biased and inefficient estimates.The Hansen test, also known as the Sargan-Hansen test, is a statistical test used to examine the validity of instrumental variables (IV) regression models. It enables researchers to determine whether the instruments used in an IV regression model are strong and valid. The test provides a statistical assessment of the null hypothesis that the instruments are valid or that there is no correlation between the instruments and the error term in the regression model.Why is the Hansen test important?The use of instrumental variables regression models is prevalent in various fields, including economics, finance, and social sciences. These models are commonly used when dealing with endogeneity, where a variable of interest is correlated with the error term, violating the assumptions of ordinary least squares.To address endogeneity in these situations, instrumental variables are employed to replace the endogenous variable with one or more exogenous variables that are assumed to be correlated with the endogenous variable but not with the error term. However, the validity of instrumental variables is crucial to ensure unbiased estimates and valid inferences.The Hansen test is an essential tool for validating the instrumental variable assumption. It examines whether the instruments are exogenous and uncorrelated with the error term, which is a vital condition for reliable results from instrumental variables models. By assessing the validity of the instruments, the test helps researchers make informed decisions regarding the suitability of instrumental variables in their models.Using Hansen test in Stata:Stata, a popular statistical software package widely used in academia and industry, offers a command for conducting the Hansen test. In Stata, the command "xtoverid" is used to implement this test.Step 1: First, ensure that you have installed and loaded the necessary IV regression packages in Stata. The commonly used packages include "ivregress" and "reghdfe."Step 2: Once the packages are loaded, estimate your IV regression model using the appropriate command. For instance, the "ivregress 2sls" command can be used to estimate a two-stage least squares (2SLS) regression model.Step 3: After estimating the model, use the "xtoverid" command to perform the Hansen test. This command takes the estimated model as input. For example, if your model is named "my_iv_model," the command would be written as "xtoverid my_iv_model."Interpreting the results:When executed, the "xtoverid" command provides several statistics to assess the validity of the instrumental variables. The primary statistic of interest is the Hansen J-statistic. This statistic follows a chi-square distribution, and its significance level determines the validity of the instruments.If the p-value associated with the Hansen J-statistic is greater than the chosen significance level (commonly 0.05), the null hypothesis of instrument validity is not rejected. This implies that the instruments used in the IV regression model are valid, and the model provides unbiased estimates. On the other hand, if thep-value is less than the chosen significance level, the null hypothesis is rejected, suggesting potential problems with the instruments' validity.Conclusion:The Hansen test is a fundamental tool for validating instrumental variables in IV regression models. By assessing the correlation between the instruments and the error term, the test helps researchers ensure the reliability of their results and draw validinferences.Utilizing the "xtoverid" command in Stata, researchers can easily implement the Hansen test and obtain essential statistics to assess the validity of their instruments. This test enhances the rigor and validity of IV regression models, making them valuable tools in empirical research across various disciplines.。
NLOS Identification and Mitigation for Localization Based on UWB Experimental Data Stefano Maran`o,Student Member,IEEE,Wesley M.Gifford,Student Member,IEEE,Henk Wymeersch,Member,IEEE,Moe Z.Win,Fellow,IEEEAbstract—Sensor networks can benefit greatly from location-awareness,since it allows information gathered by the sensors to be tied to their physical locations.Ultra-wide bandwidth(UWB) transmission is a promising technology for location-aware sensor networks,due to its power efficiency,fine delay resolution,and robust operation in harsh environments.However,the presence of walls and other obstacles presents a significant challenge in terms of localization,as they can result in positively biased distance estimates.We have performed an extensive indoor measurement campaign with FCC-compliant UWB radios to quantify the effect of non-line-of-sight(NLOS)propagation.From these channel pulse responses,we extract features that are representative of the propagation conditions.We then develop classification and regression algorithms based on machine learning techniques, which are capable of:(i)assessing whether a signal was trans-mitted in LOS or NLOS conditions;and(ii)reducing ranging error caused by NLOS conditions.We evaluate the resulting performance through Monte Carlo simulations and compare with existing techniques.In contrast to common probabilistic approaches that require statistical models of the features,the proposed optimization-based approach is more robust against modeling errors.Index Terms—Localization,UWB,NLOS Identification,NLOS Mitigation,Support Vector Machine.I.I NTRODUCTIONL OCATION-AW ARENESS is fast becoming an essential aspect of wireless sensor networks and will enable a myr-iad of applications,in both the commercial and the military sectors[1],[2].Ultra-wide bandwidth(UWB)transmission [3]–[8]provides robust signaling[8],[9],as well as through-wall propagation and high-resolution ranging capabilities[10], [11].Therefore,UWB represents a promising technology for localization applications in harsh environments and accuracy-critical applications[10]–[15].In practical scenarios,however, a number of challenges remain before UWB localization and communication can be deployed.These include signal Manuscript received15May2009;revised15February2010.This research was supported,in part,by the National Science Foundation under grant ECCS-0901034,the Office of Naval Research Presidential Early Career Award for Scientists and Engineers(PECASE)N00014-09-1-0435,the Defense University Research Instrumentation Program under grant N00014-08-1-0826, and the MIT Institute for Soldier Nanotechnologies.S.Maran`o was with Laboratory for Information and Decision Systems (LIDS),Massachusetts Institute of Technology(MIT),and is now with the Swiss Seismological Service,ETH Z¨u rich,Z¨u rich,Switzerland(e-mail: stefano.marano@sed.ethz.ch).H.Wymeersch was with LIDS,MIT,and is now with Chalmers University of Technology,G¨o teborg,Sweden(e-mail:henkw@chalmers.se).Wesley M.Gifford and Moe Z.Win are with LIDS,MIT,Cambridge,MA 02139USA(e-mail:wgifford@,moewin@).Digital Object Identifier10.1109/JSAC.2010.100907.acquisition[16],multi-user interference[17],[18],multipath effects[19],[20],and non-line-of-sight(NLOS)propagation [10],[11].The latter issue is especially critical[10]–[15]for high-resolution localization systems,since NLOS propagation introduces positive biases in distance estimation algorithms, thus seriously affecting the localization performance.Typical harsh environments such as enclosed areas,urban canyons, or under tree canopies inherently have a high occurrence of NLOS situations.It is therefore critical to understand the impact of NLOS conditions on localization systems and to develop techniques that mitigate their effects.There are several ways to deal with ranging bias in NLOS conditions,which we classify as identification and mitigation. NLOS identification attempts to distinguish between LOS and NLOS conditions,and is commonly based on range estimates[21]–[23]or on the channel pulse response(CPR) [24],[25].Recent,detailed overviews of NLOS identification techniques can be found in[22],[26].NLOS mitigation goes beyond identification and attempts to counter the positive bias introduced in NLOS signals.Several techniques[27]–[31]rely on a number of redundant range estimates,both LOS and NLOS,in order to reduce the impact of NLOS range estimates on the estimated agent position.In[32]–[34]the geometry of the environment is explicitly taken into account to cope with NLOS situations.Other approaches,such as[35],attempt to detect the earliest path in the CPR in order to better estimate the TOA in NLOS prehensive overviews of NLOS mitigation techniques can be found in[26],[36]. The main drawbacks of existing NLOS identification and mitigation techniques are:(i)loss of information due to the direct use of ranges instead of the CPRs;(ii)latency incurred during the collection of range estimates to establish a history; and(iii)difficulty in determining the joint probability distribu-tions of the features required by many statistical approaches. In this paper,we consider an optimization-based approach. In particular,we propose the use of non-parametric ma-chine learning techniques to perform NLOS identification and NLOS mitigation.Hence,they do not require a statistical characterization of LOS and NLOS channels,and can perform identification and mitigation under a common framework.The main contributions of this paper are as follows:•characterization of differences in the CPRs under LOS and NLOS conditions based on an extensive indoor mea-surement campaign with FCC-compliant UWB radios;•determination of novel features extracted from the CPR that capture the salient properties in LOS and NLOS conditions;0733-8716/10/$25.00c 2010IEEE•demonstration that a support vector machine (SVM)clas-si fier can be used to distinguish between LOS and NLOS conditions,without the need for statistical modeling of the features under either condition;and•development of SVM regressor-based techniques to mit-igate the ranging bias in NLOS situations,again without the need for statistical modeling of the features under either condition.The remainder of the paper is organized as follows.In Section II,we introduce the system model,problem statement,and describe the effect of NLOS conditions on ranging.In Section III,we describe the equipment and methodologies of the LOS/NLOS measurement campaign and its contribu-tion to this work.The proposed techniques for identi fication and mitigation are described in Section IV,while different strategies for incorporating the proposed techniques within any localization system are discussed in Section V.Numerical performance results are provided in Section VI,and we draw our conclusions in Section VII.II.P ROBLEM S TATEMENT AND S YSTEM M ODEL In this section,we describe the ranging and localization algorithm,and demonstrate the need for NLOS identi fication and mitigation.A.Single-node LocalizationA network consists of two types of nodes:anchors are nodes with known positions,while agents are nodes with unknown positions.For notational convenience,we consider the point of view of a single agent,with unknown position p ,surrounded by N b anchors,with positions,p i ,i =1,...,N b .The distance between the agent and anchor i is d i = p −p i .The agent estimates the distance between itself and the anchors,using a ranging protocol.We denote the estimateddistance between the agent and anchor i by ˆdi ,the ranging error by εi =ˆdi −d i ,the estimate of the ranging error by ˆεi ,the channel condition between the agent and anchor i by λi ∈{LOS ,NLOS },and the estimate of the channelcondition by ˆλi .The mitigated distance estimate of d i is ˆd m i=ˆd i −ˆεi .The residual ranging error after mitigation is de fined as εm i =ˆd m i −d i .Given a set of at least three distance estimates,the agent will then determine its position.While there are numerous positioning algorithms,we focus on the least squares (LS)criterion,due to its simplicity and because it makes no assumptions regarding ranging errors.The agent can infer its position by minimizing the LS cost functionˆp=arg min p(p i ,ˆdi )∈Sˆd i − p −p i 2.(1)Note that we have introduced the concept of the set of useful neighbors S ,consisting of couplesp i ,ˆdi .The optimization problem (1)can be solved numerically using steepest descent.B.Sources of ErrorThe localization algorithm will lead to erroneous results when the ranging errors are large.In practice the estimated distances are not equal to the true distances,because of a number of effects including thermal noise,multipath propa-gation,interference,and ranging algorithm inaccuracies.Ad-ditionally,the direct path between requester and responder may be obstructed,leading to NLOS propagation.In NLOS conditions,the direct path is either attenuated due to through-material propagation,or completely blocked.In the former case,the distance estimates will be positively biased due to the reduced propagation speed (i.e.,less than the expected speed of light,c ).In the latter case the distance estimate is also positively biased,as it corresponds to a re flected path.These bias effects can be accounted for in either the ranging or localization phase.In the remainder of this paper,we focus on techniques that identify and mitigate the effects of NLOS signals during the ranging phase.In NLOS identi fication,the terms in (1)corre-sponding to NLOS distance estimates are omitted.In NLOS mitigation,the distance estimates corresponding to NLOS signals are corrected for improved accuracy.The localization algorithm can then adopt different strategies,depending on the quality and the quantity of available range estimates.III.E XPERIMENTAL A CTIVITIESThis section describes the UWB LOS/NLOS measurement campaign performed at the Massachusetts Institute of Tech-nology by the Wireless Communication and Network Sciences Laboratory during Fall 2007.A.OverviewThe aim of this experimental effort is to build a large database containing a variety of propagation conditions in the indoor of fice environment.The measurements were made using two FCC-compliant UWB radios.These radios repre-sent off-the-shelf transceivers and therefore an appropriate benchmark for developing techniques using currently available technology.The primary focus is to characterize the effects of obstructions.Thus,measurement positions (see Fig.1)were chosen such that half of the collected waveforms were cap-tured in NLOS conditions.The distance between transmitter and receiver varies widely,from roughly 0.6m up to 18m,to capture a variety of operating conditions.Several of fices,hallways,one laboratory,and a large lobby constitute the physical setting of this campaign.While the campaign was conducted in one particular indoor of fice envi-ronment,because of the large number of measurements and the variety of propagation scenarios encountered,we expect that our results are applicable in other of fice environments.The physical arrangement of the campaign is depicted in Fig.1.In each measurement location,the received waveform and the associated range estimate,as well as the actual distance are recorded.The waveforms are then post-processed in order to reduce dependencies on the speci fic algorithm and hardware,e.g.,on the leading edge detection (LED)algorithm embedded in the radios.Fig.1.Measurements were taken in clusters over several different rooms and hallways to capture different propagation conditions.B.Experimental ApparatusThe commercially-available radios used during the data collection process are capable of performing communications and ranging using UWB signals.The radio complies with the emission limit set forth by the FCC[37].Specifically, the10dB bandwidth spans from3.1GHz to6.3GHz.The radio is equipped with a bottom fed planar elliptical antenna. This type of dipole antenna is reported to be well matched and radiation efficient.Most importantly,it is omni-directional and thus suited for ad-hoc networks with arbitrary azimuthal orientation[38].Each radio is mounted on the top of a plastic cart at a height of90cm above the ground.The radios perform a round-trip time-of-arrival(RTOA)ranging protocol1and are capable of capturing waveforms while performing the ranging procedure.Each waveform r(t)captured at the receiving radio is sampled at41.3ps over an observation window of190ns.C.Measurement ArrangementMeasurements were taken at more than one hundred points in the considered area.A map,depicting the topological organization of the clusters within the building,is shown 1RTOA allows ranging between two radios without a common time reference;and thus alleviates the need for networksynchronization.Fig.2.The measurement setup for collecting waveforms between D675CA and H6around the corner of the WCNS Laboratory.in Fig.1,and a typical measurement scenario is shown in Fig.2.Points are placed randomly,but are restricted to areas which are accessible by the carts.The measurement points are grouped into non-overlapping clusters,i.e.,each point only belongs to a single cluster.Typically,a cluster corresponds to a room or a region of a hallway.Within each cluster,measurements between every possible pair of points were captured.When two clusters were within transmission range, every inter-cluster measurement was collected as well.Overall, more than one thousand unique point-to-point measurements were performed.For each pair of points,several received waveforms and distance estimates are recorded,along with the actual distance.During each measurement the radios remain stationary and care is taken to limit movement of other objects in the nearby surroundings.D.DatabaseUsing the measurements collected during the measurement phase,a database was created and used to develop and evaluate the proposed identification and mitigation techniques. It includes1024measurements consisting of512waveforms captured in the LOS condition and512waveforms captured in the NLOS condition.The term LOS is used to denote the existence of a visual path between transmitter and receiver,i.e., a measurement is labeled as LOS when the straight line be-tween the transmitting and receiving antenna is unobstructed. The ranging estimate was obtained by an RTOA algorithm embedded on the radio.The actual position of the radio during each measurement was manually recorded,and the ranging error was calculated with the aid of computer-aided design(CAD)software.The collected waveforms were then processed to align thefirst path in the delay domain using a simple threshold-based method.The alignment process creates a time reference independent of the LED algorithm embedded on the radio.IV.NLOS I DENTIFICATION AND M ITIGATIONThe collected measurement data illustrates that NLOS prop-agation conditions significantly impact ranging performance. For example,Fig.3shows the empirical CDFs of the ranging error over the ensemble of all measurements collected under the two different channel conditions.In LOS conditions a ranging error below one meter occurs in more than95%of the measurements.On the other hand,in NLOS conditions a ranging error below one meter occurs in less than30%of the measurements.Clearly,LOS and NLOS range estimates have very dif-ferent characteristics.In this section,we develop techniques to distinguish between LOS and NLOS situations,and to mitigate the positive biases present in NLOS range estimates. Our techniques are non-parametric,and rely on least-squares support-vector machines(LS-SVM)[39],[40].Wefirst de-scribe the features for distinguishing LOS and NLOS situa-tions,followed by a brief introduction to LS-SVM.We then describe how LS-SVM can be used for NLOS identification and mitigation in localization applications,without needing to determine parametric joint distributions of the features for both the LOS and NLOS conditions.A.Feature Selection for NLOS ClassificationWe have extracted a number of features,which we expect to capture the salient differences between LOS and NLOS signals,from every received waveform r(t).These featuresFig.3.CDF of the ranging error for the LOS and NLOS condition. were selected based on the following observations:(i)in NLOS conditions,signals are considerably more attenuated and have smaller energy and amplitude due to reflections or obstructions;(ii)in LOS conditions,the strongest path of the signal typically corresponds to thefirst path,while in NLOS conditions weak components typically precede the strongest path,resulting in a longer rise time;and(iii)the root-mean-square(RMS)delay spread,which captures the temporal dispersion of the signal’s energy,is larger for NLOS signals. Fig.4depicts two waveforms received in the LOS and NLOS condition supporting our observations.We also include some features that have been presented in the literature.Taking these considerations into account,the features we will consider are as follows:1)Energy of the received signal:E r=+∞−∞|r(t)|2dt(2) 2)Maximum amplitude of the received signal:r max=maxt|r(t)|(3) 3)Rise time:t rise=t H−t L(4) wheret L=min{t:|r(t)|≥ασn}t H=min{t:|r(t)|≥βr max},andσn is the standard deviation of the thermal noise.The values ofα>0and0<β≤1are chosen empirically in order to capture the rise time;in our case, we usedα=6andβ=0.6.4)Mean excess delay:τMED=+∞−∞tψ(t)dt(5) whereψ(t)=|r(t)|2/E r.Fig.4.In some situations there is a clear difference between LOS(upper waveform)and NLOS(lower waveform)signals.5)RMS delay spread:τRMS=+∞−∞(t−τMED)2ψ(t)dt(6)6)Kurtosis:κ=1σ4|r|TT|r(t)|−μ|r|4dt(7)whereμ|r|=1TT|r(t)|dtσ2|r|=1TT|r(t)|−μ|r|2dt.B.Least Squares SVMThe SVM is a supervised learning technique used both for classification and regression problems[41].It represents one of the most widely used classification techniques because of its robustness,its rigorous underpinning,the fact that it requires few user-defined parameters,and its superior performance compared to other techniques such as neural networks.LS-SVM is a low-complexity variation of the standard SVM, which has been applied successfully to classification and regression problems[39],[40].1)Classification:A linear classifier is a function R n→{−1,+1}of the forml(x)=sign[y(x)](8) withy(x)=w Tϕ(x)+b(9) whereϕ(·)is a predetermined function,and w and b are unknown parameters of the classifier.These parameters are de-termined based on the training set{x k,l k}N k=1,where x k∈R n and l k∈{−1,+1}are the inputs and labels,respectively.In the case where the two classes can be separated the SVM determines the separating hyperplane which maximizes the margin between the two classes.2Typically,most practical problems involve classes which are not separable.In this case,the SVM classifier is obtained by solving the following optimization problem:arg minw,b,ξ12w 2+γNk=1ξk(10)s.t.l k y(x k)≥1−ξk,∀k(11)ξk≥0,∀k,(12) where theξk are slack variables that allow the SVM to tolerate misclassifications andγcontrols the trade-off between minimizing training errors and model complexity.It can be shown that the Lagrangian dual is a quadratic program(QP) [40,eqn.2.26].To further simplify the problem,the LS-SVM replaces the inequality(11)by an equality:arg minw,b,e12w 2+γ12Nk=1e2k(13)s.t.l k y(x k)=1−e k,∀k.(14) Now,the Lagrangian dual is a linear program(LP)[40,eqn.3.5],which can be solved efficiently by standard optimization toolboxes.The resulting classifier can be written asl(x)=signNk=1αk l k K(x,x k)+b,(15)whereαk,the Lagrange multipliers,and b are found from the solution of the Lagrangian dual.The function K(x k,x l)=ϕ(x k)Tϕ(x l)is known as the kernel which enables the SVM to perform nonlinear classification.2)Regression:A linear regressor is a function R n→R of the formy(x)=w Tϕ(x)+b(16) whereϕ(·)is a predetermined function,and w and b are unknown parameters of the regressor.These parameters are determined based on the training set{x k,y k}N k=1,where x k∈R n and y k∈R are the inputs and outputs,respectively. The LS-SVM regressor is obtained by solving the following optimization problem:arg minw,b,e12w 2+γ12e 2(17)s.t.y k=y(x k)+e k,∀k,(18) whereγcontrols the trade-off between minimizing training errors and model complexity.Again,the Lagrangian dual is an LP[40,eqn.3.32],whose solution results in the following LS-SVM regressory(x)=Nk=1αk K(x,x k)+b.(19)2The margin is given by1/ w ,and is defined as the smallest distance between the decision boundary w Tϕ(x)+b=0and any of the training examplesϕ(x k).C.LS-SVM for NLOS Identi fication and MitigationWe now apply the non-parametric LS-SVM classi fier to NLOS identi fication,and the LS-SVM regressor to NLOS mitigation.We use 10-fold cross-validation 3to assess the performance of our features and the SVM.Not only are we interested in the performance of LS-SVM for certain features,but we are also interested in which subsets of the available features give the best performance.1)Classi fication:To distinguish between LOS and NLOS signals,we train an LS-SVM classi fier with inputs x k and corresponding labels l k =+1when λk =LOS and l k =−1when λk =NLOS .The input x k is composed of a subset of the features given in Section IV-A.A trade-off between classi fier complexity and performance can be made by using a different size feature subset.2)Regression:To mitigate the effect of NLOS propagation,we train an LS-SVM regressor with inputs x k and corre-sponding outputs y k =εk associated with the NLOS signals.Similar to the classi fication case,x k is composed of a subset of features,selected from those given in Section IV-A and therange estimate ˆdk .Again,the performance achieved by the regressor will depend on the size of the feature subset and the combination of features used.V.L OCALIZATION S TRATEGIESBased on the LS-SVM classi fier and regressor,we can develop the following localization strategies:(i)localization via identi fication ,where only classi fication is employed;(ii)localization via identi fication and mitigation ,where the re-ceived waveform is first classi fied and error mitigation is performed only on the range estimates from those signals identi fied as NLOS;and (iii)a hybrid approach which discards mitigated NLOS range estimates when a suf ficient number of LOS range estimates are present.A.Strategy 1:StandardIn the standard strategy,all the range estimates ˆdi from neighboring anchor nodes are used by the LS algorithm (1)for localization.In other words,S S =p i ,ˆd i :1≤i ≤N b .(20)B.Strategy 2:Identi ficationIn the second strategy,waveforms are classi fied as LOS or NLOS using the LS-SVM classi fier.Range estimates are used by the localization algorithm only if the associated waveform was classi fied as LOS,while range estimates from waveforms classi fied as NLOS are discarded:S I = p i ,ˆd i:1≤i ≤N b ,ˆλi =LOS .(21)Whenever the cardinality of S I is less than three,the agent isunable to localize.4In this case,we set the localization error to +∞.3InK -fold cross-validation,the dataset is randomly partitioned into K parts of approximately equal size,each containing 50%LOS and 50%NLOS waveforms.The SVM is trained on K −1parts and the performance is evaluated on the remaining part.This is done a total of K times,using each of the K parts exactly once for evaluation and K −1times for training.4Note that three is the minimum number of anchor nodes needed to localize in two-dimensions.TABLE IF ALSE ALARM PROBABILITY (P F ),MISSED DETECTION PROBABILITY (P M ),AND OVERALL ERROR PROBABILITY (P E )FOR DIFFERENT NLOSIDENTIFICATION TECHNIQUES .T HE SET F i IDENOTES THE SET OF i FEATURES WITH THE SMALLEST P E USING THE LS-SVM TECHNIQUE .Identi fication Technique P F P M P E Parametric technique given in [42]0.1840.1430.164LS-SVM using features from [42]0.1290.1520.141F 1I ={r max }0.1370.1230.130F 2I ={r max ,t rise }0.0920.1090.100F 3I ={E r ,t rise ,κ}0.0820.0900.086F 4I={E r ,r max ,t rise ,κ}0.0820.0900.086F 5I ={E r ,r max ,t rise ,τMED ,κ}0.0860.0900.088F 6I={E r ,r max ,t rise ,τMED ,τRMS ,κ}0.0920.0900.091C.Strategy 3:Identi fication and MitigationThis strategy is an extension to the previous strategy,wherethe received waveform is first classi fied as LOS or NLOS,and then the mitigation algorithm is applied to those signals with ˆλi =NLOS .For this case S IM =S I ∪S M ,where S M = p i ,ˆd m i :1≤i ≤N b ,ˆλi =NLOS ,(22)and the mitigated range estimate ˆd m iis described in Sec.II.This approach is motivated by the observation that mitigation is not necessary for range estimates associated with LOS waveforms,since their accuracy is suf ficiently high.D.Strategy 4:Hybrid Identi fication and Mitigation In the hybrid approach,range estimates are mitigated as in the previous strategy.However,mitigated range estimates are only used when less than three LOS anchors are available:5S H =S I if |S I |≥3S IMotherwise(23)This approach is motivated by the fact that mitigated range es-timates are often still less accurate than LOS range estimates.Hence,only LOS range estimates should be used,unless there is an insuf ficient number of them to make an unambiguous location estimate.VI.P ERFORMANCE E VALUATION AND D ISCUSSION In this section,we quantify the performance of the LS-SVM classi fier and regressor from Section IV,as well as the four localization strategies from Section V.We will first consider identi fication,then mitigation,and finally localization.For every technique,we will provide the relevant performance measures as well as the quantitative details of how the results were obtained.5Inpractice the angular separation of the anchors should be suf ficiently large to obtain an accurate estimate.If this is not the case,more than three anchors may be needed.TABLE IIM EAN AND RMS VALUES OF RRE FOR LS-SVM REGRESSION -BASEDMITIGATION .T HE SET F i MDENOTES THE SET OF i FEATURES WHICH ACHIEVES THE MINIMUM RMS RRE.Mitigation Technique with LS-SVM RegressionMean [m]RMS [m ]No Mitigation2.63223.589F 1M={ˆd}-0.0004 1.718F 2M ={κ,ˆd }-0.0042 1.572F 3M ={t rise ,κ,ˆd }0.0005 1.457F 4M ={t rise ,τMED ,κ,ˆd }0.0029 1.433F 5M={E r ,t rise ,τMED ,κ,ˆd }0.0131 1.425F 6M ={E r ,t rise ,τMED ,τRMS ,κ,ˆd }0.0181 1.419F 7M={E r ,r max ,t rise ,τMED ,τRMS ,κ,ˆd}0.01801.425A.LOS/NLOS Identi ficationIdenti fication results,showing the performance 6for eachfeature set size,are given in Table I.For the sake of compari-son,we also evaluate the performance of the parametric identi-fication technique from [42],which relies on three features:the mean excess delay,the RMS delay spread,and the kurtosis of the waveform.For fair comparison,these features are extracted from our database.The performance is measured in terms of the misclassi fication rate:P E =(P F +P M )/2,where P F is the false alarm probability (i.e.,deciding NLOS when the signal was LOS),and P M is the missed detection probability (i.e.,deciding LOS when the signal was NLOS).The table only lists the feature sets which achieved the minimum misclassi fication rate for each feature set size.We observe that the LS-SVM,using the three features from [42],reduces the false alarm probability compared to the parametric technique.It was shown in [43]that the features from [42],in fact,give rise to the worst performance among all possible sets of size three considered ing the features from Section IV-A and considering all feature set sizes,our results indicate that the feature set of size three,F 3I ={E r ,t rise ,κ},provides the best pared to the parametric technique,this set reduces both the false alarm and missed detection probabilities and achieves a correct classi fication rate of above 91%.In particular,among all feature sets of size three (results not shown,see [43]),there are seven sets that yield a P E of roughly 10%.All seven of these sets have t rise in common,while four have r max in common,indicating that these two features play an important role.Their importance is also corroborated by the presence of r max and t rise in the selected sets listed in Table I.For the remainder of this paper we will use the feature set F 3I for identi fication.B.NLOS MitigationMitigation results,showing the performance 7for different feature set sizes are given in Table II.The performance is measured in terms of the root mean square residual ranging6Wehave used an RBF kernel of the form K (x ,x k )=exp “− x −x k 2”and set γ=0.1.Features are first converted to the log domain in order to reduce the dynamic range.7Here we used a kernel given by K (x ,x k )=exp “− x −x k2/162”and set γ=10.Again,features are first converted to the log domain.Fig.5.CDF of the ranging error for the NLOS case,before and after mitigation.error (RMS RRE): 1/N N i =1(εm i )2.A detailed analysis ofthe experimental data indicates that large range estimates are likely to exhibit large positive ranging errors.This means that ˆditself is a useful feature,as con firmed by the presence of ˆdin all of the best feature sets listed in the table.Increasing the feature set size can further improve the RMS RRE.Thefeature set of size six,F 6M={E r ,t rise ,τMED ,τRMS ,κ,ˆd },offers the best performance.For the remainder of this paper,we will use this feature set for NLOS mitigation.Fig.5shows the CDF of the ranging error before and after mitigation using this feature set.We observe that without mitigation around 30%of the NLOS waveforms achieved an accuracy of less than one meter (|ε|<1).Whereas,after the mitigation process,60%of the cases have an accuracy less than 1m.C.Localization Performance1)Simulation Setup:We evaluate the localization perfor-mance for fixed number of anchors (N b )and a varying prob-ability of NLOS condition 0≤P NLOS ≤1.We place an agent at a position p =(0,0).For every anchor i (1≤i ≤N b ),we draw a waveform from the database:with probability P NLOS we draw from the NLOS database and with probability 1−P NLOS from the LOS database.The true distance d i corre-sponding to that waveform is then used to place the i th anchor at position p i =(d i sin(2π(i −1)/N b ),d i cos(2π(i −1)/N b )),while the estimated distance ˆdi is provided to the agent.This creates a scenario where the anchors are located at different distances from the agent with equal angular spacing.The agent estimates its position,based on a set of useful neighbors S ,using the LS algorithm from Section II.The arithmetic mean 8of the anchor positions is used as the initial estimate of the agent’s position.2)Performance Measure:To capture the accuracy and availability of localization,we introduce the notion of outage probability .For a certain scenario (N b and P NLOS )and a8Thisis a fair setting for the simulation,as all the strategies are initialized inthe same way.Indeed,despite the identical initialization,strategies converge to signi ficantly different final position estimates.In addition,we note that such an initial position estimate is always available to the agent.。
fisher's scoring method代码-回复Fisher's Scoring Method for Statistical InferenceIntroduction:Fisher's scoring method is an iterative numerical algorithm used in statistical inference to estimate the parameters of a statistical model. It is named after Sir Ronald A. Fisher, who developed this technique in the early 20th century. This scoring method is widely used in fields such as biostatistics, econometrics, and machine learning, among others. In this article, we will explore the steps involved in Fisher's scoring method and explain its significance in statistical analysis.Step 1: Defining the statistical modelThe first step in Fisher's scoring method is to define the statistical model that we want to estimate. This model consists of a set of parameters that characterize the underlying distribution of the data. For example, in a linear regression model, the parameters would include the intercept and slope coefficients.Step 2: Obtaining the likelihood functionThe next step is to derive the likelihood function, which measures the probability of observing the given data given the parameter values. The likelihood function is typically derived from the assumed distributional form of the data. For example, in a linear regression model with normally distributed errors, the likelihood function would be based on the normal distribution.Step 3: Taking the derivative of the log-likelihood functionIn Fisher's scoring method, we take the derivative of thelog-likelihood function with respect to each parameter. Thelog-likelihood function allows us to work with sums of probabilities rather than products, making the calculations more tractable. The derivatives give us the score or gradient of the log-likelihood function.Step 4: Evaluating the expected Fisher information matrixNext, we compute the expected Fisher information matrix, which measures the curvature of the log-likelihood function at the trueparameter values. The Fisher information matrix is defined as the negative of the expected second derivative of the log-likelihood function with respect to each parameter. It provides information about the precision of the parameter estimates.Step 5: Iteratively updating the parameter estimatesUsing the derivative of the log-likelihood function and the Fisher information matrix, we iteratively update the parameter estimates. The update equation is given by θ(k+1) = θ(k) + (I(θ(k)))^-1 * ∇l(θ(k)), where θ(k) represents the parameter estimates at iteration k, I(θ(k)) is the Fisher information matrix evaluated at θ(k), and ∇l(θ(k)) is the score vector evaluated at θ(k). This iterative process continues until convergence is achieved, typically when the absolute change in the parameter estimates falls below a predetermined threshold.Step 6: Assessing the convergence and accuracyOnce the parameter estimates have converged, we evaluate the convergence and accuracy of the estimates. This can be done by checking the convergence criterion, such as the absolute change inthe parameter estimates, or conducting hypothesis tests to assess the significance of the estimated parameters. Additionally, we can calculate standard errors and confidence intervals to quantify the uncertainty associated with the parameter estimates.Concluding Remarks:Fisher's scoring method is a powerful technique for estimating the parameters of a statistical model. By iteratively updating the parameter estimates using the score vector and the Fisher information matrix, this method provides efficient and robust estimates. It is particularly useful when the maximum likelihood estimation technique becomes computationally challenging or fails to converge. With its wide applicability in various fields, Fisher's scoring method continues to be an important tool in statistical inference and data analysis.。
OrdinalLogisticRegressionAnalysis methods you might considerBelow is a list of some analysis methods you may have encountered. Some of the methods listed are quite reasonable while others have either fallen out of favor or have limitations.Ordered logistic regression: the focus of this page.OLS regression: This analysis is problematic because the assumptions of OLS are violated when it is used with a non-interval outcome variable.ANOVA: If you use only one continuous predictor, you could “flip” the model around so that, say, gpa was the outcome variable and apply was the predictor variable. Then you could run a one-way ANOVA. This isn’t a bad thing to do if you only have one predictor variable (from the logistic model), and it is continuous.Multinomial logistic regression: This is similar to doing ordered logistic regression, except that it is assumed that there is no order to the categories of the outcome variable (i.e., the categories are nominal). The downside of this approach is that the information contained in the ordering is lost.for a one unit increase in parental education, i.e., going from 0 (Low) to 1 (High), the odds of “very likely” applying versus “somewhat likely” or “unlikely” applying combined are 2.85 greater, given that all of the other variables in the model are held constant. Likewise, the odds “very likely” or “somewhat likely” applying versus “unlikely” applying is 2.85 times greater, given that all of the other variables in the model are held constant. For gpa (and other continuous variables), the interpretation is that when a student’s gpa moves 1 unit, the odds of moving from “unlikely” applying to “somewhat likely” or “very likley”applying (or from the lower and middle categories to the high category) are multiplied by 1.85.One of the assumptions underlying ordinal logistic (and ordinal probit) regression is that the relationship between each pair of outcome groups is the same. In other words, ordinal logistic regression assumes that the coefficients that describe the relationship between, say, the lowest versus all higher categories of the response variable are the same as those that describe the relationship between the next lowest category and all higher categories, etc. This is called the proportional odds assumption or the parallel regression assumption. Because the relationship between all pairs of groups is the same, there is only one set of coefficients. If this was not the case, we would need different sets of coefficients in the model to describe the relationship between each pair of outcome groups. Thus, in order to asses the appropriateness of our model, we need to evaluate whether the proportional odds assumption is tenable. Statistical tests to do this are available in some software packages. However, these tests have been criticized for having a tendency to reject the null hypothesis (that the sets of coefficients are the same), and hence, indicate that there the parallel slopes assumption does not hold, in cases where the assumption does hold (see Harrell 2001 p. 335). We were unable to locate a facility in R to perform any of the tests commonly used to test the parallel slopes assumption. However, Harrell does recommend a graphical method for assessing the parallel slopes assumption. The values displayed in this graph are essentially (linear) predictions from a logit model, used to model the probability that y is greater than or equal to a given value (for each level of y), using one predictor (x) variable at a time. In order create this graph, you will need the Hmisc library.The code below contains two commands (the first command falls on multiple lines) and is used to create this graph to test the proportional odds assumption. Basically, we will graph predicted logits from individual logistic regressions with a single predictor where the outcome groups are defined by either apply >= 2 and apply >= 3. If the difference between predicted logits for varying levels of a predictor, say pared, are the same whether the outcome is defined by apply >= 2 or apply >=3, then we can be confident that the proportional odds assumption holds. In other words, if the difference between logits for pared = 0 and pared = 1 is the same when the outcome is apply >= 2 as the difference when the outcome is apply >= 3, then the proportional odds assumption likely holds.The first command creates the function that estimates the values that will be graphed. The first line of this command tells R that sf is a function, and that this function takes one argument, which we label y. The sf function will calculate the log odds of being greater than or equal to each value of the target variable. For our purposes, we would like the log odds of apply being greater than or equal to 2, and then greater than or equal to 3. Depending on the number of categories in your dependent variable, and the coding of your variables, you may have to edit this function. Below the function is configured for a y variablewith three levels, 1, 2, 3. If your dependent variable has 4 levels, labeled 1, 2, 3, 4 you would need toadd 'Y>=4'=qlogis(mean(y >= 4)) (minus the quotation marks) inside the first set of parentheses. If your dependent variable were coded 0, 1, 2 instead of 1, 2, 3, you would need to edit the code, replacing each instance of 1 with 0, 2 with 1, and so on. Insidethe sf function we find the qlogis function, which transforms a probability to a logit. So, we will basically feed probabilities of apply being greater than 2 or 3 to qlogis, and it will return the logit transformations of these probabilites. Inside the qlogis function we see that we want the log odds of the mean of y >= 2. When we supply a y argument, such as apply, toOnce we are done assessing whether the assumptions of our model hold, we can obtain predicted probabilities, which are usually easier to understand than either the coefficients orThings to considerPerfect prediction: Perfect prediction means that one value of a predictor variable is associated with only one value of the response variable. If this happens, Stata will usually issue a note at the top of the output and will drop the cases so that the model can run. Sample size: Both ordered logistic and ordered probit, using maximum likelihood estimates, require sufficient sample size. How big is big is a topic of some debate, but they almost always require more cases than OLS regression. Empty cells or small cells: You should check for empty or small cells by doing a crosstab between categorical predictors and the outcome variable. If a cell has very few cases, the model may become unstable or it might not run at all.Pseudo-R-squared: There is no exact analog of the R-squared found in OLS. There are many versions of pseudo-R-squares. Please see Long and Freese 2005 for more details and explanations of various pseudo-R-squares.Diagnostics: Doing diagnostics for non-linear models is difficult,ReferencesAgresti, A. (1996) An Introduction to Categorical Data Analysis. New York: John Wiley & Sons, IncAgresti, A. (2002) Categorical Data Analysis, Second Edition. Hoboken, New Jersey: John Wiley & Sons, Inc.Harrell, F. E, (2001) Regression Modeling Strategies. New York: Springer-Verlag.Liao, T. F. (1994) Interpreting Probability Models: Logit, Probit, and Other Generalized Linear Models. Thousand Oaks, CA: Sage Publications, Inc.Powers, D. and Xie, Yu. Statistical Methods for Categorical Data Analysis. Bingley, UK:Emerald Group Publishing Limited.。
保险行业专业术语英语The insurance industry is a critical component of the global economy, providing individuals and businesses with financial protection against a wide range of risks. The industry is characterized by a unique set of specialized terminology, which can be challenging for those not familiar with the field. In this essay, we will explore the most common and important professional terms used in the insurance industry, along with their definitions and applications.One of the fundamental concepts in insurance is the "policy." A policy is a legal contract between the insurance provider and the policyholder, outlining the terms and conditions of the coverage. The policyholder is the individual or entity that purchases the insurance coverage, while the insurance provider is the company that underwrites and administers the policy.Another key term is the "premium," which refers to the amount of money the policyholder pays to the insurance provider in exchange for the coverage. Premiums are typically based on factors such as the type of coverage, the level of risk, and the policyholder's individual characteristics.The "deductible" is the amount of money the policyholder must pay out-of-pocket before the insurance provider begins to cover the remaining costs. Deductibles can vary widely depending on the type of insurance and the level of coverage.The "claim" is a request made by the policyholder to the insurance provider for the payment of benefits under the terms of the policy. The insurance provider will then evaluate the claim and determine the appropriate amount of coverage to be provided.The "underwriter" is the insurance professional responsible for assessing the risk associated with a particular policy and determining the appropriate premium. Underwriters use a variety of tools and data sources to analyze risk, including actuarial tables, industry data, and customer information.The "actuarial" is the science of using statistical and mathematical models to assess the likelihood and severity of various risks. Actuaries play a crucial role in the insurance industry, helping to price policies, manage risk, and ensure the financial stability of insurance companies.The "reinsurance" is the practice of insurance companies purchasing coverage from other insurance providers to manage their own risk exposure. Reinsurance allows insurance companies to spread theirrisk across multiple parties, making it easier to meet their obligations to policyholders.The "broker" is a professional who acts as an intermediary between the policyholder and the insurance provider, helping to facilitate the purchase of insurance coverage. Brokers can provide valuable advice and guidance to their clients, as well as access to a wide range of insurance products.The "adjuster" is the insurance professional responsible for investigating and evaluating claims made by policyholders. Adjusters work to determine the appropriate amount of coverage and ensure that claims are handled fairly and efficiently.The "liability" refers to the legal responsibility of the policyholder for damages or injuries caused to others. Liability insurance is designed to protect policyholders from such risks, covering the costs of legal fees, settlements, and judgments.The "life insurance" is a type of insurance coverage that provides financial protection to the policyholder's beneficiaries in the event of the policyholder's death. Life insurance can be used to replace lost income, pay off debts, or provide for the policyholder's family.The "health insurance" is a type of insurance coverage that helps topay for the cost of medical care, including doctor visits, hospital stays, and prescription medications. Health insurance can be obtained through an employer, purchased directly from an insurance provider, or provided by the government.The "property insurance" is a type of insurance coverage that protects the policyholder's personal or commercial property from damage or loss. Property insurance can cover a wide range of risks, including fire, theft, natural disasters, and vandalism.The "auto insurance" is a type of insurance coverage that provides financial protection for the policyholder in the event of a car accident. Auto insurance can cover the cost of repairs, medical expenses, and liability claims.The "homeowners insurance" is a type of insurance coverage that protects the policyholder's home and personal belongings from damage or loss. Homeowners insurance can cover a wide range of risks, including fire, theft, and natural disasters.The "business insurance" is a type of insurance coverage that provides financial protection for businesses against a variety of risks, including property damage, liability claims, and employee-related issues.Overall, the insurance industry is a complex and multifaceted field, with a unique set of specialized terminology. Understanding these terms is essential for anyone working in or interacting with the insurance industry, as they provide a common language for communicating about the various products, services, and processes involved. By becoming familiar with these key terms, individuals can better navigate the insurance landscape and make informed decisions about their coverage needs.。
research proposal 的格式-回复Research Proposal Format:Title: The Impact of Social Media on Mental Health: A Systematic Review and Meta-analysisIntroduction:- Briefly explain the background and significance of the topic.- Highlight the existing gap in knowledge or controversy surrounding the impact of social media on mental health.- State the research questions and objectives.Literature Review:- Conduct a comprehensive review of studies exploring the relationship between social media use and mental health outcomes.- Summarize the main findings and controversies in the existing literature.- Identify any limitations or gaps in the current research.Methodology:1. Research Design:- Specify whether the study will be a systematic review or ameta-analysis.- Justify the chosen methodology and explain how it will contribute to addressing the research questions.2. Inclusion and Exclusion Criteria:- Define the population of interest, including age range and social media platforms.- Specify the types of studies that will be included (e.g., experimental, correlational, longitudinal) and any language or date restrictions.3. Search Strategy:- Describe the databases and search engines that will be utilized. - Explain the keywords and search terms that will be used.- Discuss any additional strategies for identifying relevant studies (e.g., hand-searching reference lists, contacting experts).4. Study Selection:- Outline the process for screening and selecting studies based on the predefined inclusion and exclusion criteria.- Describe the number of reviewers involved and any measures toensure inter-rater reliability.5. Data Extraction:- Specify the data items that will be collected from each study (e.g., sample size, study design, outcome measures).- Indicate how the relevant data will be extracted and recorded (e.g., using standardized forms).6. Quality Assessment:- Explain the methods for assessing the quality and risk of bias of the included studies.- Discuss any tools or criteria that will be used for this purpose.Data Analysis:- Describe the statistical methods that will be employed for data synthesis (e.g., meta-analysis, qualitative synthesis).- Explain the rationale for pooling or comparing the results of the selected studies.Ethical Considerations:- Discuss any ethical issues that may arise during the research process.- Address how participant confidentiality, informed consent, and other ethical principles will be ensured.Timeline and Resources:- Provide a detailed timeline of the proposed research activities. - Specify any necessary resources, such as research assistants, software, or funding.Conclusion:- Summarize the main points of the research proposal.- Emphasize the potential contributions and implications of the proposed study.- Discuss the feasibility and limitations of the research.References:- Include a list of all the references cited in the proposal, following appropriate citation style guidelines (e.g., APA, MLA).。
Introduction to Meta-analysisJim Derzon, PhDBattelle Memorial InstitutePresented at the AEA/CDC Summer Institute, 2008 Jim Derzon, PhDSenior Evaluation Specialist,Centers for Public Health Research and Evaluation, Battelle2101 Wilson Blvd. #800Arlington, VA, 22201-3008v. 703.248.1640, f. 703.527.5640, c. 240.505.7488derzonj@Introduction to Meta-analysisJim Derzon ( derzonj@) Presented at the AEA/CDC Summer Institute, 2008IntroductionMeta-analysis, or quantitative synthesis, is the technique of statistically combining the results of different studies, done on different samples that have each examined and presented findings on a similar relationship. While the actual methods of synthesis are complex, they are based on the fundamental assumptions underlying all quantitative, aggregate research. Because these basic assumptions are so integral to the justification for meta-analysis, this essay begins with an explication of those assumptions and then moves into the justification of why meta-analysis is the single best method currently available for synthesizing empirical knowledge.While people tend to look and act very differently, science and experience tell us that people with similar backgrounds or experiences tend to behave more alike than those who do not share that history. We also know from experience that shared experience doesn’t always lead to identical outcomes, that is, all aggregate research is probabilistic. Thus, when studying human behavior and the effects of interventions on that behavior, we look for commonalities and tendencies, not absolute or deterministic relationships. We expect people to be different, what we look, and test for, are commonalities. We estimate these commonalities using statistics that test the distribution of individual findings against mathematical models of those distributions. Now, just as people and their experiences differ, the findings obtained from different groups of individuals may differ due to differences inherent in the sample (e.g., the age, race, gender, or socio-economic status of the sample), differences in the way data were collected (e.g., whether respondents were surveyed or were interviewed), or howthe study was conducted (e.g., quality of randomization, attrition, or properties of themeasure or summary index). These influences are referred to as potential “confounds” because they may inflate or diminish the strength of the relationship they are trying to estimate. The consequence of this is that the results of any individual study (iX)may not well represent the true population value for the estimate, µ.Until fairly recently, there were no toolsavailable to researchers to systematically separate the potential influence of these confounds nor for assessing the relative stability of findings across different samples. Without the tools for disentangling these potential confounds, social scientists were often in a quandary when asked to explain differences in findings across different samples and across different studies that used different methodologies.MethodIn 1979 Gene Glass invented a method for combining estimates across samples and studies. He called this technique “meta-analysis.” In the intervening years this method has since been considerably refined and it currently provides the single best method for systematically reviewing and summarizing the evidence across multiple studies. Because meta-analysis summarizes evidence across multiple studies and samples, it produces a better (more accurate, more statistically robust) estimate of the strength and stability of a relationship or intervention impact than could be obtained in any single study.Meta-analysis is characterized by a systematic, detailed, and organized approachto systematic review that makes explicit thedomain of research covered, the nature and quality of the information extracted fromthat research, and the analytic techniques and results upon which interpretation is based. At it’s core is the concept of “effect size,” any standardized estimate of study findings. Effect sizes can take a variety of forms (e.g., percentages, logged odds ratios, correlations d-scores) depending on the literature being summarized. In method and procedure, meta-analysis is most akin to survey research except that research studies are “interviewed” instead of persons. A population of research studies is defined, a sample is drawn (or, more often, an attempt is made to retrieve the entire population), a questionnaire of data elements is completed for each study, and the resulting database is analyzed using statistical methods appropriate for groupwise data.Thus, meta-analysis is at once the logical extension of the theories and practice undergirding traditional quantitative methodologies and an improvement in that methodology for estimating quantitative results. No one study, no matter how good or how thorough, provides a fully adequate knowledge base for understanding and action. Each study inevitably has its idiosyncrasies of operationalization and constructs, method and procedure, samples and context, and sampling and non-sampling error that compromise the replicability and generalizability of its findings. Just as probabilistic estimates based on a group of individuals will be more reliable than estimates based on single individuals, the most robust, reliable knowledge comes only from some form of synthesis or integration of the results from multiple studies.When the issues involve quantitative research findings, meta-analysis has distinct advantages as a synthesis method. It provides a systematic, explicit, detailed, and statistically cogent approach to the task of identifying convergent findings, accounting for differences among studies, revealing gapsin research, and integrating the results, to the extent possible, into a coherent depiction ofthe current state of evidence for research findings.Response to criticsThis is not to suggest that meta-analysis does not have its detractors. There are those who decry meta-analysis for lumping together both good and bad studies. This is true, but using meta-analytic techniques we can both test – and adjust for – the systematic influence of questionable research methods on the strength of a relationship. The meta-analyst does not have to resort to statistical or methodological theory to make claims about the merit of any particular study. At the meta-analytic level these issues become empirical questions, ones that are readily managed using well-established statistical techniques.A second complaint is that meta-analysis lumps together “apples and oranges,” for example, combining different studies using different measures of the outcome or the predictor – differences that are often meaningful in a clinical context. However, in meta-analysis we are not interested in the outcome or the predictor, per se, we are interested in the magnitude and the stability of the relationship between the two. If estimates do not differ more than would be expected due to sampling error, the meta-analyst would respond that, while the measures or outcomes contributing to a finding may be different, the overall magnitude of the relationship between the two is similar. That is not a claim that the outcome and/or the predictor are similar, only that the strength of relationship between the measured items or practices are similar.A third complaint about meta-analysis is that it is simplistic, it only assesses the simple bivariate relationship between twoconstructs or the main effects of socialinterventions. This is true, although there are methods available for performing morecomplex multivariate “model-driven” syntheses (Becker, 1994, 1995). Yet, we, as program planners and administrators, are often interested in these simple bivariate relationships for identifying mediators, for selecting cases for intervention, for identifying best practices or interventions, for allocating resources, for delivering services to those most likely to benefit from those services, and for estimating the impact of social interventions. Thus, while this is a valid observation, it no more obviates the need for such research than it eliminates the need for such findings in the primary literature.Yet a fourth complaint about meta-analysis is that it is primarily descriptive in nature. The goal of meta-analysis is often not a highly developed conceptual theory. Rather, it focuses on what descriptive theory can be derived or supported by existing empirical research on the relationships or findings examined. In a developmental context knowing which psychological, behavioral, or interpersonal systems are best targeted for intervention, the times in the developmental sequence when interventions might be most productive, the characteristics of the individuals to whom intervention should be directed, the relative size of the groups at risk, and the potential change that might reasonably be expected from social interventions each provide sufficient warrant for using meta-analysis to summarize a literature. Deeper theory is nice, but it should be developed in the context of reliable evidence. Meta-analysis can provide such evidence.As a final note, the limitations and complexities of available research should not inhibit attempts to integrate what is known and to configure that knowledge in ways that may aid social action. Meta-analysis offers a systematic accounting of existing knowledge and an organized frameworkwithin which to separate robust, convergent information from the vagaries of samplingerror, methodological and substantive differences among studies, and the flukes and outliers that inevitably occur in a complex, diverse research domain. Just as the variability and complexities among people provide stimulus to traditional research, the intricacies of cross-study synthesis dictate the kind of systematic, careful, and unbiased handling of evidence that meta-analysis provides.Resources:Cook, T. D., Cooper, H., Cordray, D. S.,Hartmann, H., Hedges, L. V., Light,R. J., Louis, T. A., & Mosteller, F.(Eds.). (1992). Meta-analysis forexplanation: A casebook.NY: RussellSage Foundation.Cooper, H., & Hedges, L. V. (Eds.). (1994).The handbook of research synthesis. NewYork: Russell Sage Foundation. Durlak, J. A., & Lipsey, M. W. (1991). Apractitioner's guide to meta-analysis.American Journal of CommunityPsychology, 19(3), 291-332.Glass, G. V., McGaw, B., & Smith, M. L.(1981). Meta analysis in social research.Beverly Hills, CA: Sage Publications. Hedges, L. V., & Olkin, I. (1985). Statistical methods for meta-analysis. NY:Academic Press.Hunter, J. E., & Schmidt, F. L. (1990).Methods of meta-analysis: Correcting errorand bias in research findings. NewburyPark, CA: Sage.Lipsey, M.W. & Wilson, D.B. (2001).Practical Meta-analysis. Applied socialResearch Methods Series, Vol. 49.Thousand Oaks, CA: Sage. Rosenthal, R. (1991). Meta-analytic procedures for social research (Revised edition).Newbury Park, CA: Sage.。
高二英语数学建模方法单选题20题1. When doing a math modeling project, we can collect data by _____.A. making surveysB. doing experimentsC. reading booksD. asking teachers答案:A。
本题考查数学建模中数据收集的方法。
选项A“making surveys”(做调查)是一种常见的数据收集方式,可以收集到大量的实际数据。
选项B“doing experiments”(做实验)主要是为了验证假设,不一定能收集到广泛的数据。
选项C“reading books”(读书)可以获取知识,但不是直接的数据收集方法。
选项D“asking teachers”(问老师)可以得到一些指导和建议,但不是主要的数据收集方法。
2. In a math modeling project, which is NOT a proper way to collect data?A. Interviewing expertsB. Observing phenomenaC. Guessing randomlyD. Analyzing historical data答案:C。
选项A“Interviewing experts”((采访专家)可以获得专业的意见和数据。
选项B“Observing phenomena”((观察现象)能直接收集实际数据。
选项D“Analyzing historical data”((分析历史数据)也是一种有效的数据收集方法。
而选项C“Guessing randomly”((随机猜测)不是科学的数据收集方法,不能得到可靠的数据。
3. For a math modeling project about traffic flow, we can collect data by _____.A. counting cars on the roadB. imagining the traffic situationC. making up numbersD. asking friends for opinions答案:A。
1 Statistical Models for Assessing the Individuality of FingerprintsYongfang Zhu,Sarat C.Dass∗and Anil K.JainAbstractFollowing Daubert in1993,forensic evidence based onfingerprints wasfirst challenged in the1999case of USA vs.Byron Mitchell,and subsequently,in20other cases involvingfingerprint evidence.The main concern with the admissibility offingerprint evidence is the problem of individualization,namely,that the fundamental premise for asserting the uniqueness offingerprints has not been objectively tested and matching error rates are unknown. In order to assess the error rates,we require to quantify the variability offingerprint features,namely,minutiae in the target population.A family offinite mixture models has been developed in this paper to represent the distribu-tion of minutiae infingerprint images,including minutiae clustering tendencies and dependencies in different regions of thefingerprint image domain.A mathematical model that computes the probability of a random correspondence (PRC)is derived based on the mixture models.A PRC of 2.25×10−6corresponding to12matches was computed for the NIST4Special Database,when the numbers of query and template minutiae both equal46.This is also the estimate of the PRC for a target population with similar composition as that of NIST4.EDICS:BIO-FING,BIO-THEO,FOR-V ALII.IntroductionE XPERT testimony based onfingerprint evidence isdelivered in a courtroom by comparing salient fea-tures of a latent print lifted from a crime scene with those taken from the defendant.A reasonably high degree of match between the salient features leads the experts Yongfang Zhu and Sarat C.Dass are with the Department of Statistics&Probability,A430Wells Hall,Michigan State University, East Lansing,MI48824.Phone:517-355-9589.Fax:517-432-1405. Anil K.Jain is with the Department of Computer Science&Engi-neering,3115Engineering Building,Michigan State University,East Lansing,MI48824.Phone:517-355-9282.Fax:517-432-1061.Emails: {zhuyongf,sdass,jain}@ to testify irrefutably that the source of the latent print and the defendant are one and the same person.For decades,the testimony of forensicfingerprint experts was almost never excluded from these cases,and on cross-examination,the foundations and basis of this testimony were rarely questioned.Central to establishing an iden-tity based onfingerprint evidence is the assumption of discernible uniqueness;salient features offingerprints of different individuals are observably different,and there-fore,when two prints share many common features,the experts conclude that the sources of the two different prints are one and the same person.The assumption of discernible uniqueness,although lacking sound theoretical and empirical foundations[20],allows forensic experts to offer an unquestionable proof towards the defendant’s guilt.To make matters worse,forensic experts are never questioned on the uncertainty associated with their tes-timonials(that is,how frequently would an observable match between a pair of prints lead to errors in the identification of individuals).Thus,discernible uniqueness precludes the opportunity to establish error rates which should be estimated from collecting population samples, analyzing the inherent feature variability,and reporting the corresponding probability of two different persons sharing a set of common features(known as the probability of random correspondence).A significant event that questioned this trend occurred in1993in the case of Daubert vs.Merrell Dow Phar-maceuticals[7]where the U.S.Supreme Court ruled that in order for an expert forensic testimony to be allowed in courts,it had to be subject tofive main criteria of scientific validation,that is,whether(i)the particular technique or methodology has been subject to statistical hypothesis test-ing,(ii)its error rates has been established,(iii)standards controlling the technique’s operation exist and have been maintained,(iv)it has been peer reviewed,and(v)it has a general widespread acceptance[18].Forensic evidence based onfingerprints wasfirst challenged in the1999case of U.S.v.Byron C.Mitchell[23]under the Daubert ruling, stating that the fundamental premise for asserting the uniqueness offingerprints had not been objectively tested2and its potential matching error rates were unknown.After USA vs.Byron Mitchell,fingerprint based identification has been challenged in more than20court cases in the United States,see for example,United States vs.Llera Plaza[25],[26]in2002and United States vs.Crisp[24] in2003;also see[5]for additional court cases.The main issue with the admissibility offingerprint evidence stems from the realization that the individu-alization offingerprints has not been subjected to the principles of scientific validation.The uncertainty involved in assessingfingerprint individuality can be formulated as follows:Given a queryfingerprint,what is the probability offinding afingerprint in a target population having features similar to that of the query?”As mentioned earlier, a satisfactory answer to this question requires(i)collecting fingerprint samples from a target population,(ii)analyzing the variability of the features from the differentfingerprints collected,and(iii)defining a notion of similarity between fingerprints and reporting the corresponding probability of two different individuals sharing a set of common fingerprint features.We address issues(ii)and(iii)in this paper assuming that a sample of prints is available from a target population and a notion of similarity is given;see also Figure1.We do not address the issues and challenges involved in sampling from a target population.Instead, we assume that a database of prints is available and demonstrate how the methodology described in this paper can be used to obtain estimates offingerprint individuality. If the available database is representative of the target population,then the estimates offingerprint individuality obtained based on the methodology presented here would generalize to the target population.An analysis of vari-ability offingerprint features requires the development of appropriate statistical models on the space offingerprint features that are able to represent all aspects of variability observed in these features.Based on these models,the probability of a random correspondence(PRC)(alterna-tively,the probability that the observed match between features in a pair of prints is purely due to“chance”)will be determined.There have been a few previous studies that addressed the problem offingerprint individuality using statistical models onfingerprint features.All these studies utilized minutiae features infingerprints(both location and di-rection information)to assess individuality.However,the assumptions made in these studies do not satisfactorily represent the observed variations of the features in ac-tualfingerprint databases.For example,it is known that fingerprint minutiae tend to form clusters[21],[22]but Pankanti et al.[18]assumed a uniform distribution on minutiae locations and directions which was then corrected to match empirical results from the databases used in their study.Another assumption made by Pankanti et al.isFig.1.Intraclass variability in afingerprintdatabase.Rows correspond to differentfin-gers whereas columns correspond to mul-tiple impressions of the samefinger.Whiteboxes correspond to location offingerprintminutiae.that the minutiae location is distributed independently of the minutiae direction.But,minutiae in different regions of thefingerprint are observed to be associated with different region-specific minutiae directions.Moreover, minutiae points that are spatially close tend to have similar directions with each other.These observations on the distribution offingerprint minutiae need to be accounted for in eliciting reliable statistical models.The problem of establishing individuality estimates based onfingerprints is in contrast to DNA typing where the probability of a random correspondence has been studied extensively and quantified(see,for example,[10]). The DNA typing problem(inherently1-D)is in some sense simpler to analyze compared to thefingerprint individual-ity problem(inherently2-D);also,the act of acquiring fingerprint impressions as well as the condition of the physicalfinger itself(i.e.,cuts and bruises,and distortions) introduces many sources of noise.This paper proposes to determine reliable estimates of the probability of a random correspondence between twofingerprints via appropriate statistical models in a spirit similar to that of DNA typing.To address the issue of individuality,candidate models have to meet two important requirements:(i)flexibility, that is,the model can represent the observed distributions of the minutiae features infingerprint images over differ-ent databases,and(ii)associated measures offingerprint individuality can be easily obtained from these models.In practice,a forensic expert uses manyfingerprint features (minutiae location and direction,fingerprint class,inter-ridge distance,etc.)to make the match,but here we3 only use a subset of these features,namely,the minutiaelocations and directions,to keep the problem tractable.Weintroduce a family offinite mixture models to represent theobserved distribution of minutiae locations and directionsinfingerprint images.The reliability of the models is assessed using a criteria based on the degree to which the models are able to capture the observed variability in the minutiae locations and directions.We then derive a mathematical model for computing the PRCs based on the elicited mixture models.The rest of this paper is organized as follows:Section II describes thefinite mixture models proposed for the minutiae features(both location and direction).We also develop tests to demonstrate the appropriateness of the mixtures as distributional models forfingerprint minutiae compared to the uniform distribution.Section III devel-ops a new mathematical model for computing the PRC, whereas Section IV describes the experimental results based on the NIST Special Database4[17],and FVC2002 [13]databases.II.Statistical Models On Minutiae Location and DirectionA minutiae is the location of a ridge anomaly in afin-gerprint image[14].Forensic experts and most automatic fingerprint matching systems use minutiae for identifica-tion since these features have been shown to be stable and can be reliably extracted from prints.There are many types of ridge anomalies that occur infingerprint images -examples of these include ridge endings,bifurcations, islands,dots,enclosures,bridges,double bifurcations,tri-furcations,and others.However,in this paper,we only consider the two dominant types of minutiae,namely, endings and bifurcations.The main reasons for this are that the occurrence of the other ridge anomalies is relatively rare,and it is easy to consistently detect minutiae endings and bifurcations compared to other minutiae types.Each minutiae is characterized in terms of two components:(i) its location,i.e.,the spatial coordinates of its position, and(ii)its direction,i.e.,the angle subtended by the minutiae measured from the horizontal axis.We also do not distinguish between minutiae bifurcation and ending since it is often not easy to distinguish between them by automatic systems.Subsequently,the term“minutiae features”will be used to refer to the location and direction of a minutiae in afingerprint impression.See Figure 2for an example of minutiae features for afingerprint impression from the FVC2002DB1[13]database.Let X denote a generic random minutiae location and D denote its corresponding direction.Let S⊆R2denote the subset of the plane representing thefingerprint domain. Then the set of all possible configurations for X isthe(a)(b)(c)Fig. 2.Minutiae features consisting of thelocation,s,and direction,θ,for a typicalfingerprint image(b):The top(respectively,bottom)panel in(a)shows s andθfor a ridgebifurcation(respectively,ending).The top(respectively bottom)panel in(c)shows twosubregions in which orientations of minutiaepoints that are spatially close tend to be verysimilar.(x,y)≡s coordinate points in S.The minutiae direction, D,takes values in[0,2π).Denoting the total number of minutiae in afingerprint image by k,we will develop a joint distribution model for the k pairs of minutiae features (X,D):{(X j,D j),j=1,2,...k},that accounts for(i) clustering tendencies(non-uniformity)of minutiae,and(ii) dependence between minutiae location and direction(X j and D j)in different regions of S.The proposed joint distribution model is based on a mixture consisting of G components or clusters.Let c j be the cluster label of the j-th minutiae location and direction(X j,D j),c j∈{1,2,...,G},j=1,2,...,k. The labels c j are independently distributed according to a single multinomial with G classes and class probabilities τ1,τ2,...,τG,such thatτj≥0and G j=1τj=1.Given label c j=g,the minutiae location X j is distributed according to the densityf X g(s|µg,Σg)=φ2(s|µg,Σg),(1)whereφ2is the bivariate Gaussian density with meanµg and covariance matrixΣg.Equation(1)states that the minutiae locations arising from the g-th cluster follow a two-dimensional Gaussian with meanµg and covariance matrixΣg.The V on-Mises distribution[15]is a typical distribution used to model angular random variables,such as minutiae directions in our case.So,we assume the distribution of j-th minutiae direction,D j,belonging to the g-th cluster follows the densityf D g(θ|νg,κg,p g)=pg v(θ)·I{0≤θ<π}4Fig. 3.Probability distribution plots of theVon-Mises distribution with centerνg=π2and−πI0(κg)exp{κg cos2(θ−νg)},(3)with I0(κg)defined asI0(κg)= 2π0exp{κg cos(θ−νg)}dθ.(4)In(3),νg andκg represent the mean angle and the preci-sion(inverse of the variance)of the V on-Mises distribution, respectively.Figure3plots two density functions associ-ated with V on-Mises distributions with common meansνg but with two different precisionsκg<κ∗g.Thisfigure shows thatνg represents the“center”(or modal value) whileκg controls the degree of spread around the center (thus,the density with precisionκ∗g has higher concentra-tion aroundνg).The density f D g in(2)can be interpreted in the following way:The ridgeflow orientation,O,is assumed to follow the V on-Mises distribution(3)with meanνg and precisionκg.Subsequently,minutiae arising from the g-th component have directions that are either O or O+πwith probabilities p g and1−p g,respectively.Combining the distributions of the minutiae location (X)and the direction(D),it follows that each(X,D)(e)(f)Fig.4.Assessing thefit of the mixture modelsto minutiae location and direction:Observedminutiae locations(white boxes)and direc-tions(white lines)are shown in panels(a)and(b)for two differentfingerprints from theNIST Special Database4.Panels(c)and(d),respectively,show the cluster labels for eachminutiae feature in(a)and(b).The clusters in3-D space are shown in panels(e)and(f)withx,y,z as the row,column,and the orientationof the minutiae.is distributed according to the mixture densityf(s,θ|ΘG)=Gg=1τg f X g(s|µg,Σg)·f D g(θ|νg,κg,p g),(5) where f X g(·)and f D g(·)are defined as in(1)and(2), respectively.In(5),ΘG denotes all the unknown param-eters in the mixture model which includes the total num-ber of mixture components,G,the mixture probabilities τg,g=1,2,...,G,the component means and covariance matrices of f X g’s given byµG≡{µ1,µ2,....,µG},and ΣG≡{Σ1,Σ2,...,ΣG};the component mean angles and precisions of f D g’s given byνG≡{ν1,ν2,...,νG} andκG≡{κ1,κ2,...,κG},and the mixing probabilities5 p G≡{p1,p2,...,p G}.The model in(5)allows for(i)different clustering tendencies in the minutiae locationsand directions via G different clusters,and(ii)incorporatesdependence between the minutiae location and directionsince if X j is known to come from the g-th component,then it follows that the direction D j also comes from thesame mixture component.The mixture density given in(5)is defined on the entireplane R2,and is not restricted to thefingerprint domainS.We correct this by defining the mixture model on thefingerprint area A⊂S asf(s,θ|ΘG)f A(s,θ|ΘG)=6 ImageArea, ASensingPlane, SMinutiaeFig.6.Identifying the matching region for aquery minutiae.I T=I Q).To compute the PRC,wefirst define a minutiaematch between Q and T.A pair of minutiae features in Qand T,(X Q,D Q)and(X T,D T)respectively,is said tomatch if forfixed positive numbers r0and d0,|X Q−X T|s≤r0and|D Q−D T|a≤d0,(9)where|X Q−X T|s≡w!(14)for large m and n;equation(14)corresponds to the Poissonprobability mass function with meanλ(Q,T)given byλ(Q,T)=m n p(Q,T),(15)wherep(Q,T)=P(|X Q−X T|s≤r0and|D Q−D T|a≤d0)(16)denotes the probability of a match when(X Q,D Q)and(X T,D T)are random minutiae from(12)and(13),re-spectively.The mean parameterλ(Q,T)can be interpretedas the expected number of matches from the total numberof mn possible pairings between m minutiae in Q and nminutiae points in T with the probability of each matchbeing p(Q,T).The Poisson distribution in(14)is obtainedusing arguments similar to when a binomial distributionwith a large number of trials and small probability of“success”can be approximated by a Poisson distribution,provided the expected number of“successes”is moderate.For this reason,the Poisson approximation is also calledthe law of rare events.In our case,if we define“success”to be a minutiae match,then(i)the number of trials,mn,is large,(ii)the probability of a success,p(Q,T),is small,and(iii)the number of impostor matches between Q and Tis moderate(not exceeding10in the databases we workedwith),thus,justifying the validity of the Poisson law.The above discussion is general and holds true for anydistribution for the query andfile minutiae.In particular,when the distributions on the minutiae(both location anddirection)are chosen to be uniform,we get the followingexpression forλ(Q,T):λU(Q,T)=m n p L p D,(17)where p L(respectively,p D)is the probability that X Qand X T(respectively,D Q and D T)will match.Theprobability of a location and direction match appears asthe product p L p D since the minutiae location and directionare distributed independently of each other.For afingerprint database consisting of F differentfingers with a single impression perfinger,we wish tofind the most representative value for the probability ofa random correspondence,PRC,for this database.There7 are a total of F(F−1)/2pairs of impostorfingerprintimages(Q,T)from the entire database.The average PRCcorresponding to w minutiae matches is given byF(F−1)(Q,T)impostorpairsp∗(w;Q,T),(18)where p∗(w;Q,T)is as defined in(14);note thatp∗(w;Q,T)is symmetric in Q and T,and thus it issufficient to consider only the F(F−1)/2distinct impostorpairs instead of the total F(F−1).Each of the probabil-ities,p∗(w;Q,T),is a very small number such as10−6or10−7.Thus,the average PRC in(18)is highly affectedby the largest of these probabilities,and is,therefore,notreliable as an estimate of typical PRCs arising from theimpostor pairs.A better measure would be to consider anaverage of the trimmed probabilities.Letαdenote the per-centage of p∗(w;Q,T)to be trimmed,and let p∗(w;α/2)and p∗(w;1−α/2),respectively,denote the lower andupper100α/2-th percentiles of these probabilities.Theα-trimmed mean is given byF(F−1)(1−α)(Q,T)impostor p∗α(w;Q,T),(19)where if p∗(w;α/2)≤p∗(w;Q,T)≤p∗(w;1−α/2), p∗α(w;Q,T)= p∗(w;Q,T)0,otherwise.(20) A.Incorporating Multiple Impressions per FingerTo utilize multiple impressions of afinger(such as from databases in the Fingerprint Verification Competitions (FVCs)[12],[13]),we combine minutiae from different impressions into a single“master”on which the mixture model isfit.The minutiae consolidation procedure we fol-low is described in detail in[27]and[28].An illustration of the consolidation procedure is shown in Figure7where multiple impressions of the samefinger(a)are aligned to the reference image(b)to obtain the masterfingerprint(c). The process of minutiae consolidation has two advantages: (i)A more reliablefit of the mixture model is obtained, and(ii)the assumption of large m and n required for computing the individuality estimates is satisfied.PRCs for w matches are then obtained using(14)for the F(F−1)/2 impostor master pairs.The consolidation process involves averaging the location and direction of the same minutiae obtained from the multiple impressions.This helps smooth out any non-linear distortion effects that can affect the estimate offingerprint individuality.In this paper,we do not model the variability in the partial prints corresponding to eachfinger as was done in[28].Fig.7.Masterfingerprint construction;(a)4different impressions of afinger,(b)refer-ence impression,and(c)master.B.Identifying Clusters of Fitted Mixture ModelsIn order to compute the probability of random cor-respondence based on the mixture models,our method-ology involvesfitting a separate mixture model to each fingerprint impression/master from a target population.An important difference between the proposed methodology and previous work is that wefit mixture models to eachfin-ger/master,whereas previous studies assumed a common distribution for allfingers/impressions.Assuming a com-mon minutiae distribution for allfingerprint impressions has a serious drawback,namely,that the true distribution of minutiae may not be modeled well.For example,it is well-known that thefive majorfingerprint classes in the Henry system of classification(i.e.,right-loop,left-loop, whorl,arch and tented arch)have different class-specific minutiae distributions.Thus,using one common minutiae distribution may smooth out important clusters in the dif-ferentfingerprint classes.Moreover,PRCs depend heavily on the composition of each target population.Consider the following example:The proportion of right-loop,left-loop,whorl,arch and tented arch classes offingerprints is31.7%,33.8%,27.9%,3.7%and2.9%,respectively,in a population of British people as reported in[6].Thus, PRCs computed forfingerprints from this population will be largely influenced by the mixture modelsfitted to the right-loop,left-loop and whorl classes compared to arch and tented arch.More important is the fact that the PRCs will change if the class proportions change(for example, if the target population has an equal number offingerprints in each class,or with class proportions different from the8 ones given above).Byfitting separate mixture models toeachfinger,we ensure that the composition of a targetpopulation is correctly represented.To formally obtain the composition of a target popu-lation,we adopt an agglomerative hierarchical clusteringprocedure[9]on the space of allfitted mixture models.The dissimilarity measure between the estimated mixturedensities f and g is taken to be the Hellinger distance[11]H(f,g)= x∈S θ∈[0,2π)( g(x,θ))2dx dθ.(21)The Hellinger distance,H,is a number bounded between0and2,with H=0(respectively,H=2)if and only iff=g(respectively,f and g have disjoint support).For adatabase with Ffingers,we obtain a total of F(F−1)/2Hellinger distances corresponding to the F(F−1)/2mixture pairs.The resulting dendrogram can be cut toform N clusters of mixture densities,C1,C2,...,C N,say,based on a threshold T.Note that N=1when T=2,and as T decreases to0,N increases to F(F−1)/2.Whenthe number of clusters is N,we define the within clusterdissimilarity asW N=Ni=11|C i| f∈C i f(x,θ).(24)The mean parameterλ(Q,T)in(15)depends on Q and T via the mean mixture densities of the clusters from which Q and T are taken.If Q and T,respectively,belong to clusters C i and C j,we haveλ(Q,T)≡λ(C i,C j)with the mean mixture densities of C i and C j used in place of the original mixture densities in(16).Let p∗(w;C i,C j) denote the Poisson probabilityp∗(w;C i,C j)=e−λ(C i,C j)λ(C i,C j)wPRCα= (i,j)∈T|C i||C j|p∗(w;C i,C j)9(c)Fig.8.Empirical distributions of the number of minutiae in the (a)NIST database,(b)mas-ter prints constructed from the FVC2002DB1database,and (c)master prints constructed from the FVC2002DB2database.Average number of minutiae in the three distributions are 62,63and 77,respectively.the NIST Special Database 4[17],FVC2002DB1and FVC2002DB2[13]fingerprint databases.The NIST fin-gerprint database [17]is publicly available and contains 2,0008-bit gray scale fingerprint image pairs of size 512-by-512pixels.Because of the relative large size of the images in the NIST database,we used the first image of each pair for statistical modeling.Minutiae could not be automatically extracted from two images of the NIST database due to poor quality.Thus,the total number of NIST fingerprints used in our experiments is F =1,998.For the FVC2002database,also available in the public domain,we used two of its subsets DB1and DB2.The DB1impressions (images size =388×374)are acquired using the optical sensor “TouchView II”by Identix,while the DB2impressions (image size =296×560)are acquired using the optical sensor “FX2000”by Biometrika.Each database consists of F =100different fingers with 8impressions (L =8)per finger.Because of the small size of the DB1and DB2databases,the minutiae consolidation procedure was adopted to obtain a master.The mixture models were subsequently fitted to each master.The best fitting mixture model (see (5)and (6))was found for each finger for these three databases.Two types of statistical tests for checking the appropriateness of themixture model (6)as a distribution on fingerprint minutiae were carried out.The first type of test was to select between two models,either the mixture or the uniform,for the minutiae for each finger based on the likelihood ratio (the mixture and uniform models were fitted to the master whenever the consolidation procedure of Section III-A was adopted for a database).This model selection procedure can decide only between the mixture and the uniform model.However,it may be the case that the true distribution on fingerprint minutiae is neither one of these.Thus,the second type of statistical test carried out was to assess the goodness of fit of the mixture model to the observed distribution of minutiae for each finger.The quality of fit of the mixture distribution was determined via a p-value where large p-values (p-values ¿0.01)led to the conclusion that the mixture distribution is an adequate model;otherwise,when the p-value is smaller than .01,the mixture distribution is inadequate.In a similar fashion,we also tested the goodness of fit of the uniform model to the distribution of minutiae for each finger.Based on these statistical tests,we found strong evidence for the appropriateness of the mixture models as a distribution on fingerprint minutiae for all the three databases.We refer the reader to the technical report [27]for more details on the tests that were carried out as well as the experimental results.The distributions of m and n for the three fingerprint databases are shown in Figures 8(a),(b)and (c),respec-tively (the distribution of m and the distribution of n are identical,and hence only one histogram is obtained).The mean m (and n )values for the NIST,FVC2002DB1and FVC2002DB2databases are approximately 62,63and 77respectively (For the FVC databases,m and n are reported as the mean number of minutiae centers in each master).For the three databases,NIST 4,FVC2002DB1and FVC2002DB2,the agglomerative clustering procedure in Section III-B was carried out for the fitted mixture models to find N ∗.The resulting number of clusters is given in Table I.Table I also gives the means of the following quantities for each database:m and n ,the whole fingerprint area,and λfor the mixture models representing the theoretical mean number of impostor matches.The last column gives the mean PRC,。