Association Rules and Predictive Models for e-Banking Services

格式：pdf
大小：496.12 KB
文档页数：14

下载文档原格式

数据采集和营销工具(英文版)

Motivations for DM
Abundance of business and industry data
Competitive focus - Ke, powerful computing engines
Strong theoretical/mathematical foundations
1. Decision Trees and Fraud Detection
2. Association Rules and Market Basket Analysis 3. Clustering and Customer Segmentation
3. Trends in technology
1. Knowledge Discovery Support Environment 2. Tools, Languages and Systems
Provide a systematization to the many many concepts around this area, according the following lines
the process the methods applied to paradigmatic cases the support environment the research challenges
1970s:
Relational data model, relational DBMS implementation.
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.).

北大暑期课程《回归分析》(Linear-Regression-Analysis)讲义PKU5

Class 5: ANOVA (Analysis of Variance) and F-testsI. What is ANOVAWhat is ANOVA? ANOVA is the short name for the Analysis of Variance. The essenceof ANOVA is to decompose the total variance of the dependent variable into two additive components, one for the structural part, and the other for the stochastic part, of a regression. Today we are going to examine the easiest case.II. ANOVA: An IntroductionLet the model beεβ+= X y .Assuming x i is a column vector (of length p) of independent variable values for the i th'observation,i i i εβ+='x y .Then b 'x i is the predicted value.sum of squares total:[]∑-=2Y y SST i []∑-+-=2'x b 'x y Y b i i i[][][][]∑∑∑-+-+-=Y -b 'x b 'x y 2Y b 'x b 'x y 22i i i i i i[][]∑∑-+=22Y b 'x e i ibecause [][][]∑∑=-=--0Y b 'x e Y b 'x b 'x y ii i i i .This is always true by OLS. = SSE + SSRImportant: the total variance of the dependent variable is decomposed into two additive parts: SSE, which is due to errors, and SSR, which is due to regression. Geometric interpretation: [blackboard ]Decomposition of VarianceIf we treat X as a random variable, we can decompose total variance to the between-group portion and the within-group portion in any population:()()()i i i x y εβV 'V V +=Prove:()()i i i x y εβ+='V V()()()i i i i x x εβεβ,'Cov 2V 'V ++=()()iix εβV 'V +=(by the assumption that ()0 ,'Cov =εβk x , for all possible k.)The ANOVA table is to estimate the three quantities of equation (1) from the sample.As the sample size gets larger and larger, the ANOVA table will approach the equation closer and closer.In a sample, decomposition of estimated variance is not strictly true. We thus need toseparately decompose sums of squares and degrees of freedom. Is ANOVA a misnomer?III. ANOVA in MatrixI will try to give a simplied representation of ANOVA as follows:[]∑-=2Y y SST i ()∑-+=i i y Y 2Y y 22∑∑∑-+=i i y Y 2Y y 22∑-+=222Y n 2Y n y i (because ∑=Y n y i )∑-=22Y n y i2Y n y 'y -=y J 'y n /1y 'y -= (in your textbook, monster look)SSE = e'e[]∑-=2Y b 'x SSR i()()[]∑-+=Y b 'x 2Y b 'x 22i i()[]()∑∑-+=b 'x Y 2Y n b 'x 22i i()[]()∑∑--+=i i i e y Y 2Y n b 'x 22()[]∑-+=222Y n 2Y n b 'x i(because ∑∑==0e ,Y n y i i , as always)()[]∑-=22Yn b 'x i2Y n Xb X'b'-=y J 'y n /1y X'b'-= (in your textbook, monster look)IV. ANOVA TableLet us use a real example. Assume that we have a regression estimated to be y = - 1.70 + 0.840 xANOVA TableSOURCE SS DF MS F with Regression 6.44 1 6.44 6.44/0.19=33.89 1, 18Error 3.40 18 0.19 Total 9.8419We know∑=100xi, ∑=50y i , 12.509x 2=∑i , 84.134y 2=∑i , ∑=66.257y x i i . If weknow that DF for SST=19, what is n?n= 205.220/50Y ==84.95.25.22084.134Y n y SST 22=⨯⨯-=-=∑i()[]∑-+=0.1250.84x 1.7-SSR 2i[]∑-⨯⨯⨯-⨯+⨯=0.125x 84.07.12x 84.084.07.17.12i i= 20⨯1.7⨯1.7+0.84⨯0.84⨯509.12-2⨯1.7⨯0.84⨯100- 125.0= 6.44SSE = SST-SSR=9.84-6.44=3.40DF (Degrees of freedom): demonstration. Note: discounting the intercept when calculating SST. MS = SS/DFp = 0.000 [ask students]. What does the p-value say?V. F-TestsF-tests are more general than t-tests, t-tests can be seen as a special case of F-tests.If you have difficulty with F-tests, please ask your GSIs to review F-tests in the lab. F-tests takes the form of a fraction of two MS's.MSR/MSE F , df2df1An F statistic has two degrees of freedom associated with it: the degree of freedom inthe numerator, and the degree of freedom in the denominator.An F statistic is usually larger than 1. The interpretation of an F statistics is thatwhether the explained variance by the alternative hypothesis is due to chance. In other words, the null hypothesis is that the explained variance is due to chance, or all the coefficients are zero.The larger an F-statistic, the more likely that the null hypothesis is not true. There is atable in the back of your book from which you can find exact probability values.In our example, the F is 34, which is highly significant.VI. R 2R 2 = SSR / SSTThe proportion of variance explained by the model. In our example, R-sq = 65.4%VII. What happens if we increase more independent variables.1. SST stays the same.2. SSR always increases.3. SSE always decreases.4. R 2 always increases.5. MSR usually increases.6. MSE usually decreases.7. F-test usually increases.Exceptions to 5 and 7: irrelevant variables may not explain the variance but take up degrees of freedom. We really need to look at the results.VIII. Important: General Ways of Hypothesis Testing with F-Statistics.All tests in linear regression can be performed with F-test statistics. The trick is to run"nested models."Two models are nested if the independent variables in one model are a subset or linearcombinations of a subset （子集）of the independent variables in the other model.That is to say. If model A has independent variables (1, 1x , 2x ), and model B hasindependent variables (1, 1x , 2x ,3x ), A and B are nested. A is called the restricted model; B is called less restricted or unrestricted model. We call A restricted because A implies that0=3β. This is a restriction.Another example: C has independent variable (1, 1x , 2x +3x ), D has (1, 2x +3x ). C and A are not nested.C and B are nested. One restriction in C: 32ββ=.C andD are nested. One restriction in D: 0=1β. D and A are not nested.D and B are nested: two restriction in D: 32ββ=; 0=1β.We can always test hypotheses implied in the restricted models. Steps: run tworegression for each hypothesis, one for the restricted model and one for the unrestrictedmodel. The SST should be the same across the two models. What is different is SSE and SSR. That is, what is different is R 2. Let()()df df SSE ,df df SSE u u r r ==;df df ()()0u r u r r u n p n p p p -=---=-<Use the following formulas:()()()()(),SSE SSE /df SSE df SSE F SSE /df r u r u dfr dfu dfu u u---=or()()()()(),SSR SSR /df SSR df SSR F SSE /df u r u r dfr dfu dfu u u---=(proof: use SST = SSE+SSR)Note, df(SSE r )-df(SSE u ) = df(SSR u )-df(SSR r ) =df ∆,is the number of constraints (not number of parameters) implied by the restricted modelor()()()22,2R R /df F 1R /dfur dfr dfu dfuuu--∆=- Note thatdf 1df ,2F t =That is, for 1df tests, you can either do an F-test or a t-test. They yield the same result. Another way to look at it is that the t-test is a special case of the F test, with the numerator DF being 1.IX. Assumptions of F-testsWhat assumptions do we need to make an ANOVA table work?Not much an assumption. All we need is the assumption that (X'X) is not singular, so that the least square estimate b exists.The assumption of ε'X =0 is needed if you want the ANOVA table to be an unbiased estimate of the true ANOVA (equation 1) in the population. Reason: we want b to be an unbiased estimator of β, and the covariance between b and εto disappear.For reasons I discussed earlier, the assumptions of homoscedasticity and non-serial correlation are necessary for the estimation of ()i V ε.The normality assumption that εi is distributed in a normal distribution is needed for small samples.X. The Concept of IncrementEvery time you put one more independent variable into your model, you get an increase in 2R . We sometime called the increase "incremental 2R ." What is means is that more variance is explained, or SSR is increased, SSE is reduced. What you should understand is that the incremental 2R attributed to a variable is always smaller than the 2R when other variables are absent.XI. Consequences of Omitting Relevant Independent VariablesSay the true model is the following:0112233i i i i i y x x x ββββε=++++.But for some reason we only collect or consider data on 21,,x and x y . Therefore, we omit3x in the regression. That is, we omit in 3x our model. We briefly discussed this problembefore. The short story is that we are likely to have a bias due to the omission of a relevant variable in the model. This is so even though our primary interest is to estimate the effect of1x or 2x on y.Why? We will have a formal presentation of this problem.XII. Measures of Goodness-of-FitThere are different ways to assess the goodness-of-fit of a model. A. R 2R 2 is a heuristic measure for the overall goodness-of-fit. It does not have an associated test statistic.R 2 measures the proportion of the variance in the dependent variable that is “explained” by the model: R 2 =SSESSR SSRSST SSR +=B. Model F-testThe model F-test tests the joint hypotheses that all the model coefficients except for the constant term are zero.Degrees of freedoms associated with the model F-test: Numerator: p-1 Denominator: n-p.C. t-tests for individual parametersA t-test for an individual parameter tests the hypothesis that a particular coefficient is equal to a particular number (commonly zero).t k = (b k - βk0)/SE k , where SE k is the (k, k) element of MSE(X’X)-1, with degree of freedom=n-p. D. Incremental R 2Relative to a restricted model, the gain in R 2 for the unrestricted model: ∆R 2= R u 2- R r 2E. F-tests for Nested ModelIt is the most general form of F-tests and t-tests.()()()()(),SSE SSE /df SSE df SSE F SSE /df r u r dfu dfr u dfu u u---=It is equal to a t-test if the unrestricted and restricted models differ only by one single parameter.It is equal to the model F-test if we set the restricted model to the constant-only model.[Ask students] What are SST, SSE, and SSR, and their associated degrees of freedom, for the constant-only model?Numerical ExampleA sociological study is interested in understanding the social determinants of mathematicalachievement among high school students. You are now asked to answer a series of questions. The data are real but have been tailored for educational purposes. The total number of observations is 400. The variables are defined as: y: math scorex1: father's education x2: mother's educationx3: family's socioeconomic status x4: number of siblings x5: class rankx6: parents' total education (note: x6 = x1 + x2) For the following regression models, we know: Table 1 SST SSR SSE DF R 2 (1) y on (1 x1 x2 x3 x4) 34863 4201 (2) y on (1 x6 x3 x4) 34863 396 .1065 (3) y on (1 x6 x3 x4 x5) 34863 10426 24437 395 .2991 (4) x5 on (1 x6 x3 x4) 269753 396 .02101. Please fill the missing cells in Table 1.2. Test the hypothesis that the effects of father's education (x1) and mother's education (x2) on math score are the same after controlling for x3 and x4.3. Test the hypothesis that x6, x3 and x4 in Model (2) all have a zero effect on y.4. Can we add x6 to Model (1)? Briefly explain your answer.5. Test the hypothesis that the effect of class rank (x5) on math score is zero after controlling for x6, x3, and x4.Answer: 1. SST SSR SSE DF R 2 (1) y on (1 x1 x2 x3 x4) 34863 4201 30662 395 .1205 (2) y on (1 x6 x3 x4) 34863 3713 31150 396 .1065 (3) y on (1 x6 x3 x4 x5) 34863 10426 24437 395 .2991 (4) x5 on (1 x6 x3 x4) 275539 5786 269753 396 .0210Note that the SST for Model (4) is different from those for Models (1) through (3). 2.Restricted model is 01123344()y b b x x b x b x e =+++++Unrestricted model is ''''''011223344y b b x b x b x b x e =+++++(31150 - 30662)/1F 1,395 = -------------------- = 488/77.63 = 6.29 30662 / 395 3.3713 / 3F 3,396 = --------------- = 1237.67 / 78.66 = 15.73 31150 / 3964. No. x6 is a linear combination of x1 and x2. X'X is singular.5.(31150 - 24437)/1F 1,395 = -------------------- = 6713 / 61.87 = 108.50 24437/395t = 10.42t ===。

车辆控制系统说明书

IndexAactuation layer, 132average brightness,102-103adaptive control, 43Badaptive cruise control, 129backpropagation algorithm, 159adaptive FLC, 43backward driving mode,163,166,168-169adaptive neural networks,237adaptive predictive model, 283Baddeley-Molchanov average, 124aerial vehicles, 240 Baddeley-Molchanov fuzzy set average, 120-121, 123aerodynamic forces,209aerodynamics analysis, 208, 220Baddeley-Molchanov mean,118,119-121alternating filter, 117altitude control, 240balance position, 98amplitude distribution, 177bang-bang controller,198analytical control surface, 179, 185BCFPI, 61-63angular velocity, 92,208bell-shaped waveform,25ARMAX model, 283beta distributions,122artificial neural networks,115Bezier curve, 56, 59, 63-64association, 251Bezier Curve Fuzzy PI controller,61attitude angle,208, 217Bezier function, 54aumann mean,118-120bilinear interpolation, 90, 300,302automated manual transmission,145,157binary classifier,253Bo105 helicopter, 208automatic formation flight control,240body frame,238boiler following mode,280,283automatic thresholding,117border pixels, 101automatic transmissions,145boundary layer, 192-193,195-198autonomous robots,130boundary of a fuzzy set,26autonomous underwater vehicle, 191braking resistance, 265AUTOPIA, 130bumpy control surface, 55autopilot signal, 228Index 326CCAE package software, 315, 318 calibration accuracy, 83, 299-300, 309, 310, 312CARIMA models, 290case-based reasoning, 253center of gravity method, 29-30, 32-33centroid defuzzification, 7 centroid defuzzification, 56 centroid Method, 106 characteristic polygon, 57 characterization, 43, 251, 293 chattering, 6, 84, 191-192, 195, 196, 198chromosomes, 59circuit breaker, 270classical control, 1classical set, 19-23, 25-26, 36, 254 classification, 106, 108, 111, 179, 185, 251-253classification model, 253close formation flight, 237close path tracking, 223-224 clustering, 104, 106, 108, 251-253, 255, 289clustering algorithm, 252 clustering function, 104clutch stroke, 147coarse fuzzy logic controller, 94 collective pitch angle, 209 collision avoidance, 166, 168 collision avoidance system, 160, 167, 169-170, 172collision avoidance system, 168 complement, 20, 23, 45 compressor contamination, 289 conditional independence graph, 259 confidence thresholds, 251 confidence-rated rules, 251coning angle, 210constant gain, 207constant pressure mode, 280 contrast intensification, 104 contrast intensificator operator, 104 control derivatives, 211control gain, 35, 72, 93, 96, 244 control gain factor, 93control gains, 53, 226control rules, 18, 27, 28, 35, 53, 64, 65, 90-91, 93, 207, 228, 230, 262, 302, 304-305, 315, 317control surfaces, 53-55, 64, 69, 73, 77, 193controller actuator faulty, 289 control-weighting matrix, 207 convex sets, 119-120Coordinate Measurement Machine, 301coordinate measuring machine, 96 core of a fuzzy set, 26corner cube retroreflector, 85 correlation-minimum, 243-244cost function, 74-75, 213, 282-283, 287coverage function, 118crisp input, 18, 51, 182crisp output, 7, 34, 41-42, 51, 184, 300, 305-306crisp sets, 19, 21, 23crisp variable, 18-19, 29critical clearing time, 270 crossover, 59crossover probability, 59-60cruise control, 129-130,132-135, 137-139cubic cell, 299, 301-302, 309cubic spline, 48cubic spline interpolation, 300 current time gap, 136custom membership function, 294 customer behav or, 249iDdamping factor, 211data cleaning, 250data integration, 250data mining, 249, 250, 251-255, 259-260data selection, 250data transformation, 250d-dimensional Euclidean space, 117, 124decision logic, 321 decomposition, 173, 259Index327defuzzification function, 102, 105, 107-108, 111 defuzzifications, 17-18, 29, 34 defuzzifier, 181, 242density function, 122 dependency analysis, 258 dependency structure, 259 dependent loop level, 279depth control, 202-203depth controller, 202detection point, 169deviation, 79, 85, 185-188, 224, 251, 253, 262, 265, 268, 276, 288 dilation, 117discriminated rules, 251 discrimination, 251, 252distance function, 119-121 distance sensor, 167, 171 distribution function, 259domain knowledge, 254-255 domain-specific attributes, 251 Doppler frequency shift, 87 downhill simplex algorithm, 77, 79 downwash, 209drag reduction, 244driver’s intention estimator, 148 dutch roll, 212dynamic braking, 261-262 dynamic fuzzy system, 286, 304 dynamic tracking trajectory, 98Eedge composition, 108edge detection, 108 eigenvalues, 6-7, 212electrical coupling effect, 85, 88 electrical coupling effects, 87 equilibrium point, 207, 216 equivalent control, 194erosion, 117error rates, 96estimation, 34, 53, 119, 251, 283, 295, 302Euler angles, 208evaluation function, 258 evolution, 45, 133, 208, 251 execution layer, 262-266, 277 expert knowledge, 160, 191, 262 expert segmentation, 121-122 extended sup-star composition, 182 Ffault accommodation, 284fault clearing states, 271, 274fault detection, 288-289, 295fault diagnosis, 284fault durations, 271, 274fault isolation, 284, 288fault point, 270-271, 273-274fault tolerant control, 288fault trajectories, 271feature extraction, 256fiber glass hull, 193fin forces, 210final segmentation, 117final threshold, 116fine fuzzy controller, 90finer lookup table, 34finite element method, 318finite impulse responses, 288firing weights, 229fitness function, 59-60, 257flap angles, 209flight aerodynamic model, 247 flight envelope, 207, 214, 217flight path angle, 210flight trajectory, 208, 223footprint of uncertainty, 176, 179 formation geometry, 238, 247 formation trajectory, 246forward driving mode, 163, 167, 169 forward flight control, 217 forward flight speed, 217forward neural network, 288 forward velocity, 208, 214, 217, 219-220forward velocity tracking, 208 fossil power plants, 284-285, 296 four-dimensional synoptic data, 191 four-generator test system, 269 Fourier filter, 133four-quadrant detector, 79, 87, 92, 96foveal avascular zone, 123fundus images, 115, 121, 124 fuselage, 208-210Index 328fuselage axes, 208-209fuselage incidence, 210fuzz-C, 45fuzzifications, 18, 25fuzzifier, 181-182fuzzy ACC controller, 138fuzzy aggregation operator, 293 fuzzy ASICs, 37-38, 50fuzzy binarization algorithm, 110 fuzzy CC controller, 138fuzzy clustering algorithm, 106, 108 fuzzy constraints, 286, 291-292 fuzzy control surface, 54fuzzy damage-mitigating control, 284fuzzy decomposition, 108fuzzy domain, 102, 106fuzzy edge detection, 111fuzzy error interpolation, 300, 302, 305-306, 309, 313fuzzy filter, 104fuzzy gain scheduler, 217-218 fuzzy gain-scheduler, 207-208, 220 fuzzy geometry, 110-111fuzzy I controller, 76fuzzy image processing, 102, 106, 111, 124fuzzy implication rules, 27-28 fuzzy inference system, 17, 25, 27, 35-36, 207-208, 302, 304-306 fuzzy interpolation, 300, 302, 305- 307, 309, 313fuzzy interpolation method, 309 fuzzy interpolation technique, 300, 309, 313fuzzy interval control, 177fuzzy mapping rules, 27fuzzy model following control system, 84fuzzy modeling methods, 255 fuzzy navigation algorithm, 244 fuzzy operators, 104-105, 111 fuzzy P controller, 71, 73fuzzy PD controller, 69fuzzy perimeter, 110-111fuzzy PI controllers, 61fuzzy PID controllers, 53, 64-65, 80 fuzzy production rules, 315fuzzy reference governor, 285 Fuzzy Robust Controller, 7fuzzy set averages, 116, 124-125 fuzzy sets, 7, 19, 22, 24, 27, 36, 45, 115, 120-121, 124-125, 151, 176-182, 184-188, 192, 228, 262, 265-266fuzzy sliding mode controller, 192, 196-197fuzzy sliding surface, 192fuzzy subsets, 152, 200fuzzy variable boundary layer, 192 fuzzyTECH, 45Ggain margins, 207gain scheduling, 193, 207, 208, 211, 217, 220gas turbines, 279Gaussian membership function, 7 Gaussian waveform, 25 Gaussian-Bell waveforms, 304 gear position decision, 145, 147 gear-operating lever, 147general window function, 105 general-purpose microprocessors, 37-38, 44genetic algorithm, 54, 59, 192, 208, 257-258genetic operators, 59-60genetic-inclined search, 257 geometric modeling, 56gimbal motor, 90, 96global gain-scheduling, 220global linear ARX model, 284 global navigation satellite systems, 141global position system, 224goal seeking behaviour, 186-187 governor valves80, 2HHamiltonian function, 261, 277 hard constraints, 283, 293 heading angle, 226, 228, 230, 239, 240-244, 246heading angle control, 240Index329heading controller, 194, 201-202 heading error rate, 194, 201 heading speed, 226heading velocity control, 240 heat recovery steam generator, 279 hedges, 103-104height method, 29helicopter, 207-212, 214, 217, 220 helicopter control matrix, 211 helicopter flight control, 207 Heneghan method, 116-117, 121-124heuristic search, 258 hierarchical approaches, 261 hierarchical architecture, 185 hierarchical fuzzy processors, 261 high dimensional systems, 191 high stepping rates, 84hit-miss topology, 119home position, 96horizontal tail plane, 209 horizontal tracker, 90hostile, 223human domain experts, 255 human visual system, 101hybrid system framework, 295 hyperbolic tangent function, 195 hyperplane, 192-193, 196 hysteresis thres olding, 116-117hIIF-THEN rule, 27-28image binarization, 106image complexity, 104image fuzzification function, 111 image segmentation, 124image-expert, 122-123indicator function, 121inert, 223inertia frame, 238inference decision methods, 317 inferential conclusion, 317 inferential decision, 317 injection molding process, 315 inner loop controller, 87integral time absolute error, 54 inter-class similarity, 252 internal dependencies, 169 interpolation property, 203 interpolative nature, 262 intersection, 20, 23-24, 31, 180 interval sets, 178interval type-2 FLC, 181interval type-2 fuzzy sets, 177, 180-181, 184inter-vehicle gap, 135intra-class similarity, 252inverse dynamics control, 228, 230 inverse dynamics method, 227 inverse kinema c, 299tiJ - Kjoin, 180Kalman gain, 213kinematic model, 299kinematic modeling, 299-300 knowledge based gear position decision, 148, 153knowledge reasoning layer, 132 knowledge representation, 250 knowledge-bas d GPD model, 146eLlabyrinths, 169laser interferometer transducer, 83 laser tracker, 301laser tracking system, 53, 63, 65, 75, 78-79, 83-85, 87, 98, 301lateral control, 131, 138lateral cyclic pitch angle, 209 lateral flapping angle, 210 leader, 238-239linear control surface, 55linear fuzzy PI, 61linear hover model, 213linear interpolation, 300-301, 306-307, 309, 313linear interpolation method, 309 linear optimal controller, 207, 217 linear P controller, 73linear state feedback controller, 7 linear structures, 117linear switching line, 198linear time-series models, 283 linguistic variables, 18, 25, 27, 90, 102, 175, 208, 258Index 330load shedding, 261load-following capabilities, 288, 297 loading dock, 159-161, 170, 172 longitudinal control, 130-132 longitudinal cyclic pitch angle, 209 longitudinal flapping angle, 210 lookup table, 18, 31-35, 40, 44, 46, 47-48, 51, 65, 70, 74, 93, 300, 302, 304-305lower membership functions, 179-180LQ feedback gains, 208LQ linear controller, 208LQ optimal controller, 208LQ regulator, 208L-R fuzzy numbers, 121 Luenburger observer, 6Lyapunov func on, 5, 192, 284tiMMamdani model, 40, 46 Mamdani’s method, 242 Mamdani-type controller, 208 maneuverability, 164, 207, 209, 288 manual transmissions, 145 mapping function, 102, 104 marginal distribution functions, 259 market-basket analysis, 251-252 massive databases, 249matched filtering, 115 mathematical morphology, 117, 127 mating pool, 59-60max member principle, 106max-dot method, 40-41, 46mean distance function, 119mean max membership, 106mean of maximum method, 29 mean set, 118-121measuring beam, 86mechanical coupling effects, 87 mechanical layer, 132median filter, 105meet, 7, 50, 139, 180, 183, 302 membership degree, 39, 257 membership functions, 18, 25, 81 membership mapping processes, 56 miniature acrobatic helicopter, 208 minor steady state errors, 217 mixed-fuzzy controller, 92mobile robot control, 130, 175, 181 mobile robots, 171, 175-176, 183, 187-189model predictive control, 280, 287 model-based control, 224 modeless compensation, 300 modeless robot calibration, 299-301, 312-313modern combined-cycle power plant, 279modular structure, 172mold-design optimization, 323 mold-design process, 323molded part, 318-321, 323 morphological methods, 115motor angular acceleration, 3 motor plant, 3motor speed control, 2moving average filter, 105 multilayer fuzzy logic control, 276 multimachine power system, 262 multivariable control, 280 multivariable fuzzy PID control, 285 multivariable self-tuning controller, 283, 295mutation, 59mutation probability, 59-60mutual interference, 88Nnavigation control, 160neural fuzzy control, 19, 36neural networks, 173, 237, 255, 280, 284, 323neuro-fuzzy control, 237nominal plant, 2-4nonlinear adaptive control, 237non-linear control, 2, 159 nonlinear mapping, 55nonlinear switching curve, 198-199 nonlinear switching function, 200 nonvolatile memory, 44 normalized universe, 266Oobjective function, 59, 74-75, 77, 107, 281-282, 284, 287, 289-291,Index331295obstacle avoidance, 166, 169, 187-188, 223-225, 227, 231 obstacle avoidance behaviour, 187-188obstacle sensor, 224, 228off-line defuzzification, 34off-line fuzzy inference system, 302, 304off-line fuzzy technology, 300off-line lookup tables, 302 offsprings, 59-60on-line dynamic fuzzy inference system, 302online tuning, 203open water trial, 202operating point, 210optical platform, 92optimal control table, 300optimal feedback gain, 208, 215-216 optimal gains, 207original domain, 102outer loop controller, 85, 87outlier analysis, 251, 253output control gains, 92 overshoot, 3-4, 6-7, 60-61, 75-76, 94, 96, 193, 229, 266Ppath tracking, 223, 232-234 pattern evaluation, 250pattern vector, 150-151PD controller, 4, 54-55, 68-69, 71, 74, 76-77, 79, 134, 163, 165, 202 perception domain, 102 performance index, 60, 207 perturbed plants, 3, 7phase margins, 207phase-plan mapping fuzzy control, 19photovoltaic power systems, 261 phugoid mode, 212PID, 1-4, 8, 13, 19, 53, 61, 64-65, 74, 80, 84-85, 87-90, 92-98, 192 PID-fuzzy control, 19piecewise nonlinear surface, 193 pitch angle, 202, 209, 217pitch controller, 193, 201-202 pitch error, 193, 201pitch error rate, 193, 201pitch subsidence, 212planetary gearbox, 145point-in-time transaction, 252 polarizing beam-splitter, 86 poles, 4, 94, 96position sensor detectors, 84 positive definite matrix, 213post fault, 268, 270post-fault trajectory, 273pre-defined membership functions, 302prediction, 251, 258, 281-283, 287, 290predictive control, 280, 282-287, 290-291, 293-297predictive supervisory controller, 284preview distance control, 129 principal regulation level, 279 probabilistic reasoning approach, 259probability space, 118Problem understanding phases, 254 production rules, 316pursuer car, 136, 138-140 pursuer vehicle, 136, 138, 140Qquadrant detector, 79, 92 quadrant photo detector, 85 quadratic optimal technology, 208 quadrilateral ob tacle, 231sRradial basis function, 284 random closed set, 118random compact set, 118-120 rapid environment assessment, 191 reference beam, 86relative frame, 240relay control, 195release distance, 169residual forces, 217retinal vessel detection, 115, 117 RGB band, 115Riccati equation, 207, 213-214Index 332rise time, 3, 54, 60-61, 75-76road-environment estimator, 148 robot kinematics, 299robot workspace, 299-302, 309 robust control, 2, 84, 280robust controller, 2, 8, 90robust fuzzy controller, 2, 7 robustness property, 5, 203roll subsidence, 212rotor blade flap angle, 209rotor blades, 210rudder, 193, 201rule base size, 191, 199-200rule output function, 191, 193, 198-199, 203Runge-Kutta m thod, 61eSsampling period, 96saturation function, 195, 199 saturation functions, 162scaling factor, 54, 72-73scaling gains, 67, 69S-curve waveform, 25secondary membership function, 178 secondary memberships, 179, 181 selection, 59self-learning neural network, 159 self-organizing fuzzy control, 261 self-tuning adaptive control, 280 self-tuning control, 191semi-positive definite matrix, 213 sensitivity indices, 177sequence-based analysis, 251-252 sequential quadratic programming, 283, 292sets type-reduction, 184setting time, 54, 60-61settling time, 75-76, 94, 96SGA, 59shift points, 152shift schedule algorithms, 148shift schedules, 152, 156shifting control, 145, 147shifting schedules, 146, 152shift-schedule tables, 152sideslip angle, 210sigmoidal waveform, 25 sign function, 195, 199simplex optimal algorithm, 80 single gimbal system, 96single point mass obstacle, 223 singleton fuzzification, 181-182 sinusoidal waveform, 94, 300, 309 sliding function, 192sliding mode control, 1-2, 4, 8, 191, 193, 195-196, 203sliding mode fuzzy controller, 193, 198-200sliding mode fuzzy heading controller, 201sliding pressure control, 280 sliding region, 192, 201sliding surface, 5-6, 192-193, 195-198, 200sliding-mode fuzzy control, 19 soft constraints, 281, 287space-gap, 135special-purpose processors, 48 spectral mapping theorem, 216 speed adaptation, 138speed control, 2, 84, 130-131, 133, 160spiral subsidence, 212sporadic alternations, 257state feedback controller, 213 state transition, 167-169state transition matrix, 216state-weighting matrix, 207static fuzzy logic controller, 43 static MIMO system, 243steady state error, 4, 54, 79, 90, 94, 96, 98, 192steam turbine, 279steam valving, 261step response, 4, 7, 53, 76, 91, 193, 219stern plane, 193, 201sup operation, 183supervisory control, 191, 280, 289 supervisory layer, 262-264, 277 support function, 118support of a fuzzy set, 26sup-star composition, 182-183 surviving solutions, 257Index333swing curves, 271, 274-275 switching band, 198switching curve, 198, 200 switching function, 191, 194, 196-198, 200switching variable, 228system trajector192, 195y,Ttail plane, 210tail rotor, 209-210tail rotor derivation, 210Takagi-Sugeno fuzzy methodology, 287target displacement, 87target time gap, 136t-conorm maximum, 132 thermocouple sensor fault, 289 thickness variable, 319-320three-beam laser tracker, 85three-gimbal system, 96throttle pressure, 134throttle-opening degree, 149 thyristor control, 261time delay, 63, 75, 91, 93-94, 281 time optimal robust control, 203 time-gap, 135-137, 139-140time-gap derivative, 136time-gap error, 136time-invariant fuzzy system, 215t-norm minimum, 132torque converter, 145tracking error, 79, 84-85, 92, 244 tracking gimbals, 87tracking mirror, 85, 87tracking performance, 84-85, 88, 90, 192tracking speed, 75, 79, 83-84, 88, 90, 92, 97, 287trajectory mapping unit, 161, 172 transfer function, 2-5, 61-63 transient response, 92, 193 transient stability, 261, 268, 270, 275-276transient stability control, 268 trapezoidal waveform, 25 triangular fuzzy set, 319triangular waveform, 25 trim, 208, 210-211, 213, 217, 220, 237trimmed points, 210TS fuzzy gain scheduler, 217TS fuzzy model, 207, 290TS fuzzy system, 208, 215, 217, 220 TS gain scheduler, 217TS model, 207, 287TSK model, 40-41, 46TS-type controller, 208tuning function, 70, 72turbine following mode, 280, 283 turn rate, 210turning rate regulation, 208, 214, 217two-DOF mirror gimbals, 87two-layered FLC, 231two-level hierarchy controllers, 275-276two-module fuzzy logic control, 238 type-0 systems, 192type-1 FLC, 176-177, 181-182, 185- 188type-1 fuzzy sets, 177-179, 181, 185, 187type-1 membership functions, 176, 179, 183type-2 FLC, 176-177, 180-183, 185-189type-2 fuzzy set, 176-180type-2 interval consequent sets, 184 type-2 membership function, 176-178type-reduced set, 181, 183-185type-reduction,83-1841UUH-1H helicopter, 208uncertain poles, 94, 96uncertain system, 93-94, 96 uncertain zeros, 94, 96underlying domain, 259union, 20, 23-24, 30, 177, 180unit control level, 279universe of discourse, 19-24, 42, 57, 151, 153, 305unmanned aerial vehicles, 223 unmanned helicopter, 208Index 334unstructured dynamic environments, 177unstructured environments, 175-177, 179, 185, 187, 189upper membership function, 179Vvalve outlet pressure, 280vapor pressure, 280variable structure controller, 194, 204velocity feedback, 87vertical fin, 209vertical tracker, 90vertical tracking gimbal, 91vessel detection, 115, 121-122, 124-125vessel networks, 117vessel segmentation, 115, 120 vessel tracking algorithms, 115 vision-driven robotics, 87Vorob’ev fuzzy set average, 121-123 Vorob'ev mean, 118-120vortex, 237 WWang and Mendel’s algorithm, 257 WARP, 49weak link, 270, 273weighing factor, 305weighting coefficients, 75 weighting function, 213weld line, 315, 318-323western states coordinating council, 269Westinghouse turbine-generator, 283 wind–diesel power systems, 261 Wingman, 237-240, 246wingman aircraft, 238-239 wingman veloc y, 239itY-ZYager operator, 292Zana-Klein membership function, 124Zana-Klein method, 116-117, 121, 123-124zeros, 94, 96µ-law function, 54µ-law tuning method, 54。

Leveraging the power of local spatial autocorrelation in geophysical interpolative clustering-DMKD

Data Min Knowl DiscDOI10.1007/s10618-014-0372-zLeveraging the power of local spatial autocorrelationin geophysical interpolative clusteringAnnalisa Appice·Donato MalerbaReceived:16December2012/Accepted:22June2014©The Author(s)2014Abstract Nowadays ubiquitous sensor stations are deployed worldwide,in order to measure several geophysical variables(e.g.temperature,humidity,light)for a grow-ing number of ecological and industrial processes.Although these variables are,in general,measured over large zones and long(potentially unbounded)periods of time, stations cannot cover any space location.On the other hand,due to their huge vol-ume,data produced cannot be entirely recorded for future analysis.In this scenario, summarization,i.e.the computation of aggregates of data,can be used to reduce the amount of produced data stored on the disk,while interpolation,i.e.the estimation of unknown data in each location of interest,can be used to supplement station records. We illustrate a novel data mining solution,named interpolative clustering,that has the merit of addressing both these tasks in time-evolving,multivariate geophysical appli-cations.It yields a time-evolving clustering model,in order to summarize geophysical data and computes a weighted linear combination of cluster prototypes,in order to predict data.Clustering is done by accounting for the local presence of the spatial autocorrelation property in the geophysical data.Weights of the linear combination are deﬁned,in order to reﬂect the inverse distance of the unseen data to each clus-ter geometry.The cluster geometry is represented through shape-dependent sampling of geographic coordinates of clustered stations.Experiments performed with several data collections investigate the trade-off between the summarization capability and predictive accuracy of the presented interpolative clustering algorithm.Responsible editors:Hendrik Blockeel,Kristian Kersting,Siegfried Nijssen and FilipŽelezný.A.Appice(B)·D.MalerbaDipartimento di Informatica,Universitàdegli Studi di Bari“Aldo Moro”,Via Orabona4,70125Bari,Italye-mail:annalisa.appice@uniba.itD.Malerbae-mail:donato.malerba@uniba.itA.Appice,D.Malerba Keywords Spatial autocorrelation·Clustering·Inverse distance weighting·Geophysical data stream1IntroductionThe widespread use of sensor networks has paved the way for the explosive living ubiquity of geophysical data streams(i.e.streams of data that are measured repeatedly over a set of locations).Procedurally,remote sensors are installed worldwide.They gather information along a number of variables over large zones and long(potentially unbounded)periods of time.In this scenario,spatial distribution of data sources,as well as temporal distribution of measures pose new challenges in the collection and query of data.Much scientiﬁc and industrial interest has recently been focused on the deployment of data management systems that gather continuous multivariate data from several data sources,recognize and possibly adapt a behavioral model,deal with queries that concern present and past data,as well as seen and unseen data.This poses speciﬁc issues that include storing entirely(unbounded)data on disks withﬁnite memory(Chiky and Hébrail2008),as well as looking for predictions(estimations) where no measured data are available(Li and Heap2008).Summarization is one solution for addressing storage limits,while interpolation is one solution for supplementing unseen data.So far,both these tasks,namely summa-rization and interpolation,have been extensively investigated in the literature.How-ever,most of the studies consider only one task at a time.Several summarization algo-rithms,e.g.Rodrigues et al.(2008),Chen et al.(2010),Appice et al.(2013a),have been deﬁned in data mining,in order to compute fast,compact summaries of geophysical data as they arrive.Data storage systems store computed summaries,while discarding actual data.Various interpolation algorithms,e.g.Shepard(1968b),Krige(1951),have been deﬁned in geostatistics,in order to predict unseen measures of a geophysical vari-able.They use predictive inferences based upon actual measures sampled at speciﬁc locations of space.In this paper,we investigate a holistic approach that links predictive inferences to data summaries.Therefore,we introduce a summarization pattern of geo-physical data,which can be computed to save memory space and make predictive infer-ences easier.We use predictive inferences that exploit knowledge in data summaries, in order to yield accurate predictions covering any(seen and unseen)space location.We begin by observing that the common factor of several summarization and inter-polation algorithms is that they accommodate the spatial autocorrelation analysis in the learned model.Spatial autocorrelation is the cross-correlation of values of a variable strictly due to their relatively close locations on a two-dimensional surface.Spatial autocorrelation exists when there is systematic spatial variation in the values of a given property.This variation can exist in two forms,called positive and negative spatial autocorrelation(Legendre1993).In the positive case,the value of a variable at a given location tends to be similar to the values of that variable in nearby locations. This means that if the value of some variable is low at a given location,the presence of spatial autocorrelation indicates that nearby values are also low.Conversely,neg-ative spatial autocorrelation is characterized by dissimilar values at nearby locations. Goodchild(1986)remarks that positive autocorrelation is seen much more frequentlyLocal spatial autocorrelation and interpolative clusteringin practice than negative autocorrelation in geophysical variables.This is justiﬁed by Tobler’sﬁrst law of geography,according to which“everything is related to everything else,but near things are more related than distant things”(Tobler1979).As observed by LeSage and Pace(2001),the analysis of spatial autocorrelation is crucial and can be fundamental for building a reliable spatial component into any (statistical)model for geophysical data.With the same viewpoint,we propose to:(i) model the property of spatial autocorrelation when collecting the data records of a number of geophysical variables,(ii)use this model to compute compact summaries of actual data that are discarded and(iii)inject computed summaries into predictive inferences,in order to yield accurate estimations of geophysical data at any space location.The paper is organized as follows.The next section clariﬁes the motivation and the actual contribution of this paper.In Sect.3,related works regarding spatial autocorre-lation,spatial interpolators and clustering are reported.In Sect.4,we report the basics of the presented algorithm,while in Sect.5,we describe the proposed algorithm.An experimental study with several data collections is presented in Sect.6and conclusions are drawn.2Motivation and contributionsThe analysis of the property of spatial autocorrelation in geophysical data poses spe-ciﬁc issues.One issue is that most of the models that represent and learn data with spatial autocorrelation are based on the assumption of spatial stationarity.Thus,they assume a constant mean and a constant variance(no outlier)across space(Stojanova et al. 2012).This means that possible signiﬁcant variabilities in autocorrelation dependen-cies throughout the space are overlooked.The variability could be caused by a different underlying latent structure of the space,which varies among its portions in terms of scale of values or density of measures.As pointed out by Angin and Neville(2008), when autocorrelation varies signiﬁcantly throughout space,it may be more accurate to model the dependencies locally rather than globally.Another issue is that the spatial autocorrelation analysis is frequently decoupled from the multivariate analysis.In this case,the learning process accounts for the spa-tial autocorrelation of univariate data,while dealing with distinct variables separately (Appice et al.2013b).Bailey and Krzanowski(2012)observe that ignoring complex interactions among multiple variables may overlook interesting insights into the cor-relation of potentially related variables at any site.Based upon this idea,Dray and Jombart(2011)formulate a multivariate deﬁnition of the concept of spatial autocorre-lation,which centers on the extent to which values for a number of variables observed at a given location show a systematic(more than likely under spatial randomness), homogeneous association with values observed at the“neighboring”locations.In this paper,we develop an approach to modeling non stationary spatial auto-correlation of multivariate geophysical data by using interpolative clustering.As in clustering,clusters of records that are similar to each other at nearby locations are identiﬁed,but a cluster description and a predictive model is associated to each clus-A.Appice,D.Malerba ter.Data records are aggregated through clusters based on the cluster descriptions. The associated predictive models,that provide predictions for the variables,are stored as summaries of the clustered data.On any future demand,predictive models queried to databases are processed according to the requests,in order to yield accurate esti-mates for the variables.Interpolative clustering is a form of conceptual clustering (Michalski and Stepp1983)since,besides the clusters themselves,it also provides symbolic descriptions(in the form of conjunctions of conditions)of the constructed clusters.Thus,we can also plan to consider this description,in order to obtain clusters in different contexts of the same domain.However,in contrast to conceptual clus-tering,interpolative clustering is a form of supervised learning.On the other hand, interpolative clustering is similar to predictive clustering(Blockeel et al.1998),since it is a form of supervised learning.However,unlike predictive clustering,where the predictive space(target variables)is typically distinguished from the descriptive one (explanatory variables),1variables of interpolative clustering play,in principle,both target and explanatory roles.Interpolative clustering trees(ICTs)are a class of tree structured models where a split node is associated with a cluster and a leaf node with a single predictive model for the target variables of interest.The top node of the ICT contains the entire sample of training records.This cluster is recursively partitioned along the target variables into smaller sub-clusters.A predictive model(the mean)is computed for each target variable and then associated with each leaf.All the variables are predicted indepen-dently.In the context of this paper,an ICT is built by integrating the consideration of a local indicator of spatial autocorrelation,in order to account for signiﬁcant vari-abilities of autocorrelation dependencies in training data.Spatial autocorrelation is coupled with a multivariate analysis by accounting for the spatial dependence of data and their multivariate variance,simultaneously(Dray and Jombart2011).This is done by maximizing the variance reduction of local indicators of spatial autocorrelation computed for multivariate data when evaluating the candidates for adding a new node to the tree.This solution has the merit of improving both the summarization,as well as the predictive performance of the computed models.Memory space is saved by storing a single summary for data of multiple variables.Predictive accuracy is increased by exploiting the autocorrelation of data clustered in space.From the summarization perspective,an ICT is used to summarize geophysical data according to a hierarchical view of the spatial autocorrelation.We can browse gen-erated clusters at different levels of the hierarchy.Predictive models on leaf clusters model spatial autocorrelation dependencies as stationary over the local geometry of the clusters.From the interpolation perspective,an ICT is used to compute knowledge1The predictive clustering framework is originally deﬁned in Blockeel et al.(1998),in order to com-bine clustering problems and classiﬁcation/regression problems.The predictive inference is performed by distinguishing between target variables and explanatory variables.Target variables are considered when evaluating similarity between training data such that training examples with similar target values are grouped in the same cluster,while training examples with dissimilar target values are grouped in separate clusters. Explanatory variables are used to generate a symbolic description of the clusters.Although the algorithm presented in Blockeel et al.(1998)can be,in principle,run by considering the same set of variables for both explanatory and target roles,this case is not investigated in the original study.Local spatial autocorrelation and interpolative clusteringto make accurate predictive inferences easier.We can use Inverse Distance Weighting2 (Shepard1968b)to predict variables at a speciﬁc location by a weighted linear com-bination of the predictive models on the leaf clusters.Weights are inverse functions of the distance of the query point from the clusters.Finally,we can observe that an ICT provides a static model of a geophysical phe-nomenon.Nevertheless,inferences based on static models of spatial autocorrelation require temporal stationarity of statistical properties of variables.In the geophysical context,data are frequently subject to the temporal variation of such properties.This requires dynamic models that can be updated continuously as new fresh data arrive (Gama2010).In this paper,we propose an incremental algorithm for the construction of a time-adaptive ICT.When a new sample of records is acquired through stations of a sensor network,a past ICT is modiﬁed,in order to model new data of the process, which may change their properties over time.In theory,a distinct tree can be learned for each time point and several trees can be subsequently combined using some gen-eral framework(e.g.Spiliopoulou et al.2006)for tracking cluster evolution over time. However,this solution is prohibitively time-consuming when data arrive at a high rate. By taking into account that(1)geophysical variables are often slowly time-varying, and(2)a change of the properties of the data distribution of variables is often restricted to a delimited group of stations,more efﬁcient learning algorithms can be derived.In this paper,we propose an algorithm that retains past clusters as long as they discrimi-nate between surfaces of spatial autocorrelated data,while it mines novel clusters only if the latest become inaccurate.In this way,we can save computation time and track the evolution of the cluster model by detecting changes in the data properties at no additional cost.The speciﬁc contributions in this paper are:(1)the investigation of the property of spatial autocorrelation in interpolative clustering;(2)the development of an approach that uses a local indicator of the spatial autocorrelation property,in order to build an ICT by taking into account non-stationarity in autocorrelation and multivariate analysis;(3)the development of an incremental algorithm to yield a time-evolving ICT that accounts for the fact that the statistical properties of the geophysical data may change over time;(4)an extensive evaluation of the effectiveness of the proposed (incremental)approach on several real geophysical data.3Related worksThis work has been motivated by the research literature for the property of spatial autocorrelation and its inﬂuence on the interpolation theory and(predictive)clustering. In the following subsections,we report related works from these research lines.2Inverse distance weighting is a common interpolation algorithm.It has several advantages that endorse its widespread use in geostatistics(Li and Revesz2002;Karydas et al.2009;Li et al.2011):simplicity of implementation;lack of tunable parameters;ability to interpolate scattered data and work on any grid without suffering from multicollinearity.A.Appice,D.Malerba3.1Spatial autocorrelationSpatial autocorrelation analysis quantiﬁes the degree of spatial clustering (positive autocorrelation)or dispersion (negative autocorrelation)in the values of a variable measured over a set of point locations.In the traditional approach to spatial autocorre-lation,the “overall pattern”of spatial dependence in data is summarized into a single indicator,such as the familiar Global Moran’s I,Global Geary’s C or Gamma indi-cators of spatial associations (see Legendre 1993,for details).These indicators allow us to establish whether nearby locations tend to have similar (i.e.spatial clustering),random or different values (i.e.spatial dispersion)of a variable on the entire sample.They are referred to as global indicators of spatial autocorrelation,in contrast to the local indicators that we will consider in this paper.Procedurally,global indicators (see Getis 2008)are computed,in order to indicate the degree of regional clustering over the entire distribution of a geophysical variable.Stojanova et al.(2013),as well as Appice et al.(2012)have recently investigated the addition of these indicators to predictive inferences.They show how global measures of spatial autocorrelation can be computed at multiple levels of a hierarchical clustering,in order to improve predictions that are consistently clustered in space.Despite the useful results of these studies,a limitation of global indicators of spatial autocorrelation is that they assume a spatial stationarity across space (Angin and Neville 2008).While this may be useful when spatial associations are studied for the small areas associated to the bottom levels of a hierarchical model,it is not very meaningful or may even be highly misleading in the analysis of spatial associations for large areas associated to the top levels of a hierarchical model.Local indicators offer an alternative to global modeling by looking for “local pat-terns”of spatial dependence within the study region (see Boots 2002for a survey).Unlike global indicators,which return one value for the entire sample of data,local indicators return one value for each sampled location of a variable;this value expresses the degree to which that location is part of a cluster.Widely speaking,a local indi-cator of spatial autocorrelation allows us to discover deviations from global patterns of spatial association,as well as hot spots like local clusters or local outliers.Several local indicators of spatial autocorrelation are formulated in the literature.Anselin’s local Moran’s I (Anselin 1995)is a local indicator of spatial autocorre-lation,that has gained wide acceptance in the literature.It is formulated as follows:I (i )= (z (i )−z )m 2 n j =1,j =i(λ(i j )(z (j )−z )),(1)where z (i )is the value of a variable Z measured at the location i ,z = n i =1z (i )n is themean of the data measured for Z ,λ(i j )is a spatial (Gaussian or bi-square)weight between the locations i and j over a neighborhood structure of the training data and m 2=1n j (z (j )−z )2is the second moment.Anselin’s local Moran’s I is related to the global Moran I as the average of I (i )is equal to the global I,up to a factor of proportionality.A positive value for I (i )indicates that z (i )is surrounded by similar values.Therefore,this value is part of a cluster.A negative value for I (i )indicates that z (i )is surrounded by dissimilar values.Thus,this value is an outlier.Local spatial autocorrelation and interpolative clusteringThe standardized Getis and Ord local GI∗(Getis and Ord1992)is a local indicator of spatial autocorrelation that is formulated as follows:G I∗(i)=1S2n−1nnj=1,j=iλ(i j)2−Λ(i)2⎛⎝nj=1,j=iλi j z(j)−zΛ(i)⎞⎠,(2)whereΛ(i)= nj=1,j=iλ(i j)and S2=nj=1(z(j)−z)2n.A positive value for G I∗(i)indicates clusters of high values around i,while a negative value for G I∗(i)indicates clusters of low values around i.The interpretation of GI∗is different from that of I:the former distinguishes clusters of high and low values,but does not capture the presence of negative spatial auto correlation(dispersion);the latter is able to detect both positive and negative spatial autocorrelations,but does not distinguish clusters of high or low values.Getis and Ord (1992)recommend computing GI∗to look for spatial clusters and I to detect spatial outliers.By following this advice,Holden and Evans(2010)apply fuzzy C-means to GI∗values,in order to cluster satellite-inferred burn severity classes.Scrucca(2005) and Appice et al.(2013c)use k-means to cluster GI∗values,to compute spatially aware clusters of the data measured for a geophysical variable.Measures of both global and local spatial autocorrelation are principally deﬁned for univariate data.However,the integration of multivariate and autocorrelation informa-tion has recently been advocated by Dray and Jombart(2011).The simplest approach considers a two-step procedure,where data areﬁrst summarized with a multivariate analysis such as PCA.In a second step,any univariate(either global or local)spatial measure can be applied to PCA scores for each axis separately.The other approach ﬁnds coefﬁcients to obtain a linear combination of variables,which maximizes the product between the variance and the global Moran measure of the scores.Alterna-tively,Stojanova et al.(2013)propose computing the mean of global measures(Moran I and global Getis C),computed for distinct variables of a vector,as a global indicator of spatial autocorrelation of the vector by blurring cross-correlations between separate variables.Dray et al.(2006)explore the theory of the principal coordinates of neighbor matrices and develop the framework of Morans eigenvector maps.They demonstrate that their framework can be linked to spatial autocorrelation structure functions also in multivariate domains.Blanchet et al.(2008)expand this framework by taking into account asymmetric directional spatial processes.3.2Interpolation theoryStudies on spatial interpolation were initially encouraged by the analysis of ore mining, water extraction or pumping and rock inspection(Cressie1993).In theseﬁelds,inter-polation algorithms are required as the main resource to recover unknown information and account for problems like missing data,energy saving,sensor default,as well as to support data summarization and investigation of spatial correlation between observed data(Lam1983).The interpolation algorithms estimate a geophysical quantity in any geographic location where the variable measure is not available.The interpolated valueA.Appice,D.Malerba is derived by making use of the knowledge of the nearby observed data and,some-times,of some hypotheses or supplementary information on the data variable.The rationale behind this spatially aware estimate of a variable is the property of positive spatial autocorrelation.Any spatial interpolator accounts for this property,including within its formulation the consideration of a stronger correlation between data which are closer than for those that are farther apart.Regression(Burrough and McDonnell1998),Inverse Distance Weighting(IDW) (Shepard1968a),Radial Basis Functions(RBF)(Lin and Chen2004)and Kriging (Krige1951)are the most common interpolation algorithms.These algorithms are studied to deal with the irregular sampling of the investigated area(Isaaks and Srivas-tava1989;Stein1999)or with the difﬁculty of describing the area by the local atlas of larger and irregular manifolds.Regression algorithms,that are statistical interpolators, determine a functional relationship between the variable to be predicted and the geo-graphic coordinates of points where the variable is measured.IDW and RBF,which are both deterministic interpolators,use mathematical functions to calculate an unknown variable value in a geographic location,based either on the degree of similarity(IDW) or the degree of smoothing(RBF)in relation to neighboring data points.Both algo-rithms share with Kriging the idea that the collection of variable observations can be considered as a production of correlated spatial random data with speciﬁc statistical properties.In Kriging,speciﬁcally,this correlation is used to derive a second-order model of the variable(the variogram).The variogram represents an approximate mea-sure of the spatial dissimilarity of the observed data.The IDW interpolation is based on a linear combination of nearby observations with weights proportional to a power of the distances.It is a heuristic but efﬁcient approach justiﬁed by the typical power-law of the spatial correlation.In this sense,IDW accomplishes the same strategy adopted by the more rigorous formulation of Kriging(Li and Revesz2002;Karydas et al.2009; Li et al.2011).Several studies have arisen from these base interpolation algorithms.Ohashi and Torgo(2012)investigate a mapping of the spatial interpolation problem into a multiple regression task.They deﬁne a series of spatial indicators to better describe the spatial dynamics of the variable of interest.Umer et al.(2010)propose to recover missing data of a dense network by a Kriging interpolator.By considering that the compu-tational complexity of a variogram is cubic in the size of the observed data(Cressie 1993),the variogram calculus,in this study,is sped-up by processing only the areas with information holes,rather than the global data.Goovaerts(1997)extends Krig-ing,in order to predict multiple variables(cokriging)measured at the same location. Cokriging uses direct and cross covariance functions that are computed in the sample of the observed data.Teegavarapu et al.(2012)use IDW and1-Nearest Neighbor,in order to interpolate a grid of rainfall data and re-sample data at multiple resolutions. Lu and Wong(2008)formulate IDW in an adaptive way,by accounting for the varying distance-decay relationship in the area under examination.The weighting parameters are varied according to the spatial pattern of the sampled points in the neighborhood. The algorithm proves more efﬁcient than ordinary IDW and,in several cases,also better than Kriging.Recently,Appice et al.(2013b)have started to link IDW to the spatio-temporal knowledge enclosed in speciﬁc summarization patterns,called trend clusters.Nevertheless,this investigation is restricted to univariate data.Local spatial autocorrelation and interpolative clusteringThere are several applications(e.g.Li and Revesz2002;Karydas et al.2009;Li et al.2011)where IDW is used.This contributes to highlighting IDW as a determinis-tic,quick and simple interpolation algorithm that yields accurate predictions.On the other hand,Kriging is based on the statistical properties of a variable and,hence,it is expected to be more accurate regarding the general characteristics of the recorded data and the efﬁcacy of the model.In any case,the accuracy of Kriging depends on a reliable estimation of the variogram(Isaaks and Srivastava1989;¸Sen and¸Salhn 2001)and the variogram computation cost is proportional to the cube of the number of observed data(Cressie1990).This cost can be prohibitive in time-evolving applica-tions,where the statistical properties of a variable may change over time.In the Data Mining framework,the change of the underlying properties over time is usually called concept drift(Gama2010).It is noteworthy that the concept drift,expected in dynamic data,can be a serious complication for Kriging.It may impose the repetition of costly computation of the variogram each time the statistical properties of the variable change signiﬁcantly.These considerations motivate the broad use of an interpolator like IDW that is accurate enough and whose learning phase can be run on-line when data are collected in a stream.3.3Cluster analysisCluster analysis is frequently used in geophysical data interpretation,in order to obtain meaningful and useful results(Song et al.2010).In addition,it can be used as a sum-marization paradigm for data streams,since it underlines the advantage of discovering summaries(clusters)that adjust well to the evolution of data.The seminal work is that of Aggarwal et al.(2007),where a k-means algorithm is tailored to discover micro-clusters from multidimensional transactions that arrive in a stream.Micro-clusters are adjusted each time a transaction arrives,in order to preserve the temporal locality of data along a time horizon.Another clustering algorithm to summarize data streams is presented in Nassar and Sander(2007).The main characteristic of this algorithm is that it allows us to summarize multi-source data streams.The multi-source stream is com-posed of sets of numeric values that are transmitted by a variable number of sources at consecutive time points.Timestamped values are modeled as2D(time-domain)points of a Euclidean space.Hence,the source location is neither represented as a dimension of analysis nor processed as information-bearing.The stream is broken into windows. Dense regions of2D points are detected in these windows and represented by means of cluster feature vectors.Although a spatial clustering algorithm is employed,the spatial arrangement of data sources is still neglected.Appice et al.(2013a)describe a clustering algorithm that accounts for the property of spatial autocorrelation when computing clusters that are compact and accurate summaries of univariate geophysical data streams.In all these studies clustering is addressed as an unsupervised task.Predictive clustering is a supervised extension of cluster analysis,which combines elements of prediction and clustering.The task,originally formulated in Blockeel et al.(1998),assumes:(1)a descriptive space of explanatory variables,(2)a predictive space of target variables and(3)a set of training records deﬁned on both descriptive and predictive space.Training records that are similar to each other are clustered and a predictive model is associated to each cluster.This allows us to predict the unknown。

科三专有名词中英对照

课标1.核心素养：core competences/ key competences2.关键能力：key skills3.必备品格：necessary characters4.新课标：New Curriculum Standard5.语言能力：linguistic ability6.思维品质：thinking ability7.文化品格：cultural character8.学习能力：learning ability9.英语学习活动观：English learning activities outlook10.立德树人：strengthen moral education and cultivate people11.主题语境：Theme context12.语篇类型：Discourse/Text Type13.语言知识：language knowledge14.语言技能：language skills15.学习策略：learning strategy16.文化知识：culture knowledge17.人与自我：man/human and ego18.人与自然：man/human and nature19.人与社会：man/human and society20. 语音、词汇、语法、语篇、语用：phonetics, vocabulary, grammar，discourse, pragmatics.（补充）情感态度affective attitudes认知策略cognitive strategy调控策略/元认知策略m etacognitive strategy交际策略communicative strategy资源策略resource strategy文化意识cultrual awareness综合语言运用能力comprehesive language competence自主学习：autonomous/active/independent learning 合作学习：collaborative/cooperative/group cooperation learning探究学习：inquiry learning跨文化意识：Intercultural / Cross-cultural awareness 教学反思：teaching introspection工具性和人文性：the dual nature of instrumentality and humanism面向全体学生：for all students/be open to all students 衔接cohesion:（语境上的，抽象的）连贯coherence：(语法与词汇，有形的) 语言技能听力教学teaching listening听力策略listening strategy语感 language sense自下而上模式bottom-up model自上而下模式top-down model交互模式interactive model听前 pre-listening听中 while-listening听后 post-listening主旨 gist精听 intensive listening泛听 extensive listening图式 schemata口语教学teaching speaking准确性accuracy流利性fluency得体性appropriacy连贯性coherence控制性活动controlled/mechanical activities半控制性活动semi-controlled/semi-mechanical activities开放活动open/creative activities阅读教学teaching reading自上而下模式top-down model自下而上模式bottom-up model平行阅读parallel reading交互模式interactive model预测 predicting速度 skimming扫读 scanning猜词 word-guessing推断 inferring重述 paraphrasing精读 intensive reading泛读 extensive reading写作教学teaching writing以结果为导向写作product-oriented approach以过程为导向写作process-oriented approach以内容为导向写作content-oriented approach头脑风暴brainstorming思维导图mindmapping列提纲outlining打草稿drafting修改 editing校对 proofreading控制型写作controlled writing指导性写作guided writing表达性写作expressive writing教学实施程序性提问procedral questions展示性问题display questions参阅性问题referential questions评估性问题evaluative questions开放式问题open questions封闭式问题closed questions直接纠错explicit correction间接纠错implicit correction重复法repetition重述法recast强调暗示法pinpointing指导者instructor促进者facilitator控制者controller组织者organizer参与者participant提示者prompter评估者assessor资源提供者resource-provider研究者researcher形成性评价formative assessment终结性评价summative assessment诊断性评价diagnostic assessment成绩测试achievement test水平测试proficiency test潜能测试aptitude test诊断测试diagnostic test结业测试exit test主观性测试subjective test客观性测试objective test直接测试direct test间接测试indirect test常模参照性测试Norm-referenced Test标准参照性测试Criterion-referenced Test 个体参照性测试Individual-referenced Test 信度 Reliability效度 Validity表面效度Face validity结构效度Construct validity内容效度Content validity共时效度Concurrent validity预示效度Predictive validity区分度Discrimination难度 Difficulty教学法语法翻译法grammar-translation method直接法direct method听说法audio-lingual method视听法audio-visual method交际法communicative approach全身反应法total physical response(TPR)任务型教学法task-based teaching approach语言学中英对照导论规定性与描写性Prescriptive[prɪˈskrɪptɪv] and descriptive [dɪˈskrɪptɪv]共时性与历时性Synchronic[sɪŋˈkrɒnɪk] and diachronic [ˌdaɪəˈkrɒnɪk]语言和言语Langue 美[lɑːŋɡ] and parole[pəˈrəʊl]语言能力与语言运用 competence and performance 任意性Arbitrariness[ˈɑːbɪtrərinəs]能产性Productivity[ˌprɒdʌkˈtɪvəti]双重性Duality[djuːˈæləti]移位性Displacement[dɪsˈpleɪsmənt]文化传承性Cultural transmission [trænzˈmɪʃn]语音学(phonetics[fəˈnetɪks])发音语音学articulatory [ɑːˈtɪkjuleɪtəri] phonetics听觉语音学auditory[ˈɔːdətri] phonetics声学语音学acoustic [əˈkuːstɪk]phonetics元音vowel [ˈvaʊəl]辅音consonant [ˈkɒnsənənt]发音方式Manner of Articulation闭塞音stops摩擦音fricative [ˈfrɪkətɪv]塞擦音affricate[ˈæfrɪkət]鼻音nasal [ˈneɪzl]边音lateral [ˈlætərəl]流音liquids[ˈlɪkwɪdz]滑音 glides[ɡlaɪdz]发音部位place of articulation双唇音bilabial[ˌbaɪˈleɪbiəl]唇齿音labiodental[ˌleɪbiəʊˈdentl]齿音 dental齿龈音alveolar[ælˈviːələ]硬颚音palatal [ˈpælətl]软腭音velar [ˈviːlə(r)]喉音glottal[ˈɡlɒtl]声带振动vocal [ˈvəʊkl]cord [kɔːd] vibration 清辅音voiceless浊辅音voiced [vɔɪst]音位学(phonology[fəˈnɒlədʒi])音素 phone音位phoneme [ˈfəʊniːm]音位变体allophone[ˈæləfəʊn]音位对立phonemic[fəʊˈniːmɪk]contrast互补分布complementary[ˌkɒmplɪˈmentri] distribution最小对立体minimal pair序列规则sequential[sɪˈkwenʃl] rule同化规则assimilation [əˌsɪməˈleɪʃn]rules省略规则deletion[dɪˈliːʃən] rule重音 stress语调intonation[ˌɪntəˈneɪʃn]升调 rising tone降调 falling tone[təʊn]形态学(morphology[mɔːˈfɒlədʒi])词素morphemes[ˈmɔːfiːmz]同质异形体allomorphs[ˈæləmɔːf]自由词素free morphemes粘着词素bound morphemes屈折词素inflectional[ɪnˈflɛkʃən(ə)l]/ grammatical morphemes派生词素derivational[ˌdɛrɪˈveɪʃən(ə)l]/ lexical morphemes词缀 affix[əˈfɪks , ˈæfɪks]前缀 prefix后缀suffix[ˈsʌfɪks] 语义学(semantics [sɪˈmæntɪks])词汇意义lexical meaning意义 sense概念意conceptual[kənˈseptʃuəl] meaning本意denotative[ˌdiː.nəʊˈteɪtiv] meaning字面意literal meaning内涵意connotative 美[ˈkɑː.nə.teɪ.t̬ɪv] meaning指称reference[ˈrefrəns]同义现象synonymy [sɪˈnɒnɪmi]方言同义词dialectal[ˌdaɪ.əˈlek.təl]synonyms[ˈsɪnənɪmz]文体同义词stylistic [staɪˈlɪstɪk]synonyms搭配同义词collocational[ˌkɒləʊˈkeɪʃən(ə)l] synonyms语义不同的同义词semantically different synonyms多义现象polysemy [pəˈlɪsɪmi]同音异义homonymy同音异义词homophones[ˈhɒməʊfəʊnz]同形异义词homographs完全同音异义词completehomonyms[ˈhɒm.ə.nɪm]下义关系hyponymy美[haɪ'pɒnɪmɪ]上义词superordinate下义词hyponyms['haɪpəʊnɪm]反义词antonymy[æn‘tɒnɪmɪ]等义反义词gradable [ˈɡreɪdəbl] antonyms [ˈæntəʊnɪmz]互补反义词complementary antonyms关系反义词relational opposites蕴含 entailment[ɪnˈteɪl.mənt]预设presuppose[ˌpriːsəˈpəʊz]语义反常semantically anomalous[əˈnɒmələs]语用学(pragmatics[præɡˈmætɪks])言语行为理论speech act theory言内行为locutionary[ləˈkjuː.ʃən.ər.i]言外行为illocutionary[ˌɪl.əˈkjuː.ʃən.ər.i]阐述类representatives[ˌrɛprɪˈzɛntətɪvz]指令类directives [dɪˈrɛktɪvz]承诺类commissives表达类expressives[ɪkˈspresɪvz]宣告类declarations [ˌdɛkləˈreɪʃənz]言后行为perlocutionary直接言语行为direct speech act间接言语行为inderect speech act会话原则principle of conversation合作原则cooperative principle数量准则the maxim [ˈmæksɪmof quantity质量准则the maxim of quality关系准则the maxim of relation方式准则the maxim of manner语言与社会地域方言regional dialect社会方言sociolect [ˈsəʊsiəʊlekt]语域register[ˈredʒɪstə(r)]语场 field of discourse语旨tenor [ˈtenə(r)]of discourse语式 mode of discourse第二语言习得语言习得机制LAD(language acquisition Device 对比分析contrastive[kənˈtrɑː.stɪv] analysis错误分析error analysis语际错误interlingual[ˌɪntəˈlɪŋɡwəl] errors语内错误intralingual [ˌɪntrəˈlɪŋɡwəl] errors过度概括overgeneralization互相联想cross-association中介语interlanguage石化fossilization [ˌfɒs.ɪ.laɪˈzeɪ.ʃən]。

2024年高二英语学科全球合作研究的合作机制构建分析单选题30题

2024年高二英语学科全球合作研究的合作机制构建分析单选题30题1.International cooperation is crucial for addressing global challenges. The ______ of different countries is essential.A.effortsanizationsC.cooperationsD.initiatives答案：B。

“国际合作对于应对全球挑战至关重要。

不同国家的组织是必不可少的。

”A 选项“efforts”努力；C 选项“cooperations”合作，此处与前文重复；D 选项“initiatives”倡议。

根据语境，这里强调不同国家的组织，所以选B。

2.Global cooperation requires strong ______ among nations.A.associationsB.partnershipsC.connectionsD.relationships答案：B。

“全球合作需要国家之间强大的伙伴关系。

”A 选项“associations”协会；C 选项“connections”联系；D 选项“relationships”关系，而伙伴关系更能体现全球合作的需求，所以选B。

3.The success of global cooperation depends on effective ______.A.coordinationsB.arrangementsanizationsD.plans答案：C。

“全球合作的成功取决于有效的组织。

”A 选项“coordinations”协调；B 选项“arrangements”安排；D 选项“plans”计划。

这里强调组织的重要性，所以选C。

4.In global cooperation, ______ play an important role in promoting common development.A.institutionspaniesC.factoriesD.schools答案：A。

Determinants of Internal Control Weaknesses内部控制缺陷的影响因素

Contemporary Management ResearchPages 159-176, Vol. 6, No. 2, June 2010Determinants of Internal Control WeaknessesJohn J. ChehThe University of AkronE-Mail: cheh@Jang-hyung LeeDaegu UniversityE-Mail: goodljh@daegu.ac.krIl-woon KimThe University of AkronE-Mail: ikim@ABSTRACTThe major purpose of this paper is to find the financial and nonfinancial variables which would be useful in identifying companies with Sarbanes-Oxley Act internal control material weaknesses (MW). Recently, three research papers have been published on this topic based on different variables. Using the decision tree model of data mining, this paper examines the predictive power of the variables used in each paper in identifying MW companies and compares the predictive rates of the three sets of the variables used in these studies. In addition, an attempt is made to find the optimal set of the variables which gives the highest predictive rate. Our results have shown that each set of the variables is complementary to each other and strengthens the prediction accuracy. The findings from this study can provide valuable insights to external auditors in designing a cost-effective and high-quality audit decision support system to comply with the Sarbanes-Oxley Act.Keywords: Sarbanes-Oxley Act, Internal Control Material Weakness, Data MiningContemporary Management Research 160INTRODUCTIONThe Sarbanes-Oxley Act (SOX) of 2002, also known as the Public Company Accounting Reform and Investor Protection Act, was enacted on July 30, 2002 in response to a number of major corporate and accounting scandals of large companies in the U.S., such as Enron, Tyco International, Adelphia, Peregrine Systems, and WorldCom. Section 404 of the Sarbanes-Oxley Act requires management of publicly traded companies to assess their company’s internal control material weakness (MW) and to provide an internal control report as part of their periodic report to stockholders and regulators. The Act requires all public companies to maintain accurate records and an adequate system of internal accounting control. SOX also dramatically increased the penalties for false financial reporting on both management and external auditors.Even though the Act helped improve the quality and transparency of financial reports and give investors more confidence in financial reporting through added focus on internal control, increasing costs of compliance has been a major concern of many companies. A recent survey by Financial Executive International (O’Sullivan, 2006) found that public companies have incurred greater than expected costs to comply with section 404 of SOX: 58% increase in the fees charged by external auditors. It was estimated that U.S. companies would have spent $20 billion by the end of 2006 to comply with the law since it was passed in 2002. AMR Research also estimates that companies are spending about $1 million on SOX compliance for every $1 billion in revenues.The major purpose of this paper is to find the financial and nonfinancial variables which would be useful in predicting Sarbanes-Oxley Act internal control material weaknesses (MW). In addition, an attempt will made using all the variables from these published studies to find an optimal set of the variables which gives the highest predictive rate. Once identified, these variables can be used by external auditors as a screening device in identifying companies with MW in internal control, which can help external auditors save the costs of audit significantly.PREVIOUS RESEARCHRecently, three research papers have been published on identifying variables affecting Sarbanes-Oxley Act internal control MW using different variables and different methodologies: Cheh et al. (2006), Doyle et al. (2007) and Ogneva et al. (2007). In the study by Cheh et al., a trial and error approach was used with different combinations of many variables, and 23 financial variables were finally identifiedContemporary Management Research 161 which gave the highest predictive rate for internal control MW. Studies by Doyle et al. and Ogneva et al. used a traditional method of theorizing the situations and bringing theory-based variables into the research. Doyle et al. (2007), in particular, made an attempt to identify the determinants of weaknesses in internal control by examining 779 firms that disclosed material weaknesses from August 2002 to 2005. Eleven variables were used in their study based on two major categories: entity-wide vs. accounting specific. The study by Ogneva et al. (2007) focused on the cost of equity that would be impacted by the deterioration of information quality from the weaknesses in internal control. Ten variables which represented the firm characteristics associated with internal control weaknesses were used in their study.Although three groups of these researchers worked on a similar topic, each research group had slightly different research objectives. The aim of Cheh et al.’s work (2006) was to find data mining rules that facilitate the identification of MW companies. Doyle at al. (2007) strived to find characteristics of MW companies. On the other hand, Ogneva et al. (2007) were more interested in examining the association between cost of capital and internal control weaknesses. The variables used in each study were not necessarily selected to find MW companies, but they were certainly related to MW. Since the variables in each study are quite different from each other, it would be interesting to see if the predictability can be improved using all the variables from the three studies. Eventually, a certain set of the financial and nonfinancial variables may be able to be identified that can be powerful enough, yet most cost effective, in predicting MW companies.Besides the three papers mentioned, several papers examined some issues related to Sarbanes-Oxley Act and internal control. None of them, however, appeared to deal with ex-ante factors that might have determined the weakness in internal control. In general, they have addressed the effects after the disclosure of internal control weakness. For example, Kim et al. (2009) tested how the disclosure of the weakness in internal control would affect the loan contract, and Li et al. (2008) examined what role remediation of internal control weakness would play in internal and external corporate governance. Tang and Xu (2008) investigated how institutional ownership would affect the disclosure of internal control weakness, and Xu and Tang (2008) studied the relationship between financial analysts’ earnings forecasts and internal control weakness. Patterson and Smith (2007) did a theoretical investigation on the effects of the Sarbanes-Oxley Act of 2002 on auditing intensity and internal control strength, but did not address the issues on the determinants of internal control weakness.Contemporary Management Research 162RESEARCH METHODOLOGYDecision Trees ModelAs an integral part of knowledge discovery from large databases, data mining is the overall process of converting raw data into useful information. Data mining techniques tend to allow data to speak for themselves rather than coming into the data analysis with any preconceived notion of hypothetical relationships of the variables under study. Hence, data mining has been widely used in business to find useful patterns and trends that might otherwise remain unknown within the business data.Data mining, which is often called analytics, has been widely used not only for business practices but also for academic research. There are numerous conceptual models in data mining, such as decision trees, classification and rules, clustering, instances-based learning, linear regression, and Bayesian networks (Witten and Frank, 2005). Since one of the main objectives of our research was to find tree like rules that would help audit practitioners make informed decisions about MW, we decided to use the decision tree model. As a graph of decisions, the model uses a decision tree as a predictive model which maps the variables about an item to the conclusion about the item’s target value.Independent VariablesThere is conceivably a large number of varying combinatorial sets of independent variables on the dependent variable of material weakness on a company’s internal control. In these conceivably huge sets of the independent variables, we decided to use the variables which were used in three published studies: Cheh et al. (2006), Doyle et al. (2007) and Ogneva et al. (2007). In total, 46 variables were used on these studies: 23 by Cheh et al., 10 by Doyle et al., and 13 by Ogneva et al. There were several overlapping variables between studies by Doyle et al. and Ogneva et al. In addition, two of the variables used by Doyle et al. (2007) were excluded in this study because they were not available in our data bases, such as Audit Analytics, CRSP, and Research Insight. These two variables were the number of special purpose entities (SPEs) associated with each firm and governance score. Consequently, 35 variables were used in this study. Also, we excluded in our study three variables from the work by Ogneva et al.; these variables were predicted forecast errors and were known to be associated with systematic biases in analyst forecasts. The variables utilized forecast data from I/B/E/S and Value Line; but these data were not available in our aforementioned data bases.Contemporary Management Research 163 Panel A of Table 1 provides detailed descriptive explanations of the 35 independent variables, and the descriptive statistics of each variable are presented in Panel B. It can be seen in Panel B that four variables have very large standard deviations and variances. These variables are Net Income/(Total Assets – Total Liabilities), Cost of Goods Sold/Inventory, Retained Earnings/Inventory, and Total Liabilities/(Total Assets – Total Liabilities). Although we have normalized net income, cost of goods sold, retained earnings, and total assets using various measures, these variables apparently have had very large fluctuations among the sample firms. For net income, however, it is interesting to note that Net Income/Sales and Net Income/Total Assets have very small variances while Net Income/(Total Assets – Total Liabilities) has a significant variance. This result indicates that net income of sample firms fluctuated significantly with respect to net equity and was relatively stable with respect to sales and total assets.Sample FirmsThe initial sample companies came from Research Insight which has almost 10,000 companies listed in its Compustat database. These companies are listed in three major U.S. stock exchanges (i.e., New York Stock Exchange, American Stock Exchange and NASDAQ), regional U.S. stock exchanges, Canadian stock exchanges, and over-the-counter markets. For these companies, the data required for this study were retrieved from three research databases: Audit Analytics, CRSP, and Research Insight. The companies with SOX MW for the period of 2002 to 2006 were obtained from Audit Analytics. The data for the 35 variables were retrieved from Compustat database in Research Insight, except for information about the ages of the sample firms. The information about ages was available in the CRSP tape. To remove any companies with incomplete data for any variables, we used an in-house developed software application. This process resulted in 869 companies in the final sample, a considerably smaller subset of the initial sample.Contemporary Management Research 164Table 1 Descriptive statistics for 35 variables for 2004-2006Panel A: Variable DescriptionsVariables Description Variables Description NI/SALE Net Income/Net Sales CH/LCT Net Sales/Current LiabilitiesNI/AT Net Income/TotalAssetsWCAP/SALE Cash/CurrentLiabilitiesNI/(AT-LT) Net Income/Net Worth RE/AT Working Capital/Net SalesEBIT/AT Earnings Before IncomeTax/Total AssetsCH/AT Retained Earnings/Total AssetsSALE/AT Net Sales/Total Assets LT/(AT-LT) Cash/Total AssetsACT/AT Current Assets/TotalAssetsLOGMARKETCAP Total Liabilities/Net Worth(ACT-INVT)/AT Quick Assets/TotalAssetsLOGAGELOG of share price × numberof sharesoutstanding(CSHO*PRCCF)SALE/(ACT-INVT)Net Sales/Quick Assets AGGREGATELOSS LOG Age of companyACT/SALE Current Assets/NetSalesRZSCOREIndicator variable [1,0] if∑earning before extraordinaryitems in years t and t-1<0,otherwise(IB)INVT/SALE Inventory/NetSales LOGSEGGEOMUM LOG ∑ number of operating and geographic segments (Bus Segment-Actual number+Geo Seg Areas-Actual number)COGS/INVT Cost of GoodsSold/InventoryFOREIGNIndicator variable [1,0] if non-zero foreign currencytranslation , otherwise(FCA)LT/AT Total Liabilities/TotalAssetsACQUISITIONVALUE∑50% Ownership of theacquired firm(AQA/MKVAL)(ACT-INVT)/SALE Quick Assets/SalesEXTREMESALESGROWTHIndicator variable [1,0] if year-over-year industry-adjusted NetSales growth=top quintile,otherwiseRE/INVT RetainedEarnings/InventoryRESTRUCTURINGCHARGE∑ restructuring charges(RCA/MKVAL)ACT/LCT Current Assets/CurrentLiabilitiesLOGSEGMUMLOG Number of businesssegments(ACT-INVT)/LCT Quick Assets/CurrentLiabilitiesINVENTORY Inventory/TotalassetsLCT/AT Current Liabilities/Total AssetsSALE/LCT Sales/Current Liabilities M&AIndicator variable [1,0] ifcompany was involved in M&Aover 3 years, otherwiseContemporary Management Research 165Panel B: Descriptive statistics of the variables variables N Mean Std. Deviation Variance Minimum Maximum ICMW * 869 0.520138 0.499882 0.249882 0 1NI/SALE 869 -0.04617 0.585724 0.343073 -11.4922 2.506701 NI/AT 869 -0.00369 0.258762 0.066958 -2.81327 4.832774 NI/(AT-LT) 869 1.102761 33.98825 1155.201 -14.0419 1001.306 EBIT/AT 869 0.032835 0.158506 0.025124 -2.40982 0.480082 SALE/AT 869 1.098502 0.852066 0.726016 0.026889 8.613351 ACT/AT 869 0.507784 0.235754 0.05558 0.024723 0.991249 (ACT-INVT)/AT 869 0.381526 0.212887 0.045321 0.023738 0.956625 SALE/(ACT-INVT) 869 4.355389 5.787747 33.49802 0.035184 66.39423ACT/SALE 869 0.750516 1.444692 2.087135 0.033669 28.66403 INVT/SALE 869 0.12852 0.120808 0.014595 0.000373 1.631579 COGS/INVT 869 19.77532 73.00375 5329.548 0.107527 1874.644LT/AT 869 0.528867 0.418582 0.17521 0.028224 6.811726 (ACT-INVT)/SALE 869 0.621996 1.410005 1.988113 0.015062 28.42211RE/INVT 869 -17.4043 144.4622 20869.32 -2476.96 568.4035 ACT/LCT 869 2.813127 2.783589 7.748366 0.097934 35.69801 (ACT-INVT)/LCT 869 2.227998 2.66166 7.084434 0.079661 35.41416LCT/AT 869 0.250558 0.169365 0.028684 0.024158 1.388766 SALE/LCT 869 4.914681 2.970393 8.823232 0.250556 33.59769 CH/LCT 869 0.822242 1.541899 2.377452 0 20.04329 WCAP/SALE 869 0.452549 1.332053 1.774364 -1.74452 26.57621RE/AT 869 -0.43403 2.365501 5.595594 -42.8029 1.041362 CH/AT 869 0.127273 0.131966 0.017415 0 0.879367 LT/(AT-LT) 869 -4.58159 181.5619 32964.74 -5340.58 226.1027 LOGMARKETCAP 869 2.663385 0.719972 0.51836 -0.28377 4.889938LOGAGE 869 1.117768 0.18178 0.033044 0.30103 1.30103AGGREGATELOSS 869 0.219793 0.414345 0.171682 0 1RZSCORE 869 4.52359 2.906921 8.450192 0 9 LOGSEGGEOMUM 869 0.738405 0.174211 0.030349 0 1.176091FOREIGN 869 0.218642 0.413563 0.171035 0 1 ACQUISITIONVALUE 869 -0.00023 0.002234 4.99E-06 -0.03232 0.037896EXTREMESALESGROWTH 869 0.243959 0.429715 0.184655 0 1 RESTRUCTURINGCHARGE 869 -0.19341 5.104154 26.05239 -149.554 3.574735LOGSEGMUM 869 0.286145 0.290228 0.084233 0 1INVENTORY 869 0.126258 0.120126 0.01443 0.000327 0.754949 M&A 869 0.116226 0.32068 0.102836 0 1 * Note that ICMW is the dependent variable of the study while all others are independentvariables.Contemporary Management Research 166Application of Data MiningThe decision tree model in data mining was available from SQL Server 2005Analysis Services. Since each configuration could provide different results, the same configuration was used for all three groups of variables: the default configuration inthe Analysis Services. Data mining would be particularly useful to get new insightsinto the variables used in the studies by Doyle et al. and Ogneva et al. because they hypothesized the relationships between their dependent variables and independent variables.For the purpose of training the model, one-year lagged data were used. For example, the 2004 data were used for training and then the 2005 data were used for prediction, and the 2005 data for training and the 2006 data for prediction. In that way,the decision rules were actually predicting MW of the sample companies withunknown data of the next year. This process provided with four sets of data that couldbe used for both training and testing, which will be explained in the following section.ANALYSES AND FINDINGSIt appears that in some cases, the financial variables from two studies by Doyle etal. (2007) and Ogneva et al. (2007) on Sarbanes-Oxley Act of 2002 internal controlMW add value to the predictability of the financial variables which have been used inthe study by Cheh et al. (2006), although this finding may become somewhat lessclear in special cases. Results also provide preliminary evidence that all of thevariables collectively can contribute to construction of a good predictive model.Detailed discussions on the results follow in the subsequent paragraphs.Table 2 Training 2004 data and testing 2005 data using all 35 variablesPanel A: Actual ClassificationPredicted 0 (Actual) 1 (Actual) Total250 21 43231 149 174348 Total 170 178Panel B: Prediction AccuracyPredicted 0 (Actual) 1 (Actual)0 12.35% (Prediction Accuracy) 2.25% (Type II Error)1 87.65% (Type I Error) 97.75% (Prediction Accuracy)Total 100% 100%Contemporary Management Research 167Figure 1 Graphic results for Table 2Using All 35 VariablesAs the first test, we used the 2004 fiscal year data for training to predict MW companies in the 2005 fiscal year, using all 35 variables. As is shown in Table 2, when there is no MW, the prediction accuracy is only 12.35% while the error of misclassifying MW as non-MW is 87.65%. However, when there is actually MW, the prediction accuracy is quite impressive: 97.75% with the error of misclassifying MW as non-MW being only 2.25%. It should be noted that in this data set and also subsequent date sets, we matched both sets of non-MW and MW companies with standard industry code (SIC) and also sales within 25% plus or minus.The lift chart shown in Figure 1 displays a lift which is an indication of any improvement from random guess and shows a graphical representation of the change in the lift that a mining model causes (Tang and MacLennan 2005). For example, the top line shows that an ideal model would capture 100% of the target using a little over 50% of data. The 45-degree random line across the chart indicates that if a data miner were to randomly guess the result for each case, the person would capture 50% of target using 50% of data (Tang and MacLennan 2005). The data mining model isdepicted in the line between the ideal line and random line. According to the graph,Contemporary Management Research 168the decision tree model is doing only a little better than a random model. This alsomeans that there isn’t sufficient information in the training of the data to learn thepatterns about the target.To predict MW companies with the 2006 fiscal year based on the 2005 data, ourdata mining model trained the 2005 fiscal year data for all 35 variables of the sample companies. As we did with the first test, we matched both sets of non-MW and MW companies with SIC and sales within 25% plus or minus. The final results arepresented in Table 3 and Figure 2, which are similar to those of the previous test.When there is actually MW, the prediction accuracy is as high as 98.47%, and theerror rate of misclassifying MW as non-MW is 1.53%. When there is no MW, theprediction accuracy is 11.02%, and the error rate of misclassifying non-MW as MW is88.98%.Table 3 Training 2005 data and testing 2006 data using all 35 variablesPanel A: Actual ClassificationPredicted 0 (Actual) 1 (Actual) Total150 13 22341 105 129249 Total 118 131Panel B: Prediction AccuracyPredicted 0 (Actual) 1 (Actual)0 11.02% (Prediction Accuracy) 1.53% (Type II Error)1 88.98% (Type I Error) 98.47% (Prediction Accuracy)Total 100% 100%It is interesting to see that, even though there is a slight increase in the predictionrate of MW companies when there is MW from 97.75% to 98.47%, the prediction rateof non-MW companies when there is no MW has somewhat deteriorated from 12.35%to 11.02%. It appears that there isn’t more information in the 2006 data than in the2005 data sufficient enough to make a difference in training the data and learningpatterns about the target. As expected, the lift chart in Figure 2 is similar to theprevious lift chart with only a slightly better result in predicting MW companies.Figure 2 Graphic results for Table 3Using the Doyle et al.’s VariablesWe repeated the process of training the data and predicting MW with the nine variables used by Doyle et al. As mentioned before, Doyle et al.’s (2007) used two additional variables that we did not use in our study: number of special purpose entities (SPE) associated with a firm under study and its governance score. Doyle et al. do not find a significant relation between MW disclosures and corporate governance. Although its prediction rate of non-MW companies increased slightly, its MW prediction rate decreased to below 95 % for training in 2004 data to predict 2005 MW companies as shown in Table 4. The significant decrease in the predictive power of our data mining model was probably caused by, among others, the use of the smaller number of independent variables from 35 to nine.Figure 3 provides two types of information about the model used in this study. The first type of information presented in Panel A is what the trees reveal. The decision rules in rectangular boxes or nodes present which rule will more likely generate a desired outcome. For example, if a company’s log of market capitalizationis greater than or equal to 1.775, but less than 3.833, then it is highly likely that thecompany would have material weakness. The darker the node is, the more cases it contains.Table 4 Training 2004 data and testing 2005 data using Doyle et al.’s variablesPanel A: Actual ClassificationPredicted 0 (Actual) 1 (Actual) Total370 27 103261 158 168363 Total 185 178Panel B: Prediction AccuracyPredicted 0 (Actual) 1 (Actual)0 14.59% (Prediction Accuracy) 5.62% (Type II Error)1 85.41% (Type I Error) 94.38% (Prediction Accuracy)Total 100% 100%In Panel B of Figure 3, the top line of the lift chart shows that almost 50% of thedata indicates the desired target. According to Tang and MacLennan (2005), thebottom line is the random line, and the random line is always a 45-degree line acrossthe chart. What this means is that if a business analyst were to randomly guess theresults for each case, the analyst would capture 50% of the target using 50% of thedata.An interesting part is the line between these two lines of ideal line and randomline. We may call this middle line a “model line” which represents the data miningmodel that we ran. In this particular chart, the model line was merged with the idealline up to almost 50% of the data. Then, it dipped a little and paralleled with the idealline of the remaining data until a little after 90% of the data. Then, eventually, itmerged with the ideal line again. Thus, it indicates that this data mining model workedrather well, almost as if it were an ideal model. Hence, in terms of data efficiencypoint, we can conclude that the variables used by Doyle et al. (2007) are quiteefficient because the decision tree model based on the nine variables is close to theideal model.Panel A:Panel B:Figure 3 Graphic results for Table 4Similar results are shown in Table 5 and Figure 4 for using 2005 data to predict2006 MW companies. The MW prediction rate increased slightly in predicting 2006MW companies to 96.21%, but is still below 98.47% as demonstrated using all 35variables in Panel B of Table 3. Thus, using all 35 variables is a powerful reminderthat more independent variables increase the data mining model’s predictive power.The lift chart for 2005 training data in predicting 2006 MW companies using thevariables by Doyle et al. shows some surprising finding that data efficiency is even forall distribution parts of population. Apparently, the decision model does better thanthe random model only in the end extremes.Cost of Misclassification ErrorsUsing the following cost of misclassification model that was originally developedby Masters (1993), the cost of misclassification errors can be incorporated into theprocess of data analyses:Cost of misclassification errors (CME) = (1 – q) p1c1 + qp2c2,where q is a prior probability that the information risk associated with a company’s internal control weakness was high, p1 and p2 are the probabilities of type1 (false positive) and type2 (false negative) errors, and c1 and c2 are the costs of type1 and type2 errors.Table 5 Training 2005 data and testing 2006 data using Doyle et al.’s variablesPanel A: Actual ClassificationPredicted 0 (Actual) 1 (Actual) Total220 17 52401 113 127Total 130 132 262 Panel B: Prediction AccuracyPredicted 0 (Actual) 1 (Actual)0 13.08% (Prediction Accuracy) 3.79% (Type II Error)1 86.92% (Type I Error) 96.21% (Prediction Accuracy)Total 100% 100%The actual cost of misclassification error can be significant to audit firms. For example, by classifying a good client without MW as a bad one with MW, an audit firm makes a Type I error and, as a result, could potentially lose a client. To avoid the Type 1 error, the audit firm may assign the weight of unwarranted MW risk too much in its equation of audit engagement and planning processes. On the other hand, by mistakenly classifying a bad client with MW as a good one without MW, an audit firm makes a Type II error and, as a result, could miss MW in the client firm. In an effort to avoid the Type 2 error, the audit firm may assign the weight of MW risk too little in its equation of audit engagement and planning processes.Accordingly, Cheh et al. (1999) applied the CME model to determine the cost of misclassification in predicting takeover targets. Later, Calderon (1999) and Calderon and Cheh (2002) also used this idea in auditing and risk assessment applications. The prior probability, q, can be estimated by using training data, as Cheh et al. (1999) did in their takeover target study. For example, using the data shown in Panel A of Table 2, we can compute that q = 178/(170 + 178) = 0.5115. Then, using the information in Panel B of Table 2, the cost of misclassification errors can be computed as following:CME = 0.4885 * 0.8765 * c1 + 0.5115*0.0225 * c2.Since this predictive model can provide p1 and p2, as shown above, each audit firm will be able to customize the cost model by using values on c1, and c2 based on the firm’s professional experiences and past cost data. With this cost model based on the data mining prediction model, each audit firm can prepare an expected net benefit report before the firm engages in its potential audit client. With assistance from computer professionals, the audit firm can create another useful audit risk engagement tool, based on this type of cost analysis.CONCLUSIONS AND FUTURE RESEARCHThe purpose of this paper was to find the financial and nonfinancial variables which would be useful in predicting Sarbanes-Oxley Act internal control material weaknesses. By examining a number of independent variables used in three published studies, we were able to construct a highly predictive data mining model with the prediction rate of MW of 98.47%. It appears that each set of variables from three groups of nine researchers are complementary to each other and strengthens the prediction accuracy. Apparently, each set of variables brings different types of strength to the prediction of material weaknesses.。

GAY-MLP神经网络：糖尿病诊断的混合智能系统(IJISA-V8-N1-6)

I.J. Intelligent Systems and Applications, 2016, 1, 49-59Published Online January 2016 in MECS (/)DOI: 10.5815/ijisa.2016.01.06GA_MLP NN: A Hybrid Intelligent System for Diabetes Disease DiagnosisDilip Kumar ChoubeyBirla Institute of Technology, Computer Science & Engineering, Mesra, Ranchi, IndiaEmail: dilipchoubey_1988@yahoo.inSanchita PaulBirla Institute of Technology, Computer Science & Engineering, Mesra, Ranchi, IndiaEmail: Sanchita07@Abstract—Diabetes is a condition in which the amount of sugar in the blood is higher than normal. Classification systems have been widely used in medical domain to explore patient’s data and extract a predictive model or set of rules. The prime objective of this research work is to facilitate a better diagnosis (classification) of diabetes disease. There are already several methodology which have been implemented on classification for the diabetes disease. The proposed methodology implemented work in 2 stages: (a) In the first stage Genetic Algorithm (GA) has been used as a feature selection on Pima Indian Diabetes Dataset. (b) In the second stage, Multilayer Perceptron Neural Network (MLP NN) has been used for the classification on the selected feature. GA is noted to reduce not only the cost and computation time of the diagnostic process, but the proposed approach also improved the accuracy of classification. The experimental results obtained classification accuracy (79.1304%) and ROC (0.842) show that GA and MLP NN can be successfully used for the diagnosing of diabetes disease.Index Terms—Pima Indian Diabetes Dataset, GA, MLP NN, Diabetes Disease Diagnosis, Feature Selection, Classification.I.I NTRODUCTIONDiabetes is a chronic disease and a major public health challenge worldwide. Diabetes happens when a body is not able to produce or respond properly to insulin, which is needed to maintain the rate of glucose. Diabetes can be controlled with the help of insulin injections, a controlled diet (changing eating habits) and exercise programs, but no whole cure is available. Diabetes leads to many other disease such as blindness, blood pressure, heart disease, kidney disease and nerve damage [15]. Main 3 diabetes signs are:-Increased need to urinate (Polyuria), Increased hunger (Polyphagia), Increased thirst (Polydipsia). There are two main types of diabetes: Type 1 (Juvenile or Insulin Dependent or Brittle or Sugar) Diabetes and Type 2 (Adult onset or Non Insulin Dependent) Diabetes. Type 1 Diabetes mostly happens to children and young adults but can affect at any age. For this type of diabetes, beta cells are destructed and people suffering from the condition require insulin injection regularly to survive. Type 2 Diabetes is the most common type of diabetes, in which people are suffering at least 90% of all the diabetes cases. This type mostly happens to the people more than forty years old but can also be found in younger classes. In this type, body becomes resistant to insulin and does not effectively use the insulin being produced. It can be controlled with lifestyle modification (a healthy diet plan, doing exercise regularly), oral medications (taking tablets). In some extreme cases, insulin injections may also be required but no whole cure for diabetes is available.In this paper, GA has been used as a Feature selection in which among 8 attributes, 4 attributes have been selected. The main purpose of Feature selection is to reduce the number of features used in classification while maintaining acceptable classification accuracy and ROC. Limiting the number of features (dimensionality) is important in statistical learning. With the help of Feature selection process we can save Storage capacity, Computation time (shorter training time and test time), Computation cost and increases Classification rate, Comprehensibility. MLP NN are supervised learning method for classification. Here, MLP NN have been used for the classification of the Diabetes disease diagnosis. The rest of the paper is organized as follows: Brief description of GA and MLP NN are in section II, Related work is presented in section III, Proposed methodology is discussed in section IV, Results and Discussion are devoted to section V, Conclusion and Future Direction are discussed in section VI.II.B RIEF D ESCRIPTION OF GA AND MLP NNA. GAJohn Holland introduced genetic Algorithm GA in the 1970 at University of Michigan (US). GA is an adaptive population based optimization technique, which is inspired by Darwin’s theory [10] about survival of the fittest. GA mimics the natural evolution process given by the Darwin i.e., in GA the next population isevolved through simulating operators of selection, crossover and mutation. John Holland is known as the father of the original genetic algorithm who first introduced these operators in [16]. Goldberg [13] and Michalewicz [18] later improved these operators. The advantages in GA [17] are that Concepts easy to understand, solves problems with multiple solutions, global search methods, blind search methods, Gas can be easily used in parallel machines, etc. and the limitation are Certain optimization problems, no absolute assurance for a global optimum, cannot assure constant optimization response times, cannot find the exact solution, etc. GA can be applied in Artificial creativity, Bioinformatics, chemical kinetics, Gene expression profiling, control engineering, software engineering, Traveling salesman problem, Mutation testing, Quality control etc. The genetic algorithm uses three main types of rules at each step to create the next generation from the current population:1. SelectionIt is also called reproduction phase whose primary objective is to promote good solutions and eliminate bad solutions in the current population, while keeping the population size constant. This is done by identifying good solutions (in terms of fitness) in the current population and making duplicate copies of these. Now in order to maintain the population size constant, eliminate some bad solutions from the populations so that multiple copies of good solutions can be placed in the population. In other words, those parents from the current population are selected in selection phase who together will generate the next population. The various methods like Roulette –wheel selection, Boltzmann selection, Tournament selection, Rank selection, Steady –state selection, etc., are available for selection but the most commonly used selection method is Roulette wheel. Fitness value of individuals play an important role in these all selection procedures.2. CrossoverIt is to be notice that the selection operator makes only multiple copies of better solutions than the others but it does not generate any new solution. So in crossover phase, the new solutions are generated. First two solutions from the new population are selected either randomly or by applying any stochastic rule and bring them in the mating pool in order to create two off-springs. It is not necessary that the newly generated off-springs is more, because the off-springs have been created from those individuals which have been survived during the selection phase. So the parents have good bit strings combinations which will be carried over to off-springs. Even if the newly generated off-springs are not better in terms of fitness then we should not be bother about because they will be eliminated in next selection phase. In the crossover phase new off-springs are made from those parents which were selected in the selection phase. There are various crossover methods available like single-point crossover, two-point crossover, Multi-point crossover (N-Point crossover), uniform crossover, Matrix crossover (Two-dimensional crossover), etc.3. MutationMutation to an individual takes part with a very low probability. If any bit of an individual is selected to be muted then it is flipped with a possible alternative value for that bit. For example, the possible alternative value for 0 is 1 and 0 for 1 in binary string representation case i.e. , 0 is flipped with 1 and 1 is flipped with 0. The mutation phase is applied next to crossover to keep diversity in the population. Again it is not always to get better off-springs after mutation but it is done to search few solutions in the neighborhood of original solutions.B. MLP NNOne of the most important models in ANN or NN is MLP. The advantages in NN [17] are that Mapping capabilities or pattern association, generalization, robustness, fault tolerance and parallel and high speed information processing, good at recognizing patterns and the limitations are training to operate, require high processing time for large neural network, not good at explaining how they reach their decisions, etc. NN can be applied in Pattern recognition, image processing, optimization, constraint satisfaction, forecasting, risk assessment, control systems. MLP NN is feed forward trained with the Back-Propagation algorithm. It is supervised neural networks so they require a desired response to be trained. It learns how to transform input data in to a desired response to be trained, so they are widely used for pattern classification. The structure of MLP NN is shown in fig. 1. The type of architecture used to implement the system is MLP NN. The MLP NN consists of one input layer, one output layer, one or more hidden layers. Each layer consists of one or more nodes or neurons, represented by small circles. The lines between nodes indicate flow of information from one node to another node. The input layer is that which receives the input and this layer has no function except buffering the input signal [27], the output of input layer is given to hidden layer through weighted connection links. Any layer that is formed between the input and output layers is called hidden layer. This hidden layer is internal to the network and has no direct contact with the external environment. It should be noted that there may be zero to several hidden layers in an ANN, more the number of the hidden layers, more is the complexity of the network. This may, however, provide an efficient output response. This layer performs computations and transmits the results to output layer through weighted links, the output of the hidden layer is forwarded to output layer.The output layer generates or produces the output of the network or classification the results or this layer performs computations and produce final result.Fig.1. Feed Forward Neural Network Model for Diabetes DiseaseDiagnosisIII.R ELATED W ORKKemal Polat et al. [2] stated Principal Component Analysis (PCA) and Adaptive Neuro–Fuzzy Inference System (ANFIS) to improve the diagnostic accuracy of Diabetes disease in which PCA is used to reduce the dimensions of Diabetes disease dataset features and ANFIS is used for diagnosis of Diabetes disease means apply classification on that reduced features of Diabetes disease datasets. Manjeevan Seera et al. [3] introduced a new way of classification of medical data using hybrid intelligent system. The methodology implemented here is based on the hybrid combinatorial method of Fuzzy max-min based neural network and classification of data using Random forest Regression Tree. The methodology is implemented on various datasets including Breast Cancer and Pima Indian Diabetes Dataset and performs better as compared to other existing techniques. Esin Dogantekin et al. [1] used Linear Discriminant Analysis (LDA) and ANFIS for diagnosis of diabetes. LDA is used to separate feature variables between healthy and patient (diabetes) data, and ANFIS is used for classification on the result produced by LDA. The techniques used provide good accuracy then the previous existing results. So, the physicians can perform very accurate decisions by using such an efficient tool. H. Hasan Orkcu et al. [4] compares the performance of various back propagation and genetic algorithms for the classification of data. Since Back propagation is used for the efficient training of data in artificial neural network but contains some error rate, hence genetic algorithm is implemented for the binary and real-coded so that the training is efficient and more number of features can be classified. Muhammad Waqar Aslam et al. [7] introduced Genetic Programming–K-Nearest Neighbour (GP-KNN), Genetic Programming-Support Vector Machines (GP-SVM), in which KNN and SVM tested the new features generated by GP for performance evaluation. According to Pasi Luukka [5] Fuzzy Entropy Measures is used as a feature selection by which the computation cost, computation time can be reduced and also reduce noise and this way enhance the classification accuracy. Now from the previous statement it is clear that feature selection based on fuzzy entropy measures and it is tested together with similarity classifier. Kemal Polat et al. [12] proposed uses a new approach of a hybrid combination of Generalized Discriminant Analysis (GDA) and Least Square Support Vector Machine (LS–SVM) for the classification of diabetes disease. Here the methodology is implemented in two stages: in the first stage pre-processing of the data is done using the GDA such that the discrimination between healthy and patient disease can be done. In the second stage LS-SVM technique is applied for the classification of Diabetes disease patient’s. The methodology implemented here provides accuracy about 78.21% on the basis of 10 fold-cross validation from LS-SVM and the obtained accuracy for classification is about 82.05%. K Selvakuberan et al. [9] used Ranker search method, K star, REP tree, Naive bayes, Logisitic, Dagging, Multiclass in which Ranker search approach is used for feature selection and K star, REP tree, Naive bayes, Logisitic, Dagging, Multiclass are used for classification. The techniques implemented here provide a reduced feature set with higher classification accuracy. Acording to Adem Karahoca et al. [29] ANFIS, and Multinomial Logistic Regression (MLR) has been used for the comparison of peformance in terms of standard errors for diabetes disease diagnosis. ANFIS is used as an estimation method which has fuzzy input and output parameters, whereas MLR is used as a non-linear regression method and has fuzzy input and output parameters. Laerico Brito Goncalves et al. [13] implemented a new Neuro-fuzzy model for the classification of diabetes disease patients. Here in this paper an inverted Hierarchical Neuro-fuzzy based system is implemented which is based on binary space partitioning model and it provided embodies for the continue recursive of the input space and automatically generates own structure for the classification of inputs provided. The technique implemented finally generates a series of rules extraction on the basis of which classification can be done. T. Jayalakshmi et al. [28] proposed a new and efficient technique for the classification of diagnosis of diabetes disease using Artificial neural network (ANN). The methodology implemented here is based on the concept of ANN which requires a complete set of data for the accurate classification of Diabetes. The paper also implements an efficient technique for the improvement of classification accuracy of missing values in the dataset. It also provides a preprocessing stage during classification. Nahla H. Barakat et al. [30] worked on the classification of diabetes disease using a machine learning approach such as Support Vector Machine (SVM). The paper implements a new and efficient technique for the classification of medical diabetes mellitus using SVM. A sequential covering approach for the generation of rules extraction is implemented using the concept ofSVM,which is an efficient supervised learning algorithm. The paper also discusses Eclectic rule extraction technique for the extraction of rules set attributes from the dataset such that the selected attributes can be used for classification of medical diagnosis mellitus. Saloni et al. [31] have used various classifiers i.e ANN, linear, quadratic and SVM for the classification of parikson disease in which SVM provides the best accuracy of 96%. For the increases of classification accuracy, they have also used feature selection by which among 23 features, 15 features are selected. E.P. Ephzibah [23] used GA and Fuzzy Logic (FL) for diabetes diagnosis in which GA has been used as a feature selection method and FL is used for classification. The used methods improve the accuracy and reduced the cost.IV.P ROPOSED M ETHODOLOGYHere, The Proposed approach is implemented and evaluated by GA as a Feature Selection and MLP NN for Classification on Pima Indians Diabetes Data set from UCI repository of machine learning databases.The Proposed system of Block Diagram and the next proposed algorithm is shown below:Fig.2. Block Diagram of Proposed System Proposed AlgorithmStep1: StartStep2: Load Pima Indian Diabetes DatasetStep3: Initialize the parameters for the GAStep4: Call the GAStep5.1: Construction of the first generationStep5.2: SelectionWhile stopping criteria not met doStep5.3: CrossoverStep5.4: MutationStep5.5: SelectionEndStep6: Apply MLP NN ClassificationStep7: Training DatasetStep8: Calculation of error and accuracyStep9: Testing DatasetStep10: Calculation of error and accuracyStep11: StopThe proposed approach works in the following phases:A. Take Pima Indians Diabetes Data set from UCI repository of machine learning databases.B. Apply GA as a Feature Selection on Pima Indians Diabetes Data set.C. Do the Classification by using MLP NN on selected features in Pima Indians Diabetes Dataset.A. Used Diabetes Disease DatasetThe Pima Indian Diabetes Database was obtained from the UCI Repository of Machine Learning Databases [14]. The same dataset used in the reference [1-8] [11] [15] [19-26] [28].B. GA for Feature SelectionThe GA is a repetitive process of selection, crossover and mutation with the population of individuals in each iteration called a generation. Each chromosome or individual is encoded in a linear string (generally of 0s and 1s) of fix length in genetic analogy. In search space, First of all, the individual members of the population are randomly initialized. After initialization each population member is evaluated with respect to objective function being solved and is assigned a number (value of the objective function) which represents the fitness for survival for corresponding individual. The GA maintains a population of fix number of individuals with corresponding fitness value. In each generation, the more fit individuals (selected from the current population) go in mating pool for crossover to generate new off-springs, and consequently individuals with high fitness are provided more chance to generate off-springs. Now, each new offspring is modified with a very low mutation probability to maintain the diversity in the population. Now, the parents and off-springs together forms the new generation based on the fitness which will treated as parents for next generation. In this way, the new generation and hence successive generations of individual solutions is expected to be better in terms of average fitness. The algorithms stops forming new generations when either a maximum number of generations has been formed or a satisfactory fitness value is achieved for the problem.The standard pseudo code of genetic algorithm is given in Algorithm 11. Algorithm GABeginq = 0Randomly initialize individual members of population P(q)Evaluate fitness of each individual of population P(q) while termination condition is not satisfied doq = q+1selection (of better fit solutions)crossover (mating between parents to generate off-springs)mutation (random change in off-springs)end whileReturn best individual in population;In Algorithm 1, q represents the generation counter, initialization is done randomly in search space and corresponding fitness is evaluated based on objective function. After that GA algorithm requires a cycle of three phases: selection, crossover and mutation.In medical world, If we have to be diagnosed any disease then there are some tests to be performed. After getting the results of the tests performed, the diagnosed could be done better. We can consider each and every test as a feature. If we have to do a particular test then there are certain set of chemicals, equipments, may be people, more time required which can be more expensive. Basically, Feature Selection informs whether a particular test is necessary for the diagnosis or not. So, if a particular test is not required that can be avoided. When the number of tests gets reduced the cost that is required also gets reduced which helps the common people. So, Here that is why we have applied GA as a feature selection by which we reduced 4 features among 8 features. So from the above it is clear that GA is reducing the cost, storage capacity, and computation time by selected some of the feature.C. MLP NN for ClassificationMLP NN are supervised learning, feed forward method for classification. Since the MLP NN supervised learning, they require a desired response to be trained.The working of MLP NN is summarized in steps as mentioned below:1.Input data is provided to input layer for processing,which produces a predicted output.2.The predicted output is subtracted from actualoutput and error value is calculated.3.The network then uses a Back-Propagation (BP)algorithm which adjusts the weights.4.For weights adjusting it starts from weightsbetween output layer nodes and last hidden layernodes and works backwards through network5.When BP is finished the forwarding process startsagain.6.The process is repeated until the error betweenpredicted and actual output is minimized.3.1. BP Algorithm Network for Adjusting Weight FeaturesThe most widely used training algorithm for multilayer and feed forward network is Back-Propagation. The name given is BP because, it calculates the difference between actual and predicted values is propagated from output nodes backwards to nodes in previous layer. This is done to improve weights during processing.The working of BP algorithm is summarized in steps as follows:1.Provide training data to network.pare the actual and desired output.3.Calculate the error in each node or neuron.4.Calculate what output should be for each nodeor neuron and how much lower or higher outputmust be adjusted for desired output.5.Then adjust the weights.The Block diagram of the working of MLP NN using BP is given below:Fig.3. Block Diagram of MLP NN using BPV.R ESULTS AND D ISSCUSION OF P ROPOSEDM ETHODOLOGYThe work was implemented on i3 processor with 2.30GHz speed, 2 GB RAM, 320 GB external storage and software used JDK 1.6 (Java Development Kit), NetBeans 8.0 IDE and have done the coding in java. For the computation of MLP NN and various parameters weka library is used.In Experimental studies we have partition 70-30% for training & test of GA_MLP NN system for diabetes disease diagnosis. We have performed the experimental studies on Pima Indians Diabetes Dataset mentioned in section IV.A. We have compared the results of our proposed system i.e. GA_MLP NN with the previous results reported by earlier methods [3].The parameters for the Genetic Algorithm for our task are:Population Size 20Number of Generations 20Probability of Crossover 0.6Probability of Mutation 0.033Report Frequency 20Random Number Seed 1As per the table No.2 we may see that by applying the GA approach, we have obtained 4 features among 8 features. This means we have reduced the cost to s(x) = 4/8 = 0.5 from 1. This means that we have obtained an improvement on the training and classification by a factor of 2.As we know that the diagnostic performance is usually evaluated in terms of Classification Accuracy, Precision, Recall, Fallout and F – Measure, ROC, Confusion Matrix. These terms are briefly explained below: Classification Accuracy: Classification accuracy may be defined as the probability of it in correctly classifying records in the test datasets or Classification accuracy is the ratio of Total number of correctly diagnosed cases to the Total number of cases.Classification accuracy (%)=(TP + TN )/ (TP+FP+TN+FN (1) Where,TP (True Positive): Sick people correctly detected as sick.FP (False Positive): Healthy people incorrectly detected as diabetic people.TN (True Negative): Healthy people correctly detected as healthy.FN (False Negative): Sick people incorrectly detected as healthy.Precision: precision may define to measures the rate of correctly classified samples that are predicted as diabetic samples or precision is the ratio of number of correctly classified instances to the total number of instances fetched.Precision=No.of Correctly Classified InstancesTotal No.of Instances Fetched(2)or Precision=TP/ TP+F (3)Recall: Recall may define to measures the rate of correctly classified samples that are actually diabetic samples or recall is the ratio of number of correctly classified instances to the total number of instances in the Dataset.Recall=No.of Correctly Classified InstancesTotal No.of Instances in the Dataset(4)or Recall=TP/ TP+FN (5)As we know that usually precision increases then recall decreases or in other words simply precision and recall stand in opposition to one another.Fallout: The term fallout is used to check true negative of the dataset during classification.F - Measure: The F – Measure computes some average of the information retrieval precision and recall metrics. The F – Measure (F – Score) is calculated based on the precision and recall.The calculation is as follow:F−Measure=2∗Precision∗RecallPrecision+Recall(6)Area under Curve (AUC): It is defined as the metric used to measure the performance of classifier with relevant acceptance. It is calculated from area under curve (ROC) on the basis of true positives and false positives.AUC=12(TPTP+FN+TNTN+FP) (7)ROC is an effective method of evaluating the performance of diagnostic tests.Confusion Matrix: A confusion matrix [12][2] contains information regarding actual and predicted classifications done by a classification system.The following terms is briefly explained which is used in result part.Kappa Statistics: It is defined as performance to measure the true classification or accuracy of the algorithm.K=P0−Pc1−Pc(8)Where, P0 is the total agreement probability and Pc is the agreement probability due to change.Root Mean Square Error (RMSE): It is defined as the different between actual predicted value and the actual predicted value in the learning.RMSE=√1N∑(Ere−Eacc)2Nj=12 (9)Where, Ere is the resultant error rate and Eacc is the actual error rateMean Absolute Error: It is defined as:MAE=|p1−a1|+⋯+|p n−a n|n(10)Root Mean-Squared Error: it is defined as:RMSE=√(p1−a1)2+⋯+(p n−a n)2n(11) Relative Squared error: It is defined as:RSE=(p1−a1)2+⋯+(p n−a n)2(a̅−a1)2+⋯+(a̅−a n)2(12) Relative Absolute Error: It is defined as:RAE=|p1−a1|+⋯+|p n−a n||a̅−a1|+⋯+|a̅−a n|(13)Where, ‘a1,a2….an’ are the actual target values and ‘p1,p2….pn’ are the predicted target values.The details of Evaluation on training set and Evaluation on test split with MLP NN are as follows. Time taken to build model = 2.06 secondsEvaluation on training setCorrectly ClassifiedInstances435 80.855% Incorrectly ClassifiedInstances103 19.145%Kappa statistic 0.5499Mean absolute error 0.2528Root mean squared error 0.3659Relative absolute error 55.4353%Root relative squared error 76.6411%Total Number of Instances 538Detailed Accuracy by ClassTP Rate FPRatePrecisionRecallF-MeasureROCAreaClass0.93 1 0.4180.8040.9310.8630.872 tested_negative0.58 2 0.0690.8210.5820.6810.872 tested_positiveWeight ed Avg. 0.8090.2950.81 0.8090.7990.872Confusion MatrixA b <--classified as325 24 | a = tested_ negative79 110| b = tested_ positiveTime taken to build model = 2.04 secondsEvaluation on test splitCorrectly Classified Instances 180 78.2609% Incorrectly Classified Instances 50 21.7391% Kappa statistic 0.4769Mean absolute error 0.2716Root mean squared error 0.387Relative absolute error 59.8716%Root relative squared error 81.4912%Total Number of Instances 230Detailed Accuracy by ClassTP Rate FPRatePrecisionRecallF-MeasureROCAreaClass0.92 1 0.4810.7850.9210.8480.853 tested_negative0.51 9 0.0790.7740.5190.6210.853 tested_positiveWeighted Avg. 0.7830.3430.7810.7830.77 0.853Cofusion Matrixa b <--classified as139 12| a = tested_ negative38 41| b = tested_ positiveThe figure shown below is the analysis of False Positive Rate Vs True Positive Rate. Figure 4 indicating the ROC graph for MLP NN methodology.The Pima Indian Diabetes Dataset classified using MLP NN generates less error rate as shown in figure 4.Fig.4. Analysis of Positive Rate for Pima Indian Diabetes Datasetwithout GAThe table 1 shows the result of training and testing accuracy by applying MLP NN methodology.Table 1. Classification with MLP NNTable 2. GA Feature Reduction。

model predictive control 综述

model predictive control 综述Model Predictive Control 综述一、引言Model Predictive Control (模型预测控制，MPC) 是一种先进的控制策略，已被广泛应用于多个领域，包括化工、能源、机械和汽车等。

MPC 利用系统的数学模型进行预测，并根据预测结果计算出最优的控制方案，以实现控制目标。

本文将全面介绍MPC的原理、方法和应用领域，以及相关研究的最新进展。

二、MPC原理2.1 控制目标MPC的主要目标是使系统状态根据所设定的参考轨迹尽可能地接近期望值，同时满足系统的约束条件。

这些约束条件可以包括输入变量的限制、输出变量的范围、设备的操作限制等。

2.2 控制模型MPC使用系统的数学模型来描述系统的行为，在每个时间步长上，根据当前状态和约束条件，预测一定时间范围内的系统行为。

常见的系统模型包括线性时间不变模型（LTI）、非线性模型和混合离散连续模型等。

2.3 最优控制策略在每个时间步长上，MPC计算出一组最优的控制输入变量，以使预测的系统状态与参考轨迹尽可能接近。

这种计算通常使用优化算法，如线性二次规划（LQP）或非线性优化算法。

三、MPC方法3.1 传统MPC传统的MPC方法通常使用线性时间不变（LTI）模型来描述系统行为，并假设系统的状态和控制输入是可测量的。

这种方法的优点是简单、实用，适用于大多数线性系统。

3.2 基于约束的MPC基于约束的MPC方法将系统的约束条件纳入考虑，并在控制过程中保证这些约束条件得到满足。

这种方法在处理实际工程问题时非常有用，如控制机械臂的运动范围、炉温的限制等。

3.3 非线性MPC非线性MPC方法使用非线性模型来描述系统，并利用非线性优化算法进行控制策略的计算。

这种方法因其能够处理更加复杂的系统行为而受到研究者和工程师的关注。

3.4 模型不确定性模型不确定性是MPC面临的一个重要挑战。

因为真实系统的模型往往是不完全的、存在参数误差或未知扰动的，因此在设计MPC控制器时需要考虑这些不确定性，并采取相应的补偿策略。

血清壳多糖酶３样蛋白１（ＣＨＩ３Ｌ１）对肝硬化患者发生失代偿事件风险的预测价值

８７５１临床肝胆病杂志第３９卷第７期２０２３年７月　ＪＣｌｉｎＨｅｐａｔｏｌ，Ｖｏｌ．３９Ｎｏ．７，Ｊｕｌ．２０２３"#$% &$ ＤＯＩ：１０．３９６９／ｊ．ｉｓｓｎ．１００１－５２５６．２０２３．０７．０１１血清壳多糖酶３样蛋白１（ＣＨＩ３Ｌ１）对肝硬化患者发生失代偿事件风险的预测价值杨　航１，赵黎莉２，韩　萍１，陈庆灵１，文　君２，刘　洁２，程晓静２，李　嘉２１天津医科大学第二人民医院临床学院，天津３０００７０；２天津市第二人民医院肝病科，天津３００１９２通信作者：李嘉，１６３．ｃｏｍ（ＯＲＣＩＤ：００００－０００３－０１００－４１７Ｘ）摘要：目的　预测失代偿事件以采取积极的预防措施，是提高肝硬化患者生存期的关键。

本文旨在探索血清壳多糖酶３样蛋白１（ＣＨＩ３Ｌ１）对肝硬化患者发生失代偿事件风险的预测价值。

方法　纳入２０１９年１月—２０２１年５月于天津市第二人民医院诊疗的３０５例肝硬化患者，进行病例对照研究。

其中基线时处于代偿期肝硬化的患者２００例，处于失代偿期肝硬化的患者１０５例。

将３０５例肝硬化患者，以１年内是否发生失代偿事件进行分组，其中发生失代偿事件组７９例，未发生失代偿事件组２２６例；将２００例代偿期肝硬化患者，以１年内是否发生首次失代偿事件进行分组，其中发生首次失代偿事件组４３例，未发生首次失代偿事件组１５７例。

符合正态分布的计量资料两组间比较采用成组ｔ检验或Ｍａｎｎ－ＷｈｉｔｎｅｙＵ检验；计数资料两组间比较采用Ｗｉｌｃｏｘｏｎ秩和检验或χ２检验。

采用二分类Ｌｏｇｉｓｔｉｃ回归分析各变量与失代偿事件之间的相关性；采用受试者工作特征曲线（ＲＯＣ曲线）下面积（ＡＵＣ）来评价各变量对失代偿事件的预测价值，采用约登指数最大值来确定最佳临界值。

结果　１年内发生失代偿事件患者的基线血清ＣＨＩ３Ｌ１水平高于未发生患者［２４３．００（１３６．００～３７２．００）ｎｇ／ｍＬｖｓ１１７．５０（６７．７５～２０５．２５）ｎｇ／ｍＬ，Ｕ＝４７２０．５００，Ｐ＜０．００１］；１年内发生首次失代偿事件患者的基线血清ＣＨＩ３Ｌ１水平高于未发生患者［２２７．９８（１１０．００～３１４．００）ｎｇ／ｍＬｖｓ９０．００（５８．００～１６８５０）ｎｇ／ｍＬ，Ｕ＝１６８１．５００，Ｐ＜０．００１］。

英文文献-一种生产水果发酵饮料的方法

ORIGINAL ARTICLEProduction of Star Fruit Alcoholic Fermented BeverageFla´via de Paula Valim 1•Elizama Aguiar-Oliveira 2•Eliana Setsuko Kamimura 3•Vanessa Dias Alves 1•Rafael Resende Maldonado 1,4Received:19January 2016/Accepted:18May 2016/Published online:28May 2016ÓAssociation of Microbiologists of India 2016Abstract Star fruit (Averrhoa carambola )is a nutritious tropical fruit.The aim of this study was to evaluate the production of a star fruit alcoholic fermented beverage utilizing a lyophilized commercial yeast (Saccharomyces cerevisiae ).The study was conducted utilizing a 23central composite design and the best conditions for the production were:initial soluble solids between 23.8and 25°Brix (g 100g -1),initial pH between 4.8and 5.0and initial con-centration of yeast between 1.6and 2.5g L -1.These conditions yielded a fermented drink with an alcohol content of 11.15°GL (L 100L -1),pH of 4.13–4.22,ﬁnal yeast concentration of 89g L -1and fermented yield from 82to 94%.The fermented drink also presented low levels of total and volatile acidities.Keywords Star fruit ÁFruit wine ÁFactorial design ÁSaccharomyces cerevisiaeIntroductionThe alcoholic fermentation of fruit can be used for the production of alcoholic drinks and it is commonly realized by yeast such as the Saccharomyces cerevisiae [1,2].In such process occurs the production of ethanol and carbon dioxide,which are obtained by the anaerobic conversion of the sugars naturally contained in the fruit or added to it.According to the Brazilian legislation [3],wine is a drink with alcohol content from 4to 14°GL (L 100L -1)at 20°C,produced from the alcoholic fermentation of healthy,ripe and fresh grapes.The term ‘‘fruit wine’’is applied to alcoholic fermented drinks produced from fruits other than grapes.Any fruit with reasonable amounts of fer-mentable sugars can be utilized as must for the production of wine.The usage of different fruits may lead to the obtain-ment of drinks with different ﬂavors.Therefore,many exotic fruits have been utilized in the production of wine.Carambola or star fruit (Averrhoa carambola )is a tropical fruit originally from Indonesia and India,being very popular in South-eastern Asia,South Paciﬁc and some regions of Eastern Asia.The carambola tree is grown in other countries out from Asia,such as Colombia,Guiana,Dominican Republic,Brazil and the USA [4].The star fruit chemical characteristics depend on climatic factors,the type of soil utilized for cultivation,the fruit ripeness level,etc.Almeida et al.[5]characterized ripe star fruit from north-eastern Brazil and obtained average values of soluble solids and pH of 8.0°Brix and 3.7,respectively.Star fruit is rich in vitamins,oxalic acid,polyphenols,dietary ﬁber,volatile compounds,etc.Such traits allow innumerable usages for the fruit,as well as providing beneﬁts for the health of the consumers [6,7].The aim of this study was to evaluate the production of a star fruit alcoholic fermented drink incorporating the fruitElectronic supplementary material The online version of this article (doi:10.1007/s12088-016-0601-9)contains supplementary material,which is available to authorized users.&Rafael Resende Maldonadoratafta@.br1Municipal College Professor Franco Montoro (CMPFM),R.dos Estudantes,s.n.,Mogi Guac ¸u,SP 13.843-971,Brazil 2Multidisciplinary Institute on Health (IMS),FederalUniversity of Bahia (UFBA),campus Anı´sio Teixeira (CAT),R.Rio de Contas,58,Vito´ria da Conquista,Bahia 45.029-094,Brazil3Faculty of Zootecnic and Food Engineering (FZEA),University of Sa˜o Paulo (USP),Av.Duque de Caxias,225,13.635-900Pirassununga,Sa˜o Paulo,Brazil 4Technical College of Campinas (COTUCA),University ofCampinas (UNICAMP),R.Jorge Figueiredo Correˆa,735,Pq.Taquaral,Campinas,Sa˜o Paulo 13.087-261,Brazil Indian J Microbiol (Oct–Dec 2016)56(4):476–481DOI 10.1007/s12088-016-0601-9traits in the wine.The effects of the initial concentration of yeast(S.cerevisiae),initial pH and initial sugar concen-tration were evaluated utilizing a23central composite design whereby the fermentation kinetics data and the physico-chemical characteristics of the product were analyzed.Materials and MethodsPreparation of the Star Fruit Pulp and MustFor the star fruit pulp preparation was selected ripe fruits with good appearance,smell and texture.They were bought at a local supermarket in the region of Campinas-SP,Brazil, in March2014.The star fruits were cleaned,cut and blitzed in a blender until a pulp consistency was achieved.There-after,the pulp wasﬁltered with be means of a cotton cloth, in order to remove the insoluble solids and then pasteurized at80°C for5min.The pasteurized pulp was transferred to plasticﬂasks which were closed and left cooling to room temperature and then frozen and stored at-18°C to the moment they were utilized on the must preparation.In order to prepare the must for the fermentation,the star fruit pulp was thawed at room temperature(*25°C)and mixed with saccharose to adjust the concentration of sol-uble solids indicated on Table1(varying between19and 25°Brix).Thereafter,calcium carbonate was added to each test to adjust the pH,also regarding the values indicated on Table1(ranging between4.0and5.0).Alcoholic FermentationThe must obtained was then utilized for the alcoholic fer-mentation.A lyophilized commercial yeast S.cerevisiae (Fermentais Lessaffe GroupÒ)was utilized.The lyophi-lized yeast was rehydrated with a small portion of must and thereafter added to the total must volume(1.0L).All fer-mentations were conducted at room temperature(*25°C), without agitation in glassﬂasks covered with plasticﬁlm, in which small oriﬁces were made to facilitate the elimi-nation of the carbon dioxide produced during the fermen-tation process.Each fermentation lasted9days and the soluble solids consumption of the must,as well as its pH, were measured daily.By the end of the fermentation,the alcoholic content and the yeast concentration were also measured.The total,ﬁxed and volatile acidities were also analyzed for each wine obtained.Factorial DesignThe study of the star fruit alcoholic fermentation by S. cerevisiae was conducted utilizing a central composite rotatable design with23factorial points?6star points?3 central points,totaling17tests[8]with independent vari-ables:soluble solids concentration(SS)(19a25°Brix), initial pH(4.0–5.0)and initial yeast concentration(IY)(1.0 a4.0g L).The effects of these three variables on the fer-mentation process responses(alcoholic content(AC),ﬁnal pH,volumetric fermented yield(VFY)andﬁnal yeast concentration(FY))as well as on theﬁnal product(total,ﬁxed and volatile acidities)were evaluated.The results obtained were analyzed by means of the software Statistica 8.0(Statsoft)and the factorial design matrix is shown on Table1.Analytical MethodsThe pH was measured directly with a bench pHmeter(Bel EngineeringÒ,W3B model).The soluble solids (S)(°Brix=g100g-1)were determined with a portable refractometer(InstrutempÒ,model ITREF25). From the S values,and knowing the stoichiometry ratio of sugar consumption/ethanol production(1:4mol of sac-charose per mol of ethanol)it was possible to estimate the alcohol content as the function of the fermentation time.At the end of each fermentation a sample was collected to measure the alcohol content(°GL=L100L-1)in an Alcolyzer equipment(Anton PaarÒ).The conversion fac-tors:fermentable sugars into biomass(Y X/S=dX/-dS), fermentable sugar into alcohol(Y P/S=dP/-dS)and bio-mass into alcohol(Y P/X=dP/dX)were also calculated considering(-dS:g L-1)the fermentable sugar con-sumption,(dX:g L-1)the cell growth,and(dP:g L-1)the alcohol production.Theﬁxed and total acidities(meq L-1) were measured using titrimetric methods according to Adolfo Lutz Institute[9];the volatile acidity was calcu-lated by the difference between total andﬁxed acidities. Results and DiscussionThe responses obtained from the central composite rotat-able design applied for the production of star fruit alcoholic fermented by S.cerevisiae can be observed in Table1. From these results,the analysis of variance(ANOVA) (Online Resource1)was performed for each response and second order models were evaluated to explain the process.Considering the ANOVA results for the alcohol content (Online Resource1),it was possible to obtain a signiﬁcant and predictive codiﬁed model with94%conﬁdence (p=0.06),represented by Eq.1,where OH is the alcohol content(°GL),S is the initial soluble solids concentration (°Brix),P is the initial pH and Y is the initial yeast con-centration(g L-1).The resulting surface responses and contour lines can be observed on the Online Resource2.The analysis of thisﬁgure shows that the best results for the AC,after9days of fermentation,were obtained from the highest soluble solids concentrations,the highest pH and the initial yeast concentration levels between-1 (1.6g L-1)and?1(3.4g L-1).These conditions,pre-dicted by the model,can be veriﬁed in Table1,wherein the highest alcohol content were obtained in trials4,8,10and 12,which resulted in values from10.45to11.15°GL.The results obtained in this study were better than others cited in the literature,which also used star fruit as a substrate for alcoholic fermentation.Napahde et al.[10]obtained only 0.2°GL of alcohol after21days of fermentation,while Sibounnavong et al.[11]obtained8.3°GL(average)after 2weeks.In other study,Bridgebassie and Badrie obtained a similar alcohol content(from10.25to11.50°GL)after 4weeks of fermentation using star fruit must pretreated with different concentrations of pectolase and using dif-ferent yeast strains[12].The best conditions(trials4,8,10and12)were selected for the analysis of their consumption of soluble solids and the ethanol production during the fermentation time.The obtained kinetic proﬁles can be seen in Fig.1.The con-sumption of soluble solids and the ethanol production showed similar proﬁles for the selected assays,indicating no signiﬁcant difference among the chosen conditions.It was also possible to see that after5days of fermentation there was a trend for stabilization(Fig.1),with low sugar consumption and little ethanol production.OH¼10:01þ0:61SðÞÀ0:05S2ÀÁþ0:34PðÞÀ0:05S2ÀÁþ0:34PðÞÀ0:05P2ÀÁþ0:02YðÞÀ0:42Y2ÀÁþ0:44SðÞPðÞÀ0:26SðÞYðÞþ0:22PðÞYðÞð1ÞRegarding to theﬁnal pH of the star fruit alcoholic fermented,it was not possible to obtain a predictive model (p\0.10)to represent the process(Online Resource1).By observing Table1,however,it was possible to notice that the averageﬁnal pH value,considering the17trials,was 4.05and the initial value was4.50,this reduction is a good indicator for the fermentation process because it indicates that most of the substrate was used for ethanol production and cellular growth and there was little formation of acid. Excessive production of acids may indicate microbial contamination by acetic bacteria,excess oxygen in the fermentation medium or excessive fermentation time. Bridgebassie and Badrie[12]obtained a lowerﬁnal pH of about3.1in star fruit wine,but these authors applied a pretreatment in the must using1%of citric acid before the fermentation.Similarly,for the volumetric fermented yield,the cor-relation with the independent variables was very low (R2=0.24)and thus,none of the studied variables had aTable1Kinetic parameters for star fruit alcoholic fermentation at room temperature after9days of fermentationRuns SS(°Brix)pH Yeast(g L-1)Alcohol(°GL)Final pH Yield(%v/v)Final yeast(g L-1)120.2(-1) 4.2(-1) 1.6(-1)9.06 3.7485.4783.94223.8(?1) 4.2(-1) 1.6(-1)9.76 3.7785.0054.58320.2(-1) 4.8(?1) 1.6(-1)8.36 4.2390.0065.22423.8(?1) 4.8(?1) 1.6(-1)11.15 4.2294.0089.66520.2(-1) 4.2(-1) 3.4(?1)8.36 3.8285.5088.14623.8(?1) 4.2(-1) 3.4(?1)8.36 3.8790.1074.54720.2(-1) 4.8(?1) 3.4(?1)9.06 4.2491.5063.85823.8(?1) 4.8(?1) 3.4(?1)10.45 4.2190.8070.74919.0(-1.68) 4.5(0) 2.5(0)9.06 4.1178.6091.731025.0(?1.68) 4.5(0) 2.5(0)11.15 4.1381.7489.841122.0(0) 4.0(-1.68) 2.5(0)9.76 3.9483.33131.811222.0(0) 5.0(?1.68) 2.5(0)10.45 4.1975.76167.081322.0(0) 4.5(0) 1.0(-1.68)8.36 4.0583.33144.861422.0(0) 4.5(0) 4.0(?1.68)9.76 4.0082.69121.881522.0(0) 4.5(0) 2.5(0)9.76 4.3587.78190.801622.0(0) 4.5(0) 2.5(0)10.45 4.0075.30213.071722.0(0) 4.5(0) 2.5(0)9.76 4.1077.69190.89Central composite rotatable design23?6star points?3central points for the star fruit alcoholic fermentation by S.cerevisiae after9days at room temperature(*25°C)without agitation.Independent variables are:the soluble solid concentration(S)(°Brix=g100g-1),initial pH and initial yeast concentration(Yeast)(gL-1).Response variables are:theﬁnal alcohol content(Alcohol)(°GL=L100L-1),ﬁnal pH,volumetric fermentation yield(Yield)(%v/v)andﬁnal yeast concentration(Final Yeast)(g L-1)Codiﬁed values are presented in parenthesissigniﬁcant inﬂuence on this response (Online Resource 1).Considering the 17trials (Table 1),the average volumetric fermented yield was 84.6±5.6%(L 100L -1).Volume variations in similar processes are common and can occur due to the consumption of must to produce gas (carbon dioxide)during the fermentation.Regarding to cell growth,i.e.,the ﬁnal yeast concen-tration,the ANOVA (Online Resource 1)proved that it was possible to obtain a statistically signiﬁcant and predictive model at 93%of conﬁdence (p =0.07).The complete codiﬁed model obtained is represented by Eq.2where FY is the ﬁnal yeast concentration (g L -1),S is the soluble solid concentration (°Brix),P is the initial pH,Y is the initial yeast concentration (g L -1).The resulting surface response and the contour curves can be observed on the Online Resource 3,in which it appears that the ﬁnal con-centration of yeast (FY )was the highest in the conditions close to the central points.However,when it comes toalcoholic fermentation process,the best condition for cell growth is not necessarily the best condition for ethanol production.Reddy and Reddy [13]evaluated the effect of different parameters over the S.cerevisiae growth in mango must and it was observed a great inﬂuence of temperature;at 25and at 30°C they obtained the highest cell populations within 6and 8days of fermentation,respectively,with an ethanol production 70%smaller at 25°C than at 30°C (41.3g L -1day -1).FY ¼201:32À1:08S ðÞÀ48:35S 2ÀÁþ3:48P ðÞÀ27:56P 2ÀÁÀ2:55Y ðÞÀ33:25Y 2ÀÁþ9:28S ðÞP ðÞÀ0:22S ðÞY ðÞÀ5:56P ðÞY ðÞð2ÞThe conversion factors were also calculated for each trial (Online Resource 4).The ANOVA was also obtained (data not presented),but it was not possible to obtain second order models to explain each of the conversion factors with a reasonable level of conﬁdence.The results (Online Resource 4)indicate that the highest values of Y X/S and Y P/S were obtained in trial 12,however,the highest value of Y P/X was obtained in trial 2.It conﬁrms the pre-viously mentioned suggestion that a greater microbial growth does not necessarily result in a greater ethanol production.Confronting the three conversion factors and the two models obtained (for OH and FY),the best con-ditions for the production of star fruit alcoholic fermented (aiming the highest alcohol content)were those in which conditions where the substrate was better used for con-version into product than cell growth.These conditions correspond to trials 4and 10which presented the highest initial concentrations of soluble solids (23.8–25.0°Brix),the highest initial pH (4.8–5.0)and initial concentrations of yeast in the intermediate level (1.6–2.5g L -1).Regarding to the acidity levels (Online Resource 4),the obtained values were also evaluated by ANOVA (data not presented)but none of the three responses there were sig-niﬁcantly correlations with the independent variables of the process.This means that under all conditions evaluated the resulting acidity values were very similar to each other,a fact that is in agreement with the analysis for the ﬁnal pH of the fermentation,which had the same behavior.The composition of the acidity of star fruit alcoholic fermented (Online Resource 4)suggests that most of it is related to the ﬁxed acidity (89%)and a small part (11%)to the volatile acidity.This is an excellent result,since excessive volatile acidity may indicate microbial contamination or excess of oxygen in both production and storage of fermented drink.Paul and Sahu [14]obtained star fruit alcoholic fermented with a titratable acidity of around 119meq L -1and,despite of all the other similar parameters obtained,this value is almost 2.5times higher than the average value obtained in our study (48.8meq L -1).Fig.1a Soluble solids (°Brix =g 100g -1)consumption and b production of alcohol (°GL =L 100L -1)in star fruit alcoholic fermentation by S.cerevisiae as a function of the fermentation time.The conditions for the concentration of soluble solids (°Brix)/pH/concentration of yeast (g L -1)were:23.8/4.8/1.6(trial 4);23.8/4.8/3.4(trial 8);25.0/4.5/2.5(trial 10);and 22.0/5.0/2.5(trial 12).The experiments were conducted at room temperature (*25°C)without agitation,for 9days.Lines were used to connect the points and guide the eyesBrazilian law,for instance,speciﬁes some quality stan-dards for grape wines[3],but it does not specify any stan-dard for wines from other paring with the current legislation,star fruit wine obtained showed an average total acidity of48.8±7.7meq L-1similar to the minimum established by Brazilian law(55meq L-1);however vola-tile acidity was much lower,5.4±2.0meq L-1,than the maximum established(20meq L-1)indicating low levels of acetic acid,or low oxidation of ethanol.The volatile and total acidities of jabuticaba(Myrciaria jaboticaba)wines determined by da Silvaet al.[15],were higher than those obtained in our study,with values above185and 17meq L-1,respectively.Other study conducted for the production of watermelon (Citrullus lanatus natus)alcoholic fermented[16] led to similar results to those obtained with star fruit fer-mented for:pH(4.1),ﬁnal concentration of SS(6.6°Brix), alcohol content(10°GL),fermented yield(94%)and Y P/S (0.67).However theﬁnal biomass concentration (20g L-1)and Y X/S(0.14)were lower and the fermenta-tion time(48h)was reduced comparing to those obtained with star fruit wine.These results indicate that the alco-holic fermentation with star fruit was predominantly anaerobic,with good ethanol production.A wine made from jackfruit(Artocarpus heterophyllus)[17]stored for 11months presented a higher alcohol content(13°GL)and a higher total acidity(100meq L-1),but the volatile acidity was very similar to the star fruit wine(6meq L-1).Using a similar initial concentration of yeast (1.65g L-1),Andrade et al.[18].had a slightly more acidic wine(pH=3.51)from strawberry(Fragaria ana-nassa)and similar alcohol content(9.62°GL)after 30days of fermentation,with soluble solids consumption varying from27to9°Brix.For this product,the authors observed that the consumption of soluble solids were more intense in theﬁrst10days.Tamarind(Tamarindus indica) and soursop(Annona muricata)were also objects of study [19]for the production of alcoholic drinks,resulting in alcohol levels of8.1and6.3°GL,respectively.Star fruit was applied by Paul and Sahu[14]and it was obtained an alcoholic fermented drink with similar alcohol content of 12.15°GL,however these authors obtained higher titrable acidity(0.76%w/w),lower pH(3.94),lower total soluble solid concentration(4.6°Brix)and they used an inoculum step which added more time to the process. ConclusionAt the initial conditions of:23–25°Brix;pH=4.8–5.0and 1.6–2.5g L-1of yeast concentration,it was possible to obtain a fermented drink,from star fruit,with:an alcohol content of11.15°GL,pH from4.13to4.22,aﬁnal yeast concentration of89g L-1,volumetric fermentation yield from82to94%(v/v),total acidity from42to52meq L-1, a volatile acidity of5meq L-1andﬁxed acidities of 37–47meq L-1.The average conversion factors were:Y X/ S=0.79;Y P/S=0.67and Y P/X=1.02.The fermentation kinetics showed that after5days of fermentation(*25°C), the process reached theﬁnal stage.The obtained drink had similar characteristics to drinks made from other fruits mentioned in the literature and the differences among their parameters are basically due to the differences in the pro-cesses and in the compositions of raw materials used. Acknowledgments Authors are grateful to the Municipal College Professor Franco Montoro(Mogi Guac¸u,Sa˜o Paulo,Brazil)where all the experiments were conducted and also to the Coordenac¸a˜o de Aperfeic¸oamento de Pessoal de Nı´vel Superior(CAPES,Brazil)for theirﬁnancial support.References1.Nyanga LK,Nout MJR,Smid EJ,Boekhout T,Zwietering MH(2013)Fermentation characteristics of yeasts isolated from tradi-tionally fermented masau(Ziziphus mauritiana)fruits.Int J Food Microbiol166:426–432.doi:10.1016/j.ijfoodmicro.2013.08.003 2.Santo DE,Galego L,Gonc¸alves T,Quintas C(2012)Yeastdiversity in the Mediterranean strawberry tree(Arbutus unedo L.) fruits’fermentations.Food Res Int47:45–50.doi:10.1016/j.foodres.2012.01.0093.Brazilian Ministry of Agriculture,Livestock and Supply(2004)Law n°10970.Ministe´rio da Agricultura,Pecua´ria e Abasteci-mento.Lei n.10970,de12de novembro de2004(in Portuguese)..br.Accessed on11Nov20154.Warren O,Sargent SA(2011)Carambola(Averrhoa carambolaL.).In:Yahia A(ed)Postharvest Biology and Technology of Tropical and Subtropical Fruits:Ac¸ai to Citrus.Woodhead Pub-lishing Series in Food Science,Technology and Nutrition, pp397–414e.doi:10.1533/9780857092762.3975.Almeida MB,Souza WCO,Barros JRA,Barroso PA,Villar FCR(2011)Physical and chemical characterization of star fruits (Averroa carambola L.)produced in Petrolina,PE,Brazil.Rev Semia´rido Visu1:116–1256.Chau C-F,Chen C-H,Lin C-Y(2004)Insolubleﬁber-rich frac-tions derived from Averrhoa carambola:hypoglycemic effects determined by in vitro methods.LWT Food Sci Technol 37:331–335.doi:10.1016/j.lwt.2003.10.0017.Wei S-D,Chen H,Yan T,Lin Y-M,Zhou H-C(2014)Identiﬁ-cation of antioxidant components and fatty acid proﬁles of the leaves and fruits from Averrhoa carambola.LWT Food Sci Technol55:278–285.doi:10.1016/j.lwt.2013.08.0138.Rodrigues MI,Iemma AF(2015)Experimental design and opti-mization.CRC Press,Boca Raton9.Instituto Adolfo Lutz(2008)Analytical Standards of the AdolfoLutz Institute.Normas Analı´ticas do Instituto Adolfo Lutz(in Portuguese),Sa˜o Paulo:IMESP.Accessed on11Nov2015..br/10.Napahde S,Durve A,Bharati D,Chandra N(2010)Wine Pro-duction from Carambola(Averrhoa carambola)juice using Sac-charomyces n J Exp Biol Sci1:20–2311.Sibounnavong P,Daungpanya S,Sidtiphanthong S,Keoudone C,Sayavong M(2010)Application of Saccharomyces cerevisiae forwine production from star gooseberry and carambola.Int J Agric Technol6:99–10512.Bridgebassie V,Badrie N(2004)Effects of different pectolaseconcentration and yeast strains on carambola wine quality in Trinidad,West Indies.Fruits59:131–14013.Reddy LVA,Reddy OVS(2011)Effect of fermentation condi-tions on yeast growth and volatile composition of wine produced from mango(Mangifera indica L.)fruit juice.Food Bioprod Process8:487–491.doi:10.1016/j.fbp.2010.11.00714.Paul SK,Sahu JK(2014)Process optimization and qualityanalysis of Carambola(Averrhoa carambola L.)wine.Int J Food Eng10:457–465.doi:10.1515/ijfe-2012-012515.Silva PHA,Faria FC,Tonon B,Mota SJD,Pinto VT(2008)Evaluation of the chemical composition of wine produced from jabuticaba(Myrciaria jabuticaba).Avaliac¸a˜o da composic¸a˜o quı´mica de fermentados alcoo´licos de jabuticaba(Myrciaria jabuticaba)(in Portuguese).Quı´m Nova31:595–600.doi:10.1590/S0100-4042200800030002516.Fontan RDCI,Verı´ssimo LAA,Silva WS,Bonomo RCF,VelosoCM(2011)Kinetics of the alcoholic fermentation from the prepa-ration of watermelon wine.Cine´tica da fermentac¸a˜o alcoo´lica na elaborac¸a˜o de vinho de melancia(in Portuguese).Boletim Centro Pesq Process Aliment29:203–210.doi:10.5380/cep.v29i2.25485 17.Asquieri ER,Rabelo AMDS,Silva AGDM(2008)Fermentedjackfruit:study on its physicochemical and sensorial character-istics.Cieˆn Tecnol Alim28:881–887.doi:10.1590/S0101-2061200800040001818.Andrade MB,Perim GA,Santos TRT,Marques RG(2013)Fer-mentation and characterization of fermented strawberry.Fer-mentac¸a˜o alcoo´lica e caracterizac¸a˜o de fermentado de morango (in Portuguese).Biochem Biotechnol Rep2:265–268.doi:10.5433/2316-5200.2013v2n3espp26519.Mbaeyi-Nwaoha IE,Ajumobi CN(2015)Production andmicrobial evaluation of table wine from tamarind(Tamarindus indica)and soursop(Annona muricata).J Food Sci Technol 52:105–116.doi:10.1007/S13197-013-0972-4。

From Data Mining to Knowledge Discovery in Databases

s Data mining and knowledge discovery in databases have been attracting a signiﬁcant amount of research, industry, and media atten-tion of late. What is all the excitement about?This article provides an overview of this emerging ﬁeld, clarifying how data mining and knowledge discovery in databases are related both to each other and to related ﬁelds, such as machine learning, statistics, and databases. The article mentions particular real-world applications, speciﬁc data-mining techniques, challenges in-volved in real-world applications of knowledge discovery, and current and future research direc-tions in the ﬁeld.A cross a wide variety of ﬁelds, data arebeing collected and accumulated at adramatic pace. There is an urgent need for a new generation of computational theo-ries and tools to assist humans in extracting useful information (knowledge) from the rapidly growing volumes of digital data. These theories and tools are the subject of the emerging ﬁeld of knowledge discovery in databases (KDD).At an abstract level, the KDD ﬁeld is con-cerned with the development of methods and techniques for making sense of data. The basic problem addressed by the KDD process is one of mapping low-level data (which are typically too voluminous to understand and digest easi-ly) into other forms that might be more com-pact (for example, a short report), more ab-stract (for example, a descriptive approximation or model of the process that generated the data), or more useful (for exam-ple, a predictive model for estimating the val-ue of future cases). At the core of the process is the application of speciﬁc data-mining meth-ods for pattern discovery and extraction.1This article begins by discussing the histori-cal context of KDD and data mining and theirintersection with other related ﬁelds. A briefsummary of recent KDD real-world applica-tions is provided. Deﬁnitions of KDD and da-ta mining are provided, and the general mul-tistep KDD process is outlined. This multistepprocess has the application of data-mining al-gorithms as one particular step in the process.The data-mining step is discussed in more de-tail in the context of speciﬁc data-mining al-gorithms and their application. Real-worldpractical application issues are also outlined.Finally, the article enumerates challenges forfuture research and development and in par-ticular discusses potential opportunities for AItechnology in KDD systems.Why Do We Need KDD?The traditional method of turning data intoknowledge relies on manual analysis and in-terpretation. For example, in the health-careindustry, it is common for specialists to peri-odically analyze current trends and changesin health-care data, say, on a quarterly basis.The specialists then provide a report detailingthe analysis to the sponsoring health-care or-ganization; this report becomes the basis forfuture decision making and planning forhealth-care management. In a totally differ-ent type of application, planetary geologistssift through remotely sensed images of plan-ets and asteroids, carefully locating and cata-loging such geologic objects of interest as im-pact craters. Be it science, marketing, ﬁnance,health care, retail, or any other ﬁeld, the clas-sical approach to data analysis relies funda-mentally on one or more analysts becomingArticlesFALL 1996 37From Data Mining to Knowledge Discovery inDatabasesUsama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth Copyright © 1996, American Association for Artificial Intelligence. All rights reserved. 0738-4602-1996 / $2.00areas is astronomy. Here, a notable success was achieved by SKICAT ,a system used by as-tronomers to perform image analysis,classiﬁcation, and cataloging of sky objects from sky-survey images (Fayyad, Djorgovski,and Weir 1996). In its ﬁrst application, the system was used to process the 3 terabytes (1012bytes) of image data resulting from the Second Palomar Observatory Sky Survey,where it is estimated that on the order of 109sky objects are detectable. SKICAT can outper-form humans and traditional computational techniques in classifying faint sky objects. See Fayyad, Haussler, and Stolorz (1996) for a sur-vey of scientiﬁc applications.In business, main KDD application areas includes marketing, ﬁnance (especially in-vestment), fraud detection, manufacturing,telecommunications, and Internet agents.Marketing:In marketing, the primary ap-plication is database marketing systems,which analyze customer databases to identify different customer groups and forecast their behavior. Business Week (Berry 1994) estimat-ed that over half of all retailers are using or planning to use database marketing, and those who do use it have good results; for ex-ample, American Express reports a 10- to 15-percent increase in credit-card use. Another notable marketing application is market-bas-ket analysis (Agrawal et al. 1996) systems,which ﬁnd patterns such as, “If customer bought X, he/she is also likely to buy Y and Z.” Such patterns are valuable to retailers.Investment: Numerous companies use da-ta mining for investment, but most do not describe their systems. One exception is LBS Capital Management. Its system uses expert systems, neural nets, and genetic algorithms to manage portfolios totaling $600 million;since its start in 1993, the system has outper-formed the broad stock market (Hall, Mani,and Barr 1996).Fraud detection: HNC Falcon and Nestor PRISM systems are used for monitoring credit-card fraud, watching over millions of ac-counts. The FAIS system (Senator et al. 1995),from the U.S. Treasury Financial Crimes En-forcement Network, is used to identify ﬁnan-cial transactions that might indicate money-laundering activity.Manufacturing: The CASSIOPEE trou-bleshooting system, developed as part of a joint venture between General Electric and SNECMA, was applied by three major Euro-pean airlines to diagnose and predict prob-lems for the Boeing 737. To derive families of faults, clustering methods are used. CASSIOPEE received the European ﬁrst prize for innova-intimately familiar with the data and serving as an interface between the data and the users and products.For these (and many other) applications,this form of manual probing of a data set is slow, expensive, and highly subjective. In fact, as data volumes grow dramatically, this type of manual data analysis is becoming completely impractical in many domains.Databases are increasing in size in two ways:(1) the number N of records or objects in the database and (2) the number d of ﬁelds or at-tributes to an object. Databases containing on the order of N = 109objects are becoming in-creasingly common, for example, in the as-tronomical sciences. Similarly, the number of ﬁelds d can easily be on the order of 102or even 103, for example, in medical diagnostic applications. Who could be expected to di-gest millions of records, each having tens or hundreds of ﬁelds? We believe that this job is certainly not one for humans; hence, analysis work needs to be automated, at least partially.The need to scale up human analysis capa-bilities to handling the large number of bytes that we can collect is both economic and sci-entiﬁc. Businesses use data to gain competi-tive advantage, increase efﬁciency, and pro-vide more valuable services to customers.Data we capture about our environment are the basic evidence we use to build theories and models of the universe we live in. Be-cause computers have enabled humans to gather more data than we can digest, it is on-ly natural to turn to computational tech-niques to help us unearth meaningful pat-terns and structures from the massive volumes of data. Hence, KDD is an attempt to address a problem that the digital informa-tion era made a fact of life for all of us: data overload.Data Mining and Knowledge Discovery in the Real WorldA large degree of the current interest in KDD is the result of the media interest surrounding successful KDD applications, for example, the focus articles within the last two years in Business Week , Newsweek , Byte , PC Week , and other large-circulation periodicals. Unfortu-nately, it is not always easy to separate fact from media hype. Nonetheless, several well-documented examples of successful systems can rightly be referred to as KDD applications and have been deployed in operational use on large-scale real-world problems in science and in business.In science, one of the primary applicationThere is an urgent need for a new generation of computation-al theories and tools toassist humans in extractinguseful information (knowledge)from the rapidly growing volumes ofdigital data.Articles38AI MAGAZINEtive applications (Manago and Auriol 1996).Telecommunications: The telecommuni-cations alarm-sequence analyzer (TASA) wasbuilt in cooperation with a manufacturer oftelecommunications equipment and threetelephone networks (Mannila, Toivonen, andVerkamo 1995). The system uses a novelframework for locating frequently occurringalarm episodes from the alarm stream andpresenting them as rules. Large sets of discov-ered rules can be explored with ﬂexible infor-mation-retrieval tools supporting interactivityand iteration. In this way, TASA offers pruning,grouping, and ordering tools to reﬁne the re-sults of a basic brute-force search for rules.Data cleaning: The MERGE-PURGE systemwas applied to the identiﬁcation of duplicatewelfare claims (Hernandez and Stolfo 1995).It was used successfully on data from the Wel-fare Department of the State of Washington.In other areas, a well-publicized system isIBM’s ADVANCED SCOUT,a specialized data-min-ing system that helps National Basketball As-sociation (NBA) coaches organize and inter-pret data from NBA games (U.S. News 1995). ADVANCED SCOUT was used by several of the NBA teams in 1996, including the Seattle Su-personics, which reached the NBA ﬁnals.Finally, a novel and increasingly importanttype of discovery is one based on the use of in-telligent agents to navigate through an infor-mation-rich environment. Although the ideaof active triggers has long been analyzed in thedatabase ﬁeld, really successful applications ofthis idea appeared only with the advent of theInternet. These systems ask the user to specifya proﬁle of interest and search for related in-formation among a wide variety of public-do-main and proprietary sources. For example, FIREFLY is a personal music-recommendation agent: It asks a user his/her opinion of several music pieces and then suggests other music that the user might like (<http:// www.fﬂ/>). CRAYON(/>) allows users to create their own free newspaper (supported by ads); NEWSHOUND(<http://www. /hound/>) from the San Jose Mercury News and FARCAST(</> automatically search information from a wide variety of sources, including newspapers and wire services, and e-mail rele-vant documents directly to the user.These are just a few of the numerous suchsystems that use KDD techniques to automat-ically produce useful information from largemasses of raw data. See Piatetsky-Shapiro etal. (1996) for an overview of issues in devel-oping industrial KDD applications.Data Mining and KDDHistorically, the notion of ﬁnding useful pat-terns in data has been given a variety ofnames, including data mining, knowledge ex-traction, information discovery, informationharvesting, data archaeology, and data patternprocessing. The term data mining has mostlybeen used by statisticians, data analysts, andthe management information systems (MIS)communities. It has also gained popularity inthe database ﬁeld. The phrase knowledge dis-covery in databases was coined at the ﬁrst KDDworkshop in 1989 (Piatetsky-Shapiro 1991) toemphasize that knowledge is the end productof a data-driven discovery. It has been popular-ized in the AI and machine-learning ﬁelds.In our view, KDD refers to the overall pro-cess of discovering useful knowledge from da-ta, and data mining refers to a particular stepin this process. Data mining is the applicationof speciﬁc algorithms for extracting patternsfrom data. The distinction between the KDDprocess and the data-mining step (within theprocess) is a central point of this article. Theadditional steps in the KDD process, such asdata preparation, data selection, data cleaning,incorporation of appropriate prior knowledge,and proper interpretation of the results ofmining, are essential to ensure that usefulknowledge is derived from the data. Blind ap-plication of data-mining methods (rightly crit-icized as data dredging in the statistical litera-ture) can be a dangerous activity, easilyleading to the discovery of meaningless andinvalid patterns.The Interdisciplinary Nature of KDDKDD has evolved, and continues to evolve,from the intersection of research ﬁelds such asmachine learning, pattern recognition,databases, statistics, AI, knowledge acquisitionfor expert systems, data visualization, andhigh-performance computing. The unifyinggoal is extracting high-level knowledge fromlow-level data in the context of large data sets.The data-mining component of KDD cur-rently relies heavily on known techniquesfrom machine learning, pattern recognition,and statistics to ﬁnd patterns from data in thedata-mining step of the KDD process. A natu-ral question is, How is KDD different from pat-tern recognition or machine learning (and re-lated ﬁelds)? The answer is that these ﬁeldsprovide some of the data-mining methodsthat are used in the data-mining step of theKDD process. KDD focuses on the overall pro-cess of knowledge discovery from data, includ-ing how the data are stored and accessed, howalgorithms can be scaled to massive data setsThe basicproblemaddressed bythe KDDprocess isone ofmappinglow-leveldata intoother formsthat might bemorecompact,moreabstract,or moreuseful.ArticlesFALL 1996 39A driving force behind KDD is the database ﬁeld (the second D in KDD). Indeed, the problem of effective data manipulation when data cannot ﬁt in the main memory is of fun-damental importance to KDD. Database tech-niques for gaining efﬁcient data access,grouping and ordering operations when ac-cessing data, and optimizing queries consti-tute the basics for scaling algorithms to larger data sets. Most data-mining algorithms from statistics, pattern recognition, and machine learning assume data are in the main memo-ry and pay no attention to how the algorithm breaks down if only limited views of the data are possible.A related ﬁeld evolving from databases is data warehousing,which refers to the popular business trend of collecting and cleaning transactional data to make them available for online analysis and decision support. Data warehousing helps set the stage for KDD in two important ways: (1) data cleaning and (2)data access.Data cleaning: As organizations are forced to think about a uniﬁed logical view of the wide variety of data and databases they pos-sess, they have to address the issues of map-ping data to a single naming convention,uniformly representing and handling missing data, and handling noise and errors when possible.Data access: Uniform and well-deﬁned methods must be created for accessing the da-ta and providing access paths to data that were historically difﬁcult to get to (for exam-ple, stored ofﬂine).Once organizations and individuals have solved the problem of how to store and ac-cess their data, the natural next step is the question, What else do we do with all the da-ta? This is where opportunities for KDD natu-rally arise.A popular approach for analysis of data warehouses is called online analytical processing (OLAP), named for a set of principles pro-posed by Codd (1993). OLAP tools focus on providing multidimensional data analysis,which is superior to SQL in computing sum-maries and breakdowns along many dimen-sions. OLAP tools are targeted toward simpli-fying and supporting interactive data analysis,but the goal of KDD tools is to automate as much of the process as possible. Thus, KDD is a step beyond what is currently supported by most standard database systems.Basic DeﬁnitionsKDD is the nontrivial process of identifying valid, novel, potentially useful, and ultimate-and still run efﬁciently, how results can be in-terpreted and visualized, and how the overall man-machine interaction can usefully be modeled and supported. The KDD process can be viewed as a multidisciplinary activity that encompasses techniques beyond the scope of any one particular discipline such as machine learning. In this context, there are clear opportunities for other ﬁelds of AI (be-sides machine learning) to contribute to KDD. KDD places a special emphasis on ﬁnd-ing understandable patterns that can be inter-preted as useful or interesting knowledge.Thus, for example, neural networks, although a powerful modeling tool, are relatively difﬁcult to understand compared to decision trees. KDD also emphasizes scaling and ro-bustness properties of modeling algorithms for large noisy data sets.Related AI research ﬁelds include machine discovery, which targets the discovery of em-pirical laws from observation and experimen-tation (Shrager and Langley 1990) (see Kloes-gen and Zytkow [1996] for a glossary of terms common to KDD and machine discovery),and causal modeling for the inference of causal models from data (Spirtes, Glymour,and Scheines 1993). Statistics in particular has much in common with KDD (see Elder and Pregibon [1996] and Glymour et al.[1996] for a more detailed discussion of this synergy). Knowledge discovery from data is fundamentally a statistical endeavor. Statistics provides a language and framework for quan-tifying the uncertainty that results when one tries to infer general patterns from a particu-lar sample of an overall population. As men-tioned earlier, the term data mining has had negative connotations in statistics since the 1960s when computer-based data analysis techniques were ﬁrst introduced. The concern arose because if one searches long enough in any data set (even randomly generated data),one can ﬁnd patterns that appear to be statis-tically signiﬁcant but, in fact, are not. Clearly,this issue is of fundamental importance to KDD. Substantial progress has been made in recent years in understanding such issues in statistics. Much of this work is of direct rele-vance to KDD. Thus, data mining is a legiti-mate activity as long as one understands how to do it correctly; data mining carried out poorly (without regard to the statistical as-pects of the problem) is to be avoided. KDD can also be viewed as encompassing a broader view of modeling than statistics. KDD aims to provide tools to automate (to the degree pos-sible) the entire process of data analysis and the statistician’s “art” of hypothesis selection.Data mining is a step in the KDD process that consists of ap-plying data analysis and discovery al-gorithms that produce a par-ticular enu-meration ofpatterns (or models)over the data.Articles40AI MAGAZINEly understandable patterns in data (Fayyad, Piatetsky-Shapiro, and Smyth 1996).Here, data are a set of facts (for example, cases in a database), and pattern is an expres-sion in some language describing a subset of the data or a model applicable to the subset. Hence, in our usage here, extracting a pattern also designates ﬁtting a model to data; ﬁnd-ing structure from data; or, in general, mak-ing any high-level description of a set of data. The term process implies that KDD comprises many steps, which involve data preparation, search for patterns, knowledge evaluation, and reﬁnement, all repeated in multiple itera-tions. By nontrivial, we mean that some search or inference is involved; that is, it is not a straightforward computation of predeﬁned quantities like computing the av-erage value of a set of numbers.The discovered patterns should be valid on new data with some degree of certainty. We also want patterns to be novel (at least to the system and preferably to the user) and poten-tially useful, that is, lead to some beneﬁt to the user or task. Finally, the patterns should be understandable, if not immediately then after some postprocessing.The previous discussion implies that we can deﬁne quantitative measures for evaluating extracted patterns. In many cases, it is possi-ble to deﬁne measures of certainty (for exam-ple, estimated prediction accuracy on new data) or utility (for example, gain, perhaps indollars saved because of better predictions orspeedup in response time of a system). No-tions such as novelty and understandabilityare much more subjective. In certain contexts,understandability can be estimated by sim-plicity (for example, the number of bits to de-scribe a pattern). An important notion, calledinterestingness(for example, see Silberschatzand Tuzhilin [1995] and Piatetsky-Shapiro andMatheus [1994]), is usually taken as an overallmeasure of pattern value, combining validity,novelty, usefulness, and simplicity. Interest-ingness functions can be deﬁned explicitly orcan be manifested implicitly through an or-dering placed by the KDD system on the dis-covered patterns or models.Given these notions, we can consider apattern to be knowledge if it exceeds some in-terestingness threshold, which is by nomeans an attempt to deﬁne knowledge in thephilosophical or even the popular view. As amatter of fact, knowledge in this deﬁnition ispurely user oriented and domain speciﬁc andis determined by whatever functions andthresholds the user chooses.Data mining is a step in the KDD processthat consists of applying data analysis anddiscovery algorithms that, under acceptablecomputational efﬁciency limitations, pro-duce a particular enumeration of patterns (ormodels) over the data. Note that the space ofArticlesFALL 1996 41Figure 1. An Overview of the Steps That Compose the KDD Process.methods, the effective number of variables under consideration can be reduced, or in-variant representations for the data can be found.Fifth is matching the goals of the KDD pro-cess (step 1) to a particular data-mining method. For example, summarization, clas-siﬁcation, regression, clustering, and so on,are described later as well as in Fayyad, Piatet-sky-Shapiro, and Smyth (1996).Sixth is exploratory analysis and model and hypothesis selection: choosing the data-mining algorithm(s) and selecting method(s)to be used for searching for data patterns.This process includes deciding which models and parameters might be appropriate (for ex-ample, models of categorical data are differ-ent than models of vectors over the reals) and matching a particular data-mining method with the overall criteria of the KDD process (for example, the end user might be more in-terested in understanding the model than its predictive capabilities).Seventh is data mining: searching for pat-terns of interest in a particular representa-tional form or a set of such representations,including classiﬁcation rules or trees, regres-sion, and clustering. The user can signiﬁcant-ly aid the data-mining method by correctly performing the preceding steps.Eighth is interpreting mined patterns, pos-sibly returning to any of steps 1 through 7 for further iteration. This step can also involve visualization of the extracted patterns and models or visualization of the data given the extracted models.Ninth is acting on the discovered knowl-edge: using the knowledge directly, incorpo-rating the knowledge into another system for further action, or simply documenting it and reporting it to interested parties. This process also includes checking for and resolving po-tential conﬂicts with previously believed (or extracted) knowledge.The KDD process can involve signiﬁcant iteration and can contain loops between any two steps. The basic ﬂow of steps (al-though not the potential multitude of itera-tions and loops) is illustrated in ﬁgure 1.Most previous work on KDD has focused on step 7, the data mining. However, the other steps are as important (and probably more so) for the successful application of KDD in practice. Having deﬁned the basic notions and introduced the KDD process, we now focus on the data-mining component,which has, by far, received the most atten-tion in the literature.patterns is often inﬁnite, and the enumera-tion of patterns involves some form of search in this space. Practical computational constraints place severe limits on the sub-space that can be explored by a data-mining algorithm.The KDD process involves using the database along with any required selection,preprocessing, subsampling, and transforma-tions of it; applying data-mining methods (algorithms) to enumerate patterns from it;and evaluating the products of data mining to identify the subset of the enumerated pat-terns deemed knowledge. The data-mining component of the KDD process is concerned with the algorithmic means by which pat-terns are extracted and enumerated from da-ta. The overall KDD process (ﬁgure 1) in-cludes the evaluation and possible interpretation of the mined patterns to de-termine which patterns can be considered new knowledge. The KDD process also in-cludes all the additional steps described in the next section.The notion of an overall user-driven pro-cess is not unique to KDD: analogous propos-als have been put forward both in statistics (Hand 1994) and in machine learning (Brod-ley and Smyth 1996).The KDD ProcessThe KDD process is interactive and iterative,involving numerous steps with many deci-sions made by the user. Brachman and Anand (1996) give a practical view of the KDD pro-cess, emphasizing the interactive nature of the process. Here, we broadly outline some of its basic steps:First is developing an understanding of the application domain and the relevant prior knowledge and identifying the goal of the KDD process from the customer’s viewpoint.Second is creating a target data set: select-ing a data set, or focusing on a subset of vari-ables or data samples, on which discovery is to be performed.Third is data cleaning and preprocessing.Basic operations include removing noise if appropriate, collecting the necessary informa-tion to model or account for noise, deciding on strategies for handling missing data ﬁelds,and accounting for time-sequence informa-tion and known changes.Fourth is data reduction and projection:ﬁnding useful features to represent the data depending on the goal of the task. With di-mensionality reduction or transformationArticles42AI MAGAZINEThe Data-Mining Stepof the KDD ProcessThe data-mining component of the KDD pro-cess often involves repeated iterative applica-tion of particular data-mining methods. This section presents an overview of the primary goals of data mining, a description of the methods used to address these goals, and a brief description of the data-mining algo-rithms that incorporate these methods.The knowledge discovery goals are deﬁned by the intended use of the system. We can distinguish two types of goals: (1) veriﬁcation and (2) discovery. With veriﬁcation,the sys-tem is limited to verifying the user’s hypothe-sis. With discovery,the system autonomously ﬁnds new patterns. We further subdivide the discovery goal into prediction,where the sys-tem ﬁnds patterns for predicting the future behavior of some entities, and description, where the system ﬁnds patterns for presenta-tion to a user in a human-understandableform. In this article, we are primarily con-cerned with discovery-oriented data mining.Data mining involves ﬁtting models to, or determining patterns from, observed data. The ﬁtted models play the role of inferred knowledge: Whether the models reﬂect useful or interesting knowledge is part of the over-all, interactive KDD process where subjective human judgment is typically required. Two primary mathematical formalisms are used in model ﬁtting: (1) statistical and (2) logical. The statistical approach allows for nondeter-ministic effects in the model, whereas a logi-cal model is purely deterministic. We focus primarily on the statistical approach to data mining, which tends to be the most widely used basis for practical data-mining applica-tions given the typical presence of uncertain-ty in real-world data-generating processes.Most data-mining methods are based on tried and tested techniques from machine learning, pattern recognition, and statistics: classiﬁcation, clustering, regression, and so on. The array of different algorithms under each of these headings can often be bewilder-ing to both the novice and the experienced data analyst. It should be emphasized that of the many data-mining methods advertised in the literature, there are really only a few fun-damental techniques. The actual underlying model representation being used by a particu-lar method typically comes from a composi-tion of a small number of well-known op-tions: polynomials, splines, kernel and basis functions, threshold-Boolean functions, and so on. Thus, algorithms tend to differ primar-ily in the goodness-of-ﬁt criterion used toevaluate model ﬁt or in the search methodused to ﬁnd a good ﬁt.In our brief overview of data-mining meth-ods, we try in particular to convey the notionthat most (if not all) methods can be viewedas extensions or hybrids of a few basic tech-niques and principles. We ﬁrst discuss the pri-mary methods of data mining and then showthat the data- mining methods can be viewedas consisting of three primary algorithmiccomponents: (1) model representation, (2)model evaluation, and (3) search. In the dis-cussion of KDD and data-mining methods,we use a simple example to make some of thenotions more concrete. Figure 2 shows a sim-ple two-dimensional artiﬁcial data set consist-ing of 23 cases. Each point on the graph rep-resents a person who has been given a loanby a particular bank at some time in the past.The horizontal axis represents the income ofthe person; the vertical axis represents the to-tal personal debt of the person (mortgage, carpayments, and so on). The data have beenclassiﬁed into two classes: (1) the x’s repre-sent persons who have defaulted on theirloans and (2) the o’s represent persons whoseloans are in good status with the bank. Thus,this simple artiﬁcial data set could represent ahistorical data set that can contain usefulknowledge from the point of view of thebank making the loans. Note that in actualKDD applications, there are typically manymore dimensions (as many as several hun-dreds) and many more data points (manythousands or even millions).ArticlesFALL 1996 43Figure 2. A Simple Data Set with Two Classes Used for Illustrative Purposes.。

神经外科5本杂志目录英文-中文 2022年10月

神外杂志英-中文目录(2022年10月) Neurosurgery1.Assessment of Spinal Metastases Surgery Risk Stratification Tools in BreastCancer by Molecular Subtype按照分子亚型评估乳腺癌脊柱转移手术风险分层工具2.Microsurgery versus Microsurgery With Preoperative Embolization for BrainArteriovenous Malformation Treatment: A Systematic Review and Meta-analysis 显微手术与显微手术联合术前栓塞治疗脑动静脉畸形的系统评价和荟萃分析mentary: Silk Vista Baby for the Treatment of Complex Posterior InferiorCerebellar Artery Aneurysms点评: Silk Vista Baby用于治疗复杂的小脑下后动脉动脉瘤4.Targeted Public Health Training for Neurosurgeons: An Essential Task for thePrioritization of Neurosurgery in the Evolving Global Health Landscape针对神经外科医生的有针对性的公共卫生培训:在不断变化的全球卫生格局中确定神经外科手术优先顺序的一项重要任务5.Chronic Encapsulated Expanding Hematomas After Stereotactic Radiosurgery forIntracranial Arteriovenous Malformations: An International Multicenter Case Series立体定向放射外科治疗颅内动静脉畸形后的慢性包裹性扩张血肿:国际多中心病例系列6.Trends in Reimbursement and Approach Selection for Lumbar Arthrodesis腰椎融合术的费用报销和入路选择趋势7.Diffusion Basis Spectrum Imaging Provides Insights Into Cervical SpondyloticMyelopathy Pathology扩散基础频谱成像提供了脊髓型颈椎病病理学的见解8.Association Between Neighborhood-Level Socioeconomic Disadvantage andPatient-Reported Outcomes in Lumbar Spine Surgery邻域水平的社会经济劣势与腰椎手术患者报告结果之间的关系mentary: Prognostic Models for Traumatic Brain Injury Have GoodDiscrimination But Poor Overall Model Performance for Predicting Mortality and Unfavorable Outcomes评论:创伤性脑损伤的预后模型在预测death率和不良结局方面具有良好的区分性，但总体模型性能较差mentary: Serum Levels of Myo-inositol Predicts Clinical Outcome 1 YearAfter Aneurysmal Subarachnoid Hemorrhage评论:血清肌醇水平预测动脉瘤性蛛网膜下腔出血1年后的临床结局mentary: Laser Interstitial Thermal Therapy for First-Line Treatment ofSurgically Accessible Recurrent Glioblastoma: Outcomes Compared With a Surgical Cohort评论:激光间质热疗用于手术可及复发性胶质母细胞瘤的一线治疗:与手术队列的结果比较12.Functional Reorganization of the Mesial Frontal Premotor Cortex in Patients WithSupplementary Motor Area Seizures辅助性运动区癫痫患者中额内侧运动前皮质的功能重组13.Concurrent Administration of Immune Checkpoint Inhibitors and StereotacticRadiosurgery Is Well-Tolerated in Patients With Melanoma Brain Metastases: An International Multicenter Study of 203 Patients免疫检查点抑制剂联合立体定向放射外科治疗对黑色素瘤脑转移患者的耐受性良好:一项针对203例患者的国际多中心研究14.Prognosis of Rotational Angiography-Based Stereotactic Radiosurgery for DuralArteriovenous Fistulas: A Retrospective Analysis基于旋转血管造影术的立体定向放射外科治疗硬脑膜动静脉瘘的预后:回顾性分析15.Letter: Development and Internal Validation of the ARISE Prediction Models forRebleeding After Aneurysmal Subarachnoid Hemorrhage信件:动脉瘤性蛛网膜下腔出血后再出血的ARISE预测模型的开发和内部验证16.Development of Risk Stratification Predictive Models for Cervical DeformitySurgery颈椎畸形手术风险分层预测模型的建立17.First-Pass Effect Predicts Clinical Outcome and Infarct Growth AfterThrombectomy for Distal Medium Vessel Occlusions首过效应预测远端中血管闭塞血栓切除术后的临床结局和梗死生长mentary: Risk for Hemorrhage the First 2 Years After Gamma Knife Surgeryfor Arteriovenous Malformations: An Update评论:动静脉畸形伽玛刀手术后前2年出血风险:更新19.A Systematic Review of Neuropsychological Outcomes After Treatment ofIntracranial Aneurysms颅内动脉瘤治疗后神经心理结局的系统评价20.Does a Screening Trial for Spinal Cord Stimulation in Patients With Chronic Painof Neuropathic Origin Have Clinical Utility (TRIAL-STIM)? 36-Month Results From a Randomized Controlled Trial神经性慢性疼痛患者脊髓刺激筛选试验是否具有临床实用性(TRIAL-STIM)？一项随机对照试验的36个月结果21.Letter: Transcriptomic Profiling Revealed Lnc-GOLGA6A-1 as a NovelPrognostic Biomarker of Meningioma Recurrence信件:转录组分析显示Lnc-GOLGA6A-1是脑膜瘤复发的一种新的预后生物标志物mentary: The Impact of Frailty on Traumatic Brain Injury Outcomes: AnAnalysis of 691 821 Nationwide Cases评论:虚弱对创伤性脑损伤结局的影响:全国691821例病例分析23.Optimal Cost-Effective Screening Strategy for Unruptured Intracranial Aneurysmsin Female Smokers女性吸烟者中未破裂颅内动脉瘤的最佳成本效益筛查策略24.Letter: Pressure to Publish—A Precarious Precedent Among Medical Students信件:出版压力——医学研究者中一个不稳定的先例25.Letter: Protocol for a Multicenter, Prospective, Observational Pilot Study on theImplementation of Resource-Stratified Algorithms for the Treatment of SevereTraumatic Brain Injury Across Four Treatment Phases: Prehospital, Emergency Department, Neurosurgery, and Intensive Care Unit信件:一项跨四个治疗阶段(院前、急诊科、神经外科和重症监护室)实施资源分层算法的多中心、前瞻性、观察性试点研究的协议26.Risk for Hemorrhage the First 2 Years After Gamma Knife Surgery forArteriovenous Malformations: An Update动静脉畸形伽玛刀手术后前2年出血风险:更新Journal of Neurosurgery27.Association of homotopic areas in the right hemisphere with language deficits inthe short term after tumor resection肿瘤切除术后短期内右半球同话题区与语言缺陷的关系28.Association of preoperative glucose concentration with mortality in patientsundergoing craniotomy for brain tumor脑肿瘤开颅手术患者术前血糖浓度与death率的关系29.Deep brain stimulation for movement disorders after stroke: a systematic review ofthe literature脑深部电刺激治疗脑卒中后运动障碍的系统评价30.Effectiveness of immune checkpoint inhibitors in combination with stereotacticradiosurgery for patients with brain metastases from renal cell carcinoma: inverse probability of treatment weighting using propensity scores免疫检查点抑制剂联合立体定向放射外科治疗肾细胞癌脑转移患者的有效性:使用倾向评分进行治疗加权的反向概率31.Endovascular treatment of brain arteriovenous malformations: clinical outcomesof patients included in the registry of a pragmatic randomized trial脑动静脉畸形的血管内治疗:纳入实用随机试验登记处的患者的临床结果32.Feasibility of bevacizumab-IRDye800CW as a tracer for fluorescence-guidedmeningioma surgery贝伐单抗- IRDye800CW作为荧光导向脑膜瘤手术示踪剂的可行性33.Precuneal gliomas promote behaviorally relevant remodeling of the functionalconnectome前神经胶质瘤促进功能性连接体的行为相关重塑34.Pursuing perfect 2D and 3D photography in neuroanatomy: a new paradigm forstaying up to date with digital technology在神经解剖学中追求完美的2D和三维摄影:跟上数字技术的新范式35.Recurrent insular low-grade gliomas: factors guiding the decision to reoperate复发性岛叶低级别胶质瘤:决定再次手术的指导因素36.Relationship between phenotypic features in Loeys-Dietz syndrome and thepresence of intracranial aneurysmsLoeys-Dietz综合征的表型特征与颅内动脉瘤存在的关系37.Continued underrepresentation of historically excluded groups in the neurosurgerypipeline: an analysis of racial and ethnic trends across stages of medical training from 2012 to 2020神经外科管道中历史上被排除群体的代表性持续不足:2012年至2020年不同医学培训阶段的种族和族裔趋势分析38.Management strategies in clival and craniovertebral junction chordomas: a 29-yearexperience斜坡和颅椎交界脊索瘤的治疗策略:29年经验39.A national stratification of the global macroeconomic burden of central nervoussystem cancer中枢神经系统癌症全球宏观经济负担的国家分层40.Phase II trial of icotinib in adult patients with neurofibromatosis type 2 andprogressive vestibular schwannoma在患有2型神经纤维瘤病和进行性前庭神经鞘瘤的成人患者中进行的盐酸埃克替尼II期试验41.Predicting leptomeningeal disease spread after resection of brain metastases usingmachine learning用机器学习预测脑转移瘤切除术后软脑膜疾病的扩散42.Short- and long-term outcomes of moyamoya patients post-revascularization烟雾病患者血运重建后的短期和长期结局43.Alteration of default mode network: association with executive dysfunction infrontal glioma patients默认模式网络的改变:与额叶胶质瘤患者执行功能障碍的相关性44.Correlation between tumor volume and serum prolactin and its effect on surgicaloutcomes in a cohort of 219 prolactinoma patients219例泌乳素瘤患者的肿瘤体积与血清催乳素的相关性及其对手术结果的影响45.Is intracranial electroencephalography mandatory for MRI-negative neocorticalepilepsy surgery?对于MRI阴性的新皮质癫痫手术，是否必须进行颅内脑电图检查？46.Neurosurgeons as complete stroke doctors: the time is now神经外科医生作为完全中风的医生:时间是现在47.Seizure outcome after resection of insular glioma: a systematic review, meta-analysis, and institutional experience岛叶胶质瘤切除术后癫痫发作结局:一项系统综述、荟萃分析和机构经验48.Surgery for glioblastomas in the elderly: an Association des Neuro-oncologuesd’Expression Française (ANOCEF) trial老年人成胶质细胞瘤的手术治疗:法国神经肿瘤学与表达协会(ANOCEF)试验49.Surgical instruments and catheter damage during ventriculoperitoneal shuntassembly脑室腹腔分流术装配过程中的手术器械和导管损坏50.Cost-effectiveness analysis on small (< 5 mm) unruptured intracranial aneurysmfollow-up strategies较小(< 5 mm)未破裂颅内动脉瘤随访策略的成本-效果分析51.Evaluating syntactic comprehension during awake intraoperative corticalstimulation mapping清醒术中皮质刺激标测时句法理解能力的评估52.Factors associated with radiation toxicity and long-term tumor control more than10 years after Gamma Knife surgery for non–skull base, nonperioptic benignsupratentorial meningiomas非颅底、非周期性良性幕上脑膜瘤伽玛刀术后10年以上与放射毒性和长期肿瘤控制相关的因素53.Multidisciplinary management of patients with non–small cell lung cancer withleptomeningeal metastasis in the tyrosine kinase inhibitor era酪氨酸激酶抑制剂时代有软脑膜转移的非小细胞肺癌患者的多学科管理54.Predicting the growth of middle cerebral artery bifurcation aneurysms usingdifferences in the bifurcation angle and inflow coefficient利用分叉角和流入系数的差异预测大脑中动脉分叉动脉瘤的生长55.Predictors of surgical site infection in glioblastoma patients undergoing craniotomyfor tumor resection胶质母细胞瘤患者行开颅手术切除肿瘤时手术部位感染的预测因素56.Stereotactic radiosurgery for orbital cavernous hemangiomas立体定向放射外科治疗眼眶海绵状血管瘤57.Surgical management of large cerebellopontine angle meningiomas: long-termresults of a less aggressive resection strategy大型桥小脑角脑膜瘤的手术治疗:较小侵袭性切除策略的长期结果Journal of Neurosurgery: Case Lessons58.5-ALA fluorescence–guided resection of a recurrent anaplastic pleomorphicxanthoastrocytoma: illustrative case5-ALA荧光引导下切除复发性间变性多形性黄色星形细胞瘤:说明性病例59.Flossing technique for endovascular repair of a penetrating cerebrovascular injury:illustrative case牙线技术用于血管内修复穿透性脑血管损伤:例证性病例60.Nerve transfers in a patient with asymmetrical neurological deficit followingtraumatic cervical spinal cord injury: simultaneous bilateral restoration of pinch grip and elbow extension. Illustrative case创伤性颈脊髓损伤后不对称神经功能缺损患者的神经转移:同时双侧恢复捏手和肘关节伸展。

ABA Model Rules of Professional Conduct

ABA MODEL RULES OF PROFESSIONALCONDUCT (2004)American Bar Association Model Rules of Professional Conduct (2002 edition)*PREAMBLE, SCOPE AND TERMINOLOGY PREAMBLE: A LAWYER'S RESPONSIBILITIES[1] A lawyer, as a member of the legal profession, is a representativeof clients, an officer of the legal system and a public citizen havingspecial responsibility for the quality of justice.[2] As a representative of clients, a lawyer performs various functions.As advisor, a lawyer provides a client with an informed understanding of the client's legal rights and obligations and explains their practical implications. As advocate, a lawyer zealously asserts the client'sposition under the rules of the adversary system. As negotiator, alawyer seeks a result advantageous to the client but consistent with requirements of honest dealings with others. As an evaluator, a lawyer acts by examining a client's legal affairs and reporting about them to the client or to others.[3] In addition to these representational functions, a lawyer may serveas a third-party neutral, a nonrepresentational role helping the parties to resolve a dispute or other matter. Some of these Rules applydirectly to lawyers who are or have served as third-party neutrals. See,e.g., Rules 1.12 and 2.4. In addition, there are Rules that apply tolawyers who are not active in the practice of law or to practicinglawyers even when they are acting in a nonprofessional capacity. For example, a lawyer who commits fraud in the conduct of a business is subject to discipline for engaging in conduct involving dishonesty,fraud, deceit or misrepresentation. See Rule 8.4.[4] In all professional functions a lawyer should be competent, promptand diligent. A lawyer should maintain communication with a clientconcerning the representation. A lawyer should keep in confidenceinformation relating to representation of a client except so far asdisclosure is required or permitted by the Rules of ProfessionalConduct or other law.[5] A lawyer's conduct should conform to the requirements of the law, both in professional service to clients and in the lawyer's business and personal affairs. A lawyer should use the law's procedures only for legitimate purposes and not to harass or intimidate others. A lawyer should demonstrate respect for the legal system and for those who serve it, including judges, other lawyers and public officials. While it is a lawyer's duty, when necessary, to challenge the rectitude of official action, it is also a lawyer's duty to uphold legal process.[6] As a public citizen, a lawyer should seek improvement of the law, access to the legal system, the administration of justice and the quality of service rendered by the legal profession. As a member of a learned profession, a lawyer should cultivate knowledge of the law beyond its use for clients, employ that knowledge in reform of the law and work to strengthen legal education. In addition, a lawyer should further the public's understanding of and confidence in the rule of law and the justice system because legal institutions in a constitutional democracy depend on popular participation and support to maintain their authority. A lawyer should be mindful of deficiencies in the administration of justice and of the fact that the poor, and sometimes persons who are not poor, cannot afford adequate legal assistance. Therefore, all lawyers should devote professional time and resources and use civic influence to ensure equal access to our system of justice for all those who because of economic or social barriers cannot afford or secure adequate legal counsel. A lawyer should aid the legal profession in pursuing these objectives and should help the bar regulate itself in the public interest.[7] Many of a lawyer's professional responsibilities are prescribed in the Rules of Professional Conduct, as well as substantive and procedural law. However, a lawyer is also guided by personal conscience and the approbation of professional peers. A lawyer should strive to attain the highest level of skill, to improve the law and the legal profession and to exemplify the legal profession's ideals of public service.[8] A lawyer's responsibilities as a representative of clients, an officer of the legal system and a public citizen are usually harmonious. Thus, when an opposing party is well represented, a lawyer can be a zealous advocate on behalf of a client and at the same time assume that justice is being done. So also, a lawyer can be sure that preserving client confidences ordinarily serves the public interest because people are more likely to seek legal advice, and thereby heed their legal obligations, when they know their communications will be private.[9] In the nature of law practice, however, conflicting responsibilities are encountered. Virtually all difficult ethical problems arise from conflict between a lawyer's responsibilities to clients, to the legal system and to the lawyer's own interest in remaining an ethical person while earning a satisfactory living. The Rules of Professional Conduct often prescribe terms for resolving such conflicts. Within the framework of these Rules, however, many difficult issues of professional discretion can arise. Such issues must be resolved through the exercise of sensitive professional and moral judgment guided by the basic principles underlying the Rules. These principles include the lawyer's obligation zealously to protect and pursue a client's legitimate interests, within the bounds of the law, while maintaining a professional, courteous and civil attitude toward all persons involved in the legal system.[10] The legal profession is largely self-governing. Although other professions also have been granted powers of self-government, the legal profession is unique in this respect because of the close relationship between the profession and the processes of government and law enforcement. This connection is manifested in the fact that ultimate authority over the legal profession is vested largely in the courts.[11] To the extent that lawyers meet the obligations of their professional calling, the occasion for government regulation is obviated. Self-regulation also helps maintain the legal profession's independence from government domination. An independent legal profession is an important force in preserving government under law, for abuse of legal authority is more readily challenged by a profession whose members are not dependent on government for the right to practice.[12] The legal profession's relative autonomy carries with it special responsibilities of self-government. The profession has a responsibility to assure that its regulations are conceived in the public interest and not in furtherance of parochial or self-interested concerns of the bar. Every lawyer is responsible for observance of the Rules of Professional Conduct. A lawyer should also aid in securing their observance by other lawyers. Neglect of these responsibilities compromises the independence of the profession and the public interest which it serves.[13] Lawyers play a vital role in the preservation of society. The fulfillment of this role requires an understanding by lawyers of their relationship to our legal system. The Rules of Professional Conduct, when properly applied, serve to define that relationship.SCOPE[14] The Rules of Professional Conduct are rules of reason. They should be interpreted with reference to the purposes of legal representation and of the law itself. Some of the Rules are imperatives, cast in the terms "shall" or "shall not." These define proper conduct for purposes of professional discipline. Others, generally cast in the term "may," are permissive and define areas under the Rules in which the lawyer has discretion to exercise professional judgment. No disciplinary action should be taken when the lawyer chooses not to act or acts within the bounds of such discretion. Other Rules define the nature of relationships between the lawyer and others. The Rules are thus partly obligatory and disciplinary and partly constitutive and descriptive in that they define a lawyer's professional role. Many of the Comments use the term "should." Comments do not add obligations to the Rules but provide guidance for practicing in compliance with the Rules.[15] The Rules presuppose a larger legal context shaping the lawyer's role. That context includes court rules and statutes relating to matters of licensure, laws defining specific obligations of lawyers and substantive and procedural law in general. The Comments are sometimes used to alert lawyers to their responsibilities under such other law.[16] Compliance with the Rules, as with all law in an open society, depends primarily upon understanding and voluntary compliance, secondarily upon reinforcement by peer and public opinion and finally, when necessary, upon enforcement through disciplinary proceedings. The Rules do not, however, exhaust the moral and ethical considerations that should inform a lawyer, for no worthwhile human activity can be completely defined by legal rules. The Rules simply provide a framework for the ethical practice of law.[17] Furthermore, for purposes of determining the lawyer's authority and responsibility, principles of substantive law external to these Rules determine whether a client-lawyer relationship exists. Most of the duties flowing from the client-lawyer relationship attach only after the client has requested the lawyer to render legal services and the lawyer has agreed to do so. But there are some duties, such as that of confidentiality under Rule 1.6, that attach when the lawyer agrees to consider whether a client-lawyer relationship shall be established. See Rule 1.18. Whether a client-lawyer relationship exists for any specific purpose can depend on the circumstances and may be a question of fact.[18] Under various legal provisions, including constitutional, statutory and common law, the responsibilities of government lawyers may include authority concerning legal matters that ordinarily reposes in the client in private client-lawyer relationships. For example, a lawyer for a government agency may have authority on behalf of the government to decide upon settlement or whether to appeal from an adverse judgment. Such authority in various respects is generally vested in the attorney general and the state's attorney in state government, and their federal counterparts, and the same may be true of other government law officers. Also, lawyers under the supervision of these officers may be authorized to represent several government agencies in intragovernmental legal controversies in circumstances where a private lawyer could not represent multiple private clients. These Rules do not abrogate any such authority.[19] Failure to comply with an obligation or prohibition imposed by a Rule is a basis for invoking the disciplinary process. The Rules presuppose that disciplinary assessment of a lawyer's conduct will be made on the basis of the facts and circumstances as they existed at the time of the conduct in question and in recognition of the fact that a lawyer often has to act upon uncertain or incomplete evidence of the situation. Moreover, the Rules presuppose that whether or not discipline should be imposed for a violation, and the severity of a sanction, depend on all the circumstances, such as the willfulness and seriousness of the violation, extenuating factors and whether there have been previous violations.[20] Violation of a Rule should not itself give rise to a cause of action against a lawyer nor should it create any presumption in such a case that a legal duty has been breached. In addition, violation of a Rule does not necessarily warrant any other nondisciplinary remedy, such as disqualification of a lawyer in pending litigation. The Rules are designed to provide guidance to lawyers and to provide a structure for regulating conduct through disciplinary agencies. They are not designed to be a basis for civil liability. Furthermore, the purpose of the Rules can be subverted when they are invoked by opposing parties as procedural weapons. The fact that a Rule is a just basis for a lawyer's self-assessment, or for sanctioning a lawyer under the administration of a disciplinary authority, does not imply that an antagonist in a collateral proceeding or transaction has standing to seek enforcement of the Rule. Nevertheless, since the Rules do establish standards of conduct by lawyers, a lawyer's violation of a Rule may be evidence of breach of the applicable standard of conduct.[21] The Comment accompanying each Rule explains and illustrates the meaning and purpose of the Rule. The Preamble and this note on Scope provide general orientation. The Comments are intended as guides to interpretation, but the text of each Rule is authoritative. [ Pre-2002 version ]Rule 1.0: TERMINOLOGY(a) "Belief" or "believes" denotes that the person involved actually supposed the fact in question to be true. A person's belief may be inferred from circumstances.(b) "Confirmed in writing," when used in reference to the informed consent of a person, denotes informed consent that is given in writing by the person or a writing that a lawyer promptly transmits to the person confirming an oral informed consent. See paragraph (e) for the definition of "informed consent." If it is not feasible to obtain or transmit the writing at the time the person gives informed consent, then the lawyer must obtain or transmit it within a reasonable time thereafter.(c) "Firm" or "law firm" denotes a lawyer or lawyers in a law partnership, professional corporation, sole proprietorship or other association authorized to practice law; or lawyers employed in a legal services organization or the legal department of a corporation or other organization.(d) "Fraud" or "fraudulent" denotes conduct that is fraudulent under the substantive or procedural law of the applicable jurisdiction and has a purpose to deceive.(e) "Informed consent" denotes the agreement by a person to a proposed course of conduct after the lawyer has communicated adequate information and explanation about the material risks of and reasonably available alternatives to the proposed course of conduct.(f) "Knowingly," "known," or "knows" denotes actual knowledge of the fact in question. A person's knowledge may be inferred from circumstances.(g) "Partner" denotes a member of a partnership, a shareholder in a law firm organized as a professional corporation, or a member of an association authorized to practice law.(h) "Reasonable" or "reasonably" when used in relation to conduct by a lawyer denotes the conduct of a reasonably prudent and competent lawyer.(i) "Reasonable belief" or "reasonably believes" when used in reference to a lawyer denotes that the lawyer believes the matter in question and that the circumstances are such that the belief is reasonable.(j) "Reasonably should know" when used in reference to a lawyer denotes that a lawyer of reasonable prudence and competence would ascertain the matter in question.(k) "Screened" denotes the isolation of a lawyer from any participation in a matter through the timely imposition of procedures within a firm that are reasonably adequate under the circumstances to protect information that the isolated lawyer is obligated to protect under these Rules or other law.(l) "Substantial" when used in reference to degree or extent denotes a material matter of clear and weighty importance.(m) "Tribunal" denotes a court, an arbitrator in a binding arbitration proceeding or a legislative body, administrative agency or other body acting in an adjudicative capacity. A legislative body, administrative agency or other body acts in an adjudicative capacity when a neutral official, after the presentation of evidence or legal argument by a party or parties, will render a binding legal judgment directly affecting a party's interests in a particular matter.(n) "Writing" or "written" denotes a tangible or electronic record of a communication or representation, including handwriting, typewriting, printing, photostating, photography, audio or videorecording ande-mail. A "signed" writing includes an electronic sound, symbol or process attached to or logically associated with a writing and executed or adopted by a person with the intent to sign the writing.CLIENT-LAWYER RELATIONSHIPRule 1.1: CompetenceA lawyer shall provide competent representation to a client. Competent representation requires the legal knowledge, skill,thoroughness and preparation reasonably necessary for the representation.Rule 1.2: Scope of Representation and Allocation ofAuthority between Client and Lawyer(a) Subject to paragraphs (c) and (d), a lawyer shall abide by a client's decisions concerning the objectives of representation and, as required bypursued. A lawyer may take such action on behalf of the client as is impliedly authorized to carry out the representation. A lawyer shall abide by a client's decision whether to settle a matter. In a criminal case, the lawyer shall abide by the client's decision, after consultation with the lawyer, as to a plea to be entered, whether to waive jury trial and whether the client will testify. (b) A lawyer's representation of a client, including representation by appointment, does not constitute an endorsement of the client's political, economic, social or moral views or activities.(c) A lawyer may limit the scope of the representation if the limitation is reasonable under the circumstances and the client gives informed consent.(d) A lawyer shall not counsel a client to engage, or assist a client, in conduct that the lawyer knows is criminal or fraudulent, but a lawyer may discuss the legal consequences of any proposed course of conduct with a client and maycounsel or assist a client to make a good faith effort to determine the validity, scope, meaning or application of the law.Rule 1.3: DiligenceA lawyer shall act with reasonable diligence and promptness in representing a client.Rule 1.4: Communication(a) A lawyer shall:(1) promptly inform the client of any decision or circumstance withpermit the client to make informed decisions regarding the representation.Rule 1.5: Fees(a) A lawyer shall not make an agreement for, charge, or collect an unreasonable fee or an unreasonable amount for expenses. The factors to be considered in determining the reasonableness of a fee include the following:(1) the time and labor required, the novelty and difficulty of the questionsinvolved, and the skill requisite to perform the legal service properly;(2) the likelihood, if apparent to the client, that the acceptance of theparticular employment will preclude other employment by the lawyer;(3) the fee customarily charged in the locality for similar legal services;(4) the amount involved and the results obtained;(5) the time limitations imposed by the client or by the circumstances;(6) the nature and length of the professional relationship with the client;(7) the experience, reputation, and ability of the lawyer or lawyersperforming the services; and(8) whether the fee is fixed or contingent.(b) The scope of the representation and the basis or rate of the fee and expenses for which the client will be responsible shall be communicated tocommencing the representation, except when the lawyer will charge a regularly represented client on the same basis or rate. Any changes in the basis or rate of the fee or expenses shall also be communicated to the client.(c) A fee may be contingent on the outcome of the matter for which the service is rendered, except in a matter in which a contingent fee is prohibited by paragraph (d) or other law. A contingent fee agreement shall be in a writing signed by the client and shall state the method by which the fee is to be determined, including the percentage or percentages that shall accrue to the lawyer in the event of settlement, trial or appeal; litigation and other expenses to be deducted from the recovery; and whether such expenses are to be deducted before or after the contingent fee is calculated. The agreement must clearly notify the client of any expenses for which the client will be liable whether or not the client is the prevailing party. Upon conclusion of a contingent fee matter, the lawyer shall provide the client with a written statement stating the outcome of the matter and, if there is a recovery, showing the remittance to the client and the method of its determination.(d) A lawyer shall not enter into an arrangement for, charge, or collect:(1) any fee in a domestic relations matter, the payment or amount ofwhich is contingent upon the securing of a divorce or upon the amount of alimony or support, or property settlement in lieu thereof; or(2) a contingent fee for representing a defendant in a criminal case.made only if:(1) the division is in proportion to the services performed by each lawyeror each lawyer assumes joint responsibility for the representation;(2) the client agrees to the arrangement, including the share each lawyerwill receive, and the agreement is confirmed in writing; and(3) the total fee is reasonable.[Comment][Pre-2002 version][State Narratives][Note: Rule 1.15, as recommended by the Ethics 2000Commission, would have required that communication offees and expenses, and changes in them, becommunicated to the client in writing unless the "it isreasonably foreseeable that total cost to a client, includingattorney fees, will be [$500] or less." In February 2002the House of Delegates deleted the requirement of awriting, stating that fee communications be "preferably inwriting." As of 2002, five jurisdictions require a writtenfee agreement for all new clients: Connecticut, District ofColumbia, New Jersey, Pennsylvania, and Utah. Twoadditional states require a written fee agreement in manyrepresentations: Alaska (fees in excess of $500),California (fees for all non-corporate clients in excess of$1,000).Rule 1.6: Confidentiality of Information(a) A lawyer shall not reveal information relating to the representation of aauthorized in order to carry out the representation or the disclosure is permitted by paragraph (b).(b) A lawyer may reveal information relating to the representation of a client to the extent the lawyer reasonably believes necessary:(1) to prevent reasonably certain death or substantial bodily harm;(2) to prevent the client from committing a crime or fraud that isreasonably certain to result in substantial injury to the financial interests or property of another and in furtherance of which the client has used or is using the lawyer's services;(3) to prevent, mitigate or rectify substantial injury to the financialinterests or property of another that is reasonably certain to result or has resulted from the client's commission of a crime or fraud in furtherance of which the client has used the lawyer's services;(4) to secure legal advice about the lawyer's compliance with these Rules;(5) to establish a claim or defense on behalf of the lawyer in a controversybetween the lawyer and the client, to establish a defense to a criminalcharge or civil claim against the lawyer based upon conduct in which the client was involved, or to respond to allegations in any proceedingconcerning the lawyer's representation of the client; or(6) to comply with other law or a court order.Rule 1.7 Conflict of Interest: Current Clients(a) Except as provided in paragraph (b), a lawyer shall not represent a client if the representation involves a concurrent conflict of interest. A concurrent conflict of interest exists if:(1) the representation of one client will be directly adverse to anotherclient; or(2) there is a significant risk that the representation of one or more clientswill be materially limited by the lawyer's responsibilities to another client,a former client or a third person or by a personal interest of the lawyer.(b) Notwithstanding the existence of a concurrent conflict of interest under paragraph (a), a lawyer may represent a client if:competent and diligent representation to each affected client;(2) the representation is not prohibited by law;(3) the representation does not involve the assertion of a claim by oneclient against another client represented by the lawyer in the samelitigation or other proceeding before a tribunal; and(4) each affected client gives informed consent, confirmed in writing.Rule 1.8 Conflict of Interest: Current Clients: Specific Rules(a) A lawyer shall not enter into a business transaction with a client orinterest adverse to a client unless:(1) the transaction and terms on which the lawyer acquires the interest are(b) A lawyer shall not use information relating to representation of a client toexcept as permitted or required by these Rules.(c) A lawyer shall not solicit any substantial gift from a client, including a testamentary gift, or prepare on behalf of a client an instrument giving the lawyer or a person related to the lawyer any substantial gift unless the lawyer or other recipient of the gift is related to the client. For purposes of this paragraph, related persons include a spouse, child, grandchild, parent, grandparent or other relative or individual with whom the lawyer or the client maintains a close, familial relationship.(d) Prior to the conclusion of representation of a client, a lawyer shall not make or negotiate an agreement giving the lawyer literary or media rights to a portrayal or account based in substantial part on information relating to the representation.(e) A lawyer shall not provide financial assistance to a client in connection with pending or contemplated litigation, except that:(1) a lawyer may advance court costs and expenses of litigation, the repayment of which may be contingent on the outcome of the matter; and (2) a lawyer representing an indigent client may pay court costs and expenses of litigation on behalf of the client.(f) A lawyer shall not accept compensation for representing a client from one other than the client unless:(1) the client gives informed consent;(2) there is no interference with the lawyer's independence of professional judgement or with the client-lawyer relationship; and(3) information relating to representation of a client is protected as required by Rule 1.6.(g) A lawyer who represents two or more clients shall not participate in making an aggregate settlement of the claims of or against the clients, or in a criminal case an aggregated agreement as to guilty or nolo contendereclient. The lawyer's disclosure shall include the existence and nature of all the claims or pleas involved and of the participation of each person in the settlement.(h) A lawyer shall not:(1) make an agreement prospectively limiting the lawyer's liability to a client for malpractice unless the client is independently represented in making the agreement; or。

二语习得(L10)

The role of theory in SLA Description: the characterization of the nature of the linguistic categories which constitute the learner‘s interlanguage Explanation: how SLA takes place Why SLA takes place

Two approaches to theory building: theory-then-research

research-then-theory

The study of SLA has involved both approaches
Schools of theories of second language acquisition
The acculturation theory
Definition of acculturation
Acculturation is defined by Brown (1980a:129) as ‗the process of becoming adapted to a new culture‘.

The result of social and psychological distance is the continued pidginization and fossilization.
Andersen‘s model—The Nativization Model Nativization Denativization Accommodation Growth independent Growth towards an of the external norm assimilation external norm

人工智能机器学习的模型 12-3 逻辑模型

Logical Models12. Models in Machine Learning Contents:☐12.1. Probabilistic Models☐12.2. Geometric Models☐12.3. Logical Models☐12.4. Networked Models☐The‘logical’ models are defined in terms of easily interpretable logical expressions, or can be easily translated into rules that are understandable by humans.“逻辑”模型被定义为易于解释的逻辑表达式，或者易于转换成人类能够理解的规则。

What are Logical Models 什么是逻辑模型Typical Logical Models 典型的逻辑模型First-order logic Association rules Decision tree ☐一阶逻辑☐关联规则☐决策树☐It is a method for discovering interesting relations between variables in databases.是一种用于在大型数据库中变量之间发现有趣关系的方法。

☐It is intended to identify strong rules discovered in databases using different measures of interestingness.它采用差异趣味性度量方式，旨在识别数据库中发现的强规则。

☐It can be used for discovering regularities between products in supermarkets, called Market Basket Analysis.可用于发现超市产品之间的规律，称之为购物篮分析。

数据分析工具生成pmml调研报告_V1.0_201208

DW开源工具调研目录第一章、概述 (3)1.1、PMML概述： (3)1.2、支持PMML的收费工具： (4)1.3、支持PMML的开源工具： (4)1.4、目前不支持的PMML工具： (4)1.5、内容安排： (4)第二章、收费DM工具 (5)2.1、SAS 公司的Enterprise Miner： (5)2.1.1工具说明 (5)2.1.2使用方式说明 (5)2.2、IBM 公司的Intelligent Miner (8)2.2.1工具说明 (8)2.2.2使用方式说明 (9)2.3、IBM SPSS Modeler： (13)2.3.1工具说明 (13)2.3.2使用方式 (14)第三章、开源DM工具 (14)3.1、Weka： (14)3.1.1 程序界面 (14)3.1.2 代码 (16)3.2、R： (18)3.3、knime： (19)3.1.2 程序界面 (20)第四章、总结 (20)第一章、概述1.1、PMML概述：数据挖掘任务主要由以下步骤组成：◆定义数据字典：定义对分析问题有价值的数据变量。

◆预处理数据：通过对现有变量的衍生、映射、合并等操作，产生适合于建模的变量。

◆制定挖掘模式：指定数据挖掘建模过程处理异常值和缺失值的策略。

◆表示数据挖掘模型：记录模型的结构和参数等信息。

◆预测评价：将测试数据集应用于所获得的模型，以各种准则获得性能评。

根据数据挖掘任务的需要，国际组织制定了数据挖掘的描述标准语言PMML。

PMML(Predictive Model Markup Language) 是一个开放的工业标准，它以XML 为载体将上述数据挖掘任务标准化，可以把某一产品所创建的数据挖掘方案应用于任何其它遵从PMML 标准的产品或平台中, 而不需考虑分析和预测过程中的具体实现细节。

使得模型的部署摆脱了模型开发和产品整合的束缚，为商业智能产品、数据仓库和云计算中的数据挖掘模型的应用环境开拓了新的篇章。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Association Rules and Predictive Modelsfor e-Banking ServicesVasilis Aggelis Dimitris ChristodoulakisDepartment of Computer Engineering and InformaticsUniversity of Patras, Rio, Patras, GreeceEmail: Vasilis.Aggelis@egnatiabank.gr dxri@cti.grAbstract. The introduction of data mining methods in the banking area al-though conducted in a slower way than in other fields, mainly due to the natureand sensitivity of bank data, can already be considered of great assistance tobanks as to prediction, forecasting and decision making. One particular methodis the investigation for association rules between products and services a bankoffers. Results are generally impressive since in many cases strong relations areestablished, which are not easily observed at a first glance. These rules are usedas additional tools aiming at the continuous improvement of bank services andproducts helping also in approaching new customers. In addition, the develop-ment and continuous training of prediction models is a very significant task, es-pecially for bank organizations. The establishment of such models with the ca-pacity of accurate prediction of future facts enhances the decision making andthe fulfillment of the bank goals, especially in case these models are applied onspecific bank units. E-banking can be considered such a unit receiving influencefrom a number of different sides. Scope of this paper is the demonstration of theapplication of data mining methods to e-banking. In other words associationrules concerning e-banking are discovered using different techniques and a pre-diction model is established, depending on e-banking parameters like the trans-actions volume conducted through this alternative channel in relation with othercrucial parameters like the number of active users.1 IntroductionDetermination of association rules concerning bank data is a challenging though de-manding task since:−The volume of bank data is enormous. Therefore the data should be adequately prepared by the data miner before the final step of the application of the method.−The objective must be clearly set from the beginning. In many cases not having a clear aim results in erroneous or no results at all.−Good knowledge of the data is a prerequisite not only for the data miner but also for the final analyst (manager). Otherwise wrong results will be produced by the data miner and unreliable conclusions will be drawn by the manager.−Not all rules are of interest. The analyst should recognize powerful rules as to deci-sion making.2The challenge of the whole process rises from the fact that the relations established are not easily observed without use of data mining methods.A number of data mining methods are nowadays widely used in treatment of finan-cial and bank data [22]. The contribution of these methods in prediction making is essential. Study and prediction of bank parameter’s behavior generally receives vari-ous influences. Some of these can be attributed to internal parameters while others to external financial ones. Therefore prediction making requires the correct parameter determination that will be used for predictions and finally for decision making. Meth-ods like Classification, Regression, Neural Networks and Decision Trees are applied in order to discover and test predictive models.Prediction making can be applied in specific bank data sets. An interesting data sector is e-banking. E-banking is a relatively new alternative payment and informa-tion channel offered by the banks. The familiarity of continuously more people to the internet use, along with the establishment of a feeling of increasing safety in conduc-tion of electronic transactions imply that slowly but constantly e-banking users in-crease. Thus, the study of this domain is of great significance to a bank.The increase of electronic transactions during the last years is quite rapid. E-banking nowadays offers a complete sum of products and services facilitating not only the individual customers but also corporate customers to conduct their transac-tions easily and securely. The ability to discover rules between different electronic services is therefore of great significance to a bank.Identification of such rules offers advantages as the following:−Description and establishment of the relationships between different types of elec-tronic transactions.−The electronic services become more easily familiar to the public since specific groups of customers are approached, that use specific payment manners.−Customer approach is well designed with higher possibility of successful engage-ment.−The improvement of already offered bank services is classified as to those used more frequently.−Reconsidering of the usefulness of products exhibiting little or no contribution to the rules.Of certain interest is the certification of the obtained rules using other methods of data mining to assure their accuracy.This study is concerned with the identification of rules between several different ways of payment and also we focus in prediction making and testing concerning the transactions volume through e-banking in correlation to the number of active users. The software used is SPSS Clementine 7.0. Section 2 contains association rules’ basic features and algorithms. A general description of prediction methods can be found in section 3, while in section 4 the process of investigation for association rules and the process of prediction making and testing is described. Experimental results can be found in section 5 and finally section 6 contains the main conclusions of this work accompanied with future work suggestions in this area.32 Association Rules BasicsA rule consists of a left-hand side proposition (antecedent) and a right-hand side (con-sequent) [3]. Both sides consist of Boolean statements. The rule states that if the left-hand side is true, then the right-hand is also true. A probabilistic rule modifies this definition so that the right-hand side is true with probability p, given that the left-hand side is true.A formal definition of association rule [6, 7, 14, 15, 18, 19, 21] is given below.Definition. An association rule is a rule in the form ofX Æ Y (1) where X and Y are predicates or set of items.As the number of produced associations might be huge, and not all the discovered associations are meaningful, two probability measures, called support and confidence, are introduced to discard the less frequent associations in the database. The support is the joint probability to find X and Y in the same group; the confidence is the condi-tional probability to find in a group Y having found X.Formal definitions of support and confidence [1, 3, 14, 15, 19, 21] are given below.Definition Given an itemset pattern X, its frequency fr(X) is the number of cases in the data that satisfy X. Support is the frequency fr(X ∧ Y). Confidence is the fraction of rows that satisfy Y among those rows that satisfy X,c(XÆY)=)()(X fr YXfr∧(2)In terms of conditional probability notation, the empirical accuracy of an associa-tion rule can be viewed as a maximum likelihood (frequency-based) estimate of the conditional probability that Y is true, given that X is true.2.1 Apriori AlgorithmAssociation rules are among the most popular representations for local patterns in data mining. Apriori algorithm [1, 2, 3, 6, 10, 14] is one of the earliest for finding association rules. This algorithm is an influential algorithm for mining frequent item-sets for Boolean association rules. This algorithm contains a number of passes over the database. During pass k, the algorithm finds the set of frequent itemsets L k of length k that satisfy the minimum support requirement. The algorithm terminates when L k is empty. A pruning step eliminates any candidate, which has a smaller sub-set. The pseudo code for Apriori Algorithm is following:C k: candidate itemset of size kL k: frequent itemset of size kL1 = {frequent items};For (k=1; L k != null; k++) do beginC k+1 = candidates generated from L k;For each transaction t in database do4Increment the count of all candidates inC k+1 that are contained in tL k+1 = candidates in C k+1with min_supportEndReturn L k;2.2 Generalized Rule Induction (GRI)In the generalized rule induction task [3,4,5,11,15,16,17], the goal is to discover rules that predict the value of a goal attribute, given the values of other attributes. However the goal attribute can be any attribute not occurring in the antecedent of the rule.3 Classification and RegressionThere are two types of problems that can be solved by Classification and Regression methods [3].Regression-type problems.The problem of regression consists of obtaining a func-tional model that relates the value of a response continuous variable Y with the values of variables X1, X2, ..., X v(the predictors). This model is usually obtained using sam-ples of the unknown regression function.Classification-type problems.Classification-type problems are generally those where one attempts to predict values of a categorical response variable from one or more continuous and/or categorical predictor variables.3.1 Classification and Regression Trees (C&RT)C&RT method [23, 24] builds classification and regression trees for predicting con-tinuous dependent variables (regression) and categorical predictor variables (classifi-cation). There are numerous algorithms for predicting continuous variables or cate-gorical variables from a set of continuous predictors and/or categorical factor effects. In most general terms, the purpose of the analysis via tree-building algorithms is to determine a set of if-then logical conditions that permit accurate prediction or classifi-cation of cases.Tree classification techniques [3, 25, 26], when they "work" and produce accurate predictions or predicted classifications based on a few logical if-then conditions, have a number of advantages over many of those alternative techniques.Simplicity of results.In most cases, the interpretation of results summarized in a tree is very simple. This simplicity is useful not only for purposes of rapid classifica-tion of new observations but can also often yield a much simpler "model" for explain-ing why observations are classified or predicted in a particular manner.Tree methods are nonparametric and nonlinear.The final results of using tree methods for classification or regression can be summarized in a series of logical if-then conditions (tree nodes). Therefore, there is no implicit assumption that the under-lying relationships between the predictor variables and the response variable are lin-5 ear, follow some specific non-linear link function, or that they are even monotonic in nature. Thus, tree methods are particularly well suited for data mining tasks, where neither a priori knowledge is available nor any coherent set of theories or predictions regarding which variables are related and how. In those types of data analysis, tree methods can often reveal simple relationships between just a few variables that could have easily gone unnoticed using other analytic techniques.3.2 Evaluation ChartsComparison and evaluation of predictive models in order for the best to be chosen is a task easily performed by the evaluation charts [4, 27]. They reveal the performance of the models concerning particular outcomes. They are based on sorting the records as to the predicted value and confidence of the prediction, splitting the records into groups of equal size (quantiles), and finally plotting the value of the business criterion for each quantile, from maximum to minimum.Different types of evaluation charts emphasize on a different evaluation criteria.Gains Charts: The proportion of total hits that occur in each quantile are defined as Gains. They are calculated as the result of: (number of hits in quantile/total number of hits) X 100%.Lift Charts:Lift is used to compare the percentage of records occupied by hits with the overall percentage of hits in the training data. Its values are computed as (hits in quantile/records in quantile) / (total hits/total records).Evaluation charts can also be expressed in cumulative form, meaning that each point equals the value for the corresponding quantile plus all higher quantiles. Cumu-lative charts usually express the overall performance of models in a more adequate way, whereas non-cumulative charts [4] often succeed in indicating particular prob-lem areas for models.The interpretation of an evaluation chart is a task certainly depended on the type of chart, although there are some characteristics common to all evaluation charts. Con-cerning cumulative charts, higher lines indicate more effective models, especially on the left side of the chart. In many cases, when comparing multiple models lines cross, so that one model stands higher in one part of the chart while another is elevated higher than the first in a different part of the chart. In this case, it is necessary to con-sider which portion of the sample is desirable (which defines a point on the x axis) when deciding which model is the appropriate.Most of the non-cumulative charts will be very similar. For good models, noncu-mulative charts [4] should be high toward the left side of the chart and low toward the right side of the chart. Dips on the left side of the chart or spikes on the right side can indicate areas where the model is predicting poorly. A flat line across the whole graph indicates a model that essentially provides no information.Gains charts. Cumulative gains charts extend from 0% starting from the left to 100% at the right side. In the case of a good model, the gains chart rises steeply to-wards 100% and then levels off. A model that provides no information typically fol-lows the diagonal from lower left to upper right.Lift charts. Cumulative lift charts tend to start above 1.0 and gradually descend until they reach 1.0 heading from left to right. The right edge of the chart represents6the entire data set, so the ratio of hits in cumulative quantiles to hits in data is 1.0. In case of a good model the lift chart, lift should start well above 1.0 on the left, remain on a high plateau moving to the right, trailing off sharply towards 1.0 on the right side of the chart. For a model that provides no information, the line would hover around 1.0 for the entire graph.4 Investigation for Association Rules and Generation ofPredictions in e-Banking Data SetIn this section the determination of relations between different payment types e-banking users conduct is described. These relations are of great importance to the bank as they enhance the selling of its products, by making the bank more aware of its customer’s behavior and the payment way they prefer. They also contribute in im-provement of its services.Namely the following types of payments were used:1.DIASTRANSFER Payment Orders – Funds Transfers between banks that are par-ticipating in DIASTRANSFER System.2.SWIFT Payment Orders – Funds Transfers among national and foreign banks.3.Funds Transfers – Funds Transfers inside the bank.4.Forward Funds Transfers - Funds Transfers inside the bank with forward value date.5.Greek Telecommunications Organization (GTO) and Public Power Corporation(PPC)Standing Orders – Standing Orders for paying GTO and PPC bills.6.Social Insurance Institute (SII) Payment Orders – Payment Orders for paying em-ployers’ contributions.7.VAT Payment Orders.8.Credit Card Payment Orders.For the specific case of this study a random sample of e-banking customers, both individuals and companies, was used. A Boolean value is assigned to each different payment type depending on whether the payment has been conducted by the users or not. In case the user is individual field «company» receives the value 0, else 1.A sample of the above data set is shown in Table 1. In total the sample contains 1959 users.User Comp DiasTransferP.O. SwiftP.O.FundsTransferForwardFundsTransferGTO-PPCS.O.SIIP.O.VATP.O.Cr.CardP.O.… … … … … … … … … … User103 0 T F T T F F F T User104 1 T T T T F T T F User105 1 T F T T T T T T User106 0 T F T F T F F T … … … … … … … … … …Table 1.In order to discover association rules the Apriori [4] and Generalized Rule Induc-tion (GRI) [4] methods were used. These rules are statements in the form if antece-dent then consequent.7 In the case of this study, as stated above, another objective is the production and test of predictions about the volume of e-banking transactions in relation to the active users. The term financial transactions stands for all payment orders or standing ordersa user carries out, excluding transactions concerning information content like account balance, detailed account transactions or mini statement. The term «active» describes the user who currently makes use of the electronic services a bank offers. Active users are a subgroup of the enlisted users.One day is defined as the time unit. The number of active users is the predictor variable (Count_Of_Active_Users) while the volume of financial transactions is as-sumed to be the response variable (Count_Of_Payments). A sample of the above data set is shown in Table 2Transaction Day Count_Of_Active_Users Count_Of_Payments... ... ...27/8/2002 99 22828/8/2002 107 38529/8/2002 181 91530/8/2002 215 859 ... ... ...Table 2.The date range for which the data set is applicable counts from April 20th, 2001 un-til December 12th, 2002. Data set includes data only for active days (holydays and weekends not included), which means 387 occurrences. In order to create predictions Classification and Regression Tree (C&R Tree) [4] was used.5 ExperimentalResultsFig. 1.Apriori:By the use of Apriori method and setting Minimum Rule Support = 10 and Minimum Rule Confidence = 20 eleven rules were determined and shown in Figure 1.GRI:Using method GRI with Minimum Rule Support = 10 and Minimum Rule Confidence = 40 resulted in the four rules seen in Figure 2.8Fig. 2.After comparison between the rules obtained by the two methods it can be con-cluded that the most powerful is:−If SII Payment Order then VAT Payment Order (confidence=81). (Rule 1) Additionally it can be observed that this rule is valid for the reverse case too, namely:−If VAT Payment Order then SII Payment Order (confidence= 47). (Rule 2) Two other rules obtained exhibiting confidence > 40 are the following:−If Funds Transfers then Forward Funds Transfers (confidence=47), (Rule 3) valid also reversely.−If Forward Funds Transfers then Funds Transfers (confidence=42), (Rule 4) As seen, there exists strong relationship between VAT Payment Order and SII Payment Order, as is the case between Funds Transfers and Forward Funds Transfers.These strong relationships are visualized on the web graph of Figure 3 and the ta-bles of its links in Figure 4.Fig. 3.9Fig. 4.Interestingness [8, 9, 12, 13, 19, 20] of the above rules is high bearing in mind that generally a pattern or rule can be characterized interesting if it is easily understood, unexpected, potentially useful and actionable or it validates some hypothesis that a user seeks to confirm. Given the fact that payment types used in the rules exhibiting the highest confidence (VAT Payment Order, SII Payment Order) are usually pay-ments conducted by companies and freelancers, a smaller subdivision of the whole sample was considered containing only companies (958 users) for which the methods Apriori και GRI were applied. The results are described below.Apriori:Using Apriori Method and setting Minimum Rule Support = 10 and Minimum Rule Confidence = 40, the seven rules shown in Figure 5 were obtained.Fig. 5.GRI: GRI method with Minimum Rule Support = 10 and Minimum Rule Confi-dence = 50 resulted in two rules, see Figure 6.10Fig. 6.The comparison of the above rules leads to the conclusion that the validity of rules 1 and 2 is confirmed exhibiting even increased support:−If SII Payment Order then VAT Payment Order (confidence=81). (Rule 5)−If VAT Payment Order then SII Payment Order (confidence= 51). (Rule 6)5.1 CR&TIn Figure 7 the conditions defining the partitioning of data discovered by the algo-rithm C&RT are displayed. These specific conditions compose a predictive model.Fig. 7.C&RT algorithm [4] works by recursively partitioning the data based on input field values. The data partitions are called branches. The initial branch (sometimes called the root) encompasses all data records. The root is split into subsets or child branches, based on the value of a particular input field. Each child branch may be further split into sub-branches, which may in turn be split again, and so on. At the lowest level of the tree are branches that have no more splits. Such branches are known as terminal branches, or leaves.11 Figure 7 shows the input values that define each partition or branch and a summary of output field values for the records in that split. For example a condition with many instances (201) is:If Count_of_Active_Users < 16.5 then Count_of_Payments Ö 11,473, meaning that if the number of active users during a working day is less than 16.5 then the number of transactions is predicted approximately at the value of 11.Fig. 8.Figure 8 shows a graphical display of the structure of the tree in detail. Goodness of fit of the discovered model was assessed by the use of evaluation charts. Examples of evaluation Charts (Figure 9 to Figure 10) are shown below.Gains Chart: This Chart (Figure 9) shows that the gain rises steeply towards 100% and then levels off. Using the prediction of the model, the percentage of Pre-dicted Count of Payments for the percentile is calculated and these points create the lift curve. An important notice is that the efficiency of a model is increased with the area between the lift curve and the baseline.Fig. 9.12Lift Chart: As can be seen in Figure 10, Chart starts well above 1.0 on the left, remains on a high plateau moving to the right, and then trails off sharply towards 1.0 on the right side of the chart. Using the prediction of the model shows the actual lift.Fig. 10.In Figure 11 the Count of Payment and the Predicted Count of Payment as func-tions of the number of active users can be seen. The Predicted Count of Payment line clearly reveals the partitioning of the data due to the application of the C&RT algo-rithm.Fig. 11.6 Conclusions and Future WorkIn the present paper the way of discovery of association rules between different types of e-banking payment offered by a bank is described along with experimental results.The basic outcome is that VAT Payment Orders and SII Payment Orders are the most popular and interconnected strongly. The detection of such relationships offers a bank a detailed analysis that can be used as a reference point for the conservation of the customer volume initially but also for approaching new customer groups (compa-nies, freelancers) who conduct these payments in order to increase its customer num-ber and profit.13 Similar patterns can be derived after analysis of payment types of specific cus-tomer groups resulting from various criteria either gender, residence area, age, profes-sion or any other criteria the bank assumes significant. The rules to be discovered show clearly the tendencies created in the various customer groups helping the bank to approach these groups as well as re-design the offered electronic services in order to become more competitive.In addition to the two methods used, detection of association rules can employ sev-eral other data mining methods such as decision trees in order to confirm the validity of the results.In this study the establishment of a prediction model concerning the number of payments conducted through Internet as related to the number of active users is inves-tigated along with testing its accuracy.Using the methods C&RT it was concluded that there exists strong correlation be-tween the number of active users and the number of payments these users conduct. It is clear that the increase of the users’ number results in increase of the transactions made. Therefore, the enlargement of the group of active users should be a strategic target for a bank since it increases the transactions and decreases its service cost. It has been proposed by certain researches that the service cost of a transaction reduces from €1.17 to just €0.13 in case it is conducted in electronic form. Therefore, a goal of the e-banking sector of a bank should be the increase of the proportion of active users compared to the whole number of enlisted customers from which a large num-ber do not use e-services and is considered inactive.Future work includes the creation of prediction models concerning other e-banking parameters like the transactions value, the use of specific payment types, commissions of electronic transactions and e-banking income in relation to internal bank issues. Also extremely interesting is the case of predictive models based on external financial parameters.The prediction model of this study could be determined using also other methods of data mining. Use of different methods offers the ability to compare between them and select the more suitable. The acceptance and continuous training of the model using the appropriate Data Mining method results in extraction of powerful conclu-sions and results.References1. M.J. Zaki, S. Parthasarathy, W. Li, and M. Ogihara. “Evaluation of Sampling for Data Min-ing of Association Rules”, Proceedings 7th Workshop Research Issues Data Engineering, (1997)2. R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A.I. Verkamo. “Fast Discovery ofAssociations Rules”, In: Advances in Knowledge Discovery and Data Mining, (1996)3. D. Hand, H. Mannila, and P. Smyth “Principles of Data Mining”. The MIT Press, (2001)4. “Clementine 7.0 Users’s Guide”. Integral solutions Limited, (2002)5. A. Freitas. “A Genetic Programming Framework for Two Data Mining Tasks: Classificationand Generalized Rule Induction”, Proceedings 2nd Annual Conference on Genetic Pro-gramming (1997)6. E. Han, G. Karypis, and V. Kumar. “Scalable Parallel Data Mining for Association Rules”,Proceedings ACM SIGMOD Conference, (1997)147. S. Brin, R. Motwani, and C. Silverstein. “Beyond Market Baskets: Generalizing AssociationRules to Correlations”, Proceedings ACM SIGMOD Conference, (1997)8. M. Chen, J. Han, and P. Yu. “Data Mining: An Overview from Database Perspective”, IEEETransactions on Knowledge and Data Engineering, (1997)9. N. Ikizler, and H.A. Guvenir. “Mining Interesting Rules in Bank Loans Data”, Proceedings10th Turkish Symposium on Artificial Intelligence and Neural Networks, (2001)10. R. Ng, L. Lakshmanan, and J. Han. “Exploratory Mining and Pruning Optimizations ofConstrained Association Rules”, Proceedings ACM SIGMOD Conference, (1998)11. A. Freitas. “A Survey of Evolutionary Algorithms for Data Mining and Knowledge Discov-ery”, In: A. Ghosh and S. Tsutsui. (eds.) Advances in Evolutionary Computation (2002)12. A. Freitas. “On Rule Interestingness Measures”, Knowledge-Based Systems, (1999)13. R. Hilderman, and H. Hmailton. “Knowledge Discovery and Interestingness Measures: aSurvey”, Tech Report CS 99-04, Department of Computer Science, Univ. of Regina, (1999) 14. J. Hipp, U. Guntzer, and G. Nakhaeizadeh. “Data Mining of Associations Rules and theProcess of Knowledge Discovery in Databases”, (2002)15. A. Michail. “Data Mining Library Reuse Patterns using Generalized Association Rules”,Proceedings 22nd International Conference on Software Engineering, (2000)16. J. Liu, and J. Kwok. “An Extended Genetic Rule Induction Algorithm”, Proceedings Con-gress on Evolutionary Computation (CEC), (2000)17. P. Smyth, and R. Goodman. “Rule Induction using Information Theory”, Knowledge Dis-covery in Databases, (1991)18. T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. “Data Mining Using Two-Dimensional Optimized Association Rules: Scheme, Algorithms, and Visualization”, Pro-ceedings ACM SIGMOD Conference, (1996)19. R. Srikant, and R. Agrawal. “Mining Generalized Association Rules”, Proceedings21stVLDB Conference, (1995)20. T. Yilmaz, and A. Guvenir. “Analysis and Presentation of Interesting Rules”, Proceedings10th Turkish Symposium on Artificial Intelligence and Neural Networks, (2001)21. H. Toivonen. “Discovery of Frequent Patterns in Large Data Collections”, Tech Report A-1996-5, Department of Computer Science, Univ. of Helsinki, (1996)22. D. Foster, and R. Stine. “Variable Selection in Data Mining: Building a Predictive Modelfor Bankruptcy”, Center for Financial Institutions Working Papers from Wharton School Center for Financial Institutions, Univ. of Pennsylvania, (2002)23. J. Galindo. “Credit Risk Assessment using Statistical and Machine Learning: Basic Meth-odology and Risk Modeling Applications”, Computational Economics Journal, (1997)24. R.Lewis. “An Introduction to Classification and Regression tree (CART) Analysis”, AnnualMeeting of the Society for Academic Emergency Medicine in San, (2000)25. L. Breiman, J. H. Friedman, R.A. Olshen, and C.J. Stone. “Classification and RegressionTrees”, Wadsworth Publications, (1984)26. A. Buja, and Y. Lee. “Data Mining Criteria for Tree-based Regression and Classification”,Proceedings 7th ACM SIGKDD Conference (2001)27. S.J. Hong, and S. Weiss. “Advances in Predictive Model Generation for Data Mining”Proceedings International Workshop Machine Learning and Data Mining in Pattern Recog-nition, (1999)。

Association Rules and Predictive Models for e-Banking Services

合集下载

数据采集和营销工具(英文版)

北大暑期课程《回归分析》(Linear-Regression-Analysis)讲义PKU5

车辆控制系统说明书

Leveraging the power of local spatial autocorrelation in geophysical interpolative clustering-DMKD

科三专有名词中英对照

2024年高二英语学科全球合作研究的合作机制构建分析单选题30题

Determinants of Internal Control Weaknesses内部控制缺陷的影响因素

GAY-MLP神经网络：糖尿病诊断的混合智能系统(IJISA-V8-N1-6)

model predictive control 综述

血清壳多糖酶３样蛋白１（ＣＨＩ３Ｌ１）对肝硬化患者发生失代偿事件风险的预测价值

英文文献-一种生产水果发酵饮料的方法

From Data Mining to Knowledge Discovery in Databases

神经外科5本杂志目录英文-中文 2022年10月

ABA Model Rules of Professional Conduct

二语习得(L10)

人工智能机器学习的模型 12-3 逻辑模型

数据分析工具生成pmml调研报告_V1.0_201208

文档推荐

最新文档

Association Rules and Predictive Models for e-Banking Services

合集下载

数据采集和营销工具(英文版)

北大暑期课程《回归分析》(Linear-Regression-Analysis)讲义PKU5

车辆控制系统说明书

Leveraging the power of local spatial autocorrelation in geophysical interpolative clustering-DMKD

科三专有名词中英对照

2024年高二英语学科全球合作研究的合作机制构建分析单选题30题

Determinants of Internal Control Weaknesses内部控制缺陷的影响因素

GAY-MLP神经网络：糖尿病诊断的混合智能系统(IJISA-V8-N1-6)

model predictive control 综述

血清壳多糖酶３样蛋白１（ＣＨＩ３Ｌ１）对肝硬化患者发生失代偿事件风险的预测价值

英文文献-一种生产水果发酵饮料的方法

From Data Mining to Knowledge Discovery in Databases

神经外科5本杂志目录 英文-中文 2022年10月

ABA Model Rules of Professional Conduct

二语习得(L10)

人工智能 机器学习的模型 12-3 逻辑模型

数据分析工具生成pmml调研报告_V1.0_201208

文档推荐

最新文档

神经外科5本杂志目录英文-中文 2022年10月

人工智能机器学习的模型 12-3 逻辑模型