Association Rules and Predictive Models for e-Banking Services
- 格式:pdf
- 大小:496.12 KB
- 文档页数:14
Class 5: ANOVA (Analysis of Variance) and F-testsI. What is ANOVAWhat is ANOVA? ANOVA is the short name for the Analysis of Variance. The essenceof ANOVA is to decompose the total variance of the dependent variable into two additive components, one for the structural part, and the other for the stochastic part, of a regression. Today we are going to examine the easiest case.II. ANOVA: An IntroductionLet the model beεβ+= X y .Assuming x i is a column vector (of length p) of independent variable values for the i th'observation,i i i εβ+='x y .Then b 'x i is the predicted value.sum of squares total:[]∑-=2Y y SST i []∑-+-=2'x b 'x y Y b i i i[][][][]∑∑∑-+-+-=Y -b 'x b 'x y 2Y b 'x b 'x y 22i i i i i i[][]∑∑-+=22Y b 'x e i ibecause [][][]∑∑=-=--0Y b 'x e Y b 'x b 'x y ii i i i .This is always true by OLS. = SSE + SSRImportant: the total variance of the dependent variable is decomposed into two additive parts: SSE, which is due to errors, and SSR, which is due to regression. Geometric interpretation: [blackboard ]Decomposition of VarianceIf we treat X as a random variable, we can decompose total variance to the between-group portion and the within-group portion in any population:()()()i i i x y εβV 'V V +=Prove:()()i i i x y εβ+='V V()()()i i i i x x εβεβ,'Cov 2V 'V ++=()()iix εβV 'V +=(by the assumption that ()0 ,'Cov =εβk x , for all possible k.)The ANOVA table is to estimate the three quantities of equation (1) from the sample.As the sample size gets larger and larger, the ANOVA table will approach the equation closer and closer.In a sample, decomposition of estimated variance is not strictly true. We thus need toseparately decompose sums of squares and degrees of freedom. Is ANOVA a misnomer?III. ANOVA in MatrixI will try to give a simplied representation of ANOVA as follows:[]∑-=2Y y SST i ()∑-+=i i y Y 2Y y 22∑∑∑-+=i i y Y 2Y y 22∑-+=222Y n 2Y n y i (because ∑=Y n y i )∑-=22Y n y i2Y n y 'y -=y J 'y n /1y 'y -= (in your textbook, monster look)SSE = e'e[]∑-=2Y b 'x SSR i()()[]∑-+=Y b 'x 2Y b 'x 22i i()[]()∑∑-+=b 'x Y 2Y n b 'x 22i i()[]()∑∑--+=i i i e y Y 2Y n b 'x 22()[]∑-+=222Y n 2Y n b 'x i(because ∑∑==0e ,Y n y i i , as always)()[]∑-=22Yn b 'x i2Y n Xb X'b'-=y J 'y n /1y X'b'-= (in your textbook, monster look)IV. ANOVA TableLet us use a real example. Assume that we have a regression estimated to be y = - 1.70 + 0.840 xANOVA TableSOURCE SS DF MS F with Regression 6.44 1 6.44 6.44/0.19=33.89 1, 18Error 3.40 18 0.19 Total 9.8419We know∑=100xi, ∑=50y i , 12.509x 2=∑i , 84.134y 2=∑i , ∑=66.257y x i i . If weknow that DF for SST=19, what is n?n= 205.220/50Y ==84.95.25.22084.134Y n y SST 22=⨯⨯-=-=∑i()[]∑-+=0.1250.84x 1.7-SSR 2i[]∑-⨯⨯⨯-⨯+⨯=0.125x 84.07.12x 84.084.07.17.12i i= 20⨯1.7⨯1.7+0.84⨯0.84⨯509.12-2⨯1.7⨯0.84⨯100- 125.0= 6.44SSE = SST-SSR=9.84-6.44=3.40DF (Degrees of freedom): demonstration. Note: discounting the intercept when calculating SST. MS = SS/DFp = 0.000 [ask students]. What does the p-value say?V. F-TestsF-tests are more general than t-tests, t-tests can be seen as a special case of F-tests.If you have difficulty with F-tests, please ask your GSIs to review F-tests in the lab. F-tests takes the form of a fraction of two MS's.MSR/MSE F , df2df1An F statistic has two degrees of freedom associated with it: the degree of freedom inthe numerator, and the degree of freedom in the denominator.An F statistic is usually larger than 1. The interpretation of an F statistics is thatwhether the explained variance by the alternative hypothesis is due to chance. In other words, the null hypothesis is that the explained variance is due to chance, or all the coefficients are zero.The larger an F-statistic, the more likely that the null hypothesis is not true. There is atable in the back of your book from which you can find exact probability values.In our example, the F is 34, which is highly significant.VI. R 2R 2 = SSR / SSTThe proportion of variance explained by the model. In our example, R-sq = 65.4%VII. What happens if we increase more independent variables.1. SST stays the same.2. SSR always increases.3. SSE always decreases.4. R 2 always increases.5. MSR usually increases.6. MSE usually decreases.7. F-test usually increases.Exceptions to 5 and 7: irrelevant variables may not explain the variance but take up degrees of freedom. We really need to look at the results.VIII. Important: General Ways of Hypothesis Testing with F-Statistics.All tests in linear regression can be performed with F-test statistics. The trick is to run"nested models."Two models are nested if the independent variables in one model are a subset or linearcombinations of a subset (子集)of the independent variables in the other model.That is to say. If model A has independent variables (1, 1x , 2x ), and model B hasindependent variables (1, 1x , 2x ,3x ), A and B are nested. A is called the restricted model; B is called less restricted or unrestricted model. We call A restricted because A implies that0=3β. This is a restriction.Another example: C has independent variable (1, 1x , 2x +3x ), D has (1, 2x +3x ). C and A are not nested.C and B are nested. One restriction in C: 32ββ=.C andD are nested. One restriction in D: 0=1β. D and A are not nested.D and B are nested: two restriction in D: 32ββ=; 0=1β.We can always test hypotheses implied in the restricted models. Steps: run tworegression for each hypothesis, one for the restricted model and one for the unrestrictedmodel. The SST should be the same across the two models. What is different is SSE and SSR. That is, what is different is R 2. Let()()df df SSE ,df df SSE u u r r ==;df df ()()0u r u r r u n p n p p p -=---=-<Use the following formulas:()()()()(),SSE SSE /df SSE df SSE F SSE /df r u r u dfr dfu dfu u u---=or()()()()(),SSR SSR /df SSR df SSR F SSE /df u r u r dfr dfu dfu u u---=(proof: use SST = SSE+SSR)Note, df(SSE r )-df(SSE u ) = df(SSR u )-df(SSR r ) =df ∆,is the number of constraints (not number of parameters) implied by the restricted modelor()()()22,2R R /df F 1R /dfur dfr dfu dfuuu--∆=- Note thatdf 1df ,2F t =That is, for 1df tests, you can either do an F-test or a t-test. They yield the same result. Another way to look at it is that the t-test is a special case of the F test, with the numerator DF being 1.IX. Assumptions of F-testsWhat assumptions do we need to make an ANOVA table work?Not much an assumption. All we need is the assumption that (X'X) is not singular, so that the least square estimate b exists.The assumption of ε'X =0 is needed if you want the ANOVA table to be an unbiased estimate of the true ANOVA (equation 1) in the population. Reason: we want b to be an unbiased estimator of β, and the covariance between b and εto disappear.For reasons I discussed earlier, the assumptions of homoscedasticity and non-serial correlation are necessary for the estimation of ()i V ε.The normality assumption that εi is distributed in a normal distribution is needed for small samples.X. The Concept of IncrementEvery time you put one more independent variable into your model, you get an increase in 2R . We sometime called the increase "incremental 2R ." What is means is that more variance is explained, or SSR is increased, SSE is reduced. What you should understand is that the incremental 2R attributed to a variable is always smaller than the 2R when other variables are absent.XI. Consequences of Omitting Relevant Independent VariablesSay the true model is the following:0112233i i i i i y x x x ββββε=++++.But for some reason we only collect or consider data on 21,,x and x y . Therefore, we omit3x in the regression. That is, we omit in 3x our model. We briefly discussed this problembefore. The short story is that we are likely to have a bias due to the omission of a relevant variable in the model. This is so even though our primary interest is to estimate the effect of1x or 2x on y.Why? We will have a formal presentation of this problem.XII. Measures of Goodness-of-FitThere are different ways to assess the goodness-of-fit of a model. A. R 2R 2 is a heuristic measure for the overall goodness-of-fit. It does not have an associated test statistic.R 2 measures the proportion of the variance in the dependent variable that is “explained” by the model: R 2 =SSESSR SSRSST SSR +=B. Model F-testThe model F-test tests the joint hypotheses that all the model coefficients except for the constant term are zero.Degrees of freedoms associated with the model F-test: Numerator: p-1 Denominator: n-p.C. t-tests for individual parametersA t-test for an individual parameter tests the hypothesis that a particular coefficient is equal to a particular number (commonly zero).t k = (b k - βk0)/SE k , where SE k is the (k, k) element of MSE(X’X)-1, with degree of freedom=n-p. D. Incremental R 2Relative to a restricted model, the gain in R 2 for the unrestricted model: ∆R 2= R u 2- R r 2E. F-tests for Nested ModelIt is the most general form of F-tests and t-tests.()()()()(),SSE SSE /df SSE df SSE F SSE /df r u r dfu dfr u dfu u u---=It is equal to a t-test if the unrestricted and restricted models differ only by one single parameter.It is equal to the model F-test if we set the restricted model to the constant-only model.[Ask students] What are SST, SSE, and SSR, and their associated degrees of freedom, for the constant-only model?Numerical ExampleA sociological study is interested in understanding the social determinants of mathematicalachievement among high school students. You are now asked to answer a series of questions. The data are real but have been tailored for educational purposes. The total number of observations is 400. The variables are defined as: y: math scorex1: father's education x2: mother's educationx3: family's socioeconomic status x4: number of siblings x5: class rankx6: parents' total education (note: x6 = x1 + x2) For the following regression models, we know: Table 1 SST SSR SSE DF R 2 (1) y on (1 x1 x2 x3 x4) 34863 4201 (2) y on (1 x6 x3 x4) 34863 396 .1065 (3) y on (1 x6 x3 x4 x5) 34863 10426 24437 395 .2991 (4) x5 on (1 x6 x3 x4) 269753 396 .02101. Please fill the missing cells in Table 1.2. Test the hypothesis that the effects of father's education (x1) and mother's education (x2) on math score are the same after controlling for x3 and x4.3. Test the hypothesis that x6, x3 and x4 in Model (2) all have a zero effect on y.4. Can we add x6 to Model (1)? Briefly explain your answer.5. Test the hypothesis that the effect of class rank (x5) on math score is zero after controlling for x6, x3, and x4.Answer: 1. SST SSR SSE DF R 2 (1) y on (1 x1 x2 x3 x4) 34863 4201 30662 395 .1205 (2) y on (1 x6 x3 x4) 34863 3713 31150 396 .1065 (3) y on (1 x6 x3 x4 x5) 34863 10426 24437 395 .2991 (4) x5 on (1 x6 x3 x4) 275539 5786 269753 396 .0210Note that the SST for Model (4) is different from those for Models (1) through (3). 2.Restricted model is 01123344()y b b x x b x b x e =+++++Unrestricted model is ''''''011223344y b b x b x b x b x e =+++++(31150 - 30662)/1F 1,395 = -------------------- = 488/77.63 = 6.29 30662 / 395 3.3713 / 3F 3,396 = --------------- = 1237.67 / 78.66 = 15.73 31150 / 3964. No. x6 is a linear combination of x1 and x2. X'X is singular.5.(31150 - 24437)/1F 1,395 = -------------------- = 6713 / 61.87 = 108.50 24437/395t = 10.42t ===。
IndexAactuation layer, 132average brightness,102-103adaptive control, 43Badaptive cruise control, 129backpropagation algorithm, 159adaptive FLC, 43backward driving mode,163,166,168-169adaptive neural networks,237adaptive predictive model, 283Baddeley-Molchanov average, 124aerial vehicles, 240 Baddeley-Molchanov fuzzy set average, 120-121, 123aerodynamic forces,209aerodynamics analysis, 208, 220Baddeley-Molchanov mean,118,119-121alternating filter, 117altitude control, 240balance position, 98amplitude distribution, 177bang-bang controller,198analytical control surface, 179, 185BCFPI, 61-63angular velocity, 92,208bell-shaped waveform,25ARMAX model, 283beta distributions,122artificial neural networks,115Bezier curve, 56, 59, 63-64association, 251Bezier Curve Fuzzy PI controller,61attitude angle,208, 217Bezier function, 54aumann mean,118-120bilinear interpolation, 90, 300,302automated manual transmission,145,157binary classifier,253Bo105 helicopter, 208automatic formation flight control,240body frame,238boiler following mode,280,283automatic thresholding,117border pixels, 101automatic transmissions,145boundary layer, 192-193,195-198autonomous robots,130boundary of a fuzzy set,26autonomous underwater vehicle, 191braking resistance, 265AUTOPIA, 130bumpy control surface, 55autopilot signal, 228Index 326CCAE package software, 315, 318 calibration accuracy, 83, 299-300, 309, 310, 312CARIMA models, 290case-based reasoning, 253center of gravity method, 29-30, 32-33centroid defuzzification, 7 centroid defuzzification, 56 centroid Method, 106 characteristic polygon, 57 characterization, 43, 251, 293 chattering, 6, 84, 191-192, 195, 196, 198chromosomes, 59circuit breaker, 270classical control, 1classical set, 19-23, 25-26, 36, 254 classification, 106, 108, 111, 179, 185, 251-253classification model, 253close formation flight, 237close path tracking, 223-224 clustering, 104, 106, 108, 251-253, 255, 289clustering algorithm, 252 clustering function, 104clutch stroke, 147coarse fuzzy logic controller, 94 collective pitch angle, 209 collision avoidance, 166, 168 collision avoidance system, 160, 167, 169-170, 172collision avoidance system, 168 complement, 20, 23, 45 compressor contamination, 289 conditional independence graph, 259 confidence thresholds, 251 confidence-rated rules, 251coning angle, 210constant gain, 207constant pressure mode, 280 contrast intensification, 104 contrast intensificator operator, 104 control derivatives, 211control gain, 35, 72, 93, 96, 244 control gain factor, 93control gains, 53, 226control rules, 18, 27, 28, 35, 53, 64, 65, 90-91, 93, 207, 228, 230, 262, 302, 304-305, 315, 317control surfaces, 53-55, 64, 69, 73, 77, 193controller actuator faulty, 289 control-weighting matrix, 207 convex sets, 119-120Coordinate Measurement Machine, 301coordinate measuring machine, 96 core of a fuzzy set, 26corner cube retroreflector, 85 correlation-minimum, 243-244cost function, 74-75, 213, 282-283, 287coverage function, 118crisp input, 18, 51, 182crisp output, 7, 34, 41-42, 51, 184, 300, 305-306crisp sets, 19, 21, 23crisp variable, 18-19, 29critical clearing time, 270 crossover, 59crossover probability, 59-60cruise control, 129-130,132-135, 137-139cubic cell, 299, 301-302, 309cubic spline, 48cubic spline interpolation, 300 current time gap, 136custom membership function, 294 customer behav or, 249iDdamping factor, 211data cleaning, 250data integration, 250data mining, 249, 250, 251-255, 259-260data selection, 250data transformation, 250d-dimensional Euclidean space, 117, 124decision logic, 321 decomposition, 173, 259Index327defuzzification function, 102, 105, 107-108, 111 defuzzifications, 17-18, 29, 34 defuzzifier, 181, 242density function, 122 dependency analysis, 258 dependency structure, 259 dependent loop level, 279depth control, 202-203depth controller, 202detection point, 169deviation, 79, 85, 185-188, 224, 251, 253, 262, 265, 268, 276, 288 dilation, 117discriminated rules, 251 discrimination, 251, 252distance function, 119-121 distance sensor, 167, 171 distribution function, 259domain knowledge, 254-255 domain-specific attributes, 251 Doppler frequency shift, 87 downhill simplex algorithm, 77, 79 downwash, 209drag reduction, 244driver’s intention estimator, 148 dutch roll, 212dynamic braking, 261-262 dynamic fuzzy system, 286, 304 dynamic tracking trajectory, 98Eedge composition, 108edge detection, 108 eigenvalues, 6-7, 212electrical coupling effect, 85, 88 electrical coupling effects, 87 equilibrium point, 207, 216 equivalent control, 194erosion, 117error rates, 96estimation, 34, 53, 119, 251, 283, 295, 302Euler angles, 208evaluation function, 258 evolution, 45, 133, 208, 251 execution layer, 262-266, 277 expert knowledge, 160, 191, 262 expert segmentation, 121-122 extended sup-star composition, 182 Ffault accommodation, 284fault clearing states, 271, 274fault detection, 288-289, 295fault diagnosis, 284fault durations, 271, 274fault isolation, 284, 288fault point, 270-271, 273-274fault tolerant control, 288fault trajectories, 271feature extraction, 256fiber glass hull, 193fin forces, 210final segmentation, 117final threshold, 116fine fuzzy controller, 90finer lookup table, 34finite element method, 318finite impulse responses, 288firing weights, 229fitness function, 59-60, 257flap angles, 209flight aerodynamic model, 247 flight envelope, 207, 214, 217flight path angle, 210flight trajectory, 208, 223footprint of uncertainty, 176, 179 formation geometry, 238, 247 formation trajectory, 246forward driving mode, 163, 167, 169 forward flight control, 217 forward flight speed, 217forward neural network, 288 forward velocity, 208, 214, 217, 219-220forward velocity tracking, 208 fossil power plants, 284-285, 296 four-dimensional synoptic data, 191 four-generator test system, 269 Fourier filter, 133four-quadrant detector, 79, 87, 92, 96foveal avascular zone, 123fundus images, 115, 121, 124 fuselage, 208-210Index 328fuselage axes, 208-209fuselage incidence, 210fuzz-C, 45fuzzifications, 18, 25fuzzifier, 181-182fuzzy ACC controller, 138fuzzy aggregation operator, 293 fuzzy ASICs, 37-38, 50fuzzy binarization algorithm, 110 fuzzy CC controller, 138fuzzy clustering algorithm, 106, 108 fuzzy constraints, 286, 291-292 fuzzy control surface, 54fuzzy damage-mitigating control, 284fuzzy decomposition, 108fuzzy domain, 102, 106fuzzy edge detection, 111fuzzy error interpolation, 300, 302, 305-306, 309, 313fuzzy filter, 104fuzzy gain scheduler, 217-218 fuzzy gain-scheduler, 207-208, 220 fuzzy geometry, 110-111fuzzy I controller, 76fuzzy image processing, 102, 106, 111, 124fuzzy implication rules, 27-28 fuzzy inference system, 17, 25, 27, 35-36, 207-208, 302, 304-306 fuzzy interpolation, 300, 302, 305- 307, 309, 313fuzzy interpolation method, 309 fuzzy interpolation technique, 300, 309, 313fuzzy interval control, 177fuzzy mapping rules, 27fuzzy model following control system, 84fuzzy modeling methods, 255 fuzzy navigation algorithm, 244 fuzzy operators, 104-105, 111 fuzzy P controller, 71, 73fuzzy PD controller, 69fuzzy perimeter, 110-111fuzzy PI controllers, 61fuzzy PID controllers, 53, 64-65, 80 fuzzy production rules, 315fuzzy reference governor, 285 Fuzzy Robust Controller, 7fuzzy set averages, 116, 124-125 fuzzy sets, 7, 19, 22, 24, 27, 36, 45, 115, 120-121, 124-125, 151, 176-182, 184-188, 192, 228, 262, 265-266fuzzy sliding mode controller, 192, 196-197fuzzy sliding surface, 192fuzzy subsets, 152, 200fuzzy variable boundary layer, 192 fuzzyTECH, 45Ggain margins, 207gain scheduling, 193, 207, 208, 211, 217, 220gas turbines, 279Gaussian membership function, 7 Gaussian waveform, 25 Gaussian-Bell waveforms, 304 gear position decision, 145, 147 gear-operating lever, 147general window function, 105 general-purpose microprocessors, 37-38, 44genetic algorithm, 54, 59, 192, 208, 257-258genetic operators, 59-60genetic-inclined search, 257 geometric modeling, 56gimbal motor, 90, 96global gain-scheduling, 220global linear ARX model, 284 global navigation satellite systems, 141global position system, 224goal seeking behaviour, 186-187 governor valves80, 2HHamiltonian function, 261, 277 hard constraints, 283, 293 heading angle, 226, 228, 230, 239, 240-244, 246heading angle control, 240Index329heading controller, 194, 201-202 heading error rate, 194, 201 heading speed, 226heading velocity control, 240 heat recovery steam generator, 279 hedges, 103-104height method, 29helicopter, 207-212, 214, 217, 220 helicopter control matrix, 211 helicopter flight control, 207 Heneghan method, 116-117, 121-124heuristic search, 258 hierarchical approaches, 261 hierarchical architecture, 185 hierarchical fuzzy processors, 261 high dimensional systems, 191 high stepping rates, 84hit-miss topology, 119home position, 96horizontal tail plane, 209 horizontal tracker, 90hostile, 223human domain experts, 255 human visual system, 101hybrid system framework, 295 hyperbolic tangent function, 195 hyperplane, 192-193, 196 hysteresis thres olding, 116-117hIIF-THEN rule, 27-28image binarization, 106image complexity, 104image fuzzification function, 111 image segmentation, 124image-expert, 122-123indicator function, 121inert, 223inertia frame, 238inference decision methods, 317 inferential conclusion, 317 inferential decision, 317 injection molding process, 315 inner loop controller, 87integral time absolute error, 54 inter-class similarity, 252 internal dependencies, 169 interpolation property, 203 interpolative nature, 262 intersection, 20, 23-24, 31, 180 interval sets, 178interval type-2 FLC, 181interval type-2 fuzzy sets, 177, 180-181, 184inter-vehicle gap, 135intra-class similarity, 252inverse dynamics control, 228, 230 inverse dynamics method, 227 inverse kinema c, 299tiJ - Kjoin, 180Kalman gain, 213kinematic model, 299kinematic modeling, 299-300 knowledge based gear position decision, 148, 153knowledge reasoning layer, 132 knowledge representation, 250 knowledge-bas d GPD model, 146eLlabyrinths, 169laser interferometer transducer, 83 laser tracker, 301laser tracking system, 53, 63, 65, 75, 78-79, 83-85, 87, 98, 301lateral control, 131, 138lateral cyclic pitch angle, 209 lateral flapping angle, 210 leader, 238-239linear control surface, 55linear fuzzy PI, 61linear hover model, 213linear interpolation, 300-301, 306-307, 309, 313linear interpolation method, 309 linear optimal controller, 207, 217 linear P controller, 73linear state feedback controller, 7 linear structures, 117linear switching line, 198linear time-series models, 283 linguistic variables, 18, 25, 27, 90, 102, 175, 208, 258Index 330load shedding, 261load-following capabilities, 288, 297 loading dock, 159-161, 170, 172 longitudinal control, 130-132 longitudinal cyclic pitch angle, 209 longitudinal flapping angle, 210 lookup table, 18, 31-35, 40, 44, 46, 47-48, 51, 65, 70, 74, 93, 300, 302, 304-305lower membership functions, 179-180LQ feedback gains, 208LQ linear controller, 208LQ optimal controller, 208LQ regulator, 208L-R fuzzy numbers, 121 Luenburger observer, 6Lyapunov func on, 5, 192, 284tiMMamdani model, 40, 46 Mamdani’s method, 242 Mamdani-type controller, 208 maneuverability, 164, 207, 209, 288 manual transmissions, 145 mapping function, 102, 104 marginal distribution functions, 259 market-basket analysis, 251-252 massive databases, 249matched filtering, 115 mathematical morphology, 117, 127 mating pool, 59-60max member principle, 106max-dot method, 40-41, 46mean distance function, 119mean max membership, 106mean of maximum method, 29 mean set, 118-121measuring beam, 86mechanical coupling effects, 87 mechanical layer, 132median filter, 105meet, 7, 50, 139, 180, 183, 302 membership degree, 39, 257 membership functions, 18, 25, 81 membership mapping processes, 56 miniature acrobatic helicopter, 208 minor steady state errors, 217 mixed-fuzzy controller, 92mobile robot control, 130, 175, 181 mobile robots, 171, 175-176, 183, 187-189model predictive control, 280, 287 model-based control, 224 modeless compensation, 300 modeless robot calibration, 299-301, 312-313modern combined-cycle power plant, 279modular structure, 172mold-design optimization, 323 mold-design process, 323molded part, 318-321, 323 morphological methods, 115motor angular acceleration, 3 motor plant, 3motor speed control, 2moving average filter, 105 multilayer fuzzy logic control, 276 multimachine power system, 262 multivariable control, 280 multivariable fuzzy PID control, 285 multivariable self-tuning controller, 283, 295mutation, 59mutation probability, 59-60mutual interference, 88Nnavigation control, 160neural fuzzy control, 19, 36neural networks, 173, 237, 255, 280, 284, 323neuro-fuzzy control, 237nominal plant, 2-4nonlinear adaptive control, 237non-linear control, 2, 159 nonlinear mapping, 55nonlinear switching curve, 198-199 nonlinear switching function, 200 nonvolatile memory, 44 normalized universe, 266Oobjective function, 59, 74-75, 77, 107, 281-282, 284, 287, 289-291,Index331295obstacle avoidance, 166, 169, 187-188, 223-225, 227, 231 obstacle avoidance behaviour, 187-188obstacle sensor, 224, 228off-line defuzzification, 34off-line fuzzy inference system, 302, 304off-line fuzzy technology, 300off-line lookup tables, 302 offsprings, 59-60on-line dynamic fuzzy inference system, 302online tuning, 203open water trial, 202operating point, 210optical platform, 92optimal control table, 300optimal feedback gain, 208, 215-216 optimal gains, 207original domain, 102outer loop controller, 85, 87outlier analysis, 251, 253output control gains, 92 overshoot, 3-4, 6-7, 60-61, 75-76, 94, 96, 193, 229, 266Ppath tracking, 223, 232-234 pattern evaluation, 250pattern vector, 150-151PD controller, 4, 54-55, 68-69, 71, 74, 76-77, 79, 134, 163, 165, 202 perception domain, 102 performance index, 60, 207 perturbed plants, 3, 7phase margins, 207phase-plan mapping fuzzy control, 19photovoltaic power systems, 261 phugoid mode, 212PID, 1-4, 8, 13, 19, 53, 61, 64-65, 74, 80, 84-85, 87-90, 92-98, 192 PID-fuzzy control, 19piecewise nonlinear surface, 193 pitch angle, 202, 209, 217pitch controller, 193, 201-202 pitch error, 193, 201pitch error rate, 193, 201pitch subsidence, 212planetary gearbox, 145point-in-time transaction, 252 polarizing beam-splitter, 86 poles, 4, 94, 96position sensor detectors, 84 positive definite matrix, 213post fault, 268, 270post-fault trajectory, 273pre-defined membership functions, 302prediction, 251, 258, 281-283, 287, 290predictive control, 280, 282-287, 290-291, 293-297predictive supervisory controller, 284preview distance control, 129 principal regulation level, 279 probabilistic reasoning approach, 259probability space, 118Problem understanding phases, 254 production rules, 316pursuer car, 136, 138-140 pursuer vehicle, 136, 138, 140Qquadrant detector, 79, 92 quadrant photo detector, 85 quadratic optimal technology, 208 quadrilateral ob tacle, 231sRradial basis function, 284 random closed set, 118random compact set, 118-120 rapid environment assessment, 191 reference beam, 86relative frame, 240relay control, 195release distance, 169residual forces, 217retinal vessel detection, 115, 117 RGB band, 115Riccati equation, 207, 213-214Index 332rise time, 3, 54, 60-61, 75-76road-environment estimator, 148 robot kinematics, 299robot workspace, 299-302, 309 robust control, 2, 84, 280robust controller, 2, 8, 90robust fuzzy controller, 2, 7 robustness property, 5, 203roll subsidence, 212rotor blade flap angle, 209rotor blades, 210rudder, 193, 201rule base size, 191, 199-200rule output function, 191, 193, 198-199, 203Runge-Kutta m thod, 61eSsampling period, 96saturation function, 195, 199 saturation functions, 162scaling factor, 54, 72-73scaling gains, 67, 69S-curve waveform, 25secondary membership function, 178 secondary memberships, 179, 181 selection, 59self-learning neural network, 159 self-organizing fuzzy control, 261 self-tuning adaptive control, 280 self-tuning control, 191semi-positive definite matrix, 213 sensitivity indices, 177sequence-based analysis, 251-252 sequential quadratic programming, 283, 292sets type-reduction, 184setting time, 54, 60-61settling time, 75-76, 94, 96SGA, 59shift points, 152shift schedule algorithms, 148shift schedules, 152, 156shifting control, 145, 147shifting schedules, 146, 152shift-schedule tables, 152sideslip angle, 210sigmoidal waveform, 25 sign function, 195, 199simplex optimal algorithm, 80 single gimbal system, 96single point mass obstacle, 223 singleton fuzzification, 181-182 sinusoidal waveform, 94, 300, 309 sliding function, 192sliding mode control, 1-2, 4, 8, 191, 193, 195-196, 203sliding mode fuzzy controller, 193, 198-200sliding mode fuzzy heading controller, 201sliding pressure control, 280 sliding region, 192, 201sliding surface, 5-6, 192-193, 195-198, 200sliding-mode fuzzy control, 19 soft constraints, 281, 287space-gap, 135special-purpose processors, 48 spectral mapping theorem, 216 speed adaptation, 138speed control, 2, 84, 130-131, 133, 160spiral subsidence, 212sporadic alternations, 257state feedback controller, 213 state transition, 167-169state transition matrix, 216state-weighting matrix, 207static fuzzy logic controller, 43 static MIMO system, 243steady state error, 4, 54, 79, 90, 94, 96, 98, 192steam turbine, 279steam valving, 261step response, 4, 7, 53, 76, 91, 193, 219stern plane, 193, 201sup operation, 183supervisory control, 191, 280, 289 supervisory layer, 262-264, 277 support function, 118support of a fuzzy set, 26sup-star composition, 182-183 surviving solutions, 257Index333swing curves, 271, 274-275 switching band, 198switching curve, 198, 200 switching function, 191, 194, 196-198, 200switching variable, 228system trajector192, 195y,Ttail plane, 210tail rotor, 209-210tail rotor derivation, 210Takagi-Sugeno fuzzy methodology, 287target displacement, 87target time gap, 136t-conorm maximum, 132 thermocouple sensor fault, 289 thickness variable, 319-320three-beam laser tracker, 85three-gimbal system, 96throttle pressure, 134throttle-opening degree, 149 thyristor control, 261time delay, 63, 75, 91, 93-94, 281 time optimal robust control, 203 time-gap, 135-137, 139-140time-gap derivative, 136time-gap error, 136time-invariant fuzzy system, 215t-norm minimum, 132torque converter, 145tracking error, 79, 84-85, 92, 244 tracking gimbals, 87tracking mirror, 85, 87tracking performance, 84-85, 88, 90, 192tracking speed, 75, 79, 83-84, 88, 90, 92, 97, 287trajectory mapping unit, 161, 172 transfer function, 2-5, 61-63 transient response, 92, 193 transient stability, 261, 268, 270, 275-276transient stability control, 268 trapezoidal waveform, 25 triangular fuzzy set, 319triangular waveform, 25 trim, 208, 210-211, 213, 217, 220, 237trimmed points, 210TS fuzzy gain scheduler, 217TS fuzzy model, 207, 290TS fuzzy system, 208, 215, 217, 220 TS gain scheduler, 217TS model, 207, 287TSK model, 40-41, 46TS-type controller, 208tuning function, 70, 72turbine following mode, 280, 283 turn rate, 210turning rate regulation, 208, 214, 217two-DOF mirror gimbals, 87two-layered FLC, 231two-level hierarchy controllers, 275-276two-module fuzzy logic control, 238 type-0 systems, 192type-1 FLC, 176-177, 181-182, 185- 188type-1 fuzzy sets, 177-179, 181, 185, 187type-1 membership functions, 176, 179, 183type-2 FLC, 176-177, 180-183, 185-189type-2 fuzzy set, 176-180type-2 interval consequent sets, 184 type-2 membership function, 176-178type-reduced set, 181, 183-185type-reduction,83-1841UUH-1H helicopter, 208uncertain poles, 94, 96uncertain system, 93-94, 96 uncertain zeros, 94, 96underlying domain, 259union, 20, 23-24, 30, 177, 180unit control level, 279universe of discourse, 19-24, 42, 57, 151, 153, 305unmanned aerial vehicles, 223 unmanned helicopter, 208Index 334unstructured dynamic environments, 177unstructured environments, 175-177, 179, 185, 187, 189upper membership function, 179Vvalve outlet pressure, 280vapor pressure, 280variable structure controller, 194, 204velocity feedback, 87vertical fin, 209vertical tracker, 90vertical tracking gimbal, 91vessel detection, 115, 121-122, 124-125vessel networks, 117vessel segmentation, 115, 120 vessel tracking algorithms, 115 vision-driven robotics, 87Vorob’ev fuzzy set average, 121-123 Vorob'ev mean, 118-120vortex, 237 WWang and Mendel’s algorithm, 257 WARP, 49weak link, 270, 273weighing factor, 305weighting coefficients, 75 weighting function, 213weld line, 315, 318-323western states coordinating council, 269Westinghouse turbine-generator, 283 wind–diesel power systems, 261 Wingman, 237-240, 246wingman aircraft, 238-239 wingman veloc y, 239itY-ZYager operator, 292Zana-Klein membership function, 124Zana-Klein method, 116-117, 121, 123-124zeros, 94, 96µ-law function, 54µ-law tuning method, 54。
Data Min Knowl DiscDOI10.1007/s10618-014-0372-zLeveraging the power of local spatial autocorrelationin geophysical interpolative clusteringAnnalisa Appice·Donato MalerbaReceived:16December2012/Accepted:22June2014©The Author(s)2014Abstract Nowadays ubiquitous sensor stations are deployed worldwide,in order to measure several geophysical variables(e.g.temperature,humidity,light)for a grow-ing number of ecological and industrial processes.Although these variables are,in general,measured over large zones and long(potentially unbounded)periods of time, stations cannot cover any space location.On the other hand,due to their huge vol-ume,data produced cannot be entirely recorded for future analysis.In this scenario, summarization,i.e.the computation of aggregates of data,can be used to reduce the amount of produced data stored on the disk,while interpolation,i.e.the estimation of unknown data in each location of interest,can be used to supplement station records. We illustrate a novel data mining solution,named interpolative clustering,that has the merit of addressing both these tasks in time-evolving,multivariate geophysical appli-cations.It yields a time-evolving clustering model,in order to summarize geophysical data and computes a weighted linear combination of cluster prototypes,in order to predict data.Clustering is done by accounting for the local presence of the spatial autocorrelation property in the geophysical data.Weights of the linear combination are defined,in order to reflect the inverse distance of the unseen data to each clus-ter geometry.The cluster geometry is represented through shape-dependent sampling of geographic coordinates of clustered stations.Experiments performed with several data collections investigate the trade-off between the summarization capability and predictive accuracy of the presented interpolative clustering algorithm.Responsible editors:Hendrik Blockeel,Kristian Kersting,Siegfried Nijssen and FilipŽelezný.A.Appice(B)·D.MalerbaDipartimento di Informatica,Universitàdegli Studi di Bari“Aldo Moro”,Via Orabona4,70125Bari,Italye-mail:annalisa.appice@uniba.itD.Malerbae-mail:donato.malerba@uniba.itA.Appice,D.Malerba Keywords Spatial autocorrelation·Clustering·Inverse distance weighting·Geophysical data stream1IntroductionThe widespread use of sensor networks has paved the way for the explosive living ubiquity of geophysical data streams(i.e.streams of data that are measured repeatedly over a set of locations).Procedurally,remote sensors are installed worldwide.They gather information along a number of variables over large zones and long(potentially unbounded)periods of time.In this scenario,spatial distribution of data sources,as well as temporal distribution of measures pose new challenges in the collection and query of data.Much scientific and industrial interest has recently been focused on the deployment of data management systems that gather continuous multivariate data from several data sources,recognize and possibly adapt a behavioral model,deal with queries that concern present and past data,as well as seen and unseen data.This poses specific issues that include storing entirely(unbounded)data on disks withfinite memory(Chiky and Hébrail2008),as well as looking for predictions(estimations) where no measured data are available(Li and Heap2008).Summarization is one solution for addressing storage limits,while interpolation is one solution for supplementing unseen data.So far,both these tasks,namely summa-rization and interpolation,have been extensively investigated in the literature.How-ever,most of the studies consider only one task at a time.Several summarization algo-rithms,e.g.Rodrigues et al.(2008),Chen et al.(2010),Appice et al.(2013a),have been defined in data mining,in order to compute fast,compact summaries of geophysical data as they arrive.Data storage systems store computed summaries,while discarding actual data.Various interpolation algorithms,e.g.Shepard(1968b),Krige(1951),have been defined in geostatistics,in order to predict unseen measures of a geophysical vari-able.They use predictive inferences based upon actual measures sampled at specific locations of space.In this paper,we investigate a holistic approach that links predictive inferences to data summaries.Therefore,we introduce a summarization pattern of geo-physical data,which can be computed to save memory space and make predictive infer-ences easier.We use predictive inferences that exploit knowledge in data summaries, in order to yield accurate predictions covering any(seen and unseen)space location.We begin by observing that the common factor of several summarization and inter-polation algorithms is that they accommodate the spatial autocorrelation analysis in the learned model.Spatial autocorrelation is the cross-correlation of values of a variable strictly due to their relatively close locations on a two-dimensional surface.Spatial autocorrelation exists when there is systematic spatial variation in the values of a given property.This variation can exist in two forms,called positive and negative spatial autocorrelation(Legendre1993).In the positive case,the value of a variable at a given location tends to be similar to the values of that variable in nearby locations. This means that if the value of some variable is low at a given location,the presence of spatial autocorrelation indicates that nearby values are also low.Conversely,neg-ative spatial autocorrelation is characterized by dissimilar values at nearby locations. Goodchild(1986)remarks that positive autocorrelation is seen much more frequentlyLocal spatial autocorrelation and interpolative clusteringin practice than negative autocorrelation in geophysical variables.This is justified by Tobler’sfirst law of geography,according to which“everything is related to everything else,but near things are more related than distant things”(Tobler1979).As observed by LeSage and Pace(2001),the analysis of spatial autocorrelation is crucial and can be fundamental for building a reliable spatial component into any (statistical)model for geophysical data.With the same viewpoint,we propose to:(i) model the property of spatial autocorrelation when collecting the data records of a number of geophysical variables,(ii)use this model to compute compact summaries of actual data that are discarded and(iii)inject computed summaries into predictive inferences,in order to yield accurate estimations of geophysical data at any space location.The paper is organized as follows.The next section clarifies the motivation and the actual contribution of this paper.In Sect.3,related works regarding spatial autocorre-lation,spatial interpolators and clustering are reported.In Sect.4,we report the basics of the presented algorithm,while in Sect.5,we describe the proposed algorithm.An experimental study with several data collections is presented in Sect.6and conclusions are drawn.2Motivation and contributionsThe analysis of the property of spatial autocorrelation in geophysical data poses spe-cific issues.One issue is that most of the models that represent and learn data with spatial autocorrelation are based on the assumption of spatial stationarity.Thus,they assume a constant mean and a constant variance(no outlier)across space(Stojanova et al. 2012).This means that possible significant variabilities in autocorrelation dependen-cies throughout the space are overlooked.The variability could be caused by a different underlying latent structure of the space,which varies among its portions in terms of scale of values or density of measures.As pointed out by Angin and Neville(2008), when autocorrelation varies significantly throughout space,it may be more accurate to model the dependencies locally rather than globally.Another issue is that the spatial autocorrelation analysis is frequently decoupled from the multivariate analysis.In this case,the learning process accounts for the spa-tial autocorrelation of univariate data,while dealing with distinct variables separately (Appice et al.2013b).Bailey and Krzanowski(2012)observe that ignoring complex interactions among multiple variables may overlook interesting insights into the cor-relation of potentially related variables at any site.Based upon this idea,Dray and Jombart(2011)formulate a multivariate definition of the concept of spatial autocorre-lation,which centers on the extent to which values for a number of variables observed at a given location show a systematic(more than likely under spatial randomness), homogeneous association with values observed at the“neighboring”locations.In this paper,we develop an approach to modeling non stationary spatial auto-correlation of multivariate geophysical data by using interpolative clustering.As in clustering,clusters of records that are similar to each other at nearby locations are identified,but a cluster description and a predictive model is associated to each clus-A.Appice,D.Malerba ter.Data records are aggregated through clusters based on the cluster descriptions. The associated predictive models,that provide predictions for the variables,are stored as summaries of the clustered data.On any future demand,predictive models queried to databases are processed according to the requests,in order to yield accurate esti-mates for the variables.Interpolative clustering is a form of conceptual clustering (Michalski and Stepp1983)since,besides the clusters themselves,it also provides symbolic descriptions(in the form of conjunctions of conditions)of the constructed clusters.Thus,we can also plan to consider this description,in order to obtain clusters in different contexts of the same domain.However,in contrast to conceptual clus-tering,interpolative clustering is a form of supervised learning.On the other hand, interpolative clustering is similar to predictive clustering(Blockeel et al.1998),since it is a form of supervised learning.However,unlike predictive clustering,where the predictive space(target variables)is typically distinguished from the descriptive one (explanatory variables),1variables of interpolative clustering play,in principle,both target and explanatory roles.Interpolative clustering trees(ICTs)are a class of tree structured models where a split node is associated with a cluster and a leaf node with a single predictive model for the target variables of interest.The top node of the ICT contains the entire sample of training records.This cluster is recursively partitioned along the target variables into smaller sub-clusters.A predictive model(the mean)is computed for each target variable and then associated with each leaf.All the variables are predicted indepen-dently.In the context of this paper,an ICT is built by integrating the consideration of a local indicator of spatial autocorrelation,in order to account for significant vari-abilities of autocorrelation dependencies in training data.Spatial autocorrelation is coupled with a multivariate analysis by accounting for the spatial dependence of data and their multivariate variance,simultaneously(Dray and Jombart2011).This is done by maximizing the variance reduction of local indicators of spatial autocorrelation computed for multivariate data when evaluating the candidates for adding a new node to the tree.This solution has the merit of improving both the summarization,as well as the predictive performance of the computed models.Memory space is saved by storing a single summary for data of multiple variables.Predictive accuracy is increased by exploiting the autocorrelation of data clustered in space.From the summarization perspective,an ICT is used to summarize geophysical data according to a hierarchical view of the spatial autocorrelation.We can browse gen-erated clusters at different levels of the hierarchy.Predictive models on leaf clusters model spatial autocorrelation dependencies as stationary over the local geometry of the clusters.From the interpolation perspective,an ICT is used to compute knowledge1The predictive clustering framework is originally defined in Blockeel et al.(1998),in order to com-bine clustering problems and classification/regression problems.The predictive inference is performed by distinguishing between target variables and explanatory variables.Target variables are considered when evaluating similarity between training data such that training examples with similar target values are grouped in the same cluster,while training examples with dissimilar target values are grouped in separate clusters. Explanatory variables are used to generate a symbolic description of the clusters.Although the algorithm presented in Blockeel et al.(1998)can be,in principle,run by considering the same set of variables for both explanatory and target roles,this case is not investigated in the original study.Local spatial autocorrelation and interpolative clusteringto make accurate predictive inferences easier.We can use Inverse Distance Weighting2 (Shepard1968b)to predict variables at a specific location by a weighted linear com-bination of the predictive models on the leaf clusters.Weights are inverse functions of the distance of the query point from the clusters.Finally,we can observe that an ICT provides a static model of a geophysical phe-nomenon.Nevertheless,inferences based on static models of spatial autocorrelation require temporal stationarity of statistical properties of variables.In the geophysical context,data are frequently subject to the temporal variation of such properties.This requires dynamic models that can be updated continuously as new fresh data arrive (Gama2010).In this paper,we propose an incremental algorithm for the construction of a time-adaptive ICT.When a new sample of records is acquired through stations of a sensor network,a past ICT is modified,in order to model new data of the process, which may change their properties over time.In theory,a distinct tree can be learned for each time point and several trees can be subsequently combined using some gen-eral framework(e.g.Spiliopoulou et al.2006)for tracking cluster evolution over time. However,this solution is prohibitively time-consuming when data arrive at a high rate. By taking into account that(1)geophysical variables are often slowly time-varying, and(2)a change of the properties of the data distribution of variables is often restricted to a delimited group of stations,more efficient learning algorithms can be derived.In this paper,we propose an algorithm that retains past clusters as long as they discrimi-nate between surfaces of spatial autocorrelated data,while it mines novel clusters only if the latest become inaccurate.In this way,we can save computation time and track the evolution of the cluster model by detecting changes in the data properties at no additional cost.The specific contributions in this paper are:(1)the investigation of the property of spatial autocorrelation in interpolative clustering;(2)the development of an approach that uses a local indicator of the spatial autocorrelation property,in order to build an ICT by taking into account non-stationarity in autocorrelation and multivariate analysis;(3)the development of an incremental algorithm to yield a time-evolving ICT that accounts for the fact that the statistical properties of the geophysical data may change over time;(4)an extensive evaluation of the effectiveness of the proposed (incremental)approach on several real geophysical data.3Related worksThis work has been motivated by the research literature for the property of spatial autocorrelation and its influence on the interpolation theory and(predictive)clustering. In the following subsections,we report related works from these research lines.2Inverse distance weighting is a common interpolation algorithm.It has several advantages that endorse its widespread use in geostatistics(Li and Revesz2002;Karydas et al.2009;Li et al.2011):simplicity of implementation;lack of tunable parameters;ability to interpolate scattered data and work on any grid without suffering from multicollinearity.A.Appice,D.Malerba3.1Spatial autocorrelationSpatial autocorrelation analysis quantifies the degree of spatial clustering (positive autocorrelation)or dispersion (negative autocorrelation)in the values of a variable measured over a set of point locations.In the traditional approach to spatial autocorre-lation,the “overall pattern”of spatial dependence in data is summarized into a single indicator,such as the familiar Global Moran’s I,Global Geary’s C or Gamma indi-cators of spatial associations (see Legendre 1993,for details).These indicators allow us to establish whether nearby locations tend to have similar (i.e.spatial clustering),random or different values (i.e.spatial dispersion)of a variable on the entire sample.They are referred to as global indicators of spatial autocorrelation,in contrast to the local indicators that we will consider in this paper.Procedurally,global indicators (see Getis 2008)are computed,in order to indicate the degree of regional clustering over the entire distribution of a geophysical variable.Stojanova et al.(2013),as well as Appice et al.(2012)have recently investigated the addition of these indicators to predictive inferences.They show how global measures of spatial autocorrelation can be computed at multiple levels of a hierarchical clustering,in order to improve predictions that are consistently clustered in space.Despite the useful results of these studies,a limitation of global indicators of spatial autocorrelation is that they assume a spatial stationarity across space (Angin and Neville 2008).While this may be useful when spatial associations are studied for the small areas associated to the bottom levels of a hierarchical model,it is not very meaningful or may even be highly misleading in the analysis of spatial associations for large areas associated to the top levels of a hierarchical model.Local indicators offer an alternative to global modeling by looking for “local pat-terns”of spatial dependence within the study region (see Boots 2002for a survey).Unlike global indicators,which return one value for the entire sample of data,local indicators return one value for each sampled location of a variable;this value expresses the degree to which that location is part of a cluster.Widely speaking,a local indi-cator of spatial autocorrelation allows us to discover deviations from global patterns of spatial association,as well as hot spots like local clusters or local outliers.Several local indicators of spatial autocorrelation are formulated in the literature.Anselin’s local Moran’s I (Anselin 1995)is a local indicator of spatial autocorre-lation,that has gained wide acceptance in the literature.It is formulated as follows:I (i )= (z (i )−z )m 2 n j =1,j =i(λ(i j )(z (j )−z )),(1)where z (i )is the value of a variable Z measured at the location i ,z = n i =1z (i )n is themean of the data measured for Z ,λ(i j )is a spatial (Gaussian or bi-square)weight between the locations i and j over a neighborhood structure of the training data and m 2=1n j (z (j )−z )2is the second moment.Anselin’s local Moran’s I is related to the global Moran I as the average of I (i )is equal to the global I,up to a factor of proportionality.A positive value for I (i )indicates that z (i )is surrounded by similar values.Therefore,this value is part of a cluster.A negative value for I (i )indicates that z (i )is surrounded by dissimilar values.Thus,this value is an outlier.Local spatial autocorrelation and interpolative clusteringThe standardized Getis and Ord local GI∗(Getis and Ord1992)is a local indicator of spatial autocorrelation that is formulated as follows:G I∗(i)=1S2n−1nnj=1,j=iλ(i j)2−Λ(i)2⎛⎝nj=1,j=iλi j z(j)−zΛ(i)⎞⎠,(2)whereΛ(i)= nj=1,j=iλ(i j)and S2=nj=1(z(j)−z)2n.A positive value for G I∗(i)indicates clusters of high values around i,while a negative value for G I∗(i)indicates clusters of low values around i.The interpretation of GI∗is different from that of I:the former distinguishes clusters of high and low values,but does not capture the presence of negative spatial auto correlation(dispersion);the latter is able to detect both positive and negative spatial autocorrelations,but does not distinguish clusters of high or low values.Getis and Ord (1992)recommend computing GI∗to look for spatial clusters and I to detect spatial outliers.By following this advice,Holden and Evans(2010)apply fuzzy C-means to GI∗values,in order to cluster satellite-inferred burn severity classes.Scrucca(2005) and Appice et al.(2013c)use k-means to cluster GI∗values,to compute spatially aware clusters of the data measured for a geophysical variable.Measures of both global and local spatial autocorrelation are principally defined for univariate data.However,the integration of multivariate and autocorrelation informa-tion has recently been advocated by Dray and Jombart(2011).The simplest approach considers a two-step procedure,where data arefirst summarized with a multivariate analysis such as PCA.In a second step,any univariate(either global or local)spatial measure can be applied to PCA scores for each axis separately.The other approach finds coefficients to obtain a linear combination of variables,which maximizes the product between the variance and the global Moran measure of the scores.Alterna-tively,Stojanova et al.(2013)propose computing the mean of global measures(Moran I and global Getis C),computed for distinct variables of a vector,as a global indicator of spatial autocorrelation of the vector by blurring cross-correlations between separate variables.Dray et al.(2006)explore the theory of the principal coordinates of neighbor matrices and develop the framework of Morans eigenvector maps.They demonstrate that their framework can be linked to spatial autocorrelation structure functions also in multivariate domains.Blanchet et al.(2008)expand this framework by taking into account asymmetric directional spatial processes.3.2Interpolation theoryStudies on spatial interpolation were initially encouraged by the analysis of ore mining, water extraction or pumping and rock inspection(Cressie1993).In thesefields,inter-polation algorithms are required as the main resource to recover unknown information and account for problems like missing data,energy saving,sensor default,as well as to support data summarization and investigation of spatial correlation between observed data(Lam1983).The interpolation algorithms estimate a geophysical quantity in any geographic location where the variable measure is not available.The interpolated valueA.Appice,D.Malerba is derived by making use of the knowledge of the nearby observed data and,some-times,of some hypotheses or supplementary information on the data variable.The rationale behind this spatially aware estimate of a variable is the property of positive spatial autocorrelation.Any spatial interpolator accounts for this property,including within its formulation the consideration of a stronger correlation between data which are closer than for those that are farther apart.Regression(Burrough and McDonnell1998),Inverse Distance Weighting(IDW) (Shepard1968a),Radial Basis Functions(RBF)(Lin and Chen2004)and Kriging (Krige1951)are the most common interpolation algorithms.These algorithms are studied to deal with the irregular sampling of the investigated area(Isaaks and Srivas-tava1989;Stein1999)or with the difficulty of describing the area by the local atlas of larger and irregular manifolds.Regression algorithms,that are statistical interpolators, determine a functional relationship between the variable to be predicted and the geo-graphic coordinates of points where the variable is measured.IDW and RBF,which are both deterministic interpolators,use mathematical functions to calculate an unknown variable value in a geographic location,based either on the degree of similarity(IDW) or the degree of smoothing(RBF)in relation to neighboring data points.Both algo-rithms share with Kriging the idea that the collection of variable observations can be considered as a production of correlated spatial random data with specific statistical properties.In Kriging,specifically,this correlation is used to derive a second-order model of the variable(the variogram).The variogram represents an approximate mea-sure of the spatial dissimilarity of the observed data.The IDW interpolation is based on a linear combination of nearby observations with weights proportional to a power of the distances.It is a heuristic but efficient approach justified by the typical power-law of the spatial correlation.In this sense,IDW accomplishes the same strategy adopted by the more rigorous formulation of Kriging(Li and Revesz2002;Karydas et al.2009; Li et al.2011).Several studies have arisen from these base interpolation algorithms.Ohashi and Torgo(2012)investigate a mapping of the spatial interpolation problem into a multiple regression task.They define a series of spatial indicators to better describe the spatial dynamics of the variable of interest.Umer et al.(2010)propose to recover missing data of a dense network by a Kriging interpolator.By considering that the compu-tational complexity of a variogram is cubic in the size of the observed data(Cressie 1993),the variogram calculus,in this study,is sped-up by processing only the areas with information holes,rather than the global data.Goovaerts(1997)extends Krig-ing,in order to predict multiple variables(cokriging)measured at the same location. Cokriging uses direct and cross covariance functions that are computed in the sample of the observed data.Teegavarapu et al.(2012)use IDW and1-Nearest Neighbor,in order to interpolate a grid of rainfall data and re-sample data at multiple resolutions. Lu and Wong(2008)formulate IDW in an adaptive way,by accounting for the varying distance-decay relationship in the area under examination.The weighting parameters are varied according to the spatial pattern of the sampled points in the neighborhood. The algorithm proves more efficient than ordinary IDW and,in several cases,also better than Kriging.Recently,Appice et al.(2013b)have started to link IDW to the spatio-temporal knowledge enclosed in specific summarization patterns,called trend clusters.Nevertheless,this investigation is restricted to univariate data.Local spatial autocorrelation and interpolative clusteringThere are several applications(e.g.Li and Revesz2002;Karydas et al.2009;Li et al.2011)where IDW is used.This contributes to highlighting IDW as a determinis-tic,quick and simple interpolation algorithm that yields accurate predictions.On the other hand,Kriging is based on the statistical properties of a variable and,hence,it is expected to be more accurate regarding the general characteristics of the recorded data and the efficacy of the model.In any case,the accuracy of Kriging depends on a reliable estimation of the variogram(Isaaks and Srivastava1989;¸Sen and¸Salhn 2001)and the variogram computation cost is proportional to the cube of the number of observed data(Cressie1990).This cost can be prohibitive in time-evolving applica-tions,where the statistical properties of a variable may change over time.In the Data Mining framework,the change of the underlying properties over time is usually called concept drift(Gama2010).It is noteworthy that the concept drift,expected in dynamic data,can be a serious complication for Kriging.It may impose the repetition of costly computation of the variogram each time the statistical properties of the variable change significantly.These considerations motivate the broad use of an interpolator like IDW that is accurate enough and whose learning phase can be run on-line when data are collected in a stream.3.3Cluster analysisCluster analysis is frequently used in geophysical data interpretation,in order to obtain meaningful and useful results(Song et al.2010).In addition,it can be used as a sum-marization paradigm for data streams,since it underlines the advantage of discovering summaries(clusters)that adjust well to the evolution of data.The seminal work is that of Aggarwal et al.(2007),where a k-means algorithm is tailored to discover micro-clusters from multidimensional transactions that arrive in a stream.Micro-clusters are adjusted each time a transaction arrives,in order to preserve the temporal locality of data along a time horizon.Another clustering algorithm to summarize data streams is presented in Nassar and Sander(2007).The main characteristic of this algorithm is that it allows us to summarize multi-source data streams.The multi-source stream is com-posed of sets of numeric values that are transmitted by a variable number of sources at consecutive time points.Timestamped values are modeled as2D(time-domain)points of a Euclidean space.Hence,the source location is neither represented as a dimension of analysis nor processed as information-bearing.The stream is broken into windows. Dense regions of2D points are detected in these windows and represented by means of cluster feature vectors.Although a spatial clustering algorithm is employed,the spatial arrangement of data sources is still neglected.Appice et al.(2013a)describe a clustering algorithm that accounts for the property of spatial autocorrelation when computing clusters that are compact and accurate summaries of univariate geophysical data streams.In all these studies clustering is addressed as an unsupervised task.Predictive clustering is a supervised extension of cluster analysis,which combines elements of prediction and clustering.The task,originally formulated in Blockeel et al.(1998),assumes:(1)a descriptive space of explanatory variables,(2)a predictive space of target variables and(3)a set of training records defined on both descriptive and predictive space.Training records that are similar to each other are clustered and a predictive model is associated to each cluster.This allows us to predict the unknown。
课标1.核心素养:core competences/ key competences2.关键能力:key skills3.必备品格:necessary characters4.新课标:New Curriculum Standard5.语言能力:linguistic ability6.思维品质:thinking ability7.文化品格:cultural character8.学习能力:learning ability9.英语学习活动观:English learning activities outlook10.立德树人:strengthen moral education and cultivate people11.主题语境:Theme context12.语篇类型:Discourse/Text Type13.语言知识:language knowledge14.语言技能:language skills15.学习策略:learning strategy16.文化知识:culture knowledge17.人与自我:man/human and ego18.人与自然:man/human and nature19.人与社会:man/human and society20. 语音、词汇、语法、语篇、语用:phonetics, vocabulary, grammar,discourse, pragmatics.(补充)情感态度affective attitudes认知策略cognitive strategy调控策略/元认知策略m etacognitive strategy交际策略communicative strategy资源策略resource strategy文化意识cultrual awareness综合语言运用能力comprehesive language competence自主学习:autonomous/active/independent learning 合作学习:collaborative/cooperative/group cooperation learning探究学习:inquiry learning跨文化意识:Intercultural / Cross-cultural awareness 教学反思:teaching introspection工具性和人文性:the dual nature of instrumentality and humanism面向全体学生:for all students/be open to all students 衔接cohesion:(语境上的,抽象的)连贯coherence:(语法与词汇,有形的) 语言技能听力教学teaching listening听力策略listening strategy语感 language sense自下而上模式bottom-up model自上而下模式top-down model交互模式interactive model听前 pre-listening听中 while-listening听后 post-listening主旨 gist精听 intensive listening泛听 extensive listening图式 schemata口语教学teaching speaking准确性accuracy流利性fluency得体性appropriacy连贯性coherence控制性活动controlled/mechanical activities半控制性活动semi-controlled/semi-mechanical activities开放活动open/creative activities阅读教学teaching reading自上而下模式top-down model自下而上模式bottom-up model平行阅读parallel reading交互模式interactive model预测 predicting速度 skimming扫读 scanning猜词 word-guessing推断 inferring重述 paraphrasing精读 intensive reading泛读 extensive reading写作教学teaching writing以结果为导向写作product-oriented approach以过程为导向写作process-oriented approach以内容为导向写作content-oriented approach头脑风暴brainstorming思维导图mindmapping列提纲outlining打草稿drafting修改 editing校对 proofreading控制型写作controlled writing指导性写作guided writing表达性写作expressive writing教学实施程序性提问procedral questions展示性问题display questions参阅性问题referential questions评估性问题evaluative questions开放式问题open questions封闭式问题closed questions直接纠错explicit correction间接纠错implicit correction重复法repetition重述法recast强调暗示法pinpointing指导者instructor促进者facilitator控制者controller组织者organizer参与者participant提示者prompter评估者assessor资源提供者resource-provider研究者researcher形成性评价formative assessment终结性评价summative assessment诊断性评价diagnostic assessment成绩测试achievement test水平测试proficiency test潜能测试aptitude test诊断测试diagnostic test结业测试exit test主观性测试subjective test客观性测试objective test直接测试direct test间接测试indirect test常模参照性测试Norm-referenced Test标准参照性测试Criterion-referenced Test 个体参照性测试Individual-referenced Test 信度 Reliability效度 Validity表面效度Face validity结构效度Construct validity内容效度Content validity共时效度Concurrent validity预示效度Predictive validity区分度Discrimination难度 Difficulty教学法语法翻译法grammar-translation method直接法direct method听说法audio-lingual method视听法audio-visual method交际法communicative approach全身反应法total physical response(TPR)任务型教学法task-based teaching approach语言学中英对照导论规定性与描写性Prescriptive[prɪˈskrɪptɪv] and descriptive [dɪˈskrɪptɪv]共时性与历时性Synchronic[sɪŋˈkrɒnɪk] and diachronic [ˌdaɪəˈkrɒnɪk]语言和言语Langue 美[lɑːŋɡ] and parole[pəˈrəʊl]语言能力与语言运用 competence and performance 任意性Arbitrariness[ˈɑːbɪtrərinəs]能产性Productivity[ˌprɒdʌkˈtɪvəti]双重性Duality[djuːˈæləti]移位性Displacement[dɪsˈpleɪsmənt]文化传承性Cultural transmission [trænzˈmɪʃn]语音学(phonetics[fəˈnetɪks])发音语音学articulatory [ɑːˈtɪkjuleɪtəri] phonetics听觉语音学auditory[ˈɔːdətri] phonetics声学语音学acoustic [əˈkuːstɪk]phonetics元音vowel [ˈvaʊəl]辅音consonant [ˈkɒnsənənt]发音方式Manner of Articulation闭塞音stops摩擦音fricative [ˈfrɪkətɪv]塞擦音affricate[ˈæfrɪkət]鼻音nasal [ˈneɪzl]边音lateral [ˈlætərəl]流音liquids[ˈlɪkwɪdz]滑音 glides[ɡlaɪdz]发音部位place of articulation双唇音bilabial[ˌbaɪˈleɪbiəl]唇齿音labiodental[ˌleɪbiəʊˈdentl]齿音 dental齿龈音alveolar[ælˈviːələ]硬颚音palatal [ˈpælətl]软腭音velar [ˈviːlə(r)]喉音glottal[ˈɡlɒtl]声带振动vocal [ˈvəʊkl]cord [kɔːd] vibration 清辅音voiceless浊辅音voiced [vɔɪst]音位学(phonology[fəˈnɒlədʒi])音素 phone音位phoneme [ˈfəʊniːm]音位变体allophone[ˈæləfəʊn]音位对立phonemic[fəʊˈniːmɪk]contrast互补分布complementary[ˌkɒmplɪˈmentri] distribution最小对立体minimal pair序列规则sequential[sɪˈkwenʃl] rule同化规则assimilation [əˌsɪməˈleɪʃn]rules省略规则deletion[dɪˈliːʃən] rule重音 stress语调intonation[ˌɪntəˈneɪʃn]升调 rising tone降调 falling tone[təʊn]形态学(morphology[mɔːˈfɒlədʒi])词素morphemes[ˈmɔːfiːmz]同质异形体allomorphs[ˈæləmɔːf]自由词素free morphemes粘着词素bound morphemes屈折词素inflectional[ɪnˈflɛkʃən(ə)l]/ grammatical morphemes派生词素derivational[ˌdɛrɪˈveɪʃən(ə)l]/ lexical morphemes词缀 affix[əˈfɪks , ˈæfɪks]前缀 prefix后缀suffix[ˈsʌfɪks] 语义学(semantics [sɪˈmæntɪks])词汇意义lexical meaning意义 sense概念意conceptual[kənˈseptʃuəl] meaning本意denotative[ˌdiː.nəʊˈteɪtiv] meaning字面意literal meaning内涵意connotative 美[ˈkɑː.nə.teɪ.t̬ɪv] meaning指称reference[ˈrefrəns]同义现象synonymy [sɪˈnɒnɪmi]方言同义词dialectal[ˌdaɪ.əˈlek.təl]synonyms[ˈsɪnənɪmz]文体同义词stylistic [staɪˈlɪstɪk]synonyms搭配同义词collocational[ˌkɒləʊˈkeɪʃən(ə)l] synonyms语义不同的同义词semantically different synonyms多义现象polysemy [pəˈlɪsɪmi]同音异义homonymy同音异义词homophones[ˈhɒməʊfəʊnz]同形异义词homographs完全同音异义词completehomonyms[ˈhɒm.ə.nɪm]下义关系hyponymy美[haɪ'pɒnɪmɪ]上义词superordinate下义词hyponyms['haɪpəʊnɪm]反义词antonymy[æn‘tɒnɪmɪ]等义反义词gradable [ˈɡreɪdəbl] antonyms [ˈæntəʊnɪmz]互补反义词complementary antonyms关系反义词relational opposites蕴含 entailment[ɪnˈteɪl.mənt]预设presuppose[ˌpriːsəˈpəʊz]语义反常semantically anomalous[əˈnɒmələs]语用学(pragmatics[præɡˈmætɪks])言语行为理论speech act theory言内行为locutionary[ləˈkjuː.ʃən.ər.i]言外行为illocutionary[ˌɪl.əˈkjuː.ʃən.ər.i]阐述类representatives[ˌrɛprɪˈzɛntətɪvz]指令类directives [dɪˈrɛktɪvz]承诺类commissives表达类expressives[ɪkˈspresɪvz]宣告类declarations [ˌdɛkləˈreɪʃənz]言后行为perlocutionary直接言语行为direct speech act间接言语行为inderect speech act会话原则principle of conversation合作原则cooperative principle数量准则the maxim [ˈmæksɪmof quantity质量准则the maxim of quality关系准则the maxim of relation方式准则the maxim of manner语言与社会地域方言regional dialect社会方言sociolect [ˈsəʊsiəʊlekt]语域register[ˈredʒɪstə(r)]语场 field of discourse语旨tenor [ˈtenə(r)]of discourse语式 mode of discourse第二语言习得语言习得机制LAD(language acquisition Device 对比分析contrastive[kənˈtrɑː.stɪv] analysis错误分析error analysis语际错误interlingual[ˌɪntəˈlɪŋɡwəl] errors语内错误intralingual [ˌɪntrəˈlɪŋɡwəl] errors过度概括overgeneralization互相联想cross-association中介语interlanguage石化fossilization [ˌfɒs.ɪ.laɪˈzeɪ.ʃən]。
2024年高二英语学科全球合作研究的合作机制构建分析单选题30题1.International cooperation is crucial for addressing global challenges. The ______ of different countries is essential.A.effortsanizationsC.cooperationsD.initiatives答案:B。
“国际合作对于应对全球挑战至关重要。
不同国家的组织是必不可少的。
”A 选项“efforts”努力;C 选项“cooperations”合作,此处与前文重复;D 选项“initiatives”倡议。
根据语境,这里强调不同国家的组织,所以选B。
2.Global cooperation requires strong ______ among nations.A.associationsB.partnershipsC.connectionsD.relationships答案:B。
“全球合作需要国家之间强大的伙伴关系。
”A 选项“associations”协会;C 选项“connections”联系;D 选项“relationships”关系,而伙伴关系更能体现全球合作的需求,所以选B。
3.The success of global cooperation depends on effective ______.A.coordinationsB.arrangementsanizationsD.plans答案:C。
“全球合作的成功取决于有效的组织。
”A 选项“coordinations”协调;B 选项“arrangements”安排;D 选项“plans”计划。
这里强调组织的重要性,所以选C。
4.In global cooperation, ______ play an important role in promoting common development.A.institutionspaniesC.factoriesD.schools答案:A。
Contemporary Management ResearchPages 159-176, Vol. 6, No. 2, June 2010Determinants of Internal Control WeaknessesJohn J. ChehThe University of AkronE-Mail: cheh@Jang-hyung LeeDaegu UniversityE-Mail: goodljh@daegu.ac.krIl-woon KimThe University of AkronE-Mail: ikim@ABSTRACTThe major purpose of this paper is to find the financial and nonfinancial variables which would be useful in identifying companies with Sarbanes-Oxley Act internal control material weaknesses (MW). Recently, three research papers have been published on this topic based on different variables. Using the decision tree model of data mining, this paper examines the predictive power of the variables used in each paper in identifying MW companies and compares the predictive rates of the three sets of the variables used in these studies. In addition, an attempt is made to find the optimal set of the variables which gives the highest predictive rate. Our results have shown that each set of the variables is complementary to each other and strengthens the prediction accuracy. The findings from this study can provide valuable insights to external auditors in designing a cost-effective and high-quality audit decision support system to comply with the Sarbanes-Oxley Act.Keywords: Sarbanes-Oxley Act, Internal Control Material Weakness, Data MiningContemporary Management Research 160INTRODUCTIONThe Sarbanes-Oxley Act (SOX) of 2002, also known as the Public Company Accounting Reform and Investor Protection Act, was enacted on July 30, 2002 in response to a number of major corporate and accounting scandals of large companies in the U.S., such as Enron, Tyco International, Adelphia, Peregrine Systems, and WorldCom. Section 404 of the Sarbanes-Oxley Act requires management of publicly traded companies to assess their company’s internal control material weakness (MW) and to provide an internal control report as part of their periodic report to stockholders and regulators. The Act requires all public companies to maintain accurate records and an adequate system of internal accounting control. SOX also dramatically increased the penalties for false financial reporting on both management and external auditors.Even though the Act helped improve the quality and transparency of financial reports and give investors more confidence in financial reporting through added focus on internal control, increasing costs of compliance has been a major concern of many companies. A recent survey by Financial Executive International (O’Sullivan, 2006) found that public companies have incurred greater than expected costs to comply with section 404 of SOX: 58% increase in the fees charged by external auditors. It was estimated that U.S. companies would have spent $20 billion by the end of 2006 to comply with the law since it was passed in 2002. AMR Research also estimates that companies are spending about $1 million on SOX compliance for every $1 billion in revenues.The major purpose of this paper is to find the financial and nonfinancial variables which would be useful in predicting Sarbanes-Oxley Act internal control material weaknesses (MW). In addition, an attempt will made using all the variables from these published studies to find an optimal set of the variables which gives the highest predictive rate. Once identified, these variables can be used by external auditors as a screening device in identifying companies with MW in internal control, which can help external auditors save the costs of audit significantly.PREVIOUS RESEARCHRecently, three research papers have been published on identifying variables affecting Sarbanes-Oxley Act internal control MW using different variables and different methodologies: Cheh et al. (2006), Doyle et al. (2007) and Ogneva et al. (2007). In the study by Cheh et al., a trial and error approach was used with different combinations of many variables, and 23 financial variables were finally identifiedContemporary Management Research 161 which gave the highest predictive rate for internal control MW. Studies by Doyle et al. and Ogneva et al. used a traditional method of theorizing the situations and bringing theory-based variables into the research. Doyle et al. (2007), in particular, made an attempt to identify the determinants of weaknesses in internal control by examining 779 firms that disclosed material weaknesses from August 2002 to 2005. Eleven variables were used in their study based on two major categories: entity-wide vs. accounting specific. The study by Ogneva et al. (2007) focused on the cost of equity that would be impacted by the deterioration of information quality from the weaknesses in internal control. Ten variables which represented the firm characteristics associated with internal control weaknesses were used in their study.Although three groups of these researchers worked on a similar topic, each research group had slightly different research objectives. The aim of Cheh et al.’s work (2006) was to find data mining rules that facilitate the identification of MW companies. Doyle at al. (2007) strived to find characteristics of MW companies. On the other hand, Ogneva et al. (2007) were more interested in examining the association between cost of capital and internal control weaknesses. The variables used in each study were not necessarily selected to find MW companies, but they were certainly related to MW. Since the variables in each study are quite different from each other, it would be interesting to see if the predictability can be improved using all the variables from the three studies. Eventually, a certain set of the financial and nonfinancial variables may be able to be identified that can be powerful enough, yet most cost effective, in predicting MW companies.Besides the three papers mentioned, several papers examined some issues related to Sarbanes-Oxley Act and internal control. None of them, however, appeared to deal with ex-ante factors that might have determined the weakness in internal control. In general, they have addressed the effects after the disclosure of internal control weakness. For example, Kim et al. (2009) tested how the disclosure of the weakness in internal control would affect the loan contract, and Li et al. (2008) examined what role remediation of internal control weakness would play in internal and external corporate governance. Tang and Xu (2008) investigated how institutional ownership would affect the disclosure of internal control weakness, and Xu and Tang (2008) studied the relationship between financial analysts’ earnings forecasts and internal control weakness. Patterson and Smith (2007) did a theoretical investigation on the effects of the Sarbanes-Oxley Act of 2002 on auditing intensity and internal control strength, but did not address the issues on the determinants of internal control weakness.Contemporary Management Research 162RESEARCH METHODOLOGYDecision Trees ModelAs an integral part of knowledge discovery from large databases, data mining is the overall process of converting raw data into useful information. Data mining techniques tend to allow data to speak for themselves rather than coming into the data analysis with any preconceived notion of hypothetical relationships of the variables under study. Hence, data mining has been widely used in business to find useful patterns and trends that might otherwise remain unknown within the business data.Data mining, which is often called analytics, has been widely used not only for business practices but also for academic research. There are numerous conceptual models in data mining, such as decision trees, classification and rules, clustering, instances-based learning, linear regression, and Bayesian networks (Witten and Frank, 2005). Since one of the main objectives of our research was to find tree like rules that would help audit practitioners make informed decisions about MW, we decided to use the decision tree model. As a graph of decisions, the model uses a decision tree as a predictive model which maps the variables about an item to the conclusion about the item’s target value.Independent VariablesThere is conceivably a large number of varying combinatorial sets of independent variables on the dependent variable of material weakness on a company’s internal control. In these conceivably huge sets of the independent variables, we decided to use the variables which were used in three published studies: Cheh et al. (2006), Doyle et al. (2007) and Ogneva et al. (2007). In total, 46 variables were used on these studies: 23 by Cheh et al., 10 by Doyle et al., and 13 by Ogneva et al. There were several overlapping variables between studies by Doyle et al. and Ogneva et al. In addition, two of the variables used by Doyle et al. (2007) were excluded in this study because they were not available in our data bases, such as Audit Analytics, CRSP, and Research Insight. These two variables were the number of special purpose entities (SPEs) associated with each firm and governance score. Consequently, 35 variables were used in this study. Also, we excluded in our study three variables from the work by Ogneva et al.; these variables were predicted forecast errors and were known to be associated with systematic biases in analyst forecasts. The variables utilized forecast data from I/B/E/S and Value Line; but these data were not available in our aforementioned data bases.Contemporary Management Research 163 Panel A of Table 1 provides detailed descriptive explanations of the 35 independent variables, and the descriptive statistics of each variable are presented in Panel B. It can be seen in Panel B that four variables have very large standard deviations and variances. These variables are Net Income/(Total Assets – Total Liabilities), Cost of Goods Sold/Inventory, Retained Earnings/Inventory, and Total Liabilities/(Total Assets – Total Liabilities). Although we have normalized net income, cost of goods sold, retained earnings, and total assets using various measures, these variables apparently have had very large fluctuations among the sample firms. For net income, however, it is interesting to note that Net Income/Sales and Net Income/Total Assets have very small variances while Net Income/(Total Assets – Total Liabilities) has a significant variance. This result indicates that net income of sample firms fluctuated significantly with respect to net equity and was relatively stable with respect to sales and total assets.Sample FirmsThe initial sample companies came from Research Insight which has almost 10,000 companies listed in its Compustat database. These companies are listed in three major U.S. stock exchanges (i.e., New York Stock Exchange, American Stock Exchange and NASDAQ), regional U.S. stock exchanges, Canadian stock exchanges, and over-the-counter markets. For these companies, the data required for this study were retrieved from three research databases: Audit Analytics, CRSP, and Research Insight. The companies with SOX MW for the period of 2002 to 2006 were obtained from Audit Analytics. The data for the 35 variables were retrieved from Compustat database in Research Insight, except for information about the ages of the sample firms. The information about ages was available in the CRSP tape. To remove any companies with incomplete data for any variables, we used an in-house developed software application. This process resulted in 869 companies in the final sample, a considerably smaller subset of the initial sample.Contemporary Management Research 164Table 1 Descriptive statistics for 35 variables for 2004-2006Panel A: Variable DescriptionsVariables Description Variables Description NI/SALE Net Income/Net Sales CH/LCT Net Sales/Current LiabilitiesNI/AT Net Income/TotalAssetsWCAP/SALE Cash/CurrentLiabilitiesNI/(AT-LT) Net Income/Net Worth RE/AT Working Capital/Net SalesEBIT/AT Earnings Before IncomeTax/Total AssetsCH/AT Retained Earnings/Total AssetsSALE/AT Net Sales/Total Assets LT/(AT-LT) Cash/Total AssetsACT/AT Current Assets/TotalAssetsLOGMARKETCAP Total Liabilities/Net Worth(ACT-INVT)/AT Quick Assets/TotalAssetsLOGAGELOG of share price × numberof sharesoutstanding(CSHO*PRCCF)SALE/(ACT-INVT)Net Sales/Quick Assets AGGREGATELOSS LOG Age of companyACT/SALE Current Assets/NetSalesRZSCOREIndicator variable [1,0] if∑earning before extraordinaryitems in years t and t-1<0,otherwise(IB)INVT/SALE Inventory/NetSales LOGSEGGEOMUM LOG ∑ number of operating and geographic segments (Bus Segment-Actual number+Geo Seg Areas-Actual number)COGS/INVT Cost of GoodsSold/InventoryFOREIGNIndicator variable [1,0] if non-zero foreign currencytranslation , otherwise(FCA)LT/AT Total Liabilities/TotalAssetsACQUISITIONVALUE∑50% Ownership of theacquired firm(AQA/MKVAL)(ACT-INVT)/SALE Quick Assets/SalesEXTREMESALESGROWTHIndicator variable [1,0] if year-over-year industry-adjusted NetSales growth=top quintile,otherwiseRE/INVT RetainedEarnings/InventoryRESTRUCTURINGCHARGE∑ restructuring charges(RCA/MKVAL)ACT/LCT Current Assets/CurrentLiabilitiesLOGSEGMUMLOG Number of businesssegments(ACT-INVT)/LCT Quick Assets/CurrentLiabilitiesINVENTORY Inventory/TotalassetsLCT/AT Current Liabilities/Total AssetsSALE/LCT Sales/Current Liabilities M&AIndicator variable [1,0] ifcompany was involved in M&Aover 3 years, otherwiseContemporary Management Research 165Panel B: Descriptive statistics of the variables variables N Mean Std. Deviation Variance Minimum Maximum ICMW * 869 0.520138 0.499882 0.249882 0 1NI/SALE 869 -0.04617 0.585724 0.343073 -11.4922 2.506701 NI/AT 869 -0.00369 0.258762 0.066958 -2.81327 4.832774 NI/(AT-LT) 869 1.102761 33.98825 1155.201 -14.0419 1001.306 EBIT/AT 869 0.032835 0.158506 0.025124 -2.40982 0.480082 SALE/AT 869 1.098502 0.852066 0.726016 0.026889 8.613351 ACT/AT 869 0.507784 0.235754 0.05558 0.024723 0.991249 (ACT-INVT)/AT 869 0.381526 0.212887 0.045321 0.023738 0.956625 SALE/(ACT-INVT) 869 4.355389 5.787747 33.49802 0.035184 66.39423ACT/SALE 869 0.750516 1.444692 2.087135 0.033669 28.66403 INVT/SALE 869 0.12852 0.120808 0.014595 0.000373 1.631579 COGS/INVT 869 19.77532 73.00375 5329.548 0.107527 1874.644LT/AT 869 0.528867 0.418582 0.17521 0.028224 6.811726 (ACT-INVT)/SALE 869 0.621996 1.410005 1.988113 0.015062 28.42211RE/INVT 869 -17.4043 144.4622 20869.32 -2476.96 568.4035 ACT/LCT 869 2.813127 2.783589 7.748366 0.097934 35.69801 (ACT-INVT)/LCT 869 2.227998 2.66166 7.084434 0.079661 35.41416LCT/AT 869 0.250558 0.169365 0.028684 0.024158 1.388766 SALE/LCT 869 4.914681 2.970393 8.823232 0.250556 33.59769 CH/LCT 869 0.822242 1.541899 2.377452 0 20.04329 WCAP/SALE 869 0.452549 1.332053 1.774364 -1.74452 26.57621RE/AT 869 -0.43403 2.365501 5.595594 -42.8029 1.041362 CH/AT 869 0.127273 0.131966 0.017415 0 0.879367 LT/(AT-LT) 869 -4.58159 181.5619 32964.74 -5340.58 226.1027 LOGMARKETCAP 869 2.663385 0.719972 0.51836 -0.28377 4.889938LOGAGE 869 1.117768 0.18178 0.033044 0.30103 1.30103AGGREGATELOSS 869 0.219793 0.414345 0.171682 0 1RZSCORE 869 4.52359 2.906921 8.450192 0 9 LOGSEGGEOMUM 869 0.738405 0.174211 0.030349 0 1.176091FOREIGN 869 0.218642 0.413563 0.171035 0 1 ACQUISITIONVALUE 869 -0.00023 0.002234 4.99E-06 -0.03232 0.037896EXTREMESALESGROWTH 869 0.243959 0.429715 0.184655 0 1 RESTRUCTURINGCHARGE 869 -0.19341 5.104154 26.05239 -149.554 3.574735LOGSEGMUM 869 0.286145 0.290228 0.084233 0 1INVENTORY 869 0.126258 0.120126 0.01443 0.000327 0.754949 M&A 869 0.116226 0.32068 0.102836 0 1 * Note that ICMW is the dependent variable of the study while all others are independentvariables.Contemporary Management Research 166Application of Data MiningThe decision tree model in data mining was available from SQL Server 2005Analysis Services. Since each configuration could provide different results, the same configuration was used for all three groups of variables: the default configuration inthe Analysis Services. Data mining would be particularly useful to get new insightsinto the variables used in the studies by Doyle et al. and Ogneva et al. because they hypothesized the relationships between their dependent variables and independent variables.For the purpose of training the model, one-year lagged data were used. For example, the 2004 data were used for training and then the 2005 data were used for prediction, and the 2005 data for training and the 2006 data for prediction. In that way,the decision rules were actually predicting MW of the sample companies withunknown data of the next year. This process provided with four sets of data that couldbe used for both training and testing, which will be explained in the following section.ANALYSES AND FINDINGSIt appears that in some cases, the financial variables from two studies by Doyle etal. (2007) and Ogneva et al. (2007) on Sarbanes-Oxley Act of 2002 internal controlMW add value to the predictability of the financial variables which have been used inthe study by Cheh et al. (2006), although this finding may become somewhat lessclear in special cases. Results also provide preliminary evidence that all of thevariables collectively can contribute to construction of a good predictive model.Detailed discussions on the results follow in the subsequent paragraphs.Table 2 Training 2004 data and testing 2005 data using all 35 variablesPanel A: Actual ClassificationPredicted 0 (Actual) 1 (Actual) Total250 21 43231 149 174348 Total 170 178Panel B: Prediction AccuracyPredicted 0 (Actual) 1 (Actual)0 12.35% (Prediction Accuracy) 2.25% (Type II Error)1 87.65% (Type I Error) 97.75% (Prediction Accuracy)Total 100% 100%Contemporary Management Research 167Figure 1 Graphic results for Table 2Using All 35 VariablesAs the first test, we used the 2004 fiscal year data for training to predict MW companies in the 2005 fiscal year, using all 35 variables. As is shown in Table 2, when there is no MW, the prediction accuracy is only 12.35% while the error of misclassifying MW as non-MW is 87.65%. However, when there is actually MW, the prediction accuracy is quite impressive: 97.75% with the error of misclassifying MW as non-MW being only 2.25%. It should be noted that in this data set and also subsequent date sets, we matched both sets of non-MW and MW companies with standard industry code (SIC) and also sales within 25% plus or minus.The lift chart shown in Figure 1 displays a lift which is an indication of any improvement from random guess and shows a graphical representation of the change in the lift that a mining model causes (Tang and MacLennan 2005). For example, the top line shows that an ideal model would capture 100% of the target using a little over 50% of data. The 45-degree random line across the chart indicates that if a data miner were to randomly guess the result for each case, the person would capture 50% of target using 50% of data (Tang and MacLennan 2005). The data mining model isdepicted in the line between the ideal line and random line. According to the graph,Contemporary Management Research 168the decision tree model is doing only a little better than a random model. This alsomeans that there isn’t sufficient information in the training of the data to learn thepatterns about the target.To predict MW companies with the 2006 fiscal year based on the 2005 data, ourdata mining model trained the 2005 fiscal year data for all 35 variables of the sample companies. As we did with the first test, we matched both sets of non-MW and MW companies with SIC and sales within 25% plus or minus. The final results arepresented in Table 3 and Figure 2, which are similar to those of the previous test.When there is actually MW, the prediction accuracy is as high as 98.47%, and theerror rate of misclassifying MW as non-MW is 1.53%. When there is no MW, theprediction accuracy is 11.02%, and the error rate of misclassifying non-MW as MW is88.98%.Table 3 Training 2005 data and testing 2006 data using all 35 variablesPanel A: Actual ClassificationPredicted 0 (Actual) 1 (Actual) Total150 13 22341 105 129249 Total 118 131Panel B: Prediction AccuracyPredicted 0 (Actual) 1 (Actual)0 11.02% (Prediction Accuracy) 1.53% (Type II Error)1 88.98% (Type I Error) 98.47% (Prediction Accuracy)Total 100% 100%It is interesting to see that, even though there is a slight increase in the predictionrate of MW companies when there is MW from 97.75% to 98.47%, the prediction rateof non-MW companies when there is no MW has somewhat deteriorated from 12.35%to 11.02%. It appears that there isn’t more information in the 2006 data than in the2005 data sufficient enough to make a difference in training the data and learningpatterns about the target. As expected, the lift chart in Figure 2 is similar to theprevious lift chart with only a slightly better result in predicting MW companies.Figure 2 Graphic results for Table 3Using the Doyle et al.’s VariablesWe repeated the process of training the data and predicting MW with the nine variables used by Doyle et al. As mentioned before, Doyle et al.’s (2007) used two additional variables that we did not use in our study: number of special purpose entities (SPE) associated with a firm under study and its governance score. Doyle et al. do not find a significant relation between MW disclosures and corporate governance. Although its prediction rate of non-MW companies increased slightly, its MW prediction rate decreased to below 95 % for training in 2004 data to predict 2005 MW companies as shown in Table 4. The significant decrease in the predictive power of our data mining model was probably caused by, among others, the use of the smaller number of independent variables from 35 to nine.Figure 3 provides two types of information about the model used in this study. The first type of information presented in Panel A is what the trees reveal. The decision rules in rectangular boxes or nodes present which rule will more likely generate a desired outcome. For example, if a company’s log of market capitalizationis greater than or equal to 1.775, but less than 3.833, then it is highly likely that thecompany would have material weakness. The darker the node is, the more cases it contains.Table 4 Training 2004 data and testing 2005 data using Doyle et al.’s variablesPanel A: Actual ClassificationPredicted 0 (Actual) 1 (Actual) Total370 27 103261 158 168363 Total 185 178Panel B: Prediction AccuracyPredicted 0 (Actual) 1 (Actual)0 14.59% (Prediction Accuracy) 5.62% (Type II Error)1 85.41% (Type I Error) 94.38% (Prediction Accuracy)Total 100% 100%In Panel B of Figure 3, the top line of the lift chart shows that almost 50% of thedata indicates the desired target. According to Tang and MacLennan (2005), thebottom line is the random line, and the random line is always a 45-degree line acrossthe chart. What this means is that if a business analyst were to randomly guess theresults for each case, the analyst would capture 50% of the target using 50% of thedata.An interesting part is the line between these two lines of ideal line and randomline. We may call this middle line a “model line” which represents the data miningmodel that we ran. In this particular chart, the model line was merged with the idealline up to almost 50% of the data. Then, it dipped a little and paralleled with the idealline of the remaining data until a little after 90% of the data. Then, eventually, itmerged with the ideal line again. Thus, it indicates that this data mining model workedrather well, almost as if it were an ideal model. Hence, in terms of data efficiencypoint, we can conclude that the variables used by Doyle et al. (2007) are quiteefficient because the decision tree model based on the nine variables is close to theideal model.Panel A:Panel B:Figure 3 Graphic results for Table 4Similar results are shown in Table 5 and Figure 4 for using 2005 data to predict2006 MW companies. The MW prediction rate increased slightly in predicting 2006MW companies to 96.21%, but is still below 98.47% as demonstrated using all 35variables in Panel B of Table 3. Thus, using all 35 variables is a powerful reminderthat more independent variables increase the data mining model’s predictive power.The lift chart for 2005 training data in predicting 2006 MW companies using thevariables by Doyle et al. shows some surprising finding that data efficiency is even forall distribution parts of population. Apparently, the decision model does better thanthe random model only in the end extremes.Cost of Misclassification ErrorsUsing the following cost of misclassification model that was originally developedby Masters (1993), the cost of misclassification errors can be incorporated into theprocess of data analyses:Cost of misclassification errors (CME) = (1 – q) p1c1 + qp2c2,where q is a prior probability that the information risk associated with a company’s internal control weakness was high, p1 and p2 are the probabilities of type1 (false positive) and type2 (false negative) errors, and c1 and c2 are the costs of type1 and type2 errors.Table 5 Training 2005 data and testing 2006 data using Doyle et al.’s variablesPanel A: Actual ClassificationPredicted 0 (Actual) 1 (Actual) Total220 17 52401 113 127Total 130 132 262 Panel B: Prediction AccuracyPredicted 0 (Actual) 1 (Actual)0 13.08% (Prediction Accuracy) 3.79% (Type II Error)1 86.92% (Type I Error) 96.21% (Prediction Accuracy)Total 100% 100%The actual cost of misclassification error can be significant to audit firms. For example, by classifying a good client without MW as a bad one with MW, an audit firm makes a Type I error and, as a result, could potentially lose a client. To avoid the Type 1 error, the audit firm may assign the weight of unwarranted MW risk too much in its equation of audit engagement and planning processes. On the other hand, by mistakenly classifying a bad client with MW as a good one without MW, an audit firm makes a Type II error and, as a result, could miss MW in the client firm. In an effort to avoid the Type 2 error, the audit firm may assign the weight of MW risk too little in its equation of audit engagement and planning processes.Accordingly, Cheh et al. (1999) applied the CME model to determine the cost of misclassification in predicting takeover targets. Later, Calderon (1999) and Calderon and Cheh (2002) also used this idea in auditing and risk assessment applications. The prior probability, q, can be estimated by using training data, as Cheh et al. (1999) did in their takeover target study. For example, using the data shown in Panel A of Table 2, we can compute that q = 178/(170 + 178) = 0.5115. Then, using the information in Panel B of Table 2, the cost of misclassification errors can be computed as following:CME = 0.4885 * 0.8765 * c1 + 0.5115*0.0225 * c2.Since this predictive model can provide p1 and p2, as shown above, each audit firm will be able to customize the cost model by using values on c1, and c2 based on the firm’s professional experiences and past cost data. With this cost model based on the data mining prediction model, each audit firm can prepare an expected net benefit report before the firm engages in its potential audit client. With assistance from computer professionals, the audit firm can create another useful audit risk engagement tool, based on this type of cost analysis.CONCLUSIONS AND FUTURE RESEARCHThe purpose of this paper was to find the financial and nonfinancial variables which would be useful in predicting Sarbanes-Oxley Act internal control material weaknesses. By examining a number of independent variables used in three published studies, we were able to construct a highly predictive data mining model with the prediction rate of MW of 98.47%. It appears that each set of variables from three groups of nine researchers are complementary to each other and strengthens the prediction accuracy. Apparently, each set of variables brings different types of strength to the prediction of material weaknesses.。
I.J. Intelligent Systems and Applications, 2016, 1, 49-59Published Online January 2016 in MECS (/)DOI: 10.5815/ijisa.2016.01.06GA_MLP NN: A Hybrid Intelligent System for Diabetes Disease DiagnosisDilip Kumar ChoubeyBirla Institute of Technology, Computer Science & Engineering, Mesra, Ranchi, IndiaEmail: dilipchoubey_1988@yahoo.inSanchita PaulBirla Institute of Technology, Computer Science & Engineering, Mesra, Ranchi, IndiaEmail: Sanchita07@Abstract—Diabetes is a condition in which the amount of sugar in the blood is higher than normal. Classification systems have been widely used in medical domain to explore patient’s data and extract a predictive model or set of rules. The prime objective of this research work is to facilitate a better diagnosis (classification) of diabetes disease. There are already several methodology which have been implemented on classification for the diabetes disease. The proposed methodology implemented work in 2 stages: (a) In the first stage Genetic Algorithm (GA) has been used as a feature selection on Pima Indian Diabetes Dataset. (b) In the second stage, Multilayer Perceptron Neural Network (MLP NN) has been used for the classification on the selected feature. GA is noted to reduce not only the cost and computation time of the diagnostic process, but the proposed approach also improved the accuracy of classification. The experimental results obtained classification accuracy (79.1304%) and ROC (0.842) show that GA and MLP NN can be successfully used for the diagnosing of diabetes disease.Index Terms—Pima Indian Diabetes Dataset, GA, MLP NN, Diabetes Disease Diagnosis, Feature Selection, Classification.I.I NTRODUCTIONDiabetes is a chronic disease and a major public health challenge worldwide. Diabetes happens when a body is not able to produce or respond properly to insulin, which is needed to maintain the rate of glucose. Diabetes can be controlled with the help of insulin injections, a controlled diet (changing eating habits) and exercise programs, but no whole cure is available. Diabetes leads to many other disease such as blindness, blood pressure, heart disease, kidney disease and nerve damage [15]. Main 3 diabetes signs are:-Increased need to urinate (Polyuria), Increased hunger (Polyphagia), Increased thirst (Polydipsia). There are two main types of diabetes: Type 1 (Juvenile or Insulin Dependent or Brittle or Sugar) Diabetes and Type 2 (Adult onset or Non Insulin Dependent) Diabetes. Type 1 Diabetes mostly happens to children and young adults but can affect at any age. For this type of diabetes, beta cells are destructed and people suffering from the condition require insulin injection regularly to survive. Type 2 Diabetes is the most common type of diabetes, in which people are suffering at least 90% of all the diabetes cases. This type mostly happens to the people more than forty years old but can also be found in younger classes. In this type, body becomes resistant to insulin and does not effectively use the insulin being produced. It can be controlled with lifestyle modification (a healthy diet plan, doing exercise regularly), oral medications (taking tablets). In some extreme cases, insulin injections may also be required but no whole cure for diabetes is available.In this paper, GA has been used as a Feature selection in which among 8 attributes, 4 attributes have been selected. The main purpose of Feature selection is to reduce the number of features used in classification while maintaining acceptable classification accuracy and ROC. Limiting the number of features (dimensionality) is important in statistical learning. With the help of Feature selection process we can save Storage capacity, Computation time (shorter training time and test time), Computation cost and increases Classification rate, Comprehensibility. MLP NN are supervised learning method for classification. Here, MLP NN have been used for the classification of the Diabetes disease diagnosis. The rest of the paper is organized as follows: Brief description of GA and MLP NN are in section II, Related work is presented in section III, Proposed methodology is discussed in section IV, Results and Discussion are devoted to section V, Conclusion and Future Direction are discussed in section VI.II.B RIEF D ESCRIPTION OF GA AND MLP NNA. GAJohn Holland introduced genetic Algorithm GA in the 1970 at University of Michigan (US). GA is an adaptive population based optimization technique, which is inspired by Darwin’s theory [10] about survival of the fittest. GA mimics the natural evolution process given by the Darwin i.e., in GA the next population isevolved through simulating operators of selection, crossover and mutation. John Holland is known as the father of the original genetic algorithm who first introduced these operators in [16]. Goldberg [13] and Michalewicz [18] later improved these operators. The advantages in GA [17] are that Concepts easy to understand, solves problems with multiple solutions, global search methods, blind search methods, Gas can be easily used in parallel machines, etc. and the limitation are Certain optimization problems, no absolute assurance for a global optimum, cannot assure constant optimization response times, cannot find the exact solution, etc. GA can be applied in Artificial creativity, Bioinformatics, chemical kinetics, Gene expression profiling, control engineering, software engineering, Traveling salesman problem, Mutation testing, Quality control etc. The genetic algorithm uses three main types of rules at each step to create the next generation from the current population:1. SelectionIt is also called reproduction phase whose primary objective is to promote good solutions and eliminate bad solutions in the current population, while keeping the population size constant. This is done by identifying good solutions (in terms of fitness) in the current population and making duplicate copies of these. Now in order to maintain the population size constant, eliminate some bad solutions from the populations so that multiple copies of good solutions can be placed in the population. In other words, those parents from the current population are selected in selection phase who together will generate the next population. The various methods like Roulette –wheel selection, Boltzmann selection, Tournament selection, Rank selection, Steady –state selection, etc., are available for selection but the most commonly used selection method is Roulette wheel. Fitness value of individuals play an important role in these all selection procedures.2. CrossoverIt is to be notice that the selection operator makes only multiple copies of better solutions than the others but it does not generate any new solution. So in crossover phase, the new solutions are generated. First two solutions from the new population are selected either randomly or by applying any stochastic rule and bring them in the mating pool in order to create two off-springs. It is not necessary that the newly generated off-springs is more, because the off-springs have been created from those individuals which have been survived during the selection phase. So the parents have good bit strings combinations which will be carried over to off-springs. Even if the newly generated off-springs are not better in terms of fitness then we should not be bother about because they will be eliminated in next selection phase. In the crossover phase new off-springs are made from those parents which were selected in the selection phase. There are various crossover methods available like single-point crossover, two-point crossover, Multi-point crossover (N-Point crossover), uniform crossover, Matrix crossover (Two-dimensional crossover), etc.3. MutationMutation to an individual takes part with a very low probability. If any bit of an individual is selected to be muted then it is flipped with a possible alternative value for that bit. For example, the possible alternative value for 0 is 1 and 0 for 1 in binary string representation case i.e. , 0 is flipped with 1 and 1 is flipped with 0. The mutation phase is applied next to crossover to keep diversity in the population. Again it is not always to get better off-springs after mutation but it is done to search few solutions in the neighborhood of original solutions.B. MLP NNOne of the most important models in ANN or NN is MLP. The advantages in NN [17] are that Mapping capabilities or pattern association, generalization, robustness, fault tolerance and parallel and high speed information processing, good at recognizing patterns and the limitations are training to operate, require high processing time for large neural network, not good at explaining how they reach their decisions, etc. NN can be applied in Pattern recognition, image processing, optimization, constraint satisfaction, forecasting, risk assessment, control systems. MLP NN is feed forward trained with the Back-Propagation algorithm. It is supervised neural networks so they require a desired response to be trained. It learns how to transform input data in to a desired response to be trained, so they are widely used for pattern classification. The structure of MLP NN is shown in fig. 1. The type of architecture used to implement the system is MLP NN. The MLP NN consists of one input layer, one output layer, one or more hidden layers. Each layer consists of one or more nodes or neurons, represented by small circles. The lines between nodes indicate flow of information from one node to another node. The input layer is that which receives the input and this layer has no function except buffering the input signal [27], the output of input layer is given to hidden layer through weighted connection links. Any layer that is formed between the input and output layers is called hidden layer. This hidden layer is internal to the network and has no direct contact with the external environment. It should be noted that there may be zero to several hidden layers in an ANN, more the number of the hidden layers, more is the complexity of the network. This may, however, provide an efficient output response. This layer performs computations and transmits the results to output layer through weighted links, the output of the hidden layer is forwarded to output layer.The output layer generates or produces the output of the network or classification the results or this layer performs computations and produce final result.Fig.1. Feed Forward Neural Network Model for Diabetes DiseaseDiagnosisIII.R ELATED W ORKKemal Polat et al. [2] stated Principal Component Analysis (PCA) and Adaptive Neuro–Fuzzy Inference System (ANFIS) to improve the diagnostic accuracy of Diabetes disease in which PCA is used to reduce the dimensions of Diabetes disease dataset features and ANFIS is used for diagnosis of Diabetes disease means apply classification on that reduced features of Diabetes disease datasets. Manjeevan Seera et al. [3] introduced a new way of classification of medical data using hybrid intelligent system. The methodology implemented here is based on the hybrid combinatorial method of Fuzzy max-min based neural network and classification of data using Random forest Regression Tree. The methodology is implemented on various datasets including Breast Cancer and Pima Indian Diabetes Dataset and performs better as compared to other existing techniques. Esin Dogantekin et al. [1] used Linear Discriminant Analysis (LDA) and ANFIS for diagnosis of diabetes. LDA is used to separate feature variables between healthy and patient (diabetes) data, and ANFIS is used for classification on the result produced by LDA. The techniques used provide good accuracy then the previous existing results. So, the physicians can perform very accurate decisions by using such an efficient tool. H. Hasan Orkcu et al. [4] compares the performance of various back propagation and genetic algorithms for the classification of data. Since Back propagation is used for the efficient training of data in artificial neural network but contains some error rate, hence genetic algorithm is implemented for the binary and real-coded so that the training is efficient and more number of features can be classified. Muhammad Waqar Aslam et al. [7] introduced Genetic Programming–K-Nearest Neighbour (GP-KNN), Genetic Programming-Support Vector Machines (GP-SVM), in which KNN and SVM tested the new features generated by GP for performance evaluation. According to Pasi Luukka [5] Fuzzy Entropy Measures is used as a feature selection by which the computation cost, computation time can be reduced and also reduce noise and this way enhance the classification accuracy. Now from the previous statement it is clear that feature selection based on fuzzy entropy measures and it is tested together with similarity classifier. Kemal Polat et al. [12] proposed uses a new approach of a hybrid combination of Generalized Discriminant Analysis (GDA) and Least Square Support Vector Machine (LS–SVM) for the classification of diabetes disease. Here the methodology is implemented in two stages: in the first stage pre-processing of the data is done using the GDA such that the discrimination between healthy and patient disease can be done. In the second stage LS-SVM technique is applied for the classification of Diabetes disease patient’s. The methodology implemented here provides accuracy about 78.21% on the basis of 10 fold-cross validation from LS-SVM and the obtained accuracy for classification is about 82.05%. K Selvakuberan et al. [9] used Ranker search method, K star, REP tree, Naive bayes, Logisitic, Dagging, Multiclass in which Ranker search approach is used for feature selection and K star, REP tree, Naive bayes, Logisitic, Dagging, Multiclass are used for classification. The techniques implemented here provide a reduced feature set with higher classification accuracy. Acording to Adem Karahoca et al. [29] ANFIS, and Multinomial Logistic Regression (MLR) has been used for the comparison of peformance in terms of standard errors for diabetes disease diagnosis. ANFIS is used as an estimation method which has fuzzy input and output parameters, whereas MLR is used as a non-linear regression method and has fuzzy input and output parameters. Laerico Brito Goncalves et al. [13] implemented a new Neuro-fuzzy model for the classification of diabetes disease patients. Here in this paper an inverted Hierarchical Neuro-fuzzy based system is implemented which is based on binary space partitioning model and it provided embodies for the continue recursive of the input space and automatically generates own structure for the classification of inputs provided. The technique implemented finally generates a series of rules extraction on the basis of which classification can be done. T. Jayalakshmi et al. [28] proposed a new and efficient technique for the classification of diagnosis of diabetes disease using Artificial neural network (ANN). The methodology implemented here is based on the concept of ANN which requires a complete set of data for the accurate classification of Diabetes. The paper also implements an efficient technique for the improvement of classification accuracy of missing values in the dataset. It also provides a preprocessing stage during classification. Nahla H. Barakat et al. [30] worked on the classification of diabetes disease using a machine learning approach such as Support Vector Machine (SVM). The paper implements a new and efficient technique for the classification of medical diabetes mellitus using SVM. A sequential covering approach for the generation of rules extraction is implemented using the concept ofSVM,which is an efficient supervised learning algorithm. The paper also discusses Eclectic rule extraction technique for the extraction of rules set attributes from the dataset such that the selected attributes can be used for classification of medical diagnosis mellitus. Saloni et al. [31] have used various classifiers i.e ANN, linear, quadratic and SVM for the classification of parikson disease in which SVM provides the best accuracy of 96%. For the increases of classification accuracy, they have also used feature selection by which among 23 features, 15 features are selected. E.P. Ephzibah [23] used GA and Fuzzy Logic (FL) for diabetes diagnosis in which GA has been used as a feature selection method and FL is used for classification. The used methods improve the accuracy and reduced the cost.IV.P ROPOSED M ETHODOLOGYHere, The Proposed approach is implemented and evaluated by GA as a Feature Selection and MLP NN for Classification on Pima Indians Diabetes Data set from UCI repository of machine learning databases.The Proposed system of Block Diagram and the next proposed algorithm is shown below:Fig.2. Block Diagram of Proposed System Proposed AlgorithmStep1: StartStep2: Load Pima Indian Diabetes DatasetStep3: Initialize the parameters for the GAStep4: Call the GAStep5.1: Construction of the first generationStep5.2: SelectionWhile stopping criteria not met doStep5.3: CrossoverStep5.4: MutationStep5.5: SelectionEndStep6: Apply MLP NN ClassificationStep7: Training DatasetStep8: Calculation of error and accuracyStep9: Testing DatasetStep10: Calculation of error and accuracyStep11: StopThe proposed approach works in the following phases:A. Take Pima Indians Diabetes Data set from UCI repository of machine learning databases.B. Apply GA as a Feature Selection on Pima Indians Diabetes Data set.C. Do the Classification by using MLP NN on selected features in Pima Indians Diabetes Dataset.A. Used Diabetes Disease DatasetThe Pima Indian Diabetes Database was obtained from the UCI Repository of Machine Learning Databases [14]. The same dataset used in the reference [1-8] [11] [15] [19-26] [28].B. GA for Feature SelectionThe GA is a repetitive process of selection, crossover and mutation with the population of individuals in each iteration called a generation. Each chromosome or individual is encoded in a linear string (generally of 0s and 1s) of fix length in genetic analogy. In search space, First of all, the individual members of the population are randomly initialized. After initialization each population member is evaluated with respect to objective function being solved and is assigned a number (value of the objective function) which represents the fitness for survival for corresponding individual. The GA maintains a population of fix number of individuals with corresponding fitness value. In each generation, the more fit individuals (selected from the current population) go in mating pool for crossover to generate new off-springs, and consequently individuals with high fitness are provided more chance to generate off-springs. Now, each new offspring is modified with a very low mutation probability to maintain the diversity in the population. Now, the parents and off-springs together forms the new generation based on the fitness which will treated as parents for next generation. In this way, the new generation and hence successive generations of individual solutions is expected to be better in terms of average fitness. The algorithms stops forming new generations when either a maximum number of generations has been formed or a satisfactory fitness value is achieved for the problem.The standard pseudo code of genetic algorithm is given in Algorithm 11. Algorithm GABeginq = 0Randomly initialize individual members of population P(q)Evaluate fitness of each individual of population P(q) while termination condition is not satisfied doq = q+1selection (of better fit solutions)crossover (mating between parents to generate off-springs)mutation (random change in off-springs)end whileReturn best individual in population;In Algorithm 1, q represents the generation counter, initialization is done randomly in search space and corresponding fitness is evaluated based on objective function. After that GA algorithm requires a cycle of three phases: selection, crossover and mutation.In medical world, If we have to be diagnosed any disease then there are some tests to be performed. After getting the results of the tests performed, the diagnosed could be done better. We can consider each and every test as a feature. If we have to do a particular test then there are certain set of chemicals, equipments, may be people, more time required which can be more expensive. Basically, Feature Selection informs whether a particular test is necessary for the diagnosis or not. So, if a particular test is not required that can be avoided. When the number of tests gets reduced the cost that is required also gets reduced which helps the common people. So, Here that is why we have applied GA as a feature selection by which we reduced 4 features among 8 features. So from the above it is clear that GA is reducing the cost, storage capacity, and computation time by selected some of the feature.C. MLP NN for ClassificationMLP NN are supervised learning, feed forward method for classification. Since the MLP NN supervised learning, they require a desired response to be trained.The working of MLP NN is summarized in steps as mentioned below:1.Input data is provided to input layer for processing,which produces a predicted output.2.The predicted output is subtracted from actualoutput and error value is calculated.3.The network then uses a Back-Propagation (BP)algorithm which adjusts the weights.4.For weights adjusting it starts from weightsbetween output layer nodes and last hidden layernodes and works backwards through network5.When BP is finished the forwarding process startsagain.6.The process is repeated until the error betweenpredicted and actual output is minimized.3.1. BP Algorithm Network for Adjusting Weight FeaturesThe most widely used training algorithm for multilayer and feed forward network is Back-Propagation. The name given is BP because, it calculates the difference between actual and predicted values is propagated from output nodes backwards to nodes in previous layer. This is done to improve weights during processing.The working of BP algorithm is summarized in steps as follows:1.Provide training data to network.pare the actual and desired output.3.Calculate the error in each node or neuron.4.Calculate what output should be for each nodeor neuron and how much lower or higher outputmust be adjusted for desired output.5.Then adjust the weights.The Block diagram of the working of MLP NN using BP is given below:Fig.3. Block Diagram of MLP NN using BPV.R ESULTS AND D ISSCUSION OF P ROPOSEDM ETHODOLOGYThe work was implemented on i3 processor with 2.30GHz speed, 2 GB RAM, 320 GB external storage and software used JDK 1.6 (Java Development Kit), NetBeans 8.0 IDE and have done the coding in java. For the computation of MLP NN and various parameters weka library is used.In Experimental studies we have partition 70-30% for training & test of GA_MLP NN system for diabetes disease diagnosis. We have performed the experimental studies on Pima Indians Diabetes Dataset mentioned in section IV.A. We have compared the results of our proposed system i.e. GA_MLP NN with the previous results reported by earlier methods [3].The parameters for the Genetic Algorithm for our task are:Population Size 20Number of Generations 20Probability of Crossover 0.6Probability of Mutation 0.033Report Frequency 20Random Number Seed 1As per the table No.2 we may see that by applying the GA approach, we have obtained 4 features among 8 features. This means we have reduced the cost to s(x) = 4/8 = 0.5 from 1. This means that we have obtained an improvement on the training and classification by a factor of 2.As we know that the diagnostic performance is usually evaluated in terms of Classification Accuracy, Precision, Recall, Fallout and F – Measure, ROC, Confusion Matrix. These terms are briefly explained below: Classification Accuracy: Classification accuracy may be defined as the probability of it in correctly classifying records in the test datasets or Classification accuracy is the ratio of Total number of correctly diagnosed cases to the Total number of cases.Classification accuracy (%)=(TP + TN )/ (TP+FP+TN+FN (1) Where,TP (True Positive): Sick people correctly detected as sick.FP (False Positive): Healthy people incorrectly detected as diabetic people.TN (True Negative): Healthy people correctly detected as healthy.FN (False Negative): Sick people incorrectly detected as healthy.Precision: precision may define to measures the rate of correctly classified samples that are predicted as diabetic samples or precision is the ratio of number of correctly classified instances to the total number of instances fetched.Precision=No.of Correctly Classified InstancesTotal No.of Instances Fetched(2)or Precision=TP/ TP+F (3)Recall: Recall may define to measures the rate of correctly classified samples that are actually diabetic samples or recall is the ratio of number of correctly classified instances to the total number of instances in the Dataset.Recall=No.of Correctly Classified InstancesTotal No.of Instances in the Dataset(4)or Recall=TP/ TP+FN (5)As we know that usually precision increases then recall decreases or in other words simply precision and recall stand in opposition to one another.Fallout: The term fallout is used to check true negative of the dataset during classification.F - Measure: The F – Measure computes some average of the information retrieval precision and recall metrics. The F – Measure (F – Score) is calculated based on the precision and recall.The calculation is as follow:F−Measure=2∗Precision∗RecallPrecision+Recall(6)Area under Curve (AUC): It is defined as the metric used to measure the performance of classifier with relevant acceptance. It is calculated from area under curve (ROC) on the basis of true positives and false positives.AUC=12(TPTP+FN+TNTN+FP) (7)ROC is an effective method of evaluating the performance of diagnostic tests.Confusion Matrix: A confusion matrix [12][2] contains information regarding actual and predicted classifications done by a classification system.The following terms is briefly explained which is used in result part.Kappa Statistics: It is defined as performance to measure the true classification or accuracy of the algorithm.K=P0−Pc1−Pc(8)Where, P0 is the total agreement probability and Pc is the agreement probability due to change.Root Mean Square Error (RMSE): It is defined as the different between actual predicted value and the actual predicted value in the learning.RMSE=√1N∑(Ere−Eacc)2Nj=12 (9)Where, Ere is the resultant error rate and Eacc is the actual error rateMean Absolute Error: It is defined as:MAE=|p1−a1|+⋯+|p n−a n|n(10)Root Mean-Squared Error: it is defined as:RMSE=√(p1−a1)2+⋯+(p n−a n)2n(11) Relative Squared error: It is defined as:RSE=(p1−a1)2+⋯+(p n−a n)2(a̅−a1)2+⋯+(a̅−a n)2(12) Relative Absolute Error: It is defined as:RAE=|p1−a1|+⋯+|p n−a n||a̅−a1|+⋯+|a̅−a n|(13)Where, ‘a1,a2….an’ are the actual target values and ‘p1,p2….pn’ are the predicted target values.The details of Evaluation on training set and Evaluation on test split with MLP NN are as follows. Time taken to build model = 2.06 secondsEvaluation on training setCorrectly ClassifiedInstances435 80.855% Incorrectly ClassifiedInstances103 19.145%Kappa statistic 0.5499Mean absolute error 0.2528Root mean squared error 0.3659Relative absolute error 55.4353%Root relative squared error 76.6411%Total Number of Instances 538Detailed Accuracy by ClassTP Rate FPRatePrecisionRecallF-MeasureROCAreaClass0.93 1 0.4180.8040.9310.8630.872 tested_negative0.58 2 0.0690.8210.5820.6810.872 tested_positiveWeight ed Avg. 0.8090.2950.81 0.8090.7990.872Confusion MatrixA b <--classified as325 24 | a = tested_ negative79 110| b = tested_ positiveTime taken to build model = 2.04 secondsEvaluation on test splitCorrectly Classified Instances 180 78.2609% Incorrectly Classified Instances 50 21.7391% Kappa statistic 0.4769Mean absolute error 0.2716Root mean squared error 0.387Relative absolute error 59.8716%Root relative squared error 81.4912%Total Number of Instances 230Detailed Accuracy by ClassTP Rate FPRatePrecisionRecallF-MeasureROCAreaClass0.92 1 0.4810.7850.9210.8480.853 tested_negative0.51 9 0.0790.7740.5190.6210.853 tested_positiveWeighted Avg. 0.7830.3430.7810.7830.77 0.853Cofusion Matrixa b <--classified as139 12| a = tested_ negative38 41| b = tested_ positiveThe figure shown below is the analysis of False Positive Rate Vs True Positive Rate. Figure 4 indicating the ROC graph for MLP NN methodology.The Pima Indian Diabetes Dataset classified using MLP NN generates less error rate as shown in figure 4.Fig.4. Analysis of Positive Rate for Pima Indian Diabetes Datasetwithout GAThe table 1 shows the result of training and testing accuracy by applying MLP NN methodology.Table 1. Classification with MLP NNTable 2. GA Feature Reduction。
model predictive control 综述Model Predictive Control 综述一、引言Model Predictive Control (模型预测控制,MPC) 是一种先进的控制策略,已被广泛应用于多个领域,包括化工、能源、机械和汽车等。
MPC 利用系统的数学模型进行预测,并根据预测结果计算出最优的控制方案,以实现控制目标。
本文将全面介绍MPC的原理、方法和应用领域,以及相关研究的最新进展。
二、MPC原理2.1 控制目标MPC的主要目标是使系统状态根据所设定的参考轨迹尽可能地接近期望值,同时满足系统的约束条件。
这些约束条件可以包括输入变量的限制、输出变量的范围、设备的操作限制等。
2.2 控制模型MPC使用系统的数学模型来描述系统的行为,在每个时间步长上,根据当前状态和约束条件,预测一定时间范围内的系统行为。
常见的系统模型包括线性时间不变模型(LTI)、非线性模型和混合离散连续模型等。
2.3 最优控制策略在每个时间步长上,MPC计算出一组最优的控制输入变量,以使预测的系统状态与参考轨迹尽可能接近。
这种计算通常使用优化算法,如线性二次规划(LQP)或非线性优化算法。
三、MPC方法3.1 传统MPC传统的MPC方法通常使用线性时间不变(LTI)模型来描述系统行为,并假设系统的状态和控制输入是可测量的。
这种方法的优点是简单、实用,适用于大多数线性系统。
3.2 基于约束的MPC基于约束的MPC方法将系统的约束条件纳入考虑,并在控制过程中保证这些约束条件得到满足。
这种方法在处理实际工程问题时非常有用,如控制机械臂的运动范围、炉温的限制等。
3.3 非线性MPC非线性MPC方法使用非线性模型来描述系统,并利用非线性优化算法进行控制策略的计算。
这种方法因其能够处理更加复杂的系统行为而受到研究者和工程师的关注。
3.4 模型不确定性模型不确定性是MPC面临的一个重要挑战。
因为真实系统的模型往往是不完全的、存在参数误差或未知扰动的,因此在设计MPC控制器时需要考虑这些不确定性,并采取相应的补偿策略。
8751临床肝胆病杂志第39卷第7期2023年7月 JClinHepatol,Vol.39No.7,Jul.2023"#$% &$ DOI:10.3969/j.issn.1001-5256.2023.07.011血清壳多糖酶3样蛋白1(CHI3L1)对肝硬化患者发生失代偿事件风险的预测价值杨 航1,赵黎莉2,韩 萍1,陈庆灵1,文 君2,刘 洁2,程晓静2,李 嘉21天津医科大学第二人民医院临床学院,天津300070;2天津市第二人民医院肝病科,天津300192通信作者:李嘉,163.com(ORCID:0000-0003-0100-417X)摘要:目的 预测失代偿事件以采取积极的预防措施,是提高肝硬化患者生存期的关键。
本文旨在探索血清壳多糖酶3样蛋白1(CHI3L1)对肝硬化患者发生失代偿事件风险的预测价值。
方法 纳入2019年1月—2021年5月于天津市第二人民医院诊疗的305例肝硬化患者,进行病例对照研究。
其中基线时处于代偿期肝硬化的患者200例,处于失代偿期肝硬化的患者105例。
将305例肝硬化患者,以1年内是否发生失代偿事件进行分组,其中发生失代偿事件组79例,未发生失代偿事件组226例;将200例代偿期肝硬化患者,以1年内是否发生首次失代偿事件进行分组,其中发生首次失代偿事件组43例,未发生首次失代偿事件组157例。
符合正态分布的计量资料两组间比较采用成组t检验或Mann-WhitneyU检验;计数资料两组间比较采用Wilcoxon秩和检验或χ2检验。
采用二分类Logistic回归分析各变量与失代偿事件之间的相关性;采用受试者工作特征曲线(ROC曲线)下面积(AUC)来评价各变量对失代偿事件的预测价值,采用约登指数最大值来确定最佳临界值。
结果 1年内发生失代偿事件患者的基线血清CHI3L1水平高于未发生患者[243.00(136.00~372.00)ng/mLvs117.50(67.75~205.25)ng/mL,U=4720.500,P<0.001];1年内发生首次失代偿事件患者的基线血清CHI3L1水平高于未发生患者[227.98(110.00~314.00)ng/mLvs90.00(58.00~168 50)ng/mL,U=1681.500,P<0.001]。
Association Rules and Predictive Modelsfor e-Banking ServicesVasilis Aggelis Dimitris ChristodoulakisDepartment of Computer Engineering and InformaticsUniversity of Patras, Rio, Patras, GreeceEmail: Vasilis.Aggelis@egnatiabank.gr dxri@cti.grAbstract. The introduction of data mining methods in the banking area al-though conducted in a slower way than in other fields, mainly due to the natureand sensitivity of bank data, can already be considered of great assistance tobanks as to prediction, forecasting and decision making. One particular methodis the investigation for association rules between products and services a bankoffers. Results are generally impressive since in many cases strong relations areestablished, which are not easily observed at a first glance. These rules are usedas additional tools aiming at the continuous improvement of bank services andproducts helping also in approaching new customers. In addition, the develop-ment and continuous training of prediction models is a very significant task, es-pecially for bank organizations. The establishment of such models with the ca-pacity of accurate prediction of future facts enhances the decision making andthe fulfillment of the bank goals, especially in case these models are applied onspecific bank units. E-banking can be considered such a unit receiving influencefrom a number of different sides. Scope of this paper is the demonstration of theapplication of data mining methods to e-banking. In other words associationrules concerning e-banking are discovered using different techniques and a pre-diction model is established, depending on e-banking parameters like the trans-actions volume conducted through this alternative channel in relation with othercrucial parameters like the number of active users.1 IntroductionDetermination of association rules concerning bank data is a challenging though de-manding task since:−The volume of bank data is enormous. Therefore the data should be adequately prepared by the data miner before the final step of the application of the method.−The objective must be clearly set from the beginning. In many cases not having a clear aim results in erroneous or no results at all.−Good knowledge of the data is a prerequisite not only for the data miner but also for the final analyst (manager). Otherwise wrong results will be produced by the data miner and unreliable conclusions will be drawn by the manager.−Not all rules are of interest. The analyst should recognize powerful rules as to deci-sion making.2The challenge of the whole process rises from the fact that the relations established are not easily observed without use of data mining methods.A number of data mining methods are nowadays widely used in treatment of finan-cial and bank data [22]. The contribution of these methods in prediction making is essential. Study and prediction of bank parameter’s behavior generally receives vari-ous influences. Some of these can be attributed to internal parameters while others to external financial ones. Therefore prediction making requires the correct parameter determination that will be used for predictions and finally for decision making. Meth-ods like Classification, Regression, Neural Networks and Decision Trees are applied in order to discover and test predictive models.Prediction making can be applied in specific bank data sets. An interesting data sector is e-banking. E-banking is a relatively new alternative payment and informa-tion channel offered by the banks. The familiarity of continuously more people to the internet use, along with the establishment of a feeling of increasing safety in conduc-tion of electronic transactions imply that slowly but constantly e-banking users in-crease. Thus, the study of this domain is of great significance to a bank.The increase of electronic transactions during the last years is quite rapid. E-banking nowadays offers a complete sum of products and services facilitating not only the individual customers but also corporate customers to conduct their transac-tions easily and securely. The ability to discover rules between different electronic services is therefore of great significance to a bank.Identification of such rules offers advantages as the following:−Description and establishment of the relationships between different types of elec-tronic transactions.−The electronic services become more easily familiar to the public since specific groups of customers are approached, that use specific payment manners.−Customer approach is well designed with higher possibility of successful engage-ment.−The improvement of already offered bank services is classified as to those used more frequently.−Reconsidering of the usefulness of products exhibiting little or no contribution to the rules.Of certain interest is the certification of the obtained rules using other methods of data mining to assure their accuracy.This study is concerned with the identification of rules between several different ways of payment and also we focus in prediction making and testing concerning the transactions volume through e-banking in correlation to the number of active users. The software used is SPSS Clementine 7.0. Section 2 contains association rules’ basic features and algorithms. A general description of prediction methods can be found in section 3, while in section 4 the process of investigation for association rules and the process of prediction making and testing is described. Experimental results can be found in section 5 and finally section 6 contains the main conclusions of this work accompanied with future work suggestions in this area.32 Association Rules BasicsA rule consists of a left-hand side proposition (antecedent) and a right-hand side (con-sequent) [3]. Both sides consist of Boolean statements. The rule states that if the left-hand side is true, then the right-hand is also true. A probabilistic rule modifies this definition so that the right-hand side is true with probability p, given that the left-hand side is true.A formal definition of association rule [6, 7, 14, 15, 18, 19, 21] is given below.Definition. An association rule is a rule in the form ofX Æ Y (1) where X and Y are predicates or set of items.As the number of produced associations might be huge, and not all the discovered associations are meaningful, two probability measures, called support and confidence, are introduced to discard the less frequent associations in the database. The support is the joint probability to find X and Y in the same group; the confidence is the condi-tional probability to find in a group Y having found X.Formal definitions of support and confidence [1, 3, 14, 15, 19, 21] are given below.Definition Given an itemset pattern X, its frequency fr(X) is the number of cases in the data that satisfy X. Support is the frequency fr(X ∧ Y). Confidence is the fraction of rows that satisfy Y among those rows that satisfy X,c(XÆY)=)()(X fr YXfr∧(2)In terms of conditional probability notation, the empirical accuracy of an associa-tion rule can be viewed as a maximum likelihood (frequency-based) estimate of the conditional probability that Y is true, given that X is true.2.1 Apriori AlgorithmAssociation rules are among the most popular representations for local patterns in data mining. Apriori algorithm [1, 2, 3, 6, 10, 14] is one of the earliest for finding association rules. This algorithm is an influential algorithm for mining frequent item-sets for Boolean association rules. This algorithm contains a number of passes over the database. During pass k, the algorithm finds the set of frequent itemsets L k of length k that satisfy the minimum support requirement. The algorithm terminates when L k is empty. A pruning step eliminates any candidate, which has a smaller sub-set. The pseudo code for Apriori Algorithm is following:C k: candidate itemset of size kL k: frequent itemset of size kL1 = {frequent items};For (k=1; L k != null; k++) do beginC k+1 = candidates generated from L k;For each transaction t in database do4Increment the count of all candidates inC k+1 that are contained in tL k+1 = candidates in C k+1with min_supportEndReturn L k;2.2 Generalized Rule Induction (GRI)In the generalized rule induction task [3,4,5,11,15,16,17], the goal is to discover rules that predict the value of a goal attribute, given the values of other attributes. However the goal attribute can be any attribute not occurring in the antecedent of the rule.3 Classification and RegressionThere are two types of problems that can be solved by Classification and Regression methods [3].Regression-type problems.The problem of regression consists of obtaining a func-tional model that relates the value of a response continuous variable Y with the values of variables X1, X2, ..., X v(the predictors). This model is usually obtained using sam-ples of the unknown regression function.Classification-type problems.Classification-type problems are generally those where one attempts to predict values of a categorical response variable from one or more continuous and/or categorical predictor variables.3.1 Classification and Regression Trees (C&RT)C&RT method [23, 24] builds classification and regression trees for predicting con-tinuous dependent variables (regression) and categorical predictor variables (classifi-cation). There are numerous algorithms for predicting continuous variables or cate-gorical variables from a set of continuous predictors and/or categorical factor effects. In most general terms, the purpose of the analysis via tree-building algorithms is to determine a set of if-then logical conditions that permit accurate prediction or classifi-cation of cases.Tree classification techniques [3, 25, 26], when they "work" and produce accurate predictions or predicted classifications based on a few logical if-then conditions, have a number of advantages over many of those alternative techniques.Simplicity of results.In most cases, the interpretation of results summarized in a tree is very simple. This simplicity is useful not only for purposes of rapid classifica-tion of new observations but can also often yield a much simpler "model" for explain-ing why observations are classified or predicted in a particular manner.Tree methods are nonparametric and nonlinear.The final results of using tree methods for classification or regression can be summarized in a series of logical if-then conditions (tree nodes). Therefore, there is no implicit assumption that the under-lying relationships between the predictor variables and the response variable are lin-5 ear, follow some specific non-linear link function, or that they are even monotonic in nature. Thus, tree methods are particularly well suited for data mining tasks, where neither a priori knowledge is available nor any coherent set of theories or predictions regarding which variables are related and how. In those types of data analysis, tree methods can often reveal simple relationships between just a few variables that could have easily gone unnoticed using other analytic techniques.3.2 Evaluation ChartsComparison and evaluation of predictive models in order for the best to be chosen is a task easily performed by the evaluation charts [4, 27]. They reveal the performance of the models concerning particular outcomes. They are based on sorting the records as to the predicted value and confidence of the prediction, splitting the records into groups of equal size (quantiles), and finally plotting the value of the business criterion for each quantile, from maximum to minimum.Different types of evaluation charts emphasize on a different evaluation criteria.Gains Charts: The proportion of total hits that occur in each quantile are defined as Gains. They are calculated as the result of: (number of hits in quantile/total number of hits) X 100%.Lift Charts:Lift is used to compare the percentage of records occupied by hits with the overall percentage of hits in the training data. Its values are computed as (hits in quantile/records in quantile) / (total hits/total records).Evaluation charts can also be expressed in cumulative form, meaning that each point equals the value for the corresponding quantile plus all higher quantiles. Cumu-lative charts usually express the overall performance of models in a more adequate way, whereas non-cumulative charts [4] often succeed in indicating particular prob-lem areas for models.The interpretation of an evaluation chart is a task certainly depended on the type of chart, although there are some characteristics common to all evaluation charts. Con-cerning cumulative charts, higher lines indicate more effective models, especially on the left side of the chart. In many cases, when comparing multiple models lines cross, so that one model stands higher in one part of the chart while another is elevated higher than the first in a different part of the chart. In this case, it is necessary to con-sider which portion of the sample is desirable (which defines a point on the x axis) when deciding which model is the appropriate.Most of the non-cumulative charts will be very similar. For good models, noncu-mulative charts [4] should be high toward the left side of the chart and low toward the right side of the chart. Dips on the left side of the chart or spikes on the right side can indicate areas where the model is predicting poorly. A flat line across the whole graph indicates a model that essentially provides no information.Gains charts. Cumulative gains charts extend from 0% starting from the left to 100% at the right side. In the case of a good model, the gains chart rises steeply to-wards 100% and then levels off. A model that provides no information typically fol-lows the diagonal from lower left to upper right.Lift charts. Cumulative lift charts tend to start above 1.0 and gradually descend until they reach 1.0 heading from left to right. The right edge of the chart represents6the entire data set, so the ratio of hits in cumulative quantiles to hits in data is 1.0. In case of a good model the lift chart, lift should start well above 1.0 on the left, remain on a high plateau moving to the right, trailing off sharply towards 1.0 on the right side of the chart. For a model that provides no information, the line would hover around 1.0 for the entire graph.4 Investigation for Association Rules and Generation ofPredictions in e-Banking Data SetIn this section the determination of relations between different payment types e-banking users conduct is described. These relations are of great importance to the bank as they enhance the selling of its products, by making the bank more aware of its customer’s behavior and the payment way they prefer. They also contribute in im-provement of its services.Namely the following types of payments were used:1.DIASTRANSFER Payment Orders – Funds Transfers between banks that are par-ticipating in DIASTRANSFER System.2.SWIFT Payment Orders – Funds Transfers among national and foreign banks.3.Funds Transfers – Funds Transfers inside the bank.4.Forward Funds Transfers - Funds Transfers inside the bank with forward value date.5.Greek Telecommunications Organization (GTO) and Public Power Corporation(PPC)Standing Orders – Standing Orders for paying GTO and PPC bills.6.Social Insurance Institute (SII) Payment Orders – Payment Orders for paying em-ployers’ contributions.7.VAT Payment Orders.8.Credit Card Payment Orders.For the specific case of this study a random sample of e-banking customers, both individuals and companies, was used. A Boolean value is assigned to each different payment type depending on whether the payment has been conducted by the users or not. In case the user is individual field «company» receives the value 0, else 1.A sample of the above data set is shown in Table 1. In total the sample contains 1959 users.User Comp DiasTransferP.O. SwiftP.O.FundsTransferForwardFundsTransferGTO-PPCS.O.SIIP.O.VATP.O.Cr.CardP.O.… … … … … … … … … … User103 0 T F T T F F F T User104 1 T T T T F T T F User105 1 T F T T T T T T User106 0 T F T F T F F T … … … … … … … … … …Table 1.In order to discover association rules the Apriori [4] and Generalized Rule Induc-tion (GRI) [4] methods were used. These rules are statements in the form if antece-dent then consequent.7 In the case of this study, as stated above, another objective is the production and test of predictions about the volume of e-banking transactions in relation to the active users. The term financial transactions stands for all payment orders or standing ordersa user carries out, excluding transactions concerning information content like account balance, detailed account transactions or mini statement. The term «active» describes the user who currently makes use of the electronic services a bank offers. Active users are a subgroup of the enlisted users.One day is defined as the time unit. The number of active users is the predictor variable (Count_Of_Active_Users) while the volume of financial transactions is as-sumed to be the response variable (Count_Of_Payments). A sample of the above data set is shown in Table 2Transaction Day Count_Of_Active_Users Count_Of_Payments... ... ...27/8/2002 99 22828/8/2002 107 38529/8/2002 181 91530/8/2002 215 859 ... ... ...Table 2.The date range for which the data set is applicable counts from April 20th, 2001 un-til December 12th, 2002. Data set includes data only for active days (holydays and weekends not included), which means 387 occurrences. In order to create predictions Classification and Regression Tree (C&R Tree) [4] was used.5 ExperimentalResultsFig. 1.Apriori:By the use of Apriori method and setting Minimum Rule Support = 10 and Minimum Rule Confidence = 20 eleven rules were determined and shown in Figure 1.GRI:Using method GRI with Minimum Rule Support = 10 and Minimum Rule Confidence = 40 resulted in the four rules seen in Figure 2.8Fig. 2.After comparison between the rules obtained by the two methods it can be con-cluded that the most powerful is:−If SII Payment Order then VAT Payment Order (confidence=81). (Rule 1) Additionally it can be observed that this rule is valid for the reverse case too, namely:−If VAT Payment Order then SII Payment Order (confidence= 47). (Rule 2) Two other rules obtained exhibiting confidence > 40 are the following:−If Funds Transfers then Forward Funds Transfers (confidence=47), (Rule 3) valid also reversely.−If Forward Funds Transfers then Funds Transfers (confidence=42), (Rule 4) As seen, there exists strong relationship between VAT Payment Order and SII Payment Order, as is the case between Funds Transfers and Forward Funds Transfers.These strong relationships are visualized on the web graph of Figure 3 and the ta-bles of its links in Figure 4.Fig. 3.9Fig. 4.Interestingness [8, 9, 12, 13, 19, 20] of the above rules is high bearing in mind that generally a pattern or rule can be characterized interesting if it is easily understood, unexpected, potentially useful and actionable or it validates some hypothesis that a user seeks to confirm. Given the fact that payment types used in the rules exhibiting the highest confidence (VAT Payment Order, SII Payment Order) are usually pay-ments conducted by companies and freelancers, a smaller subdivision of the whole sample was considered containing only companies (958 users) for which the methods Apriori και GRI were applied. The results are described below.Apriori:Using Apriori Method and setting Minimum Rule Support = 10 and Minimum Rule Confidence = 40, the seven rules shown in Figure 5 were obtained.Fig. 5.GRI: GRI method with Minimum Rule Support = 10 and Minimum Rule Confi-dence = 50 resulted in two rules, see Figure 6.10Fig. 6.The comparison of the above rules leads to the conclusion that the validity of rules 1 and 2 is confirmed exhibiting even increased support:−If SII Payment Order then VAT Payment Order (confidence=81). (Rule 5)−If VAT Payment Order then SII Payment Order (confidence= 51). (Rule 6)5.1 CR&TIn Figure 7 the conditions defining the partitioning of data discovered by the algo-rithm C&RT are displayed. These specific conditions compose a predictive model.Fig. 7.C&RT algorithm [4] works by recursively partitioning the data based on input field values. The data partitions are called branches. The initial branch (sometimes called the root) encompasses all data records. The root is split into subsets or child branches, based on the value of a particular input field. Each child branch may be further split into sub-branches, which may in turn be split again, and so on. At the lowest level of the tree are branches that have no more splits. Such branches are known as terminal branches, or leaves.11 Figure 7 shows the input values that define each partition or branch and a summary of output field values for the records in that split. For example a condition with many instances (201) is:If Count_of_Active_Users < 16.5 then Count_of_Payments Ö 11,473, meaning that if the number of active users during a working day is less than 16.5 then the number of transactions is predicted approximately at the value of 11.Fig. 8.Figure 8 shows a graphical display of the structure of the tree in detail. Goodness of fit of the discovered model was assessed by the use of evaluation charts. Examples of evaluation Charts (Figure 9 to Figure 10) are shown below.Gains Chart: This Chart (Figure 9) shows that the gain rises steeply towards 100% and then levels off. Using the prediction of the model, the percentage of Pre-dicted Count of Payments for the percentile is calculated and these points create the lift curve. An important notice is that the efficiency of a model is increased with the area between the lift curve and the baseline.Fig. 9.12Lift Chart: As can be seen in Figure 10, Chart starts well above 1.0 on the left, remains on a high plateau moving to the right, and then trails off sharply towards 1.0 on the right side of the chart. Using the prediction of the model shows the actual lift.Fig. 10.In Figure 11 the Count of Payment and the Predicted Count of Payment as func-tions of the number of active users can be seen. The Predicted Count of Payment line clearly reveals the partitioning of the data due to the application of the C&RT algo-rithm.Fig. 11.6 Conclusions and Future WorkIn the present paper the way of discovery of association rules between different types of e-banking payment offered by a bank is described along with experimental results.The basic outcome is that VAT Payment Orders and SII Payment Orders are the most popular and interconnected strongly. The detection of such relationships offers a bank a detailed analysis that can be used as a reference point for the conservation of the customer volume initially but also for approaching new customer groups (compa-nies, freelancers) who conduct these payments in order to increase its customer num-ber and profit.13 Similar patterns can be derived after analysis of payment types of specific cus-tomer groups resulting from various criteria either gender, residence area, age, profes-sion or any other criteria the bank assumes significant. The rules to be discovered show clearly the tendencies created in the various customer groups helping the bank to approach these groups as well as re-design the offered electronic services in order to become more competitive.In addition to the two methods used, detection of association rules can employ sev-eral other data mining methods such as decision trees in order to confirm the validity of the results.In this study the establishment of a prediction model concerning the number of payments conducted through Internet as related to the number of active users is inves-tigated along with testing its accuracy.Using the methods C&RT it was concluded that there exists strong correlation be-tween the number of active users and the number of payments these users conduct. It is clear that the increase of the users’ number results in increase of the transactions made. Therefore, the enlargement of the group of active users should be a strategic target for a bank since it increases the transactions and decreases its service cost. It has been proposed by certain researches that the service cost of a transaction reduces from €1.17 to just €0.13 in case it is conducted in electronic form. Therefore, a goal of the e-banking sector of a bank should be the increase of the proportion of active users compared to the whole number of enlisted customers from which a large num-ber do not use e-services and is considered inactive.Future work includes the creation of prediction models concerning other e-banking parameters like the transactions value, the use of specific payment types, commissions of electronic transactions and e-banking income in relation to internal bank issues. Also extremely interesting is the case of predictive models based on external financial parameters.The prediction model of this study could be determined using also other methods of data mining. Use of different methods offers the ability to compare between them and select the more suitable. The acceptance and continuous training of the model using the appropriate Data Mining method results in extraction of powerful conclu-sions and results.References1. M.J. Zaki, S. Parthasarathy, W. Li, and M. Ogihara. “Evaluation of Sampling for Data Min-ing of Association Rules”, Proceedings 7th Workshop Research Issues Data Engineering, (1997)2. R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A.I. Verkamo. “Fast Discovery ofAssociations Rules”, In: Advances in Knowledge Discovery and Data Mining, (1996)3. D. Hand, H. Mannila, and P. Smyth “Principles of Data Mining”. The MIT Press, (2001)4. “Clementine 7.0 Users’s Guide”. Integral solutions Limited, (2002)5. A. Freitas. “A Genetic Programming Framework for Two Data Mining Tasks: Classificationand Generalized Rule Induction”, Proceedings 2nd Annual Conference on Genetic Pro-gramming (1997)6. E. Han, G. Karypis, and V. Kumar. “Scalable Parallel Data Mining for Association Rules”,Proceedings ACM SIGMOD Conference, (1997)147. S. Brin, R. Motwani, and C. Silverstein. “Beyond Market Baskets: Generalizing AssociationRules to Correlations”, Proceedings ACM SIGMOD Conference, (1997)8. M. Chen, J. Han, and P. Yu. “Data Mining: An Overview from Database Perspective”, IEEETransactions on Knowledge and Data Engineering, (1997)9. N. Ikizler, and H.A. Guvenir. “Mining Interesting Rules in Bank Loans Data”, Proceedings10th Turkish Symposium on Artificial Intelligence and Neural Networks, (2001)10. R. Ng, L. Lakshmanan, and J. Han. “Exploratory Mining and Pruning Optimizations ofConstrained Association Rules”, Proceedings ACM SIGMOD Conference, (1998)11. A. Freitas. “A Survey of Evolutionary Algorithms for Data Mining and Knowledge Discov-ery”, In: A. Ghosh and S. Tsutsui. (eds.) Advances in Evolutionary Computation (2002)12. A. Freitas. “On Rule Interestingness Measures”, Knowledge-Based Systems, (1999)13. R. Hilderman, and H. Hmailton. “Knowledge Discovery and Interestingness Measures: aSurvey”, Tech Report CS 99-04, Department of Computer Science, Univ. of Regina, (1999) 14. J. Hipp, U. Guntzer, and G. Nakhaeizadeh. “Data Mining of Associations Rules and theProcess of Knowledge Discovery in Databases”, (2002)15. A. Michail. “Data Mining Library Reuse Patterns using Generalized Association Rules”,Proceedings 22nd International Conference on Software Engineering, (2000)16. J. Liu, and J. Kwok. “An Extended Genetic Rule Induction Algorithm”, Proceedings Con-gress on Evolutionary Computation (CEC), (2000)17. P. Smyth, and R. Goodman. “Rule Induction using Information Theory”, Knowledge Dis-covery in Databases, (1991)18. T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. “Data Mining Using Two-Dimensional Optimized Association Rules: Scheme, Algorithms, and Visualization”, Pro-ceedings ACM SIGMOD Conference, (1996)19. R. Srikant, and R. Agrawal. “Mining Generalized Association Rules”, Proceedings21stVLDB Conference, (1995)20. T. Yilmaz, and A. Guvenir. “Analysis and Presentation of Interesting Rules”, Proceedings10th Turkish Symposium on Artificial Intelligence and Neural Networks, (2001)21. H. Toivonen. “Discovery of Frequent Patterns in Large Data Collections”, Tech Report A-1996-5, Department of Computer Science, Univ. of Helsinki, (1996)22. D. Foster, and R. Stine. “Variable Selection in Data Mining: Building a Predictive Modelfor Bankruptcy”, Center for Financial Institutions Working Papers from Wharton School Center for Financial Institutions, Univ. of Pennsylvania, (2002)23. J. Galindo. “Credit Risk Assessment using Statistical and Machine Learning: Basic Meth-odology and Risk Modeling Applications”, Computational Economics Journal, (1997)24. R.Lewis. “An Introduction to Classification and Regression tree (CART) Analysis”, AnnualMeeting of the Society for Academic Emergency Medicine in San, (2000)25. L. Breiman, J. H. Friedman, R.A. Olshen, and C.J. Stone. “Classification and RegressionTrees”, Wadsworth Publications, (1984)26. A. Buja, and Y. Lee. “Data Mining Criteria for Tree-based Regression and Classification”,Proceedings 7th ACM SIGKDD Conference (2001)27. S.J. Hong, and S. Weiss. “Advances in Predictive Model Generation for Data Mining”Proceedings International Workshop Machine Learning and Data Mining in Pattern Recog-nition, (1999)。