Correlations between p_T and Multiplicity in a Single BFKL Pomeron
- 格式:pdf
- 大小:109.13 KB
- 文档页数:5
Grey Correlation AnalysisIntroductionGrey correlation analysis is a statistical method used to measure the correlation between two or more variables when the data is limited or uncertain. It was developed by Deng Julong in China in the 1980s and has since been widely applied in various fields including finance, economics, engineering, and social sciences.Grey correlation analysis is particularly useful when dealing with incomplete or uncertain data. It can provide valuable insights and helpin decision-making processes when traditional correlation analysis methods may not be applicable.Principles of Grey Correlation AnalysisGrey correlation analysis is based on the principles of grey system theory, which aims to study systems with limited information and uncertain data. The method involves four main steps:1.Data Organization: The first step in grey correlation analysis isto organize the available data. This may involve collecting datafrom various sources and arranging it in a systematic manner.2.Data Comparison: Once the data is organized, the next step is tocompare the different variables or factors under consideration.This can be done using various statistical measures such as mean,range, and standard deviation.3.Grey Correlation Coefficient Calculation: The grey correlationcoefficient is calculated to measure the correlation between thevariables. It is a value between 0 and 1, where a higher valueindicates a stronger correlation. The grey correlation coefficient takes into account the uncertainties and variations in the data.4.Grey Correlation Analysis: Finally, the grey correlation analysisis performed to determine the relationships between the variables.This can help in identifying the most influential factors andmaking predictions or forecasts based on the available data.Applications of Grey Correlation AnalysisGrey correlation analysis has been widely used in various fields for different purposes. Some of the applications include:1. Financial AnalysisGrey correlation analysis has been applied in financial analysis to study the relationships between different financial indicators. It can help in identifying the key factors influencing the financial performance of companies or investment portfolios. For example, it can be used to analyze the correlation between stock prices and economic indicators such as inflation rates or interest rates.2. Engineering DesignIn engineering design, grey correlation analysis can be used to evaluate the relationship between various design parameters and the performance of a system or product. It can help in optimizing the design process by identifying the most critical factors and their impact on the overall performance. For example, it can be used to analyze the correlation between different manufacturing parameters and the strength of a material.3. Economic ForecastingGrey correlation analysis has also been used in economic forecasting to predict future trends based on historical data. It can help in identifying the key factors influencing economic growth or decline and making accurate predictions. For example, it can be used to analyze the correlation between GDP growth and factors such as consumer spending, investment, and government policies.4. Social SciencesIn social sciences, grey correlation analysis can be used to study the relationships between different social or demographic variables. It can help in understanding the factors influencing social phenomena and making informed policy decisions. For example, it can be used to analyze the correlation between education levels, income levels, and crime rates in a specific region.Advantages and LimitationsGrey correlation analysis has several advantages over traditional correlation analysis methods. Some of the advantages include:•Suitable for limited or uncertain data: Grey correlation analysis can handle situations where the data is incomplete or uncertain,making it useful in real-world applications.•Provides insights in complex systems: Grey correlation analysis can provide valuable insights in complex systems where traditional correlation analysis methods may not be effective.•Helps in decision-making: Grey correlation analysis can help in decision-making processes by identifying the most influentialfactors and their impact on the outcomes.However, grey correlation analysis also has some limitations. These include:•Subject to data quality: The accuracy and reliability of the results obtained through grey correlation analysis are highlydependent on the quality of the data used.•Limited to linear relationships: Grey correlation analysis assumesa linear relationship between the variables under consideration.It may not be suitable for analyzing non-linear relationships.•Interpretation challenges: Interpreting the results of greycorrelation analysis can be challenging due to the complexity ofthe method and the uncertainties involved.ConclusionGrey correlation analysis is a valuable statistical method for measuring the correlation between variables when data is limited or uncertain. It has been widely applied in various fields and has provided valuable insights in complex systems. However, it is important to consider the limitations and challenges associated with grey correlation analysis when using it for decision-making processes. Overall, grey correlation analysis is a useful tool that complements traditional correlation analysis methods and helps in addressing real-world challenges.。
CorrelationXu JiajinNational Research Center for Foreign Language Education Beijing Foreign Studies University2Key points•Why correlation?•What is correlation analysis about?•How to make a correlation analysis?–Case studiesWhy Correlation?4Three things that stats can do •1.Summarizing univariate data •2.Testing the significance of differences •3.Exploring relationships b/t variables5Three things that stats can do •1.Summarizing univariate data •2.Testing the significance of differences •3.Exploring relationships b/t variables6探究事物之间的关联•植物的生长是否浇水的多少有关系,有多大关系•足球成绩好坏是否与身体(体质、人种)有关?•兴趣高、成绩好•元认知策略使用越多,学习进步越快•学好统计学有利于身体健康Key ides of correlationanalysis8•Correlation: co ‐relation . The co ‐relation is represented by a ‘correlation coefficient , r .•The range of the coefficient: ‐1to 1.•Three critical values: ‐1, 0and 1.Strength of correlationPositive correlation Strength of correlation Direction of correlationDirection of correlation Positive correlationDirection of correlation•Less Negative correlation12Two main types of correlation•Pearson : standard type, suitable for interval data (e.g. score, freq.)•Pearson r coefficient•Spearman : suitable for ordinal/rank data•Spearman rho coefficient13Significance•Similar to t ‐test and ANOVA statistics, the correlation coefficients need to be statistically significant.< .05Sig./P 值/alpha (α)值Coefficient of Determination r Ær2Æ% of variance explained15Coefficient of Determination •The squared correlation coefficient is called the coefficient of determination .•Multiplied by 100, this proportion of variance indicates the percentage of variance that is accounted for.•Correlation coefficients of .30 account for about 9% of the variance. Correlation of .70 explains about 49% of variance.Effect sizeCase Study 1Is connector use by Chinese EFL learners correlated with theirwriting quality?SPSS ProceduresAnalyze‐Correlate‐Bivariate1921Reporting correlations•In correlation tables/matrices •Embedded in textCorrelation tablesCorrelation tables(Dörnyei2007: 227)2324Embedded in text •As one would expect from the extensive literature documenting the benefits of intrinsic motivation, there was a significant positive correlation between overall GPA and intrinsic motivation (r = .34, p < .oo1).(Dörnyei 2007: 227)Practice: CET4 and CET6 Correlational analysisHomework英语成绩是否与语文成绩有相关性?28Wrap Up & Look Forward •Correlation coefficients provide a way to determine the strength & the direction of the relationship b/t two variables.•This index does not ... demonstrate a causal association b/t two variables.29Wrap Up & Look Forward •The coefficient of determination determines how much variance in one variable is explained by another variable.•Correlation coefficients are the precursors to the more sophisticated statistics involved in multiple regression (Urdan 2005: 87).30Thank you32。
A Reconsideration of Testing for CompetenceRather Than for IntelligenceGerald V. Barrett and Robert L. DepinetThe University of AkronCorrespondence concerning this article should be addressed to GeraldV. Barrett, Department of Psychology, The University of Akron, BuchtelCollege of Arts and Sciences, Akron, OH 44325-4301.October 1991 ° American PsychologistDavid C. McClelland's 1973 article has deeply influenced both professional and public opinion. In it, he presented five major themes: (a) Grades in school did not predict occupational success, (b) intelligence tests and aptitude tests did not predict occupational success or other important life outcomes, (c) tests and academic performance only predicted job performance because of an underlying relationship with social status, (d) such tests were unfair to minorities, and (e) "competencies" would be better able to predict important behaviors than would more traditional tests. Despite the pervasive influence of these assertions, this review of the literature showed only limited support for these claims.In 1973, David C. McClelland's lead article in the American Psychologist profoundly affected both the field of psychology and popular opinion. This article was designed to "review skeptically the main lines of evidence for the validity of intelligence and aptitude tests and to draw some inferences from this review as to new lines that testing might take in the future" (p. 1). The main themes he endorsed and continues to promote (e.g., Klemp & McClelland, 1986) have been published widely in newspapers, magazines, and popular books as well as psychology textbooks. Belief in these views, however, has become so widespread that often they are presented as com- mon knowledge (e.g., Feldman, 1990).Table 1 reviews a number of works that cited McClelland (1973) and shows that the impact of McClelland's article has increased over time. Soon after the article was published, McClelland's views were integrated into introductory psychology textbooks. By the late 1980s, these themes had become part of generally accepted public opinion, with newspaper and magazine writers commonly citing McClelland as an authority on intelligence testing.It was McClelland's (1973) belief that intelligence testing should be replaced by competency-based testing. His argument against intelligence testing rested on the assertion that intelligence tests and aptitude tests have not been shown to be related to important life outcomes because psychologists were unable and unwilling to test this relationship. McClelland argued that intelligence tests have been correlated with each other and with grades in school but not with other life outcomes.McClelland (1973) stated that intellectual ability scores and academic performance were theresult of social status, and he labeled them a sort of game. He asserted that a test must resemble job performance or other criteria to be related to the performance on the criteria. He also claimed that intelligence and aptitude testing were unfair to minorities. He advocated that the profession should focus on what he termed competency testing and criterion sampling, maintaining that intelligence testing and aptitude testing should be discarded.The main points of McClelland's (1973) article can be summarized in the following five themes: (a) Grades in school did not predict occupational success, (b) intelligence tests and aptitude tests did not predict occupational success or other important life outcomes, (c) tests and academic performance only predicted job performance as a result of an underlying relationship to social status, (d) traditional tests were unfair to minorities, and (e) "competencies" would more successfully predict important behaviors than would more traditional tests.In the present article, these themes are examined through a comprehensive review of relevant literature. Although McClelland's (1973) article contained many subthemes, only those themes we believe to be the main issues are addressed here. This does not imply, however, that we agree with any aspects of McClelland's article that are not addressed here.Do Grades Predict Occupational Success?McClelland (1973) claimed that "the games people are required to play on aptitude tests are similar to the games teachers require in the classroom" (p. 1). As evidence, McClelland presented four citations that he interpreted as support for his position, while ignoring disconfirming evidence. He also included his personal experiences at Wesleyan University as evidence, maintaining that "A" students could not be distinguished from barely passing students in later occupational success. This finding differs greatly from that found in a similar, more scientific comparison done by Nicholson (1915) at the same school. Nicholson found that academically exceptional students were much more likely to achieve distinction in later life. The results of Nicholson's study are summarized in Table 2.Table 1 Support for McClelland's (1973) Concepts in Newspapers, Magazines, Popular Books, and TextbooksPublication Author(s) Statement NewspapersNew York Times Goleman (1988) IQ tests severely limited as predictors of job successNew York Times Goleman (1984) Intelligence unrelated to career successPlain Dealer Drexler (1981) Tests unrelated to accomplishments in leadership, arts, science, music, writing, speech, and drama; tests discriminate by culture MagazinesAtlantic MonthlyPsychology TodayPsychology TodayPopular booksMore Like UsWhiz KidsPsychology textsPsychology: An IntroductionIntroduction to PsychologyPsychology: Being HumanPsychologyUnderstanding Human BehaviorElements of PsychologyEssentials of PsychologyPsychology: An IntroductionIntroductory PsychologyFallows (1985)Goleman (1981)Koenig (1974)Fallows (1989)Machlowitz (1985)Morris (1990)Coon (1986)Rubin & McNeil (1985)Crider, Goethals,Kavanaugh, &Solomon (1983)McConnell (1983)Krech & Crutchfield(1982)Silverman (1979)Mussen &Rosenzweig (1977)Davids & Engen(1975)Promote replacing aptitude tests with competence testsTests and grades are unrelated to career successTests and grades have less value than competence testsTests and grades are useless as predictors of occupational success Bright people do not do better in lifeIQ and grades are unrelated to occupational successIQ does not predict important behaviors or successSuggests replacing IQ tests with competence testsTests are unfair by race and socioeconomic statusAbility is unrelated to career successTests and grades are unrelated to life outcomesTesting results in categorical labelsTest scores are unrelated to job successSuggests replacing IQ tests with competence testsSome limitations do exist when grades are used as predictors. Grades vary greatly among disciplines (Barrett & Alexander, 1989; EUiott & Strenta, 1988; Schoenfeldt & Brush, 1975) as well as among colleges (Barrett & Alexander, 1989; Humphreys, 1988; Nelson, 1975). Because different students usually take different courses, the reliability of grades is relatively low unless a common set of courses is taken (Butler & McCauley, 1987). Despite these shortcomings, a number of meta-analyses have shown that grades do have a small-to-moderate correlation with occupational success (Cohen, 1984; Dye & Reek, 1988, 1989; O'Leary, 1980; Samson, Graue, Weinstein, & Walberg, 1984). Despite an overlap among the data used by these studies and variability among results (r =. 15 to .29), they all reached similar conclusions. A wide variety of measures of occupational success such as salary, promotion rate, and supervisory ratings have been positively related to grade point average.Table 2 Success of Wesleyan GraduatesClasses/academic standing Percentage who achieved distinction in later life1831-1959Valedictorians and salutatorians 49Phi Beta Kappa 31No scholarly distinction 61860-1889Highest honors 47Phi Beta Kappa 31No scholarly distinction 101890-1899Highest honors 60Phi Beta Kappa 30No scholarly distinction 11Note. Adapted from "Success in college and in later life" by F. W. Nicholson,1915, School and Society, 12, p. 229-232. In the public domain.The results of these recta-analyses reflect the diverse individual studies that showed a relationship between academic performance and occupational success. This relationship may have stemmed from underlying associations between academic performance and intellectual ability, motivation (Howard, 1986), and attitudes toward work (Palmer, 1964). Hunter (1983, 1986) supported this possibility by demonstrating through path analysis that higher ability led to increased job knowledge, which in turn led to better job performance. This relationship was true at all educational levels, including medical school graduates, graduate-level MBAs, college graduates in both engineering and liberal arts, technical school graduates, and high school graduates in the United States and in other countries, such as Sweden (Husen, 1969). The correlations between grades and occupational success have ranged from .14 to .59. However, some research has indicated that these relationships were underestimated because the range onthe predictor grades was restricted (Dye & Reck, 1989; Elliott & Strenta, 1988). Even when limitations are considered, both meta-analyses and diverse individual studies showed grades as predictors of occupational success.Do Intelligence Tests and Aptitude Tests Relate to Job Success or Other Life Outcomes?Thorndike and Hagen's (1959) study was McClelland's (1973) central evidence that aptitude tests did not predict occupational success. The Thorndike and Hagen study involved more than 12,000 correlations between aptitude tests and various measures of occupational success for more than 10,000 individuals. They concluded that the number of significant correlations did not exceed the number that would be expected by chance. From these results, MeClelland concluded that "in other words, the tests were invalid" (p. 3).This characterization of the research by Thorndike and Hagen (1959) has often been quoted as proof that aptitude tests cannot predict job success (Haney, 1982; Nairn, 1980). However, McClelland (1973) did not address some extremely important points.Perhaps the most basic point overlooked was that aptitude tests did, in fact, predict success for those professionals for whom they were designed, namely, pilots and navigators. The test battery consisted of dial and table reading, speed of identification, two-hand coordination, complex coordination, rotary pursuit, finger dexterity, aiming stress, discrimination in reaction time, reading comprehension, mathematics, numerical operations, and mechanical principles (Dubois, 1947). All of these tests were specifically designed to predict success in avionics, and the content of these tests was directly related to that field. The mechanical principles test, for example, asked the direction of the wind as shown by a wind sock.The validity of the test battery was demonstrated during World War II (Dubois, 1947) when an unscreened group was used as part of the validation process. Of those who failed the test battery, only 8.6% subsequently graduated from training (45 of 520), and no one in the lowest stanine (150 subjects) graduated. Conversely, 85% of those in the upper stanines graduated (Dubois, 1947).M eClelland (1973)was concerned that cultural bias was present in aptitude tests. The avionics battery studied by Thorndike and Hagen (1959) was used to predict the success of pilots during World War II (Dubois, 1947) and included West Point cadets, Chinese people, women, and Blacks as subjects. The battery was found valid for all of these groups. This agrees with later findings that, in general, aptitude tests are valid for all groups (Boehm, 1972; Hunter, Sehmidt, & Hunter, 1979; Hunter, Schmidt, & Rauschenberger, 1984).Thorndike and Hagen (1959) surveyed a sample of individuals who had taken the pilot and navigators test battery in 1943. The respondents, who ranged in age from 18 to 26 years at the time of testing, were asked to supply self-report data in seven areas, including monthly income in1955. Validity coefficients were then computed between results on the avionics test battery and self-reported income.This validation procedure contained obvious flaws. The eight-year age range among subjects influenced the job experience of the respondents. Some respondents were well established in their careers. Others were only beginning. Differences in job experience would translate into wide salary differences, even within the same occupation, contaminating the criterion measure.The respondents were in diverse occupations and were dispersed geographically throughout the United States. Even if the avionics test had been appropriate for predicting the success of both an English academic and a physician and even if they were the same ages at the time the salary data were collected, the differences in mean occupational salary would obscure any potential relationship.While McClelland (1973) was claiming that the avionics battery was invalid for predicting occupational success, other researchers using the same data set as Thorndike and Hagen (1959) refined the procedure and obtained additional criterion data in 1969 (Beaton, 1975; Hause, 1972, 1975; Tanbman & Wales, 1973, 1974). These researchers determined that the numerical aptitude factor, derived by factor analysis, was positively related to later income. These studies also showed that this relationship increased over time as the former aviators and navigators matured in their respective occupation. When the data were broken down by occupation, those respondents scoring in the top one tenth in numerical ability earned 30% more than those scoring in the bottom four tenths. When ability was held constant, education was not a significant factor in relation to earnings (Taubman & Wales, 1974).Taubman and Wales (1974) found that those with scores in the top ability level within each educational category (from high school through professional education) had considerably higher salaries than those at the lowest ability level. For individuals with master's degrees, those scoring in the bottom one fifth averaged an annual salary of $14,000, whereas those in the top one fifth averaged $22,200.Comparable results were obtained in a longitudinal study in Sweden over a 26-year period (Husen, 1969). Men included in the group with the highest intellectual ability, when tested at age 10, earned twice the income of those in the lowest category, a practical and significant difference in income. The evidence presented here leads to the inevitable conclusion that intelligence tests and aptitude tests are positively related to job success.Recent EvidenceMany researchers have tested the relationship between cognitive ability and job performance using meta-analytic techniques. Data from approximately 750 studies on the General Aptitude Test Battery (GATB) showed that the test validly predicted job performance for many different occupations (Hartigan & Wigdor, 1989). Hunter and Hunter's ( 1984 ) recta-analysis demonstrated that in entrylevel positions, cognitive ability predicted job performance with an average validityof .53. This study also showed an average correlation of.45 between intellectual ability and job proficiency. Other studies using a number of different measures of job proficiency have found similar relationships to cognitive ability (Distefano & Pryer, 1985; Hunter, 1983, 1986; Pearlman, Schmidt, & Hunter, 1980; Schmidt, Hunter, & Caplan, 1981).McClelland (1973) implied that supervisors' ratings were biased. However, research has shown that the sex and race of either the rater or ratee do not exert important influence on ratings (Pulakos, White, Oppler, & Borman, 1989). More objective criterion measures produced even higher validity coefficients with aptitude test scores. In Nathan and Alexander's (1988) meta-analysis, the criteria of ratings, rankings, work samples, and production quantities all resulted in high test validities. Production quantity and work sample criteria resulted in substantial validity coefficients, negating McClelland's claim that validity coefficients were obtained only by using biased supervisory ratings. In fact, Smither and Reilly (1987) found that the intelligence of the rater was related to the accuracy of job performance ratings.In a study using path analysis, Schmidt, Hunter, and Outerbridge (1986) found that cognitive ability correlated with job knowledge (.46), work samples (.38), and supervisory ratings (. 16). They concluded that cognitive ability led to an increase in job knowledge, a position also supported by Gottfredson (1986).Practical TasksTo support his assertion that intelligence was not appli- cable to employment situations, McClelland (1973) stated that intelligence as measured in aptitude and intelligence testing was not useful in practical, everyday situations. Schaie (1978) explored this theory, describing the issues that must be addressed to attain external validity. He suggested that criteria should include actual real-world tasks. Willis and Schaie (1986) tested this proposition on older adults. Both the individuals tested and the criterion tasks used in the study, such as ability to comprehend the label on a medicine bottle or to understand the yellow pages of the telephone directory, differed substantially from typical academic tasks. According to McClelland's view, a relationship should not exist between mental abilities, such as fluid and crystallized intelligence, and performance on the eight categories of real-life tasks used by Willis and Schaie.This idea was not supported by the study results. An extremely high relationship existed between intelligence and performance on real-life tasks. Intellectual ability accounted for 80% of the variance in task performance (Willis & Schaie, 1986). In a second study, they again found intellectual ability to be related to both selfperceived performance and the ratings assigned by judges for performing a number of practical tasks. These results were replicated on several samples of older adults (Schaie, 1987).Correlations between performance and scores on intelligence and aptitude tests are supported in other, more unstructured and ambiguous situations including business management (Bray & Grant, 1966; Campbell, Dunnette, Lawler, & Weick, 1970; Siegel & Ghiselli, 1971), performance in groups (Mann, 1959), and success in science (Price, 1963). Michell and Lambourne (1979) studied16-year-old students and found that those with higher cognitive ability were better able to answer openended questions. Students with higher cognitive ability were also able to sustain discussion longer, ask more interpretive questions, and achieve a more complex understanding of issues. In addition, intelligence has been shown to be related to musical ability (Lynn & Gault, 1986) and creativity (Cropley & Maslany, 1969; Drevdahl & Cattell, 1958; Hocevar, 1980; MacKinnon, 1962; McDermid, 1965; Richards, Kinney, Benet, & Merzel, 1988). From examining these studies, we find cognitive ability to be positively related to a variety of real-world behaviors.SummaryA review of the relevant literature shows that intelligence tests are valid predictors of job success and other important life outcomes. Cognitive ability is the best predictor of performance in most employment situations (Arvey, 1986; Hunter, 1986), and this relationship remains stable over extended periods of time (Austin & Hanisch, 1990). Using samples of the size usually found in personnel work, Thorndike (1986) concluded that cognitive "g" is the best predictor of job success. Ironically, this was the same author whose earlier study was presented in McClelland's (1973) article as evidence that aptitude tests cannot be used to predict job performance.The evidence from these varied scientific studies leads again and again to the same conclusion: Intelligence and aptitude tests are positively related to job performance.Is There an Artifactual Relationship Between Intellectual Ability and Job Success Based on Social Status?A major part of McClelland's (1973) argument against the use of intelligence or aptitude tests was his claim that "the tests are clearly discriminatory against those who have not been exposed to the culture, entrance to which is guarded by the tests" (13. 7). Available scientific evidence has refuted this contention; IQ is related to occupational success. However, McClelland maintained that "'the correlation between intelligence test scores and job success often may be an artifact, the product of their joint association with class status" (p. 3).Despite the numerous ways of defining socioeconomic status (SES), we will show that occupational success is primarily a result of individual cognitive ability and education, both factors that are relatively independent of social origin. We will also show that the strength of the relationship between IQ and job success is not strongly related to the social prestige of particular careers, regardless of variations between occupations. We agree with Gottfredson (1986) that it is more useful to focus on areas such as individual ability rather than irrelevant SES factors, such as family income, over which individuals have no control.Definition of Socioeconomic StatusMcClelland's (1973) definition of SES differs considerably from those used by other researchers. To McClelland, socioeconomic status belongs to the power elite--those who have credentials, power, pull, opportunities, values, aspirations, money, and material advantages. Some of these factors (e.g., values and aspirations) have been shown to be related to later success (Sewell & Hauser, 1976). They have not been described as socioeconomic status by other researchers, however, because these factors do not belong exclusively to the wealthy (Greenberg & Davidson, 1972).McClelland (1973) also described SES in terms of income. Other researchers in the area (e.g., Scarr & Weinberg, 1978; Sewell & Hauser, 1976) have found in- come to have weak connections with later success, with correlations of only. 17 between the adult's income and the income of his or her parents (Sewell & Hauser, 1976). These findings are consistent with Alwin and Thornton (1984) and Williams (1976), who found correlations between. 12 and .25 between family income and the intelligence of the children. Although variation exists in the correlations found, none of the results supported McClelland's view of strong financial effects.Some variables that have been examined as operational measures of SES include family structure, dwelling conditions, and school attendance record (Greenberg & Davidson, 1972); number of siblings in the family, region of residence, and size of community (Peterson & Karplus, 1981); number of people per room in the home (Greenberg & Davidson, 1972; Herzog, Newcomb, & Cisin, 1972); mother's educational level (Herzog et al., 1972; Peterson & Karplus, 1981; Sewell & Hauser, 1976; Willerman, 1979); father's educational level (Duncan, Featherman, & Duncan, 1972; Peterson & Karplus, 1981; Sewell & Hauser, 1976; Willerman 1979); father's occupation (Duncan et al., 1972; Greenberg & Davidson, 1972; Peterson & Karplus, 1981; Sewell & Hauser, 1976; Willerman, 1979); family income (Peterson & Karplus, 1981; Sewell & Hauser, 1976); and median neighborhood income and educational level (Scarr, 1981). Socioeconomic status has often been operationally defined as a combination of these factors. Because SES has been defined in so many ways, the specific variables explored were theoretically more important and practical than the general term socioeconomic status.Effects of Socioeconomic Status VariablesMeasures described as SES, such as parental education, have been related to children's success (Duncan et al., 1972; Scarr & Weinberg, 1978; Sewell & Hauser, 1976). These factors were most likely proxies for explanatory factors such as orderliness in the home and value placed on education. Studies show that parental background variables make little contribution to the distribution of individuals to occupations, whereas years of education and cognitive ability make a large contribution (Duncan et al., 1972; Gottfredson & Brown, 1981). A well-known longitudinal study (Vaillant, 1977) found that broad measures of SES before an individual's enrollment in college had no relation to outcome variables 30 years later. However, among people of equal ability, the most significant predictor of adult occupational achievement was the parents' attitudetoward school and education (Kraus, 1984).The operational measures of SES that have been found to be important determinants of later outcomes (e.g., values and attitudes) were factors that could be influenced. Even the poorest of families could develop and use these factors to benefit their children (Greenberg & Davidson, 1972). Unfortunately, some families are so destitute that their environment would not even be considered as humane, and this deprivation would have detrimental effects on later accomplishments. For the vast majority of people in all socioeconomic and racial subgroups, however, this is not the case (Scarf, 1981).Education and measured cognitive ability were shown to be more important to later outcomes than were such factors as income. However, the effect of SES on these variables must be examined further.Test performance. Oakland (1983) found that the relationship between IQ scores andachievement test performance was the same across SES levels. A factor analysis of ability measures in different SES groups showed that factor structure was not contingent on SES (Humphreys & Taber, 1973). Spaeth (1976) and Valencia, Henderson, and Rankin (1985) found that the effects of parental SES on a child's IQ score were mediated by family interaction and exposure to stimuli provided by parents. In addition, Spaeth concluded that parental influence was a great deal more important than that of teachers and schools. The effects of the latter were much less personal and direct. He concluded that the direct effect of parental SES on child's IQ was -.03. In related research, SES has not been found to have a significant effect on the IQ scores of adult, adopted twins reared apart (Bouchard, Lykken, McGue, Segal, & Tellegen, 1990).Simple measures of SES did not adequately capture the parts of the environment that produced individual differences, even within families (Mercy & Steelman, 1982; Rowe & Plomin, 1981). Even such simple, specific variables as amount of time spent on homework and amount of time spent watching TV on weekdays were related in the expected direction to performance on academic achievement tests (Keith, Reimers, Fehrmann, Pottehaum, & Aubey, 1986). Ultimately, parents could help children learn to cope with cognitive complexity, an effect independent of SES (Spaeth, 1976).College attendance. Contrary to McClelland's (1973, p. 3) assertion that entrance intoprestigious jobs was based on social background, entrance into higher status jobs has instead been shown to be primarily determined by educational attainment (Alexander & Eckland, 1975; Bajema, 1968; Gottfredson & Brown, 1981; Schiefelbein & Farrell, 1984; Sewell & Hauser, 1976). Therefore, what determines attendance at college is very important.McClelland ( 1973) stated that an individual's socioeconomic class was the primary factor in determining his or her ability to attend college. Research has shown the flaws in this assertion. Although socioeconomic background is associated with college attendance, other factors are。
decay of correlation 数学名词Decay of correlation(相关性的衰减)refers to the decrease in correlation between two variables as the distance between them increases. It is a mathematical concept used to quantify the relationship between two variables across different spatial or temporal distances.1. The decay of correlation between rainfall and crop yield was observed as the distance between the two fields increased.雨量与农作物产量之间的相关性随着两个田地之间的距离增加而减弱。
2. The study analyzed the decay of correlation between interest rates and stock market performance over a one-year timespan.该研究分析了利率和股市表现之间的相关性在一年的时间内是如何衰减的。
3. As the distance between two cities increased, thedecay of correlation between their population sizes became more noticeable.随着两个城市之间的距离增加,它们的人口规模之间的相关性衰减变得更加明显。
4. The researchers used statistical methods to determine the decay of correlation between air pollution andrespiratory diseases in different neighborhoods.研究人员使用统计方法来确定不同社区之间空气污染和呼吸道疾病之间的相关性衰减。
correlation 标准流程英文回答:Correlation.Correlation is a statistical measure that expresses the extent to which two variables are linearly related. It is a value between -1 and 1, where -1 indicates a perfect negative correlation, 0 indicates no correlation, and 1 indicates a perfect positive correlation.The correlation coefficient is calculated by dividing the covariance of the two variables by the product of their standard deviations. The covariance is a measure of how much the two variables vary together, and the standard deviation is a measure of how much each variable varies on its own.Correlation is a useful tool for understanding the relationship between two variables. It can be used toidentify trends, make predictions, and test hypotheses. However, it is important to note that correlation does not imply causation. Just because two variables are correlated does not mean that one causes the other.Types of Correlation.There are three main types of correlation:Positive correlation: This type of correlation occurs when two variables increase or decrease together. For example, the number of hours you study for a test and your score on the test are positively correlated.Negative correlation: This type of correlation occurs when one variable increases and the other variable decreases. For example, the amount of money you spend ongas and your car's gas mileage are negatively correlated.No correlation: This type of correlation occurs when there is no relationship between two variables. For example, the number of times you flip a coin and the number of headsyou get are not correlated.Strength of Correlation.The strength of a correlation is determined by the absolute value of the correlation coefficient. The closer the correlation coefficient is to 1 or -1, the stronger the correlation. A correlation coefficient of 0 indicates that there is no correlation between the two variables.Significance of Correlation.The significance of a correlation is determined by the p-value. The p-value is the probability of obtaining a correlation coefficient as large as or larger than the one that was observed, assuming that there is no correlation between the two variables. A p-value less than 0.05 is considered to be statistically significant.Correlation Analysis.Correlation analysis is a statistical technique that isused to identify and measure the relationship between two or more variables. Correlation analysis can be used to:Identify trends.Make predictions.Test hypotheses.Control for confounding variables.Correlation analysis is a valuable tool for understanding the relationships between variables. However, it is important to note that correlation does not imply causation.中文回答:相关性。
惭愧,今天才注意到统计上的关联(association)与相关(corelation)是不同的虽然教书多载,以前一直以为关联和相关为同一个意思,只不过国人翻译的不同,今日总觉得哪里不对,于是乎一探究竟,发现两者差别还真是挺大的。
英文原文如下,松哥就不翻译了,怕又翻出歧义来,大家看看吧!以前分析一直忽略下图中红框部分,看完今天的推送,你就能明白下图中那么多选项的意义了!如果您还没明白,也别急,松哥正在撰写《统计思维与SPSS24.0实战解析》,里面会有详细的,全新的解读哦!精鼎35-36期SPSS高级研习班开班通知:(详情点击)精鼎35期(合肥)-36期(昆明)全国SPSS研习班报名啦!/Association vs CorrelationAssociation and correlation are two methods of explaining a relationship between two statistical variables. Association refers to a more generalized term and correlation can be considered as a special case of association, where the relationship between the variables is linear in nature.What is Association?The statistical term association is defined as a relationship between two random variables which makes them statistically dependent. It refers to rather a general relationship without specifics of the relationship being mentioned, and it is not necessary to be a causal relationship.Many statistical methods are used to establish the association between two variables. Pearson’s correlation coefficient, odds ratio, distance correlation, Goodman’s and Kruskal’s Lambda and Spearman’s rho (ρ) are a few examples.What is Correlation?Correlation is a measure of the strength of the relationship between two variables. The correlation coefficient quantifies the degree of change of one variable based on the change of theother variable. In statistics, correlation is connected to the concept of dependence, which is the statistical relationship between two variablesThe Pearson’s correlation coefficient or just the correlation coefficient r is a value between -1 and 1 (-1≤r≤+1). It is the most commonly used correlation coefficient and valid only for a linear relationship between the variables. If r=0, no relationship exist, and if r≥0, the relation is directly proportional; the value of one variable increases with the increase in the other. If r≤0, the relationship is inversely proportional; one variable decreases as the other increases.Because of the linearity condition, correlation coefficient r can also be used to establish the presence of a linear relationship between the variables.Spearman’s rank correlation coefficient and Kendrall’s rank correlation coefficient measure the strength of the relationship, excluding the linear factor. They consider the extent one variable increases or decreases with the other. If both variables increase together the coefficient is going to be positive and if one variable increases while the other decreases the coefficient value is going to be negative.The rank correlation coefficients are used just to establish the type of the relationship, but not to investigate in detail like the Pearson’s correlation coefficient. They are also used to reduce the calculations and make the results more independent of the non-normality of the distributions considered.What is the difference between Association and Correlation?· Association refers to the general relationship between two random variables while the correlation refers to a more or less alinear relationship between the random variables.· Association is a concept, but correlation is a measure of association and mathematical tools are provided to measure the magnitude of the correlation.·Pearson’s product moment correlation coefficient establishes the presence of a linear relationship and determines the nature of the relationship (whether they are proportional or inversely proportional).· Rank correlation coefficients are used to determine the nature of the relationship only, excluding the linearity of the relation (it may or may not be linear, but it will tell whether the variables increase together, decrease together or one increases while the other decreases or vice versa).。
相关分析(Correlate)Correlation and dependenceIn statistics, correlation and dependence are any of a broad class of statistical relationships between two or more random variables or observed data values.Correlation is computed(用...计算)into what is known as the correlation coefficient(相关系数), which ranges between -1 and +1. Perfect positive correlation (a correlation co-efficient of +1) implies(意味着)that as one security(证券)moves, either up or down, the other security will move in lockstep(步伐一致的), in the same direction. Alternatively(同样的), perfect negative correlation means that if one security moves in either direction the security that is perfectly negatively correlated will move by an equal amount in the opposite(相反的)direction. If the correlation is 0, the movements of the securities are said to have no correlation; they are completely random(随意、胡乱).There are several correlation coefficients, often denoted(表示、指示)ρ or r, measuring(衡量、测量)the degree of correlation. The most common of these is the Pearson correlation coefficient, which is sensitive only to a linear(只进行两变量线性分析)relationship between two variables (which may exist even if one is a nonlinear function of the other).Other correlation coefficients have been developed to be more robust(有效的、稳健)than the Pearson correlation, or more sensitive to nonlinear relationships.Rank(等级)correlation coefficients, such as Spearman's rank correlation coefficient and Kendall's rank correlation coefficient (τ) measure the extent(范围)to which, as one variable increases, the other variable tends to increase, without requiring(需要、命令)that increase to be represented by a linear relationship. If, as the one variable(变量)increases(增加), the other decreases, the rank correlation coefficients will be negative. It is common to regard these rank correlation coefficients as alternatives to Pearson's coefficient, used either to reduce the amount of calculation or to make the coefficient less sensitive to non-normality in distributions(分布). However, this view has little mathematical basis, as rank correlation coefficients measure a different type of relationship than the Pearson product-moment correlation coefficient, and are best seen as measures of a different type of association, rather than as alternative measure of the population correlation coefficient.Common misconceptions(错误的想法)Correlation and causality(因果关系)The conventional(大会)dictum(声明)that "correlation does not imply causation" means that correlation cannot be used to infer a causal relationship between the variables.Correlation and linearityFour sets of data with the same correlation of 0.816The Pearson correlation coefficient indicates the strength of a linear relationship between two variables, but its value generally does not completely characterize their relationship. In particular, if the conditional mean of Y given X, denoted E(Y|X), is not linear in X, the correlation coefficient will not fully determine the form ofE(Y|X).The image on the right shows scatterplots(散点图)of Anscombe's quartet, a set of four different pairs of variables created by Francis Anscombe. The four y variables have the same mean (7.5), standard deviation (4.12), correlation (0.816) and regression line (y = 3 + 0.5x). However, as can be seen on the plots, the distribution of the variables is very different. The first one (top left) seems to be distributed normally, and corresponds to what one would expect when considering two variables correlated and following the assumption of normality. The second one (top right) is not distributed normally; while an obvious relationship between the two variables can be observed, it is not linear. In this case the Pearson correlation coefficient does not indicate that there is an exact functional relationship: only the extent to which that relationship can be approximated(大概)by a linear relationship. In the third case (bottom left), the linear relationship is perfect, except for one outlier which exerts enough influence to lower the correlation coefficient from 1 to0.816. Finally, the fourth example (bottom right) shows another example when one outlier(异常值)is enough to produce a high correlation coefficient, even though the relationship between the two variables is not linear.(离群值可降低、也可以增加数据的相关性。
A Tutorial on Principal Component AnalysisBy Y. Zee Ma, Schlumberger, Denver, COPrincipal component analysis (PCA), introduced by Pearson (1901), is an orthogonal transform of correlated variables into a set of linearly uncorrelated variables - principal components (PCs). Each principal component is a linear combination of weighted original variables. The number of principal components is equal to the number of original variables, but the number of meaningful PCs might be fewer depending on the correlations between the original variables.The transform is defined in such a way that the first principal component represents the most variability in the data, in fact, as much as possible, under the condition of the orthogonality between any pair of components. Each succeeding component in turn has the highest variance possible not accounted for by the preceding PCs, under the orthogonality condition. Hence, principal components are uncorrelated between each other.PCA is mathematically defined as a linear transform that converts the data to a new coordinate system such that the first principal component lies on the coordinate that has the largest variance by projection of the data (Fig. 1a), the second principal component lies on the coordinate with second largest variance, and so on. The procedure includes several steps:(1) Calculating the (multivariate) covariance or correlation matrix from the sample data,(2) Computing eigenvalues and eigenvectors of the covariance or correlation matrix, and(3) Generating the PCs; each PC is a linear combination of optimally weighted original variables, such as:P i = b i1X1 + b i2X2 + … + b ik X k (1)where P i is the i th principal component, b ik is the weight (some call regression coefficient) for the variable, X k. It is often convenient that all the variables, X k, are normalized to zero mean and one standard deviation.The weights, b ik, are calculated using covariance or correlation matrix. As the covariance or correlation matrix is symmetric positive definite, it yields an orthogonal basis of eigenvectors, each of which has a nonnegative eigenvalue. These eigenvectors correspond to principal components and the eigenvalues to the variances explained by the principal components. For more mathematical insights of PCA, readers can refer to Basilevsky (1994), Everitt and Dunn (2002), and Abdi and Williams (2010).PCA is a non-parametric statistical method and it provides analytical solutions based on linear algebra; statistical moments, such as mean and covariance, are simply calculated from thedata without any assumption. Because of its efficiency of removing redundancy and capability of extracting interpretable information, PCA has a wide range of applications, spanning over nearly all the industries, from computer vision to neuroscience, from medical data analysis to psychology, from chemical research to seismic data analysis, among others. In fact, PCA is one of most used multivariate statistical tools; with the explosion of data in modern society, its application is ever increasing.A simple bivariate example with two petrophysical variables, neutron and density (RHOB), is presented here to illustrate the method. The two PCs from PCA of neutron and RHOB logs are overlain on the neutron-RHOB crossplots (Figs. 1a and 1b). The first PC (PC1) represents the major axis that describes the maximum variability of the data and the second PC (PC2) represents the minimum axis that describes the remaining variability not accounted for by the first PC. In this example, the major axis, PC1, approximately represents porosity and the minor axis, PC2, approximately represents the lithology. This explains why lithofacies clustering by artificial neural networks (ANN) or statistical clustering methods using PC1 are not good (Fig. 1c), but lithofacies clustered using PC2 are more consistent with the benchmark chart (Fig. 1d). In many other cases, however, major PCs, such as PC1, are important; sometimes, lithofacies classification using PC1 alone is good enough (Ma et al., 2011).Principal components can be rotated to align with a physically more meaningful variable. This can be illustrated with a bivariate example, in which the two original variables are equally weighted in the principal components before rotation. In the neutron-RHOB analysis, neutron and RHOB equally contribute to both PC1 and PC2. However, if for example, neutron is more important than RHOB for porosity determination, PC1 can be rotated to be correlated higher with neutron. Fig. 1e shows a rotated component from PC1 that has an increased correlation to neutron, and decreased correlation to RHOB (Table 1). Similarly, if RHOB is more important than neutron in determining lithofacies, PC2 can be rotated to reflect that. Fig. 1f shows a rotated component from PC2 that has an increase correlation to RHOB and a decreased correlation to neutron (Table 1). The two ro tated components don’t have to be orthogonal as shown in this example. The main criterion of rotation is to make a component physically meaningful.(a)(b) (c)(d) (e) (f)Fig. 1 Illustrating two principal components from PCA of Neutron and density (RHOB) on Neutron-RHOB or their PC1-PC2 crossplots. (a) Overlay of PC1 on Neutron- RHOB crossplot (arrow indicates the coordinate on which PC1 is defined). (b) Overlay of PC2 on Neutron- RHOB crossplot (arrow indicates the coordinate on which PC2 is defined). (c) PC1-PC2 crossplot (their correlation is zero). (d) Overlay of lithofacies clustered by ANN using PC1 on Neutron-RHOB crossplot (red: sandstone, green: limestone, and blue: dolostone). (e) Overlay of lithofacies clustered by ANN using PC2 on Neutron-RHOB crossplot. (f) Overlay of a rotated PC2 on Neutron-RHOB crossplot.Table 1 Correlation matrix between pairs of six variables: Neutron (NPHI), density (RHOB), their principal components (PC1 and PC2), and two rotated component (PC1_rotated and PC2_rotated).The original data can be reconstructed from the principal components. The general equation of reconstructing the original data can be expressed as the following matrix formulation:D = P C t + uM t(2)where D is the reconstructed data matrix of size k×n (k being the number of variables, n being the number of samples), P is the matrix of principal components of size q×n (q is the number of PCs, equal or less than k), C is the matrix of correlation coefficients between the PCs and the variables of size k×q, t denotes the matrix transpose, is the diagonal matrix that contains the standard deviations of the variables of size k×k , u is a unit vector of size n, and M is vector that contains the mean values of the variables of size k .When data are highly correlated, a small number of PCs out of the all PCs can reconstruct the data quite well. PCA is highly efficient in removing the redundancy, which is highlighted by the following seismic amplitude versus offset (AVO) example (Fig. 2a). Consider different offsets as variables and common mid points as observations or samples. The first principal component (Fig. 2b) from PCA represents more than 99.6% variance explained and can be used to reconstruct the original data. This is done simply by 1D vector multiplication of PC1 (Fig. 2b) and its correlation coefficients to each offset normalized by the respective standard deviation and mean of each offset (Fig. 2c). The result is very much similar to the original AVO data (compare Figs. 2a and 2d).In this example, q is set to 1 as PC1 represents more than 99% information in the data. This explains the surprising reconstructed 2D map (Fig. 2d) simply by vector multiplication of two 1D functions of different size (Figs. 2b and 2c), and normalizations by the standard deviations and means.(a)(b)(c) 280180(d)Fig. 2 PCA of AVO data and reconstruction of AVO data by using a PC. (a) Original AVO data. (b) PC1 (as a function of common mid-point or CMP). (c) Correlations between PC1 and each offset. (d) The reconstructed AVO data using PC1, i.e., vector multiplication of (b) and (c) normalized by the respective standard deviation and means of each offset (see Eqn. 2) References Abdi H, Williams LJ (2010) Principal component analysis. Statistics & Data Mining Series, Vol.2, John Wiley & Sons, p. 433-459.Basilevsky A (1994) Statistical Factor Analysis and Related Methods: Theory and Applications.Wiley Series in Probability and Mathematical Statistics.Benjamini Y (1988) Opening the box of a boxplot, The American Statistician 42(4):257-262.Everitt BS, Dunn G (2002) Applied Multivariate Data Analysis. 2nd Edition, Arnold Publisher,London.Ma, Y.Z., Gomez E., et al, forthcoming, Mixture decomposition and lithofacies clustering usingwireline logs, under review, J. of Applied Geophysics.Ma Y. Z. And Gomez E., forthcoming, Uses and abuses in applying neural networks for predicting reservoir properties, under review, J of Petroleum Sci. & Eng.Ma, Y. Z., 2011, Lithofacies clustering using principal component analysis and neural network:applications to wireline logs, Math. Geosciences, 43(4): 401-419, or Web link::3177/search?query=zee+ma&search-within=Journal&facet-journal-id=%2211004%22 Ma YZ, Gomez E, Young TL, Cox DL, Luneau B, Iwere F (2011) Integrated reservoir modelingof a Pinedale tight-gas reservoir in the Greater Green River Basin, Wyoming. In Y. Z. Ma and P. LaPointe (Eds), Uncertainty Analysis and Reservoir Modeling, AAPG Memoir 96, Tulsa or web link: :2564/data/alt-browse/aapg-special-volumes/m96.htmPearson K. (1901) On lines and planes of closest fit to systems of points in space, PhilosophicalMagazine 2(11):559-572.Tukey JW (1977) Exploratory data analysis, Addison-Wesley.280180。
correlationCorrelationIntroductionCorrelation is a statistical measure that determines the degree to which two variables are related to each other. It is an important concept in many fields, including statistics, economics, social sciences, and healthcare. In this document, we will explore the concept of correlation, its types, and its significance in various applications.What is Correlation?Correlation quantifies the statistical relationship between two variables. It measures how changes in one variable correspond to changes in another variable. Correlation is typically represented by the correlation coefficient, which ranges from -1 to +1. A positive correlation indicates a direct relationship, while a negative correlation indicates an inverse relationship. A correlation coefficient close to zero indicates a weak or no relationship between the variables.Types of CorrelationThere are three main types of correlation: positive correlation, negative correlation, and zero correlation.1. Positive Correlation: When two variables increase or decrease together, they are said to have a positive correlation. For example, there is a positive correlation between the amount of study time and test scores. As the study time increases, the test scores also tend to increase. The correlation coefficient for a positive correlation ranges from 0 to +1.2. Negative Correlation: In contrast to a positive correlation, a negative correlation exists when one variable increases while the other decreases. For instance, there is a negative correlation between the number of hours spent watching TV and academic performance. As the hours spent watching TV increase, the academic performance tends to decrease. The correlation coefficient for a negative correlation ranges from 0 to -1.3. Zero Correlation: Zero correlation, as the name suggests, implies no relationship between the variables. The changes in one variable do not correspond to any changes in the othervariable. When the correlation coefficient is close to zero, it indicates a weak or no correlation.Significance of CorrelationCorrelation has several practical applications in different fields.1. Statistics: Correlation analysis is used to determine the strength and direction of the relationship between variables. It helps statisticians to understand the patterns and trends in data. Correlation coefficients are widely used in regression analysis and predictive modeling.2. Economics: In economics, correlation analysis helps to identify relationships between different economic variables such as inflation and unemployment rates, interest rates and investment, or GDP and consumer spending. Understanding these relationships is essential for making informed economic decisions.3. Social Sciences: Correlation is used in social sciences to study various phenomena, such as the relationship between education and income, crime rates and poverty, or healthbehaviors and disease outcomes. Correlation can provide insights into social trends and patterns.4. Healthcare: Correlation plays a crucial role in healthcare research. It helps to identify risk factors, assess treatment effectiveness, and understand the relationship between lifestyle choices and health outcomes. For example, studying the correlation between smoking and lung cancer can help healthcare professionals develop effective prevention strategies.ConclusionCorrelation is a powerful statistical tool that measures the relationship between two variables. It helps us understand how changes in one variable relate to changes in another variable. By analyzing correlation coefficients, we can determine the strength and direction of the relationship. Correlation has wide-ranging applications in statistics, economics, social sciences, healthcare, and other fields. Understanding correlation is essential for making informed decisions and drawing meaningful conclusions from data.。
Intellectual capital reporting insustainability reportsLı´dia Oliveira and Lu ´cia Lima Rodrigues School of Economics and Management,University of Minho,Braga,Portugal,andRussell CraigCollege of Business and Economics,University of Canterbury,Christchurch,New ZealandAbstractPurpose –The purpose of this paper is to analyse voluntary disclosures of intellectual capital (IC)items in the sustainability reports of Portuguese companies.The paper aims to highlight the level,pattern and determinants of IC disclosures in those sustainability reports;and the potential for sustainability reports to be a medium for IC disclosures.Design/methodology/approach –An index of voluntary disclosure of intangibles is constructed and deployed to analyse IC disclosures in the sustainability reports for 2006of Portuguese firms,published on the web site of the Portugal’s Business Council for Sustainability Development.Four hypotheses are tested about associations between that disclosure index and firm-specific variables.Findings –Disclosure of information about IC is more likely in sustainability reports of firms that have a higher level of application of the Global Reporting Initiative framework,and are listed companies.Research limitations/implications –This study is cross-sectional.Subjective judgment is involved in constructing the disclosure index.Practical implications –The observed level and pattern of disclosure of IC information suggests that the preparation of a sustainability report is an opportune starting point for the development of IC reporting.Originality/value –The study highlights the determinants of IC disclosures in sustainability reports;the high incidence of such disclosures;and points to the enhancement of legitimacy and reputation as potential incentives for firms to engage in such practice.Keywords Disclosure,Financial reporting,Intellectual capital,Sustainable development,Portugal Paper type Research paper1.IntroductionThis study analyses the voluntary disclosure of intellectual capital (IC)items in the sustainability reports of Portuguese companies.The level,pattern and determinants of IC disclosures in those sustainability reports are highlighted,and attention is drawn to the potential for sustainability reports to be a medium for IC disclosures.A major premise is that firms disclose IC in sustainability reports to improve transparency,legitimise status and enhance reputation.Such a premise accords with contention by McPhail (2009,p.804)that knowledge-based firms have strong reasons to improve transparency by disclosing IC information to stakeholders.The current issue and full text archive of this journal is available at /1469-1930.htmThe authors are grateful to the Portuguese Foundation for Science and Technology for financial support (project PTDC/GES/64453/2006).Intellectual capital reporting575Journal of Intellectual CapitalVol.11No.4,2010pp.575-594q Emerald Group Publishing Limited1469-1930DOI 10.1108/14691931011085696Consistent with the definition provided by Meritum (2002),IC is conceived as the “value-creating”combination of a company’s human capital (skills,experience,competence and innovation ability of personnel),structural capital (organisational processes and systems,software and databases and business processes),and relational capital (all resources linked to the external relationships of the firm with stakeholders,such as,customers,creditors,investors,suppliers,etc.).The term “intellectual capital”is used synonymously with “intangibles”:both are non-physical sources of future economic benefits that may or may not appear in corporate financial reports (Meritum,2002).Most empirical studies of IC disclosures have focused on annual reports (Bozzolan et al.,2003;Guthrie et al.,2006;Oliveira et al.,2006).However,sustainability reports have become an increasingly important medium for company disclosure.Such a communication channel is particularly important in Portugal since intellectual capital reporting is not popular.Zambon (2003)and Cordazzo (2005)have drawn attention to the potential for an overlap between IC reporting and sustainability reporting.With this potential in mind,the objectives of this paper are to enquire whether sustainability reports are a mechanism for external disclosure of information on IC;and to ascertain the characteristics of firms that are more likely to provide IC information in sustainability reports.The analytical framework applied here conceives legitimacy theory and resource-based perspectives as subsidiary elements of a stakeholder meta-narrative in which firms disclose IC information to convey a wider understanding of their performance.They want to increase transparency to satisfy stakeholder expectations;and they seek to generate valuable reputation-related IC by developing and maintaining good relations with stakeholders (Branco and Rodrigues,2006b).Voluntary disclosure of IC items will help firms enhance their legitimacy and survive (Dowling and Pfeffer,1975;Woodward et al.,2001).An analysis of IC disclosures in sustainability reports for 2006,published by Portuguese firms on the web site of the Portugal’s Business Council for SustainableDevelopment (BCSD)(Conselho Empresarial para o Desenvolvimento Sustenta´vel),reveals a high incidence of disclosure of IC information.Portuguese firms appear to be more likely to disclose information about IC in their sustainability reports if they have a higher level of application level of the Global Reporting Initiative (GRI)framework,and are listed companies.Bivariate analysis also shows that firm size is correlated positively with the level of IC disclosures.Consistent with Pedrini (2007),the authors believe that sustainability reports offer a good and synergistic starting point for the development of IC reporting.The remainder of this paper is structured as follows.Section 2presents the theoretical framework,Section 3develops hypotheses,Section 4explains the research design,Section 5presents the results,and Section 6draws conclusions and engages in discussion.2.Theoretical frameworkFirms are prompted to increase transparency voluntarily to meet stakeholder expectations.They are motivated to adopt a stakeholder perspective self-interestedly –to benefit from associated reputational effects (a resource)(e.g.,Branco and Rodrigues,2006b).Those reasons seem likely also to influence decisions to issue sustainability reports and to use such reports to disclose IC items.Disclosures of IC information haveJIC 11,4576the potential to be a good benchmark indicator of afirm’s capacity to employ the type of resources,systems and technology perceived as conducive to environmentally sustainable operations.Thus,it is plausible to expect sustainability reports to appeal to firms as a medium for disclosure of IC items.A stakeholder is any group or individual who can affect,or be affected by,the achievement of afirm’s objectives(Freeman,1984).Stakeholders include shareholders, employees,customers,suppliers,lenders,government and communities;and groups representing environmentalists,the media,and consumer advocates(Clarkson,1995). The managerial branch of stakeholder theory posits that corporate disclosure is a mechanism for negotiating the relationship between afirm and its stakeholders(Gray et al.,1995)and as a“strategy for managing,or perhaps manipulating,the demands of particular groups”(Deegan and Blomquist,2006,p.349).Stakeholder management is panies have strong incentive to convince stakeholders that their activities are congruent with stakeholder expectations (Branco and Rodrigues,2008a,b).Thus,disclosure of information on IC to stakeholders is helpful in avoiding information asymmetries and litigation risks.In similar vein, Guthrie et al.(2004)have argued that legitimacy theory is tied closely to the reporting of IC;and thatfirms are more likely to report information on intangibles if they cannot legitimise their status via the“hard”assets that traditionally have symbolised corporate success.Companies with a resource-based perspective believe good relations with stakeholders will increasefinancial returns and help in the acquisition of a competitive advantage through the development of valuable intangible assets related to reputation(Branco and Rodrigues,2006b,2008b).They also consider stakeholders to be gatekeepers to needed resources,and catalysts for increases or decreases in the cost and speed of access to those resources(Svendsen et al.,2001).Thus,relationships between afirm and its stakeholders are critical sources of afirm’s wealth;and the ability to establish and maintain such relationships determines afirm’s long-term survival and success(Post et al.,2002).So,from a resource-based perspective,there is an inherently high risk to afirm from failing to establish and nurture stakeholder relationships.IC reporting will helpfirms to build a positive relationship with stakeholders and help them to acquire an important intangible element–a good reputation.Because corporate reputation is based on perceptions,any strategy implemented by management should be accompanied by disclosure(Deegan,2002).Although reputation is influenced by many other factors,it is created and managed through the disclosure process(Toms,2002).Indicative of this is Stansfield’s(2006) description of corporate practice at the AXA Group.At AXA specific initiatives are taken to shape,promote,measure and safeguard the company’s reputation and image in the global marketplace.These initiatives include a multi-faceted global communications and sustainable development policy that integrates corporate social responsibility and sustainable development into business strategy and practices. Additionally,AXA has created a Group Compliance and Ethics Guide to facilitate continuous commitment to transparent disclosure practices.This Guide defines several important group-wide policies and procedures–including those relating to trading in AXA Group securities and prohibiting insider trading.According to Stansfield(2006, p.478)“we[AXA]believe that maintaining our reputation for transparency in governance and disclosure has never been more important”.Intellectual capital reporting577Firms have conveyed their extended performance to stakeholders using different communication media.The role of the annual report in corporate reporting strategies appears to be changing.Other types of corporate reports are becoming important and popular (Striukova et al.,2008).Discrete reports,reporting on the internet,one-to-one meetings,presentations and conference calls to financial analysts and institutional investors have been used to disseminate voluntarydisclosures of corporate information,including intangibles information (Garcı´a-Meca et al.,2005;Guthrie et al.,2008).Consequently,although previous empirical studies on IC reporting have predominantly focused on IC disclosures within the annual report (see Striukova et al.,2008),other communication media have been analysed to overcome the incomplete view of IC disclosures given by the examination of annual reports solely.These other media have included prospectuses of initial public offerings (Bukh et al.,2005;Singh and Van der Zahn,2008),conference calls tofinancial analysts (Garcı´a-Meca et al.,2005),environmental and social reports (Cordazzo,2005)and web sites (Striukova et al.,2008;Gerpott et al.,2008;Guthrie et al.,2008).Studies by Gerpott et al.(2008)and Guthrie et al.(2008)have compared IC disclosures between different communication media.These two papers,each focusing on a single industry (respectively,telecommunications industry;and the Australian food and beverage industry),concluded that intangible items disclosure in annual reports tended to be better than on web sites.According to Gerpott et al.(2008),telecommunications network operators use annual reports and web sites in a complementary manner to disclose intangible information.In annual reports they found that the highest disclosure quality values were for the intangible categories “customer”,“supplier”,and “investor capital”,whereas in web site disclosures,they were for “investor”,“customer”,and “human capital.”Additionally,many forms of stakeholder-oriented corporate reporting have developed –including financial,triple bottom line,sustainability,social and environmental responsibility,and intellectual capital (see InCaS project in ).Guidelines for the disclosure of information on IC in such reports have been produced by the Nordic Industrial Fund (2001),Meritum (2002),the Danish Ministry of Science,Technology and Innovation (DMSTI)(2003)(see also European Commission,2006).Companies and stakeholders would benefit if efforts to guide voluntary disclosure of corporate information on intangibles were integrated,and a consistent blueprint for the voluntary disclosure of relevant and reliable information onIC was provided (Garcı´a-Ayuso,2003).Suggested guidelines regarding the content of an IC report by Meritum (2002),DMSTI (2003)and a sustainability report by GRI (2006),are similar.There is much common ground relating to purpose,elements to include,and classifications of intangibles resources and activities (Meritum,2002),knowledge resources (DMSTI,2003),sustainability dimensions (GRI,2006),target groups and expected benefits (see Table I).An IC report and a corporate responsibility report (such as a sustainability report)have compatible and amenable characteristics.This makes their integration or convergence feasible and sensible (e.g.,Pedrini,2007).To investigate this degree of integration,Pedrini (2007)analysed common elements between human capital accounting and the GRI Guidelines 2002,focusing on which indicators for employees (proposed in GRI Guidelines)were used frequently in 20international best practices forJIC 11,4578IC reports.The author found a large overlap of indicators around three issues:description of human capital,reporting on diversity and opportunity,and measurement of the quality and intensity of training.Cordazzo (2005)also conducted an empirical analysis of environmental and social reports in Italy,and analysed,in particular,whether some elements of an IC statement are present in environmental and social reports.She found a significant overlapping of data between these two sets of documents and a common relevant set of information between the environmental and social reports and the IC statement.Based on prior literature,analysis of guidelines,and on the increasing importance that sustainability reports have gained as medium of corporate communication,the sustainability report appears to be a potentially important vehicle for disclosure of IC information.Meritum (2002)DMSTI (2003)GRI (2006)Intellectual capital reportIntellectual capital statementSustainability report PurposeTo communicate to stakeholders the firm’s abilities,resources and commitments in relation to the fundamental determinant of firm value:intellectual capitalTo explain thecompany’s resource base and the activities that managementimplements to develop itTo provide a balanced and reasonable representation of the company’ssustainability performance,disclosing outcomes and results that occurred in the context of the organisation’scommitments,strategy,and management approachElements to include Vision of the firmSummary of intangible resources and activities System of indicators Knowledge narrativeManagement challenges InitiativesIndicatorsStrategy and profile Management approachPerformance indicators Main classification Intangibles resourcesand activities:Structural capital Relational capital Human capital Knowledge resources:Customers/users Employees Technology Processes Sustainability dimensions:Economic dimensionEnvironmental dimension Social dimension Target groupsBroad range of stakeholdersBoth internal and external usersBroad range of stakeholdersBoth internal and external usersBroad range of stakeholdersExpected benefitsBoth documentsrecognise a double role for the IC report:contribution to the value creation process;andmeasurement of IC valueEnable a robust assessment of an organisation’s performance.Supportcontinuous improvement in performance over time.Serve as a tool for engaging with stakeholders.Secure useful input toorganisational processesTable I.Comparison of guidelines for an intellectual capitalreport and asustainability reportIntellectual capital reporting5793.HypothesesWe posit that the extent of voluntary disclosure of IC in sustainability reports is explained by four variables:adherence to GRI guidelines,industry differences,firm size,and stock market listing.3.1Adherence to GRI guidelinesAdherence to GRI guidelines can be interpreted as a way for a firm to foster legitimacy.Legitimacy theory and resource-based perspectives suggest that firms would regard adherence to GRI guidelines as a way to manage their stakeholders,and gain their support and approval.Therefore,a positive association is expected between the extent of IC information disclosed in a sustainability report and adherence to GRI guidelines.This expectation is supported by argument that there is an overlap between IC reporting and sustainability reporting (Zambon,2003;Cordazzo,2005).This leads to the following hypothesis:H1.The higher the level of application of the GRI reporting framework,the morelikely there will be voluntary disclosures of information about IC in a firm’s sustainability report.3.2Industry differencesFirms are more likely to report IC information if they have a specific need to do so:for example,if they cannot legitimise their status via physical assets and need to highlight the extent of their holdings of intangibles (Guthrie et al.,2004).However,the current accounting treatment of intangibles is inadequate,especially in high technology industries with large investments in IC (Collins et al.,1997;Francis and Schipper,1999;Lev and Zarowin,1999;Lev,2001).The risk and uncertainty of the future economic benefits of intangible items is high,and many such items are unrecognised.Most accounting standards require expenditure on intangibles (particularly those internally generated)to be expensed,because this procedure is considered more reliable.International Accounting Standard 38,Intangible Assets (IASB,2004)is conservative and does not provide an adequate conceptual basis for accounting for intangibles.As a result,mandatory financial reporting tends to be less informative in industries with large investments in intangibles (such as R&D).In such industries,it is plausible that firms will address this lack of information and potential misrepresentation by providing further disclosures voluntarily (Tasker,1998):H2.Firms in industries with high levels of intangibles are likely to disclose moreinformation about IC in sustainability reports than firms in industries with low levels of intangibles.3.3Firm sizeMost studies of accounting disclosure have found a positive relationship between firm size and extent of discretionary disclosure (e.g.,Botosan,1997;Depoers,2000).Larger firms are said to disclose more information because they are more visible and more susceptible to scrutiny from stakeholder groups (Branco and Rodrigues,2008b);and because they are more likely to have stronger financial,organisational and human resources to support voluntary disclosures.Drawing from studies of IC by Bukh et al.JIC 11,4580(2005),Garcı´a-Meca et al.(2005)and Oliveira et al.(2006),the authors hypothesise that the extent of afirm’s voluntary disclosure of IC is related positively tofirm size: H3.The larger afirm,the more likely there will be voluntary disclosure of information about IC in thatfirm’s sustainability report.3.4.Stock market listingGenerally,a listedfirm will disclose more information than a non-listedfirm because of the disclosure requirements of stock exchanges.Listing rules of stock exchanges are devised to help reduce litigation risk and to elicit transparency,equity and timeliness. Generally,in comparison with non-listed companies,listed companies are more visible, subject to greater public scrutiny,and attract more media coverage(Branco and Rodrigues,2006a;Archel,2003).A significant association between listing status and extent of disclosure has been found by Cooke(1992),Hossain et al.(1995),Wallace et al. (1994),Giner(1997),Garcı´a-Meca(2005),and Cerbioni and Parbonetti(2007).H4.Listedfirms are likely to disclose more information about IC in sustainability reports than non-listedfirms.4.Research design4.1SampleThe sample comprises the sustainability reports published byfirms for the calendar year2006on BCSD Portugal’s web site(as available on30November,2009).BCSD Portugal is a non-profit association affiliated with the World Business Council for Sustainable Development(WBCSD)that was created by three Portuguesefirms(Sonae, Cimpor,and Soporcel)to promote corporate communication.The number offirms that published a sustainability report on the BCSD Portugal web site increased fromfive for the calendar year2002,to55for the calendar year2006.After excluding all sustainability reports from international non-Portuguese groups and one“outlier”Portuguese company,the sample comprised42sustainability reports.Table II presents the profile of the industries represented in the sample.4.2VariablesTo explore the extent to which information on IC is disclosed voluntarily in sustainability reports,the dependent variable chosen was an IC disclosure index(ICI). This index was constructed using content analysis.Disclosure indexes(such as ICI) calculate“the number of information-related items that a given report contains based on a predefined list of the possible items”(Bukh et al.,2005,p.719).The success of a disclosure index depends on critical and cautious selection of items(Marston and Shrives,1991;Bukh et al.,2005).Mindful of this,the selection of items comprising the ICI was influenced by example lists provided by Bukh et al.(2005),Garcı´a-Meca et al. (2005)and Singh and Van der Zahn(2008)(78items,71items,and81items respectively).An interrogation protocol was used to pilot testfive randomly chosen sustainability reports with a view to modifying the lists to better reflect the diverse nature of disclosed items.Manual coding was preferred because software-assisted searches for words,sentences or portions of pages are insufficiently robust to capture the nature of the IC information disclosed(Beattie and Thomson,2007).Intellectual capital reporting581The final list included 88IC items that firms could report in sustainability reports(Table III).Drawing on the methods of Bukh et al.(2005)and Garcı´a-Meca et al.(2005),these items were classified into six categories:Strategy (ST)(n ¼21),Processes (P)(n ¼11),Technology (T)(n ¼5),Innovation,Research and Development (IRD)(n ¼8),Customers (C)(n ¼14)and Human Capital (HC)(n ¼29).Such classification included the most common components of IC:structural capital,relational capital and human capital (Stewart,1997;Sveiby,1997;Meritum,2002).The total disclosure score was computed as the unweighted sum of the scores of each item (Cooke,1989b;Raffournier,1995;Giner,1997;Chavent et al.,2006).All items were considered relevant to all firms.The total ICI score for a firm was calculated as:ICI ¼Xm i ¼1d imwhere d i ¼0or 1,as follows:d i ¼0if the disclosure item is not found;d i ¼1if the disclosure item was found;andm¼the maximum number of items a firm can disclose in the sustainabilityreport (i.e.88items).SectorNumber of firmsBanks 3Beverages1Construction and building materials 6Containers and packaging 1Customer services/retail 1Electricity2Food producers and processors 1Forestry and paper 2General retailers1Government,authorities and agencies 1Industrial machinery 1Industrial transportation3Information technology hardware 1Leisure,entertainment and hotels 2Oil and gas 3Real estate1Support services3Telecommunication services 4Transport1Travel and leisure 1Water 3Total42Table II.Industry membership of sample firmsJIC 11,4582Selected items No.offirms% Strategy–21itemsNew products/services and technology2867 Investments in new business921 Strategic alliances or agreements3993 Acquisitions and mergers1126 Leadership1843 Network of suppliers and distributors2560 Supplier evaluation policy3379 Image and brand3379 Corporate culture3686 Best practices2662 Organisational structure3071 Environmental investments3276 Community involvement4095 Corporate social responsibility and objective4198 Shareholders’structure2252 Price policy1433 Business vision,objectives and consistency of strategy3993 Quality of products/services2969 Marketing activities1638 Stakeholder relationships/engagement3788 Risk management3276 Processes–11itemsEfforts related to the working environment2355 Internal sharing of knowledge and information/internal communication3071 External sharing of knowledge and information/external communication2764 Measure of internal or external failures1843 Environmental approvals and statements/policies4095 Utilisation of energy,raw materials and other input goods4198 Efficiency3174 Business model1536 Installed capacity1433 Litigations/law suits/sanctions512 Quality approvals and statements/policies3686 Innovation,research and development–eight itemsPolicy,strategy and/or objectives of I&R&D activities3583 I&R&D expenses819 I&R&D in basic research717 I&R&D in product design/development1229 Futures projects or projects in course regarding I&R&D1024 Details offirm patents25 Patents,licences,papers,etc.1433 Patents pending00 Technology–five itemsInvestments in information technology–description,reason and/or expenses1331 Information technology systems and facilities3993Software assets614 Web transactions37 Number of visits to the web819(continued)Table III. Frequency of intellectual capital items disclosed byfirmsIntellectual capital reporting583An increase in IC categories potentially decreases inter-coder reliability (Beattie and Thomson,2007).To check for reliability,two researchers used the specified coding system on five sustainability reports in the sample (more than 10per cent of sample).The two codings were compared for inter-coder reliability using correlation coefficients.Selected itemsNo.of firms%Customers –14items Number of customers1945Sales breakdown by customer512Annual sales per segment or product 1638Average customer size 12Customer relationships2969Customer satisfaction/survey 2764Education/training of customers 37Customers by employee12Value added per customer or segment25Market share,breakdown by country/segment/product 37Relative market share to competitors1638Repurchase/customer seniority and loyalty 410New customers614Production by customer or customer by product410Human capital –29items Staff breakdown by age3788Staff breakdown by seniority 1945Staff breakdown by gender3788Staff breakdown by job function/business area 2355Staff breakdown by level of education1843Staff breakdown by geographic area/by country 1740Staff breakdown by type of contract 1945Rate of staff turnover2048Changes in number of employees 3481Staff health and safety 3890Absence2457Staff interview/employee survey 2252Policy on competence development3583Description of competence development program and activities 1945Education and training policy 3993Education and training expenses1229Education and training expenses/number of employees or.../sales turnover 37Employee expenses/number of employees 410Recruitment policies1740Job rotation opportunities 1126Career opportunities1229Remuneration and evaluation systems 3071Incentive systems and fringe benefits 3174Pensions1536Insurance policies1843Income or assets by employee12Value added/employee or production/employee 410Employee quality and experience 921Management quality and experience614Table III.JIC 11,4584。
&%Tutorial on Cluster AnalysisTopics1.An in-depth look at hierarchical clustering,including:•Weighting observations•Nearest neighbor and reciprocal nearest neighbor algorithms •State of the art in complexity•Clustering of correspondence analysis factor projections,to bypass normalization problems 2.Graph methods and constrained clustering:these are mostly methods for clustering on graphs (as opposed to clustering graphs)3.Partitioning,distribution mixture modeling with Bayes factors,and Kohonen self-organizing maps,–all of which are based on the EM,expectation-maximization optimization algorithm&%Introduction and An ExampleTutorial on Cluster Analysis –CSNA May 2006–F Murtagh3'&$%Cluster AnalysisSome Terms•Unsupervised classification,clustering,cluster analysis,automaticclassification.Versus:Supervised classification,discrimant analysis,trainable classifier,machine learning.•For clustering we can consider (i)partitioning methods,(ii)agglomerative hierarchical classification,(iii)graph methods,(iv)statistical methods,or distribution mixture models,(v)Kohonen self-organizing feature map.•Then there are combinatorial methods,statistical methods which assume a (data +)noise model,and so on.•Note that principal components analysis,correspondence analysis,or indeed visualizationdisplay methods,can be used for clustering.Tutorial on Cluster Analysis –CSNA May 2006–F Murtagh4'&$%&%Example:analysis of globular clusters•M.Capaccioli,S.Ortolani and G.Piotto,“Empirical correlation between globular cluster parameters and mass function morphology”,AA,244,298–302,1991.•14globular clusters,8measurement variables.•Data collected in earlier CCD (digital detector)photometry studies.•Pairwise plots of the variables.•PCA of the variables.•PCA of the objects (globular clusters).&%Objectt_rlx Rgc Zg log(M/c[Fe/H]xx0years Kpc Kpc M.)M15 1.03e+810.4 4.5 5.95 2.54-2.15 2.5 1.4M68 2.59e+810.1 5.6 5.1 1.6-2.09 2.0 1.0M13 2.91e+88.9 4.6 5.82 1.35-1.65 1.50.7M3 3.22e+812.610.2 5.94 1.85-1.66 1.50.8M5 2.21e+8 6.6 5.5 5.91 1.4-1.4 1.50.7M41.12e+8 6.80.6 5.15 1.7-1.28-0.5-0.747Tuc 1.02e+88.1 3.2 6.062.03-0.710.2-0.1M301.18e+77.2 5.3 5.182.5-2.19 1.00.7NGC 6397 1.59e+7 6.90.5 4.77 1.63-2.20.0-0.2M927.79e+79.8 4.4 5.62 1.7-2.240.50.5M123.26e+8 5.0 2.3 5.39 1.7-1.61-0.4-0.4NGC 67528.86e+7 5.9 1.8 5.33 1.59-1.540.90.5M10 1.50e+8 5.3 1.8 5.39 1.6-1.60.50.4M718.14e+77.40.34.981.5-0.58-0.4-0.4Tutorial on Cluster Analysis –CSNA May 2006–F Murtagh7'&$%log(t_rel)681012•••••••••••••••••••••••••••• 4.85.2 5.66.0••••••••••••••••••••••••••••-2.0-1.5-1.0••••••••••••••••••••••••••••-0.50.5 1.0171819••••••••••••••681012••••••••••••••R_gc•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••Z_g•••••••••••••••••••••••••••••••••••••••••••••••••••••••0246810•••••••••••••4.85.46.0••••••••••••••••••••••••••••••••••••••••••log(mass)•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••c••••••••••••••••••••••••••••1.41.82.2••••••••••••••-2.0-1.0•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••[Fe/H]•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••x-0.51.02.0•••••••••••••171819-0.50.5••••••••••••••••••••••••••••0246810••••••••••••••••••••••••••••1.41.82.2••••••••••••••••••••••••••••-0.50.51.52.5••••••••••••x_0Tutorial on Cluster Analysis –CSNA May 2006–F Murtagh8'&$%M 15M 68M 13M 3M 5M 447 T u cM 30N G C 6397M 92M 12N G C 6752M 10M 71123Hierarchical clustering (Ward’s) of globular clusters&%Principal plane (48%, 24% of variance)Principal component 1P r i n c i p a l c o m p o n e n t 2-1.0-0.50.00.5-0.50.00.51.0log(t_rel)R_gcZ_glog(mass)c[Fe/H]xx_0&%Principal plane (48%, 24% of variance)Principal component 1P r i n c i p a l c o m p o n e n t 2-1.0-0.50.00.5-0.8-0.6-0.4-0.20.00.20.4M15M68M13M3M5M447 TucM30NGC 6397M92M12NGC 6752M10M71Tutorial on Cluster Analysis –CSNA May 2006–F Murtagh11'&$%A Formal Definition to Begin WithTutorial on Cluster Analysis –CSNA May 2006–F Murtagh12'&$%Hierarchical clustering•Hierarchical agglomeration on n observation vectors,i ∈I ,involves a series of 1,2,...,n −1pairwise agglomerations of observations or clusters,with the following properties.•A hierarchy H ={q |q ∈2I }such that:1.I ∈H 2.i ∈H ∀i3.for each q ∈H,q ∈H :q ∩q =∅=⇒q ⊂q or q ⊂q•An indexed hierarchy is the pair (H,ν)where the positive function defined on H ,i.e.,ν:H →I R +,satisfies:1.ν(i )=0if i ∈H is a singleton 2.q ⊂q =⇒ν(q )<ν(q )•Function νis the agglomeration level.&%•Take q ⊂q ,let q ⊂q and q ⊂q ,and let q be the lowest level cluster for which this is true.Then if we define D (q,q )=ν(q ),D is an ultrametric.•Recall:Distances satisfy the triangle inequality d (x,z )≤d (x,y )+d (y,z ).An ultrametric satisfies d (x,z )≤max(d (x,y ),d (y,z )).In an ultrametric space triangles formed by any three points are isosceles.An ultrametric is a special distance associated with rooted trees.Ultrametrics are used in other fields also –in quantum mechanics,numerical optimization,number theory,and algorithmic logic.•In practice,we start with a Euclidean distance or other dissimilarity,use some criterion such as minimizing the change in variance resulting from theagglomerations,and then define ν(q )as the dissimilarity associated with the agglomeration carried out.&%Distance,Similarity,Tree DistanceTutorial on Cluster Analysis –CSNA May 2006–F Murtagh15'&$%Metric and Ultrametric•Triangular inequality:Symmetry:d (a,b )=d (b,a )Positive semi-definiteness:d (a,b )>0,if a =b ;d (a,b )=0,if a =b Triangular inequality:d (a,b )≤d (a,c )+d (c,b )•Ultrametric inequality:d (a,b )≤max (d (a,c )+d (c,b ))•Minkowski metric:d p (a,b )=p q Pj |a j −b j |p p ≥1.•Particular cases of the Minkowski metric:p =2gives Euclidean,p =1givesHamming or city-block;and =∞gives d ∞(a,b )=max j |a j −b j |which is the “maximum coordinate”or Chebyshev distance.•Also termed L 2,L 1,and L ∞distances.•Question:show that squared Euclidean and Hamming distances are the same for binary data.Tutorial on Cluster Analysis –CSNA May 2006–F Murtagh16'&$%Metrics•The notion of distance is crucial,since we want to investigate relationships between observations and/or variables.•Recall:x ={3,4,1,2},y ={1,3,0,1},then:scalar product x,y = y,x =x y =xy =3×1+4×3+1×0+2×1.•Euclidean norm: x 2=3×3+4×4+1×1+2×2.•Euclidean distance:d (x,y )= x −y .The squared Euclidean distance is:3−1+4−3+1−0+2−1•Orthogonality:x is orthogonal to y if x,y =0.•Distance is symmetric,d (x,y )=d (y,x );positive,d (x,y )≥0;and definite,d (x,y )=0=⇒x =y .&%Metrics (cont’d.)•Any symmetric,positive,definite matrix M defines a generalized Euclidean space.Scalar product is x,y M =x My ,norm is x 2=x Mx ,and Euclidean distance is d (x,y )= x −y M .•Classical case:M =I n ,the identity matrix.•Normalization to unit variance:M is diagonal matrix with i th diagonal term1/σ2i .•Mahalanobis distance:M is inverse variance-covariance matrix.•Next topic:Scalar product defines orthogonal projection.&%Metrics (cont’d.)•Projected value,projection,coordinate:x 1=(x Mu/u Mu )u .Here x 1and uare both vectors.•Norm of vector x 1=(x Mu/u Mu ) u =(x Mu )/ u .•The quantity (x Mu )/( x u )can be interpreted as the cosine of the angle a between vectors x and u .+x /|/|/|/|/a |+-----+-----uO x1Tutorial on Cluster Analysis –CSNA May 2006–F Murtagh19'&$%Least Squares Optimal Projection of Points•Plot of 3points in I R 2(see following slides).•PCA:determine best fitting axes.•Examples follow.•Note:optimization means either (i)closest axis to points,or (ii)maximumelongation of projections of points on the axis.•This follows from Pythagoras’s theorem:x 2+y 2=z 2.Call z the distance from the origin to a point.Let x be the distance of the projection of the point from the origin.Then y is the perpendicular distance from the axis to to the point.•Minimizing y is the same as maximizing x (because z is fixed).Tutorial on Cluster Analysis –CSNA May 2006–F Murtagh20'&$%Examples of Optimal Projection0B B @1224351C C A&%sss -6s s s -6&%-6sia3bc@@@@Tutorial on Cluster Analysis –CSNA May 2006–F Murtagh23'&$%ss-(a)s sss s s s-(b)Tutorial on Cluster Analysis –CSNA May 2006–F Murtagh24'&$%sss-6(c)&%Cosine Coefficient (cf.Principal Components Analysis)•The projection of vector x onto axis u is y =x M u u Mu•I.e.the coordinate of the projection on the axis is x M u / u M .•This becomes x M u when the vector u is of unit length.•The cosine of the angle between vectors x and y in the usual Euclidean space isx y / x y .•That is to say,we make use of the triangle whose vertices are the origin,the projection of x onto y ,and vector x .•The cosine of the angle between x and y is then the coordinate of the projection of x onto y ,divided by the –hypotenuse –length of x .•The correlation coefficient between two vectors is then simply the cosine of the angle between them,when the vectors have first been centred (i.e.x −g and y −g are used,where g is the overall centre of gravity.&%Normalization =⇒Scalar Product gives Correlation•Let r ij be the original measurements.•Then define:x ij =rij−rj sj √n•r j =1n P n i =1r ij•s 2j =1n P n i =1(r ij −r j )2•Then the matrix to be diagonalized in PCA,or the all-pairwise scalar products of observation vectors,is of (j,k )th term:ρjk =P n i =1x ij x ik =1n P ni =1(r ij −r j )(r ik −r k )/s j s k •This is the correlation coefficient between variables j and k .•Have distanced 2(j,k )=P n i =1(x ij −x ik )2=P n i =1x 2ij +P n i =1x 2ik −2P n i =1x ij x ik•First two terms both yield 1.Hence:Tutorial on Cluster Analysis –CSNA May 2006–F Murtagh27'&$%•d 2(j,k )=2(1−ρjk )•Thus the distance between variables is directly proportional to the correlationbetween them.•For row points (objects,observations):d 2(i,h )=Pj (x ij −x hj )2=P j (r ij −r hj √nsj)2=(r i −r h ) M (r i −r h )•r i and r h are column vectors (of dimensions m ×1)and M is the m ×mdiagonal matrix of j th element 1/ns 2j .•Therefore d is a Euclidean distance associated with matrix M .•Note that the row points are now centred but the column points are not:therefore the latter may well appear in one quadrant on output listings.Tutorial on Cluster Analysis –CSNA May 2006–F Murtagh28'&$%Cosine,Correlation Coeffs.Now:Further Examples of Similarities•Jaccard coefficient for binary vectors a and b .N is counting operator:s (a,b )=N j (a j =b j =1)N j (a j =1)+N j(b j =1)−N j (a j =b j =1)•Jaccard similarity coefficient of vectors (10001001111)and (10101010111)is5/(6+7−5)=5/8.In vector notation:s (a,b )=a ba a +b b −a b .•Jaccard coefficient uses counts of presence/absences in cross-tabulation of binary presence/absence vectors:||a/present a/absent ||-----------+--------------------+|b/present |n1n2||b/absent |n3n4|•A number of such measures have been used in information retrieival,or numerical taxonomy:Jaccard,Dice,Tanimoto,...&%Upstream of Distances or Similarities:Data CodingRecord x:S1,18.2,X Record y:S1,6.7,—Two records (x and y)with three variables (Seyfert type,magnitude,X-ray emission)showing disjunctive coding.Seyfert type spectrum Integrated magnitude X–ray data?S1S2S3—≤10>10Yes x 1000011y11&%Concluding for the present on Distances•A distance,as seen,is defined on a set of objects x ,as a mappingd :x ×x −→R +,where the result (right hand term)is a value in the set of positive reals.•Alternatively expressed,for x i ,x j ∈x ,then d (x i ,x j )∈R +.•A Euclidean space is a particular metric space.If we allow for infinite dimension,then this is termed a Hilbert space.•Euclidean distance is defined from scalar product.Scalar product gives cosine of angle between two vectors.If vectors are suitably normalized,then we have correlations between them.A more “global”normalization is involved when we modify the Euclidean distance to give the Mahalanobis distance.Tutorial on Cluster Analysis –CSNA May 2006–F Murtagh31'&$%Some Properties of Ultrametrics•Distance defined strictly on a tree .•|k------*------||||*i *jConsidering 3points,i,j,k we have already considered the relationshipd xy ≤max {d xz ,d yz }where x,y,z take on the different values i,j,k in any order.•Furthermore:any triangle,formed from a triplet of points,must be equilateral,or isosceles with small base.•Topologically,every open ball is also a closed ball.We term this a clopen ball.•Every point in a (closed or open)ball can be taken as its center.Tutorial on Cluster Analysis –CSNA May 2006–F Murtagh32'&$%•The radius of a ball is identical to its diameter.•If two (either both open or both closed)balls are overlapping,then one must be enclosed in the other.•Conclude:an ultrametric,or tree or hierarchic distance,is very peculiar!&%A Worked Example of Hierarchical Agglomerative clusteringNote:the agglomerative criterion used is very important.&%Single Linkage Hierarchical ClusteringDissimilarity matrix defined for 5objects 1234512U435--+------------------+----------------1|049581|04982|406362U4|40653|960633|96034|536055|85305|86350Agglomerate 2and 4atAgglomerate 3and 5atdissimilarity 3dissimilarity 3Tutorial on Cluster Analysis –CSNA May 2006–F Murtagh35'&$%Single Linkage Hierarchical Clustering –212U43U51U2U43U5----+---------------------+-------------1|0481U2U4|052U4|4053U5|503U5|850Agglomerate 1and 2U4at Finally agglomerate 1U2U4dissimilarity 4and 3U5at dissim.5Tutorial on Cluster Analysis –CSNA May 2006–F Murtagh36'&$%Single Linkage Hierarchical Clustering –3Resulting dendrogramrc||+----------+...4...5||+-----+|...3...4|||||+---+...2...3|||||+---+||...1...3||||||||||...0 0r =ranks or levels.c =criterion values (linkage wts).&%Single Linkage Hierarchical Clustering –3Input An n (n −1)/2set of dissimilarities.Step 1Determine the smallest dissimilarity,d ik .Step 2Agglomerate objects i and k :i.e.replace them with a new object,i ∪k ;update dissimilarities such that,for all objects j =i,k :d i ∪k,j =min {d ij ,d kj }.Delete dissimilarities d ij and d kj ,for all j ,as these are no longer used.Step 3While at least two objects remain,return to Step 1.&%Single Linkage Hierarchical Clustering –4•Precisely n −1levels for n objects.Ties settled arbitrarily.•Note single linkage criterion.•Disadvantage:chaining.“Friends of friends”in the same cluster.•Lance-Williams cluster update formula:d (i ∪j,k )=αi d (i,k )+αj d (j,k )+βd (i,j )+γ|d (i,k )−d (j,k )|where coefficients αi ,αj ,β,and γdefine the agglomerative criterion.•For single link,αi =0.5,β=0and γ=−0.5.•These values always imply:min {d ik ,d jk }•Ultrametric distance,δ,resulting from the single link method is such thatδ(i,j )≤d (i,j )always.It is also unique (with the exception of ties).So single link is also termed the subdominant ultrametric method.Tutorial on Cluster Analysis –CSNA May 2006–F Murtagh39'&$%Remarks on Hierarchical Clustering Criteria•Complete link:substitute max for min in single link.•Complete link leads to compact clusters.•Single link defines the cluster criterion from the closest object in the plete link defines the cluster criterion from the furthest object in the cluster.•Single link yields the maximal inferior ultrametric ,or subdominant ultrametric .•What this means is:let δij be an ultrametric distance derived from the single link hierarchy,and let d ij be the original corresponding distance.Thenδij ≤d ij ,and δij is the best such fit to d ij “from below”.This subdominant utrametric is unique.•Analogously,complete link yields a minimal superior ultrametric .However this is not unique.•Robin Sibson developed an O (n 2)algorithm for single link.Tutorial on Cluster Analysis –CSNA May 2006–F Murtagh40'&$%R.Sibson,“SLINK:An Optimally Efficient Algorithm for the Single-Link Cluster Method”,Computer Journal,16,30-34,1973.Note here:optimal ,i.e.O (n 2).Robin Sibson was Professor of Statistics at the University of Bath,and later Vice-Chancellor of the University of Kent at Canterbury.In 2000,he became Chief Executive of the Higher Education Statistics Agency,HESA,in the UK.•Daniel Defays developed an O (n 2)algorithm for a complete link method.D.Defays,“An efficient algorithm for a complete link method”,Computer Journal,20,364-366,1977.Daniel Defays went on to work also in official statistics,in Eurostat,the Statistical Office of the European Union.•Other criteria define d (i ∪j,k )from the distance between k and something closer to the mean or center of i and j .These criteria include the median,centroid and minimum variance methods.&%Remarks on Hierarchical Clustering Criteria (Cont’d.)•A problem that can arise:inversions in the hierarchy.I.e.the cluster criterion value is not monotonically increasing.That leads to cross-overs in the dendrogram.•Of the above agglomerative methods,the single link,complete link,and minimum variance methods can be shown to never allow inversions.They satisfy the reducibility property .First formulated by Michel Bruynooghe,working in Benz´e cri’s lab in the late 1970s.Bruynooghe now works in a university group on photonic systems in Strasbourg,France.•We will return to this property –which guarantees no inversions or monotonic behavior in the sequence of agglomerations –later when we discuss representation or display aspects of hierarchies.&%Summary of Hierarchical Agglomerative CriteriaNote:we should distinguish clearly between clustering method (implying a stepwiseoptimization criterion)and an algorithm.N.Jardine and R.Sibson,Mathematical Taxonomy,Wiley,1971,p.42Tutorial on Cluster Analysis –CSNA May 2006–F Murtagh43'&$%Hierarchical Lance and Williams Coordinates Dissimilarity clustering dissimilarity of centre of between cluster methods (and update formula.cluster,which centres g i and g j .aliases).agglomerates clusters i and j .Single link αi =0.5(nearest β=0neighbor).γ=−0.5(More simply:min {d ik ,d jk })Complete link αi =0.5(diameter).β=0γ=0.5(More simply:max {d ik ,d jk })Group average αi =|i ||i |+|j |(average link,β=0UPGMA).γ=0Tutorial on Cluster Analysis –CSNA May 2006–F Murtagh44'&$%Hierarchical Lance and Williams Coordinates Dissimilarity clustering dissimilarity of centre of between cluster methods (and update formula.cluster,which centres g i and g j .aliases).agglomerates clusters i and j .Median method αi =0.5g =g i +g j2g i −g j 2(Gower’s,β=−0.25WPGMC).γ=0Centroid αi =|i ||i |+|j |g =|i |g i +|j |g j |i |+|j |g i −g j 2(UPGMC).β=−|i ||j |(|i |+|j |)2γ=0Ward’s method αi =|i |+|k ||i |+|j |+|k |g =|i |g i +|j |g j |i |+|j ||i ||j ||i |+|j |g i−g j 2(minimum var-β=−|k ||i |+|j |+|k |iance,error γ=0sum of squares.&%Observation Weighting•Note how centroid and Ward’s minimum variance methods allow for a simple but satisfactory way to weight the observatations:•New cluster center:q =(m q q +m q q )/(m q +m q ).•Dissimilarity between new cluster center is (m q m q )/(m q +m q ) q −q 2.•Typically,m q =m q =1/n to begin with,where we have n observations.•To weight observations,just take these weights as other than identical and constant.•Our software –in C,Java and R –supports observation weighting.(Of course there is no problem with identical,constant weights.)&%Basic or Traditional AlgorithmsTutorial on Cluster Analysis –CSNA May 2006–F Murtagh47'&$%Agglomerative Algorithm Based on DataStep 1Examine all interpoint dissimilarities,and form cluster from two closestpoints.Step 2Replace two points clustered by representative point (centre of gravity)or bycluster fragment.Step 3Return to Step 1,treating clusters as well as remaining objects,until allobjects are in one cluster.Tutorial on Cluster Analysis –CSNA May 2006–F Murtagh48'&$%Agglomerative Algorithm Based on DissimilaritiesStep 1Form cluster from smallest dissimilarity.Step 2Define cluster;remove dissimilarity of agglomerated pair.Updatedissimilarities from cluster to all other clusters/singletons.Step 3Return to Step 1,treating clusters as well as remaining objects,until allobjects are in one cluster.&%Computational Complexity•Find closest dissimilarity in order to carry out an agglomeration:take each observation and match (Euclidean distance etc.)with every other.We take n observations,and we carry out O (n )matchings.So complexity is O (n 2).We repeat this for n −1agglomerations.So complexity overall is O (n 3).•Say we have dissimilarities.(These could well be distances;or mutatismutandis similarities.)All pairwise dissimilarities are needed.(Not precluding an upper,or lower,half matrix of dissimilarities.)So,to set up,the complexity is O (n 2).Now we find the minimum dissimilarity,taking O (n 2)effort to scan all dissimilarities.We agglomerate and update our dissimilarity matrix (again,O (n 2)effort).So far,everything together is of O (n 2)effort.We repeat this procedure n −1times.All told,complexity is O (n 3).&%Minimum Variance MethodTutorial on Cluster Analysis –CSNA May 2006–F Murtagh51'&$%Minimum variance agglomeration•For Euclidean distance inputs,the following definitions hold for the minimum variance or Ward error sum of squares agglomerative criterion.•Coordinates of the new cluster center,following agglomeration of q and q ,where m q is the mass of cluster q defined as cluster cardinality,and (vector)q denotes sing overloaded notation the center of (set)cluster q :q =(m q q +m q q )/(m q +m q ).•Following the agglomeration of q and q ,we define the following dissimilarity:(m q m q )/(m q +m q ) q −q 2.•Hierarchical clustering is usually based on factor projections,if desired using a limited number of factors (e.g.7)in order to filter out the most useful information in our data.(See discussion later.)•In such a case,hierarchical clustering can be seen to be a mapping of Euclidean distances into ultrametric distances.Tutorial on Cluster Analysis –CSNA May 2006–F Murtagh52'&$%Minimum variance method:properties•We seek to agglomerate two clusters,c 1and c 2,into cluster c such that the within-class variance of the partition thereby obtained is minimum.•Alternatively,the between-class variance of the partition obtained is to be maximized.•Let P and Q be the partitions prior to,and subsequent to,the agglomeration;let p 1,p 2,...be classes of the partitions.P ={p 1,p 2,...,p k ,c 1,c 2}Q ={p 1,p 2,...,p k ,c }.•Total variance of the cloud of objects in m -dimensional space is decomposed into the sum of within-class variance and between-class variance.This is Huyghen’s theorem in classical mechanics.•Total variance,between-class variance,and within-class variance are as follows:&%V (I )=1n P i ∈I (i −g )2,V (P )=P p ∈P |p |n(p −g )2;and1nP p ∈P P i ∈p (i −p )2.•For two partitions,before and after an agglomeration,we have respectively:V (I )=V (P )+Xp ∈PV (p )V (I )=V (Q )+Xp ∈QV (p )•From this,it can be shown that the criterion to be optimized in agglomerating c 1and c 2into new class c is:V (P )−V (Q )=V (c )−V (c 1)−V (c 2)=|c 1||c 2||c 1|+|c 2|c 1−c 2 2,&%Reciprocal Nearest Neigbors,NN-ChainsTutorial on Cluster Analysis –CSNA May 2006–F Murtagh55'&$%Efficient NN chain algorithmsss ssedcba--- •A NN -chain (nearest neighbor chain)Tutorial on Cluster Analysis –CSNA May 2006–F Murtagh56'&$%Efficient NN chain algorithm (cont’d.)•An NN -chain consists of an arbitrary point followed by its NN ;followed by the NN from among the remaining points of this second point;and so on until we necessarily have some pair of points which can be termed reciprocal or mutual NN s.(Such a pair of RNN s may be the first two points in the chain;and we have assumed that no two dissimilarities are equal.)•In constructing a NN -chain,irrespective of the starting point,we may agglomerate a pair of RNN s as soon as they are found.•Exactness of the resulting hierarchy is guaranteed when the cluster agglomeration criterion respects the reducibility property .•Inversion impossible if:d (i,j )<d (i,k )or d (j,k )⇒d (i,j )<d (i ∪j,k )&%NN-Chain Algorithm Complexity –for “Geometric”Methods•Firstly,take observation points in space,starting with an arbitrary point.Find its NN;and latter’s NN;and latter’s;...until we have RNN.Each such operation is called a growth .Agglomerate.Such an operation is called a contraction .Restart process from last point of NN-chain,before the RNN pair .The number of contractions is necessarily n −1.The number of growths cannot exceed 3n −3.(Why?Because we have n points to begin with;we have n −1cluster points created;and we have n −1“stub”points to consider which allow an RNN pair to be created from the final link in the NN-chain.Total upper bounded by:3n −3.)•So the total number of growths and contractions is linear in n ,i.e.is O (n ).Now each growth is based on a NN search,hence O (n ).Overall,the complexity is O (n 2).•Storage here is the original data and cluster points,hence O (n ).&%NN-Chain Algorithm Complexity –for “Graph”Methods•Start from dissimilarity matrix,O (n 2)to create.Storage here is bounded by the dissimilarity data,hence O (n 2).•After each agglomeration,keep the dissimilarity matrix updated.O (n )effort required at each agglomeration,since we use Lance-Williams on 2rows and on 2columns of the dissimilarity matrix.Note that the dissimilarity matrix has numbers of rows and columns that are less 1at each step.•There are,in all,n −1agglomerations.So all updates to the dissimilarity matrix are O (n ).Each such update taking O (n )implies overall O (n 2)effort.•What about the growths?Just like before,the total number of NN-chaingrowths is O (n ).Each such growth requires O (n )effort because we just have to scan one row (or one column since dissimilarities are assumed symmetric).•We see that overall complexity is O (n 2).Tutorial on Cluster Analysis –CSNA May 2006–F Murtagh59'&$%NN-Chain Algorithm Complexity –for “Graph”Methods•There is enormous confusion in the literature about this result!•Confusion is most often about complete link method.•Edward Fox,Virginia Tech,/∼cs5604/f95/cs5604cnCL/CL-alg-details.html “Complete link:Time:V oorhees alg.worst case is O(N**3)Implementations of the general algorithm:–Stored matrix approach:Use matrix,and then apply Lance-Williams to recalculate dissimilarities between cluster centers.Storage is therefore O(N**2)and time is at least O(N**2),but will be O(N**3)if matrix is scanned linearly.–Stored data approach:O(N)space for data but recompute pairwise dissimilarities so need O(N**3)time –Sorted matrix approach:O(N**2)to calculate dissimilarity matrix,O(N**2log N**2)to sort it,O(N**2)to construct hierarchy,but one need not store the data set,and the matrix can be processed linearly,which reduces disk accesses.”•Hinrich Sch¨u tze,Stuttgart,/∼schuetze/completelink.html “The worst case time complexity of complete-link clustering is at most O (n 2log n ).(My intuition is that complete link clustering is easier than sorting a set of n 2numbers,so there should be a more efficient algorithm.Let me know if you know of one!)”Tutorial on Cluster Analysis –CSNA May 2006–F Murtagh60'&$%•Peter Scheuermann,Northwestern,/∼peters/publications/euro par.pdf M.Dash,S.Petrutiu and P.Scheuermann,Efficient Parallel Hierarchical Clustering,Proc.10th International Euro-Par Conference,Italy,September 2004,LNCS 3149,pp.363-371.“Existing algorithms take O (N 2log N )CPU time and require (N 2)memory.”•David Eppstein,UCI,/∼eppstein/280/tree.html“However,Neighbor-Joining seems more difficult,with the best known time bound being O (n 3)(and some commonly available implementations taking even more than that).”•Confusion reigns!But O (n 2)time algorithms (“optimal”as termed by Sibson)have been known and implemented (e.g.in David Wishart’s CLUSTAN package,since 1984),since the early 1980s.There is no excuse for not knowing this!。
a rXiv:h ep-ph/212185v113Dec22February 1,200820:0WSPC/Trim Size:9in x 6in for Proceedings contribution4CORRELATIONS BETWEEN <p T >AND MULTIPLICITY IN A SINGLE BFKL POMERON ∗†M.A.BRAUN Dept.High-Energy Physics St.Petersburg State University 198504St.Petersburg,Russia E-mail:Braun1@pobox.spbu.ru C.MERINO AND G.RODRIGUEZ Department of Particle Physics,Facultade de F´ısica,Universidade de Santiago de Compostela,Campus Universitario s/n,Santiago de Compostela,Galice,Spain E-mail:Merino@c.es E-mail:Grod@c.es Strong correlations are obtained between the number and the average transverse momentum of jets emitted by the exchange of a single BFKL Pomeron.1.IntroductionStrong correlations are observed experimentally between the average p T and multiplicities of particles produced in high-energy hadronic collisions[1].Average p T grows with multiplicity.To interpret this fact it is tacitly assumed that with only one hard collision there are no correlations be-tween <p T >and multiplicity.Theoretically this assumption can only be tested within the Balitskii-Fadin-Kuraev-Lipatov (BFKL)dynamics,which presents a detailed description of particle (actually jet)production at high energies under certain simplifying assumptions (a fixed small coupling con-stant).The present calculation is aimed to see if there exist correlations between <p T >and the number of produced jets in the hard PomeronFebruary1,200820:0WSPC/Trim Size:9in x6in for Proceedings contribution42described by the BFKL chain of interacting reggeized gluons[2].We limitourselves to the leading order BFKL model.2.The FormalismThe BFKL equation for the amputated BFKL amplitude,f(y,k),when yis the rapidity and k is the two-dimensional transverse momentum of thevirtual(Reggeized)gluon,may be written in the formf(y,k)=f(0)(y,k)+¯αs y0dy1 d2k1k21f(y1,k1)−f(y,k)θ(k2−q2) ,(1)where¯αs=3αs/πand q=k−k1is the transverse momentum of theemitted(real)gluon.Defining as an observable jet a real gluon with q2≥µ2,one splits the integration over momenta and thus the integration kernel in(1)into twoparts,a resolved one,K R,corresponding to emitted gluons with q2>µ2,and an unresolved one,K UV,which combines emission of gluons withq2<µ2and the subtraction term in(1).Exclusive probabilities to pro-duce n jets are obtained by introducing n operators K R between the Greenfunctions of the BFKL equations with kernel K UV[3].If one presents thefull gluon distribution f as a sum of contributions f n from the productionof n jets then one gets a recursive relationf n(y)= y0dy1K(y−y1)f n−1(y1),(2)where K(y)is an y-dependent operator in the transverse momentum spaceK(y)=e yK UV K R.(3) Eq.(2)allows one to successively calculate the relative probabilities toproduce n=0,1,2,...jets starting from the no-jet contribution.The exclusive physical probabilities to observe n jets are obtained by convoluting f n with the gluon distribution in the projectile(the projectileimpact factor).Both the impact factors of the target and of the projectileshould vanish as k→0.3.The CalculationWe are interested in the average values of<q>n in the observed jets,provided their number n isfixed.The momentum k which serves as anargument of f(y,k)refers to the virtual gluon,and not to the emitted one,February1,200820:0WSPC/Trim Size:9in x6in for Proceedings contribution43 whose momentum q is hidden inside the kernel K R.Therefore tofind anaverage of any quantityφ(q)depending on the emitted real jet momentum,one has to introduce the functionφ(q)into the integral defining K R,thuschanging the kernel K R to the kernel K av:K av f (k)=¯αs k2 d2k1(dk2/k4)h(k)g n(y,k)nFebruary 1,200820:0WSPC/Trim Size:9in x 6in for Proceedings contribution444.The ResultsWe defined our jets by taking µ=2GeV/c.As for the cutoffs,we used1GeV/c <k 1<100GeV/c,(9)and we used a simplified expression for the virtual photon impact factor,independent of rapidity [2].We have calculated the functions f n and g n from Eqs.(2)and (8)upto n =5and y =15.Following [3]we have used the expansion in NChebyshev polynomials to discretize the kernels in a simple way.In Figure 1we present the averages <q >n for n =1−5and x =e −y =3.10−7−0.1,for the γ∗-hadron collisions (DIS)atQ 2=100(GeV/c)2.468101214<p _T>_nGe V/cxDIS, Q^2=100 (GeV/c)^2Figure 1.Average <p T >n for a fixed number n of jets produced in γ∗-hadron colli-sions,as a function of x at Q 2=100(GeV/c)2.Curves from bottom to top correspondto n =1,2,...5.As one observes,<q >n strongly grows with n at all rapidities,beingthe growth approximately linear.February1,200820:0WSPC/Trim Size:9in x6in for Proceedings contribution45 As an interesting by-product of our study wefind that the averages <q>n go down with rapidity for all n≥2.This is quite unexpected,sincein the BFKL approach an overall average<q>rapidly grows with y.Similar results are obtained for purely hadronic collisions[2].5.DiscussionEmissions of high-p T jets in DIS seem to be a suitable place to see theBFKL signatures.Our results show that in such emissions strong positivecorrelations are predicted between<p T>and the number of jets,alreadyfor a single Pomeron exchange.This indicates that in fact such correlationsare already present in the basic mechanism of jet production.The lineargrowth of<p T>with n that has been obtained could be a random-walkeffect,<p T>becoming larger and larger at each step(with each newproduced jet)[4].The extension of our study to the case of the BFKLequation with a running coupling constant would be important in order tostablish the stability of our results.An unexpected result obtained in our calculation is that<q>n at fixed n≥2fall with energy.Certainly this phenomenon deserves furtherinvestigation including higher y and/or n.We hope that it can be testedexperimentally as a possible signature of the BFKL Pomeron.AcknowledgmentsThis work is supported by CICYT(Spain),FPA2002-01161,and by theRFFI grant01-0-17137(Russia).References1.UA1Collaboration,C.Ciapetti in The Quark Structure of Matter,editedby M.Jacob and K.Winter(1986),p.455;F.Ceradini,Proceedings of theInternational Europhys.Conference on High-Energy Physics,Bari,editedby L.Nitti and G.Preparata(1985),and references therein.2.M.A.Braun,C.Merino,and G.Rodr´ıguez,Phys.Rev.D65,114001(2002).3.J.Kwiecinski,C.A.M.Lewis,and A.D.Martin,Phys.Rev.D54,6664(1996).4.We thank A.B.Kaidalov for enlightening discussions on this point.。