Categorical Document Frequency Based Feature Selection for Text Categorization
- 格式:pdf
- 大小:439.20 KB
- 文档页数:4
《统计学》_各章关键术语(中英⽂对照)第⼆部分各章关键术语(中英⽂对照)第1章统计学(statistics)随机性(randomness)描述统计学(descriptive statistics)推断统计学(inferential statistics)总体(population)母体(parent)(parent population)样本、⼦样(sample)调查对象总体(respondents population)有限总体(finite population)调查的理论总体(survey’s heoretical population)超总体(super population)变量(variable)数据(data)原始数据(original data)派⽣数据(derived data)定类尺度(nominal scale)定类尺度变量(nominal scale level variable)定类尺度数据(nominal scale level data)定序尺度(ordinal scale)定序尺度变量(ordinal scale level variable)定序尺度数据(ordinal scale level data)定距尺度(interval scale)定距尺度变量(interval scale level variable)定距尺度数据(interval scale level data)定⽐尺度(ratio scale)定⽐尺度变量(ratio scale level variable)定⽐尺度数据(ratio scale level data)分类变量(categorical variable)定性变量、属性变量(qualitative variable)数值变量(numerical variable)定量变量、数量变量(quantitative variable)绝对数变量(absolute number level variable)绝对数数据(absolute number level data)⽐率变量(ratio level variable)⽐率数据(ratio level data)实验数据(experimental data)调查数据(survey data)观察数据(observed data)第2章随机性(randomness)随机现象(random phenomenon)随机试验(random experiment)事件(event)基本事件(elementary event)复合事件(union of event)必然事件(certain event)不可能事件(impossible event)基本事件空间(elementary event space)互不相容事件(mutually exclusive events)统计独⽴(statistical independent)统计相依(statistical dependence)概率(probability)古典⽅法概率(classical method probability)相对频数⽅法概率(relative frequency method probability)主观⽅法概率(subjective method probability)⼏何概率(geometric probability)条件概率(conditional probability)全概率公式(formula of total probability)贝叶斯公式(Bayes’ formula)先验概率(prior probability)后验概率(posterior probability)随机变量(random variable)离散型随机变量(discrete type random variable)连续型随机变量(continuous type random variable)概率分布(probability distribution)特征数(characteristic number)位置特征数(location characteristic number)数学期望(mathematical expectation)散布特征数(scatter characteristic number)⽅差(variance)标准差(standard deviation)变异系数(variable coefficient)贝努⾥分布(Bernoulli distribution)⼆点分布(two-point distribution) 0-1分布(zero-one distribution)贝努⾥试验(Bernoulli trials)⼆项分布(binomial distribution)超⼏何分布(hyper-geometric distribution)正态分布(normal distribution)正态概率密度函数(normal probability density function)正态概率密度曲线(normal probability density curve)正态随机变量(normal random variable)卡⽅分布(chi-square distribution)F_分布(F-distribution)t_分布(t-distribution) “学⽣”⽒t_分布(Student’s t-distribution)列联表(contingency table)联合概率分布(joint probability distribution)边缘概率分布(marginal probability distribution)条件分布(conditional distribution)协⽅差(covariance)相关系数(correlation coefficient)第3章统计调查(statistical survey)数据收集(collection of data)统计单位(statistical unit)统计个体(statistical individual)社会经济总体(socioeconomic population)调查对象总体(respondents population)有限总体(finite population)标志(character)标志值(character value)属性标志(attributive character )品质标志(qualitative character )数量标志(numerical indication)不变标志(invariant indication)变异(variation)调查条⽬(item of survey)指标(indicator)统计指标(statistical indicator)总量指标(total amount indicator)绝对数(absolute number)统计单位总量(total amount of statistical unit )标志值总量(total amount of indication value)(total amount of character value)时期性总量指标(time period total amount indicator)流量指标(flow indicator)时点性总量指标(time point total amount indicator)存量指标(stock indicator)平均指标(average indicator)平均数(average number)相对指标(relative indicator)相对数(relative number)动态相对指标(dynamic relative indicator)发展速度(speed of development)增长速度(speed of growth)增长量(growth amount)百分点(percentage point)计划完成相对指标(relative indicator of fulfilling plan)⽐较相对指标(comparison relative indicator)结构相对指标(structural relative indicator)强度相对指标(intensity relative indicator)基期(base period)报告期(given period)分组(classification)(grouping)统计分组(statistical classification)(statistical grouping)组(class)(group)分组设计(class divisible design)(group divisible design)互斥性(mutually exclusive)包容性(hold)分组标志(classification character)(grouping character)按品质标志分组(classification by qualitative character)(grouping by qualitative character)按数量标志分组(classification by numerical indication)(grouping by numerical indication)离散型分组标志(discrete classification character)(discrete grouping character)连续型分组标志(continuous classification character)(continuous grouping character)单项式分组设计(single-valued class divisible design)(single-valued group divisible design)组距式分组设计(class interval divisible design)(group interval divisible design)组界(class boundary)(group boundary)频数(frequency)(frequency number)频率(frequency)组距(class interval)(group interval)组限(class limit)(group limit)下限(lower limit)上限(upper limit)组中值(class mid-value)(group mid-value)开⼝组(open class)(open-end class)(open-end group)开⼝式分组(open-end grouping)等距式分组设计(equal class interval divisible design)(equal group interval divisible design)不等距分组设计(unequal class interval divisible design)(unequal group interval divisible design)调查⽅案(survey plan)抽样调查(sample survey)有限总体概率抽样(probability sampling in finite populations)抽样单位(sampling unit)个体抽样(elements sampling)等距抽样(systematic sampling)整群抽样(cluster sampling)放回抽样(sampling with replacement)不放回抽样(sampling without replacement)分层抽样(stratified sampling)概率样本(probability sample)样本统计量(sample statistic)估计量(estimator)估计值(estimate)⽆偏估计量(unbiased estimator)有偏估计量(biased estimator)偏差(bias)精度(degree of precision)估计量的⽅差(variance of estimates)标准误(standard error)准确度(degree of accuracy)均⽅误差(mean square error)估计(estimation)点估计(point estimation)区间估计(interval estimate)置信区间(confidence interval)置信下限(confidence lower limit)置信上限(confidence upper limit)置信概率(confidence probability)总体均值(population mean)总体总值(population total)总体⽐例(population proportion)总体⽐率(population ratio)简单随机抽样(simple random sampling)简单随机样本(simple random sample)研究域(domains of study)⼦总体(subpopulations)抽样框(frame)估计量的估计⽅差(estimated variance of estimates)第4章频数(frequency)(frequency number)频率(frequency)分布列(distribution series)经验分布(empirical distribution)理论分布(theoretical distribution)品质型数据分布列(qualitative data distribution series)数量型数据分布列(quantitative data distribution series)单项式数列(single-valued distribution series)组距式数列(class interval distribution series)频率密度(frequency density)分布棒图(bar graph of distribution)分布直⽅图(histogram of distribution)分布折线图(polygon of distribution)累积分布数列(cumulative distribution series)累积分布图(polygon of cumulative distribution)位置特征(location characteristic)位置特征数(location characteristic number)平均值、均值(mean)平均数(average number)权数(weight number)加权算术平均数(weighted arithmetic average)加权算术平均值(weighted arithmetic mean)简单算术平均数(simple arithmetic average)简单算术平均值(simple arithmetic mean)加权调和平均数(weighted harmonic average)加权调和平均值(weighted harmonic mean)简单调和平均数(simple harmonic average)简单调和平均值(simple harmonic mean)加权⼏何平均数(weighted geometric average)加权⼏何平均值(weighted geometric mean)简单⼏何平均数(simple geometric average)简单⼏何平均值(simple geometric mean)绝对数数据(absolute number data)⽐率类型数据(ratio level data)中位数(median)众数(mode)耐抗性(resistance)散布特征(scatter characteristic)散布特征数(scatter characteristic number)极差、全距(range)四分位差(quartile deviation)四分间距(inter-quartile range)上四分位数(upper quartile)下四分位数(lower quartile)在外截断点(outside cutoffs)平均差(mean deviation)⽅差(variance)标准差(standard deviation)变异系数(variable coefficient)第5章随机样本(random sample)简单随机样本(simple random sample)参数估计(parameter estimation)矩(moment)矩估计(moment estimation)修正样本⽅差(modified sample variance)极⼤似然估计(maximum likelihood estimate)参数空间(space of paramete)似然函数(likelihood function)似然⽅程(likelihood equation)点估计(point estimation)区间估计(interval estimation)假设检验(test of hypothesis)原假设(null hypothesis)备择假设(alternative hypothesis)检验统计量(statistic for test)观察到的显著⽔平(observed significance level)显著性检验(test of significance)显著⽔平标准(critical of significance level)临界值(critical value)拒绝域(rejection region)接受域(acceptance region)临界值检验规则(test regulation by critical value)双尾检验(two-tailed tests)显著⽔平(significance level)单尾检验(one-tailed tests)第⼀类错误(first-kind error)第⼀类错误概率(probability of first-kind error)第⼆类错误(second-kind error)第⼆类错误概率(probability of second-kind error)P_值(P_value)P_值检验规则(test regulation by P_value)经典统计学(classical statistics)贝叶斯统计学(Bayesian statistics)第6章⽅差分析(analysis of variance,ANOVA)⽅差分析恒等式(analysis of variance identity equation)单因⼦⽅差分析(one-factor analysis of variance)双因⼦⽅差分析(two-factor analysis of variance)总变差平⽅和(total variation sum of squares)总平⽅和SST(total sum of squares)组间变差平⽅和(among class(group) variation sum of squares),回归平⽅和SSR(regression sum of squares)组内变差平⽅和(within variation sum of squares)误差平⽅和SSE(error sum ofsquares)⽪尔逊χ2统计量(Pearson’s chi-statistic)分布拟合(fitting of distrbution)分布拟合检验(test of fitting of distrbution)⽪尔逊χ2检验(Pearson’s chi-square test)列联表(contingency table)独⽴性检验(test of independence)数量变量(quantitative variable)属性变量(qualitative variable)对数线性模型(loglinear model)回归分析(regression analysis)随机项(random term)随机扰动项(random disturbance term)回归系数(regression coefficient)总体⼀元线性回归模型(population linear regression model with a single regressor)总体多元线性回归模型(population multiple regression model with a single regressor)完全多重共线性(perfect multicollinearity)遗漏变量(omitted variable)遗漏变量偏差(omitted variable bias)⾯板数据(panel data)⾯板数据回归(panel data regressions)⼯具变量(instrumental variable)⼯具变量回归(instrumental variable regressions)两阶段最⼩平⽅估计量(two stage least squares estimator)随机化实验(randomized experiment)准实验(quasi-experiment)⾃然实验(natural experiment)普通最⼩平⽅准则(ordinary least squares criterion)最⼩平⽅准则(least squares criterion)普通最⼩平⽅(ordinary least squares,OLS)最⼩平⽅(least squares)最⼩平⽅法(least squares method)第7章简单总体(simple population)复合总体(combined population)个体指数:价⽐(price relative),量⽐(quantity relative)总指数(general index)(combined index)统计指数(statistical indices)类指数、组指数(class index)动态指数(dynamic index)⽐较指数(comparison index)计划完成指数(index of fulfilling plan)数量指标指数(quantitative indicator index)物量指数(quantitative index)(quantity index)(quantum index)质量指标指数(qualitative indicator index)价格指数、物价指数(price index)综合指数(aggregative index)(composite index)拉斯贝尔指数(Laspeyres’ index)派许指数(Paasche’s index)阿斯·杨指数(Arthur Young’s index)马歇尔—埃奇沃斯指数(Marshall-Edgeworth’s index)理想指数(ideal index)加权综合指数(weighted aggregate index)平均指数(average index)加权算术平均指数(weighted arithmetic average index)加权调和平均指数(weighted harmonic average index)因⼦互换(factor-reversal)购买⼒平价(purchasing power parity,PPP)环⽐指数(chain index)定基指数(fixed base index)连环替代因素分析法(factor analysis by chain substitution method)不变结构指数、固定构成指数(index of invariable construction)结构指数、结构影响指数(structural index)第8章截⾯数据(cross-section data)时序数据(time series data)动态数据(dynamic data)时间数列(time series)发展⽔平(level of development)基期⽔平(level of base period)报告期⽔平(level of given period)平均发展⽔平(average level of development)序时平均数(chronological average)增长量(growth quantity)平均增长量(average growth amount)发展速度(speed of development)增长速度(speed of growth)增长率(growth rate)环⽐发展速度(chained speed of development)定基发展速度(fixed base speed of development)环⽐增长速度(chained growth speed)定基增长速度(fixed base growth speed)平均发展速度(average speed of development)平均增长速度(average speed of growth)平均增长率(average growth rate)算术图(arithmetic chart)半对数图(semilog graph)时间数列散点图(scatter diagram of time series)时间数列折线图(broken line graph of time series)⽔平型时间数列(horizontal patterns in time series data)趋势型时间数列(trend patterns in time series data)季节型时间数列(season patterns in time series data)趋势—季节型时间数列(trend-season patterns in time series data)⼀次指数平滑平均数(simple exponential smoothing mean)⼀次指数平滑法(simple exponential smoothing method)最⼩平⽅法(leas square method)最⼩平⽅准则(least squares criterion)原资料平均法(average of original data method)季节模型(seasonal model)(seasonal pattern)长期趋势(secular trends)季节变动(变差)(seasonal variation)季节波动(seasonal fluctuations)不规则变动(变差)(erratic variation)不规则波动(random fluctuations)时间数列加法模型(additive model of time series)时间数列乘法模型(multiplicative model of time series)。
常用术语中英文词典A-DAbsolute deviation, 绝对离差Absolute number, 绝对数Absolute residuals, 绝对残差Acceleration array, 加速度立体阵Acceleration in an arbitrary direction, 任意方向上的加速度Acceleration normal, 法向加速度Acceleration space dimension, 加速度空间的维数Acceleration tangential, 切向加速度Acceleration vector, 加速度向量Acceptable hypothesis, 可接受假设Accumulation, 累积Accuracy, 准确度Actual frequency, 实际频数Adaptive estimator, 自适应估计量Addition, 相加Addition theorem, 加法定理Additivity, 可加性Adjusted rate, 调整率Adjusted value, 校正值Admissible error, 容许误差Aggregation, 聚集性Alternative hypothesis, 备择假设Among groups, 组间Amounts, 总量Analysis of correlation, 相关分析Analysis of covariance, 协方差分析Analysis of regression, 回归分析Analysis of time series, 时间序列分析Analysis of variance, 方差分析Angular transformation, 角转换ANOVA (analysis of variance), 方差分析ANOVA Models, 方差分析模型Arcing, 弧/弧旋Arcsine transformation, 反正弦变换Area under the curve, 曲线面积AREG , 评估从一个时间点到下一个时间点回归相关时的误差ARIMA, 季节和非季节性单变量模型的极大似然估计Arithmetic grid paper, 算术格纸Arithmetic mean, 算术平均数Arrhenius relation, 艾恩尼斯关系Assessing fit, 拟合的评估Associative laws, 结合律Asymmetric distribution, 非对称分布Asymptotic bias, 渐近偏倚Asymptotic efficiency, 渐近效率Asymptotic variance, 渐近方差Attributable risk, 归因危险度Attribute data, 属性资料Attribution, 属性Autocorrelation, 自相关Autocorrelation of residuals, 残差的自相关Average, 平均数Average confidence interval length, 平均置信区间长度Average growth rate, 平均增长率Bar chart, 条形图Bar graph, 条形图Base period, 基期Bayes' theorem , Bayes定理Bell-shaped curve, 钟形曲线Bernoulli distribution, 伯努力分布Best-trim estimator, 最好切尾估计量Bias, 偏性Binary logistic regression, 二元逻辑斯蒂回归Binomial distribution, 二项分布Bisquare, 双平方Bivariate Correlate, 二变量相关Bivariate normal distribution, 双变量正态分布Bivariate normal population, 双变量正态总体Biweight interval, 双权区间Biweight M-estimator, 双权M估计量Block, 区组/配伍组BMDP(Biomedical computer programs), BMDP统计软件包Boxplots, 箱线图/箱尾图Breakdown bound, 崩溃界/崩溃点Canonical correlation, 典型相关Caption, 纵标目Case-control study, 病例对照研究Categorical variable, 分类变量Catenary, 悬链线Cauchy distribution, 柯西分布Cause-and-effect relationship, 因果关系Cell, 单元Censoring, 终检Center of symmetry, 对称中心Centering and scaling, 中心化和定标Central tendency, 集中趋势Central value, 中心值CHAID -χ2 Automatic Interaction Detector, 卡方自动交互检测Chance, 机遇Chance error, 随机误差Chance variable, 随机变量Characteristic equation, 特征方程Characteristic root, 特征根Characteristic vector, 特征向量Chebshev criterion of fit, 拟合的切比雪夫准则Chernoff faces, 切尔诺夫脸谱图Chi-square test, 卡方检验/χ2检验Choleskey decomposition, 乔洛斯基分解Circle chart, 圆图Class interval, 组距Class mid-value, 组中值Class upper limit, 组上限Classified variable, 分类变量Cluster analysis, 聚类分析Cluster sampling, 整群抽样Code, 代码Coded data, 编码数据Coding, 编码Coefficient of contingency, 列联系数Coefficient of determination, 决定系数Coefficient of multiple correlation, 多重相关系数Coefficient of partial correlation, 偏相关系数Coefficient of production-moment correlation, 积差相关系数Coefficient of rank correlation, 等级相关系数Coefficient of regression, 回归系数Coefficient of skewness, 偏度系数Coefficient of variation, 变异系数Cohort study, 队列研究Column, 列Column effect, 列效应Column factor, 列因素Combination pool, 合并Combinative table, 组合表Common factor, 共性因子Common regression coefficient, 公共回归系数Common value, 共同值Common variance, 公共方差Common variation, 公共变异Communality variance, 共性方差Comparability, 可比性Comparison of bathes, 批比较Comparison value, 比较值Compartment model, 分部模型Compassion, 伸缩Complement of an event, 补事件Complete association, 完全正相关Complete dissociation, 完全不相关Complete statistics, 完备统计量Completely randomized design, 完全随机化设计Composite event, 联合事件Composite events, 复合事件Concavity, 凹性Conditional expectation, 条件期望Conditional likelihood, 条件似然Conditional probability, 条件概率Conditionally linear, 依条件线性Confidence interval, 置信区间Confidence limit, 置信限Confidence lower limit, 置信下限Confidence upper limit, 置信上限Confirmatory Factor Analysis , 验证性因子分析Confirmatory research, 证实性实验研究Confounding factor, 混杂因素Conjoint, 联合分析Consistency, 相合性Consistency check, 一致性检验Consistent asymptotically normal estimate, 相合渐近正态估计Consistent estimate, 相合估计Constrained nonlinear regression, 受约束非线性回归Constraint, 约束Contaminated distribution, 污染分布Contaminated Gausssian, 污染高斯分布Contaminated normal distribution, 污染正态分布Contamination, 污染Contamination model, 污染模型Contingency table, 列联表Contour, 边界线Contribution rate, 贡献率Control, 对照Controlled experiments, 对照实验Conventional depth, 常规深度Convolution, 卷积Corrected factor, 校正因子Corrected mean, 校正均值Correction coefficient, 校正系数Correctness, 正确性Correlation coefficient, 相关系数Correlation index, 相关指数Correspondence, 对应Counting, 计数Counts, 计数/频数Covariance, 协方差Covariant, 共变Cox Regression, Cox回归Criteria for fitting, 拟合准则Criteria of least squares, 最小二乘准则Critical ratio, 临界比Critical region, 拒绝域Critical value, 临界值Cross-over design, 交叉设计Cross-section analysis, 横断面分析Cross-section survey, 横断面调查Crosstabs , 交叉表Cross-tabulation table, 复合表Cube root, 立方根Cumulative distribution function, 分布函数Cumulative probability, 累计概率Curvature, 曲率/弯曲Curvature, 曲率Curve fit , 曲线拟和Curve fitting, 曲线拟合Curvilinear regression, 曲线回归Curvilinear relation, 曲线关系Cut-and-try method, 尝试法Cycle, 周期Cyclist, 周期性D test, D检验Data acquisition, 资料收集Data bank, 数据库Data capacity, 数据容量Data deficiencies, 数据缺乏Data handling, 数据处理Data manipulation, 数据处理Data processing, 数据处理Data reduction, 数据缩减Data set, 数据集Data sources, 数据来源Data transformation, 数据变换Data validity, 数据有效性Data-in, 数据输入Data-out, 数据输出Dead time, 停滞期Degree of freedom, 自由度Degree of precision, 精密度Degree of reliability, 可靠性程度Degression, 递减Density function, 密度函数Density of data points, 数据点的密度Dependent variable, 应变量/依变量/因变量Dependent variable, 因变量Depth, 深度Derivative matrix, 导数矩阵Derivative-free methods, 无导数方法Design, 设计Determinacy, 确定性Determinant, 行列式Determinant, 决定因素Deviation, 离差Deviation from average, 离均差Diagnostic plot, 诊断图Dichotomous variable, 二分变量Differential equation, 微分方程Direct standardization, 直接标准化法Discrete variable, 离散型变量DISCRIMINANT, 判断Discriminant analysis, 判别分析Discriminant coefficient, 判别系数Discriminant function, 判别值Dispersion, 散布/分散度Disproportional, 不成比例的Disproportionate sub-class numbers, 不成比例次级组含量Distribution free, 分布无关性/免分布Distribution shape, 分布形状Distribution-free method, 任意分布法Distributive laws, 分配律Disturbance, 随机扰动项Dose response curve, 剂量反应曲线Double blind method, 双盲法Double blind trial, 双盲试验Double exponential distribution, 双指数分布Double logarithmic, 双对数Downward rank, 降秩Dual-space plot, 对偶空间图DUD, 无导数方法Duncan's new multiple range method, 新复极差法/Duncan新法E-LEffect, 实验效应Eigenvalue, 特征值Eigenvector, 特征向量Ellipse, 椭圆Empirical distribution, 经验分布Empirical probability, 经验概率单位Enumeration data, 计数资料Equal sun-class number, 相等次级组含量Equally likely, 等可能Equivariance, 同变性Error, 误差/错误Error of estimate, 估计误差Error type I, 第一类错误Error type II, 第二类错误Estimand, 被估量Estimated error mean squares, 估计误差均方Estimated error sum of squares, 估计误差平方和Euclidean distance, 欧式距离Event, 事件Event, 事件Exceptional data point, 异常数据点Expectation plane, 期望平面Expectation surface, 期望曲面Expected values, 期望值Experiment, 实验Experimental sampling, 试验抽样Experimental unit, 试验单位Explanatory variable, 说明变量Exploratory data analysis, 探索性数据分析Explore Summarize, 探索-摘要Exponential curve, 指数曲线Exponential growth, 指数式增长EXSMOOTH, 指数平滑方法Extended fit, 扩充拟合Extra parameter, 附加参数Extrapolation, 外推法Extreme observation, 末端观测值Extremes, 极端值/极值F distribution, F分布F test, F检验Factor, 因素/因子Factor analysis, 因子分析Factor Analysis, 因子分析Factor score, 因子得分Factorial, 阶乘Factorial design, 析因试验设计False negative, 假阴性False negative error, 假阴性错误Family of distributions, 分布族Family of estimators, 估计量族Fanning, 扇面Fatality rate, 病死率Field investigation, 现场调查Field survey, 现场调查Finite population, 有限总体Finite-sample, 有限样本First derivative, 一阶导数First principal component, 第一主成分First quartile, 第一四分位数Fisher information, 费雪信息量Fitted value, 拟合值Fitting a curve, 曲线拟合Fixed base, 定基Fluctuation, 随机起伏Forecast, 预测Four fold table, 四格表Fourth, 四分点Fraction blow, 左侧比率Fractional error, 相对误差Frequency, 频率Frequency polygon, 频数多边图Frontier point, 界限点Function relationship, 泛函关系Gamma distribution, 伽玛分布Gauss increment, 高斯增量Gaussian distribution, 高斯分布/正态分布Gauss-Newton increment, 高斯-牛顿增量General census, 全面普查GENLOG (Generalized liner models), 广义线性模型Geometric mean, 几何平均数Gini's mean difference, 基尼均差GLM (General liner models), 通用线性模型Goodness of fit, 拟和优度/配合度Gradient of determinant, 行列式的梯度Graeco-Latin square, 希腊拉丁方Grand mean, 总均值Gross errors, 重大错误Gross-error sensitivity, 大错敏感度Group averages, 分组平均Grouped data, 分组资料Guessed mean, 假定平均数Half-life, 半衰期Hampel M-estimators, 汉佩尔M估计量Happenstance, 偶然事件Harmonic mean, 调和均数Hazard function, 风险均数Hazard rate, 风险率Heading, 标目Heavy-tailed distribution, 重尾分布Hessian array, 海森立体阵Heterogeneity, 不同质Heterogeneity of variance, 方差不齐Hierarchical classification, 组内分组Hierarchical clustering method, 系统聚类法High-leverage point, 高杠杆率点HILOGLINEAR, 多维列联表的层次对数线性模型Hinge, 折叶点Histogram, 直方图Historical cohort study, 历史性队列研究Holes, 空洞HOMALS, 多重响应分析Homogeneity of variance, 方差齐性Homogeneity test, 齐性检验Huber M-estimators, 休伯M估计量Hyperbola, 双曲线Hypothesis testing, 假设检验Hypothetical universe, 假设总体Impossible event, 不可能事件Independence, 独立性Independent variable, 自变量Index, 指标/指数Indirect standardization, 间接标准化法Individual, 个体Inference band, 推断带Infinite population, 无限总体Infinitely great, 无穷大Infinitely small, 无穷小Influence curve, 影响曲线Information capacity, 信息容量Initial condition, 初始条件Initial estimate, 初始估计值Initial level, 最初水平Interaction, 交互作用Interaction terms, 交互作用项Intercept, 截距Interpolation, 内插法Interquartile range, 四分位距Interval estimation, 区间估计Intervals of equal probability, 等概率区间Intrinsic curvature, 固有曲率Invariance, 不变性Inverse matrix, 逆矩阵Inverse probability, 逆概率Inverse sine transformation, 反正弦变换Iteration, 迭代Jacobian determinant, 雅可比行列式Joint distribution function, 分布函数Joint probability, 联合概率Joint probability distribution, 联合概率分布K means method, 逐步聚类法Kaplan-Meier, 评估事件的时间长度Kaplan-Merier chart, Kaplan-Merier图Kendall's rank correlation, Kendall等级相关Kinetic, 动力学Kolmogorov-Smirnove test, 柯尔莫哥洛夫-斯米尔诺夫检验Kruskal and Wallis test, Kruskal及Wallis检验/多样本的秩和检验/H检验Kurtosis, 峰度Lack of fit, 失拟Ladder of powers, 幂阶梯Lag, 滞后Large sample, 大样本Large sample test, 大样本检验Latin square, 拉丁方Latin square design, 拉丁方设计Leakage, 泄漏Least favorable configuration, 最不利构形Least favorable distribution, 最不利分布Least significant difference, 最小显著差法Least square method, 最小二乘法Least-absolute-residuals estimates, 最小绝对残差估计Least-absolute-residuals fit, 最小绝对残差拟合Least-absolute-residuals line, 最小绝对残差线Legend, 图例L-estimator, L估计量L-estimator of location, 位置L估计量L-estimator of scale, 尺度L估计量Level, 水平Life expectance, 预期期望寿命Life table, 寿命表Life table method, 生命表法Light-tailed distribution, 轻尾分布Likelihood function, 似然函数Likelihood ratio, 似然比line graph, 线图Linear correlation, 直线相关Linear equation, 线性方程Linear programming, 线性规划Linear regression, 直线回归Linear Regression, 线性回归Linear trend, 线性趋势Loading, 载荷Location and scale equivariance, 位置尺度同变性Location equivariance, 位置同变性Location invariance, 位置不变性Location scale family, 位置尺度族Log rank test, 时序检验Logarithmic curve, 对数曲线Logarithmic normal distribution, 对数正态分布Logarithmic scale, 对数尺度Logarithmic transformation, 对数变换Logic check, 逻辑检查Logistic distribution, 逻辑斯特分布Logit transformation, Logit转换LOGLINEAR, 多维列联表通用模型Lognormal distribution, 对数正态分布Lost function, 损失函数Low correlation, 低度相关Lower limit, 下限Lowest-attained variance, 最小可达方差LSD, 最小显著差法的简称Lurking variable, 潜在变量。
《Business Statistic》中国人民大学出版社英文版第五版chapter1~8复习参考Part1名词解释1、Statistics is a method of extracting useful information from a set of numerical data in orderto make a more effective and informed decision.2、Descriptive Statistics:These are statistical methods of organizing, summarizing andpresenting numerical data in convenient forms such as graphs, charts and tables.3、Inferential statistics is defined as statistical methods used for drawing conclusions about apopulation based on samples.4、Primary data is obtained first hand.5、Secondary data already exists or has been previously collected such as company accounts, orsales figures.6、Mean: The arithmetic average and the most common measure ofaaaaaaa central tendency.①All values are included in computing the mean.②A set of data has a unique mean ③Themean is affected by unusually large or small data points (outliers / extreme values).7、*8、Mode: The most frequent data, or data corresponding to the highest frequency. ①Mode isnot affected by extreme values. ②There may not be a mode. ③There may be several modes. ④Used for either numerical or categorical data.9、Median is the value that splits a ranked set of data into two equal parts. ①Median is notaffected by extremely large or small values and is therefore a valuable measure of central tendency when such values occur.10、Standard Deviation: ①A measure of the variation of data from the mean. ②The mostcommonly used measure of variation. ③Represented by the symbol ‘s’. ④Shows how the data is distributed around the mean.11、Probability is the chance of an occurrence of an event. ①Probability of an eventalways lies between 0 and 1. ②The sum of the probabilities of every possible outcome or event is 1. ③The probability of the complement A’ is given by 1-P(A).12、Properties of Normal distribution:①Continuous random variable. ②‘Bell-shaped’ &symmetrical. ③Mean, median, mode are equal ④Area under the curve is 1.13、The Central Limited Theorem:①If the population followed normal distribution, thesampling distribution of mean is followed normal distribution. ②If the population do not followed normal distribution, but the sample size is larger than 30, the sampling distribution of mean is followed normal distribution.Part2选择题Topic 1 - Introduction to Business Statistics & Data CollectionQ1. The universe or totality of items or things under consideration is called:a.【b. a sample.c. a population.d. a parameter.e.none of the above.Q2. Those methods involving the collection, presentation, and characterization of a set of data in order to properly describe the various features of that set of data are called:a.inferential statistics.b.total quality management.c.sampling.d.descriptive statistics.Q3. The portion of the universe that has been selected for analysis is called:a.—b. a sample.c. a frame.d. a parameter.e. a statistic.Q4. A summary measure that is computed to describe a numerical characteristic from only a sample of the population is called:a. a parameter.b. a census.c. a statistic.d.the scientific method.Q5. A summary measure that is computed to describe a characteristic of an entire population is called:a.)b. a parameter.c. a census.d. a statistic.e.total quality management.Q6. The process of using sample statistics to draw conclusions about population parameters is called:a.inferential statistics.b.experimentation.c.primary sources.d.descriptive statistics.Q7. Which of the four methods of data collection is involved when a person retrieves data from an online databasea.(b.published sources.c.experimentation.d.surveying.e.observation.Q8. Which of the four methods of data collection is involved when people are asked to complete a questionnairea.published sources.b.experimentation.c.surveying.d.observation.Q9. Which of the four methods of data collection is involved when a person records the use of the Los Angeles freeway systema.@b.published sources.c.experimentation.d.surveying.e.observation.Q10. A focus group is an example of which of the four methods of data collectiona.published sources.b.experimentation.c.surveying.d.observation.Q11. Which of the following is true about response ratesa.】b.The longer the questionnaire, the lower the rate.c.Mail surveys usually produce lower response rates than personal interviews or telephonesurveys.d.Question wording can affect a response rate.e. d. All of the above.Q12. Which of the following is a reason that a manager needs to know about statisticsa.To know how to properly present and describe information.b.To know how to draw conclusions about the population based on sample information.c.To know how to improve processes.d.All of the above.~Scenario 1-1Questions 13-15 refer to this scenario:An insurance company evaluates many variables about a person before deciding on an appropriate rate for automobile insurance. Some of these variables can be classified as categorical, discrete and numerical, or continuous and numerical.Q13. Referring to Scenario 1-1 (above), the number of claims a person has made in the last three years is what type of variablea.Categorical.b.Discrete and numerical.c.Continuous and numerical.d.None of the above.Q14. Referring to Scenario 1-1 (above), a person's age is what type of variablea.—b.Categorical.c.Discrete and numerical.d.Continuous and numerical.e.None of the above.Q15. Referring to Scenario 1-1 (above), a person's gender is what type of variablea.Categorical.b.Discrete and numerical.c.Continuous and numerical.d.None of the above.{Q16. Which of the following can be reduced by proper interviewer traininga.Sampling error.b.Measurement error.c.Coverage error.d.Nonresponse error.Scenario 1-2Questions 17-19 refer to this scenario:Mediterranean fruit flies were discovered in California a few years ago and badly damaged the oranges grown in that state. Suppose the manager of a large farm wanted to study the impact of the fruit flies on the orange crops on a daily basis over a 6-week period. On each day a random sample of orange trees was selected from within a random sample of acres. The daily average number of damaged oranges per tree and the proportion of trees having damaged oranges were calculated.,Q17. Referring to Scenario 1-2 (above), the two main measures calculated each day ., average number of damaged oranges per tree and proportion of trees having damaged oranges) are called _______.a.statistics.b.parameters.c.samples.d.populations.Q18. Referring to Scenario 1-2 (above), the two main measures calculated each day ., average number of damaged oranges per tree and proportion of trees having damaged oranges) may be used on a daily basis to estimate the respective true population _______.a.estimates.b.parameters.c.statistics.d.frame.(Q19. Referring to Scenario 1-2 (above), in this study, drawing conclusions on any one day about the true population characteristics based on information obtained from the sample is called _______.a.evaluation.b.descriptive statistics.c.inferential statistics.d.survey.Scenario 1-3Questions 20 and 21 refer to this scenario:The Quality Assurance Department of a large urban hospital is attempting to monitor and evaluate patient satisfaction with hospital services. Prior to discharge, a random sample of patients is asked to fill out a questionnaire to rate such services as medical care, nursing, therapy, laboratory, food, and cleaning. The Quality Assurance Department prepares weekly reports that are presented at the Board of Directors meetings and extraordinary/atypical ratings are easy to flag.、Q20. Referring to Scenario 1-3 (above), true population characteristics estimated from the sample results each week are called _____________.a.inferences.b.parameters.c.estimates.d.data.Q21. Referring to Scenario 1-3 (above), a listing of all hospitalised patients in this institution over a particular week would constitute the ________.a.sample.b.population.c.statistics.d.parameters.`Scenario 1-4Questions 22-24 refer to this scenario:The following are the questions given to Sheila Drucker-Ferris in her college alumni association survey. Each variable can be classified as categorical or numerical, discrete or continuous.Q22. Referring to Scenario 1-4 (above), the data for the number of years since graduation is categorised as: __________________.a.numerical discrete.b.categorical.c.numerical continuous.d.none of the above.:Q23. Referring to Scenario 1-4 (above), the data for the number of science majors is categorised as: ____________.a.categorical.b.numerical continuous.c.numerical discrete.d.none of the above.Q24. Referring to Scenario 1-4 (above), the data for tabulating the level of job satisfaction (High, Moderate, Low) is categorised as: _________.a.numerical continuous.b.categorical.c.numerical discrete.d.none of the above.'Topic 2: Organising and Presenting dataQ1 The width of each bar in a histogram corresponds to the:a.boundaries of the classes.b.number of observations in the classes.c.midpoint of the classes.d.percentage of observations in the classes.Q2 When constructing charts, which of the following chart types is plotted at the class midpointsa.Frequency histograms.b.Percentage polygons.c.$d.Cumulative relative frequency ogives.e.Relative frequency histograms.Q3 When polygons or histograms are constructed, which axis must show the true zero or "origin"a.The horizontal axis.b.The vertical axis.c.Both the horizontal and vertical axes.d.Neither the horizontal nor the vertical axis.Q4 To determine the appropriate width of each class interval in a grouped frequency distribution, we:a.divide the range of the data by the number of desired class intervals.b.divide the number of desired class intervals by the range of the datac.】d.take the square root of the number of observations.e.take the square of the number of observations.Q5 When grouping data into classes it is recommended that we have:a.less than 5 classes.b.between 5 and 15 classes.c.more than 15 classes.d.between 10 and 30 classes.Q6 Which of the following charts would give you information regarding the number of observations "up to and including" a given groupa.Frequency histograms.b.Polygons.c.}d.Percentage polygons.e.Cumulative relative frequency ogives.Q7 Another name for an "ogive" is a:a.frequency histogram.b.polygon.c.percentage polygon.d.cumulative percentage polygon.Q8 In analyzing categorical data, the following graphical device is NOT appropriate:a.bar chart.b.Pareto diagram.c.!d.stem and leaf display.e.pie chart.Table 2The opinions of a sample of 200 people broken down by gender about the latest congressionalFor Neutral Against】Totals Female385412104Male123648(96 Totals509060200about the latest congressional plan to eliminate anti-trust exemptions for professional baseball. Referring to Table 2, the number of people who are neutral to the plan is _______.a.36b.54c.90d.、e.200Q10 Referring to Table 2, the number of males who are against the plan is _______.a.12b.48c.60d.96Q11 Referring to Table 2, the percentage of males among those who are for the plan is ______.a.%b.24%c.25%d.(e.76%Q12 Referring to Table 2, the percentage who are against the plan among the females is _______.a.%b.20%c.30%d.52%Topic 3: Numerical Descriptive StatisticsQ1 Which measure of central tendency can be used for both numerical and categorical variablesa.Mean.b./c.Median.d.Mode.e.Quartiles.Q2 Which of the following statistics is not a measure of central tendencya.Mean.b.Median.c.Mode.d.Q3.Q3 Which of the following statements about the median is NOT truea.It is more affected by extreme values than the mean.b.\c.It is a measure of central tendency.d.It is equal to Q2.e.It is equal to the mode in bell-shaped distributions.Q4 The value in a data set that appears most frequently is called:a.the median.b.the mode.c.the mean.d.the variance.Q5 In a perfectly symmetrical distribution:a.the mean equals the median.b.,c.the median equals the mode.d.the mean equals the mode.e.All of the above.Q6 When extreme values are present in a set of data, which of the following descriptive summary measures are most appropriatea.CV and range.b.Mean and standard deviation.c.Median and interquartile range.d.Mode and variance.Q7 The smaller the spread of scores around the mean:a.the smaller the interquartile range.b.(c.the smaller the standard deviation.d.the smaller the coefficient of variation.e.All the above.Q8 In a right-skewed distribution:a.the median equals the mean.b.the mean is less than the median.c.the mean is greater than the median.d.the mean is less than the mode.a.b.c.d.Q10 Referring to Table 3 (above), the median carbohydrate amount in the cereal is ________ grams.a.19b.20c.[d.21e.Q11 Referring to Table 3 (above), the 1st quartile of the carbohydrate amounts is ________ grams.a.15b.20c.21d.25Q12 Referring to Table 3 (above), the range in the carbohydrate amounts is ________ grams.a.16b.18c.$d.20e.21Topic 4: Basics probability and discrete probability distributionsInformation A, needed to answer Questions 1 to 2The Health and Safety committee in a large retail firm is examining the relationship between the number of days of sick leave an employee takes and whether an employee works on the day shift (D) or night shift (N). The committee looks at a sample of 50 employees and notes which shiftthey work on and whether the number of days of sick leave they take in a year is less than 6 daysthe values in the table of probabilities is not correcta.The probability of an employee taking 6 or more days of sick leave P(M) isb.The probability that an employee is on the Night Shift (N) and takes less than 6 days ofleave (L), is called a conditional probability P(N | L) =c.)d.If you know that an employee is on day shift (D) then the probability that they will takeless than 6 days of leave (L) is the conditional probability P(L | D) =e.The probability that an employee works Day Shift (D) or takes 6 or more days of leave (M)is found using the addition rule to be P(D or M) =f.They are all correctQ2 The analyst wishes to use the Probabilities table from Information A to determine whether the work shift variable and the number of days of sick leave variable are or are not independent variables. Which of the following statements about the work shift and the number of days of sick leave variables is correcta.These variables are independent because the marginal probabilities such as P(L) are thesame as the conditional probabilities P(L | D)b.These variables are not independent because the marginal probability P(L) is differentfrom the conditional probability P(N | L)c.These variables are not independent because the joint probabilities such as P(L and N)are equal to the product of the probabilities P(L).P(N).d.These variables are dependent because the marginal probabilities such as P(L) are equalto the conditional probability P(L | N)e.None of the above】Information B, needed to answer Question 3Suppose the manager of a home ware retailer decides in a 5-minute period no more than 4 customers can arrive at a counter. Using past records he obtains the following probabilityTable 4-3Arrivals (X)01234~.15.20.30.20.15 P(X)Q3 Use Information B to answer this question. If values are rounded to 3 decimal places which of the following is the correct pair of values for the mean, the variance or standard deviation of the number of arrivals at the counter.a.Mean mu = 2 and variance sigma-squared =b.Mean mu = and variance sigma-squared =c.Mean mu = 2 and standard deviation sigma =d.、e.Mean mu = and variance sigma-squared =f.None of the aboveInformation C, needed to answer Questions 4-6The section manager in an insurance company is interested in evaluating how well staff at the inquiry counter handle customer complaints. She interviews a sample of n = 6 customers who have made complaints and asks each of them whether staff had handled their complaints well. Each interview is called a trial. If a customer says their complaint was handled well this is called a success. She thinks that as long as these people are interviewed independently of each other then the number of people who say their complaint was handled well is a random variable with a Binomial probability distribution. The section manager thinks that the probability that a customers complaint will be handled well is p = .Q4 Use Information C to answer this question. A total of n = 6 people are interviewed independently of each other. Which of the following statements about the probability that 5 out of the 6 complaints will be handled well is correcta.less thanb.between andc.more thand.between ande.`f.None of the aboveQ5 Using Information C, which of the following statements about the probability that 4 or less of the 6 complaints will be handled well is correcta.less thanb.more thanc.between andd.between ande.None of the aboveQ6 Suppose the section manager from Information C is interested in the measures of central tendency and variation for the number of complaints which are handled well. Which of the following sets of values, where values are rounded to 3 decimal places, is the correct set of valuesa.Mean mu = and variance sigma-squared =b.Mean mu = and variance sigma-squared =c.!d.Mean mu = and variance sigma-squared =e.Mean mu = and standard deviation sigma =f.None of the aboveInformation D, needed to answer Questions 7-9The manager of a large retailer thinks that one reason why staff at the complaints counter fail to handle customer complaints well is that not enough staff are allocated to this counter. Past experience has shown that the number of customers who arrive at this counter has a Poisson distribution where the average number who arrive each hour is 36. He decides to look at how many customers are likely to arrive at the complaints counter during a 5-minute period.Q7 Use Information D to answer this question. Which of the following statements concerning the probability that exactly 2 customers will arrive at the counter in a 5-minute period is correcta.less thanb.between andc.between andd.…e.more thanf.None of the aboveQ8 Use Information D to answer this question. Which of the following statements concerning the probability that 3 or more customers will arrive at a counter in a 5-minute period is correcta.between andb.less thanc.more thand.between ande.None of the aboveQ9 The section manager from Information D is interested in the mean and variance of the number of customers who arrive during a 1 hour period. Which of the following is the correct set of values for these two measuresa.Mean mu = 3 and variance sigma-squared = 3b./c.Mean mu = 36 and standard deviation sigma =d.Mean mu = 30 and variance sigma-squared = 30e.Mean mu = 36 and standard deviation sigma = 6f.None of the aboveTopic 5: Normal probability distribution & sampling distributionQ1 Which of the following is not a property of the normal distributiona.It is bell-shaped.b.It is slightly skewed left.c.Its measures of central tendency are all identical.d.%e.Its range is from negative infinity to positive infinity.Q2 The area under the standardized normal curve from 0 to would be:a.the same as the area from 0 to .b.equal to .c.found by using Table in your textbook.d.all of the above.Q3 Which of the following about the normal distribution is not truea.Theoretically, the mean, median, and mode are the same.b.About two-thirds of the observations fall within ± 1 standard deviation from the mean.c.It is a discrete probability distribution.d.(e.Its parameters are the mean and standard deviation.Q4 In its standardized form, the normal distribution:a.has a mean of 0 and a standard deviation of 1.b.has a mean of 1 and a variance of 0.c.has a total area equal to .d.cannot be used to approximate discrete binomial probability distributions.Q5 In the standardized normal distribution, the probability that Z > 0 is _______.a.b.c.d.]e.cannot be found without more informationQ6 The probability of obtaining a value greater than 110 in a normal distribution in which the mean is 100 and the standard deviation is 10 is ______________ the probability of obtaining a value greater than 650 in a normal distribution with a mean of 500 and a standard deviation of 100.a.less thanb.equal to.c.greater thand.It is unknown without more information.Q7 The probability of getting a Z score greater than is ________.a.close tob.c. a negative numberd.%e.almost zeroQ8 For some positive value of Z, the probability that a standardized normal variable is between 0 and Z is . The value of Z isa.b.c.d.Q9 For some value of Z, the probability that a standardized normal variable is below Z is . The value of Z isa.b.c.d.【e.Q10 Given that X is a normally distributed random variable with a mean of 50 and a standard deviation of 2, the probability that X is between 47 and 54 isa.b.c.d.Q11 For some positive value of X, the probability that a standardized normal variable is between 0 and + is . The value of X isa.b.c.d.'e.Q12 The owner of a fish market determined that the average weight for a catfish is pounds with a standard deviation of pounds. A citation catfish should be one of the top 2 percent in weight. Assuming the weights of catfish are normally distributed, at what weight (in pounds) should the citation designation be establisheda.poundsb.poundsc.poundsd.poundsQ13 Which of the following is NOT a property of the arithmetic meana.It is unbiased.b.It is always equal to the population mean.c.Its average is equal to the population mean.d.(e.Its variance becomes smaller when the sample size gets bigger.Q14 The sampling distribution of the mean is a distribution of:a.individual population values.b.individual sample values.c.statistics.d.parameters.Q15 The standard deviation of the sampling distribution of the mean is called the:a.standard error of the sample.b.standard error of the estimate.c.standard error of the mean.d.~e.All of the aboveQ16 According to the central limit theorem, the sampling distribution of the mean can be approximated by the normal distribution:a.as the number of samples gets "large enough."b.as the sample size (number of observations) gets "large enough."c.as the size of the population standard deviation increases.d.as the size of the sample standard deviation decreases.Q17 For a sample size of n=10, the sampling distribution of the mean will be normally distributed:a.regardless of the population's distribution.b.if the shape of the population is symmetrical.c.if the variance of the mean is known.d.{e.if the population is normally distributedTopic 6: EstimationQ1 The interval estimate using the t critical value is ________ than the interval estimate using the z critical value.a.Narrowerb.The same asc.Widerd.More powerfulQ2 To estimate the mean of a normal population with unknown standard deviation using a small sample, we use the ______ distribution.a.'t'b.:c.'Z'd.samplinge.alphaQ3 If the population does not follow a normal distribution, then to use the t distribution to give a confidence interval estimate for the population mean, the sample size should be:a.at least 5b.at least 30c.at least 100d.less than 30Q4 The 'z' value or 't' value used in the confidence interval formula is called the:a.sigma valueb.:c.critical valued.alpha valuee.none of the aboveQ5 The 'z' value that is used to construct a 90 percent confident interval is:a.b.c.d.Q6 The 'z' value that is used to construct a 95 percent confidence interval is:a.b.c.d.Q7 The sample size needed to construct a 90 percent confidence interval estimate for the population mean with sampling error ± when sigma is known to be 10 units is:a.9b.32c.75d.107Q8 The t critical value approaches the z critical value when:a.the sample size decreasesb.the sample size approaches infinityc.the confidence level increasesd.the sample is smallQ9 The t-critical value used when constructing a 99 percent confidence interval estimate with a sample of size 18 is:a.b.c.d.Q10 The t-value that would be used to construct a 90 percent confidence interval for the mean with a sample of size n 36 would be:a.b.c.d.Q11 The value of alpha (two tailed) for a 96 percent confidence interval would be: a.b.c.d.Q12 When using the t distribution for confidence interval estimates for the mean, the degrees of freedom value is:a.nb.n-1c.n-2d.n %2B 1Q13 You would interpret a 90 percent confidence interval for the population mean as:a.you can be 90 percent confident that you have selected a sample whose interval doesinclude the population meanb.if all possible samples are selected and CI's are calculated, 90 percent of those intervalswould include the true population meanc.90 percent of the population is in that intervald.both A and B are trueQ14 From a sample of 100 items, 30 were defective. A 95 percent confidence interval for the proportion of defectives in the population is:a.(.2, .4)b.(.21, .39)c.(.225, .375)d.(.236, .364)Q15 A confidence interval was used to estimate the proportion of statistics students that are male. A random sample of 70 statistics students generated the following 90 percent confidence interval: , . Using the information above, what size sample would be necessary if we wanted to estimate the true proportion to within ± using 95 percent confidencea.240b.450c.550d.150整理人:阿桤。
I.J. Information Technology and Computer Science, 2017, 5, 15-22Published Online May 2017 in MECS (/)DOI: 10.5815/ijitcs.2017.05.03A Frequency Based Approach to Multi-ClassText ClassificationAnurag SarkarNortheastern University, Boston, United StatesE-mail: sarkar.an@Debabrata DattaSt. Xavier‟s College (Autonomous), Kolkata, IndiaE-mail: debabrata.datta@Abstract—Text classification is a method which involves managing and processing important information that can be categorized into predefined classes within a collection of text data. This method plays a vital role in the field of information processing and information retrieval. Different approaches to text classification specifically based on machine learning algorithms have been discussed and proposed in various research works. This paper discusses a classification approach based on the frequencies of some important text parameters and classifies a given text accordingly into one among multiple categories. Using a newly defined parameter called wf-icf, classification accuracy obtained in a previous work was significantly improved upon.Index Terms—Supervised learning, Multi-class classification, Text classification, Text mining, Text categorization, tf-idf.I.I NTRODUCTIONText classification, also referred to as text categorization, may be defined as the process of classifying textual information by means of its content. Due to the ubiquity of textual information, text classification finds application in a diverse array of domains such as text summarization, information extraction, information retrieval, question answering and sentiment analysis, to name a few.Text classification is a form of text mining, which is a more general term used to denote any process that involves deriving information from textual data by analyzing patterns within it and is in turn a subset of the larger domain of data mining. Since in text classification, a labeled dataset is used to train the classifier, it is said that text classification is a supervised learning technique and differs from unsupervised learning techniques such as clustering where the training is performed using unlabeled data instances.In this research paper, we have expanded on the work that was done in [1]. In [1], a binary text classifier had been proposed and implemented. It made use of an incremental approach to text mining wherein the newly classified data instances and their predicted labels were added to the existing training data set so that this enhanced data set could be used to train the classifier for predicting the labels of future unclassified data instances. This allowed for more thorough classifier training and increased classification accuracy each time the classifier was used for the same problem.The motivation behind improving upon the previous work in [1] has been two-fold. First, the primary limitation of the previous classifier was that it could only perform binary classification. The classifier needed to be more dynamic and applicable in a wide variety of training sets. Additionally, a classifier shouldn‟t constrain the data set to consist of a specific number of classes and should not inconvenience the users by having them supply the number of classes present in the training data since the users themselves may not be aware of this information. Second, it seemed that higher classification accuracy could be achieved by making certain optimizations to the existing algorithm and implementing the tf-idf (term frequency-inverse document frequency) statistic [2][18] in place of simple word frequency as was done in the binary classifier. In the following sections, it will be illustrated that both of the above-mentioned goals have been successfully achieved and a classifier has been developed accordingly. The new classifier is now capable of automatically inferring the number of classes in the training data while achieving higher classification accuracy.The rest of the paper is structured as follows. Section 2 contains a short overview of related research work in the field of text classification and states a few examples of text classifiers implemented using different known techniques. Section 3 provides a detailed description of the theoretical principles and concepts behind the proposed text classifier. Section 4 offers a working description of the proposed classifier, stepping through the pseudocode and algorithms behind its implementation. Section 5 is a discussion of the time complexity of the classifier and contains an analysis of the results obtained using the classifier. Section 6 concludes the paper with a summary of the work and its future scope.II.R ELATED W ORKAs mentioned in the previous section, the work in this paper is an extension of our previous work in [1] where a simple binary text classifier had been implemented. That classifier made use of incremental mining and word frequency for classifying new instances. Various other methods of text classification and categorization exist and have been implemented using a variety of different techniques in various other research works.The paper in [3] defines text categorization as “the task of automatically sorting a set of documents into categories (or classes, or topics) from a predefined set” and states that it falls under the domain of both machine learning and information retrieval. Thus, techniques derived from both of these domains find application in implementing text classifiers. Joachims [4] proposed the use of Support Vector Machines (SVMs) for text classification and demonstrated that SVMs could be more optimal than other machine learning methods such as k-NN classifiers and Naïve-Bayes classifiers. In [5], Cristianini discussed in great length the working principles and applications of SVMs in text classification. Another popular variant of text classifiers is one based on Bayes‟ theorem and referred to as the Naïve Bayes classifier. In [6], Leung defined Bayesian classifiers as statistical classifiers that are capable of predicting the probability of a specific test instance belonging to a specific class. Frank and Bouckaert [7] demonstrated the use of a variant of the Naïve Bayes classifier called the Multinomial Naïve Bayes (MNB) in solving problems of text classification pertaining to unbalanced datasets. Another variant of the Naïve Bayes text classifier is offered by Dai, Xue, Yang and Yu [8] in which they solved the problem of categorizing documents across varied distributions.The k-Nearest Neighbors (k-NN) technique, as stated above, has also been used in the field of text classification. The k-NN algorithm finds the k nearest neighbors of the data instance that needs to be classified in order to form its neighborhood. Then, a voting mechanism is used within the neighborhood to determine the class of the instance. Guo et al. [9] developed a new method for text classification which combined the advantages of the k-NN algorithm with another type of classifier known as the Rocchio classifier and obtained performance levels in the range of that offered by SVM-based text classifiers. In [10], Toker and Kirmemis made use of the k-NN classifier to develop an application for organizing documents. Additionally, Li, Yu and Lu [11] developed a modified k-NN based method that utilized a suitable number of nearest neighbors in order to predict classes based on the distribution of a particular class in the test documents.III.T HEORY B EHIND THE P ROPOSED W ORKThe property that has been incorporated into the present research work to help boost its accuracy is the inverse class frequency. This is based on the tf-idf (term frequency-inverse document frequency) statistic that is widely used in information retrieval. The tf-idf method helps in determining the importance of a particular term with respect to a specific document within a collection of documents and finds use as a weighting factor [2][18]. It is directly proportional to the term frequency which is the number of times the term appears in the document while being inversely proportional to the document frequency which is the number of different documents within the collection in which the term appears. Thus, tf-idf essentially combines two weighting factors based on the following two principles:1.The weight of a term within a document isproportional to the term frequency [16]2.The specificity of a term is inversely proportionalto the number of different documents in which theterm occurs [17]To suit the purpose of our present research work, a similar property has been defined in this paper for each test instance and has been termed as the word frequency–inverse class frequency or wf-icf. As the name suggests, wf-icf is directly proportional to the word frequency which is the number of times a particular word in an instance appears in all the instances belonging to a particular class and is inversely proportional to the class frequency which is the number of different classes in whose instances the words making up the current instance appears.The word frequency …wf‟is computed during the training phase by simply counting the number of occurrences of each word separately for each class. The …icf‟of each word …w‟in the instance is then computed during the classification phase using the following formula:ICF (w) = log10 (N/M),where N is the total number of classes in the dataset and M is the number of classes which contains a specific word …w‟ i.e. the number of classes for which wf(w) > 0. This formula has been taken from the formula used in the calculation of the tf-idf statistic.In order to classify the test instance, the product of the word frequency of each word in each class and the inverse class frequency of each word making up that instance has been computed and then the products for each class have been summed up. This product is the wf-icf as defined above. After this, the instance is classified into the class which has the maximum value of this product sum.To illustrate the above method, the task of classifying book titles into different subjects has been considered. Suppose we have five different subjects - Biology, Chemistry, Computer Science, Mathematics and Physics. After stop word removal and lower case conversion, let the test instance be …introduction java programming‟. Table 1 shows the frequency counts for each word withrespect to each class. From the table, the ICF would be as follows using the previously mentioned formula: ICF (introduction) = 0.0969 ICF (java) = 0.6989ICF (programming) = 0.221Table 1. Word FrequenciesTable 2. WF-ICF Values Each frequency count as shown in Table 1 is thenmultiplied with the ICF for the corresponding word toobtain the WF-ICF values as shown in Table 2.The values in each column from Table 2 are thensummed up. So each summation value shows the result corresponding to a particular class. With this, thefollowing values are obtained:Biology = 1.48Chemistry = 0.29Computer Science = 2.38Math = 0.63Physics = 0.2Since Computer Science has the highest wf-icf value,t he instance …introduction java programming‟ will beclassified as a Computer Science textbook which is ofcourse the correct class. This result has been a major improvement on the previous classifier proposed in [1],where only the word frequency had been used to classifythe instances. For example, since the previous classifieronly used word frequency and did not incorporate the ICF property, it would have classified the above instance as aBiology textbook since the training dataset in the above example contains a larger number of Biology textbooks with the word …introduction‟ in it. Being able to correctly classify such instances is what helped to boost the present classification accuracy while enhancing the classifier to be able to classify between any numbers of classes.To develop a mathematical formulation for the present classifier, let the number of data classes in the training data be N, numbered from 0 to N-1. Let the current test data instanceto be classified consist of W words, numbered from 0 to W – 1. Also, let the frequency count of each word in the training instances of a particular class be represented by WF(X,Y) where X is the word number and Y is the corresponding class number and the previously defined inverse class frequency of each word X be given by ICF(X). Thus,X = {0, 1, 2… W-1} and Y = {0, 1, 2 … N-1}The next step is to find a relation that specifies the conditions for a test data instance to be correctly classified. For this, let C be the class number of the current test data i nstance‟s actual class label. Thus, the classifier will be correct if it predicts that the class label of the current test instance is class number C.Based on the previously defined notation, for word number I of the current test data instance, the frequency of occurrence of that word in training instances belonging to class C in the training data set is given by WF(I,C). Additionally, the corresponding inverse class frequency is ICF(I). Thus, the wf-icf property is given by WF-ICF(I,C) whereWF-ICF(I,C) = WF(I,C) * ICF(I)Thus, a test instance will be correctly classified if the summation of WF-ICF(I,C), for all words I of the test instance, is greater than the summation of WF-ICF(I,Y) for all the other N-1 classes, where Y is a class number ranging from 0 to N-1 but not equal to C. Thus, this can be mathematically represented as:In the above expression, C is the class number of the actual class the test instance should be classified into, W is the number of words in the current test instance, Y is the class number of all classes except class number C, WF is the frequency count as defined previously and ICF is the inverse class frequency as defined previously.IV.W ORK D ESCRIPTIONIn this section, an overview of the algorithms used to implement the initialization, training and testing phases of the multi-class incremental text classifier is presented.A brief description of each phase is given followed by the steps involved in implementing the phase.In the initialization phase, the algorithm reads in the training data instances along with their corresponding class labels. It then counts the number of different classes contained in the training data set and also constructs the hash table which will store the mapping between the different words contained in the training data to the corresponding word frequencies. This mapping is performed in the training phase. The final step in the initialization phase involves pre-processing the training data for the training phase. The steps are outlined below.A. Initialization Phase∙Read in training data and store in an array called …trainData‟, with each element containing atraining instance∙Read in corresponding class labels and store in an array called …labels‟∙Loop through …labels‟ to determine the number of different labels (i.e. classes) and store it in avariable called …N‟∙Create a hash table called …wordList‟ whose keys will be the words in the training data and thevalues will be the word counts∙Create a string array called …categories‟ that stores the different class labels∙Convert each instance in …trainData‟ to lower case The training phase involves populating the hash table created in the initialization phase with the words contained in the training data and their corresponding frequencies. Thus, in this phase, the algorithm iterates through the training data instances, scans each word contained in these instances and updates the frequency counts stored in the hash table accordingly, depending on the class label of the data instance which the word is contained in. The steps for this phase are given below. The frequency of each word is stored as an array of N elements where N is the number of class labels contained in the training data set. The ith element of this array stores the number of times the corresponding word appears in training instances belonging to the ith class. B. Training Phase∙Iterate through each member of …trainData‟∙Store the corresponding label in a string called …trainLabel‟∙Remove punctuation and the stop words for each instance∙Convert the resulting string of words into a string array∙Check if it is in …wordlist‟for each word in the arraya)If not, add it as a key in …wordList‟ and set thevalue of that key to a list of …N‟elements of 0but the i th element set to 1 where the i th classlabel is …trainLabel‟b)Otherwise, retrieve the list stored in the valuefield of the corresponding word and incrementthe i th element by 1 where the i th class label is…trainLabel‟The final phase is the testing phase in which the accuracy of the classifier is determined. In this phase, the algorithm reads in the test data instances and corresponding class labels and pre-processes them in a manner similar to what was done in the training phase. Then it predicts the class label of each test instance. In order to do so, for each test instance, the algorithm iterates through each of its constituent words, computes the wf-icf values for the word and each class label as defined in the previous section. It then sums the wf-icf values for each class label and selects the class label that maximizes this sum as the predicted class label. If this predicted label is the same as the actual class label, then the count of correct predictions is incremented. The process is continued for all test instances and the prediction accuracy of the classifier is determined using the final count of correct predictions.C. Testing Phase∙Read in test data and store in an array called …testData‟, similar to how the training data wasstored∙Read in corresponding class labels and store in an array called …testLabels‟∙Initialize a variable called …correct‟ to 0 - this will store the total number of correct predictions∙Iterate through each member of …testData‟∙Store corresponding label in a string called …testLabel‟∙Convert to lower case, remove punctuations and stop words and convert to array for each instance ∙Create an integer array called …sums‟ consisting of N elements∙The i th element of …sums‟ will store the sum of the number of times the words in the current testinstance appear in a training instance with the i thclass label∙Initialize a variable called …userCat‟ to 0 – this will store the predicted label∙For each word in the array, check if the word is in the …wordList‟hash table which was constructedduring the training phasea)If not, nothing is to be doneb)Otherwise, retrieve the array stored in thecorresponding value field and calculate theinverse class frequency using the formuladefined in the previous sectionc)Update …sums‟accordingly, i.e. calculate theproduct of the number in the i th element of thevalue array and the inverse class frequency andadd the product to the i th element of …sums‟. Thisproduct is the …wf-icf‟value that has beendefined previously.∙Set the variable …userCat‟ to the index in …sums‟that corresponds to the maximum value∙Increment …correct‟ if the label corresponding to…userCat‟ is the same as the corresponding label in …testLabels‟∙ Add the words of the current test instance along with the predicted label to …wordList‟. This step implements incremental classification.∙ Repeat the process for all remaining test instances ∙ Calculate the classifier accuracy using the …correct‟ variable which stores the number of correct predictions.∙Finally, output the total number of test data instances, the number of correct predictions and the resulting prediction accuracy of the classifier.V. R ESULT AND A NALYSISIn order to test the classifier, k-fold cross validation has been performed with k = 6 using a dataset consisting of a total of 300 book titles spanning 5 different subjects, viz., Biology, Chemistry, Computer Science, Mathematics and Physics with 60 titles of each of the subjects. In each step of cross validation, a hash table has been constructed thus giving a total of 6 hash tables. The order of the subjects used in constructing these tables is as given above. So, Biology comes first, followed by Chemistry, Computer Science and Mathematics in that order and finally, the last entry corresponds to Physics.Table 3. The first five entries of the first hash tableTable 4. The first five entries of the second hash tableTable 5. The first five entries of the third hash tableTable 6. The first five entries of the fourth hash tableTable 7. The first five entries of the fifth hash tableTable 8. The first five entries of the sixth hash tableThe titles in each subject are numbered from 1 to 60. For each test, 10 titles from each subject have been used as the testing set and the remaining 50 titles from each set have been used as the training set. Thus, each test involved a training set of 250 book titles and a testing set of 50 book titles.Table 9. Test Results using Word FrequencyThe results of the initial testing using only the word frequency methodology, used in the previous work in [1], are shown in Table 9. It is clear from Table 9 that less than satisfactory results were obtained with average prediction accuracy of 82% (246 correct class predictions out of 300 total test instances). This was below the 83.6% accuracy as was obtained using the same methodology for the binary classifier in the previous work [1]. A slight dip in accuracy is explainable because of the smaller training dataset (300 instances instead of the 500 that were used in [1]) and the increased number of classes (5 instead of 2) but nevertheless, a higher level of accuracy was desired.Thus, the inverse class frequency property was incorporated into the classifier, as has been explained in previous sections. This helped to rectify many incorrect predictions that previously happened because of generic words like …introduction‟, …techniques‟, …methods‟, etc. which had a high frequency of occurrence but weren‟t intrinsic to any specific subject. Using the wf-icf property, the classifier has been re-tested using the same dataset and the results obtained have been shown in Table 10 below.Table 10. Test Results using WF-ICFAs is evident, a much higher level of accuracy has been obtained, with an average accuracy of 87% (261 correct predictions out of 300 total data instances). Though accuracy below 90% does not seem exceptionally high, the training dataset only contains 300 instances which is quite small for training a text classifier. It can be expected that using a larger training dataset would allow the classifier to achieve an accuracy of above 90% since more is the number of training instances, higher is the level of accuracy achieved by the classifier.Table 11 lists all 39 incorrect predictions made by the classifier along with their actual class and the class predicted by the classifier. An important fact revealed by this table is nearly half (17 out of 39) of the incorrect predictions mistakenly identified Biology as the correct class. Most of these are cases in which none of the words in the test instance could be found in the training set (or more specifically, in the hash table constructed from the training set) and thus the classifier did not know how to deal with these words, leading to the instance getting a wf-icf score of 0 for all classes. In the cases when the wf-icf scores of two or more classes are the same, the predicted class will be the one which appears first in the order (Biology, Chemistry, Computer Science, Mathematics, Physics). Thus here, when the wf-icf score is 0 for all classes, the predicted class label would be Biology. Clearly, this issue can be resolved by using a larger number of training instances so that more concepts related to the different classes can be captured in the dataset used to train the classifier.This table also reveals that incorporating word stemming would be very useful in improving accuracy as presently the classifier treats words li ke …Math‟, …Mathematics‟, …Mathematician‟, …Mathematical‟ etc. as completely different from each other and so a book title such as …Mathematical Intro to Logic‟ cannot be identified by the classifier as a mathematics book even though it knows that a title with the word …mathematics‟ is a mathematics book. Plural and singular forms of the same word also contribute to reducing classifier accuracy for the same reason.Table 11. Incorrect PredictionsRegardless of all these issues, a considerably satisfactory level of prediction accuracy has been obtained with the proposed classifier. One disadvantageof incremental classification is that if an instance is incorrectly classified, the classifier‟s accuracy is lesser for future classifications. This problem is reduced as larger datasets are used in the training phase.To determine the runtime complexity of the training and classification processes, let the number of training instances be N (i.e. the size of the dataset used to train the classifier has N instances). Additionally, let the number of words in the largest training instance be M and let the number of different classes (i.e. the number of unique class labels) contained in the training dataset be C.The initialization phase simply involves looping through the training dataset in order to load the instances into the program along with their corresponding class labels and then running another loop through the labels to determine the number of unique classes. Thus, this phase is bounded by O(N).In order to train the classifier, the algorithm loops through each of the N training instances and for each such instance, it must loop through each of its constituent words. For each of these words, the index of the corresponding class label of the current training instance is determined and the related word count array is updated accordingly.Thus the time complexity of the training phase is bounded by O(N×M) since the largest training instance consists of M words. Each such instance must be pre-processed before being used to train the classifier by converting to lower case and then removing stop words and punctuation. To do this pre-processing, the algorithm must loop through each instance and compare each word in the instance with the stop words. Let the number of stop words that the classifier can detect be s. Then, the stop word removal process has complexity O(s×M). However, s is a constant as the number of stop words is fixed for the classifier. Thus, this complexity may be rewritten as s×O(M) which reduces to O(M). Similarly, punctuation removal and lower case conversion require only a loop through each word in each instance, and thus for each individual instance, the pre-processing time complexity is O(M) and hence the overall complexity of the training phase for each instance is bounded by O(M). The classification phase works similar to the training phase but has an additional level of complexity for each instance. It requires iterating through each word in the instance to be classified and for each word, iterating through the different classes in order to determine the wf-icf values for each class. Thus, the complexity of classifying each instance is O(M×C). Hence, the classifier has a quadratic runtime complexity for the overall training process as well as for classification process for a single instance.VI.C ONCLUSION AND F UTURE S COPEIn this paper, the proposed work has successfully improved upon the work that as was done in [1] by converting the incremental binary text classifier into an incremental N-class text classifier. Further, the classifier‟s prediction accuracy was improved upon by incorporating the wf-icf property (based upon the well known tf-idf property of information retrieval).As discussed in the previous section, an important and useful extension that can be made to the classifier in the future is incorporating word stemming into the training and classification processes. This would allow the classifier to treat different forms of the same word as the same concept and thereby improve the classification accuracy. Several of the incorrect predictions in Table 5 were caused due to the lack of this feature.R EFERENCES[1] A. Sarkar, S. Chatterjee, W. Das, D. Datta, “TextClassification using Support Vector Machine”, International Journal of Engineering Science Invention, Vol. 4 Issue 11, November 2015, pp. 33 – 37.[2]M. Ikonomakis, S. Kotsiantis, V. Tampakas, “TextClassification Using Machine Learning Techniques”, WSEAS Transactions on Computers, Vol. 4 Issue 8, August 2005, pp. 966 – 974.[3] F. Sebastiani, “Text Categorization”, The Encyclopedia ofDatabase Technologies and Applications,2005, pp. 683 –687.[4]T. Joachims, “Text Categorization with Support VectorMachines: Learning with Many Relevant Features”, Technical Report 23, Universitat Dortmund, LS VIII, 1997.[5]N. Cristianini, “Support Vector and Kernel Machines”,Tutorial at the 18th International Conference on Machine Learning, June 28, 2001.[6]K. Ming Leung, “Naive Bayesian Classifier”, PolytechnicUniversity Department of Computer Science/Finance and Risk Engineering, 2007.[7] E. Frank, and R. R. Bouckaert, “Naive Bayes for TextClassification with Unbalanced Classes”, Knowledge Discovery in Databases: PKDD 2006, pp 503 – 510. [8]W. Dai, G. Xue, Q. Yang and Y. Yu, “Transferring NaiveBayes Classifiers for Text Classification”, Proceedings of the 22nd National Conference on Artificial Intelligence, Vol. 1, 2007, pp. 540 – 545.[9]G. Guo, H. Wang, D. Bell, Y. Bi and K. Greer, “UsingkNN Model for Automatic Text Categorization”, Soft Computing, Vol. 10, Issue 5, 2006, pp. 423 – 430. [10]G. Toker and O. Kirmemis, “Text Categorization using kNearest Neighbor Classification”, Survey Paper, Middle East Technical University.[11]Baoli Li, Shiwen Yu, and Qin Lu., “An Improved k-nearest Neighbor Algorithm for Text Categorization”, arXiv preprint cs/0306099, 2003.[12] D. D. Lewis, and W. A. Gale, “A Sequential Algorithmfor Training Text Classifiers”, Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, (Springer-Verlag New York, Inc., 1994).[13]K. Nigam, A. McCallum, S. Thrun, and T. Mitchell,“Learning to Classify Text from Labeled and Unlabeled Documents”, AAAI/IAAI792, 1998.[14]P. Soucy and G. Mineau, “Feature Selection Strategies forText Categorization”, AI 2003, LNAI 2671, 2003, pp. 505 – 509.[15] A. Kehagias, V. Petridis, V. Kaburlasos and P. Fragkou,“A Comparison of Word- and Sense-Based Text Categorization Using Several Classification Algorithms”, JIIS, Volume 21, Issue 3, 2003, pp. 227 – 247.[16]H. P. Luhn, “A Statistical Approach to Mechanized。
CHAPTER 2GRAPHICAL AND TABULAR DESCRIPTIVE TECHNIQUESSECTIONS 1MULTIPLE CHOICE QUESTIONS1. Which of the following statements is false?a.All calculations are permitted on interval datab.All calculations are permitted on nominal datac.The most important aspect of ordinal data is the order of thedata valuesd.The only permissible calculations on ordinal data are onesinvolving a ranking process2. The average number of units earned per semester by collegestudents is suspected to be rising. A researcher at Boston College wishes to estimate the number of units earned by students during the spring semester of 2004 at Boston. To do so, he randomly selects 250 student transcripts and records the number of units each student earned in the spring term of 2004.The variable of interest to the researcher is thea.number of students enrolled at Boston College during thespring term of 2004b.average indebtedness of Boston College students enrolled inthe springc.age of Boston College students enrolled in the springd.number of units earned by Boston College students duringthe spring term of 20043. The classification of student major (accounting, economics,management, marketing, other) is an example of14 Chapter Twoa.a categorical random variable.b.a discrete random variablec.a continuous random variabled.a parameter.4. A study is under way in national forest to determine the adultheight of pine trees. Specifically, the study is attempting to determine what factors aid a tree in reaching heights greater than 50 feet tall. It is estimated that the forest contains 32,000 adult pines. The study involves collecting heights from 500 randomly selected adult pine trees and analyzing the results.The variable of interest in the study is thea.age of a pine tree in the national forest.b.height of a pine tree in the national forest.c.number of pine trees in the national forest.d.species of trees in the national forest.5. The classification of student class designation (freshman,sophomore, junior, senior) is an example ofa.a categorical random variable.b.a discrete random variable.c.a continuous random variable.d.a parameter.6. Most analysts focus on the cost of tuition as the way tomeasure the cost of a college education. But incidentals, such as textbook costs, are rarely considered. A researcher at Ferris State University wishes to estimate the textbook costs of first-year students at Ferris. To do so, he monitored the textbook cost of 200 first-year students and found that their average textbook cost was $275 per semester. The variable of interest to the researcher is thea.textbook cost of first-year Ferris State University students.b.year in school of Ferris State University students.c.age of Ferris State University students.Graphical and Tabular Descriptive Techniques 15d.cost of incidental expenses of Ferris State Universitystudents.7. The manager of the customer service division of a majorconsumer electronics company is interested in determining whether the customers who have purchased a videocassette recorder made by the company over the past 12 months are satisfied with their products. The possible responses to the question “Are you happy, indifferent, or unhappy with the performance per dollar spent on the videocassette recorder?”are values from aa.discrete numerical random variable.b.continuous numerical random variablec.categorical random variable.d.parameter.TRUE / FALSE QUESTIONS8. There are actually four types of data: nominal, ordinal, interval,and ratio. However, for statistical purposes, there is no difference between interval and ratio data, and the authors of your book combine the two types.√9. Quantitative variables usually represent membership in groupsor categories.10. Interval data, such as heights, weights, and incomes, are alsoreferred to as quantitative or numerical data.√11. Nominal data are also called qualitative or categorical data.√12. ATP singles rankings for tennis players is an example of aninterval scale.13. Interval data may be treated as ordinal or nominal.√16 Chapter Two14. Nominal data may be treated as ordinal or interval15. Professor Hogg graduated from the University of Iowa with acode value = 2 while Professor Maas graduated from Michigan State University with a code value = 1. The scale of measurement likely represented by this information is interval.16. Ordinal data may be treated as interval but not as nominal.17. A variable is some characteristic of a population, while data arethe observed values of a variable based on a sample.18. An automobile insurance agent believes that company “A” ismore reliable than company “B”. The scale of measurement that this information represents is the ordinal scale. √STATISTICAL CONCEPTS & APPLIED QUESTIONS19. The Dean of Students conducted a survey on campus. SATscore in mathematics is an example of a __________, or __________ variable.ANSWER:quantitative, numerical20. Provide one example for nominal, ordinal, and interval data.ANSWER:Nominal data example: Political party affiliation for voters recorded using the code: 1 = Democrat, 2 = Republican, and 3 = Independent.Ordinal data example: Response to market research survey measured on the Likert scale using the code: 1 = Strongly agree, 2 = Agree, 3 = Neutral, 4 = Disagree, and 5 = Strongly disagree.Graphical and Tabular Descriptive Techniques 17Interval data example: Temperature in tennis courts duringthe US Open21. The dean of students conducted a survey on campus. Thegender of the student is an example of a __________, or __________ variable.ANSWER:categorical, qualitative22. For each of the following examples, identify the data type asnominal, ordinal, or interval.a.The letter grades received by students in a computerscience classb.The number of students in a statistics coursec.The starting salaries of newly Ph.D. graduates from astatistics programd.The size of fries (small, medium, large) ordered by a sampleof Burger King customerse.The college you are enrolled in (Arts and science, Business,Education, etc.)ANSWER:a.Ordinalb.Intervalc.Intervald.Ordinale.Nominal23. The Dean of Students conducted a survey on campus. Classdesignation (Freshman, Sophomore, Junior, and Senior) is an example of a __________, or __________ variable.ANSWER:categorical, qualitative18 Chapter Two24. Most colleges admit students based on their achievements in anumber of different areas. The grade obtained in senior level English course (A, B, C, D, or F) is an example of a __________, or __________ variable.ANSWER:categorical, qualitative25. At the end of an escorted motor coach vacation, the touroperator asks the vacationers to respond to the questions listed below. For each question, determine whether the possible responses are interval, nominal, or ordinal.a.How many escorted vacations have you taken prior to thisone?b.Do you feel that the stay in New York was sufficiently long?c.Which of the following features of the hotel in New York didyou find most attractive: location, facilities, room size, orprice?d.What is the maximum number of hours per day that youwould like to spend traveling?e.Would your overall rating of this tour be excellent, good, fair,or poor?ANSWER:a.Intervalb.Nominalc.Nominald.Intervale.Ordinal26. For each of the following, indicate whether the variable ofinterest would be nominal or interval.a.Whether you are a US citizenb.Your marital statusGraphical and Tabular Descriptive Techniques 19c.Number of cars in a parking lotd.Amount of time you spend per week on your homeworke.Lily’s travel time from her dorm to the student union at theuniversity of Iowaf.Heidi’s favorite brand of tennis balls.ANSWER:a.Nominalb.Nominalc.Intervald.Intervale.Intervalf.Nominal27. In purchasing a used automobile, there are a number ofvariables to consider. The age of the car is an example of a __________, or __________ variable.ANSWER:quantitative, numerical28. In purchasing an automobile, there are a number of variables toconsider. The body style of the car (sedan, coupe, wagon, etc.) is an example of a __________, or __________ variable.ANSWER:categorical, qualitative29. Before leaving a particular restaurant, customers are asked torespond to the questions listed below. For each question,determine whether the possible responses are interval, nominal, or ordinal.a.What is the approximate distance of the restaurant from yourresidence?b.Have you eaten at the restaurant previously?20 Chapter Twoc.On how many occasions have you eaten at the restaurantpreviously?d.Which of the following attributes of the restaurant do youfind most attractive: service, prices, quality of the food, orvaried menu?e.Would your overall rating of the restaurant be excellent,good, fair, or poor?ANSWER:a.Intervalb.Nominalc.Intervald.Nominale.OrdinalSECTION 2MULTIPLE CHOICE QUESTIONSIn the following multiple-choice questions, please circle the correct answer.30. The best type of chart for comparing two sets of categorical data is aa.line chartb.pie chartc.histogramd.bar chart31. Which of the following statements about pie charts is false?a.Pie charts are graphical representations of the relativefrequency distributionb.Pie charts are usually used to display the relative sizes ofcategories for interval data.c.Pie charts always have the shape of a circleGraphical and Tabular Descriptive Techniques 21d.Area of each slice of a pie chart is the proportion of thecorresponding category of the frequency distribution of acategorical variable32. The two graphical techniques we usually use to presentnominal data area.bar chart and histogramb.pie chart and ogivec.bar chart and pie chartd.histogram and ogive33. Which of the following statements is false?a.A bar chart is similar to a histogramb.A pie chart is a circle subdivided into slices whose areas areproportional to the frequenciesc.Pie charts emphasize the frequency of occurrences of eachcategory in a frequency distributiond.None of the above34. Which of the following statements is true?a.Bar charts focus the attention on the frequency of theoccurrences of the categoriesb.A bar chart is created by drawing a rectangle representingeach categoryc.The height of each rectangle in a bar chart represents thefrequency for a particular categoryd.All of the aboveTRUE / FALSE QUESTIONS35. A bar chart is used to represent interval data.36. One of the advantages of a pie chart is that it clearly showsthat the total of all the categories of the pie adds to 100%.√22 Chapter Two37. The bar chart is preferred to the pie chart, because the humaneye can more accurately judge length comparisons against a fixed scale (as in a bar chart) than angular measures (as in a pie chart).√38. Bar and pie charts are graphical techniques for nominal data.The former focus the attention on the frequency of the occurrences of the categories, and the later emphasize the proportion of occurrences of each category.√39. Bar and pie charts are two graphical techniques that can beused to represent nominal data.√40. A bar chart is similar to a histogram in the sense that the basesof the rectangles are arbitrary intervals whose centers are themidpoints of the intervals.41. If we wish to emphasize the relative frequencies for nominaldata, we draw a histogram instead of drawing a bar chart.42. Pie and bar charts are used widely in newspapers, magazines,and business and government reports.√43. The size of each slice in a pie chart is proportional to thepercentage corresponding to that category.√44. A category that contains 30% of the observations isrepresented by a slice of a pie chart that contains 100 degrees.STATISTICAL CONCEPTS & APPLIED QUESTIONS45. Identify the type of data for which each of the following graphsis appropriate.a.Pie chartb.Bar chartANSWER:a.Nominalb.Nominal46. Voters participating in a recent election exit poll in Minnesotawere asked to state their political party affiliation. Coding the data 1 for Republican, 2 for Democrat, and 3 for Independent, the data collected were as follows: 3, 1, 2, 3, 1, 3, 3, 2, 1, 3, 3, 2, 1, 1, 3, 2, 3, 1, 3, 2, 3, 2, 1, 1, and 3. Construct a frequency bar graph.ANSWER:FOR QUESTIONS 47 AND 48, USE THE FOLLOWING NARRATIVE: Narrative: Car DealersCar buyers were asked to indicate the car dealer they believed offered the best overall service. The four choices were Carriage Motors (C), Marco Chrysler (M), Triangle Auto (T), and University Chevrolet (U). The following data were obtained:T C C C U C M T C UU M C M T C M M C MT C C T U M M C C TT U C U T M M C U T47. {Car Dealers Narrative} Construct a frequency bar chart.ANSWER:48. {Car Dealers Narrative} Construct a pie chart. Which cardealer offered the best overall service?It seems that Carriage Motors offered the best overall service.49. Given the following five categories and the number of timeseach occurs, draw a pie chart and a bar chart.ANSWER:FOR QUESTIONS 50 AND 51, USE THE FOLLOWING NARRATIVE:Narrative: Business School GraduatesThe frequency distribution for a sample of 200 business school graduates is shown in the following table.50. {Business School Graduates Narrative} Draw a pie chart of the number ofgraduates.ANSWER:51. {Business School Graduates Narrative} Draw a frequency bar chart.ANSWER:SECTION 3MULTIPLE CHOICE QUESTIONSIn the following multiple-choice questions, please circle the correct answer.52. The most appropriate type of chart for determining the number of observations at or below a specific value is:a. a histogramb. a pie chartc. a time-series chartd. a cumulative frequency ogive53. In general, incomes of employees in large firms tend to beA. positively skewedB. negatively skewedC. symmetricD. All of the above54. The total area of the bars in a relative frequency histogram:A. depends on the sample sizeB. depends on the number of barsC. depends on the width of each barD. depends on the height of each bar55. Which of the following statements is false?A. A frequency distribution counts the number of observations that fall into each of a series on intervals, called classes that cover the complete range of observations.B. The intervals in a frequency distribution may overlap to ensure that each observation is assigned to an intervalC. Although the frequency distribution provides information about how the numbers in the data set are distributed, the information is more easily understood and imparted by drawing a histogramD. The number of class intervals we select in a frequency distribution depends entirely on the number of observations in the data set56. The total area of the five bars in a relative frequency histogram for which the width of each bar is four units is:A.5B.4C.9D.157. The relative frequency of a class is computed byA. dividing the frequency of the class by the number of classesB. dividing the frequency of the class by the class widthC. dividing the frequency of the class by the total number of observations in the data setD. subtracting the lower limit of the class from the upper limit and multiplying the difference by the number of classes58. A modal class is the class that includesA. the largest number of observationsB. the smallest number of observationsC. the largest observation in the data setD. the smallest observation in the data set59. The sum of the relative frequencies for all classes will always equalA. the number of classesB. the class widthC. the total number of observations in the data setoneANSWER: d60. When ogives or histograms are constructed, which axis must show the true zero or “origin”?A. The horizontal axis.B. The vertical axis.C. Both the horizontal and vertical axes.D. Neither the horizontal nor the vertical axis.61. The width of each bar in a histogram corresponds to theA. differences between the lower and upper limits of the class.B. number of observations in each class.C. midpoint of each classD. frequency of observations in each class.62. The most important and commonly graphical presentation of interval data is aA. bar chartB. histogramC. pie chartD. cumulative frequency distribution63. According to Sturges’ rule, th e ideal number of class intervals in a frequency distribution of n = 150 data equals aboutA. 8B.15C.20D.2864. According to Sturges’ rule, the ideal number of class intervals in a frequency distribution equalsA.5B.15C. 3.3 + log (n), where n is the size of the data set.D. 1 + 3.3 log (n), where n is the size of the data set.65. How many classes should a histogram contain if the number of observations is 250?A. 5, 6, or 7B. 7, 8, or 9C. 9 or 10D. 10 or 1166. How many classes should a frequency distribution contain if the number of observations is 45?A. 5, 6, or 7B. 7, 8, or 9C. 9 or 10D. 10 or 1167. Sturge’s formula recommends that the number of class intervals to construct a frequency distribution or draw a histogram using a data set with n observations is determined by:A. log(n)B. 3.3 log(n)C. 1 + 3.3 log(n)D. 2 – 3.3 log(n)68. Which of the following statements about number of modal classes is false?A. A unimodal histogram is one with a single peakB. A bimodal histogram is one with two peaks, not necessarily equal in heightC. A bimodal histogram is one with two peaks equal in heightD. None of the above69. Which of the following statements about shapes of histograms is true?A. .A histogram is said to be symmetric if, when we draw a vertical line down the center of the histogram, the two sides are identical in shape and sizeB. A positively skewed histogram is one with a long tail extending to the rightC. A negatively skewed histogram is one with a long tail extending to the leftD. All of the aboveTRUE / FALSE QUESTIONS70. A relative frequency distribution describes the proportion ofdata values that fall within each class, and may be presented ina histogram form.√71. A relative frequency distribution describes the proportion ofdata values that fall within each category.√72. The stem-and-leaf display reveals far more informationrelative to individual values than does the histogram.73. Individual observations within each class may be found in afrequency distribution.74. The following stem-and-leaf output has been generated bystatistical software. The median of this data is 26.Stem-and-leaf of C2 N = 75Leaf Unit = 109 0 00011233314 0 5689921 1 000012326 1 6669933 2 3334445(8) 2 6667788834 3 002334427 3 5666999919 4 00012223310 4 555666779975. A cumulative frequency distribution lists the number ofobservations that are within or below each of the classes. √76. The following stem-and-leaf output has been generated bystatistical software. This data has a negative mode.√Stem-and-leaf of C2 N = 75Leaf Unit = 0.011 -2 62 -2 05 -1 5558 -1 42022 -0 9999988777766536 -0 44322111111000(14) 0 0112223333334425 0 6667888999914 1 00222223344 1 562 2 0377. Compared to the frequency distribution, the stem-and-leafdisplay provides more details, since it can describe the individual data values as well as show how many are in each group, or stem.√78. A histogram represents nominal data.79. In the term “frequency distribution,” frequency refers to thenumber of data values falling within each class.√80. The class interval in a frequency distribution is the number ofdata values falling within each class.81. The largest value in a set of data is 140, and the lowest valueis 70. If the resulting frequency distribution is to have five classes of equal width, the class width will be 14. √82. A stem-and-leaf display describes two - digit integersbetween 20 and 70. For one of the classes displayed, the row appears as 4|256. The numerical values being described are 24, 54, and 64.83. The following “character histogram”has been generated bystatistical software. The median class is 150.√Histogram of C1 N = 75Midpoint Count-150 1 *-100 1 *-50 3 ***0 2 **50 7 *******100 12 ************150 18 ******************200 20 ********************250 5 *****300 5 *****350 1 *84. The following stem-and-leaf output has been generated bystatistical software. This data set has a mean that is negative, and there is no modal class.√Stem-and-leaf of C2 N = 10Leaf Unit = 0.102 - 1 534 - 0 97(2) - 0 654 0 33 0 62 1 31 1 885. A frequency distribution is a listing of the individualobservations arranged in ascending or descending order.86. When a distribution has more values to the left and tails to theright, it is skewed negatively.87. A histogram is said to be symmetric if, when we draw a verticalline down the center of the histogram, the two sides are identical in shape and size.√88. A skewed histogram is one with a long tail extending either tothe right or left. The former is called negatively skewed, and the later is called positively skewed.89. A bimodal histogram is one with two or more peaks equal in height.90. A cumulative frequency distribution when presented in graphicform is called an ogive.√91. When a distribution has more values to the right and tails to theleft, we say it is skewed positively.92. The sum of relative frequencies in a distribution always equals1.√93. The stem-and-leaf display is often superior to the frequencydistribution in that is maintains the original values for the further analysis.√94. The sum of cumulative frequencies in a distribution alwaysequals 1.95. If the values of the sixth and seventh class in a cumulativefrequency distribution are the same, we know that there are no observations in the seventh class.√96. The larger the number of observations in a numerical data set,the larger the number of class intervals needed for a frequency distribution.√97. The original data values cannot be assessed once they aregrouped into a frequency distribution.√98. A research analyst was directed to arrange raw data collectedon the yield of wheat, ranging from 40 to 90 bushels per acre, in a frequency distribution. He should choose 40 as the class interval width.99. The relative frequency of a class is the frequency of that classdivided by the total number of classes.100. Ogives are plotted at the midpoints of the class intervals.101. Sturge’s formula recommends that the number of class intervals needed to draw a histogram using a data set with 200 observations is 12.79 which we round to 13.102. A modal class is the class with the largest number of observations.√103. Incomes of employees in large firms tend to be negatively skewed, because there is a large number of relatively low –paid workers and a small number of well – paid executives.104. The time taken by students to write exams is frequently positively skewed because few students hand in their exams early; most prefer to reread their papers and hand them in near the end of the scheduled test period.105. A frequency distribution counts the number of observations that fall into each of a series of intervals, called classes that cover the range of observations.√106. One of the drawbacks of the histogram is that we lose potentially useful information by classifying the observations and sacrificing whatever information was contained in the actual observations.√107. The histogram is usually preferred over the stem – and – leaf display.108. The stem –and –leaf display’s advantage over the histogram is that we can see the actual observations rather than observations classified into different classes.√STATISTICAL CONCEPTS & APPLIED QUESTIONS109. Identify the type of data for which a Histogram is appropriate.ANSWER:Interval110. The total area under a relative frequency histogram for which the width of each class is ten units is _________.ANSWER:10111. Voters participating in a recent election exit poll in Minnesota were asked to state their political party affiliation. Coding the data 1 for Republican, 2 for Democrat, and 3 for Independent, the data collected were as follows: 3, 1, 2, 3, 1, 3, 3, 2, 1, 3, 3, 2, 1, 1, 3, 2, 3, 1, 3, 2, 3, 2, 1, 1, and 3. Develop a frequency distribution and a proportion distribution for the data. What does the data suggest about the strength of the political parties in Minnesota?ANSWER:Independent in Minnesota is stronger than Republican and Democrat partiesFOR QUESTIONS 112 THROUGH 118, USE THE FOLLOWING NARRATIVE:Narrative: Salespersons’ AgesThe ages of a sample of 25 salespersons are as follows:47 21 37 53 2840 30 32 34 2634 24 24 35 4538 35 28 43 4530 45 31 41 56112. {Salespersons’Ages Narrative} Draw a histogram with four classes.ANSWER:113. {Salespersons’Ages Narrative} Draw a hist ogram with six classes.ANSWER:114. {Salespersons’ Ages Narrative} Draw a stem and leaf display.ANSWER:115.{Salesperson’s Ages Na rrative}Construct an ogive for the data.ANSWER:116. {Salesperson’s Ages Narrative} Estimate the proportion of salespersons who are less than 30 years of age.ANSWER:0.24117. {Salesperson’s Ages Narrative} Estimate the proportion of salespersons who are more than 40 years of age.ANSWER:1-0.64 = 0.36118. {Salesperson’s Ages Narrative} Estimate the proportion of salespersons who are between 40 and 50 years of age.ANSWER:0.92 - 0.64 = 0.28FOR QUESTIONS 119 THROUGH 121, USE THE FOLLOWING NARRATIVE:Narrative: Defective ItemsThe number of defective items produced by a machine and recorded for the last 25 days are as follows: 19, 6, 15, 20, 17, 16, 17, 12, 15, 29, 23, 17, 7, 10, 14, 14, 27, 22, 8, 5, 23, 19, 9, 28, and 5.119. {Defective Items Narrative} What is the relationship between the total area under the histogram you have constructed and the relative frequencies of observations?ANSWER:*Class contains observations up to but not including 10. Theother classes are definedsimilarly. This notion is used throughout the chapter.120. {Defective Items Narrative} Construct a relative frequencyhistogram for these data.ANSWER:Note that the numbers thatappear along the horizontal axisrepresent the upper limits of the class intervals even though they appear in the center of the classes121. {Defective Items Narrative} Construct a frequency distributionand relative frequency distribution for these data. Use five class intervals, with the lower boundary of the first class being five items.。
条形统计图英语作文范文Bar Charts: A Comprehensive Guide to Data Visualization.Bar charts, a versatile form of data visualization, effectively represent categorical data and provide valuable insights into the distribution and comparison of different categories. This guide will delve into the intricacies of bar charts, exploring their construction, interpretation, and applications across various domains.Components of a Bar Chart.A typical bar chart comprises several key components:Bars: The vertical or horizontal rectangular elements that represent each category. The height or length of the bars corresponds to the associated data values.Labels: Text labels that identify the categories along the x-axis (horizontal axis) and the data values along they-axis (vertical axis).Scales: The axes provide a reference frame for interpreting the data values. The x-axis represents the categories, while the y-axis represents the numerical values.Axis Titles: Provide clear descriptions of the axes, indicating the type of data represented on each axis.Legend: Optional, provides further information about the data or the different categories included in the chart.Types of Bar Charts.There are two primary types of bar charts, each withits own unique characteristics:Vertical Bar Chart: The bars are oriented vertically, with the x-axis representing the categories and the y-axis representing the data values. Vertical bar charts are commonly used to compare the values of different categoriesor to show changes over time.Horizontal Bar Chart: The bars are oriented horizontally, with the y-axis representing the categories and the x-axis representing the data values. Horizontal bar charts are often used when the categories have long or complex names, or when it is more convenient to read the data horizontally.Data Considerations.To create an effective bar chart, it is essential to consider the following data characteristics:Numerical Values: Bar charts are most suitable for representing quantitative data, where the values can be measured on a numerical scale.Categorical Data: The categories represented on the x-axis should be discrete and mutually exclusive.Data Distribution: The distribution of the data valuesshould be considered when choosing the scale of the y-axis. For skewed distributions, a logarithmic scale may be necessary to prevent misleading visual representations.Interpretation of Bar Charts.Bar charts provide several important insights into data:Comparison of Categories: The height or length of the bars allows for easy comparison of the values acrossdifferent categories.Distribution of Values: The distribution of the bars reveals the frequency or proportion of each category.Changes Over Time: In time-series bar charts, the bars represent data values at different time points, enablingthe visualization of trends and patterns over time.Applications of Bar Charts.Bar charts have wide-ranging applications acrossvarious fields:Business Analysis: Bar charts are commonly used in business presentations and reports to depict sales figures, revenue distribution, and market share analysis.Scientific Research: In scientific studies, bar charts are employed to visualize experimental data, compare treatments, and present results in a clear and concise manner.Marketing and Advertising: Bar charts are used to track campaign performance, analyze customer demographics, and evaluate product preferences.Education: Bar charts are useful for presenting survey results, comparing student performance, and visualizing data in educational settings.Best Practices for Bar Chart Creation.To create an effective bar chart, follow these bestpractices:Choose the Right Chart Type: Select the appropriate type of bar chart based on the data and the desired insights.Use Clear Labels: Ensure that the categories and data values are labeled accurately and consistently.Select an Appropriate Scale: The scale of the y-axis should be chosen carefully to avoid distorting the data or obscuring patterns.Consider Color and Shading: Use colors and shading sparingly to enhance visual clarity and distinguish between categories.Provide Context: Include a title and axis labels that provide sufficient context for the data being presented.In conclusion, bar charts are a powerful tool for visualizing and interpreting categorical data. Byunderstanding the components, types, data considerations, interpretation, and best practices of bar charts, users can effectively convey insights and make informed decisions based on data.。
Categorical Document Frequency Based Feature Selection forText CategorizationZhilong Zhen1,2, Haijuan Wang1, Lixin Han1, Zhan Shi11 College of Computer and Information, Hohai University, Nanjing, 210098, China2 Department of Computer Science, Tonghua Normal University, Tonghua, 134002, ChinaE-mail: zhenzhilong@Abstract—Effective feature selection methods are essential for improvin g the accuracy an d efficien cy of text categorization. Motivated by documen t frequen cy, we proposed a n ew filter-based feature selection approach, called categorical document frequen cy. The categorical documen t frequen cy displays the distribution of a term over each category. Mathematically, the varian ce of a term reflects the con tribution of the term to categorization. Finally, the experiments are carried out on the Reuters-21578 stan dard text corpus. The results showed that the categorization performan ce of the proposed approach is similar or better than information gain and chi-square statistic. In addition, computational cost of this approach is lower than in formation gain an d chi-square so that it is also well-suited for processing large-scale text data.Keywords-text categorization; feature selection; categorical document frequency; filterI.I NTRODUCTIONInternet apparently provides convenience to people in real applications along with rapid development of science and technology. A large number of available textual information in the electronic medium such as electronic books, digital libraries increased. Therefore, it is difficult and significant to organize and handle these resources, and automatic text categorization is widely used to accomplish this task [1]. Text categorization aims to automatically classify a new document into one or more predefined categories based on its contents [2][3]. In general, documents are usually represented by employing a bag-of-words approach, namely, a document is treated as a point in the feature space spanned by tens or hundreds of thousands of terms in the document collections [4]. However, most of these terms are irrelevant and noise which will lead to high cost of computation and low accuracy of text categorization, and traditional classification methods are infeasible for text categorization due to the high dimensionality of feature space. It is obviously crucial to solve the curse of dimensionality because effective dimensionality reduction will enhance the performance, decrease complexity and save storage space for text categorization. Feature selection is a popular technique for simplifying or accelerating computations in text categorization by selecting a subset of original feature set for representing the data before applying a learning algorithm. The key problem is to propose appropriate feature selection approaches to construct an optimal feature subset with little information loss. According to John et al. [5], feature selection can typically be grouped into two categories including wrapper approaches and filter approaches. In wrapper approaches, the performance of classifiers influence the construction of feature subset, that is, the minimum error probability of feature vector combination is chosen by estimating the classification error probability of the classifier for each feature vector combination. Wrapper approaches not only consider the relationship between feature subset search and model selection but also take into account the correlation of features without the assumption of independence. For filter approaches, they rank the features based on the scores obtained by a certain criterion and select a predetermined number from the top score features to comprise the feature subset. From the perspective of performance of classification, wrapper approaches are better than filter approaches. A main drawback of wrapper approaches is that their computational complexity is expensive so that they are unsuitable for processing high-dimensional text data. Filter techniques can be easily scaled to high-dimensional dataset and they are relatively simple and fast. Several filter-based measures have been successfully used in the field of text categorization such as document frequency (D F), mutual information (MI), information gain (IG), Ȥ2 statistic (CHI) and term strength (TS) [6][7]. Yang and Pedersen once compared the performance of the criteria mentioned above [7], and the experimental results indicated that information gain and Ȥ2 statistic performed very well without losing or improving classification accuracy and document frequency was comparable to information gain and Ȥ2 statistic due to its efficiency and reliability.D ocument frequency is an unsupervised term-goodness criterion and its computational complexity is approximately linear. This paper proposed a new method called categorical document frequency. The criterion took advantage of category information based on document frequency to make a further improvement of classification performance. We believed that those terms are not useful for text categorization if they uniformly occur in each category, and the terms are informative and should be preserved if the terms occur many times in one certain category and at the same time they hardly appear in the other categories. In fact,2011 International Conference of Information Technology, Computer Engineering and Management Sciencesthe experiments also demonstrate the proposed method performs equal or better than the classical information gain and Ȥ2 statistic criteria.The remainder of this paper is organized as follows: firstly, we give a brief description of information gain and Ȥ2 statistic for text feature selection. The new feature evaluation criterion for text categorization is introduced in Section ҉. The experimental settings including data preprocessing and classifier choosing are described, and experimental results of our comparisons with these two popular methods are reported in Section Ҋ. Finally, the conclusion is presented in Section ҋ.II. RELATED WORKS In the context of text categorization, information gain andȤ2 statistic are frequently used feature selection methods which have been proved to be effective [7]. So we consider two methods as our baseline. In this section, we give a brief introduction on two popular feature selection algorithms.A. Information Gain (IG) Information gain is an important computational approachin the domain of information theory. This feature ranking criterion can estimate the amount of information brought by a new term based on the presence and absence of the term ina category. The set of categories is denoted by {c 1, c 2,Ξ, c k },and information gain of term t is defined by111()Pr()log Pr()Pr()Pr(|)log Pr(|)Pr()Pr(|Pr(|ki i i ki i i k i i i IG t c c t c t c t t c t c t ====−+⋅+⋅¦¦¦(1) where Pr(c i ) is the probability of the category c i , Pr(t ) is the probability that the term t occurs, Pr(c i |t ) is the conditionalprobability of the category c i given that the term t occurs, Pr(t ) is the probability that the term t does not occur, and Pr(c i |t ) is the conditional probability of the category c i given that the term t does not occur. B. ·2 Statistic (CHI)In text categorization, Ȥ2 statistic is usually used to measures the degree of dependency between a term t and a specific category c i using a typical two-way contingency table of both, and is defined as the follows:22()(,)()()()()i N A D B C t c A C B D A B C D χ⋅⋅−⋅=++++(2)where A is the number of documents that term t and category c i co-occur, B is the number of documents that term t occurs without category c i , C is the number of documents category c i occurs without term t , D is the number of documents neither term t nor category c i occurs, and N is the total number of documents that is A +B +C +D .For each category, Ȥ2 statistic is calculated between a term and that category. In this paper, we take the average value of combination of all categories as the measure221()Pr()(,)kavgi i i t c t c χχ==⋅¦ (3)III. PROPOSED FEATURE SELECTION MEASURED ocument frequency is the simplest filter measure. The terms will be removed from feature space when document frequency of these terms is less than the predetermined threshold. That is, it assumed that the low-frequency is non-informative. From our point of view, information of classlabel of documents is useful for selecting features fromfeature set. We propose a categorical document frequency based feature selection approach. Document frequency is the number of documents in which a term occurs. Categorical document frequency is the number of documents pertained tocategory c i (i = 1, 2,Ξ, k , and k is the number of categories) in which a term occurs.In this paper, we attempt to find the terms which arerelevant to a category. For a term t , we computed the categorical document frequency over each category in training set. The document frequency of term t over categoryc i is denoted by df (t , c i ). Term t is informative for category c i if df (t , c i ) is high and df (t , c j ) is low when i is not equal to j , that is, term t occurs many times in category c i and seldom occurs in the other categories. In addition, term t is regarded as non-informative if the number of documents in which term t occurs in each category is almost similar, namely,term t is uniformly scattered over each category.Mathematically, we considered that the variance of a term twill be influenced by the distribution of the term in eachcategorization. The bigger the difference of distribution is,the larger the variance will be. Motivated by this intuition,we propose a novel evaluation function using categoricaldocument frequency (CD F). The evaluation criterion isdefined by ()211()(,)k i i score t df t c k μ==−¦(4) where 11(,)ki i df t c kμ==¦.Let T ={t 1, t 2,Ξ, t m } denote the set of terms, we calculated the scores of all terms by formula (4) and ranked the terms by decreasing order. Thus, we select the top p (p <<m ) items to comprise the subset of original features. Note that, during the above course, we only select the terms that are strongly relevant to categorization. The satisfactory results of our criterion will be displayed in next section.IV. EXPERIMENTSWe conducted the experiments to validate categorical document frequency effectiveness on Reuters-21578 text corpora [8]. All experiments are based on k-nearest neighbor classifier, which can achieve the better performance in a previous comparative study [7][9]. The results obtainedusing categorical document frequency were compared to those obtained using two feature selection methods commonly found in the literature: information gain and chi-square. Empirical results showed that the CDF is an effective feature selection method according to the Macro-averaged F1 (Macro-F1) measure and Micro-averaged F1 (Micro-F1) measure.A. Dataset and PreprocessingIn this paper, a subset of the Reuters-21578 standard document corpus is used to conduct all experiments. We chose the top eight categories according to the ‘ModeApte’ split version [10], with a fixed splitting in test and train. The training data set contains 6800 documents and the test data set consists of 2660 documents. We removed words according to an English stop list containing 571 stop-of-words [11] and used the Porter stemmer algorithm, which maps words with the same meaning to one morphological form by removing suffixes [12][13].B. K-Nearest Neighbor ClassifierK-nearest neighbor classifier is commonly employed in the area of text categorization since it is a non-parametric and non-linear classifier with few assumptions about the input data [7]. To assess the effectiveness of the proposed feature selection method, the k -nearest neighbor classifier is used in this paper as text classifier. The key idea of k-NN classifier for text categorization can be easily understood, that is, one compares a given test document with documents in training set which have been categorized and then selects k the closest documents using a document-document similarity or distance. The given test document is determined according to the category that the k documents belong to.C. Performance MeasuresThe performance of feature selection algorithm for text categorization is evaluated by precision, recall and F1 measure which are widely used for performance evaluation of classification [14][15]. The definition of precision and recall is##of correct positive predictions precision of positive predictions =##of correct positive predictionsrecall of positive examples = And the F1 measure is represented by precision and recall as follows: 21precision recallF precision recall ⋅⋅=+ In general, we compare the performance of text categorization from two aspects that are Macro-F1 and Micro-F1, respectively [16]. D. Experimental results and analysis To evaluate the performance of text categorization, we calculated macro-averaged measure and micro-averaged measure for three feature evaluation algorithms when the number of selected features ranges from 500 to 3000.Figure1 and figure2 present the experimental results of Macro-F1 and Micro-F1 for three feature selection methods on the top 8 categories of Reuters-21578, respectively. In our experiments, the highest value of Macro-F1 is achieved by categorical document frequency.Fig.1 Macro-F1 on Reuters-21578 for three methodsFig.2 Micro-F1 on Reuters-21578 for three methodsFig. 1 illustrates the Macro-F1 curves of k-NN classifier when the number of selected features ranges from 500 to 3000 by use of information gain, chi-square statistic andcategorical document frequency. As can be seen in thisfigure, the macro-averaged of CHI and IG achieved their highest values at 500 dimensions and the best macro-averaged of CD F is 69.16% that is also obtained at 500dimensions. The categorization results of macro-averaged show that these three methods can eliminate up to 96.5% ofthe unique terms without categorization accuracy losing or improving the categorization performance, and theperformance of CDF is the best.Fig.2 displays the Micro-F1 curves for k-NN classifier. In Fig. 2, the highest micro-averaged is 90.57% that is acquired by CHI when the number of selected features also is 500,and the CD F achieves 90.52%. The performance of IG isinferior to those two methods. But from the decreasing rate of micro-averaged with the increasing of dimensions perspective, the CDF is relatively better than IG and CHI.In a word, either macro-averaged or micro-averaged demonstrates the effectiveness of our proposed approach for text categorization. Experimental results show that the novel method is similar or better than information gain and chi-square statistic. On the other hand, the computational complexity of CD F is lower than IG and CHI so that it is suitable for processing large-scale text data.V.CONCLUSIONSFeature selection methods for large-scale text data have attracted more and more attention in the past decades. This paper presented a novel feature selection approach for text categorization, called categorical document frequency. The categorical document frequency is independent with the performance of classifier since it is a filter method.D ocument frequency take the training set as a whole and does not consider the category of a document. The categorical document frequency is based on document frequency. It computed the number of documents pertained to a category in which a term occurs. The categorical document frequency reflects the extent of scatter of a term over all categories, thus, we employ variance of a term to measure whether the term is useful or not. Experimental results demonstrate that our proposal method achieved good performance for text categorization and the performance of this method is similar or better than classical information gain and chi-square statistic. In addition, we find the computational cost of categorical document frequency is lower than information gain and chi-square. Therefore, it is feasible and reliable to processing large-scale text data when the computation of information gain and chi-square are too expensive.ACKNOWLEDGMENTThis work was supported by the National Natural Science Foundation of China under Grants No. 60673186, and the Research Innovation Program for College Postgraduate of Jiangsu Province under Grant No. CXZZ11_0426. Sponsored by Qing Lan Project.R EFERENCES [1] E. F. Combarro, E. Montanes, I. D iaz, J. Ranilla, and R. Mones,“Introducing a family of linear measures for feature selection in text categorization,” IEEE Transactions on Knowledge and D ata Engineering, vol. 17, no. 9, 2005, pp. 1223-1232.[2] F. Sebastiani, “Machine learning in automated text categorization,”ACM Computing Surveys, vol. 34, no. 1, 2002, pp. 1-47.[3]S. Godbole, and S. Sarawagi, “D iscriminative methods for multi-labeled classification,” In: Proceedings of the 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’04), LNAI 3056, 2004, pp. 22-30.[4]X. F. He, D. Cai, H. F. Liu, and W. Y. Ma, “Locality preservingindexing for document representation,” In: Proceedings of the 27th ACM International Conference on Research and D evelopment in Information Retrieval (SIGIR’04), 2004.[5]G. H. John, R. Kohavi, and K. Pfleger, “Irrelevant features and thesubset selection problem,” In: Proceedings of the 11th International Conference on Machine Learning (ICML’94), 1994, pp. 121-129. [6]G. Forman, “An experimental study of feature selection metrics fortext categorization,” Journal of Machine Learning Research, vol. 3, 2003, pp. 1289-1305.[7]Y. M. Yang, and J. O. Pedersen, “A comparative study on featureselection in text categorization,” In: Proceedings of the 14th International Conference on Machine Learning (ICML’97), 1997, pp.412-420.[8] D. D. Lewis, “Reuters-21578 text categorization collection,” 1997.Available: /databases/reuters21578.html.[9]Y. M. Yang, and X. Liu, “A re-examination of text categorizationmethods,” In: Proceedings of the 22nd ACM International Conference on Research and Deveolopment in Information Retrieval (SIGIR’99), 1999, pp. 42-49.[10] C. Apte, F. Damerau, and S. Weiss, “Automated learning of decisionrules for text categorization,” Information Systems, vol. 12, no. 3, 1994, pp. 233-251.[11]/projects/jmlr/papers/volume5/lewis04a/a11-smart-stop-list/[12]M. F. Porter, “An algorithm for suffix stripping,” Program: ElectronicLibrary and Information Systems, vol. 14, no. 3, 1980, pp. 130-137. [13] E. Montanes, I. D iaz, J. Ranilla, E. Combarro, and J. Fernandez,“Scoring and selecting terms for text categorization,” IEEE Intelligent Systems, vol. 20, no. 3, 2005, pp. 40-47.[14]W. Q. Shang, H. K. Huang, H. B. Zhu, Y. M. Lin, Y. L. Qu, and Z. H.Wang, “A novel feature selection algorithm for text categorization,”Expert Systems with Applications, vol. 33, 2007, pp. 1-5.[15]X. B. Xue, and Z. H. Zhou, “D istributional features for textcategorization,” IEEE Transactions on Knowledge and D ata Engineering, vol. 21, no. 3, 2009, pp. 428-442.[16]M. Rogati, and Y. M. Yang, “High-performing feature selection fortext classification,” In: Proceedings of the 11th International Conference on Information and Knowledge Management (CIKM’02), 2002, pp. 659-661.。