运用经典测验理论和多层面Rasch测量理论对一次“非英语专业研究生学位课程考试”的分析

格式：pdf
大小：543.25 KB
文档页数：16

下载文档原格式

来华留学预科汉语考试作文评分研究--基于概化理论和多面Rasch模型

来华留学预科汉语考试作文评分研究--基于概化理论和多面
Rasch模型
孔傅钰
【期刊名称】《考试研究》
【年(卷),期】2022(18)4
【摘要】为了探究来华留学生预科汉语教育结业汉语综合统一考试的作文评分信度,本研究采用概化理论和多面Rasch模型分析5名评分员对120篇实考作文样本的评分情况。

概化理论的研究表明:考生能力是得分总变异的最大来源,一位评分员进行评分时,其结果即可达到可接受的概化系数;两位评分员进行评分时,信度系数提高的幅度最大,因此应保持目前的双评状态。

多面Rasch模型的分析显示:评分量表基本能区分考生能力,评分员的严厉性差异显著,存在对高水平考生偏严而对低水平考生偏宽松的趋势,个别评分员自身一致性较差。

【总页数】11页(P41-51)
【作者】孔傅钰
【作者单位】北京语言大学国际学生教育政策与评价研究院
【正文语种】中文
【中图分类】G424.74
【相关文献】
1.研究生入学考试写作评分的概化理论研究与多面Rasch分析
2.汉语阅读理解测验的认知诊断研究——以中国政府奖学金本科来华留学生预科教育汉语综合统一
考试为例3.基于多面Rasch模型的作文网上评卷“趋中评分”判定研究4.基于概化理论和多层面Rasch模型的计算机化英语听说考试评分研究5.来华留学预科汉语考试能力维度探析
——基于因素分析和项目反应理论
因版权原因，仅展示原文概要，查看原文内容请购买。

多面Rasch模型

独立地估计出来,且考生总分和项目总分是能力参数与
难度参数的充分估计值,因而按极大似然估计法估计出的能力与项目参数是无偏的、一致的、有效的和充分的。
教育测量与评价专题
极大似然估计（maximum likelihood estimation）：一个随机变量可能会有各种分布，或者说可能会以各种概率出现，在这些概率中，可能会有一个最可能的概率，这个最可能的概率当然应该是最大的概率。因此，“极大似然估计”就是估计出概率的最可能的极大值。
∆
教育测量与评价专题
∆
∆
多面 Rasch 模型主要的分析指标：
多面Rasch模型为测量侧面中的每个要素都将提供三方面的信息 , 即测量值 (Measure) 、标准差 (Standard Eorrr,SE) 以及 fit 统计量 (fit Statistics)。
测量值代表了研究者想要测量的对象 , 如考生的能力、评分者的宽严程度以及项目的难度。这些参数值共同决定了某个考生在某道题目上由某个评分者给某个特定等级的概率。标准差提供了每个考生、评分者以及项目的测量值的精确程度。 Fit值代表了观测值与模型之间的差异程度。
三年后，Rasch进入哥本哈根大学数学系。 1930年完成博士论文《矩阵代数及其在差分方程和微分方程中的应用》，获得博士学位。但是Rasch后来没当成数学家，改当统计学家了。
1
教育测量与评价专题
“Rasch模型”
二战后，军方有一个智力测验，叫IGP(Intelligence Group Test)
∆
∆
多面 Rasch 模型测试质量分析的特点与应用：
多面 Rasch 模型可以用来度量考生能力、评分员宽严程度和任务难度及其一致性，同时能够细致分析各侧面之间存在的交互作用。相较于经典测量理论与概化理论，多面 Rasch 模型具有独特的优势：（1）具有等距量尺。这也是所有 Rasch模型具有的优势；（2）为检查测量情境中各个侧面提供了方法，并为调整侧面差异提供了理论框架，可提高测量结果的客观性和公平性；（3）各参数具有充分统计量，便于参数估计，具有参数不变形，所有参数在同一量尺上的特点；（4）对含有缺失值的不完全设计可以进行较好处理；

《现代教育测量与评价》考试题库附答案

《现代教育测量与评价》考试题库附答案(含A.B卷)A卷：单选题1.儿童努力摆脱掌握原则的集团或个人的权威，且不把自己和这种集团视为一体，从而确定有效的和可用的道德价值和原则，这是科尔伯格的()A、前习俗水平B、习俗水平C、后习俗水平D、阶段零参考答案：C2.测评者在测评过程中要注意发挥思想品德测量与评价的教育作用，使思想品德测评成为一项教育活动。

指的是学生思想测评的()原则。

A、方向性B、教育性C、客观性D、有效性参考答案：B3.世界观开始萌芽，逐渐能用道德信念自觉调节行为的发展阶段是()A、小学阶段B、初中阶段C、高中阶段D、大学阶段参考答案：B4.以下哪一项不是信度估计的方法()A、重测信度B、复本信度C、同质性信度D、标准信度参考答案：D5.评价者已有的知识、经验、认知模式等对评价可靠性影响，是指评价者和被评者的()A、意识倾向性B、情意心理C、认知倾向D、个性心理特征参考答案：C6.两个变量都是连续变量，但其中一个变量因为某种原因被人为地分成两类，这种数据类型适合的计算区分度的方法是计算()相关系数。

A、点二列B、二列C、积差D、高低分组法参考答案：B7.智力结构的"三重智力论"的提出者是()A、吉尔福特B、斯皮尔曼C、斯腾伯格D、瑟斯顿参考答案：C8.影响智商的最为主要的因素是()A、遗传和环境B、种族C、学习成绩D、性别参考答案：A9.以下哪一项不是美国耶鲁大学斯滕伯格教授提出的成功智力理论所包含的内容()A、分析性B、创造性C、操作D、实践性参考答案：C10.教育测量专家格兰朗德认为，一个完整的评价计划，可用公式形象地表达，其公式是()。

A、评价=测量+评定+价值判断B、评价=测量+定量描述+定性判断C、评价=测量+非测量+价值判断D、评价=测量+非测量+统计推断参考答案：C11.适合于某些用于选拔和分类的职业测验的效度是()A、时间效度B、内容效度C、效标关联效度D、结构效度参考答案：B12.智力测验在社会历史舞台上获得第一个稳固的立足点是为()A、智力正常的儿童进行测试B、智力超常的儿童进行测试C、弱智者提供诊断和帮助有关D、生理缺陷者提供诊断有关参考答案：C13.下列量尺中，属于最高水平测度的是()。

级免师教育硕士学科英语教育测量与评价答案

级免师教育硕士学科英语教育测量与评价答案 Revised by BETTY on December 25,2020西南大学研究生课程考试答卷纸考试科目教育测量与评价院、所、中心外国语学院专业或专业领域研究方向级别学年 2013-2014学年学期 2014年秋季学期姓名学号类别③（①全日制博士②全日制硕士③教育硕士④高师硕士⑤工程硕士⑥农推硕士⑦兽医硕士⑧进修)2014年 7 月 18 日研究生院(筹)制备注：成绩评定以百分制或等级制评分，每份试卷均应标明课程类别（①必修课②选修课③同等学力补修课）与考核方式（①闭卷笔试②口试③开卷笔试④课程论文）。

课程论文应给出评语。

西南大学外国语学院免师教育硕士2014级“学科（英语）教育测量与评价”课程作业要求：保留原题题干及数据，然后在其下面插入空白来呈现答案。

不得篡改问题或者数据。

在编辑将要提交的作业的Word文件时，需确保欲呈现的内容能被打印出来。

一、简答题（50分）1.测量的基本要素有哪些（1分）答：测量的量具、测量的单位和测量的参照点是测量的三个基本要素。

2.什么是教育测量（1分）答：教育测量就是针对学校教育影响下学生各方面的发展，侧重从量的规定性上予以确定和描述的过程。

3.教育测量有哪些量表类型（1分）答：有称名量表、顺序量表、等距量表和比率量表。

4.什么是教育评价（1分）答：教育评价是指根据一定的标准，对教育事物或现象进行系统的调查，在获取足够多的资料事实（定性与定量资料）基础上，做出价值分析和价值判断。

5.教育测量与教育评价有什么不同（1分）答：教育测量时一种以量化为主要特征的事实判断，而教育评价是指根据一定的标准，对教育事物或现象进行系统的调查，在获取足够多的资料事实（定性与定量资料）基础上，做出价值分析和价值判断。

所以教育评价最根本的特征是做出价值判断；而教育测量过程的完结，在给出数量事实的描述与判断之后，不一定都要做出价值判断。

6.教育评价与教育评估有什么异同（1分）答：教育评估和教育评价是两个近义词，他们在内容上有交叉，也有区别。

自考《现代教育测量与评价学》历年真题及答案

自考《现代教育测量与评价学》历年真题及答案xx年上半年高等教育自学考试福建省统一命题考试现代教育测量与评价学试卷6231 一、单项选择题(本大题共xx年4月高等教育自学考试福建省统一命题考试现代教育测量与评价学一、单项选择题(本大题共19小题，每小题1分，共19分)1．五分制是一种【】A．称名量表 B．顺序量表C．等距量表 D．比率量表2．我国教师进入转型期，国际教师专业发展的必然趋势是【】 A．师范教育B．教师教育C．终身教育D．继续教育3．测验专家严格按照测验程序而编成的一种测验称为【】 A．标准化测验B．非标准化测验 C．成就测验D．能力倾向测验4．在编制配合题时，相对于选项而言，每一试题的配对数目不超过【】 A．5个B．7个C．10个D．12、个5．当原始分数比平均数高时，其相应的标准分数为【】 A．正值B．负值C．0 D．16．建立在一套完善的题目汇编或内容领域规范汇编基础上的用于检查被试素质及发展水平的测验，我们称之为【】 A．领域参照测验 B．目标参照测验 C．掌握测验D．标准参照测验7．采用多元的方法对学生的课业发展进行评估．重视过程与结果的评估方法是【】 A．表征性考评机制B．实质性考评机制 C．掌握性考评机制D．发展性考评机制8．一种有目的、有计划、持久的知觉活动我们称之为【】 A．记忆B．观察C．思维D．想象 9．采用内省报告、观察评判为基本方式，以语言文字转达评定结果的方法称为【】 A．操行评语法B．操行加减评分法C．考试考核测评法 D．积分测评法 10．教育评价是以教育目标为标准的【】A．事实半口断B．价值判断C．过程判断D．数据判断11．物理量的测量大多是直接性的，丽教育测量内容主要是关于人的种种非物质属性，只能通过人对来自外界的刺激作出＿＿＿的测量。

【】 A．间接性、推断性 B．模糊性、不确定性 C．间接性、多样性D．多样性、抽象性12．我国的考试制度的建立是在【】A．西周B．两汉时期C．魏晋南北朝D．隋朝13．对经常表现出学习困难的学生所做的测量与评价，我们称之为【】 A．形成性测量与评价 B．诊断性测量与评价 C．终结性测量与D．潜力参照测量与评价14表示每项评价指标在指标体系中所古的重要性程度，并赋予相应的值，这个数值叫【】A．评价指标B．评价权重C．评价标准D．评价目标15．以下哪项不是美国耶鲁大学斯腾伯格教授提出的成功智力理论所包括的内容【】 A．分析性B．创造性C．操作性D．实践性16．以下哪一项不是人格评价的内容【】 A．气质B．性格 C．适应性D．交际17．课堂作业观察评估法是属于【】A．思维分析评价法B．作品分析评价法 C．心理分析评价法D．想象分析评价法18．假如某学生在期中语文统考中卷面分数为75分，又知该学生所在年级中有80％的学生成绩低于75分，我们可以表达该生的【】A．标准分为80 B．原始分为80 C．百分等级为80 D．常模分为8019．教师依据教学目标与计划，请学生持续一段时间主动收集、组织与省思学习成果档案，以评定其努力、进步、成长情形的评价方法是【】 A．档案袋评价 B．动态评价C．轶事记录评价D．同伴评价二、填空题(本大题共xx年级常模和＿＿＿＿＿＿两类。

Rasch模型在中国教育领域的应用研究

Rasch模型在中国教育领域的应用研究Rasch模型的概述Rasch模型是由丹麦统计学家乔治·拉什在20世纪50年代提出的一种测量模型，用于评价人的某一特质或能力。

该模型的核心思想是通过评估个体对特定特质的概率来进行排名，而非简单累加得分或等级。

Rasch模型能够消除不同试题的难易度和个体能力的影响，将试题和被试者的能力通过转化为相同的指标，进而进行测量和比较。

Rasch模型在中国的应用现状自20世纪80年代开始，中国学者就开始研究和应用Rasch模型。

随着教育改革的不断深化，Rasch模型在中国的应用领域也越来越广泛，主要包括教育评估、能力测试、学生发展研究等方面。

国内外许多高等院校的考试和招生选拔中就广泛采用了Rasch模型来进行能力评估和排名。

教育评估Rasch模型在教育评估中的应用是其最主要的领域之一。

教育评估是指对学校、教育机构、教师和学生进行综合评价的活动，其目的在于改进教学质量、提高学生学习效果和推动教育发展。

Rasch模型通过对学生知识水平、技能掌握和学习成绩的测量，可以为教育评估提供客观、准确的数据支持。

近年来，许多研究表明，采用Rasch模型进行教育评估可以减少测验对不同测试者的影响，提高评估结果的信度和效度，从而更好地评估和促进学生的学习。

能力测试在中国的教育领域，能力测试也是Rasch模型的重要应用领域之一。

随着学校选拔和职业资格考试的普及，对学生和应试者能力的准确测量和评估变得越来越重要。

Rasch模型通过对考试试卷和学生能力的测量，可以更准确地评估学生的能力水平，并提供客观的能力比较。

与传统的分数排名不同，Rasch模型可以消除试卷难易度和能力水平的差异，提供更客观、公正的能力评估结果。

学生发展研究除了教育评估和能力测试，Rasch模型在中国的教育领域还广泛应用于学生发展研究。

学生发展研究是指通过对学生学习过程和学术成就的跟踪和评估，来了解学生的发展动态，促进其个性化学习和发展。

经典测量理论、概化、项目反应理论

2019/3/2
（3）等测量标准误差难做到真分数模型已经指出测量误差的存在，以一个相同的测量标准误作为每位受试者的测量误差，显然不适当。当测验施测于能力水平高于（或低于）测验难度的被试时就容易产生较大的测量误差，且误差会随着被试水平与测验难度距离的增加而变大。（4）能力量表与难度量表不配套在经典测量理论中，被试能力量表是卷面总分，项目的难度量表是题目难度。因而不能提供不同能力水平的被试如何对项目进行反应的预测信息，找不到验证某个项目是否匹配某种能力水平被试的计量方法，这使得在选题时带有一定盲目性，失去了精确指导测验编制的作用。
2019/3/2
GT与CTT的几点比较
（1）CTT要求严格平行测验的“强假设”，即两个平行测验的实测分数必须具有相同的平均数和方差，否则无法确定测验信度的意义；而GT只要求随机平行的“弱假设”，所谓随机平行测验是指随机取自同一题库的长度相同的测验。因而 CTT的应用范围受到许多限制，而GT的应用范围则更广泛，应用也更合理。（2）CTT把测验分数简单划分为真分数和误差分数两个部分，误差分数是单一的、含混的、随机的，这就导致不能有效地解释影响人的心理活动因素的多样性，从而在实践上对控制误差缺乏有效指导。GT采用方差分析方法，充分考虑了影响分数的所有误差来源，并进一步提出绝对误差和相对误差的划分及其对绝对误差和相对误差的度量。
2019/3/2
尽管CTT和GT之间存在着基础性差异，但是在

某种程度上，GT仍然可以看做是通过应用适当的方差分析ANOVA程序对CTT的一种拓展。由于统计计算相当繁杂，前在我国还处于实验研究阶段，在面试、考核等主观性测评中有一些应用。
2019/3/2
项目反应理论(item response theory,IRT)

【心理学考研】测量心理学测试题二

【比邻学堂】心理测量学测试题第二章经典测量理论【本章习题精练】一、判断题1．系统误差具有稳定性，因此一些系统误差是可以避免的。

2．随机误差是难以控制的，在测量中是无法避免的。

3．CTT假设真分数是不变的，所以测量的任务就是估计真分数并通过改进测量工具等方法来是观测分数等于真分数。

4．真分数中包括随机误差和系统误差。

5．测量特质单一表示同质性信度高，同样，同质性信度高也可推出测量特质单一。

678912．的影响。

345．。

67度P为8910111213．有 2.2分（该题满分为5分），那么该题的区分度约为__________。

14．信度系数和效度系数的数量关系可以表达为__________。

15．若以测量信度趋向于1，则该测量的标准误趋向于__________。

16．效度定义的数学表达式__________。

17．若某题通过率为0.5，则D最大为__________。

18．如果某测验的信度系数为0.9，那么该测验中真分数造成的变异占__________、由误差分数造成的变异占__________。

三、单项选择题1．测量过程中由不可控制的偶然因素引起的误差称为A．系统误差B．恒定误差C．测量误差D．随机误差2．某次考试中，由于老师的粗心而给错了一道题的标准答案，导致学生的成绩产生的误差是A．系统误差B．恒定误差C．测量误差D．随机误差3．CTT数学模型的假设公理不包括A．真分数（T）和误差分数（E）之间的相关为零B．各平行测验上的误差分数（E）之间相关为零C．观察分数（X）和误差分数（E）之间的相关为零D．一个人的某种心理特质用平行测验反复测量多次，则其误差分数（E）的均值会接近04．在经典测量的真分数模型下，公式中，由于系统误差造成的变异是A．B．C．D．5．Thorndike关于经典测量理论提出的观点是A．“心理特质是一种客观存在”B．“凡客观存在的事物都有其数量”C．“凡有数量的东西都可以测量”D．“心理特质可以测量”6．下列选项属于测验稳定性或多次测量结果一致性程度的是A．信度B．标准化C．效度D．常模7．关于各类信度系数及其主要的误差方差来源，下列描述不正确的是A．重测信度的主要误差方差来源是时间取样B．分半信度的主要误差方差来源是内容取样C．同质性信度的主要误差方差来源是内容的异质性D．重测复本信度的主要误差定方差来源是评定者间差异8．下列关于重测信度的说法错误的是A．重测信度要求我们测量的特质是稳定的B．两次测量之间需要保证适当的间隔时间C．重测信度可以提供测验结果是否随时间而变化的资料D．重测信度适合测量那些随时间变化而变化的特质9．已知经过矫正的分半信度为0.89，则原来一半测验的信度系数为A．0.84 B．0.94 C．0.80 D．1.0610．对于创造力测验或投射测验最适合用以下哪种信度估计方法A．评分者信度B．同质性信度C．分半信度D．重测信度11．已知迷死他赵，凉音，眼泪三位老师对君君侠同学和度同学的试卷进行打分，求该试卷的信度君君侠度迷死他赵 1 2凉音 2 1眼泪 2 1A．1/9 B．1/16 C．3 D．27/1612．同质性信度主要代表A．两半测验间的一致性B．所有题目间的一致性C．所有题目与分测验间一致性D．所有分测验间的一致性13．一次测验结束后，将测验按一定的标准分为等值的两半来求相关系数。

心理与教育测量学戴海琦第三版重点问答题及答案[大全5篇]

心理与教育测量学戴海琦第三版重点问答题及答案[大全5篇]第一篇：心理与教育测量学戴海琦第三版重点问答题及答案第一章、心理测量概述1、心理测量的含义、特点；测量的含义；要素；以及测量的量表（1）心理测量的含义：依据一定的法则，对人的心理特质进行定量描述的过程。

（2）心理测量的特点––间接性：与物理的直接测量不同，从外显行为推测，以间接了解人的心理属性；相对性：测量的结果是与其所属团体比较而言。

客观性：即测验的标准化，是对一切测量的基本要求。

–（3）什么是测量：测量是根据法则给事物分派数字(S.S.Stevens)事物：测量对象，在心理测量中，想测量的当然是心理能力和人格特点；数字：代表某一事物或事物某一属性的量；法则：测量所依据的规则和方法。

（4）测量的要素：参照点：①测量工作中测量对象的数量的固定原点②绝对参照点：以绝对的零点作为测量起点，如长度/高度③相对参照点：相对零点，如温度[水冰点]、海拔[海平面] 单位①理想的单位一是要有确定的意义，不能有不同解释②其次应有相同的价值，即两个单位点之间差异相等。

（5）测量的量表：量表：任何可以使事物数量化的值或量的渐进系列。

①命名量表：数字仅仅代表分类，无任何意义；不可比较，如男女②顺序量表：可比较，没有相同单位和零点，不能加减；如名次③等距量表：可比较，有相同单位无绝对零点，可加减，不可乘除；适用多种统计方法：平均数、标准差等，如温度④比例量表：最理想的量表，有等距的的单位和绝对零点，有倍数关系；如年龄。

2、什么是心理测验，如何理解心理测验？心理测验的类型；心理测量的功能；（1）什么是心理测验：心理测验实质上是行为样本的客观的和标准化的测量。

①行为样本：有代表性的题目②标准化：测验的编制、实施、计分和分数解释的一致性。

③难度的客观测量：测验的编制、实施等过程中减少主试和被试的随意性程度即标准化，测验的难度水平应确定④信度：测验结果的一致性⑤效度：测验结果的有效性和正确性。

多面Rasch模型理论及其在结构化面试中的应用

多面Ｒａｓｃｈ模型理论及其在结构化面试中的应用针对影响面试效度的各种误差来源，该文引入了一种新颖的面试结果处理方法：多面Rasch模型。

这一模型在结构化面试中的应用不但有利于有效测量被试的能力水平，而且为识别问题评委、进一步完善评分规则、实现面试等值等问题都提供了全新的解决思路。

文章在对结构化面试信、效度研究进展进行综述的基础上，介绍了多面Rasch模型的理论及其在结构化面试中的应用框架。

标签：结构化面试；MFRM；项目反应理论1 面试研究概述近年来，人事测评在人员招聘中正发挥着越来越重要的作用，测评的科学性与实用性也越来越得到人们的认可。

其中主要的技术包括履历分析、心理测验、情景模拟（如无领导小组讨论、文件筐测验），结构化面试等。

上述测评方法中，结构化面试几乎在所有的招聘中都会被用到，已经成为人员招聘中使用最为广泛的方法之一[1]。

从R.Wagner（1949）开始到现在，出现了很多有关面试的元分析文章：Mayfield（1964）, Ulrich& Trumbo（1965）, Wright（1969）, Schmitt（1976）, Arvey& J. Campion（1982）, Wiesner& Cronshaw（1988）, Harris（1989）, Huffcutt& Arthur （1994）, McDaniel, Whetzel, Schmidt& Maurer（1994）, Schmidt& Rader（1999）, Salgado& Moscoso（2002）。

研究所探讨的问题从面试的效度到不同面试题目的有效性，面试所测量的结构效度，面试相对于认知能力测验而言的增益效度等等[2，3]。

在有关面试的各种已有研究中，长期以来，较低水平的评分者一致性一直被认为是人事面试的一个重要的缺陷。

从Wagner开始，一直到近代的学者（例如，Arvey& Campion, 1982; Schmitt, 1976; Schneider &Schmitt, 1986; Salgado and Moscoso, 2002），对面试研究的总结反复提到下述评分误差：对比效应[4]（contrast effects）、与我类似效应（similar to me）、第一印象效应（first impression error[4，5]）、晕轮效应（Halo）、首因和近因效应（primacy-recency effects）[4，5]、评委刻板印象（interviewer stereotypes）、顺序效应(order effects)、评委对考生的个人感情（Personal feelings）、信息偏好等等。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

An Application of Classical T est Theory and Many-facet Rasch Measurement in Analyzing the Reliability of an English T est for Non-English Major Graduates S UN HaiyangBeijing Foreign Studies UniversityAbstractTaking classical test theory (CTT) and many-facet Rasch measurement (MFRM) model as its theoretical basis, this study investigates the reliability of an English test for non-English major graduates by using SPSS and FACETS. The results of the CTT reliability study show that the candidates’ scores of the objective test were not significantly correlated with their scores of the subjective tasks, and the internal consistency of the three subjective tasks was not satisfying, either. The results of the MFRM analysis indicate that it was the two raters’ severity difference in their rating, the varying difficulty levels of the test tasks, and the bias interaction between some students and certain tasks that caused most of the variance in the scores. This demonstrates the necessity of training raters to be not only self-consistent but also coherent and consistent with each other. It also requires systematic study of measurement theory as well as test item writing or test task design techniques on the part of the teachers as item writers or task designers. In addition, it calls upon English teachers’ attention to enhancing students’comprehensive language skills.Key words: classical test theory; many-facet Rasch measurement; reliability; bias analysis 1. IntroductionThe Ministry of Education initiated Non-English Major Graduate Student English Qualifying Test (GET1) from 1999 and cancelled it in 2005. Graduate schools of many universities in China took scores from this test as reference to determine whether a student’s English proficiency was good enough to deserve a master’s degree. After the cancellation of the test, graduate schools of most universities began to organize their ownAn Application of Classical Test Theory and Many-facet Rasch Measurement…English teachers to write test items by following the test framework of the Education Ministry. Due to their small-scale and informal status, these tests have rarely been analyzed for validity and reliability. The present study was carried out to fill in the gap. Based on classical test theory (CTT) and many-facet Rasch measurement (MFRM) model, this research aims to use FACETS and SPSS to analyze the reliability of a test constructed by English teachers of a key university in Hebei province.1.1 Classical test theoryClassical test theory provides several ways of estimating reliability mainly by distinguishing true scores from error scores. The true score of a person can be obtained by taking the average of the scores that the person would get on the same test if he or she took the test an infinite number of times. Because it is impossible to obtain an infinite number of test scores, true score is hypothetical (Kline, 2005). Sources of error scores might be random sampling error, internal inconsistencies among items or tasks within the test, inconsistencies over time, inconsistencies across different forms of the test, or inconsistencies within and across raters. Under CTT, reliability can be estimated by calculating the correlation between two sets of scores, or by calculating Cronbach’s alpha, which is based on the variance of different sets of scores (Bachman, 1990). The higher the value of Cronbach’s alpha is, the better the consistency level of the test will be. As a rule of thumb, under CTT the internal consistency reliability is usually measured by calculating Cronbach’s alpha, while the inter-rater reliability is assessed by calculating Cohen’s kappa if the data is interval scale or Spearman correlation coefficient if the data is rank ordered scale.CTT estimates of reliability are useful in detecting the general quality of the test scores in question. However, these estimates have several limitations. Firstly, each CTT estimate can only address one source of measurement error at a time, and thus cannot provide information about the effects of multiple sources of error and how these differ. Secondly, CTT treats all errors to be random or “unidimensional” (see Baker, 1997), and thus CTT reliability estimates do not distinguish systematic measurement error from random measurement error. Lastly, CTT has a single estimate of standard error of measurement for all candidates (Weir, 2005). These limitations of CTT are addressed by item response theory (IRT).1.2 Item response theory and many-facet Rasch measurementSince the 1960s there has been a growing interest in item response theory, a term which includes a range of probabilistic models that allow us to describe the relationship between a test taker’s ability level and the probability of his or her correct response to any individual item (Lord, 1980; Shultz & Whitney, 2005). Early IRT models were developed to examine dichotomous data2. By the 1980s, IRT models were being developed to examine polytomous data. IRT has found tremendous use in computer adaptive testing and in designing and developing performance tests.Three currently used IRT models are respectively one-parameter logistic, two-parameter logistic and three-parameter logistic model (The one-parameter model is alsoS UN Haiyangreferred to as the Rasch model3). All three models have an item difficulty parameter b, which is the point of inflection of the ability θscale. Both b and θ are scaled using a distribution with a mean of 0 and a standard deviation of 1.0. Therefore, items with higher b values are “more difficult”, the respondent must have a higher level of θ to pass or endorse them. The three- and two-parameter models also have a discrimination parameter a, which allows items to differentially discriminate among examinees. T echnically a is defined as the slope of the item characteristic curve4 at the point of inflection (see Baker, 1985: 21). The three-parameter model also has a lower asymptote parameter c, which is sometimes referred to as pseudochance5 (see Harris, 1989). This parameter allows for examinees, even ones with low ability, to have perhaps substantial probability of correctly answering even moderate or hard items. Theoretically c ranges from 0 to 1.0, but is typically lower than 0.3.IRT rests on the premise that a test taker’s performance on a given item is determined by two factors: one is the test taker’s level of ability; the other is the characteristics of the item. It assumes that an observed score is indicative of a person’s ability (Fulcher & Davidson, 2007). All models under IRT assume that when their ability corresponds to item difficulty, the probability of a test taker getting the item correct will be 0.5. On a scale, therefore, some items would have lower values (be easier), and the probability of more able students passing the item would be very high. As the difficulty value of the item rises, a test taker must be more able to ensure a higher probability of getting the item correct.Many-facet Rasch measurement (Linacre 1989) is an extension of one-parameter Rasch (Rasch, 1980) model, which is a special case of IRT model, namely one-parameter logistic model. It enables us to include multiple aspects, or facets, of the measurement procedure in the test results analysis. A facet of measurement is an aspect of the measurement procedure which the test developer believes may affect test scores and hence needs to be investigated as part of the test development. Examples of facets are task or item difficulty, rater severity, rating condition, etc. All estimates of the facets are expressed on a common measurement scale, which expresses the relative status of elements within a facet, together with interactions between the various facets, as probabilities; the units of probability used on the scale are known as logits6. This analysis allows us to compensate for differences across the facets.MFRM also provides information about how well the performance of each individual person, rater, or task matches the expected values predicted by the model generated in the analysis (Sudweeks et al, 2005). MFRM allows us to identify particular elements within a facet that are problematic, or “misfitting”7, which may be a rater who is unsystematically inconsistent in his or her ratings, a task that is unsystematically difficult, or a person whose responses appear inconsistent. These “fit statistics” are reflected by Infit and Outfit Mean Square in MFRM analysis. According to many-facet Rasch model, the Infit and Outfit have an expected value of 1.0, with a standard error of 0. Many researchers (e.g. Lunz & Stahl, 1990; Wright & Linacre, 1994; etc.) hold that a reasonable range of Infit and Outfit values is between 0.5 and 1.5.The MFRM analysis also reports the reliability of separation index and the separation ratio. These two statistics describe the amount of variability in the measures estimated by the MFRM model for the various elements in the specified facet relative to the precisionAn Application of Classical Test Theory and Many-facet Rasch Measurement…by which those measures are estimated. The reliability of separation index for each facet is a proportion ranging between 0 and 1.0, while the separation ratio ranges from 1.0 to infinity. However, the interpretation of these two statistics is different for various facets. For the person facet, low values in either of these two statistics may be indicative of central tendency error in the ratings, meaning that the raters were unable to distinguish the performance of the test takers (Myford & Wolfe, 2003, 2004). Low values of these two statistics for facets (e. g. rater, task, etc.) other than persons indicate a high degree of consistency in the measures for various elements of that facet.In addition, MFRM model allows for analysis of “bias”, or instances of interactions between the facets. Through bias analysis, we can identify specific combinations of facet elements, such as particular rater by person or person by task combinations. Bias analysis can indicate whether one rater tended to rate an individual differently than all others, or whether a person systematically performed differently on one task than he or she did on another. Each interaction between the facets specified in the analysis is given a bias score on the logit scale, and its significance is designated by a standard z-score. The z-score with an absolute value equal to or greater than 2.0 indicates significant interactions.1.3 Previous studiesAlthough item response theory and related models have been applied and studied for over 40 years, classical test theory and related models have been researched and applied continuously and successfully for well over 80 years because of its simpler mathematical computation and straightforwardness. Hambleton & Jones (1993) described the theoretical possibility of combining the two measurement frameworks in test development. Bechger et al. (2003) discussed at length the use of CTT in combination with IRT in item selection process of a real high-stakes test development. They suggested the use of CTT reliability estimation when an appropriate IRT model was found in test construction process. Fan (1998) did an empirical study to compare the item and person statistics from the two measurement frameworks, and he found that the results of the two were quite comparable.Most of the reliability indices in second language performance test were still estimated through Cronbach’s alpha or correlation coefficients (e.g. Saif, 2002; Shameem, 1998; Shohamy et al., 1993; Weir & Wu, 2006; etc.) under CTT framework, whereas the issue of intra-rater variability and task variability has been extensively studied by employing MFRM analysis (e.g. Kondo-Brown, 2002; Lumley & McNamara, 1995; McNamara, 1996; Weigle, 1998; Weir & Wu, 2006; etc.). However, few if any empirical studies in language testing field address the issue of comparing CTT and MFRM reliability analysis. This study was carried out to compare the results of the reliability study by the two frameworks.1.4 The present studyStandardized test developers usually take rigorous steps to design, administer, and analyze their tests. However, small-scale nonstandardized tests have rarely been studied for their validity and reliability. The test items of these tests were usually written by the courseS UN Haiyangteachers relying merely on their intuition. Many teachers know little about the test theories and they scarcely consider whether their test items or tasks are valid enough to measure what they intend to test and whether the scores from the test paper are reliable enough to evaluate a person’s language proficiency.Based on CTT and MFRM, the present study was carried out to analyze the reliability of a nonstandardized test, a qualifying English test for non-English major graduates, by employing SPSS and FACETS. The second purpose of this study is to demonstrate the complementary roles CTT and MFRM play in test score analysis.2. Methodology2.1 ParticipantsFifty-six students of one normal class, with fourteen boys and forty-two girls, took the test. They are all non-English major graduate students. Some of them major in information technology, some are economics majors, and others major in engineering. Their ages vary from twenty-three to thirties. No one is above thirty-five years old. At the time of the test they had finished their first year of graduate study. One year before the test all of them took the national entrance examinations for graduates, and their English scores were above 55 out of a total score of 100.2.2 The instrumentThe test items used in this study were developed by English teachers at a national key university in Hebei province. The test consists of five parts, including listening comprehension, vocabulary, reading comprehension, translation and writing. In these five parts, listening comprehension, vocabulary and reading comprehension all take the format of multiple choice questions and are objective in nature, whereas translation, which is composed by English-to-Chinese (E-to-C) and Chinese-to-English (C-to-E) translation, and writing are to be evaluated subjectively. The weighted scores of the five parts are respectively 20, 20, 30, 20 (E-to-C and C-to-E translation each takes up 10), and 10, which add up to a total score of 100.The five parts were developed separately by five groups, with each group being made up of two teachers. Some of the items (such as those in the vocabulary part) were adapted from the exercises in the textbook, and others were created by the teachers who assumed that the test takers were at the higher-intermediate level of English proficiency.Listening comprehension consists of ten questions based on ten conversations and ten items based on three passages. The vocabulary part includes ten questions measuring the synonyms of the underlined words or phrases, and ten items testing sentence comprehension. The reading comprehension encompasses five passages with three or more questions following each, totaling up to 30 multiple choice questions. In this part, three reading texts, covering different aspects of daily life such as reflections on social problems, news reports, etc., are familiar to the students, and two texts on science reports are unfamiliar. A paragraph of about 200 English words and a paragraph of approximately 100 Chinese characters areAn Application of Classical Test Theory and Many-facet Rasch Measurement…given as the translated materials. The writing is controlled with a given title. Students are expected to write an argumentative composition with no less than 150 words.2.3 Data collectionThe participants were asked to finish the test paper within two and a half hours, and were asked to write their answers to the objective test on the answer cards, which were marked by the computer, and to write the translations and the compositions on the answer sheets, which were evaluated by two raters independently later on.The objective part has a total score of 70. Because the computer produced only the total score of the objective part, the final analysis did not address the internal consistency of the three components of the objective part.All three tasks in the subjective test were rated on a ten-point holistic rating scale. What should be noted is that the researcher intentionally treated E-to-C and C-to-E translation as two different tasks, making the number of the subjective tasks three in total. The reason for doing this is that the more tasks there were the more significant and reliable the MFRM analysis would be. A rating of 6 was considered to be the cut-off score, indicating the minimum required competence. Scoring categories above or below 6 were grouped by two or three, with 1 to 3 signifying little or no success, 4 to 5 inadequate, 7 to 8 adequate, and 9 to 10 excellent. Two raters were introduced to the detailed rubric of rating and practiced rating several papers of varying qualities before the scoring began.2.4 Data analysisThe CTT inter-rater reliability was estimated by calculating Spearman’s rho (ρ) of the two raters’ rating of each task, and the internal consistency reliability was assessed by calculating Cronbach’s alpha (α) of the tasks in question. The reason for using Spearman rank correlation to estimate the inter-rater reliability is that the data were non-interval in a strict sense and it was inappropriate to use Cohen’s kappa or Pearson’s correlation. SPSS version 15.0 was used to do the analysis.The MFRM analysis of the three subjective tests scores was completed using Minifac, student version of FACETS. In the present study, three facets were analyzed, including persons, raters, and tasks. The three facets had fifty-six (the number of the test takers), two (the number of raters), and three (the number of tasks) elements in them respectively.Bias analyses, which are also called interaction analyses, were performed for all two-way interactions between the facets. Thus, the final output of MFRM analysis will report ability measures and fit statistics for persons; difficulty estimates and fit statistics for tasks; severity estimates and fit statistics for raters; and separation ratio and reliability index for each facet; bias analyses for rater by person, rater by task, and person by task interactions.3. Results3.1 CTT reliability studyThe results of the reliability coefficients for raters and tasks are summarized in Table 1.S UN HaiyangAs shown in Table 1, the inter-rater reliability for all three subjective tasks was moderately high. The Spearman correlation coefficient rhos for the three tasks were respectively 0.853, 0.774 and 0.678, which were all significant at p < .000, indicating a high level of consistency between the two raters. However, the overall low alpha values of the inter-task correlations indicate that the internal consistency of the test tasks was not good. The alpha value of the objective and three subjective tasks was 0.201, suggesting the candidates’scores varied significantly over the objective and the subjective parts of the test. The Cronbach’s alpha value of the three subjective tasks was 0.366, indicating a low consistency among the subjective tasks. Table 1 also reveals that the alpha value of the two translation tasks was much higher than other alphas (α = 0.712), suggesting the students’ scores over these two translation tasks were adequately consistent.T able 1. Reliability coefficients for raters and tasksReliability StatisticsInter-rater C-to-E translationρ= 0.853, p < .000 (No. of raters = 2) E-to-C translationρ= 0.774, p < .000 (No. of raters = 2) Writingρ= 0.678, p < .000 (No. of raters = 2)Inter-task Three subjective tasksα= 0.366 (No. of tasks = 3) Two translation tasksα= 0.712 (No. of tasks = 2) Objective to subjective tasksα= 0.201 (No. of tasks = 4)Under CTT, the intra-rater reliability cannot be obtained for a single rating. In order to know the intra-rater consistency, researchers have to ask raters to rate the same test twice with some intervals in between, this might cause some unwanted errors in the rating. However, MFRM analysis can estimate the intra-rater consistency without double ratings. This is done by assessing the Infit or Outfit Mean Squares of each rater. The misfitting rater identified according to the Infit Mean Squares is the one who is not self-consistent.As was mentioned earlier, the variance of the scores might be caused by inconsistent rating, inconsistent test items or the test taker’s inconsistent performance over different test tasks. A good test should minimize the influence of inconsistent test items and raters. The results of CTT reliability analysis give us only a general picture of the internal consistency of the test and raters. As for what caused the inconsistency, we expect FACETS analysis would give us the answer.3.2 MFRM analysis3.2.1 PersonsTable 2 provides a summary of selected statistics on the ability scale for the 56 test candidates. The mean ability of examinees was 1.31 logits, with a standard deviation of 0.82. The range was from –0.13 to 3.79 logits. The person separation reliability index (the proportion of the observed variance in measurements of ability which is not due to measurement error) was 0.77, which suggests that central tendency error was not a bigAn Application of Classical Test Theory and Many-facet Rasch Measurement…problem in the ratings of the examinees (Myford & Wolfe, 2004) and the analysis was moderately reliable to separate examinees into different levels of ability. The separation index 1.82 indicates that the dispersion of language ability estimates was 1.82 times greater than the precision of those measures. The chi-square of 208.5 was significant at p < .00, therefore, the null hypothesis that all students were equally able must be rejected.T able 2. Summary of statistics on examinee facet (N = 56)Mean ability 1.31Standard deviation0.82Root Mean Square standard error0.40Separation index 1.82Separation reliability index0.77Fixed (all same) chi-square208.5 (df = 55, p < .00)In order to identify students who exhibited unusual ability variances among the three tasks, fit statistics were examined. As was noted in 1.3, although there are no hard-and-fast rules for determining what degree of fit is acceptable, many researchers believe that the lower and upper limits of fit values are 0.5 and 1.5 respectively for mean squares to be useful for practical purposes. Fit statistics 1.5 or greater indicate too much unpredictability in the examinee’s scores, while fit statistics of 0.5 or less indicate overfit, or not enough variation in scores. Linacare suggested that “values less than 1.5 are productive of measurement. Between 1.5 and 2.0 are not productive but not deleterious. Above 2.0 are distorting” (cited in Myfold & Wolfe, 2003). Based on this rule, the Infit Mean Squares of 5 out of the 56 candidates (representing 8.9%) in this study were found to be above 2.0, and thus misfitting the model. In other words, these five students’ performance showed significant variability from the expected model. Besides, 20 out of 56 (35.7%) examinees were found to have an Infit Mean Square value lower than 0.5, meaning there was less variation in their scores. The lack of variation in the scores of a large percentage of students might be attributed to the small number of the tasks in question. The number of misfitting examinees is a problem, given that Pollitt & Hutchinson (1987) point out that we would normally expect around 2% of misfitting examinees. This would suggest revisions in the test structure by deleting or modifying misfitting tasks or training the raters if there exist some misfitting tasks, misfitting raters or bias interaction between the facets.3.2.2 T asksThe results for the task facet analysis are presented in Table 3. The task fit statistics document whether the tasks were graded in a consistent manner among the raters. The Infit Mean Squares of the three tasks were respectively 0.85, 1.16, and 1.12, which were all within the acceptable range, suggesting the tasks were graded in a consistent manner. This implies that the raters gave more difficult tasks lower ratings than easier tasks in a consistent manner.S UN HaiyangAs mentioned earlier, low values of reliability statistics for the task facet indicate high degree of consistency or equal difficulty level among the tasks. In contrast, high values suggest inconsistency or separated difficulty levels. The logit scores of –0.57, –0.08 and 0.65 of the task facet in the current study, together with a very high separation ratio (6.01) and separation reliability coefficient (0.97), demonstrate that task difficulties were clearly and reliably separated along the continuum. That is to say, the three tasks were not equally challenging to the students, with writing as the most challenging and E-to-C translation as the least challenging. This separation was statistically significant (p < .00), with a chi-square of 117.9 and 2 degrees of freedom.T able 3. Results of task facet analysisTasks Measure logit Model error Infit Mean Square Difficulty level E-to-C translation–0.570.090.85The easiest C-to-E translation–0.080.08 1.16Less difficult Writing 0.650.07 1.12Most difficultroot mean square error = 0.08; adjusted SD = 0.49; separation = 6.01; reliability = 0.97; fixed (all same) chi square = 117.9, df = 2, p < .00As expected, writing was found to be the most difficult task, with a logit measure of 0.65. This might be ascribed to the fact that writing as productive language ability is much more difficult than any other skills. Compared to translation, which measures test takers’ability to find the corresponding expressions among two languages, writing assessment is definitely more challenging. Within the two translation tasks, it turned out that E-to-C translation (logit difficulty = –0.57) was easier than C-to-E translation (logit difficulty = –0.08). It is easier and more convenient for the students to get their Chinese translation sentences cohesively and coherently organized, even though they did not fully understand the English prompt passage. However, with limited proficiency in English, students might have trouble in finding the appropriate English counterparts of Chinese vocabulary and in structuring the sentences when completing C-to-E task.3.2.3 RatersThe results of the rater behavior analysis are displayed in Table 4. For raters, a small value of reliability index is desirable, since ideally different raters would be equally severe or lenient (McNamara, 1996; Myford & Wolfe, 2004; etc.). According to McNamara (1996), the label “reliability index” for this statistic is “a rather misleading term as it is not an indication of the extent of the agreement between raters but the extent to which they really differ in their levels of severity”(p. 140). In other words, the reliability of rater separation index indicates to what extent the raters are reliably different rather than to what degree they are reliably similar. In the present case, the reliability was 0.91 for the two raters, indicating the analysis was reliably separating the raters into different levels of severity. The rater separation ratio of 3.16 indicates that the differences between the severity/leniency estimates for the two raters were not likely to be due to sampling error because it was 3.16An Application of Classical Test Theory and Many-facet Rasch Measurement…times greater than the estimation error with which these measures were estimated. The chi-square of 22.0 (df = 1) was significant at p < .00, therefore, the null hypothesis that all rater were equally severe must be rejected. The range of severity difference was 0.44 logits. These indicators of the magnitude of severity differences between the two raters indicate that significant harshness did exist: rater 2 was harsher than rater 1 in their rating of the students’ translation and composition. This result is in contrast with the higher inter-rater consistency level by the CTT analysis.T able 4. Rater characteristicsRater number Severity logit Model error Infit Mean Square 1–0.220.7 1.192 0.220.60.96Infit Mean Square mean = 1.08, SD = 0.12; separation = 3.16; reliability = 0.91; fixed (all same) chi square = 22.0, df = 1, p < .00 There is another interpretation of Infit Mean Square, that is, to interpret the Infit Mean Square value against the mean and the standard deviation of the set of Infit Mean Square values for the facet concerned. A value greater than the mean plus twice the standard deviation would be considered as misfitting for these data (McNamara, 1996: 173). In this specific case, the Infit Mean Square mean was 1.08, with a standard deviation of 0.12, so a value greater than 1.32 would be misfitting. Since the Infit Mean Square values for the two raters were 1.19 and 0.96, both of which were smaller than 1.32, neither of the raters was misfitting. In other words, both raters were self-consistent in their own scoring.3.2.4 Bias analysisMFRM uses the term “bias” differently from its more familiar meaning in education and culture contexts. An MFRM bias analysis can allow us to identify patterns in relation to different interaction effects of the facets in the data matrix; these patterns suggest a consistent deviation from what we would expect. For the three facets involved in the model, three two-way combinations can be generated, namely person by task, rater by person, and rater by task.3.2.4.1 Rater by person biasA bias analysis was carried out for rater-person interaction. This identifies consistent subpatterns of ratings to help us to find out whether particular raters are behaving in similar ways for all test takers, or whether some examinees are receiving overgenerous or harsh treatment from given raters.Z-scores over an absolute value of 2.0 are held to demonstrate significant bias. In this data set, there were 112 rater-person interactions (2 raters ×56 examinees). Of these, no interaction showed significant bias. When there are interactions between particular raters and particular examinees, double ratings might be necessary.。