Effective and efficient document ranking without using a large lexicon

格式：pdf
大小：138.57 KB
文档页数：11

下载文档原格式

雅思写作评分标准对照表

雅思写作评分标准对照表雅思写作的评分标准根据四个不同维度进行评估：任务响应（Task Response）、连贯与衔接（Coherence and Cohesion）、词汇运用（Lexical Resource）以及语法使用（Grammar）。

任务响应（Task Response）是评估考生是否对题目进行了正确的回答，是否有效地表达了自己的观点和主张。

评分时会考虑到考生的观点是否明确，是否能清楚地论述观点并给出相关的例子，支持材料是否充足。

此外，逻辑的组织结构以及对于各个论点之间的衔接也会被评估。

连贯与衔接（Coherence and Cohesion）评估的是考生的文档结构和条理性。

评分时会考虑到段落之间是否有恰当的连接词，是否有明确的中心思想以及表达的清晰度。

此外，考生应注意使用恰当的语法和词汇，以便用更流畅的方式表达自己的观点。

词汇运用（Lexical Resource）评估的是考生对于词汇的掌握程度。

考生应尽量使用准确、多样化的词汇，展示自己的词汇储备。

评分时会考虑到使用词汇的恰当性、准确性和多样性，以及语法和文化上下文的运用。

语法使用（Grammar）评估的是考生对于语法结构的运用。

考生应尽量避免语法错误和拼写错误，并展示自己对于语法的正确理解和运用。

评分时会考虑到语法结构的正确性、多样性和复杂性。

以下是一些参考内容，帮助考生更好地理解雅思写作的评分标准：1. 任务响应（Task Response）：- 理解并回答题目中提出的问题，明确表达自己的观点和主张；- 提供恰当的例子来支持观点，并展示自己的论述能力；- 逻辑结构清晰，段落之间有明确的衔接关系。

2. 连贯与衔接（Coherence and Cohesion）：- 文章有明确的中心思想，并对相关观点进行适当的展开；- 使用恰当的连词（如：however, therefore, in addition）来连接不同的观点和段落；- 使用恰当的句子结构和过渡词，使文章的表达更加流畅。

雅思写作评分标准对照表

雅思写作评分标准对照表雅思写作评分标准对照表参考内容：1. Task Response（任务完成度）：是否完整回答了题目所要求的问题、是否针对各个部分都进行了回答以及是否展开了论述。

2. Coherence and Cohesion（连贯性和衔接性）：是否能够清晰、有条理地展开论述、是否使用了过渡词、连接处是否准确、是否能够使用句子间衔接性来使全文结构清晰。

3. Lexical Resource（词汇资源）：使用是否丰富、是否准确、是否适用、是否符合上下文语境。

4. Grammatical Range and Accuracy（语法结构掌握和正确性）：使用的语法结构掌握是否熟练（包括语法准确性）、是否进行了变化以及是否有错误。

5. Task, Achievement （任务实现）：文章是否达到了预定的写作目的、是不是仅仅回答了相应的问题，还是尝试在多个角度进行探究和论述。

6. Coherence and Cohesion, Performance（连贯性和衔接性的表现）：文章是否有很好的连贯性和衔接性，是否有很好的读者引导，是否用了恰当的过渡词、符号等来取得紧密的联系。

7. Lexical Resource, Performance（词汇资源的表现）：文章是否用了本质精确的字眼、措辞、短语等等，尽可能地表达自己的意见，是否还有表达全貌的瑕疵？8. Grammatical Range and Accuracy, Performance（语法结构掌握和正确性的表现）：文章中是否使用了简单、复杂和复杂的句子结构、措辞是否精确无误、法是否准确、避免使用错误的情况。

是否保持后者表达意见的一致性。

注意：以上内容只为参考，写作过程中请务必关注题目要求，根据题目来选择适当的语言表达方式，考生应按照自己的理解，把握基本分数线，不断提高自己的写作能力。

世界大学评分标准

World University Rankings 2015-2016methodologyChange for the better: fuelled by more comprehensive data, the 2015-2016 rankings probe deeper than everSeptember 24, 2015Source: Peter GrundyTheTimes Higher EducationWorld University Rankings are the only global performance tables that judge research-intensive universities across all their core missions: teaching, research, knowledge transfer and international outlook. We use 13 carefully calibrated performance indicators to provide the most comprehensive and balanced comparisons, trusted by students, academics, university leaders, industry and even governments. The basic methodology for this year’s rankings is similar to that employed since the 2011-12 tables, but we have made important changes to the underlying data. The performance indicators are grouped into five areas:Teaching thelearning environmentResearch volume, income and reputationCitations research influenceInternational outlook staff, students and researchIndustry income knowledge transfer.Note on comparisons with previous yearsBecause of changes in the underlying data, we strongly advise against direct comparisons with previous years’ World University Rankings.ExclusionsUniversities are excluded from the World University Rankings if they do not teach under-graduates or if their research output amounted to fewer than 200 articles per year over the five-year period 2010-14. In exceptional cases, institutions below the 200-paper threshold are included if they have a particular focus on disciplines with generally low publication volumes, such as engineering or the arts.Data collectionInstitutions provide and sign off their institutional data for use in the rankings. On the rare occasions when a particular data point is not provided – which affects only low-weighted indicators such as industrial income –we enter a low estimate between the average value of the indicators and the lowest value reported: the 25th percentile of the other indicators. By doing this, we avoid penalising an institutio n too harshly with a “zero” value for data that it overlooks or does not provide, but we do not reward it for withholding them.Getting to the final resultMoving from a series of specific data points to indicators, and finally toa total score for an institution, requires us to match values that representfundamentally different data. To do this we use a standardisation approach for each indicator, and then combine the indicators in the proportions indicated below.The standardisation approach we use is based on the distribution of data within a particular indicator, where we calculate a cumulative probability function, and evaluate where a particular institution’s indicator sits within that function. A cumulative probability score ofXin essence tells us that a university with random values for that indicator would fall below that scoreXper cent of the time.For all indicators except the Academic Reputation Survey, we calculate the cumulative probability function using a version of Z-scoring. Thedistribution of the data in the Academic Reputation Survey requires us to add an exponential component.Teaching the learning environment: 30%.Reputation survey: 15%.The Academic Reputation Survey run annually that underpins this category was carried out in December 2014 and January 2015. It examined theperceived prestige of institutions in teaching. The responses were statistically representative of the global academy’s geographical and subject mix..Staff-to-student ratio: %.Doctorate-to-bachelor’s ratio: %.Doctorates awarded-to-academic staff ratio: 6%.As well as giving a sense of how committed an institution is to nurturing the next generation of academics, ahigh proportion of postgraduate research students also suggests the provision of teaching at the highest level that is thus attractive to graduates and effective atdeveloping them. This indicator is normalised to take account of a university’s unique subject mix, reflecting that the volume of doctoral awards varies by discipline. .Institutional income: %.This measure of income is scaled against staff numbers and normalised for purchasing-power parity. It indicates an institution’s general status and gives a broad sense of the infrastructure and facilities available to students and staff.Research volume, income and reputation: 30%Reputation survey: 18%The most prominent indicator in this category looks at a university’s reputation for research excellence among its peers, based on the responses to our annual.Research income: 6%Research income is scaled against staff numbers and adjusted for purchasing-power parity PPP. This is a controversial indicator because it can be influenced by national policy and economic circumstances. But income is crucial to the development of world-class research, and because much of it is subject tocompetition and judged by peer review, our experts suggested that it was a valid measure. This indicator is fully normalised totake account of each university’s distinct subject profile, reflecting the fact that research grants in science subjects are often bigger than those awarded for the highest-quality social science, arts and humanities research.Research productivity: 6%We count the number of papers published in the academic journals indexed by Elsevier’s Scopus database per s cholar, scaled for institutional size and normalised for subject. This gives a sense of the university’s ability to get papers published in quality peer-reviewed journals.Citations research influence: 30%Our research influence indicator looks at universi ties’ role in spreading new knowledge and ideas.We examine research influence by capturing the number of times a university’s published work is cited by scholars globally, compared with the number of citations a publication of similar type and subject is expected to have. This year, our bibliometric data supplier Elsevier examined more than 51 million citations to million journal articles, published over five years. The data are drawn from the 23,000 academic journals indexed by Elsevier’s Scopus database and include all indexed journals published between 2010 and 2014. Only three types of publications are analysed: journal articles, conference proceedings and reviews. Citations to these papers made in the six years from 2010 to 2015 are also collected.The indicator is always defined with reference to a global baseline and intrinsically accounts for differences in citation accrual over time, differences in citation rates for different document types reviews typically attract more citations than research articles, for example as well as subject-specific differences in citation frequencies overall and over time and document types. It is one of the most sophisticated indicators in the modern bibliometric toolkit.The citations help to show us how much each university is contributing to the sum of human knowledge: they tell us whose research has stood out, has been picked up and built on by other scholars and, most importantly, has beenshared around the global scholarly community toexpand the boundaries of our understanding, irrespective of discipline.The data are fully normalised to reflect variations in citation volume between different subject areas. This means that institutions with high levels of research activity in subjects with traditionally high citation counts do not gain an unfair advantage.This year we have removed the very small number of papers 649 with more than 1,000 authors from the citations indicator.In previous years we have further normalised citation data within countries, with the aim of reducing the impact of measuring citations of English language publications. The change to Scopus as a data source has allowed us to reduce the level to which we do this. This year, we have blended equal measures of a country-adjusted and non-country-adjusted raw measure of citations scores. This reflects a more rigorous approach to international comparison of research publications.International outlook staff, students, research: %International-to-domestic-student ratio: %International-to-domestic-staff ratio: %The ability of a university toattract undergraduates, postgraduates and faculty from all over the planet is key to its success on the world stage.International collaboration: %In the third international indicator, we calculate the proportion of a university’s total research journal publications that have at least one international co-author and reward higher volumes. This indicator is normalised to account for auniversity’s subject mix and uses the same five-year w indow as the “Citations: research influence” category. Industry income knowledge transfer: %A university’s ability to help industry with innovations, inventions and consultancy has become a core mission ofthe contemporary global academy. This category seeks to capture such knowledge-transfer activity by looking athow much research income an institution earns from industry adjusted for PPP, scaled against the number of academic staff itemploys.The category suggests the extent to which businesses are willing to pay for research and a university’s ability toattract funding in the commercial marketplace –useful indicators of institutional quality.Subject tablesThe subject tables employ the same range of 13 performance indicators used in the overall World University Rankings, brought together with scores provided under the same five categories.However, we continue the three differences from the main World University Rankings methodology:Weightings recalibrated:Here, the overall methodology is carefully recalibrated for each subject, with the weightings changed to best suit the individual fields. In particular, those given to the research indicators have been altered to fit more closely the research culture in each subject, reflecting different publication habits: in the arts and humanities, for instance, where the range of outputs extends well beyond peer-reviewed journals, we give less weight to paper citations.Accordingly, the weight given to “citations: research influence” is halved from 30 per cent in the overall rankings to just 15 per cent for the arts and humanities. More weight is given to other research indicators, including the Academic Reputation Survey. For social sciences, where there is also less faith in the strength of citations alone as an indicator of research excellence, the measure’s weighting is reduced to 25 per cent.By the same token, in those subjects where the vast majority of research outputs come through journal articles and where there are high levels of confidence in the strength of citations data, we have increased the weighting given to the research influence up to 35 per cent for the physical and life sciences and for the clinical, pre-clinical and health tables.Publication eligibility criteria:For the six subject tables, there is an additional threshold within the subject, of 500 papers over 2010-14 for subjects that generate a high volume of publications, and 250 papers over 2010-14 in the social sciences and in the arts and humanities, where the volume tends to be lower.Staff eligibility criteria:We also generally expect an institution to have at least 5 per cent of its staff working in the relevant discipline in order to include it in the subject table.。

英文作文电脑评分依据

英文作文电脑评分依据英文：When it comes to computerized essay grading, there are several factors that are taken into consideration. These include grammar, vocabulary, structure, coherence, and overall content. The computer program uses algorithms to analyze the essay and assign a score based on these factors.One of the main advantages of computerized essaygrading is that it is objective and consistent. Unlike human graders, who may be influenced by personal biases or moods, the computer program evaluates each essay based on the same criteria. This ensures fairness and accuracy inthe grading process.However, there are also some limitations to computerized essay grading. For example, the program maynot be able to detect certain nuances in language or understand the context of the essay. It may also strugglewith essays that are highly creative or unconventional in structure.Overall, computerized essay grading can be a useful tool for educators and students alike. It provides a quick and objective way to evaluate essays and can help students improve their writing skills. However, it should not be relied on as the sole method of grading, and human input and evaluation should still play a role in the process.中文：在电脑评分作文方面，有几个因素需要考虑。

如何提高学术成绩英文作文

如何提高学术成绩英文作文英文：When it comes to improving academic performance, there are several key strategies that can be employed. Firstly,it is important to establish a regular study routine and stick to it. This means setting aside dedicated study time each day and minimizing distractions such as social mediaor television. Additionally, it is important to actively engage with the material being studied, rather than just passively reading through it. This can involve taking notes, asking questions, and participating in class discussions.Another important factor in academic success is time management. This means effectively prioritizing tasks and assignments, and breaking them down into manageable chunks. It is also important to avoid procrastination, as this can lead to increased stress and poorer performance. One useful technique for managing time is the Pomodoro method, which involves working for 25-minute intervals followed by shortbreaks.Finally, seeking help when needed is crucial for improving academic performance. This can involve reaching out to teachers or professors for clarification ondifficult concepts, or seeking tutoring or study groups for additional support. It is also important to take care of one's physical and mental health, as these factors can greatly impact academic performance.Overall, improving academic performance requires dedication, effort, and effective strategies for studying and managing time. By establishing a regular study routine, actively engaging with material, managing time effectively, and seeking help when needed, students can achieve their academic goals.中文：要提高学术成绩，有几个关键策略可以采用。

美国大学课程成绩评定方法及其技术支撑

置阶段评价时可能自己心中都有预定的要实现的目标，但学生不一定能认识到。（）４使作业多样化。几乎不可能使所有学生对所有作业都产生兴趣，因而这种途径就提高了使所有学生都有自己喜爱的作业的机会。（）布置作业要中心明确、晰。学５清习的内容太多可能不利于学生的有效学习，特别是对那些抽象思维不发达、能将许不多观念成功地综合在一起的学生，置的作业只能集中于一个概念或问题，必要时布要提供相关的背景信息。（）布置能刺激学生思考和综合运用各种能力的作业。好６的作业能为学生在可控制的课堂之外提供运用一种理论或概念的机会，能刺激学生自由地释放他们的发散思维方式。（）布置体现学生个人特色的作业。作业要适合７学生的技能、兴趣和需要，效果最好的作业必然与学生有紧密联系，一些作业对他们来说能体现个人的特色。（）将作业与现实挂钩。要使他们对若干年前发生的事情８或形成的规则感兴趣是很困难的。有些教师在这方面做了很多努力，如，例一位叫丹尼尔・杜尔宾的印第安纳州教师通过讨论今天的帮派与莎士比亚时代的帮派之间的相似性，罗密欧与朱丽叶》使《与他的学生产生了更多的联系。（）安排的阶段评价９
中图分类号：４０Ｇ２文献标志码：Ｂ文章编号：６２８４（０００ — ０７０１７ — ７２２１）２０９８
课程评价的研究始于２０世纪３Ｏ年代的美国。在美国的“ 年研究” ，勒提八中泰

科技英语简介课件

information tranmbiguity
To avoid ambiguity and misunderstandings, science and
technology English of uses specific and clear vocabulary to
describe concepts and phenomena
The use of admissions and symbols
Abbreviations
Abbreviations are commonly used in science and technology English to simplify complex terms and concepts, making them easier to understand and remember
Technical jargon
In addition to specialized vocabulary, science and technology English also includes technical jargon, which refers to the language used by experts or professionals in a specific field
Terminology Evolution
With the development of science and technology, new terms and concepts are consistently emerging, requiring continuous learning and updating of vocabulary

LTR入门材料整理--小白学搜广推01

13. ULTRA: ULTR的官方文档,封装了常见的算法. 13.1 Github地址: https:///ULTR-Community/ULTRA 1.3.2 官方文档: https://ultr-community.github.io/ULTRA/ranking_model_reference.html
sigir2019-LTR-tutorial PPT链接地址: r.it/slides/ 搭配中文理解: 排序学习(LTR)杂谈 (下) https:///p/138436960
4. (2020-04)Transformer 在美团搜索排序中的实践: https:///2020/04/16/transformer-in-meituan.html 5. （2020-03）美团-WSDM Cup 2020检索排序评测任务第一名经验总结(多模融合排序检索)： https:///2020/03/26/wsdm-2020-bert-lightgbm.html 6. (2019-01)大众点评搜索基于知识图谱的深度学习排序实践: https:///2019/01/17/dianping-search-deeplearning.html 7. 全面理解搜索Query：当你在搜索引擎中敲下回车后，发生了什么？ https:///p/112719984 8. 漫谈搜索引擎 https:///p/99624706 ] LTR论文整理:
VSCode 如何编写运行 C、C++ 程序？ - 谭九鼎的回答 - 知乎： https:///question/30315894/answer/154979413 几乎可以当作问题大全查看;
： VSCode实现C++多文件编译: 1. 命令行 : https:///qq_34801642/article/details/103770219?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-1.add_param_isCf&depth_1-utm_source=distribute.pc_relevant.none-task-bl 2. 基于MinGW: https:///u012030174/article/details/107791407?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-1.add_param_isCf&depth_1-utm_source=distribute.pc_relevant.none-task2.1 单纯修改tasks.json中的参数; VS Code：编译运行根目录下不同文件夹中的文件： https:///qq_34801642/article/details/106419763

英语硕士毕业论文高级词汇

英语硕士毕业论文高级词汇在英语硕士毕业论文中使用高级词汇可以提升论文的学术性和表达能力。

以下是一些适合论文写作的高级词汇，总计1200字：1. Significance (意义)2. Proponent (支持者)3. Advocate (提倡者)4. Undermine (削弱)5. Enhance (增强)6. Comprehend (理解)7. Assimilate (吸收，融入)8. Analyze (分析)9. Evaluate (评估)10. Distinguish (区分)11. Contradictory (矛盾的)12. Controversial (有争议的)13. Acquire (获得)14. Substantiate (证实)15. Fabricate (捏造)16. Validity (有效性)17. Reliability (可靠性)18. Inconclusive (无定论的)19. Conclusive (确定的)20. Prominent (突出的)21. Prevalent (普遍的)22. Pervasive (普及的)23. Ubiquitous (无处不在的)24. Discern (分辨)25. Evident (明显的)26. Profound (深刻的)27. Elaborate (详细阐述)28. Alleviate (减轻)29. Mitigate (缓解)30. Ameliorate (改善)31. Perceive (感知)32. Acknowledge (承认)33. Contend (主张，辩论)34. Advocate (主张)35. Efficacy (功效)36. Eminent (杰出的)37. Stagnant (停滞的)38. Pioneering (开拓性的)39. Culminate (达到顶点)40. Provoke (引发)41. Foster (培养)42. Impose (征收，强加)43. Implicate (牵连，涉及)44. Exemplify (举例说明)45. Exacerbate (加剧)46. Alleviate (缓解)47. Contribute (贡献)48. Substantiate (证明)49. Corroborate (证实)50. Uphold (支持)51. Inherently (固有地)52. Conducive (有助于)53. Hinder (阻碍)54. Facilitate (促进)55. Solicit (征求)56. Inhibit (抑制)57. Impede (妨碍)58. Elucidate (阐明)59. Integral (不可或缺的)60. Constitute (构成)61. Evolve (演变)62. Assess (评估)63. Propagate (传播)64. Imply (暗示)65. Correlate (相关联)66. Chronicle (记录)67. Advocate (倡导)68. Bolster (支持)69. Coincide (一致)70. Culminate (达到顶点)71. Illuminate (阐明)72. Innovate (创新)73. Disseminate (传播)74. Foster (培养)75. Clarify (澄清)76. Substantiate (证实)77. Postulate (假设)78. Elucidate (阐明)79. Deduce (推断)80. Propound (提出)81. Validate (验证)82. Attribute (归因于)83. Veracity (真实性)84. Intricate (复杂的)85. Entail (需要)86. Evoke (唤起)87. Delineate (描绘)88. Convey (传达)89. Contrive (策划)90. Culminate (达到高潮)91. Credible (可信的)92. Cynical (愤世嫉俗的)93. Connote (暗示)94. Ascribe (归因于)95. Appertain to (属于)96. Dissociate (分离)97. Fabricate (捏造)98. Elucidate (阐明)99. Dispense with (省去) 100. Attribute to (归因于) 101. Delineate (描述) 102. Advocate (主张) 103. Constitute (构成) 104. Corroborate (证实) 105. Refute (反驳)106. Merit (价值)107. Superficial (肤浅的) 108. Substantial (实质的) 109. Facilitate (促进) 110. Contend (主张) 111. Compensate for (补偿) 112. Deviate (偏离)113. Dissent (异议)114. Acumen (敏锐)115. Ambiguous (含糊不清的) 116. Validate (验证)117. Attribute to (归因于) 118. Manifest (显现)119. Substantiate (证实) 120. Curtail (削减)121. Paradoxical (矛盾的) 122. Incongruous (不协调的) 123. Clarity (清晰)124. Profound (深刻的) 125. Conjecture (推测) 126. Unveil (揭示)127. Illustrate (阐明)128. Demarcate (划界) 129. Promulgate (宣扬) 130. Mitigate (缓解)131. Provoke (激起)132. Fortify (加强)133. Elicit (引出)134. Brevity (简洁)135. Reminiscent (令人回忆起) 136. Evident (明显的)137. Plausible (合理的) 138. Endeavor (努力)139. Implicate (涉及)140. Indiscriminate (任意的) 141. Far-reaching (广泛的) 142. Inexplicable (难以解释的) 143. Articulate (表达清晰的) 144. Comprehensible (可理解的) 145. Pragmatic (实用的)146. Parity (平等)147. Inconsequential (不重要的) 148. Disparate (迥然不同的) 149. Amplify (扩大)150. Inhibit (阻止)。

学术英语水平量表

学术英语水平量表学术英语水平量表是用于评估一个人在学术领域中的英语能力的工具。

在学术研究和学术交流中，良好的学术英语水平是非常重要的。

下面是一个常用的学术英语水平量表及其相关参考内容。

1. 英语基础水平：- A1水平（初级水平）：能够理解和使用一些简单的英语单词和表达，例如问候、介绍自己、家庭成员等。

- A2水平（初级水平）：能够理解和使用一些更复杂的英语单词和句子，例如询问路线、购物、饮食等。

2. 阅读和理解能力：- B1水平（中级水平）：能够阅读简单的学术资料，理解其中的关键信息并做出相应的总结和分析。

- B2水平（中级水平）：能够阅读较为复杂的学术文献，理解其中的主要观点、论证过程和结论，并能够进行批判性思考和评估。

- C1水平（高级水平）：能够阅读领域专业文献，理解其中的复杂概念和理论框架，并能够进行独立的学术研究和写作。

3. 写作表达能力：- B1水平（中级水平）：能够写简单的学术论文、报告或综述，表达自己的观点和论证，并能够使用一些学术词汇和句型。

- B2水平（中级水平）：能够写较为复杂的学术论文、报告或综述，包括对相关研究的批判性评估和自己的研究方法和结果的描述。

- C1水平（高级水平）：能够进行高水平的学术写作，包括撰写科学论文、书籍、研究报告等，能够运用学术语言和规范流程进行学术交流。

4.口语表达能力：- B1水平（中级水平）：能够进行简单的学术演讲、讨论或交流，表达自己的观点和思想，并能够进行基本的问答和回应。

- B2水平（中级水平）：能够进行较为复杂的学术演讲、讨论或交流，包括对相关研究的批判性评估和自己的研究成果的呈现。

- C1水平（高级水平）：能够进行高水平的学术交流和演讲，包括参加国际学术会议、进行学术研究合作等。

这个学术英语水平量表可以帮助人们了解自己在学术英语方面的优势和劣势，并制定相应的学习计划和目标。

要提高学术英语水平，可以通过以下方法：1. 阅读学术文献和相关资料，扩大词汇量，熟悉学术用语和句型。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

Effective and efﬁcient document rankingwithout using a large lexiconOGA W A Yasushiogawa@ic.rdc.ricoh.co.jpInformation and Communication R&D Center,RICOH Co.,Ltd.3-2-3Shin-yokohama,Kouhoku-ku,Yokohama,222JAPANAbstractAlthough a word-based method is commonly usedin document retrieval,it cannot be directly ap-plicable to languages that have no obvious wordseparator.Given a lexicon,it is possible to iden-tify words in documents,but a large lexicon istroublesome to maintain and makes retrieval sys-tems large and complicated.This paper proposesan effective and efﬁcient ranking that does notuse a large lexicon;words need not be identiﬁedduring document registration because a character-based signatureﬁle is used for the access struc-ture.A user request,during document retrieval,is statistically analyzed to generate an appropriatequery,and the query is evaluated efﬁciently in aword-based manner using the character-based in-dex.We also propose two optimizing techniquesto accelerate retrieval.1IntroductionBest match retrieval,which ranks retrieved documents inorder of their relevance to the user request,is more effec-tive than the conventional exact match retrieval[5][8][30].Most retrieval models,such as the vector space and proba-bilistic models,compute the relevance values of documentsby using three term frequencies:the term’s document fre-quency,the number of documents containing the term;thein-query frequency,the number of times the term occurs inthe query;and the in-document frequency,the number oftimes the term occurs in the target document.As the size of document databases has increased,the ef-ﬁcient implementation of ranking models has been exten-sively studied recently[5][30].Most implementations useGeneral issues concerning Japanese IR are summarized by Fujii[6].N-gram-based indexing uses n-grams,overlapping series of suc-cessive characters,as indexing units.Because n-gram-based indexing isconsidered as an extension of character-based indexing,we include,in thefollowing,the former in the latter.RegistrationRetrieval Signature FileFigure1:Overview of processingﬂow. accesses by relaxing a condition that determines the top-ranked documents.The other accelerates retrieval by lim-iting the number of query terms used to determine ranking candidates.This paper is organized as follows.The next section brieﬂy describes the character-based signatureﬁle used in our system.Section3explains query generation without using a large lexicon.Sections4and5detail a query eval-uation method and its optimization techniques.Section6 presents the results of an experiment that evaluated our pro-posals with regard to retrieval effectiveness and efﬁciency.2Character-based Signature FileCharacter-based indexing is preferred in kanji-based Asian languages,since there is no need to identify words [6][16][28][29].Because in-document frequencies of char-acters are much larger than those of words,a character-based index tends to become large.A signatureﬁle is a kind of indexing methods,in which a document is described by aﬁxed-length bitstring or signature,which has less space overhead than an invertedﬁle[5].Thus a character-based signatureﬁle is especially widely used in kanji-based lan-guages[3][7][13][17].In a character-based signatureﬁle,a signature is com-puted not from words but from characters.For example,a signature for the text””(the production of semiconductors) is computed from all the characters included in the text, instead of from the words””(semiconductors),””(of),and ””(production)as shown in Figure2.The signatureﬁle used in our retrieval system has the following features:A document/query signature is obtained from the bit-wise OR of signatures computed from single charac-ters and from character pairs[20].To reduce false drops,we use more than one hash functions,each of which is used for a certain character class[11]and makes the document counts among the signature bitsTextSignatureFigure2:Character-based signature.nearly the same[21].To attain faster retrieval,bitmap data is organized ina bit-sliced manner and each bit slice is compressedusing the Exp-Golomb algorithm[11].3Query Generation3.1Processing FlowOur retrieval system allows users to write a request in natu-ral language.The main subject of query generation is how to identify appropriate words in a user request without a large lexicon.Japanese has a two word forms:simple words and com-pound words composed of several simple words[6][20]. There are in Japanese several classes of characters(kanji, katakana and hiragana)each of which has different func-tions and which sometimes indicates word boundaries [6][24].Some types of words can therefore be identiﬁed by using a closed lexicon that consists of a small number functional words[12][24].Component simple words in a compound word,however,generally cannot be identiﬁed without a large lexicon.However,since the meaning of a compound word can be expressed using other phrases or passages made up of its component simple words,simple words need to be identiﬁed in retrieval.Thus a statistical method of segmenting compound words has been devel-oped.It will be explained later.A user request is atﬁrst segmented into compound words by using a closed lexicon parser called QJP[12]. The size of the lexicon used in QJP is about5000words(50 KB),mainly composed of functional words such as parti-cles.Then,words which are not particles or auxiliary verbs are further segmented into their component words.3.2Statistical Compound Word SegmentationTo segment a compound word into component words, breaks between the component words have to be found.For a given word,the probability that a given character pair is a break between two component words is computed for every adjacent character pair in it.Then the word is segmented at points where the probabilities are greater than a segmenta-tion threshold,.To compute the segmentation probability of a character pair,it has been assumed that the segmentation probabilityis a product of the tail probability of the former character and the head probability of the latter character.Here,the character’s head or tail probability are probabilities that a given character appears at the head or tail of words.These values can be compiled automatically by counting the num-ber of times a character occurs in a large corpus.It should be noted that the data size is small,i.e.,proportional to the size of the character set.For example,if the tail probability of””is0.20and the head probability of””is0.09,the segmentation probability of a character pair””,which forms word”peace,”becomes.Given a compound word””(peace keeping operation),segmentation probabilities between all character pairs are computed in the same way.If the result is given as shown below,the word is correctly segmented into””(peace),””(keeping),and””(operation)by setting .0.0180.1040.0470.22650.0290As you may understand from this example,the thresh-old value controls segmentation and therefore greatly in-ﬂuences both the effectiveness and efﬁciency of retrieval. That is,when the value is too small,words are divided into many small parts.Effectiveness thus increases because fewer documents are missed,but retrieval takes longer be-cause a query includes more words,which generate many more documents to be ranked.4Query Evaluation4.1Evaluation Procedure using Upper Bounds When one uses a word-based signatureﬁle,because there is no room to store the in-document frequencies,retrieval effectiveness deteriorated[25].One attempt to solve this replaces the term’s in-document frequency by the number of logical blocks,in a given document,that contain the term[4].Another uses several signatureﬁles,each of which corresponds to a certain in-document frequency[31].Those methods are not true solutions,however,because the for-mer is even less effective and the latter entails a large space overhead.Yet another method uses the signatureﬁle to determine the documents to be ranked,and theﬁnal relevance val-ues are computed using a non-inverted descriptionﬁle that records the frequencies of terms[15][26].The introduction of the descriptionﬁle,however,requires another disk ac-cess,which might decrease the retrieval speed.Consid-ering observations that users generally assess only a lim-ited number of top-ranked documents in retrieval results [4][32],Knaus and Sch¨a uble focused on speeding up the identiﬁcation of top ranking documents.In their method, the upper bound of the relevance values is computed for all retrieved documents using the signatureﬁle.There upper bounds are,as shown below,used to determine the order of computing theﬁnal relevance values,and to judge whether the top-ranked documents have been identiﬁed.Let be the upper bound of the relevancevalue,and be the identiﬁer of the-th pre-ranked document.Now there is the relationship””for docu-ments and.Given an upper bound list forthe ranking candidates,when the top documentsin the upper bound list have been evaluated,thest document has the largest upper bound among documents that have not been evaluated.Thus,one can determine theﬁnal ranks of docu-ments belonging to the document set:(1)Therefore,to determine the top documents oftheﬁnal ranking,it is enough to evaluate docu-ments until(denotes the number ofitems in set).The following procedure implements the above idea in two distinct phases.One hasﬁrst to prepare a formula that computes an upper bound without using in-document fre-quencies.In the pre-ranking phase,candidate documents that contain at least one query term are identiﬁed using the signatureﬁle,and their upper bounds are computed.In the reevaluation phase,the candidate documents are,in the order of their upper bounds,reevaluated one-by-one until the top documents areﬁxed.In each iteration,the ex-act relevance value is computed using the query terms’in-document frequencies obtained from the descriptionﬁle. The iteration terminates immediately when.Figure3illustrates the reevaluation phase.In thisﬁg-ure,each document is identiﬁed by a letter of the alphabet, and the dark hatched bars represent documents whoseﬁnal ranks are determined.In the-th iteration,the-th docu-ment is evaluated and the all exact relevance values of the evaluated documents are compared with the upper bound of the st document.At the5th step,for example,ﬁnal ranking is determined for three documents,,and. If,the iteration can be stopped at the5th step.It should be noted that the above procedure is quite dif-ferent from other query evaluation methods using upper bounds[2][27][32]in the complete separation of the upper bound and the relevance value computations.This sepa-ration comes from the fact that in-document frequencies cannot be obtained from the signatureﬁle index.4.2Modiﬁcations for Character-based IndexingThe above procedure is developed for word-based signature ﬁles,and cannot be directly applied to the character-based ones.Therefore,we have made the following modiﬁcations to the upper bound method.Step 1Step 2Step 3Step 4Step 5Figure3:Reevaluation process.Theﬁrst modiﬁcation is in the way the document fre-quencies are obtained.In the character-based organiza-tion,because there is no word table(lexicon),there is no space to store the document frequencies in the signature ﬁle.To solve this problem,the term’s document frequency is replaced by the number of retrieved documents that are judged to contain the term by using the signatureﬁle.Al-though the document frequency is greater than its real value because of false drops,the effect on the ranked results is negligible as the document frequencies are taken logarithm in calculating the relevance value.The second modiﬁcation is that the non-inverted de-scriptionﬁle is not used.The reason is that because terms are not identiﬁed at document registration in our imple-mentation,in-document frequencies are not established and thus the descriptionﬁle cannot be created.On the other hand,the original method is modiﬁed to directly access documents in order to establish the term frequencies at re-trieval.This modiﬁcation is also useful in reducing space overhead because the descriptionﬁle is now not used.Theﬁnal modiﬁcation is that the processing strategy in the pre-ranking phase is changed.In the original method [26],the evaluation is carried out in the”document at a time”way[27],where the upper bounds are computed in sequential order of document identiﬁer,because the signa-tureﬁle is organized sequentially to increase updating ef-ﬁciency.But,because theﬁle is organized in a bit-sliced manner in order to maximize retrieval performance,the evaluation is processed in the”term at a time”way[27], where terms are picked up from the query one-by-one and the upper bounds for documents containing the term are updated simultaneously.4.3Ranking ModelWe used Robertson’s model,a probabilistic ranking model which assumes that a word’s frequency has a Poisson distribution[23].We adopted this model for two reasons;(1)It achieved good performance in TREC,a large-scale IR system evaluation contest for English document collec-tions[8].(2)Most retrieval models require establishing in-document frequencies for all the terms in a target docu-ment to normalize them,but Robertson’s model does not. It is a very attractive feature for our method that no word is identiﬁed at document registration.To compute the relevance value,Robertson presented a basic model BM15and another model BM11that incorpo-rates the effect of the document length in normalizing the term’s in-document frequencies.In our experiment,the relevance value of a document is computed by the following formula that merges both model.(2) where is the document frequency of term,is the ’s in-query frequency,and is the’s in-document fre-quency.and are constants for normalizing and ,and is the number of documents in the collection. and are the length of the target document and the av-erage document length in the collection,and is a constantfor control of the effect of the document length.Note that and1correspond to the BM15and BM11models.To compute the upper bounds of the relevance values, we have established the formula,(3)where is1if the signatureﬁle judges that the term exists in the document.Because, Formula(3)gives the upper bound of the relevance value given by Formula(2).5Optimizing Techniques5.1Relaxing the Stop ConditionIn the above query evaluation method,the rank of a given document is determined when the condition,,is satisﬁed.However,this condition seems too strict,because there sometimes exists a document whoseﬁnal ranking can be determined,i.e.,but which does not fulﬁll the stop condition, i.e..Let’s consider a relaxed stop condition for the-th iter-ation,where,and denote a document set that satisﬁes the new condition by(4)Because,we can expect for which is smaller than that fulﬁlls.This means that fewer documents need to be reevaluated to de-termine the top-ranked documents,and the modiﬁed con-dition therefore speeds up the reevaluation phase.Relaxing the stop condition affects the ranking results because there possibly exists a such thatbut.However,since the term’s in-document frequency is usually not large,the contribution of in Formula(2)is smaller than that ofin Formula(3).Thus,by controlling,the effect on rank-ing results and the decrease in retrieval effectiveness can be made negligible.5.2Selecting Query Terms5.2.1MethodologyOne way to speed up ranking retrieval is to limit the number of query terms actually used by selecting only such terms as have large impact on relevance values[9].The term’s im-pact is usually measured by(the inverse of the term’s document frequency).This term selection reduces not only the amount of index access but also the number of candi-date documents,resulting in reduction of relevance value computation and the memory required to store intermedi-ate results[1][18][27].If terms with low s are simply discarded,however,their contribution to a relevance value is lost and retrieval becomes less effective.Query terms should therefore be used only to determine the ranking can-didate,and the discarded terms should be used to compute the relevance values for the candidate documents[9][18].One can apply this idea to the proposed query evaluation method as follows.The system atﬁrst selects query terms with s greater than,where is a constant between0and1and is the maximum among all the query terms.Note that in our ranking model is given by theﬁrst component of Formula(2).In the pre-ranking phase,only these selected terms are used to access the sig-natureﬁle to determine candidate documents,and their up-per bounds are computed.In the reevaluation phase,all the query terms,including the discarded terms,are used to compute the relevance values of the candidate documents.This simple application,however,causes another prob-lem.Since our signatureﬁle is organized in a bit-slice man-ner,which of the not-selected terms a candidate document contains is unknown in the pre-ranking phase.Because the estimated value needs to be an upper bound of the relevance value,all the remaining terms must be assumed to occur in a candidate document,and thus s for these terms must be set to1.This assumption,however,requires performance degradation.That is,some s are set to1even though their real values are0,so the estimated value in this case is greater than the real upper bound that is computed using all the terms.This increase in upper bounds causes more reevaluation iterations before the stop condition is satisﬁed. In the worst case,the lengthening of the reevaluation phase is greater than the shortening of the pre-ranking phase,and the total response time increases.To avoid this problem,is set to for the remaining terms.As decreases,the upper bounds decrease,and thus the reevaluation phase terminates ear-lier.Of course,the retrieval effectiveness is affected by the modiﬁcation since it violates the upper bound condi-tion,and estimated values sometimes become lower than the corresponding relevance values.But because the possi-bility that a remaining term appears in a document is usu-ally small,the decrease in the effectiveness can be made small by controlling appropriately.5.2.2Estimation of Document FrequencyWhen one uses the query term selection optimization,the character-based signatureﬁle generates another problem. To select query terms,the document frequencies of terms must be established to compute s before signatureﬁle access.Since the document frequencies are obtained by ac-cessing the signatureﬁle in our implementation,however, it is impossible to simply use the optimized query term se-lection.The document frequencies of terms need to be ob-tained without accessing the signatureﬁle.We have,therefore,developed a method of estimating the document frequency using character-level statistical in-formation.Let be the term’s occurrence probability which is obtained by dividing the number of all the occur-rences of term by the total size in the collection,and let be the length of a document.Since is roughly equal to,what we have to do is to estimate.Forthis estimation,it is assumed that a word is generated in such a way that the occurrence probability of one charac-ter is determined by the character class of its proceeding character.Let term has characters,.When denotes the character class of the-th character, denotes the occurrence probability of the character,and denotes the conditional occurrence probability of the character after the character belonging to the charac-ter class,the assumption is expressed as(5)When the relationships andgiven by Basian in-ference are used,the above equation becomes(6) 6Evaluation6.1Test ConditionThe proposed ranking method was evaluated from the viewpoints of retrieval effectiveness and efﬁciency.Effectiveness was measured using recall,the ratio of the number of relevant documents retrieved to the relevant documents in the entire collection,and precision,the ratio of the number of relevant documents retrieved to the total number of retrieved documents[30].Measured results are shown in interpolated recall vs.precision graphs[8][30]. The BMIR-J1,whose statistics are shown in Table1,was used.It should be noted that in the baseline evaluation without the optimization,recall and precision are computed using the entire ranking list,but for the optimized cases they are computed from the top20documents;optimiza-tion parameters and(in case of)only control the termination of the reevaluation phase so that they do not affect the performance results that are measured from the entire ranking list.Efﬁciency was measured by the response time to get the top documents.The BMIR-J1is too small to evaluateBMIR-J11993N.K.articlesnumber of items60011494 min.length(chr.)1022816545 total size872KBRecallP r e c i s i onNumber of Top-ranked DocumentsR e s p o n s e T i m e P = 1.00P = 0.10P = 0.05P = 0.00Figure 4:Effect of.RecallP r e c i s i onNumber of Top-ranked DocumentsR e s p o n s e T i m e [s e c ]Kd = 0.00Kd = 0.50Kd = 1.00Kd = 2.00Figure 5:Effect of.the performance reached at the best,where precision atrecallis 22%higher than that without segmentation.However,precision decreased fromto ,be-cause a compound word is divided into almost single char-acters at,so the possibility that irrelevant docu-ments receive higher relevance values by chance increases.The proposed word segmentation method is,from these re-sults,conﬁrmed to be effective.The response times for various values are plotted in the right side of the ﬁgure.The response time increased as increased because the number of query terms and candi-date documents increased in accordance with .Actually,for 1.00,0.10,0.05,and 0.00,the average number of query terms generated were 2.74,3.60,4.40,and 7.72,and the average number of candidate documents were 23291,45356,75686,120712.For,the response time of was worse than that of ;At least docu-ments have to receive relevance values higher than the up-per bound value of a certain document to ﬁnish a ranking.Thus the range of upper bound values needs to be wide in order to terminate the reevaluation iteration quickly.How-ever,only a few query terms are generated and the rangeof upper bounds becomes very small when.As a result,all of the candidate documents sometimes needed to be evaluated,and the response time increased.6.2.2Effect ofThe effect of when and is illustrated in Figure 5.The effectiveness was minimum at ,which corresponds to cases without using the in-document frequency.This result means in-document frequency plays an important role in ranking.Although the best perfor-mance was achieved at,has less impact than .As for the response time,the ranking result was estab-lished quickly as became smaller.That is because the difference between the ﬁnal score and the upper bound be-comes smaller according to.We noticed that while the processing time for the pre-ranking phase (which is givenat)stayed at the same level for all s,the response time increased greatly as increased for larger .This is because only affects the number of reevaluation it-RecallP r e c i s i onNumber of Top-ranked DocumentsR e s p o n s e T i m e Lambda = 0.00Lambda = 0.20Lambda = 0.40Lambda = 1.00Figure 6:Effect of.RecallP r e c i s i onNumber of Top-ranked DocumentsR e s p o n s e T i m e [s e c ]Alpha = 1.00Alpha = 0.75Alpha = 0.50Alpha = 0.25Figure 7:Effect of .erations according to decreases in the third component of Formula (2),and does not affect the pre-ranking phase.6.2.3Effect ofThe effect of the document length factor is shown in Fig-ure 6.The retrieval effectiveness attained the best perfor-mance when,and after that decreased to minimum at .This result is incompatible with other experi-mental results which showed that the document length fac-tor had a quite positive effect;the above-mentioned BM11model,which corresponds to,attained higher performance than the BM15model,which corresponds to[23].Although the BM11model requires,to getthe same relevance value,the term’s occurrence frequency must be proportional to the document length,this require-ment is hard to be maintained because a longer document frequently has more than one topic [10].Thus,we believe that our results in which the best result was given at a point between the two extreme cases would coincide with our in-tuition.As for retrieval efﬁciency,decreased the performance.This is because larger in-document frequencies are ob-tained in general in longer documents,so that the normal-ization makes the effective frequencies small and thus the processing becomes take much time.In summary,and look like the best parameter settings from the above experiments.This combination of parameters was,therefore,used in the following experiments.6.3Effect of Stop Condition RelaxationMeasurement results are shown in Figure 7,in whichcorresponds to the baseline method.It might be difﬁcult to see from this ﬁgure,the retrieval effectiveness forwas slightly lower than the result at the full ranking given in Figure 6.This is because the effectiveness here was measured at as described in Section 6.1,so some documents that should have been listed in the top 20were missed.Precision was decreased by this optimization,but the decrease was small.On the other hand,this optimization was quite effective to speed retrieval as shown in the right graphs in Figure 7.RecallP r e c i s i onNumber of Top-ranked DocumentsR e s p o n s e T i m e Beta = 1.00Beta = 0.75Beta = 0.50Beta = 0.25Figure 8:Effect of .RecallP r e c i s i onNumber of Top-ranked DocumentsR e s p o n s e T i m e [s e c ]Gamma = 1.00 Gamma = 0.50 Gamma = 0.10 Gamma = 0.01Figure 9:Effect of.The top 10or 20documents were,for example,identiﬁed6or 7times more quickly when.Because the improvement was almost the same for and 0.25,it seems better to set at 0.50.6.4Effect of Query Term SelectionThe query term selection optimization was evaluated by changing and .First,the effect of is shown in Figure 8.Recall-precision graphs show that precision decreased in accor-dance with ,and became considerably worse when.The reason for this decrease is that document can-didates that only contained many terms with smaller s were missed when became low.The effect on the response time was complicated,as shown in the left graph.Decrease in serves to speed up the pre-ranking phase,which was indicated by a decrease in the response time at ,by limiting the query words used in the pre-ranking and the number of candidate docu-ments.Actually,the number of query words decreased asfollows:4.40(),4.09(),2.57()and 1.81().However,as mentioned in Section5.2.1,the response time for went up as became smaller because the number of accessed documents in the reevaluation phase increased.In consequence,no perfor-mance gain was achieved by simply setting as ,and must be smaller.Figure 9illustrates the effect of when was ﬁxed at 0.5.As we expected,the system responded faster in ac-cordance with decrease in throughout the range for all s,but the speed up saturated at .Note that theresponse time atdid not change,since did not change the number of query words and was not effective in accelerating the pre-ranking phase.The effect on recall and precision was also shown in the same ﬁgure,and we found that had a smaller effect.In conclusion,and seems the best parameter settings for the query term selection optimization.Finally,the performance was measured when the two optimizing methods were combined.In Figure 10,RSC and QTS stand for the stop condition relaxation and the query term selection optimizations,and the parameterswere set tofor RSC,and and for。

Effective and efficient document ranking without using a large lexicon

合集下载

雅思写作评分标准对照表

雅思写作评分标准对照表

世界大学评分标准

英文作文电脑评分依据

如何提高学术成绩英文作文

美国大学课程成绩评定方法及其技术支撑

科技英语简介课件

LTR入门材料整理--小白学搜广推01

英语硕士毕业论文高级词汇

学术英语水平量表

文档推荐

最新文档