美国当代英语语料库(COCA)使用介绍共34页

美国当代英语语料库COCA使用介绍 ppt课件

美国当代英语语料库(COCA)使用说明
1
2
精品资料
• 你怎么称呼老师？
• 如果老师最后没有总结一节课的重点的难点，你是否会认为老师的教学方法需要改进？
• 你所经历的课堂，是讲座式还是讨论式？ • 教师的教鞭
• “不怕太阳晒，也不怕那风雨狂，只怕先生骂我笨，没有学问无颜见爹娘 ……”
• “太阳当空照，花儿对我笑，小鸟说早早早……”
19
POS LIST
pron.INDF 不定代词 pron.PERS 人称代词 pron.WH 疑问代词 pron.REFL 反身代词 adj.CMP 形容词比较级 (comparative) adj.SPRL 形容词最高级 (superlative) adv.particle 副词小品词 adv.WH 疑问副词
10
COCA界面简介
• 语料库分类区(五大类型语料库共包括42个子语料库)。
11
COCA界面简介
• 语料库分类区(五大类型语料库共包括42个子语料库)。
12
COCA主要搜索功能(一)
• 搜索words、phrases、lemmas（单词的所有形式）、wildcards（通配符）和其他更加复杂的字词。
13
Concordances显示方式
• 蓝色——名词 • 紫色——动词 • 绿色——形容词 • 棕色——副词 • 灰色——代词 • 黄色——介词
14
COCA主要搜索功能
• 例：输入词组 “excuse+名词” 得到的结果都是 excuse后面跟的名词短。检索表达式为：white [n*]。
15
18
POS LIST
verb base=动词原形 verb.INF=动词不定式 verb MODAL=情态动词 verb 3SG=动词第三人称单数 verb ED=过去式 verb EN=过去分词 verb ING=现在分词 verb.LEX=lexical verb实意动词 verb.[BE]=系动词 verb.[DO]=do verb.[HAVE]=have

COCA 语料库界面查询输入说明(修订)

The Materials for Guiding Students to Use COCA
1. COCA界面的字符串查询（WORD（S））处输入词示例内容的说明
输入词示例
作用
说明与技巧
Jumbo或
soft landing
查具体的词或短语
也可以输入长字符串（9词以下）
borrow/lend
简单对比两个词的使用频率
SECTION 2=FIC
查smart的同义词在报纸和小说中的使用情况
查beautiful的同义词与flower的同义词搭配的情况
small
little
[nn*]
0/3各
RELEVANCE
查small和little后面3个词内的名词使用对比情况
ground.[n*]
floor.[n*]
[j*]
3/0
RELEVANCE
各查ground和floor作名词时前面3个词内的形容词使用对比情况
查is被缩写为’s情况的句式结构
’s在本语料库中可以被视为一个词单独查询，即，和前面的单词要空格且输成‘s，其它缩写形式也是用类似方法查询
it is [v*] that
或we [vv*] that
查句式结构
选择CHART显示可以看出第一个是学术结构，是口语的8.5倍；第二个结构口语中最常用
to [v*] or not to [v*]
dis* [v?d]
查第一个词以dis开头，
下一个词为过去式结构
（注意与上面的区别）
查到district had, disease was, disease had等
*ly.[j*]
查以ly结尾的形容词
仅查ly结尾的词作形容词使用情况

美国当代英语语料库(COCA)在词汇教学中的应用价值

美国当代英语语料库(COCA)在词汇教学中的应用价值张仁霞【摘要】本研究介绍了美国当代英语语料库(COCA)在英语词汇教学中的利用价值：充实单词语义，建立图式；学习单词搭配，归纳语义偏好；培养学生语体意识，学会恰当使用单词；发现单词的同义词近义词；真实语料和语境中习得词汇，培养观察归纳思维能力。

COCA对于学生进行英语词汇网络自主学习是很有价值的语料库资源和工具。

【期刊名称】齐齐哈尔大学学报（哲学社会科学版）【年(卷),期】2015(000)004【总页数】4【关键词】语料库；COCA；词汇教学□学科教学研究近年来，计算机和网络技术的迅猛发展为英语教学创造了新的条件，大大提高了英语教学的效率。

教学中引入网络语料库手段，将极大丰富英语教学的手段。

COCA—美国当代英语语料库 (Corpus of Contemporary American English) 是美国最新当代英语语料库，是当今世界上最大的英语平衡语料库。

关于其系统介绍，可以参考《美国当代英语语料库(COCA)——英语教学与研究的良好平台》[1]专业语料库需要购买昂贵的软件或者注册费用，繁忙的教学使得教师们无暇自建语料库，所以提到语料库，很多英语教师望而却步，加上多数具有“技术恐惧症”，认为语料库望尘莫及。

英语教师和学习者要观察当今美语使用变化的情况,COCA 提供了在线免费使用的良好平台。

它是由杨伯翰大学 Mark Davies 教授开发的高达 4.5 亿词汇库容的美国最新当代英语语料库,是当今世界上最大的英语平衡语料库。

其界面主要是为语言学家和语言学习者了解单词、短语以及句子结构的频率及进行相关信息比较而设计。

它具备了一个好语料库的三项最基本条件:规模、速度以及词性标注。

[2] 它收集的数据涵盖了最近22 年(1990 年到2012 年)美国的口语、小说、流行杂志、报纸和学术期刊五大类型的语料,并且每种类型基本呈均匀平衡分布。

值得一提的是,COCA 具有其它语料库不可企及的突出优势,它是一种动态的语料库资源,没有最后的版本,处于不断更新与发展中,每年约2000 万词汇,而且今后每年至少更新两次。

coca等级词汇

coca等级词汇【原创实用版】目录1.引言：介绍 COCA 词汇等级2.COCA 词汇等级的定义与划分3.COCA 词汇等级的应用领域4.COCA 词汇等级对于英语学习的重要性5.结论：总结 COCA 词汇等级的价值和意义正文1.引言COCA（Corpus of Contemporary American English）是美国当代英语的一个大规模语料库，它包含了众多英语词汇和短语。

在 COCA 中，词汇被分为五个等级，分别为高频词汇、中频词汇、低频词汇、罕见词汇和极罕见词汇。

这些等级对于英语学习者来说具有重要的参考价值。

2.COCA 词汇等级的定义与划分（1）高频词汇：在 COCA 语料库中出现频率最高的词汇，如“the”、“is”、“and”等。

这些词汇是英语基础中的基础，掌握这些词汇有助于提高阅读和写作效率。

（2）中频词汇：在 COCA 语料库中出现频率较高的词汇，如“education”、“technology”等。

这些词汇扩大了英语学习者的词汇量，有助于提高阅读理解的能力。

（3）低频词汇：在 COCA 语料库中出现频率适中的词汇，如“empanada”、“antics”等。

这些词汇在日常交流中不常用，但在特定场景下会出现，掌握这些词汇有助于提高英语表达的准确性。

（4）罕见词汇：在 COCA 语料库中出现频率较低的词汇，如“plethora”、“ephemeral”等。

这些词汇在日常交流中很少出现，但在文学作品或专业领域中会有所涉及，掌握这些词汇有助于提高英语阅读和写作的深度。

（5）极罕见词汇：在 COCA 语料库中出现频率极低的词汇，如“supercalifragilisticexpialidocious”等。

这些词汇在英语学习中几乎不会用到，但对于语言研究和词汇爱好者来说具有一定的价值。

3.COCA 词汇等级的应用领域COCA 词汇等级在英语教学、研究、翻译等领域都有广泛的应用。

英语学习者可以根据这些等级有针对性地进行学习和记忆，提高自己的英语水平。

美国当代英语语料库(COCA)使用介绍要点

COCA语料库简介
COCA简介
– COCA美国当代英语语料库涵盖美国这一时期的口语(spoken)、小说(fiction)、流行杂志(pop magzine)、报纸(newspaper)和学术期刊 (academic)五大类型的语料库,并且在这五个类型方面基本呈均匀平衡分布。
– 网址：/coca
• 例1. 输入单词“mysterious” (图2.1.1-1)：得到相关结果(图2.1.1-2)：在各子库中的频率，每百万词使用的频率。
• 若对图2中的相应条块进行点击，那么就可以看到 KWIC，如图2.1.1-3 (以点Fiction的条块为例)：
图2.1.1-1
图2.1.1-2
使用CHART显示
POS LIST
det.GEN 类指限定词 det.POS 物主限定词 num.CARD 基数词 num.ORD 序数词 conj.CRD 并列连词 conj.SUB 从属连词 Interj. 叹词 PUNC 标点
词性列表的使用
• 1）查询多义词特定的词性 • 2）某个词前或者后面特定词性的若干搭配
COCA界面简介
• 语料库分类区(五大类型语料库共包括42个子语料库)。
COCA界面简介
• 语料库分类区(五大类型语料库共包括42个子语料库)。
二、COCA主要搜索功能
• 2.1 搜索words、phrases、lemmas（单词的所有形式）、wildcards（通配符）和其他更加复杂的字词。
POS LIST
pron.INDF 不定代词 pron.PERS 人称代词 pron.WH 疑问代词 pron.REFL 反身代词 adj.CMP 形容词比较级 (comparative) adj.SPRL 形容词最高级 (superlative) adv.particle 副词小品词 adv.WH 疑问副词

coca等级词汇

coca等级词汇摘要：一、引言1.介绍COCA 等级词汇的背景和作用2.阐述COCA 等级词汇对于学习者的重要性二、COCA 等级词汇的概述1.COCA 的定义和来源2.COCA 等级词汇的分类和特点三、COCA 等级词汇的应用1.在英语学习中的作用2.如何有效地利用COCA 等级词汇提高英语水平四、COCA 等级词汇与其他词汇体系的比较1.GSL (General Service List)2.BNC (British National Corpus)五、结论1.总结COCA 等级词汇的重要性2.鼓励学习者积极利用COCA 等级词汇提高英语能力正文：一、引言COCA（The Corpus of Contemporary American English）等级词汇是英语学习者提高英语能力的重要工具。

COCA 等级词汇不仅可以帮助学习者掌握英语中最常用的词汇，还能让学习者了解词汇的难度和重要性，从而更好地进行英语学习。

二、COCA 等级词汇的概述COCA 等级词汇是基于COCA 语料库（The Corpus of Contemporary American English）进行的研究成果。

COCA 语料库包含了大量美国英语的文本，包括书籍、报纸、杂志、网络文章等，共约5.2 亿词。

通过对这些语料库的分析，研究人员将词汇按照其在英语中的使用频率和重要性进行分类，形成了COCA 等级词汇。

COCA 等级词汇共分为十个等级，从最常用的Level 1 词汇到较为生僻的Level 10 词汇。

每个等级的词汇都有其特定的使用场景和重要性。

例如，Level 1 词汇是英语中最常用的词汇，学习者需要熟练掌握这些词汇；而Level 10 词汇虽然在日常生活中使用频率较低，但对于学习特定领域（如科技、医学等）的专业知识具有重要意义。

三、COCA 等级词汇的应用COCA 等级词汇在英语学习中具有广泛的应用。

学习者可以通过掌握不同等级的词汇，提高自己的英语水平。

在美国当代英语语料库(COCA)如何查词

在美国当代英语语料库（COCA）如何查词.doc 在美国当代英语语料库(COCA)如何查词摘要:美国当代英语语料库(Corpus of Contemporary American English，COCA)由美国Brigham Young University 的Mark Davies教授开发，目前单词容量在4.5亿，是美国当前最新的当代英语语料库，也是当今世界上最大的英语平衡语料库。

该语料库的语料来自1990-2012年，每年更新，检索功能强大，是最佳的英语学习助手。

本文以sorry为例介绍了如何在美国当代英语语料库中查询单词及对单词sorry的检查与研究结果。

关键词:美国当代英语语料库，平衡语料库，sorryAbstract: The Corpus of Contemporary American English (COCA) is the largest freely-available corpus of English，and the only large and balanced corpus of American English.The corpus was created by Mark avies of Brigham Young University，and it is used by tens of thousands of sers every month (linguists，teachers，translators，and other searchers).COCA is also related to other large corpora that we have created.The corpus contains more than 450 million words of text and isqually divided among spoken，fiction，popular magazines，newspapers，and academic texts.It includes 20 million words each year from 1990-2012.Key words: the Corpus of Contemporary American English，parallel corpus，sorry中图分类号:H319.3文献标识码:A文章编号:1006-026X(2013)12-0000-02一、引论美国当代英语语料库(Corpus of Contemporary American English，COCA)由美国Brigham Young University 的Mark Davies教授开发，目前单词容量在4.5亿以上，是美国当前最新的当代英语语料库，也是当今世界上最大的英语平衡语料库，且与其他所建语料库相连。

美国当代英语语料库（COCA）在词汇教学中的应用价值

Application Value of American COCA in Vocabulary
Teaching
作者：张仁霞
作者机构：广东技术师范学院大学英语部,广东广州501665
出版物刊名：齐齐哈尔大学学报：哲学社会科学版
页码： 175-178页
年卷期： 2015年第4期
主题词：语料库;COCA;词汇教学
摘要：本研究介绍了美国当代英语语料库（COCA）在英语词汇教学中的利用价值：充实单词语义,建立图式;学习单词搭配,归纳语义偏好;培养学生语体意识,学会恰当使用单词;发现单词的同义词近义词;真实语料和语境中习得词汇,培养观察归纳思维能力。

COCA对于学生进行英语词汇网络自主学习是很有价值的语料库资源和工具。

使用COCA等在线语料库相关说明

1. Who created these corpora?The corpora were created by Mark Davies, Professor of Linguistics at Brigham Young University in Provo, Utah, USA. In most cases (though see #2 below) this involved designing the corpora, collecting the texts, editing and annotating them, creating the corpus architecture, and designing and programming the web interfaces. Even though I use the terms "we" and "us" on this and other pages, most activities related to the development of most of these corpora were actually carried out by just one person.2. Who else contributed?3. Could you use additional funding or support?As noted above, we have received support from the US National Endowm ent for the Humanities and Brigham Young University for the developm ent of several corpora. However, we are always in need of ongoing support for new hardware and software, to add new features, and especially to create new corpora. Because we do not charge for the use of the corpora (which are used by 80,000+ researchers, teachers, and language learners each month) and since the creation and maintenance of these corpora is essentially a "one person enterprise", any additional support would be very welcom e. There might be graduate programs in linguistics, or ESL or linguistics publishers, who might want to make a contribution, and we would then "spotlight" them on the front page of the corpora. Also, if you have contacts at a funding source like the Mellon Foundation or the MacArthur grants, please let them know about us (and no, we're not kidding).4. What's the history of these corpora?The first large online corpus was the Corpus del Español in 2002, followed by the BYU-BNC in 2004, the Corpus do Português in 2006, TIME Corpus in 2007, the Corpus of Contemporary American English (COCA) in 2008, and the Corpus of Historical American English (COHA) in 2010. (More details...)5. What is the advantage of these corpora over other ones that are available?For some languages and time periods, these are really the only corpora available. For example, in spite of earlier corpora like the American National Corpus and the Bank of English, our Corpus of Contemporary American English is the only large, balanced corpus of contemporary American English. In spite of the Brown family of corpora and the ARCHER corpus, the Corpus of Historical American English is the only large and balanced corpus of historical American English. And the Corpus del Español and the Corpus do Português are the only large, annotated corpora of these two languages. Beyond the "textual" corpora, however, the corpus architecture and interface that we have developed allows for speed, size, annotation, and a range of queries that we believe is unmatched with other architectures, and which makes it useful for corpora such as the British National Corpus, which does have other interfaces. Also, they're free -- a nice feature.6. What software is used to index, search, and retrieve data from these corpora?We have created our own corpus architecture, using Microsoft SQL Server as the backbone of the relational database approach. Our proprietary architecture allows for size, speed, and very good scalability that we believe are not available with any other architecture. Even complex queries of the more than 425 million word COCA corpus or the 400 million word COHA corpus typically only take one or two seconds. In addition, be cause of the relational database design, we can keep adding on more annotation "modules" with little or no performance hit. Finally, the relational database design allows for a range of queries that we believe is unmatched by any other architecture for large corpora.7. How many people use the corpora?As measured by Google Analytics, as of March 2011 the corpora are used by more than 80,000 unique people each month. (In other words, if the same person uses three different corpora a total of ten times that month, it counts as just one of the 80,000 unique users). The most widely-used corpus is the Corpus of Contemporary American English -- with more than 40,000 unique users each month. And people don't just come in, look for one word, and move on -- average time at the site each visit is between 10-15 minutes.8. What do they use the corpora for?For lots of things. Linguists use the corpora to analyze variation and change in the different languages. Some are materials developers, who use the data to create teaching materials. A high number of users are language teachers and learners, who use the corpus data to model native speaker performance and intuition. Translators use the corpora to get precise data on the target languages. Some businesses purchase data from the corpora to use in natural language processing projects. And lots of people are just curious about language, and (believe it or not) just use the corpora for fun, to see what's going on with the languages currently. If you are a registered user, you can look at the profiles of other users (by country or by interest) after you log in.9. Are there any published materials that are based on these corpora?As of mid-2011, researchers have submitted entries for more than 260 books, articles and conference presentations that are based on the corpora, and this is probably only a sm all fraction of all of the publications that have actually been done. In addition, we ourselves have published three frequency dictionaries that are based on data from the corpora -- Spanish (2005), Portuguese (2007), and American English (2010).10. How can I collaborate with other users?You can search users' profiles to find researchers from your country, or to find researchers who have similar interests. In the near future, we may start a Google Group for those who want more interaction.11. What about copyright?Our corpora contain hundreds of millions of words of copyrighted material. The only way that their use is legal (under US Fair Use Law) is because of the limited "Keyword in Context" (KWIC) displays. It's kind of like the "snippet defense" used by Google. They retrieve and index billions of words of copyright material, but they only allow end users to access"snippets" (片段，少许)of this data from their servers. Click here for an extended discussion of US Fair Use Law and how it applies to our COCA texts.12. Can I get access to the full text of these corpora?Unfortunately, no, for reasons of copyright discussed above. We would love to allow end users to have access to full-text, but we simply cannot. Even when "no one else will ever use it" and even when "it's only one article or one page" of text, we can't. We have to be 100% compliant with US Fair Use Law, and that means no full text for anyone under any circumstances -- ever. Sorry about that.13. I want more data than what's available via the standard interface. What can I do?Users can purchase derived data -- such as frequency lists, collocates lists, n-grams lists (e.g. all two or three word strings of words), or even blocks of sentences from the corpus. Basically anything, as long as it does not involve full-text access (e.g. paragraphs or pages of text), which would violate copyright restrictions. Click here for much more detailed information on this data, as well as downloadable samples.14. Can my class have additional access to a corpus on a given day?Yes. Sometimes your school will be blocked after an hour or so of heavy use from a classroom full of students. (This is a security mechanism, to prevent "bots" from running thousands of queries in a short time.) To avoid this, sign up ahead of time for "group access".15. Can you create a corpus for us, based on our own materials?Well, I probably could, but I'm not overly inclined to at this point. Creating and maintaining corpora is extremely time intensive, even when you give me the data "all ready" to import into the database. The one exception, I guess, would be if you get a large grant to create and maintain the corpus. Feel free to contact me with questions.16. How do I cite the corpora in my published articles?Please use the following information when you cite the corpus in academic publications or conference papers. And please remember to add an entry to the publication database (it takes only 30-40 seconds!). Thanks.In the first reference to the corpus in your paper, please use the full name. For example, for COCA: "the Corpus of Contemporary American English" with the appropriate citation to the references section of the paper, e.g. (Davies 2008-). After that reference, feel free touse something shorter, like "COCA" (for example: "...and as seen in COCA, there are..."). Also, please do not refer to the corpus in the body of your paper as "Mark Davies' COCA corpus", "a corpus created by Mark Davies", etc. The bibliographic entry itself is enough to indicate who created the corpus.。

COCA

COCA———美国当代英语语料库(Corpus of Contemporary A2merican English)(2011-05-13 10:53:38)转载▼标签：分类：每周一推美国当代英语语料库coca英语教学研究平台免费教育COCA———美国当代英语语料库(Corpus of Contemporary A2merican English)/由美国B righam Young University的M ark Davies教授开发的高达3.6亿词汇的美国最新当代英语语料库,是当今世界上最大的英语平衡语料库。

与其它语料库不同的是它是免费在线供大家使用,给全世界英语学习者带来了福音,是不可多得的一个英语学习宝库,也是观察美国英语使用和变化的一个绝佳窗口。

COCA美国当代英语语料库于2008年2月20日在互联网上正式推出。

在来自全球25个国家和地区有140余位专家学者参会的AACL22008(American Association for Corpus Linguis2tics)学术会议上,会议的组织者M ark Davies教授介绍了自己开发的COCA美国当代英语语料库,此外在此学术会议上还有部分学者对研究使用这一语料库进行了交流,获得了热烈反响。

为让更多中国的英语教师和学习者从大型语料库的知识海洋中受益,现积极向国内读者作介绍。

1优点许多语言学习者对普通网络搜索引擎有偏好,那是无法获取免费语料库的无奈之举。

但网络信息作为语料库的问题就是用户无法限定要查询字词的词性,无法作词与词的对比,无法限定要查找的语料类型,也无法确切地查找某一时段的字词使用信息,更无法限定要查找的字词间的距离,也就没有办法确定字词互信息的值。

而其它专业点的英语语料库又需要较贵的注册费或软件购买费用,让普通使用者望尘莫及。

COCA美国当代英语语料库是一个大型在线并免费供大家使用的语料库,为英语研究者和英语学习者共享美国英语资源提供了一个良好平台。

COCA语料库操作演示.ppt教程

图2.4.2
规则：在WORDS的方格里分别输入woman和man，再在 COLLOCATES方框里输入[j*],选在左3，表示前面3个跨距内所有的形容词。当然也可以比较在某个子语料库中出现的频率比较。
• 2.4.3 搜索近义词 • 如：搜索beautiful的所有近义词（如图2.4.3-1）
图2.3-1
图2.3-2
图2.3-2
• 但是也可以之间对两者子语料库中它们出现频率的对比，操作：分别选择section 1&2,如下图(图2.3-3)：
图2.3-3
• 2.4 进行语义倾向比较 • 2.4.1 比较近义词 • 如：近义形容词hot和warm后面所跟名词的区别（如图2.4.1）：
图2.4.1
规则：首先选择 COMPARE 显示。然后在WORDS的方格里分别输入hot和warm，再在COLLOCATES方框里输入[n*],表示后面所跟任何名词。当然也可以比较在某个子语料库中出现的频率比较。
ቤተ መጻሕፍቲ ባይዱ
• 2.4.2 比较反义词 • 如：woman和man前面所跟的形容词的区别（如图2.4.2）
图2.1.4-1
规则：若要得到某个单词的所有单复数和时态形式，那么就要在输入时，在这个单词外加 [ ]。
图2.1.4-2
形容词early的原形，比较级和最高级三种形式一次性检索出来检索
• 2.1.5 输入某种词性且部分带有某些字母的命令，如要得到以 un- 开头、 -ed 结尾的所有形容词的所有形式（见图 2.1.5-1 ）和得到动词 + 任何词 +ground的所有词组（见图2.1.5-2）: • 规则：若要得到某种词性且词中带有部分带有某些字母的形式时，如要得到以 un- 开头、 -ed 结尾的所有形容词的所有形式，那么输入: un*ed.[aj*]；若要得到动词+任何词+ground的所有词组,那么输入: [vv*]*[ground]即可。前者用来研究词汇，后者用来查询特定词性的搭配。

使用COCA等在线语料库相关说明

1. Who created these corpora?The corpora were created by Mark Davies, Professor of Linguistics at Brigham Young University in Provo, Utah, USA. In most cases (though see #2 below) this involved designing the corpora, collecting the texts, editing and annotating them, creating the corpus architecture, and designing and programming the web interfaces. Even though I use the terms "we" and "us" on this and other pages, most activities related to the development of most of these corpora were actually carried out by just one person.2. Who else contributed?3. Could you use additional funding or support?As noted above, we have received support from the US National Endowm ent for the Humanities and Brigham Young University for the developm ent of several corpora. However, we are always in need of ongoing support for new hardware and software, to add new features, and especially to create new corpora. Because we do not charge for the use of the corpora (which are used by 80,000+ researchers, teachers, and language learners each month) and since the creation and maintenance of these corpora is essentially a "one person enterprise", any additional support would be very welcom e. There might be graduate programs in linguistics, or ESL or linguistics publishers, who might want to make a contribution, and we would then "spotlight" them on the front page of the corpora. Also, if you have contacts at a funding source like the Mellon Foundation or the MacArthur grants, please let them know about us (and no, we're not kidding).4. What's the history of these corpora?The first large online corpus was the Corpus del Español in 2002, followed by the BYU-BNC in 2004, the Corpus do Português in 2006, TIME Corpus in 2007, the Corpus of Contemporary American English (COCA) in 2008, and the Corpus of Historical American English (COHA) in 2010. (More details...)5. What is the advantage of these corpora over other ones that are available?For some languages and time periods, these are really the only corpora available. For example, in spite of earlier corpora like the American National Corpus and the Bank of English, our Corpus of Contemporary American English is the only large, balanced corpus of contemporary American English. In spite of the Brown family of corpora and the ARCHER corpus, the Corpus of Historical American English is the only large and balanced corpus of historical American English. And the Corpus del Español and the Corpus do Português are the only large, annotated corpora of these two languages. Beyond the "textual" corpora, however, the corpus architecture and interface that we have developed allows for speed, size, annotation, and a range of queries that we believe is unmatched with other architectures, and which makes it useful for corpora such as the British National Corpus, which does have other interfaces. Also, they're free -- a nice feature.6. What software is used to index, search, and retrieve data from these corpora?We have created our own corpus architecture, using Microsoft SQL Server as the backbone of the relational database approach. Our proprietary architecture allows for size, speed, and very good scalability that we believe are not available with any other architecture. Even complex queries of the more than 425 million word COCA corpus or the 400 million word COHA corpus typically only take one or two seconds. In addition, be cause of the relational database design, we can keep adding on more annotation "modules" with little or no performance hit. Finally, the relational database design allows for a range of queries that we believe is unmatched by any other architecture for large corpora.7. How many people use the corpora?As measured by Google Analytics, as of March 2011 the corpora are used by more than 80,000 unique people each month. (In other words, if the same person uses three different corpora a total of ten times that month, it counts as just one of the 80,000 unique users). The most widely-used corpus is the Corpus of Contemporary American English -- with more than 40,000 unique users each month. And people don't just come in, look for one word, and move on -- average time at the site each visit is between 10-15 minutes.8. What do they use the corpora for?For lots of things. Linguists use the corpora to analyze variation and change in the different languages. Some are materials developers, who use the data to create teaching materials. A high number of users are language teachers and learners, who use the corpus data to model native speaker performance and intuition. Translators use the corpora to get precise data on the target languages. Some businesses purchase data from the corpora to use in natural language processing projects. And lots of people are just curious about language, and (believe it or not) just use the corpora for fun, to see what's going on with the languages currently. If you are a registered user, you can look at the profiles of other users (by country or by interest) after you log in.9. Are there any published materials that are based on these corpora?As of mid-2011, researchers have submitted entries for more than 260 books, articles and conference presentations that are based on the corpora, and this is probably only a sm all fraction of all of the publications that have actually been done. In addition, we ourselves have published three frequency dictionaries that are based on data from the corpora -- Spanish (2005), Portuguese (2007), and American English (2010).10. How can I collaborate with other users?You can search users' profiles to find researchers from your country, or to find researchers who have similar interests. In the near future, we may start a Google Group for those who want more interaction.11. What about copyright?Our corpora contain hundreds of millions of words of copyrighted material. The only way that their use is legal (under US Fair Use Law) is because of the limited "Keyword in Context" (KWIC) displays. It's kind of like the "snippet defense" used by Google. They retrieve and index billions of words of copyright material, but they only allow end users to access"snippets" (片段，少许)of this data from their servers. Click here for an extended discussion of US Fair Use Law and how it applies to our COCA texts.12. Can I get access to the full text of these corpora?Unfortunately, no, for reasons of copyright discussed above. We would love to allow end users to have access to full-text, but we simply cannot. Even when "no one else will ever use it" and even when "it's only one article or one page" of text, we can't. We have to be 100% compliant with US Fair Use Law, and that means no full text for anyone under any circumstances -- ever. Sorry about that.13. I want more data than what's available via the standard interface. What can I do?Users can purchase derived data -- such as frequency lists, collocates lists, n-grams lists (e.g. all two or three word strings of words), or even blocks of sentences from the corpus. Basically anything, as long as it does not involve full-text access (e.g. paragraphs or pages of text), which would violate copyright restrictions. Click here for much more detailed information on this data, as well as downloadable samples.14. Can my class have additional access to a corpus on a given day?Yes. Sometimes your school will be blocked after an hour or so of heavy use from a classroom full of students. (This is a security mechanism, to prevent "bots" from running thousands of queries in a short time.) To avoid this, sign up ahead of time for "group access".15. Can you create a corpus for us, based on our own materials?Well, I probably could, but I'm not overly inclined to at this point. Creating and maintaining corpora is extremely time intensive, even when you give me the data "all ready" to import into the database. The one exception, I guess, would be if you get a large grant to create and maintain the corpus. Feel free to contact me with questions.16. How do I cite the corpora in my published articles?Please use the following information when you cite the corpus in academic publications or conference papers. And please remember to add an entry to the publication database (it takes only 30-40 seconds!). Thanks.In the first reference to the corpus in your paper, please use the full name. For example, for COCA: "the Corpus of Contemporary American English" with the appropriate citation to the references section of the paper, e.g. (Davies 2008-). After that reference, feel free touse something shorter, like "COCA" (for example: "...and as seen in COCA, there are..."). Also, please do not refer to the corpus in the body of your paper as "Mark Davies' COCA corpus", "a corpus created by Mark Davies", etc. The bibliographic entry itself is enough to indicate who created the corpus.。

这个连专业翻译都爱用的在线词典，你一定要试一下！

这个连专业翻译都爱用的在线词典，你一定要试一下！ZSCI中英文的相互转换是当下学生工作中的必备技能，作为科研人，做学术不免要写论文，写论文不免会遇到翻译问题，常规的翻译就是把论文题目、摘要、关键文翻译成英语，那么怎么在海量的词库中，寻找到最精准的那个单词呢？今天介绍的这款COCA英语在线词库，就能完美解决这个问题！一、COCA英语在线词库网址：/coca/COCA英语在线词库是目前最大的免费英语词库，由美国杨百翰大学的Mark Davies教授主持创立并在2008年正式上线，除了强大的文本检索功能之外，由于将语料按照年份进行了细致的划分，使得研究者追溯语言发展中的变迁成为可能，其中它还包含文本小说、口语、杂志、报纸、学术文章等文体。

其时效性很强，一些新词也会收录在内。

可以作为平时词典的补充，把不确定的表达可以放到语料库里查，确认是否地道或者是找到更过的信息。

注意：使用COCA时建议注册一个账号（语库的使用和注册都是免费的），不然会有查询次数限制。

二、重点功能介绍1、确认所用单词是否为地道英语进入官网后，这里是输入查询词的地方，有4种搜索，可供下面用户选择，如下图所示：以“Non-small cell lung cancer”为例做操作流程的讲解，首先点击“List”,然后在搜索框内输入“Non-small cell lung cancer”，再点击“Find matching strings”以下是搜索的结果，我们可以看到系统里有22处的例证，这也说明了“Non-small cell lung cancer”这个表达没有问题。

2、检索词的合理搭配搜索进入网站后点击“List”,本次讲解以“Regulatory mechanism of cell migration”为例（注：此处直接输入仅显示cause本身的搜索结果，检索单词的全部形式，检索时需加[]符号）在搜索框内输入[=cell] migration（这个指令主要是为了搜索“call”的近义词，并且能与 migration搭配）以下为搜索结果，根据结果显示，除了Cell能跟migration完美搭配外，系统也推荐了“group”，点击“group migration”，可以看到COCA将文本分成了不同的语体，我们可以比较在不同状态下，最标准的语体的表达，供用户参考。

COCA 语料库界面查询输入说明(修订)

/表示或者；在LIST中上下排列结果；选择SHOW SECTIONS时结果更直观
fairly *
查与fairly搭配的情况
*是通配符，此处代指任一个词；注意：fairly与*之间有空格
un*ly
查以un开头以ly结尾的词
查到单词如unlikely，unusually；此处*代指任意数量的字母
[slip].[v*]
查beautiful的同义词与flower的同义词搭配的情况
small
little
[nn*]
0/3各
RELEVANCE
查small和little后面3个词内的名词使用对比情况
ground.[n*]
floor.[n*]
[j*]
3/0
RELEVANCE
各查ground和floor作名词时前面3个词内的形容词使用对比情况
查is被缩写为’s情况的句式结构
’s在本语料库中可以被视为一个词单独查询，即，和前面的单词要空格且输成‘s，其它缩写形式也是用类似方法查询
it is [v*] that
或we [vv*] that
查句式结构
选择CHART显示可以看出第一个是学术结构，是口语的8.5倍；第二个结构口语中最常用
to [v*] or not to [v*]
[sing]
查sing的任何形式
查到sing，singing，sang等，但不包括song
[=publish]
查publish的同义词
=表示同义关系，结果为publish，circulate，announce等。说明：查同义词是COCA语料库的一大特色
[[= publish]]
查publish的同义词并且
查slip作动词的情况

英语语料库及词频表介绍

英语语料库及词频表介绍
英语语料库是收集英语书面和口语等各类英文表述方式的语言材料集合，涵盖了英语词汇的各个方面。

其中，目前主流的有三个语料库，分别是GBC（Google Book's Corpus）、BNC（British National Corpus）和COCA（Corpus of Contemporary American English）。

COCA（Corpus of Contemporary American English）是由美国伯翰大学（Brigham Young University）的Mark Davies教授开发的美国最新当代英语语料库，是当今世界上最大的英语平衡语料库。

它涵盖了美国这一时期的口语、小说、流行杂志、报纸和学术期刊五大类型的语料库，并且这五个方面基本上成均匀平衡分布。

COCA词频表基于COCA的5亿单词语料库，利用算法提取出来最高的前5000和20000的高频词，并注释了搭配，解决了单词最实际的实用问题。

如需更多关于英语语料库及词频表的信息，建议咨询英语专业人士或查阅相关文献资料。

BNC和COCA语料库

Un开头 ly结尾的词
R开头中间有 N的词

4. 输入 lemma （即一个单词的单复数、时态等所有形式）以sing为例
规则：若要得到某个单词的所有单复数和时态形式，那么就要在输入时，在这个单词外加 [ ]。
形容词early的原型，比较级和最高级三种形式一次性检索出来检索
5.若要得到某种词性且词中部分带有某些字母的形式时，如要得到以un-开头、-ed结尾的所有形容词的所有形式，那么输入: un*ed.[aj*]
2.词组（形容名词” 得到的结果都是 white后面跟的名词短语检索表达式为： white [n*]

规则：输入名词的话用正确表达式: [n*]；动词: [v*]; 形容词: [aj*]; 副词: [av*]……
White+名词的短语
3. 输入un*ly和r?n*，

/
杨百翰大学BNC语料库使用方法说明
/bnc
BYU－BNC界面简介
1.单词 mysterious 为例
使用LIST显示
使用CHART显示
使用KWIC（key words in the context)显示方式
规则：在words的方格里分别输入hot和warm，再在collocates方框里输入[n*], 表示后面所跟任何名词。
11.搜索搭配词以及出现的频率如：thick后跟的名词
规则：在context里输入[nn*] 后选择4,表示在thick后面（4跨距范围内）出现的任何名词
we want to say
翻译是一个循序渐进的过程。需要积累，需要耐心，需要不断挖掘知识的方方面面。语料库为我们做翻译带来了便利。希望我们能充分利用各种有利条件，不断提升自己的翻译水平与能力。希望我们在岑老师的带领下，把翻译做好，把自己今后的翻译之路规划好。

美国当代英语语料库(COCA)使用介绍共34页

合集下载

美国当代英语语料库COCA使用介绍 ppt课件

COCA 语料库界面查询输入说明(修订)

美国当代英语语料库(COCA)在词汇教学中的应用价值

coca等级词汇

美国当代英语语料库(COCA)使用介绍要点

coca等级词汇

在美国当代英语语料库(COCA)如何查词

美国当代英语语料库（COCA）在词汇教学中的应用价值

最新常用在线语料库使用简介PPT课件

使用COCA等在线语料库相关说明

COCA

COCA语料库操作演示.ppt教程

使用COCA等在线语料库相关说明

这个连专业翻译都爱用的在线词典，你一定要试一下！

COCA 语料库界面查询输入说明(修订)

英语语料库及词频表介绍

BNC和COCA语料库

文档推荐

最新文档