Chinese Segmentation with a Word-Based Perceptron Algorithm
- 格式:pdf
- 大小:188.50 KB
- 文档页数:8
BERT-based Document Segmentation for Chinese的语义分割模型主要用于文本的自动分段,特别是在中文文本处理中。
它使用BERT(Bidirectional Encoder Representations from Transformers)模型,这是一种基于Transformer的深度双向编码器模型,用于自然语言处理任务,包括文本分类、命名实体识别和情感分析等。
在文本分段任务中,模型接受一个长文本作为输入,并自动将其分割成多个有意义的段落或句子。
这种分段可以根据语义内容进行,而不仅仅是根据固定的格式或规则。
使用BERT进行文档语义分割的优点包括:
1.深度双向处理:BERT能够理解上下文并捕捉句子间的关系,从而使模型能够
更准确地分割文本。
2.强大的预训练:BERT在大量无监督数据上进行预训练,使其能够适应各种语
言和任务。
3.可扩展性:由于BERT的架构,可以通过添加更多的层或使用更强大的硬件来
提高模型的性能。
然而,这种模型也有一些局限性,例如计算复杂度高和需要大量训练数据。
此外,对于某些特定的文本分段任务,可能需要更具体的模型或额外的训练数据。
中文的自然语言处理与英文的自然语言处理Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on the interaction between computers and humans using natural language. It is a field that has seen significant advancements in recent years, with researchers around the world working to improve the accuracy and effectiveness of NLP systems. In this article, we will compare and contrast the differences between NLP in Chinese and NLP in English.Chinese NLP:1. Character-based: One of the key differences between Chinese NLP and English NLP is that Chinese is a character-based language, whereas English is an alphabet-based language. This means that Chinese NLP systems need to be able to understand and process individual characters, as opposed to words in English.2. Word segmentation: Chinese is also a language that does not use spaces between words, which means that word segmentation is a crucial step in Chinese NLP. This process involves identifying where one word ends and another begins, which can be challenging due to the lack of spaces.3. Tonal differences: Another unique aspect of Chinese NLP is that Chinese is a tonal language, meaning that the tone in which a word is spoken can change its meaning. NLP systems need to be able to recognize and account for these tonal differences in order to accurately process and understand Chinese text.English NLP:1. Word-based: In contrast to Chinese, English is an alphabet-based language, which means that NLP systems can focus on processing words rather than individual characters. This can make certain tasks, such as named entity recognition, easier in English NLP.2. Sentence structure: English has a more rigid sentence structure compared to Chinese, which can make tasks such as parsing and syntactic analysis more straightforward in English NLP. This is because English follows a specificsubject-verb-object order in most sentences, whereas Chinese has a more flexible word order.3. Verb conjugation: English is also a language that uses verb conjugation, meaning that verbs change form based on tense, person, and number. NLP systems need to be able to recognizeand interpret these verb forms in order to accurately understand and generate English text.In conclusion, while there are similarities between Chinese NLP and English NLP, such as the use of machine learning algorithms and linguistic resources, there are also key differences that researchers need to consider when developing NLP systems for these languages. By understanding these differences, researchers can continue to advance the field of NLP and improve the performance of NLP systems in both Chinese and English.。
bert_document-segmentation_chinese-base介绍1.引言1.1 概述在自然语言处理领域,文档分割是一个重要的任务,它的目标是将长文档按照逻辑结构进行切割,以便于后续的语义理解和信息抽取。
BERT (Bidirectional Encoder Representations from Transformers) 是一种基于Transformer结构的预训练语言模型,它在自然语言处理任务中取得了巨大的成功。
本文将介绍BERT模型的相关知识,并探讨与之相关的中文文档分割方法。
BERT模型作为自然语言处理任务的一种通用模型,通过预训练和微调的方式,在各种任务上取得了领先的成绩。
其独特之处在于使用了Transformer结构,能够充分捕捉上下文信息,并通过双向编码策略有效地解决了语言模型中的歧义问题。
因此,利用BERT模型进行文档分割任务具有很大的潜力。
文档分割技术旨在将长文档切分为段落、句子或其他有意义的片段,以更好地理解文本的结构和语义关系。
在中文文档分割中,由于中文语言的特殊性(如缺乏明显的标点符号),文档分割任务具有一定的挑战。
因此,本文将重点探讨如何利用BERT模型来解决中文文档分割问题。
本文的结构如下:首先,我们将介绍BERT模型的基本原理和特点,包括其中的Transformer结构和预训练技术。
然后,我们将详细讨论文档分割的相关方法,并分析其在中文文档分割任务中的应用。
最后,我们将总结本文的主要内容,并对未来可能的研究方向进行展望。
通过对BERT模型和文档分割方法的介绍,我们希望能够提供一个全面的视角,以便读者对BERT在中文文档分割中的应用有更深入的了解。
同时,我们也希望为后续的研究提供一些有价值的启示和思路。
接下来的章节将逐步展开对BERT模型和文档分割方法的介绍。
1.2文章结构文章结构是指文章整体的组织框架和部分之间的层次结构关系。
一个清晰合理的文章结构对于读者理解文章思路和逻辑推理非常重要。
postgres jieba 标准分词Postgres Jieba is a standard word segmentation tool for the Postgres database. It is based on the popular Chinese word segmentation library, Jieba, and its integration with Postgres enables efficient and accurate segmentation of Chinese text within the database. In this article, we will explore the features, installation process, and usage of Postgres Jieba to understand how it enhances text processing capabilities in Postgres.I. Introduction to Postgres JiebaPostgres Jieba is a powerful open-source Chinese word segmentation tool designed specifically for the Postgres database. It enables the extraction of meaningful words from Chinese text, facilitating subsequent analysis, search, and indexing operations. With its integration with Postgres, it becomes an indispensable tool for handling Chinese text efficiently within the database.II. Installation of Postgres Jieba1. PrerequisitesBefore installing Postgres Jieba, ensure that you have the following dependencies:- Postgres database installed and running- Access to the database with administrative privileges2. Installation StepsTo install Postgres Jieba, follow these steps:Step 1: Download Postgres JiebaStart by obtaining the Postgres Jieba extension from the official GitHub repository. You can either clone the repository or download the source code as a zip file.Step 2: Compile and InstallOnce you have the source code, navigate to the downloaded directory and execute the following commands:makesudo make installThese commands will compile the extension and install it into your Postgres database.Step 3: Enable the ExtensionAfter successful installation, connect to your Postgres database using a superuser account and enable the Postgres Jieba extension by running the following SQL command:CREATE EXTENSION jieba;III. Usage of Postgres JiebaPostgres Jieba provides various functions that allow you to perform segmentation and analysis on Chinese text data stored in your Postgres database. Let's explore some of the key functionalities:1. Segmentation FunctionThe primary function offered by Postgres Jieba is the word segmentation function, which splits Chinese text into meaningful words. To segment a text column in a table, use the following syntax:SELECT jieba.cut('你好世界!') AS segmented_text;This will return the segmented_text column with the segmented words. You can also segment a specific column of a table by substituting '你好世界!' with the column name.2. Part-of-Speech TaggingPostgres Jieba provides a function for part-of-speech tagging, which assigns a grammatical tag to each word in the segmented text. This feature enables more advanced analysis andunderstanding of the Chinese text. To perform part-of-speech tagging, use the following syntax:SELECT jieba.posseg_cut('你好世界!') AS segmented_text_with_tags;This will return the segmented_text_with_tags column, where each word is accompanied by its corresponding tag.3. Custom DictionaryPostgres Jieba allows the creation of a custom dictionary to include domain-specific terms or specialized vocabulary. To add words to the custom dictionary, use the following syntax:SELECT jieba.insert_word('新冠病毒') AS added_word;This will add '新冠病毒' (COVID-19) to the custom dictionary. The added_word column will display the word that was successfully added.4. Stop WordsPostgres Jieba provides a mechanism to exclude certain words from the segmentation process. These words, known as stop words, are commonly used words with little semantic value. To define stop words, use the following syntax:SELECT jieba.add_stop_word('的') AS added_stop_word;This will exclude the word '的' (of) from the segmentation process. The added_stop_word column will display the stop word that was successfully added.IV. ConclusionPostgres Jieba is a valuable tool for handling Chinese text within the Postgres database. Its integration allows efficient segmentation, part-of-speech tagging, and customization options, enhancing the database's text processing capabilities. By using Postgres Jieba, developers and data analysts can extract meaningful information from Chinese text, enabling advanced analysis and search operations.。
基于多粒度和语义信息的中文关系抽取①陈 钰, 张安勤, 许春辉(上海电力大学 计算机科学与技术学院, 上海 201306)通讯作者: 张安勤摘 要: 中文关系抽取采用基于字符或基于词的神经网络, 现有的方法大多存在分词错误和歧义现象, 会不可避免的引入大量冗余和噪音, 从而影响关系抽取的结果. 为了解决这一问题, 本文提出了一种基于多粒度并结合语义信息的中文关系抽取模型. 在该模型中, 我们将词级别的信息合并进入字符级别的信息中, 从而避免句子分割时产生错误; 借助外部的语义信息对多义词进行建模, 来减轻多义词所产生的歧义现象; 并且采用字符级别和句子级别的双重注意力机制. 实验表明, 本文提出的模型能够有效提高中文关系抽取的准确率和召回率, 与其他基线模型相比,具有更好的优越性和可解释性.关键词: 信息处理; 关系抽取; 注意力机制; 词向量表示; 双向长短期记忆网络引用格式: 陈钰,张安勤,许春辉.基于多粒度和语义信息的中文关系抽取.计算机系统应用,2021,30(3):190–195. /1003-3254/7810.htmlChinese Relation Extraction Based on Multi-Granularity and Semantic InformationCHEN Yu, ZHANG An-Qin, XU Chun-Hui(School of Computer Science and Technology, Shanghai University of Electric Power, Shanghai 201306, China)Abstract : Chinese relation extraction adopts character-based or word-based neural networks. Most of the existing methods have word segmentation errors and ambiguity, which will inevitably introduce a lot of redundancy and noise and thus affect the results of relation extraction. In order to solve this problem, this study proposes a Chinese relationship extraction model based on multi-granularity combined with semantic information. In this model, we merge word-level information into character-level information, so as to avoid errors in sentence segmentation; use external semantic information to model polysemous words to reduce the ambiguity caused by semantic words; and adopt Dual attention mechanism at character level and sentence level. The experimental results show that the model proposed in this study can effectively increase the accuracy and recall rate of Chinese relation extraction and has better superiority and interpretability than other baseline models.Key words : information processing; RE; attention; Word2Vec; Bi-LSTM随着大数据时代的到来, 数据的规模不断增大, 信息过载的问题日益严重. 因此快速准确地抽取关键信息有着重大意义. 关系抽取在信息抽取中有着举足轻重的作用, 目的是提取自然语言句子中实体对之间的语义关系. 实体关系抽取作为自然语言处理的一项基本任务, 是知识图谱、自动问答、机器翻译, 自动文摘等领域的关键模块. 随着深度学习的不断发展, 引起了人们对NRE 的兴趣, 现在大多使用神经网络来自动学习语义特征.1 相关工作作为先驱, Liu 等提了一个基于CNN 的关系抽取计算机系统应用 ISSN 1003-3254, CODEN CSAOBNE-mail: Computer Systems & Applications,2021,30(3):190−195 [doi: 10.15888/ki.csa.007810] ©中国科学院软件研究所版权所有.Tel: +86-10-62661041① 收稿时间: 2020-07-09; 修改时间: 2020-08-11; 采用时间: 2020-08-17; csa 在线出版时间: 2021-03-03190模型[1]. 在此基础上, Zeng等提出一个带有最大池化层的CNN模型[2], 并且引入了位置嵌入来表示位置信息,然后他们设计了PCNN模型[3], 但是PCNN模型在句子选择方面存在问题. 为了解决这一问题, Lin等[4]将注意力机制应用于其中. 尽管PCNN模型有着不错的效果, 但是它不能像RNN类型的模型一样挖掘上下文的信息. 因此, 带有注意力机制的LSTM网络也被应用于关系抽取任务中[5,6].尽管NRE不需要进行特征工程, 但是它们忽略了不同语言输入粒度对模型的影响, 特别是对于中文关系抽取. 根据输入的粒度的不同, 现有的中文关系抽取方法可以分为基于字符的关系抽取和基于词的关系抽取的两种.对于基于字符的关系抽取, 它将每个输入语句看作一个字符的序列. 这种方法的缺点是不能充分利用词语级别的信息, 想比较于基于词语的方法捕获到的特征少. 对于基于词语级别的关系抽取, 首先要进行分词,导出一个词序列, 然后将其输入到神经网络模型中. 但是, 基于词的模型的性能会受到分割质量的显著影响[7].比如说, 一句中文语句“乔布斯设计部分苹果”有两个实体, “乔布斯”和“苹果”, 它们之间的关系为“设计”. 在这种情况下, 对于这句话的分词为: “乔布斯/设计/部分/苹果”. 但是, 随着对语句切分的变化, 句子的含义可能变得完全不同. 如果该句话分割为: “乔布斯设计部/分/苹果”, 那么这句话的实体就变为“乔布斯设计部”和“苹果”, 它们之间的关系转变为“分发”. 因此,无论是基于字符的方法还是基于词语的方法都不能充分利用数据中的语义信息. 因此, 要从纯文本中发现高层实体关系, 需要不同粒度的综合信息的帮助. 此外,中文的词语存在大量的多义词, 这限制了模型挖掘深层次语义信息的能力. 例如, 词语“苹果”含有两种不同的含义, 即一种水果和电子产品. 但是, 如果没有语义信息的加入, 就很难从纯文本中学习到这种信息.本文提出了一种能够综合利用句子内部的多粒度信息以及外部知识的网络框架(PL-Lattice)来完成中文关系抽取任务. (1) 该模型采用基于Lattice-LSTM模型的结构, 将基于词语级别的特征动态的集成到基于字符级别的特征中. 因此, 可以利用句子中的多粒度信息而不受到分词错误的影响. (2) 为了解决中文中多义词的现象, 改模型加入了HowNet[8]这一外部知识库对词语的多种语义进行标注, 在训练阶段结合语义信息,提高模型的挖掘能力.2 问题描述与建模给定一个中文句子和其中的两个标记实体, 中文关系抽取的任务就是提取两个标记实体之间的语义关系. 本文提出用于中文关系抽取的PL-Lattice模型, 该模型的整体结构如图1所示.图1 模型整体结构(1) 输入层: 给定一个以两个目标实体为输入的中文语句, 该部分表示语句中的每个单词和字符. 该模型可以同时提取和利用字级和词级的信息.(2) PL-Lattice编码器: 这一部分使用Lattice-LSTM 网络结构为基础, 改进了字级别和词级别的输入的循环结构, 并且将外部语义知识融入到网络里面来实现语义的消歧.(3) 注意力机制: 使用词级注意力机制和句级注意力机制.(4) 关系分类层: 通过Softmax函数输出关系类型.2.1 输入层本文模型的输入是一个带有两个标记实体的中文句子. 为了利用多粒度信息, 本文在句子中同时使用字级和词级信息.(1) 字符集向量表示s={c1,···,c M}d c x cei∈R d c本文将每个输入的中文语句看作一个字符序列.给定一个由M个字符组成的句子s, 表示为, 使用Skip-gram模型[9]将每个字符映射到一个维的向量, 表示为.此外, 在关系抽取任务中, 句子中的字到命名实体2021 年 第 30 卷 第 3 期计算机系统应用191c i p 1i p 2i p 1的距离能够影响关系抽取的结果. 所以, 本文采用位置特征来指定句子中的字符, 即当前字符到第一个命名实体和第二个命名实体的相对距离[2]. 具体来说, 第i 个字符到两个命名实体的距离分别表示为和.对于本文使用以下的方法来计算:b 1e 1p 2i p 1i p 2i d p x p 1i ∈R d p x p 2i ∈R d p 其中, 和为第一个命名实体的开始和结束索引,的计算方法与等式1类似. 然后, 和分别映射为维度的向量, 表示为以及.(2) 词级向量表示虽然模型有字符特征作为输入, 但是为了充分捕捉句子中的多粒度特征, 本文还需要提取句子中的所有潜在的词级的特征, 潜在词为是由字符组成的句子中的任意子序列. 这些子序列与构建在大型原始文本上的字典D 相匹配得到真正的词级特征.w b ,e w b ,e 用来表示由第b 个字符开始, 第e 个字符结束的词. 为了将表示为向量形式, 大部分文章使用Word2Vec 模型[9]来将其转化为词向量.但是, Word2Vec 模型只是将词语映射为一个嵌入向量, 忽略了一词多义这一事实. 本文使用SAT 模型来解决这一问题, SAT 模型是基于Skip-gram 模型改进出来的, 它可以同时学习词语以及意义之间的关系,并将其转换为词向量.w b ,e S ense (w b ,e )w b ,e sen (w b ,e )k ∈S ense (w b ,e )x sen b,e ,k∈R d sen w b ,e x sen b ,e ={x sen b ,e ,1,···,x sen b ,e ,K}给定一个词语, 首先通过检索HowNet 这一知识库来获得该词语的K 种意义. 用表示词语所有意义的集合. 通过SAT 模型将每种意义映射为向量形式, 表示为. 最终, 表示为一个向量集合, 即.2.2 编码器本文的编码器是在Lattice-LSTM 的基础上, 进行改进, 加入词的意义这一特征, 改进了基于字符级别和基于词级别的循环单元的网络结构; 通过改进结构, 减轻了Lattice-LSTM 编码器中对词的特征的提取导致的字符级别的特征被削弱的现象; 并且使用了双向循环单元的结构, 使编码器能够同时捕捉学习正向和反向的信息, 能够显著提高模型的准确率以及合理性.(1) Lattice-LSTM 编码器LSTM 神经网络是循环神经网络的变种, 主要思i j f j o j 想就是引入一种自适应门控机制来控制LSTM 单元保留以前状态的同时学习当前数据输入的特征. LSTM 神经网络有3个门: 输入门、遗忘门和输出门.基于字符的LSTM 网络表示为:σ()其中, 为激活函数, W 和U 为可训练权重矩阵, b 为偏置.w b ,e 在输入语句中给定一个词语, 与外部词典D 相匹配, 可以表示为:e w c c j x w b ,ec wx w c w 其中, b 和e 表示词语在句子中的开始与结束的索引, 表示一个查找表. 在这样的情况下, 对的计算需要结合词语级的表示, 先构建词语级别的门控逻辑单元,再与字级的LSTM 网络相结合, 形成Lattice-LSTM 编码器. 使用来表示的细胞状态. 的计算方法如下:i w b,e f w b,e其中, 和分别表示词级的门控逻辑单元里面的输入门和遗忘门.e e w b ,e b ∈{b ′|w b ′,e ∈D }i c 第个字符的细胞状态将通过合并以索引结束的所有词语的信息来计算, 这些单词具有这样的特征. 为了控制每个词语的输入, 设计了一个额外的门:第e 个字符的细胞状态计算方法如下:αc b ,e αc e 其中, 以及为归一化因子, 它们的总和等于1, 计算方法如下:计算机系统应用2021 年 第 30 卷 第 3 期192h c j 最终, 使用式(5)来计算序列中每个字符的最终隐藏状态. 这一结构出自文献[7]提出的Lattice LSTM.(2) PL-Lattice 编码器w 2,3sen (w 2,3)1x w 2,3w 2,3虽然Lattice-LSTM 编码器能够利用字符和词的信息, 但是它不能充分考虑中文的一词多义的特征. 例如,如图1所示, (苹果)这个词有两种意义, 代表一种水果, 但是在Lattice-LSTM 中只有一个表示. 并且Lattice-LSTM 编码器会造成对词的信息的过度提取会减弱字符级别的信息, 甚至会忽略字符的信息, 影响模型的效果.x sen b ,e ,kw b ,e 为了解决这两个缺点, 本文改进了模型. 首先在模型中加入了感知层, 将外部语义融入其中, 如1.1.2节中所示, 本文使用来词语的第k 种意义. 其次,改进了词级和字符级的LSTM 结构, 加强了字符之间的信息传递, 减弱词级信息的提取.并且本文采用了双向循环网络的结构, 使编码器能同时提取句子上下文的关系与信息对于字符级的循环神经网络单元, 本文改进的前向传播的计算方法如下:σ()其中, 为激活函数, W 和U 为可训练权重矩阵, b 为偏置.c c j −1c c j −1c c j 这种结构将上一个单元的细胞状态合并进入当前单元的各个门控逻辑单元中, 增强了对的影响, 即增强了字符级别的信息的提取能力.对于词级的循环神经网络单元, 本文的改进的前向传播的计算方法如下:c sen b ,e ,kw b ,e 其中, 表示词语的的第k 个意义的细胞状态.这种结构将遗忘门合并进输入门, 会在传入信息的时候就遗忘一部分信息, 从而起到减弱词级别信息的作用.c sen w b ,e 然后, 将所有的意义的细胞状态结合起来, 得到, 表示词语所有意义的细胞状态, 计算方法如下:c sen b,ec sen b ,ec c 所有感知层的细胞状态都被合并为, 这样可以更好的表示一个词语多义的现象. 然后, 类似于式(9)到式(12), 以索引e 结尾的所有词语的细胞状态融入到第e 个字符的细胞状态:隐含状态h 的计算方法与式(5)相同.最终将每一个单元前向传播与后向传播得到的隐含状态结合起来, 计算方法如下:⊕其中, 表示将两个向量拼接起来. 然后送入注意力层.(3) 注意力机制与关系分类近年来, 注意力机制在深度学习的各个领域取得了成功. 从本质上讲, 深度学习中的注意力机制和人类的选择性注意力类似, 都是从众多的信息中选择出对当前任务目标更重要的信息. 本文采用了双重注意力机制.h =[h 1,h 2,h 3,···,hM ]由双向PL-Lattice 网络训练产生的输出向量组成矩阵, 其中M 表示句子的长度.基于词级的注意力机制的句子表示的计算方法如下所示:ωα其中, 为可训练参数矩阵, 为h 所对应的权重向量.S h ∗为了计算每种关系相对于句子的权重, 使用句子级的注意力机制, 将句子的特征向量送入Softmax 分类器:W ∈R Y ×d hb ∈R Y 其中, 为变换矩阵, 为偏执向量. Y 表示2021 年 第 30 卷 第 3 期计算机系统应用193关系类型的数量. y 表示每种类型的概率.T =(S (i ),y (i ))最终, 给定所有的训练样本, 本文使用交叉熵作为目标函数来计算模型输出结果分别与真实结果分布的差距, 如式(28)所示:θ其中, 表示模型中所有的参数.3 实验数据与参数设定3.1 实验数据的选取由于公开的中文关系抽取语料库的匮乏, 目前还没有较为通用且权威的中文远程监督关系抽取数据集.本文使用的数据为中文的散文数据[10]. 该数据集收录837篇中文文章, 包含9中关系类型, 其中训练集695篇, 测试集84篇, 验证集58篇.3.2 性能评估指标与参数设定本文实验采用3种评估指标. 召回率, F 1以及AUC.召回率(Recall )是度量的是多个正例被分为正例:式中, TP 表示将正类预测为正类的数量, FN 表示将正类预测为负类的数量.F 1是分类问题的一个衡量指标, 数值在0~1之间:式中, FP 表示将负类预测为正类的数量.本文实验参数的设定如表1所示.表1 实验参数设定参数数值Learning_rate 0.0005dropout0.5char_embedding_size 100lattice_embedding_size 200position_embedding_size5hidden_unit 200epoch 80regulation1.00e−084 实验结果与分析为了验证PL-Lattice 模型在中文实体关系抽取方面的效果, 本文设置了5组实验:(1) BLSTM [5]: 提出了一种双向的LSTM 用于关系抽取.(2) Att-BLSTM [6]: 在双向LSTM 的基础上加入了词级注意力机制.(3) PCNN [3]: 提出来一种具有多实例的分段CNN 模型.(4) PCNN+Att [4]: 利用注意力机制改进了PCNN.(5) Lattice-LSTM [7]: 使用基础的Lattice-LSTM 模型加注意力机制作为对比实验.实验结果如表2所示.各模型的召回率随训练次数的变化如图2所示.表2 实验结果模型召回率F 1AUC BLSTM 0.51780.61040.5021Att-BLSTM 0.52730.59480.5042PCNN 0.49230.610.4826PCNN+Att 0.52320.60550.5041Lattice-LSTM 0.58220.63880.5688PL-Lattice0.59740.67570.57940.6501020304050607080训练次数PL-latticeBi-LSTMAtt-BLSTMPCNNPCNN+AttLattice-LSTM图2 各模型召回率随训练次数的变化曲线计算机系统应用2021 年 第 30 卷 第 3 期194PL-Lattice 模型的F 1值和AUC 随训练次数的变化如图3所示.从实验结果可以看出, 注意力机制的加入能够使模型关注到句子中更重要的部分, 从而提升模型的表现能力. 由于LSTM 模型相对于CNN 模型在处理序列数据上拥有天然的优势, 所以会表现出更好的水平.本文提出的PL-Lattice 模型在各方面都优于其他5种模型. 经过分析, 认为主要的原因是本文对于分词更加精准, 模型使用了多粒度的信息, 使词向量的表示更加合理化, 并且加入了双重注意力机制, 从多个方面提升了模型的可解释性和能力.F 1AUC1020304050607080训练次数图3 PL-Lattice 模型F 1和AUC 随训练次数的变换曲线5 结论与展望本文提出了一种用于中文关系的PL-Lattice 模型,该模型同时使用了字符级别和词级别的信息, 并且引入了外部语义库来表示词向量, 使其拥有更深层次的语义信息, 避免了一词多义的现象. 加入了基于词级和基于句子级别的双重注意力机制, 关注了词和句子的多个方面. 在散文数据集上与其他5种模型进行了对比, 表现出更好的优越性.在未来, 可以将更多粒度的语料信息融入到模型中, 这些信息可能会由于模型挖掘更深层次的意义特征.参考文献Liu CY, Sun WB, Chao WH, et al . Convolution neuralnetwork for relation extraction. Proceedings of the 9th International Conference on Advanced Data Mining and Applications. Hangzhou, China. 2013. 231–242.1Zeng DJ, Liu K, Lai SW, et al . Relation classification viaconvolutional deep neural network. Proceedings of the 25th International Conference on Computational Linguistics.Dublin, Ireland. 2014. 2335–23442Zeng DJ, Liu K, Chen Y, et al . Distant supervision forrelation extraction via piecewise convolutional neural networks. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal.32015. 1753–1762.Lin YK, Shen SQ, Liu ZY, et al . Neural relation extractionwith selective attention over instances. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Berlin, Germany. 2016. 2124–2113.4Zhang DX, Wang D. Relation classification via recurrentneural network. arXiv preprint arXiv: 1508.01006, 2015.5刘鉴, 张怡, 张勇. 基于双向LSTM 和自注意力机制的中文关系抽取研究. 山西大学学报(自然科学版), 2020, 43(1):8–13.6Zhang Y, Yang J. Chinese NER using lattice LSTM.Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne, Australia. 2018.1554–1564.7Qi FC, Yang CH, Liu ZY, et al . OpenHowNet: An opensememe-based lexical knowledge base. arXiv preprint arXiv:1901.09957, 2019.8Mikolov T, Sutskever I, Chen K, et al . Distributedrepresentations of words and phrases and theircompositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems. Red Hook, NY, USA. 2013. 3111–3119.9Xu JJ, Wen J, Sun X, et al . A discourse-level named entityrecognition and relation extraction dataset for Chinese literature text. arXiv preprint arXiv: 1711.07010, 2019.102021 年 第 30 卷 第 3 期计算机系统应用195。
摘要日常生活中,人们为了提高自己的阅读效率,会经常不间断的进行眼跳来使我们的注视点落在中央凹区域。
Yan等人发现了汉语阅读中的偏好注视位置(PVL)曲线。
当只有单次注视时,PVL曲线在词中心附近达到峰值,而当在首次阅读中有多次注视时,在首次注视即达到峰值。
Li等人认为当注视点位于词中心时,汉语词中的所有字都会被加工,读者可以在不知道词边界的时候就能迅速决定眼跳进入词中心。
Liu等人提出了动态调整假说,认为眼跳在眼睛移动到提供最大加工效率的观察位置时发生。
本研究探讨了眼跳目标选择是否与副中央凹区域的词切分有关。
选取低水平的词长信息和高水平的预测性信息,通过降低空间频率(模糊呈现)的方式来达到掩盖副中央凹信息的目的,使用增量显示范式来操纵词的长短和副中央凹信息的有效性。
同时,进一步研究了正常老化所带来的视觉能力丧失,是否会使老年人在阅读过程中对获取非正常呈现文本信息的能力受到影响。
实验中测试了青年人和老年人的视敏度,在可以同样清晰的获取视觉信息后,研究青年人和老年人是否在阅读中有眼动特征的差异。
结果发现,汉语阅读中存在稳定的年龄效应、词长效应和预测性效应;汉语读者中存在偏好注视位置(PVL)效应,只是老年人和青年人的模式不同;汉语读者阅读时采取灵活的加工方式,老年人采取一种更为谨慎的阅读策略。
词长和预测性影响汉语阅读的词汇加工,但是眼跳目标选择与副中央凹的词切分无关。
关键词:汉语阅读,眼跳目标选择,年龄,词长,预测性,词切分Saccade Target Selection in Chinese Reading: An Eye Movement Study Based on Word Length and PredictabilityAbstractIn daily life, in order to improve their reading efficiency, people will constantly make eye jumps to make our gaze point fall in the fovea. Yan (et al.) discovered the PVL curve in Chinese reading. When there is only one fixation, PVL reaches a peak near the word center, while when there are multiple fixations in the first reading, the peak is reached at the first fixation. Li (et al.) believe that when the gaze point is located at the word center, all words in Chinese words will be processed, and readers can quickly decide to jump into the word center without knowing the word boundary. Liu (et al.) proposed the dynamic adjustment hypothesis, which holds that saccade occurs when the eye moves to the observation position providing maximum processing efficiency.In this study, we discussed whether the target selection of saccade is related to word segmentation in the sub-fovea region. Low-level word length information and high-level predictive information are selected to cover up the sub-fovea information by reducing the spatial frequency (fuzzy presentation). Incremental display paradigm is used to manipulate the length of words and the effectiveness of the sub-fovea information. At the same time, it is further studied whether the loss of visual ability caused by normal aging will affect the ability of the elderly to obtain abnormal text information during reading. In the experiment, the visual acuity of young and old people was tested. After the visual information could be obtained equally clearly, whether there were differences in eye movement characteristics in reading between young and old people was studied.The results show that there are a typical age effect, word length effect and predictive effect in Chinese reading; The Preferred viewing location effect exists in Chinese readers, but the patterns of the elderly and the young are different; Chinese readers adopt a flexible processing method while the elderly adopt a more cautious reading strategies. Word length and predictive effect affect Chinese reading vocabulary processing, but there is no relationship between word segmentation and saccade target selection.Key words: chinese reading, saccade target selection, age, word length, predictability, word segmentation目录摘要 (Ⅰ)Abstract (Ⅱ)1前言 (1)1.1阅读的眼动研究中的基本问题 (1)1.2注视位置效应 (2)1.3年龄效应 (3)1.4眼跳目标选择与词长 (4)1.4.1拼音文字阅读中的眼跳目标选择与词长 (4)1.4.2汉语阅读中的眼跳目标选择与词长 (5)1.5眼跳目标选择与预测性 (6)1.5.1拼音文字阅读中的眼跳目标选择与预测性 (6)1.5.2汉语阅读中的眼跳目标选择与预测性 (7)2问题提出与研究意义 (9)2.1问题提出 (9)2.2研究意义 (10)2.3研究框架 (10)3实验研究 (11)3.1实验一词长对眼跳目标选择的影响 (11)3.1.1研究目的与假设 (11)3.1.2研究方法 (11)3.1.3结果分析 (13)3.1.4讨论 (24)3.2实验二预测性对眼跳目标选择的影响 (25)3.2.1研究目的与假设 (25)3.2.2研究方法 (25)3.2.3结果分析 (26)3.2.4讨论 (36)4讨论 (38)4.1词长对眼跳目标选择的的影响 (38)4.2预测性对眼跳目标选择的影响 (40)5 结论 (42)参考文献 (43)附录 (49)致谢 (54)1前言1.1阅读的眼动研究中的基本问题人们在阅读时的眼睛移动方向是随文本排列方式变化的,现有的汉字排列方式基本上都是从左至右、从上至下,则人眼在阅读文本时的移动方向也是从左至右移动的。
npl中英文分词方法English:There are various methods for segmenting Chinese and English text in Natural Language Processing (NLP). For English, the most common method is to simply split the text based on spaces or punctuation marks. More advanced techniques include using Part-of-Speech (POS) tagging to identify word boundaries and compound words. For Chinese, the most common method is to use word segmentation algorithms such as the Maximum Match method, which involves matching the longest possible word from a dictionary, or the Bi-LSTM-CRF model, which is a neural network model specifically designed for Chinese word segmentation. Additionally, character-based word segmentation methods can also be used for Chinese text, where words are segmented based on constituent characters and language-specific rules.中文翻译:在自然语言处理(NLP)中,有多种方法可以对中文和英文文本进行分词处理。
文章编号:167422974(2009)0520077204基于双数组T rie 树中文分词研究Ξ赵 欢 ,朱红权(湖南大学计算机与通信学院,湖南长沙 410082) 摘 要:对双数组Trie 树(Double 2Array Trie )分词算法进行了优化:在采用Trie 树构造双数组Trie 树的过程中,优先处理分支节点多的结点,以减少冲突;构造一个空状态序列;将冲突的结点放入Hash 表中,不需要重新分配结点.然后,利用这些方法构造了一个中文分词系统,并与其他几种分词方法进行对比,结果表明,优化后的双数组Trie 树插入速度和空间利用率得到了很大提高,且分词查询效率也得到了提高.关键词:自然语言处理;双数组;Trie 树;词典;分词中图分类号:TU471 文献标识码:AResearch of Chinese Word Segmentation Based on Double 2Array TrieZHAO Huan ,ZHU Hong 2quan(School of Computer and Communication ,Hunan Univ ,Changsha ,Hunan 410082,China ) Abstract :This paper proposed some improved strategies for the algorithm of Double 2Array Trie.Firstly ,thepriority was given to the node with most child nodes in order to avoid the collision ;secondly ,an empty 2list was defined ;Finally ,the collision node was added to a hash table ,which avoided re 2allocation.Then ,we imple 2mented a program for a Chinese word segmentation system based on the improved Double 2Array Trie and com 2pared it with several other methods.From the results ,it turns out that the insertion time and the space efficien 2cy are achieved ,and that search efficiency is improved.K ey w ords :natural language processing systems ;double 2array ;trie ;lexicon ;word segmentation 中文信息处理存在着分词的问题,但分词必须有一个足够大的词库,词库技术对于搜索有很大影响.理想情况下是包含所有的词语,任意词串只要能在词库中查询到,就认为是词语,但势必存在大量数据的存储和搜索问题.词库目前主要采用索引结构来实现,常用的包括线性索引表、倒排表、散列(Hash )表以及搜索树.线性索引和倒排表都是静态索引结构,不利于更新,只能按顺序或者折半搜索数据.散列表则是根据设定的Hash 函数H (key )和处理冲突的方法将关键字映射到一个存储位置[1].搜索时只要对关键码进行函数计算,得到存储位置,搜索速度较快,但冲突只能尽可能地少,不可能完全避免,另外还存在空间浪费问题.搜索树包括B 树和Trie 树等.它们的结构比较复杂,设计好的话能提高检索效率.另外还有一些基于这两种方法的变种[2-5].本文首先介绍了双数组Trie 树(Double 2Array Trie ,DA T )的基本原理,然后对其进行优化设计,最后进行实验比较,得出结论.1 双数组T rie 树基本原理Trie 树用于确定词条的快速检索[1].对于给定Ξ收稿日期:20081009基金项目:教育部科学技术研究重点项目资助(106458)作者简介:赵 欢(1967-),女,湖南长沙人,湖南大学教授 通讯联系人,E 2mail :hzhao @第36卷 第5期2009年5月湖南大学学报(自然科学版)Journal of Hunan University (Natural Sciences )Vol.36,No.5May 12009的一个字符串a1,a2,a3,…a n,采用Trie树搜索最多经过n次匹配即可完成一次查找(即最坏是O (n)),而与词库中词条的数目无关,缺点是空间空闲率高,它是中文匹配分词算法中词典的一种常见实现.双数组Trie树是Trie树的一种变形,是在保证Trie树检索速度的前提下,提高空间利用率而提出的一种数据结构.其本质是一个确定有限状态自动机(Deterministic Finite Automaton,DFA),每个节点代表自动机的一个状态,根据变量的不同,进行状态转移,当到达结束状态或者无法转移时完成查询.DA T采用两个线性数组(base和check)对Trie 树保存,base和check数组拥有一致的下标,即DFA中的每一个状态,也即Trie树中所说的节点, base,check数组用于检验转移的正确性,检验该状态是否存在[6].由文献[6]和文献[7],有如下定义:定义 从状态s输入c到状态t的一个转移必须满足:t=base[s]+N(c);check[t]=s.式中:N(c)为字符c的序列码,在词条插入时根据插入顺序建立序列码,但不会有两个非空结点通过base[s]映射到check数组中的同一位置.当base[s]和check[s]都为0时表示该位置为空,当s是可结束状态时,base[s]为负值[7].2 双数组T rie树的优化在传统的双数组Trie树中,树中每个结点在数组中的位置,都是由其父节点也就是上一状态的base值决定的.而一个节点的base值取决于数组的当前空闲位置以及该节点的直接子结点.如果某结点的子结点较多,该节点在插入过程中就会遇到很多冲突.为减少冲突,文献[8]优先处理子结点数多的结点,并将Trie树中的所有结点按子节点数进行降序排列.这样做在一定程度上提高了速度,但仍有可优化的空间,且空间利用率不高.为获得高的空间利用率,本文对所有空状态构建一个序列,在插入某节点确定base值时,只需扫描空状态序列即可.对双数组中空状态递增结点r1,r2,…,r m,构建这样空空序列:check[r i]=-r i+1(1<=i<=m-1),check[r m]=-DA-S I ZE.式中:r1=E-H EA D,为第一个空值状态对应的索引,DA-S I ZE为Trie树的结点数.这种方法在空状态并不太多的情况下可以进一步大大提高插入速度.判断check数组中的空元素非常简单,只需判断check数组元素值是否为负数即可[5].文献[8]在一定程度上减少了冲突.本文在此基础上进行了进一步的优化:在处理冲突的过程中,不需要重新对结点进行分配,只需将冲突结点对应数组索引放入hash表中即可,这样可以大大提高速度.211 优化双数组T rie树的建立过程2.1.1 核心算法1)寻找当前结点base数组下标值X—CH EC K (A,c,li m it),A为当前结点所有子结点字符集,c 为A中字符序列码最小值,li m it为限制值.步1 变量e-i ndex初始化为E-H EA D;步2 让q=e-i ndex- c.若对于所有字符ch ∈A,有check[q+N(ch)]<0并且q>li m it则返回q;否则e-i ndex=-check[e-i ndex],进入步骤3;步3 若e-i ndex<DA-S I ZE,进入步骤2;若e-i ndex>=DA-S I ZE,返回e-i ndex- c.2)设置check值setCheck(i dx,val).步1 若i dx>=DA-S I ZE,返回;步2 若check[i dx]<0,进入步骤3,从空状态序列中删除该索引i dx;步3 若i dx≠E-H EA D,进入步骤4;若i dx =E-H EA D,则E-H EA D=-check[i dx],check [i dx]=val.步4 令pre-i ndex=E-H EA D,扫描空状态序列,找到满足i dx=-check[pre-i ndex]的pre2 i ndex,则check[pre-i ndex]=check[i dx],check[i dx]=val,终止函数.2.1.2 构造算法全过程步1 初始化数组前n个位置留给根结点的直接子结点.令E-H EA D=1,设置根节点的所有直接子结点check值为0,即调用setCheck.因为这些节点的索引就是它们的序列码,减少空列表遍历次数.步2 初始化后E-H EA D将变为非根节点的第一个空节点,将根结点的所有子结点加入新建的空队列中.步3 若队列非空,在队列中选出子结点数最87 湖南大学学报(自然科学版)2009年多的结点为当前处理结点curN ode.否则,算法结束,数组构造完成.步4 访问当前结点curN ode,根据X-CH EC K算法确定其数组下标s,及各直接子结点在数组中的位置(要保证唯一性).如节点遇到字符ch,若base[s]+N(ch)>=DA-S I ZE,则该子节点数组下标为E-H EA D,则check[E-H EA D]= s,同时将该E-H EA D值保存在hash表中,对应的键为下标s和字符ch生成的hash码hashCode= hash(s,ch).即(hashCode,E-H EA D)—>(键,值对),根据空序列将E-H EA D后移到空闲节点.否则,check[base[s]+N(ch)]=s;步5 将当前结点的所有直接子结点加入到队列中,重复步3.2.1.3 分词所需要的查询算法步1 假设当前状态为s,输入字符为c,N (c)为序列码.步2 循环过程:t=base[s]+N(c);If(t>=DA-S I ZE)then根据s和字符c生成的hash码hash(s, ch)得到t值.If(check[t]=s)thens=t;else f ail;endif步3 若base[t]不为负,重复步骤1.否则,t 为一个可结束状态.212 优化双数组T rie树的具体应用假定词表中只有”啊,阿根廷,阿胶,阿拉伯,阿拉伯人,埃及”这几个词,用Trie树表示如图1所示.首先,对词表中所出现的10个汉字进行编码:啊———1,阿———2,唉———3,根———4,胶———5,拉———6,及———7,廷———8,伯———9,人———10.对每个汉字,要确定一个base索引值,使得所有以该字开头的词,在双数组中都能放下.例如,要确定“阿”字的base值,假设以”阿”开头的词的第二个字序列码依次为a1,a2,…,a n,要确定q,使得check[q+a1],check[q+a2],…,check[q+a n]均为负数.找到了q,“阿”的base值就为q.用这种方法构建双数组T rie,需经过4次遍历,将所有的词语放入双数组中.本文用负base值表示该位置为词语,若状态q对应某一个词,且是叶子结点,令base[q]=(-1)3q,若不是叶子结点,令base[q] =(-1)3base[q].最后得到双数组如下:base[]={0,-1,1,1,-4,1,-6,1,-8,-9, -1},check[]={0,0,0,0,10,2,2,2,3,5,7}.图1 词表构造Trie树Fig.1 Established Trie base on wordlist例如要搜索“阿拉伯人”是否为此表中一个词,首先由状态“阿”的序列码为2得到base[2]=1,接下来输入变量是“拉”,序列码为6,base[2]+6= 7,check[7]=2,所以“阿拉”是一个状态,可以继续.base[7]=1,接下来输入变量是“伯”,序列码是9,base[7]+9=10,check[10]=7,所以“阿拉伯”是一个状态,可以继续.base[10]=-1,接下来输入变量是“人”,序列码是10,|base[10]|+10=11,而11>=DA—I ZE,所以从前面的构造过程中保存的hash表得到对应下标为4,而check[4]=10,同时base[4]=-4.所以判定“阿拉伯人”是词表中的词.由上面搜索过程可知,优化后的双数组Trie树算法时间开销只与词长度有关,算法时间复杂度为O(n),n为词长度.而且构造词典的过程中,生成的数组长度为Trie树结点数.3 实验比较及结果分析本文对3种词典算法在相同环境下进行比较.第1种是普通Trie树词典机制;第2种是文献[8]的双数组Trie树词典机制;第3种是本文优化后的DA T机制.实验环境是CPU3.0GHZ(Intel Pen2 tium4),内存512M,操作系统为Windows XP,所97第5期赵 欢等:基于双数组Trie树中文分词研究用词典总共包括55501个词条.表1~表3为实验结果.表1 构造双数组T rie树时间T ab.1Time for establishing double2array类别t/s文献[8]3366.328本文434.593由表1可知,构造本文优化双数组Trie树的时间远远少于构造文献[8]双数组Trie树的时间.表2 占用空间T ab.2 Space occupied类别空间文献[8]数组长度186375本文数组长度72492,hash表长18由表2可知,本文优化的双数组Trie树空间占用率不到文献[8]双数组Trie树的一半.同时,还可以预知随着词条数越多,空间利用率也会越高.表3 给定语料分词速度T ab.3 Time for w ord segmentation类别t/s普通算法0.063文献[8]0.047本文0.031该实验意在比较用于正向最大匹配分词的速度.语料库文本为任意选择语料,大小为44k.由表3可知,普通Trie树分词平均速度为700kB/s,文献[8]双数组Trie树分词平均速度为936kB/s,而本文优化后的双数组Trie树分词平均速度则达1.3MB/s.4 结 论由理论分析和实验结果均可以看出,优化后的双数组Trie树不仅在构造词典和搜索词语速度上,而且在空间的利用率上都得到了很大提高,因为数组长度始终是Trie树结点总数[5].这种方法也有缺点,由于本文在构造过程中数组长度是固定的,不会自动增长,当有结点发生冲突时,则把该节点保存在空状态序列头,同时保存实际数组下标到hash表中,当插入词语数据量比较大时,解决hash表的冲突将是下一步研究的关键.参考文献[1] 殷人昆.数据结构(C++语言版)[M].北京:清华大学出版社,1999.YIN Ren2kun.Data structures(C++)[M].Beijing:Tsinghua University Press,1999.(In Chinese)[2] 杨文峰,陈光英,李星.基于PATRICIA Tree的汉语自动分词词典机制[J].中文信息学报,2001,15(3):44-49.YAN G Wen2fen,CHEN Guang2yin,L I Xing.Patricia2tree based dictionary mechanism for chinese word segmentation[J].Journal of Chinese Information Processing,2001,15(3):44-49.(In Chi2 nese)[3] 温滔,朱巧明.一种快速汉语分词算法[J].计算机工程,2004,30(19):119-182.WEN Tao,ZHU Qiao2ming.A fast algorithm for chinese word segmentation[J].Computer Engineering,2004,30(19):119-182.(In Chinese)[4] 吴胜远.一种汉语分词方法[J].计算机研究与发展,1996,33(4):306-311.WU Sheng2yuan.A new chinese phrase segmentation method[J].Journal of Computer Research and Development,1996,33(4):306 -311.(In Chinese)[5] KAZU HIRO M,EL-SA YED A,MASAO F.Fast and compactupdating algorithms of a double2array structure[J].Information Sciences,2004,159:53-67.[6] THEPPITA K K.An implementation of double2array trie[Z]./~thep/datrie/datrie.html,2006.[7] J UN2ICHI A,SEIGO Y,TA KASHI S.An efficient digitalsearch algorithm by using a double2array structure[J].IEEE Transactions on Software Engineering,1989,15(9):1066-1077.[8] 王思力,张华平,王斌.双数组Trie树算法优化及其应用研究[J].中文信息学报,2006,20(5):24-30.WAN G Si2li,ZHAN G Hua2ping,WAN G Bin.Research of opti2 mization on double2array trie and its application[J].Journal of Chinese Information Processing,2006,20(5):24-30.(In Chi2 nese)08 湖南大学学报(自然科学版)2009年。
Chinese Segmentation with a Word-Based Perceptron AlgorithmYue Zhang and Stephen ClarkOxford University Computing LaboratoryWolfson Building,Parks RoadOxford OX13QD,UK{yue.zhang,stephen.clark}@AbstractStandard approaches to Chinese word seg-mentation treat the problem as a taggingtask,assigning labels to the characters inthe sequence indicating whether the char-acter marks a word boundary.Discrimina-tively trained models based on local char-acter features are used to make the taggingdecisions,with Viterbi decodingfinding thehighest scoring segmentation.In this paperwe propose an alternative,word-based seg-mentor,which uses features based on com-plete words and word sequences.The gener-alized perceptron algorithm is used for dis-criminative training,and we use a beam-search decoder.Closed tests on thefirst andsecond SIGHAN bakeoffs show that our sys-tem is competitive with the best in the litera-ture,achieving the highest reported F-scoresfor a number of corpora.1IntroductionWords are the basic units to process for most NLP tasks.The problem of Chinese word segmentation (CWS)is tofind these basic units for a given sen-tence,which is written as a continuous sequence of characters.It is the initial step for most Chinese pro-cessing applications.Chinese character sequences are ambiguous,of-ten requiring knowledge from a variety of sources for disambiguation.Out-of-vocabulary(OOV)words are a major source of ambiguity.For example,a difficult case occurs when an OOV word consists of characters which have themselves been seen aswords;here an automatic segmentor may split the OOV word into individual single-character words. Typical examples of unseen words include Chinesenames,translated foreign names and idioms.The segmentation of known words can also beambiguous.For example,“”should be“(here)(flour)”in the sentence“”(flour and rice are expensive here)or“(here) (inside)”in the sentence“”(it’s cold inside here).The ambiguity can be resolved with information about the neighboring words.In comparison,for the sentences“”, possible segmentations include“(the discus-sion)(will)(very)(be successful)”and “(the discussion meeting)(very)(be successful)”.The ambiguity can only be resolved with contextual information outside the sentence. Human readers often use semantics,contextual in-formation about the document and world knowledge to resolve segmentation ambiguities.There is nofixed standard for Chinese word seg-mentation.Experiments have shown that there isonly about75%agreement among native speakersregarding the correct word segmentation(Sproat etal.,1996).Also,specific NLP tasks may require dif-ferent segmentation criteria.For example,“”could be treated as a single word(Bank of Bei-jing)for machine translation,while it is more natu-rally segmented into“(Beijing)(bank)”for tasks such as text-to-speech synthesis.There-fore,supervised learning with specifically defined training data has become the dominant approach. Following Xue(2003),the standard approach tosupervised learning for CWS is to treat it as a tagging task.Tags are assigned to each character in the sen-tence,indicating whether the character is a single-character word or the start,middle or end of a multi-character word.The features are usually confined to afive-character window with the current character in the middle.In this way,dynamic programming algorithms such as the Viterbi algorithm can be used for decoding.Several discriminatively trained models have re-cently been applied to the CWS problem.Exam-ples include Xue(2003),Peng et al.(2004)and Shi and Wang(2007);these use maximum entropy(ME) and conditional randomfield(CRF)models(Ratna-parkhi,1998;Lafferty et al.,2001).An advantage of these models is theirflexibility in allowing knowl-edge from various sources to be encoded as features. Contextual information plays an important role in word segmentation decisions;especially useful is in-formation about surrounding words.Consider the sentence“”,which can be from“(among which)(foreign)(companies)”, or“(in China)(foreign companies) (business)”.Note that thefive-character window surrounding“”is the same in both cases,making the tagging decision for that character difficult given the local window.However,the correct decision can be made by comparison of the two three-word win-dows containing this character.In order to explore the potential of word-based models,we adapt the perceptron discriminative learning algorithm to the CWS problem.Collins (2002)proposed the perceptron as an alternative to the CRF method for HMM-style taggers.However, our model does not map the segmentation problem to a tag sequence learning problem,but defines fea-tures on segmented sentences directly.Hence we use a beam-search decoder during training and test-ing;our idea is similar to that of Collins and Roark (2004)who used a beam-search decoder as part of a perceptron parsing model.Our work can also be seen as part of the recent move towards search-based learning methods which do not rely on dynamic pro-gramming and are thus able to exploit larger parts of the context for making decisions(Daume III,2006). We study several factors that influence the per-formance of the perceptron word segmentor,includ-ing the averaged perceptron method,the size of the beam and the importance of word-based features. We compare the accuracy of ourfinal system to the state-of-the-art CWS systems in the literature using thefirst and second SIGHAN bakeoff data.Our sys-tem is competitive with the best systems,obtaining the highest reported F-scores on a number of the bakeoff corpora.These results demonstrate the im-portance of word-based features for CWS.Further-more,our approach provides an example of the po-tential of search-based discriminative training meth-ods for NLP tasks.2The Perceptron Training AlgorithmWe formulate the CWS problem asfinding a mapping from an input sentence x∈X to an output sentence y∈Y,where X is the set of possible raw sentences and Y is the set of possible segmented sentences. Given an input sentence x,the correct output seg-mentation F(x)satisfies:F(x)=arg maxy∈GEN(x)Score(y)where GEN(x)denotes the set of possible segmen-tations for an input sentence x,consistent with nota-tion from Collins(2002).The score for a segmented sentence is computed byfirst mapping it into a set of features.A feature is an indicator of the occurrence of a certain pattern in a segmented sentence.For example,it can be the occurrence of“”as a single word,or the occur-rence of“”separated from“”in two adjacent words.By defining features,a segmented sentence is mapped into a global feature vector,in which each dimension represents the count of a particular fea-ture in the sentence.The term“global”feature vec-tor is used by Collins(2002)to distinguish between feature count vectors for whole sequences and the “local”feature vectors in ME tagging models,which are Boolean valued vectors containing the indicator features for one element in the sequence.Denote the global feature vector for segmented sentence y withΦ(y)∈R d,where d is the total number of features in the model;then Score(y)is computed by the dot product of vectorΦ(y)and a parameter vectorα∈R d,whereαi is the weight for the i th feature:Score(y)=Φ(y)·αInputs:training examples(x i,y i) Initialization:setα=0 Algorithm:for t=1..T,i=1..Ncalculate z i=arg max y∈GEN(xi)Φ(y)·αif z i=y iα=α+Φ(y i)−Φ(z i)Outputs:αFigure1:the perceptron learning algorithm,adapted from Collins(2002)The perceptron training algorithm is used to deter-mine the weight valuesα.The training algorithm initializes the parameter vector as all zeros,and updates the vector by decod-ing the training examples.Each training sentence is turned into the raw input form,and then decoded with the current parameter vector.The output seg-mented sentence is compared with the original train-ing example.If the output is incorrect,the parameter vector is updated by adding the global feature vector of the training example and subtracting the global feature vector of the decoder output.The algorithm can perform multiple passes over the same training sentences.Figure1gives the algorithm,where N is the number of training sentences and T is the num-ber of passes over the data.Note that the algorithm from Collins(2002)was designed for discriminatively training an HMM-style tagger.Features are extracted from an input se-quence x and its corresponding tag sequence y:Score(x,y)=Φ(x,y)·αOur algorithm is not based on an HMM.For a given input sequence x,even the length of different candi-dates y(the number of words)is notfixed.Because the output sequence y(the segmented sentence)con-tains all the information from the input sequence x (the raw sentence),the global feature vectorΦ(x,y) is replaced withΦ(y),which is extracted from the candidate segmented sentences directly.Despite the above differences,since the theorems of convergence and their proof(Collins,2002)are only dependent on the feature vectors,and not on the source of the feature definitions,the perceptron algorithm is applicable to the training of our CWS model.2.1The averaged perceptronThe averaged perceptron algorithm(Collins,2002)was proposed as a way of reducing overfitting on the training data.It was motivated by the voted-perceptron algorithm(Freund and Schapire,1999) and has been shown to give improved accuracy over the non-averaged perceptron on a number of tasks. Let N be the number of training sentences,T the number of training iterations,andαn,t the parame-ter vector immediately after the n th sentence in the t th iteration.The averaged parameter vectorγ∈R d is defined as:γ=1NTn=1..N,t=1..Tαn,tTo compute the averaged parametersγ,the train-ing algorithm in Figure1can be modified by keep-ing a total parameter vectorσn,t= αn,t,which is updated usingαafter each training example.After thefinal iteration,γis computed asσn,t/NT.In the averaged perceptron algorithm,γis used instead of αas thefinal parameter vector.With a large number of features,calculating the total parameter vectorσn,t after each training exam-ple is expensive.Since the number of changed di-mensions in the parameter vectorαafter each train-ing example is a small proportion of the total vec-tor,we use a lazy update optimization for the train-ing process.1Define an update vectorτto record the number of the training sentence n and iteration t when each dimension of the averaged parameter vector was last updated.Then after each training sentence is processed,only update the dimensions of the total parameter vector corresponding to the features in the sentence.(Except for the last exam-ple in the last iteration,when each dimension ofτis updated,no matter whether the decoder output is correct or not).Denote the s th dimension in each vector before processing the n th example in the t th iteration asαn−1,ts,σn−1,ts andτn−1,ts=(nτ,s,tτ,s).Suppose that the decoder output z n,t is different from thetraining example y n.Nowαn,t s,σn,t s andτn,ts can 1Daume III(2006)describes a similar algorithm.be updated in the following way:σn,t s=σn−1,ts +αn−1,ts×(tN+n−tτ,s N−nτ,s)αn,t s=αn−1,ts+Φ(y n)−Φ(z n,t)σn,t s=σn,t s+Φ(y n)−Φ(z n,t)τn,t s=(n,t)We found that this lazy update method was signif-icantly faster than the naive method.3The Beam-Search DecoderThe decoder reads characters from the input sen-tence one at a time,and generates candidate seg-mentations incrementally.At each stage,the next in-coming character is combined with an existing can-didate in two different ways to generate new candi-dates:it is either appended to the last word in the candidate,or taken as the start of a new word.This method guarantees exhaustive generation of possible segmentations for any input sentence.Two agendas are used:the source agenda and the target agenda.Initially the source agenda contains an empty sentence and the target agenda is empty. At each processing stage,the decoder reads in a character from the input sentence,combines it with each candidate in the source agenda and puts the generated candidates onto the target agenda.After each character is processed,the items in the target agenda are copied to the source agenda,and then the target agenda is cleaned,so that the newly generated candidates can be combined with the next incom-ing character to generate new candidates.After the last character is processed,the decoder returns the candidate with the best score in the source agenda. Figure2gives the decoding algorithm.For a sentence with length l,there are2l−1differ-ent possible segmentations.To guarantee reasonable running speed,the size of the target agenda is lim-ited,keeping only the B best candidates.4Feature templatesThe feature templates are shown in Table1.Features 1and2contain only word information,3to5con-tain character and length information,6and7con-tain only character information,8to12contain word and character information,while13and14contain Input:raw sentence sent–a list of characters Initialization:set agendas src=[[]],tgt=[] Variables:candidate sentence item–a list of words Algorithm:for index=0..sent.length−1:var char=sent[index]foreach item in src://append as a new word to the candidatevar item1=itemitem1.append(char.toWord())tgt.insert(item1)//append the character to the last wordif item.length>1:var item2=itemitem2[item2.length−1].append(char)tgt.insert(item2)src=tgttgt=[]Outputs:src.best itemFigure2:The decoding algorithmword and length information.Any segmented sen-tence is mapped to a global feature vector according to these templates.There are356,337features with non-zero values after6training iterations using the development data.For this particular feature set,the longest range features are word bigrams.Therefore,among partial candidates ending with the same bigram,the best one will also be in the bestfinal candidate.The decoder can be optimized accordingly:when an in-coming character is combined with candidate items as a new word,only the best candidate is kept among those having the same last word.5Comparison with Previous Work Among the character-tagging CWS models,Li et al. (2005)uses an uneven margin alteration of the tradi-tional perceptron classifier(Li et al.,2002).Each character is classified independently,using infor-mation in the neighboringfive-character window. Liang(2005)uses the discriminative perceptron al-gorithm(Collins,2002)to score whole character tag sequences,finding the best candidate by the global score.It can be seen as an alternative to the ME and CRF models(Xue,2003;Peng et al.,2004),which1word w2word bigram w1w23single-character word w4a word starting with character c and having length l5a word ending with character c and having length l6space-separated characters c1and c27character bigram c1c2in any word8thefirst and last characters c1and c2of any word9word w immediately before character c10character c immediately before word w11the starting characters c1and c2of two con-secutive words12the ending characters c1and c2of two con-secutive words13a word of length l and the previous word w 14a word of length l and the next word wTable1:feature templatesdo not involve word information.Wang et al.(2006) incorporates an N-gram language model in ME tag-ging,making use of word information to improve the character tagging model.The key difference be-tween our model and the above models is the word-based nature of our system.One existing method that is based on sub-word in-formation,Zhang et al.(2006),combines a CRF and a rule-based model.Unlike the character-tagging models,the CRF submodel assigns tags to sub-words,which include single-character words and the most frequent multiple-character words from the training corpus.Thus it can be seen as a step towards a word-based model.However,sub-words do not necessarily contain full word information.More-over,sub-word extraction is performed separately from feature extraction.Another difference from our model is the rule-based submodel,which uses a dictionary-based forward maximum match method described by Sproat et al.(1996).6ExperimentsTwo sets of experiments were conducted.Thefirst, used for development,was based on the part of Chi-nese Treebank4that is not in Chinese Treebank 3(since CTB3was used as part of thefirst bake-off).This corpus contains240K characters(150K words and4798sentences).80%of the sentences (3813)were randomly chosen for training and the rest(985sentences)were used as development test-ing data.The accuracies and learning curves for the non-averaged and averaged perceptron were com-pared.The influence of particular features and the agenda size were also studied.The second set of experiments used training and testing sets from thefirst and second international Chinese word segmentation bakeoffs(Sproat and Emerson,2003;Emerson,2005).The accuracies are compared to other models in the literature.F-measure is used as the accuracy measure.De-fine precision p as the percentage of words in the de-coder output that are segmented correctly,and recall r as the percentage of gold standard output words that are correctly segmented by the decoder.The (balanced)F-measure is2pr/(p+r).CWS systems are evaluated by two types of tests. The closed tests require that the system is trained only with a designated training corpus.Any extra knowledge is not allowed,including common sur-names,Chinese and Arabic numbers,European let-ters,lexicons,part-of-speech,semantics and so on. The open tests do not impose such restrictions. Open tests measure a model’s capability to utilize extra information and domain knowledge,which can lead to improved performance,but since this extra information is not standardized,direct comparison between open test results is less informative.In this paper,we focus only on the closed test. However,the perceptron model allows a wide range of features,and so future work will consider how to integrate open resources into our system.6.1Learning curveIn this experiment,the agenda size was set to16,for both training and testing.Table2shows the preci-sion,recall and F-measure for the development set after1to10training iterations,as well as the num-ber of mistakes made in each iteration.The corre-sponding learning curves for both the non-averaged and averaged perceptron are given in Figure3.The table shows that the number of mistakes made in each iteration decreases,reflecting the conver-gence of the learning algorithm.The averaged per-Iteration 12345678910P (non-avg)89.091.692.092.392.592.592.592.792.692.6R (non-avg)88.391.492.292.692.792.893.093.093.193.2F (non-avg)88.691.592.192.592.692.692.792.892.892.9P (avg)91.792.893.192.293.193.293.293.293.293.2R (avg)91.692.993.393.493.493.593.593.593.693.6F (avg)91.692.993.293.393.393.493.393.393.493.4#Wrong sentences 34011652945621463288217176151139Table 2:accuracy using non-averaged and averaged perceptron.P -precision (%),R -recall (%),F -F-measure.B 2481632641282565121024Tr 6606106838301111164525454922910415598Seg 18.6518.1828.8526.5236.5856.4595.45173.38325.99559.87F86.9092.9593.3393.3893.2593.2993.1993.0793.2493.34Table 3:the influence of agenda size.B -agenda size,Tr -training time (seconds),Seg -testing time (seconds),F -F-measure.Figure 3:learning curves of the averaged and non-averaged perceptron algorithmsceptron algorithm improves the segmentation ac-curacy at each iteration,compared with the non-averaged perceptron.The learning curve was used to fix the number of training iterations at 6for the remaining experiments.6.2The influence of agenda sizeReducing the agenda size increases the decoding speed,but it could cause loss of accuracy by elimi-nating potentially good candidates.The agenda size also affects the training time,and resulting model,since the perceptron training algorithm uses the de-coder output to adjust the model parameters.Table 3shows the accuracies with ten different agenda sizes,each used for both training and testing.Accuracy does not increase beyond B =16.Moreover,the accuracy is quite competitive even with B as low as 4.This reflects the fact that the best segmentation is often within the current top few can-didates in the agenda.2Since the training and testing time generally increases as N increases,the agenda size is fixed to 16for the remaining experiments.6.3The influence of particular features Our CWS model is highly dependent upon word in-formation.Most of the features in Table 1are related to words.Table 4shows the accuracy with various features from the model removed.Among the features,vocabulary words (feature 1)and length prediction by characters (features 3to 5)showed strong influence on the accuracy,while word bigrams (feature 2)and special characters in them (features 11and 12)showed comparatively weak in-fluence.2The optimization in Section 4,which has a pruning effect,was applied to this experiment.Similar observations were made in separate experiments without such optimization.Features F Features FAll93.38w/o192.88w/o293.36w/o3,4,592.72w/o693.13w/o793.13w/o893.14w/o9,1093.31w/o11,1293.38w/o13,1493.23 Table4:the influence of features.(F:F-measure. Feature numbers are from Table1)6.4Closed test on the SIGHAN bakeoffsFour training and testing corpora were used in the first bakeoff(Sproat and Emerson,2003),including the Academia Sinica Corpus(AS),the Penn Chinese Treebank Corpus(CTB),the Hong Kong City Uni-versity Corpus(CU)and the Peking University Cor-pus(PU).However,because the testing data from the Penn Chinese Treebank Corpus is currently un-available,we excluded this corpus.The corpora are encoded in GB(PU,CTB)and BIG5(AS,CU).In order to test them consistently in our system,they are all converted to UTF8without loss of informa-tion.The results are shown in Table5.We follow the format from Peng et al.(2004).Each row repre-sents a CWS model.Thefirst eight rows represent models from Sproat and Emerson(2003)that partic-ipated in at least one closed test from the table,row “Peng”represents the CRF model from Peng et al. (2004),and the last row represents our model.The first three columns represent tests with the AS,CU and PU corpora,respectively.The best score in each column is shown in bold.The last two columns rep-resent the average accuracy of each model over the tests it participated in(SA V),and our average over the same tests(OA V),respectively.For each row the best average is shown in bold.We achieved the best accuracy in two of the three corpora,and better overall accuracy than the major-ity of the other models.The average score of S10 is0.7%higher than our model,but S10only partici-pated in the HK test.Four training and testing corpora were used in the second bakeoff(Emerson,2005),including the Academia Sinica corpus(AS),the Hong Kong City University Corpus(CU),the Peking University Cor-pus(PK)and the Microsoft Research Corpus(MR).AS CU PU SA V OA V S0193.890.195.193.095.0S0493.993.994.0S0594.289.491.895.3S0694.592.492.493.195.0S0890.493.692.094.3S0996.194.695.495.3S1094.794.794.0S1295.991.693.895.6Peng95.692.894.194.295.096.594.694.0Table5:the accuracies over thefirst SIGHAN bake-off data.AS CU PK MR SA V OA V S1494.794.395.096.495.195.4 S15b95.294.194.195.894.895.4 S2794.594.095.096.094.995.4 Zh-a94.794.694.596.495.195.4 Zh-b95.195.195.197.195.695.494.695.194.597.2Table6:the accuracies over the second SIGHAN bakeoff data.Different encodings were provided,and the UTF8 data for all four corpora were used in this experi-ment.Following the format of Table5,the results for this bakeoff are shown in Table6.We chose the three models that achieved at least one best score in the closed tests from Emerson(2005),as well as the sub-word-based model of Zhang et al.(2006)for comparison.Row“Zh-a”and“Zh-b”represent the pure sub-word CRF model and the confidence-based combination of the CRF and rule-based models,re-spectively.Again,our model achieved better overall accu-racy than the majority of the other models.One sys-tem to achieve comparable accuracy with our sys-tem is Zh-b,which improves upon the sub-word CRF model(Zh-a)by combining it with an independent dictionary-based submodel and improving the accu-racy of known words.In comparison,our system is based on a single perceptron model.In summary,closed tests for both thefirst and the second bakeoff showed competitive results for oursystem compared with the best results in the litera-ture.Our word-based system achieved the best F-measures over the AS(96.5%)and CU(94.6%)cor-pora in thefirst bakeoff,and the CU(95.1%)and MR(97.2%)corpora in the second bakeoff.7Conclusions and Future WorkWe proposed a word-based CWS model using the discriminative perceptron learning algorithm.This model is an alternative to the existing character-based tagging models,and allows word information to be used as features.One attractive feature of the perceptron training algorithm is its simplicity,con-sisting of only a decoder and a trivial update process. We use a beam-search decoder,which places our work in the context of recent proposals for search-based discriminative learning algorithms.Closed tests using thefirst and second SIGHAN CWS bake-off data demonstrated our system to be competitive with the best in the literature.Open features,such as knowledge of numbers and European letters,and relationships from semantic networks(Shi and Wang,2007),have been reported to improve accuracy.Therefore,given theflexibility of the feature-based perceptron model,an obvious next step is the study of open features in the seg-mentor.Also,we wish to explore the possibility of in-corporating POS tagging and parsing features into the discriminative model,leading to joint decod-ing.The advantage is two-fold:higher level syn-tactic information can be used in word segmenta-tion,while joint decoding helps to prevent bottom-up error propagation among the different processing steps.AcknowledgementsThis work is supported by the ORS and Clarendon Fund.We thank the anonymous reviewers for their insightful comments.ReferencesMichael Collins and Brian Roark.2004.Incremental parsing with the perceptron algorithm.In Proceedings of ACL’04, pages111–118,Barcelona,Spain,July.Michael Collins.2002.Discriminative training methods for hidden markov models:Theory and experiments with per-ceptron algorithms.In Proceedings of EMNLP,pages1–8, Philadelphia,USA,July.Hal Daume III.2006.Practical Structured Learning for Natu-ral Language Processing.Ph.D.thesis,USC.Thomas Emerson.2005.The second international Chinese word segmentation bakeoff.In Proceedings of The Fourth SIGHAN Workshop,Jeju,Korea.Y.Freund and rge margin classification using the perceptron algorithm.In Machine Learning,pages 277–296.fferty,A.McCallum,and F.Pereira.2001.Conditional randomfields:Probabilistic models for segmenting and la-beling sequence data.In Proceedings of the18th ICML, pages282–289,Massachusetts,USA.Y.Li,Zaragoza,R.H.,Herbrich,J.Shawe-Taylor,and J.Kan-dola.2002.The perceptron algorithm with uneven margins.In Proceedings of the9th ICML,pages379–386,Sydney, Australia.Yaoyong Li,Chuanjiang Miao,Kalina Bontcheva,and Hamish Cunningham.2005.Perceptron learning for Chinese word segmentation.In Proceedings of the Fourth SIGHAN Work-shop,Jeju,Korea.Percy Liang.2005.Semi-supervised learning for natural lan-guage.Master’s thesis,MIT.F.Peng,F.Feng,,and A.McCallum.2004.Chinese segmenta-tion and new word detection using conditional randomfields.In Proceedings of COLING,Geneva,Switzerland.Adwait Ratnaparkhi.1998.Maximum Entropy Models for Nat-ural Language Ambiguity Resolution.Ph.D.thesis,UPenn. Yanxin Shi and Mengqiu Wang.2007.A dual-layer CRF based joint decoding method for cascade segmentation and labelling tasks.In Proceedings of IJCAI,Hyderabad,India. Richard Sproat and Thomas Emerson.2003.Thefirst interna-tional Chinese word segmentation bakeoff.In Proceedings of The Second SIGHAN Workshop,pages282–289,Sapporo, Japan,July.R.Sproat,C.Shih,W.Gail,and N.Chang.1996.A stochas-ticfinite-state word-segmentation algorithm for Chinese.In Computational Linguistics,volume22(3),pages377–404. Xinhao Wang,Xiaojun Lin,Dianhai Yu,Hao Tian,and Xihong Wu.2006.Chinese word segmentation with maximum en-tropy and n-gram language model.In Proceedings of the Fifth SIGHAN Workshop,pages138–141,Sydney,Australia, July.N.Xue.2003.Chinese word segmentation as character tag-ging.In International Journal of Computational Linguistics and Chinese Language Processing,volume8(1). Ruiqiang Zhang,Genichiro Kikui,and Eiichiro Sumita.2006.Subword-based tagging by conditional randomfields for Chinese word segmentation.In Proceedings of the Human Language Technology Conference of the NAACL,Compan-ion,volume Short Papers,pages193–196,New York City, USA,June.。