ENHANCED LANGUAGE MODELLING WITH PHONOLOGICALLY CONSTRAINED MORPHOLOGICAL ANALYSIS
- 格式:pdf
- 大小:36.54 KB
- 文档页数:4
Chapter 1 Introduction1.Explain the following definition of linguistics: Linguistics is the scientific study oflanguage. 请解释以下语言学的定义:语言学是对语言的科学研究。
Linguistics investigates not any particular languagebut languages in general.Linguistic study is scientific because it is baxxxxsed on the systematic investigation of authentic language data.No serious linguistic conclusion is reached until after the linguist has done the following three things: observing the way language is actually usedformulating some hypothesesand testing these hypotheses against linguistic facts to prove their validity.语言学研究的不是任何特定的语言,而是一般的语言。
语言研究是科学的,因为它是建立在对真实语言数据的系统研究的基础上的。
只有在语言学家做了以下三件事之后,才能得出严肃的语言学结论:观察语言的实际使用方式,提出一些假设,并用语言事实检验这些假设的正确性。
1.What are the major branches of linguistics? What does each of them study?语言学的主要分支是什么?他们每个人都研究什么?Phonetics-How speech sounds are produced and classified语音学——语音是如何产生和分类的Phonology-How sounds form systems and function to convey meaning音系学——声音如何形成系统和功能来传达意义Morphology-How morphemes are combined to form words形态学——词素如何组合成单词Sytax-How morphemes and words are combined to form sentences句法学-词素和单词如何组合成句子Semantics-The study of meaning ( in abstraction)语义学——意义的研究(抽象)Pragmatics-The study of meaning in context of use语用学——在使用语境中对意义的研究Sociolinguistics-The study of language with reference to society社会语言学——研究与社会有关的语言Psycholinguistics-The study of language with reference to the workings of the mind心理语言学:研究与大脑活动有关的语言Applied Linguistics-The application of linguistic principles and theories to language teaching and learning应用语言学——语言学原理和理论在语言教学中的应用1.What makes modern linguistics different from traditional grammar?现代语言学与传统语法有何不同?Modern linguistics is descxxxxriptive;its investigations are baxxxxsed on authenticand mainly spoken language data.现代语言学是描述性的,它的研究是基于真实的,主要是口语数据。
Diffusion Language Model随着人类社会的不断发展和全球化的趋势, 语言交流变得越来越重要。
然而, 不同语言之间的沟通和理解仍然是一个巨大的挑战。
为了解决这个问题, 学术界和工业界不断探索和研究不同的语言模型。
其中, 扩散语言模型(diffusion language model) 是一种受到广泛关注的语言模型, 具有重要的理论和实际意义。
本文将介绍扩散语言模型的基本概念、发展历程、应用领域, 并对这一模型的未来发展进行展望。
扩散语言模型是一种基于跨语言交叉的模型, 旨在实现不同语言之间的信息传递与交流。
它以语法、语义和语用等为基础, 借助机器学习和自然语言处理等技术, 通过对多语言数据的学习和分析, 构建起一个跨语言共享的语言模型。
在这个模型中, 各种语言共享相同的语言特征, 并通过相互作用和影响, 促进跨语言的沟通和理解。
值得注意的是, 扩散语言模型并不是简单地把一种语言直译成另一种语言, 而是通过对多种语言间关系的分析和建模, 实现对语言的变异和扩散的全面认识。
在扩散语言模型的研究过程中, 学者们积极探索语言的共性和差异, 推动了语言学理论的深化和发展。
通过对多语言数据的挖掘和分析, 研究者们不仅发现了语言之间的广泛联系, 还揭示了语言演变和扩散的规律。
这为语言学理论和实践提供了全新的视角和思路, 促进了多语言研究和跨文化交流的发展。
在实际应用方面, 扩散语言模型具有广泛的潜在价值。
首先, 它可以为跨国企业和国际组织提供多语言信息处理和管理的技术支持, 促进国际业务的拓展和发展。
其次, 扩散语言模型还可以为移民社区和边境地区的语言交流提供便捷和高效的解决方案, 实现不同语言和文化之间的和谐共处。
此外, 还可以为国际旅游和文化交流提供有力支持, 促进多元文化的交融和共存。
可以预见, 扩散语言模型将会在国际合作、跨文化交流和多元共生等领域发挥重要作用, 为全球化时代的语言交流和理解提供新的动力和支持。
一文详解general language model-概述说明以及解释1.引言1.1 概述引言部分是一篇文章的开端,用来向读者介绍文章的主题和目的。
在本篇文章中的引言部分,我们将对general language model进行概述。
General language model是一种基于深度学习的自然语言处理模型,它具有广泛的应用领域和重要性。
它通过大规模的语料库进行训练,以学习语言的潜在结构、语义和上下文依赖关系。
具体而言,general language model使用概率模型来预测一个给定上下文下的下一个单词或字符,从而实现对语言的理解和生成。
在过去的几年中,general language model取得了令人瞩目的成果,并在各个领域展现出巨大的潜力。
它可以被广泛应用于机器翻译、语言生成、自动问答、语义分析、情感分析和文本分类等任务中。
通过将general language model应用于这些任务,我们可以提高自然语言处理系统的表现,并改善人机交互的体验。
本文将对general language model的原理、应用领域以及其未来的发展进行详细的讨论。
我们将探讨general language model在不同领域的成功案例,并分析其优势和局限性。
同时,我们也会展望general language model在未来的进一步发展,并对其可能的应用和挑战进行展望。
通过本文的阅读,读者将能够全面了解general language model的概念、原理和应用领域。
同时,我们也希望读者能够对general language model在未来的发展趋势有一定的了解,并认识到这一领域所面临的挑战和机遇。
请开始阅读正文,进一步了解general language model的精髓。
1.2文章结构1.2 文章结构本文将按照以下结构来展开对general language model的详细解析:引言部分将概述general language model的基本概念和应用场景,并介绍本文的目的。
SHANGHAI UNIVERSITY 毕业设计(论文)UNDERGRADUATE PROJECT (THESIS)论文题目基于连续隐马尔科夫模型的语音识别学院机自专业自动化学号03122669学生姓名金微指导教师李昕起讫日期2007 3.20—6.6目录摘要---------------------------------------------------------------------------2 ABSTRACT ------------------------------------------------------------------------2绪论---------------------------------------------------------------------------3第一章语音知识基础---------------------------------------------------------------6 第一节语音识别的基本内容-------------------------------------------6第二节语音识别的实现难点-------------------------------------------9第二章HMM的理论基础--------------------------------------------------------10 第一节HMM的定义----------------------------------------------------10第二节隐马尔科夫模型的数学描述---------------------------------10第三节HMM的类型----------------------------------------------------12第四节HMM的三个基本问题和解决的方-----------------------15第三章HMM算法实现的问题----------------------------------------------21 第一节HMM状态类型及参数B的选择---------------------------21第二节HMM训练时需要解决的问题-----------------------------23第四章语音识别系统的设计---------------------------------------------------32 第一节语音识别系统的开发环境-----------------------------------32第二节基于HMM的语音识别系统的设计------------------------32第三节实验结果---------------------------------------------------------49第五章结束语-------------------------------------------------------------------67致谢------------------------------------------------------------------------------68参考文献------------------------------------------------------------------------69摘要语音识别系统中最重要的部分就是声学模型的建立,隐马尔可夫模型作为语音信号的一种统计模型,由于它能够很好地描述语音信号的非平稳性和时变性,因此在语音识别领域有着广泛的应用。
Chapter 2 Speech Sounds2.1 Speech production and perceptionPhonetics is the study of speech sounds. It includes three main areas:1. Articulatory phonetics – the study of the production of speech sounds2. Acoustic phonetics –the study of the physical properties of the soundsproduced in speech3. Auditory phonetics – the study of perception of speech soundsMost phoneticians are interested in articulatory phonetics.2.2 Speech organsSpeech organs are those parts of the human body involved in the production of speech. The speech organs can be considered as consisting of three parts: the initiator of the air stream, the producer of voice and the resonating cavities.2.3 Segments, divergences, and phonetic transcription2.3.1 Segments and divergencesAs there are more sounds in English than its letters, each letter must represent more than one sound.2.3.2 Phonetic transcriptionInternational Phonetic Alphabet (IPA): the system of symbols for representing the pronunciation of words in any language according to theprinciples of the International Phonetic Association. The symbols consists ofletters and diacritics. Some letters are taken from the Roman alphabet, someare special symbols.2.4 Consonants2.4.1 Consonants and vowelsA consonant is produced by constricting or obstructing the vocal tract atsome places to divert, impede, or completely shut off the flow of air in theoral cavity.A vowel is produced without obstruction so no turbulence or a totalstopping of the air can be perceived.2.4.2 ConsonantsThe categories of consonant are established on the basis of several factors. The most important of these factors are:1. the actual relationship between the articulators and thus the way inwhich the air passes through certain parts of the vocal tract (mannerof articulation);2. where in the vocal tract there is approximation, narrowing, or theobstruction of the air (place of articulation).2.4.3 Manners of articulation1. Stop/plosive: A speech sound which is produced by stopping the airstream from the lungs and then suddenly releasing it. In English,[] are stops and[] are nasal stops.2. Fricative: A speech sound which is produced by allowing the airstream from the lungs to escape with friction. This is caused bybringing the two articulators, e.g. the upper teeth and the lower lip,close together but not closes enough to stop the airstreamscompletely. In English,[] arefricatives.3. (Median) approximant: An articulation in which one articulator isclose to another, but without the vocal tract being narrowed to suchan extent that a turbulent airstream is produced. In English this classof sounds includes [].4. Lateral (approximant): A speech sound which is produced bypartially blocking the airstream from the lungs, usually by thetongue, but letting it escape at one or both sides of the blockage. []is the only lateral in English.Other consonantal articulations include trill, tap or flap, and affricate. 2.4.4 Places of articulation1. Bilabial: A speech sound which is made with the two lips.2. Labiodental: A speech sound which is made with the lower lip andthe upper front teeth.3. Dental: A speech sound which is made by the tongue tip or blade andthe upper front teeth.4. Alveolar: A speech sound which is made with the tongue tip or bladeand the alveolar ridge.5. Postalveolar: A speech sound which is made with the tongue tip andthe back of the alveolar ridge.6. Retroflex: A speech sound which is made with the tongue tip orblade curled back so that the underside of the tongue tip or bladeforms a stricture with the back of the alveolar ridge or the hardpalate.7. Palatal: A speech sound which is made with the front of the tongueand the hard palate.8. Velar: A speech sound which is made with the back of the tongueand the soft palate.9. Uvular: A speech sound which is made with the back of the tongueand the uvula, the short projection of the soft tissue and muscle atthe posterior end of the velum.10. Pharyngeal: A speech sound which is made with the root of thetongue and the walls of the pharynx.11. Glottal: A speech sound which is made with the two pieces of vocalfolds pushed towards each other.2.4.5 The consonants of EnglishReceived Pronunciation (RP): The type of British Standard English pronunciation which has been regarded as the prestige variety and which shows no regional variation. It has often been popularly referred to as “BBC English” or “Oxford English” because it is widely used in the private sector of the education system and spoken by most newsreaders of the BBC network.of articulation. These pairs of consonants are distinguished by voicing, the one appearing on the left is voiceless and the one on the right is voiced.Therefore, the consonants of English can be described in the following way:[p] voiceless bilabial stop[b] voiced bilabial stop[s] voiceless alveolar fricative[z] voiced alveolar fricative[m] bilabial nasal[n] alveolar nasal[l] alveolar lateral[j] palatal approximant[h] glottal fricative[r] alveolar approximant2.5 Vowels2.5.1 The criteria of vowel description1. The part of the tongue that is raised – front, center, or back.2. The extent to which the tongue rises in the direction of the palate.Normally, three or four degrees are recognized: high, mid (oftendivided into mid-high and mid-low) and low.3. The kind of opening made at the lips –various degrees of liprounding or spreading.4. The position of the soft palate – raised for oral vowels, and loweredfor vowels which have been nasalized.2.5.2 The theory of cardinal vowels[Icywarmtea doesn’t quite understand this theory.]Cardinal vowels are a set of vowel qualities arbitrarily defined, fixed and unchanging, intending to provide a frame of reference for the description ofthe actual vowels of existing languages.By convention, the eight primary cardinal vowels are numbered from one to eight as follows: CV1[], CV2[], CV3[], CV4[], CV5[],CV6[], CV7[], CV8[].A set of secondary cardinal vowels is obtained by reversing thelip-rounding for a give position: CV9 – CV16. [I am sorry I cannot type outmany of these. If you want to know, you may consult the textbook p. 47. –icywarmtea]2.5.3 Vowel glidesPure (monophthong) vowels: vowels which are produced without any noticeable change in vowel quality.V owel glides: V owels where there is an audible change of quality.Diphthong: A vowel which is usually considered as one distinctive vowel of a particular language but really involves two vowels, with one vowelgliding to the other.2.5.4 The vowels of RP[] high front tense unrounded vowel[] high back lax rounded vowel[] central lax unrounded vowel[] low back lax rounded vowel2.6 Coarticulation and phonetic transcription2.6.1 CoarticulationCoarticulation: The simultaneous or overlapping articulation of two successive phonological units.Anticipatory coarticulation: If the sound becomes more like the following sound, as in the case of lamp, it is known as anticipatorycoarticulation.Perseverative coarticulation: If the sound displays the influence of the preceding sound, as in the case of map, it is perseverative coarticulation.Nasalization: Change or process by which vowels or consonants become nasal.Diacritics: Any mark in writing additional to a letter or other basic elements.2.6.2 Broad and narrow transcriptionsThe use of a simple set of symbols in our transcription is called a broad transcription. The use of more specific symbols to show more phonetic detailis referred to as a narrow transcription. The former was meant to indicateonly these sounds capable of distinguishing one word from another in a givenlanguage while the latter was meant to symbolize all the possible speechsounds, including even the minutest shades of pronunciation.2.7 Phonological analysisPhonetics is the study of speech sounds. It includes three main areas: articulatory phonetics, acoustic phonetics, and auditory phonetics. On the other hand, phonology studies the rules governing the structure, distribution, and sequencing of speech sounds and the shape of syllables. There is a fair degree of overlap in what concerns the two subjects, so sometimes it is hard to draw the boundary between them. Phonetics is the study of all possible speech sounds while phonology studies the way in which speakers of a language systematically use a selection of these sounds in order to express meaning. That is to say, phonology is concerned with the linguistic patterning of sounds in human languages, with its primary aim being to discover the principles that govern the way sounds are organized in languages, and to explain the variations that occur.2.8 Phonemes and allophones2.8.1 Minimal pairsMinimal pairs are two words in a language which differ from each other by only one distinctive sound and which also differ in meaning. E.g. theEnglish words tie and die are minimal pairs as they differ in meaning and intheir initial phonemes /t/ and /d/. By identifying the minimal pairs of alanguage, a phonologist can find out which sound substitutions causedifferences of meaning.2.8.2 The phoneme theory2.8.3 AllophonesA phoneme is the smallest linguistic unit of sound that can signal adifference in meaning. Any of the different forms of a phoneme is called itsallophones. E.g. in English, when the phoneme // occurs at the beginningof the word like peak//, it is said with a little puff of air, it isaspirated. But when // occurs in the word like speak//, it issaid without the puff of the air, it is unaspirated. Both the aspirated [] inpeak and the unaspirated [=] in speak have the same phonemic function, i.e.they are both heard and identified as // and not as //; they are bothallophones of the phoneme //.2.9 Phonological processes2.9.1 AssimilationAssimilation: A process by which one sound takes on some or all the characteristics of a neighboring sound.Regressive assimilation: If a following sound is influencing a preceding sound, we call it regressive assimilation.Progressive assimilation: If a preceding sound is influencing a following sound, we call it progressive assimilation.Devoicing: A process by which voiced sounds become voiceless.Devoicing of voiced consonants often occurs in English when they are at theend of a word.2.9.2 Phonological processes and phonological rulesThe changes in assimilation, nasalization, dentalization, and velarization are all phonological processes in which a target or affected segmentundergoes a structural change in certain environments or contexts. In eachprocess the change is conditioned or triggered by a following sound or, in thecase of progressive assimilation, a preceding sound. Consequently, we cansay that any phonological process must have three aspects to it: a set ofsounds to undergo the process; a set of sounds produced by the process; a setof situations in which the process applies.We can represent the process by mans of an arrow: voiced fricative →voiceless / __________ voiceless. This is a phonological rule. The slash (/)specifies the environment in which the change takes place. The bar (calledthe focus bar) indicates the position of the target segment. So the rule reads: avoiced fricative is transformed into the corresponding voiceless sound whenit appears before a voiceless sound.2.9.3 Rule ordering[No much to say, so omitted – icywarmtea]2.10 Distinctive featuresDistinctive feature: A particular characteristic which distinguishes one distinctive sound unit of a language from another or one group of sounds from another group.Binary feature: A property of a phoneme or a word which can be used to describe the phoneme or word. A binary feature is either present or absent. Binary features are also used to describe the semantic properties of words.2.11 S yllablesSuprasegmental features: Suprasegmental features are those aspects of speech that involve more than single sound segments. The principal suprasegmental features are syllables, stress, tone, and intonation.Syllable: A unit in speech which is often longer than one sound and smaller than a whole word.Open syllable: A syllable which ends in a vowel.Closed syllable: A syllable which ends in a consonant.Maximal onset principle: The principle which states that when there is a choice as to where to place a consonant, it is put into the onset rather than the coda. E.g. The correct syllabification of the word country should be //. It shouldn’tbe // or // according to this principle.2.12 StressStress refers to the degree of force used in producing a syllable. In transcription, a raised vertical line [] is used just before the syllable it relates to.[文档可能无法思考全面,请浏览后下载,另外祝您生活愉快,工作顺利,万事如意!]。
Prompt-basedLanguageModels:模版增强语⾔模型⼩结©PaperWeekly 原创 · 作者 | 李泺秋学校 | 浙江⼤学硕⼠⽣研究⽅向 | ⾃然语⾔处理、知识图谱最近注意到 NLP 社区中兴起了⼀阵基于 Prompt(模版)增强模型预测的潮流:从苏剑林⼤佬近期的⼏篇⽂章《必须要 GPT3 吗?不,BERT 的 MLM 模型也能⼩样本学习》,《P-tuning:⾃动构建模版,释放语⾔模型潜能》,到智源社区在 3 ⽉ 20 ⽇举办的《智源悟道1.0 AI 研究成果发布会暨⼤规模预训练模型交流论坛》[1] 中杨植麟⼤佬关于“预训练与微调新范式”的报告,都指出了 Prompt 在少样本学习等场景下对模型效果的巨⼤提升作⽤。
本⽂根据上述资料以及相关论⽂,尝试梳理⼀下 Prompt 这⼀系列⽅法的前世今⽣。
不知道哪⾥来的图……本⽂⽬录:1. 追本溯源:从GPT、MLM到Pattern-Exploiting Training1. Pattern-Exploiting Training2. 解放双⼿:⾃动构建Prompt1. LM Prompt And Query Archive2. AUTOPROMPT3. Better Few-shot Fine-tuning of Language Models3. 异想天开:构建连续Prompt1. P-tuning4. ⼩结追本溯源:从GPT、MLM到Pattern-Exploiting Training要说明 Prompt 是什么,⼀切还要从 OpenAI 推出的 GPT 模型说起。
GPT 是⼀系列⽣成模型,在 2020 年 5 ⽉推出了第三代即 GPT-3。
具有 1750 亿参数的它,可以不经微调(当然,⼏乎没有⼈可以轻易训练它)⽽⽣成各式各样的⽂本,从常规任务(如对话、摘要)到⼀些稀奇古怪的场景(⽣成 UI、SQL 代码?)等等。
Speech Enhanced Multi-span Language ModelA.Nayeeemulla Khan andB.YegnanarayanaSpeech and Vision LaboratoryDepartment of Computer Science and Engineering Indian Institute of Technology Madras,Chennai-600036,India Email:nayeem,yegna@cs.iitm.ernet.inAbstractTo capture local and global constraints in a language,sta-tistical-grams are used in combination with multi-spanlanguage models for improved language e of latent semantic analysis(LSA)to capture the globalsemantic constraints and bigram models to capture lo-cal constraints,is shown to reduce the perplexity of the model.In this paper we propose a method in which themulti-span LSA language model can be developed basedon the speech signal.Reference pattern vectors are de-rived from the speech signal for each word in the vocabu-lary.Based on the normalised distance between the refer-ence word pattern vector and the pattern vector of a wordin the training data,the LSA model is developed.Weshow that this model in combination with a standard bi-gram model performs better than the conventional bigram+LSA model.The results are demonstrated for a limitedvocabulary on a database for the Indian language,Tamil.1.IntroductionIn every language there exist dependencies in the usage ofwords which could be syntactic,semantic or pragmatic.Local level constraints are captured by means of statisti-cal-gram models.-gram models are unable to predictlong range dependencies as this requires a large value of ,making the parameter estimates of the model unreli-able due to the limited training data available.To modellong range dependencies,equivalence classes on the-gram history[1]and structured language models[2]are useful for limited domains.In less constrained domains they are not as useful.Trigger based language models[3] are also potential ways in which long range dependencies can be captured.But trigger pair selection is a complex task,with different pairs displaying different behaviors. Use of latent semantic analysis to capture long range de-pendencies has been shown to be effective.In combina-tion with-gram models it results in a substantial reduc-tion in perplexity[4][5].In conventional language mod-els no knowledge of the language is used.The data being modelled could as well be a sequence of arbitrary sym-bols.It is essential to use knowledge sources available to enhance the performance of the statistical language mod-els.One application of statistical language models is in speech recognition.The use of speech knowledge,prosodic constraints,large span semantic and local syntactic con-straints,when integrated with the speech recogniser would improve the performance of the recogniser.In this paper we propose a method in which the semantic constraints in terms of the co-occurrence of words in a document are captured indirectly from the speech signal in the la-tent semantic analysis(LSA)framework.We show that the speech enhanced LSA language model performs bet-ter than the-gram and the hybrid-gram+LSA model. The reduction in perplexity for a test set is used to mea-sure the performance of the model.The paper is organised as follows,the next section briefly illustrates the technique of LSA.Section3de-scribes the development of the speech enhanced multi-span language model.In Section4the database used is described.Section5details the evaluation of the model, followed by discussion of the results in Section6.We summarise the study in Section7.tent semantic analysisA brief overview of related work on LSA relevant to this study as described in[4][5][6]is presented here.LSA is an algebraic technique that can be used to infer the rela-tionship among words by means of the co-occurrence of the words in identical contexts.Given a set of docu-ments from a text corpus,with a vocabulary of words,it specifies a mapping between the discrete sets and and a continuous vector space.A document could arbitrarily be a sentence,a paragraph or a larger unit of text.A matrix containing the co-occurrence statistics between words and documents is constructed.Here word order is ignored unlike conventional-gram modelling. Each element of is weighed by the normalised word entropy and scaled for the document length.The element(i,j)of is given bywhere is the number of times word occurs in doc-ument,is the total number of words present in,,and is the normalised entropy of in the entire corpus.The matrix can be approximated by its order-R singular value decomposition(SVD).This results in three matrices and. and are column orthonormal and is a diagonal matrix.This transformation to the lower dimensional space captures major structural association between the words and the documents and removes noise.It also provides a dimensional representation for both the words and the documents.Based on information retrieval and language modelling studies[5],values of in the range of100to 300seems to work reasonably.The-dimensional scaled representation of the word and document vector is given by and where and are the corresponding rows of and.Any new document(test document) can be considered as an additional column of the matrix .Its representation in the reduced dimensional space is given by.For language modelling given such a representation, and a distance metric in the-dimensional space it is possible to combine the standard-grams and the LSA model to derive a hybrid-gram+LSA language model as detailed in[4][5].In the following sections we detail the construction of the matrix from the acoustic ing this speech based matrix we develop the speech enhanced hybrid-gram+LSA model.3.Speech enhanced multi-span languagemodelThe block diagram of the proposed system for construc-tion of matrix is shown in Figure1.Availability of a database segmented in terms of words is assumed.The duration of a word segment is variable.Tofind the close-ness of a pattern vector representing a word,to other words of the vocabulary,it is desirable to havefixed di-mensional pattern vectors for all the words in the vocab-ulary.From the speech signal corresponding to a word segment,for every frame of15msec and a frame shift of1msec we derive13dimensional mel frequency cep-stral coefficients.The euclidean distance between adja-cent pairs of feature vectors is computed for all the frames corresponding to the word segment.Depending on the number of frames needed to construct the desired pat-tern vector,frames are added/dropped.If the number of frames in the word segment is less than the desired, then the frame with the minimum euclidean distance is replicated.For a word segment with larger number of frames than desired,a frame is dropped if its euclidean distance to its neighbor is minimum among the distances computed.This is repeated until the desired number of frames is obtained.It is assumed that there is minimal distortion/loss in adding/dropping the above frames.The selected frames are concatenated to form thefixed dimen-sional pattern vector representing the word.The resulting pattern vector is large(390to572dimensions).Compar-ing pattern vectors in such a high dimension space is not preferable.It has been shown that non-linear compres-sion of large dimensional pattern vectors of speech using AANN models does not degrade the speech recognition performance[7].We use AANN models to compress the large dimensional pattern vector into40to100dimen-sions.Thus a reduced dimension pattern vector is de-rived for each word segment in the entire training data. Pattern vectors corresponding to a word in the training set are used to derive a mean pattern vector,which serves as the reference pattern vector for that word in the vocab-ulary.One such reference pattern vector is derived for each word in the vocabulary.For every word segment in a training document(speech file),a compressed pattern vector is derived as explained. The euclidean distance between this pattern vector and all the reference pattern vectors in the vocabulary is deter-mined.The resulting distances are normalised between zero and one.The membership defined as(1-normalised distance)indicates how close the current pattern vector is to each of the reference pattern vectors in the vocabulary. If this membership is above a certain threshold then the appropriate element is incremented by the mem-bership value.The elements of are also scaled for the length of the document(No.of words)and weighted by the entropy of the term.Thus the matrix is derived from the acoustic signal.4.DatabaseThe database used for the study is the Indian language speech corpus[8].TV news bulletins from Doordarshan for Tamil language were collected.Speech pertaining to the news reader was manually transcribed and seg-mented into words representing around4hours of speech. Among these bulletins23are spoken by females and10 by males.For this task the database was partitioned man-ually into news stories belonging to8different categories. The details of the database in terms of news stories are shown in Tables1and2.There are no standard text cor-pus of news bulletins or news wire corpora in Tamil lan-guage.As it is preferable to use bigram models trained on data pertaining to the domain of use,we used a bigramFigure1:Block diagram for construction of in the proposed speech enhanced LSA language modelmodel derived from the limited training data for integra-tion with the LSA model.Table1:Description of the database in terms ofstoriesStory No.of documentsTest SetEconomics44Events104Others124Politics104Sports34War163Weather6World politics6492Table2:Database statisticsTraining set923,7068136405.Experimental evaluationFrom the transcription of the training data we chose a lim-ited vocabulary of1,278words(inclusive of the unknown word tag UNK),that had at least4occurrences in the training data.For the643training documents a ma-trix of size1278643is created.The average duration of these words in the database is431msec.Assuming a frame shift of10msec,44frames are chosen using the procedure mentioned in Section3to represent the word. The feature vectors concatenated together resulted in pat-tern vectors of572dimension.Such pattern vectors are derived for each word in the training data.The pattern vectors are non-linearly compressed to a smaller dimen-sion(60,80or100)using AANN models.The structure of the AANN model is572L858N k N858N572L,where represents a linear unit,represents a non-linear unit and is the dimension of the desired compressed patternvector.All the word pattern vectors in the training data are used to train the AANN model.The model in trained for200epochs.The compressed feature vector is ob-tained from the compression layer of the trained AANN model.These compressed feature vectors are used in the construction of the matrix.As the structure of the AANN model is large and the training patterns limited, the AANN model may not generalise well.An alter-native compact representation of a word using only30 frames concatenated to form a390dimensional pattern vector was also employed.These390dimension pattern vectors were compressed to40,60or80dimension us-ing an AANN model represented by390L585N k N585N ing these compressed pattern vectors of differ-ent dimensionalities,appropriate matrices were con-structed.The integrated bigram+LSA model was de-rived as described in[4][5].We report results for pattern vectors compressed to60dimension.To test the performance of the language model the perplexity of the speech enhanced hybrid bigram+LSA model was found for the test data of Table1.During test-ing based on the transcription of the speech document,in a manner similar to the standard LSA language model, for every word in the test document,the appropriate vec-tors in the reduced dimensional space andare used for computation of the probabilities.The out ofvocabulary rate for the test set is very high(41%)due to the limited vocabulary chosen,and the fact that the data pertains to news bulletins.The out of vocabulary words were ignored in perplexity computation.6.ResultsThe performance of the speech enhanced hybrid bigram +LSA model is shown in Table3for different thresholds of membership values,and a SVD order of75(optimal order balancing reconstruction error and noise suppres-Table3:Perplexity of the speech enhanced bigram+LSA model for different pattern representationand membership threshold,for a SVD order of75 Word pattern MembershipCompressed0.98from0.97572to600.960.92199196195197 Table4:Comparison of performance of three dif-ferent language models.LSA models use SVD oforder250Model Perplexityover bigramsBigram234Bigram+LSA199Speech enhanced21%sion).If the threshold is high(0.98)the matrix is similar to its text based counterpart in its sparseness.As the threshold is lowered,more elements of the ma-trix arefilled,which is like smoothing.The performance of the model improves marginally.For lower thresholds the performance is likely to deteriorate.This behaviour is observed for both the representations of the word pat-tern vectors.The performance of the model using pat-terns vectors compressed from390to60dimension is better than the model using pattern vectors compressed from572to60dimension.The performance comparison of the three different language models is shown in Table4.The perplexity of the speech enhanced hybrid-gram model is better than the standard bigram model by21%and shows an improvement of6%over the conventional text based bi-gram+LSA model for a SVD order of250.Order250is chosen due to its better performance over SVD order75.7.SummaryIn this study we proposed an approach for developing a speech enhanced multi-span language model.We have shown that the performance of the system is better than the text based bigram+LSA model for the limited vo-cabulary of words.No use of word level and document level smoothing[5]is made,which would further reduce the perplexity of the model.Different parameters of the system like word pattern vector representation,order of compression of the pattern vector,membership thresh-old,SVD order and scaling factor for the LSA probabili-ties are not optimised.Doing so may improve the perfor-mance of the language model.This method of indirect in-corporation of the speech information may be a small step towards using speech level constraints in language mod-els for better speech recognition performance.One limi-tation in extending the study is the lack of a large speech corpus of the size required for language modelling,seg-mented in terms of words for Indian languages.8.References[1]Peter F.Brown,Vincent J.Della Pietra,Peter V.deS-ouza,Jenifer i,and Robert L.Mercer,“Class-based-gram models of natural language,”Com-putational Linguistics,vol.18,no.4,pp.467–479, 1992.[2]C.Chelba and F.Jelinek,“Recognition performanceof a structured language model,”in Proc.6th Eur.Conf.Speech Commun.Technol.,Budapest,Hun-gary,Sept.1999,vol.4,pp.1567–1570.[3]u,R.Rosenfeld,and S.Roukos,“Trigger-basedlanguage models:A maximum entropy approach,”in Proc.IEEE Int.Conf.Acoust.,Speech,and Signal Processing,Minneapolis,USA,Apr.1993,vol.2,pp.45–48.[4]N.Coccaro and D.Jurafsky,“Toward better inte-gration of semantic predictors in statistical language modeling,”in Proc.Int.Conf.Spoken Language Processing,Sydney,Australia,Dec.1998,pp.2403–2406.[5]J.R.Bellegarda,“Exploiting latent semantic in-formation in statistical language modeling,”Proc.IEEE,vol.88,no.8,pp.1279–1296,Aug.2000. [6]ndauer,P.W.Foltz,and ham,“An intro-duction to latent semantic analysis,”Discourse Pro-cess,vol.25,pp.259–284,1998.[7]S.V.Gangashetty,C.Chandra Sekhar,and B.Yeg-nanarayana,“Dimension reduction using autoas-sociative neural network models for recognition of consonant-vowel units of speech,”in Proc.Fifth Int.Conf.Advances in Pattern Recognition,ISI Calcutta, India,Dec.2003,pp.156–159.[8]A.Nayeemulla Khan,Suryakanth V.Gangashetty,and S.Rajendran,“Speech database for Indian languages-A preliminary study,”in Proc.Int.Conf.Natural Language Processing,NCST,Mumbai,In-dia,Dec.2002,pp.295–301.。
To appear in Proc. ICASSP-2000, Istanbul, Turkey1ENHANCED LANGUAGE MODELLING WITH PHONOLOGICALLY CONSTRAINED MORPHOLOGICAL ANALYSISA.C. Fang and M. HuckvaleDepartment of Phonetics and LinguisticsUniversity College LondonGower Street WC1E 6BT, London, EnglandABSTRACTPhonologically constrained morphological analysis (PCMA) is the decomposition of words into their component morphemes conditioned by both orthography and pronunciation. This article describes PCMA and its application in large-vocabulary continuous speech recognition to enhance recognition performance in some tasks. Our experiments, based on the British National Corpus and the LOB Corpus for training data and WSJCAM0 for test data, show clearly that PCMA leads to smaller lexicon size, smaller language models, superior word lattices and a decrease in word error rates. PCMA seems to show most benefit in open-vocabulary tasks, where the productivity of a morph unit lexicon makes a substantial reduction in out-of-vocabulary rates.1. INTRODUCTIONIn this paper we present a novel approach towards the enhancement of language modelling that is achieved through phonologically constrained morphological analysis (PCMA). PCMA is the decomposition of word tokens into their component affixes and stems constrained by both orthography and pronunciation. In its simplest form, PCMA involves the analysis of words into a sequence of sub-word units which express the morphological structure of the word subject to the constraint that the pronunciation of the whole is derivable simply from the concatenation of the pronunciation of the parts. As an example, PCMA accepts the decomposition of abandoned into abandon+ed since the pronunciation for the whole string may be concatenated from the parts. On the other hand, academician is not decomposed into academic+ian for the reason that the parts do not allow direct derivation of the pronunciation.This paper describes our investigations into the use of PCMA for speech recognition based mostly on the 100-million-word British National Corpus (BNC; [1]) for training and test material. In the following sections we describe the preparation of the training and test data and present baseline statistics obtained with the Abbot connectionist/HMM continuous speech recognition system ([2]; henceforward referred to as Abbot) from conventional word-based models. We then describe the PCMA approach towards language modelling in detail. We then present statistics obtained from PCMA models and discuss comparisons with word-based language models in terms of lexicons, lattice scoring, perplexity measures, and finally word accuracy rates.2. DATA AND BASELINE STATISTICS 2.1 Text and speech dataThe BNC was used as a basis on which both training and test data sets were selected. There are 4,124 files in the corpus, 3,209 written and 815 transcribed speech. The written texts were randomised and 10 chunks of 10 million words were selected for use as training sets. The remainder of the written texts were divided into 10 chunks of one million words each for use as test sets.To establish baseline statistics, two sets of language models were trained from the first 10m-word chunk in the training set (train-01-raw) and the first and the second chunks in the training set totalling about 20 million words (train-01-02-raw). The CMU-Cambridge Toolkit [3] was used for this purpose with linear discounting. The vocabulary sizes were set at 20k, 40k, and 65k. The pronunciations were mapped from a dictionary of British English Example Pronunciations [4]. A text-to-speech system was used to generate pronunciations for lexical items from the training sets that did not have a corresponding entry in BEEP.For the first set of recognition experiments, 100 sentences (1786 words) were randomly selected from the first test data set (BNC1). In addition, a further 100 sentences (2002 words) were randomly chosen from the Lancaster-Oslo-Bergen Corpus (LOB1). These sentences were read by a single male speaker of British English in anechoic conditions and the recording was digitally acquired at 16 kHz.2.2 Lexicons and OOV ratesTable 1 summarises the coverage of the pronunciation lexicons constructed from the training sets. As can be noted, BNC1 has a higher out-of-vocabulary (OOV) rate than LOB1, especially at 65k level.OOV % Lexicon SizeLOB1BNC1 train-01-raw-20k19,9989.239.28train-01-raw-40k39,9947.237.48train-01-raw-65k64,978 5.85 6.79Table 1: A summary of pronunciation lexicons2.3 Perplexity measuresPerplexities were measured for the word models trained with 10- and 20-million word training sets. These were calculated using the CMU toolkit, which by default ignoresOOV words. From Table 2, we can see that LOB1 has produced higher perplexities than BNC1. On the other hand our small test samples seem to have significantly higher perplexities than the LOB corpus taken as a whole.10m20mLOB1BNC1LOB LOB1BNC1LOB 20k434448277377404248 40k512514324441457289 65k562538350480481311 Table 2: Perplexities for word models2.4. Lattice scoresThe Abbot recognition system (version 0.76) was used to obtain our baseline lattice and word accuracy scores. Word lattices were generated with parameters which increased the default number of hypotheses to 100. The maximal scores for the lattices were calculated by finding the best path matching the correct transcription using a dynamic programming search. The overall performance could then be computed according to the number of incorrect matches, deletions, and insertions in the best path. Table 3 summarises the word lattice scores for the two test sets with lexicons of various sizes. It is noticeable that the overall performance increases with the increase in lexicon size showing that coverage is an issue.LOB1 (%)BNC1 (%)20k86.486.640k88.488.665k89.188.7Table 3: Word lattice scores2.5 Word accuracy ratesTable 4 summarises the performance of Abbot with language models trained with the 10m-word training data set (train-01). Three vocabulary sets of respective sizes 20k, 40k, and 65k were used in the training of the language models.LOB1 (%)BNC1 (%)20k54.752.940k56.354.665k55.855.0Table 4: Word recognition scoresTable 5 summarises the performance of Abbot with a language model trained with the 20m word training set (train-01-02).LOB1 (%)BNC1 (%)20k56.454.540k57.455.265k57.655.3Table 5: Word recognition scoresAcross the various vocabulary sizes, Abbot performed consistently better with the larger language models. The better performance on LOB1 is probably related to its smaller perplexity and smaller OOV rate.3. PCMA LANGUAGE MODELS The use of morphological analyses in the construction of language models is motivated by the benefits that stem from a reduction in the size of the lexicon. In addition, a morph based pronunciation dictionary has fewer minimally different pronunciation pairs than a wordform dictionary. However, in our approach, the morphological analysis is not simply a process whereby words are decomposed into various parts according to their prefixes and suffixes. In order that morphological words or word parts may be reconstructed back into their corresponding orthographic forms, the decomposition itself has to be conditioned by phonological constraints. This ensures that a legal pronunciation may be directly generated from the decomposed parts and that the decoder used in speech recognition need not be affected. As an example, in our approach, the decomposition of abandoned into abandon + -ed is allowable because the pronunciation may be constructed from the parts:ABANDON= ax b ae n d ax n-ED= dABANDONED= ax b ae n d ax n dOn the other hand, the decomposition of academician is not allowed since its pronunciation cannot be reconstructed from its parts, i.e., academic and –ian:ACADEMIC= ae k ax d eh m ih k-IAN= ia nACADEMICIAN= ax k ae d ax m ih sh nOnce the decomposition is successful, the word is presented as a sequence of its component parts with a trailing hash sign (#) indicating the presence of a prefix and a leading hyphen (-) indicating a suffix. As an example, disregarded is decomposed into three parts: dis# regard -ed. PCMA is therefore a process with two sequential operations. Firstly, the word in question is decomposed into its corresponding morphological parts according to rules. Secondly, the decomposition is constrained by a pronunciation lexicon (BEEP, in our experiment) so that the system only retains those component parts that allow for the direct derivation of the pronunciation for the original word.Based on [5], a total of 115 prefixes and suffixes were built into the morphological analyser. Table 6 lists 30 most frequent affixes together with frequency counts extracted from the LOB corpus for the 52,703 word types as a result of the morphological analysis.4353-s2527-ing1909co#1854-d1693-es1599-ed1546-er1481re#1275-ly1036in#889de#849-al786di#767un#694-ion471be#432ex#418pro#405pre#396dis#369-ies350-en331-or318-ation316en#278-ment268-ions264-ity263-able253-nessThe use of morph units increases the number of tokens used in language modelling by about 14%.4. COMPARISON OF PCMA MODELSWITH WORD MODELSComparisons were made between PCMA and word-based models in terms of lexicon size, perplexity, model size,lattice scores, and word recognition rates.4.1 Lexicon sizeOur experiments show that morphological analysis substantially reduces the lexicon size. Take the 65k-lexicon as an example. Of the 64,978 items, 32,323(49.7%) can be analysed by the morphological units listed in Table 7. Phonological constraints reduce this number slightly to 21,663 items (33.3%), which results in a reduction of 29.2% for the lexicon as a whole. Table 7summarises the sizes and OOV rates of the PCMA lexicons.OOV (%)LexiconSizeRed.(%)LOB1BNC120k 13,37033.27.037.0840k 25,15837.1 6.09 5.9065k 46,00029.2 5.02 5.45 As well as reducing the size of the lexicon, PCMA also reduces the OOV rate since many OOV words are simply different morphological inflexions of units in the lexicon.OOV rates for LOB1 and BNC1 are now comparable across the different vocabulary sizes.4.2 Perplexity scoresPerplexity scores for the PCMA models are listed in Table 8. Morph sequence perplexities are about 55% of the word sequence perplexities. For instance, when trained with 20m-word set, the 65k PCMA model has a perplexity score of 218 with LOB1 and 207 with BNC1, a reduction of respectively 54.5% and 56.8% when compared with Table 2.LOB1BNC1LOB 10m 20m 10m 20m 10m 20m 20k 20718720018314413140k 22420122020015814465k 244218227207169153Table 8: Morph perplexities for PCMA models However to directly compare morph-sequence perplexities to word-sequence perplexities is it necessary to compensate for the fact that there about 14% more morph-unit tokens in the test data than there are word-unit tokens.Scaling the log-probabilities accordingly shows that the morph-mapped word perplexities are actually slightly higher than word-based perplexities. Any improvements in word error rate are probably not to do with decreased perplexity alone.To confirm that the reductions in perplexity were not simply due to the reduction in lexicon size, three additional lexicons were constructed that contain full 20k, 40k, and 65k PCMA items selected according to frequency of use.With models trained with 20m words, the LOB corpus yielded 173, 187, and 193 as perplexity scores. Whencompared to Table 2, the reductions are respectively 30.2%, 35.3%, and 50.8%, suggesting that the reductions are not merely due to the reduction of lexicon size.4.3 Language model sizeAs a result of lexicon size reduction (c.f. Table 7), the language model size is accordingly reduced. Table 9 shows that with language models trained with 10 million words with different vocabulary sizes, as an example, the reduction rate is about 25% for bigrams and 10% for trigrams.No. of bigrams No. of trigrams Size Word PCMA Red.Word PCMA Red 20k 1333754100265124.84489914402892310.340k 1542090112903926.8479829342439289.165k1748975130976825.15127667456834910.9Table 9: Reduction of model sizeWhen training data size increases to 20 million words, the reduction rate accordingly becomes higher. At 40k level,according to Table 10, the reduction of the number of bigrams is as high as 29.2% and that of trigrams is 14.6%.No. of bigrams No. of trigrams Size Raw Morph Red Raw Morph Red 20k 2049486148853527.47831775679179713.340k 2409319170523729.28470465723226714.665k2708455196125727.69017522775343514.0Table 10: Reduction of model size4.4. Lattice scoresThe results for maximal morph accuracy are listed in Table 11. It is significant that lattices generated through PCMA lexicons have achieved the maximal performance, i.e.,100% minus OOV rates. In Table 3, by contrast, the word error rate is nearly twice the OOV rate.LOB1 (%)BNC1 (%)20k 92.293.440k 94.395.265k 94.995.2Direct comparison between morph-unit lattices and word-lattices is difficult because the average length of the units is shorter in the morph-unit lattice.4.5. Word accuracy ratesFinally, word recognition rates were obtained with the PCMA models. According to Table 12, models trained with the 10m-word training set scored nearly 4% better than conventional word models (shown in Table 4).LOB1 (%)BNC1 (%)20k 58.655.740k 60.256.465k 59.957.6With the BNC1 test set, word recognition improvementsare slightly varied across the different vocabulary sets over the conventional models: 2.8%, 1.8%, and 2.6%. Table 13 summarises the performance of PCMA models trained from 20 million words from BNC. When compared to those listed in Table 5, the enhancement rates across different vocabulary sets were respectively 2.7%, 3.6%, and 3.8% for LOB1, and 1.0%, 1.2%, and 2.0% for BNC1.LOB1 (%)BNC1 (%)20k59.155.540k61.056.465k61.457.3While the increase of training data size has resulted in better word accuracy rates for LOB1, the performance seems to have deteriorated for the other test set, BNC1. This deterioration is mainly due to the increase in the number of insertions.4.6. Closed-vocabulary testsThe recognition tasks described in 4.5 are characterised by relatively large vocabularies and OOV rates: the materials were taken from general English corpora. To compare these results with more conventional recognition materials, PCMA modelling was also applied to the test materials of the WSJCAM0 database [6, 7]. The 1105 test sentences were divided into two groups: whether they arose from the 5k or the 20k WSJ lexicons. Table 14 summarises the recognition results for a 20k word, 65k word, 20k-equivalent morph-unit and a 65k-equivalent morph-unit lexicon.Model Word PCMAVoc. size20k65k20k65k Test size5k20k5k20k5k20k5k20k Results58.861.564.966.260.363.263.465.2 Overall60.365.661.964.3 Table 14: Results from close-vocabulary testsAt 20k level, PCMA models were marginally better than the word-based models (61.9% vs 60.3%) whereas at 65k level the performance of PCMA fell below that of word-based models. This was probably due to the fact that the words in the test material were adequately covered by the larger word lexicon.5. CONCLUSIONAs described in this article, PCMA has shown through empirical tests a number of benefits:•Reduced lexicon size – PCMA generates a much smaller lexicon for the same coverage, a reduction of 30% of the conventional pronunciation lexicon.•Enhanced lattices – a larger proportion of correct readings are found in morph lattices compared to word lattices. In fact the morph lattices are at near-maximum performance.•Reduced perplexities – morph sequence perplexities are only 50% of equivalent word-sequence perplexities.•Reduced language model size – PCMA is capable of reducing word bigrams by 25% and word trigrams by about 10% .•Increased word accuracy rates – PCMA has reduced the word error rate in absolute terms by about 2% and in relative terms by about 5% although this improvement was observed only in open-vocabulary recognition tasks.We conclude that PCMA obtains most of its effect through the increased productivity of a fixed size lexicon. In tasks with high OOV rates, such as those derived from the BNC, the increase in coverage compensates for deficiencies arising from the use of fewer, smaller units. There seems to be no benefit with regards to language model perplexity, which might be expected since the trigram morph-unit model operates on a smaller 'window' of the sentence. The increase in morph lattice rates could be due to both a decrease in lexicon size and a decrease in the number of minimally different pronunciations in the lexicon.It is possible that some of the deficiencies of the morph-unit model could be addressed by further work, in particular by adding phonological constraints on morph-unit combinations in a recognition post-processor, or by interpolating word and morph-unit language models.ACKNOWLEDGEMENTSThe work was supported in part by the Engineering and Physical Science Research Council, UK, Grant No GR/L81406. We thank Tony Robinson and Steve Renals for assistance with the Abbot recognition system.REFERENCES[1]Burnard L. Users Reference Guide for the BritishNational Corpus. Oxford University Computing Services, Oxford, 1995.[2]Hochberg M., Renals R., and Robinson A. “ABBOT:The CUED Hybrid Connectionist-HMM Large Vocabulary Recognition System”. Proceedings of Language Technology Workshop, Austin Texas, Jan 1995.[3]Clarkson P. and Rosenfeld R. “Statistical LanguageModeling using the CMU-Cambridge Toolkit”.Proceedings of Eurospeech, 1997.[4]Robinson T., Fransen J., Pye D., Foote J. and Renals S.“WSJCAM0: A British English Speech Corpus for Large Vocabulary Continuous Speech Recognition”.Proceedings of ICASSP, 1995, pages 81-84.[5]Quirk R., Greenbaum S., Leech G., and Svartvik J. AGrammar of Contemporary English. Longman, London, 1972.[6]Fransen J., Pye D., Robinson T., Woodland P., andYoung S. WSJCAM0 Corpus and Recording Description. Linguistic Data Consortium, 1994.[7]Paul D.B. and Baker J.M. “The design for the WallStreet Journal-based CSR corpus”. Proceedings of Fifth DARPA Speech and Natural Language Workshop, 1992, pages 357-362.。