PROSODY MODELS FOR CONVERSATIONAL SPEECH RECOGNITION

格式：pdf
大小：107.66 KB
文档页数：8

下载文档原格式

Connected-Speech英语快速语流中的缩读连读

I gotcha some water.
• 11. bet you betcha
I bet you know the answer answer.
I betcha know the
• 12 don't you dontcha
Don't you need help?
Doncha need help?
7
lotta a lotta guys
a lot
• 16. could have coulda/couldov You could have stopped.
You coulda stopped. You couldov stopped.
• 17. should have shoulda/shouldov
He should have gone.
• He shoulda gone. He shouldov gone.
8
• 18. give me
gimme
Байду номын сангаас
Give me my keys. Gimme my keys!
• 19. and
n
you and I
you n I
• 20. them
em
• I see them.
• Reductions are NOT how the words are spelled, but how they sound in connected speech.
• DO NOT write reductions. Instead use proper (formal) spelling and grammar when writing.
2

教资英语理论精讲-语言学

教资英语理论精讲-语言学【说在课前】1.今天开始语言学的三堂课。

在课堂上更多涉及到理解，大家一定先跟上老师的思路来理解，单词可以课下记忆，不要课上反复刷屏单词的问题。

每一个部分需要背哪些单词老师都会总结在 ppt 上，大家课后看 ppt 进行记忆即可。

2.语言学尽量不要走神，走神了课下可以听回放。

语言学涉及到比较多的理解，大家如果没有听明白，可以继续跟着老师听，语言学知识的连续性不是很强，不要打断课堂的思路。

3.课堂上有详有略。

略的部分不是考点，课上老师不再回答非考点部分。

上课打勾的部分是需要记忆的。

4.语言学：考试不会考查概念。

研究语言的，如研究语言的结构、发生、历史、发展等。

题量 1-4 题【解析】1.语言学概述又分为三个部分。

遵循从一般到特殊的规律。

先带大家了解语言学概述，再带大家一个个了解。

2.语言学通常考查 1-4 题，即 2-8 分，能拿到的分数尽量拿。

语言学和语法不一样，只要听懂并记下需要记的单词，8 分一定可以拿到。

3.句法学不考，这里不再赘述。

修辞学和二语习得考查较少，我们会从当中挑选重点概念进行讲解。

标红部分是重点内容。

Part1 语言学概述1.语言学的分类2.语言的本质特征3.语言的主要功能【解析】语言学的分类带着大家看下即可，考试不考。

标红的两个考试考过。

01语言学的分类【解析】语言学分类：微观和宏观语言学，考试的重点是标红。

1.微观语言学：从语音到语用。

从语言的音、形、意、用四个角度进行研究的。

（1）语音学：研究元音和辅音。

（2）音系学：研究发音的内部结构和发音规律。

（3）形态学：研究单词的形，这部分是围绕单词来展开的。

（4）语义学：研究语言的意义。

（5）语用学：研究语言的使用。

2.宏观语言学：将语言学和其他学科拼在一起进行研究。

考试不会考查得这么复杂。

02语言的本质特征(Design features)语言的本质特性，指的是人类固有的，有别于任何其他动物交流系统的特质。

discourse markers

The main role of discourse markers: to guide speaker’s interpretations of the utterances. The features of DMs: DMs seem to clarify a text’s structural relations for the reader. Despite any differences in their use in different types of discourse, these items share a number of formal and textual features.
summons传唤-response assertion-assent赞成
request-acceptance
promise-acknowledgement
goodbye-goodbye.
thanks-acknowledgement
However, not all first parts immediately receive their second parts. It often happens that a question-answer (Q-A) sequence will be delayed while another question-answer sequence intervenes. The sequence will then take the form of Q1-Q2-A2A1, with the middle pair (Q2-A2) being called an insertion sequence插入语列.
7.4 Discourse markers 话语标记语

英国文学上课重点

What is Epic?
Epic is an extended narrative poem in elevated or dignified language, like Homer's Iliad and Odyssey. It usually celebrates the feats of one or more legendary or traditional heroes. The action is simple but full of magnificence. Today,some long narrative works, like novels that reveal an age and its people are also called epic.
A rich fabric of fact and fancy, Beowulf is the oldest surviving epic in British literature. Beowulf exists in only one manuscript. This copy survived both the wholesale destruction of religious artifacts during the dissolution of the monasteries by Henry VIII and a disastrous fire which destroyed the library of Sir Robert Bruce Cotton (1571-1631).
Spenser, Edmund, 1552?-1599
The Poet's poet
His poetry usually enjoys five qualities: (1) a perfect melody;

现代语言学名词释义(自考)

interlanguage:语际语the approximate language system that a second language learner constructs which represents his or her transitional competence in the target language. fossilization: a process that sometimes occurs in second language learning in which incorrect linguistic features (such as the accent or a grammatical pattern) become a permanent part of the way a person speaks or writes in the target language. holophrase: a single word that appears in children’s early speech and functions as a complex idea or sentence. Holophrastic sentences: They are children’s one-word utterances. They are called holophrastic sentences, because they can be used to express a concept or predication that would be associated with an entire sentence in adult speech. telegraphic speech:the early speech of children, so called because it lacks the same sorts of words which adults typically leave out of telegrams.input:the language which a learner hears or receives and from which he or she can learn. caretaker speech: simple, modified speech used by parents, baby-sitter, etc. when they talk to young children who are acquiring their native language. behaviorist learning theory: a theory of psychology which, when applied to first language acquisition, suggests that the learner’s ver bal behavior is conditioned or reinforced through association between stimulus and response. language transfer:the effect of the first language knowledge on the learning of a second language.interference: the use of one’s first language rule which leads to an error or inappropriate form in the target language, because the L1 pattern is different from the counterpart of the target language.contrastive analysis: a comparative procedure used to establish linguistic differences between two languages so as to predict learning difficulties caused by interference from the learner’s first language and prepare the type of teaching materials that will reduce theaffects of interference.linguistic determinism: atheory put forward by theAmerican anthropologicallinguists Sapir and Whorf, whichstates that the way people viewthe world is determined by thestructure of their native language.linguistic relativism: Whorfbelieved that speakers ofdifferent languages perceive andexperience the world differently,that is relative to their linguisticbackground, hence the notion oflinguistic relativism .overt thought: A term used torefer to speech when languageand thought are identical orclosely parallel to each other, wemay regard speech as “overtthought.”subvocal speech: a term used torefer to thought when thoughtand language are identical orclosely parallel to each other.linguistic lateralization:hemispheric specialization ordominance for language.dichotic listening: a researchtechnique which has been used tostudy how the brain controlshearing and language. Thesubjects wear earphones andsimultaneously receive differentsounds in the right and left ear,and are then asked to repeat whatthey hear.lingua franca: a variety oflanguage that serves as acommon speech for socialcontact among groups of peoplewho speak different nativelanguages or dialects.pidgin: a marginal contactlanguage with a limitedvocabulary and reducedgrammatical structures, used bynative speakers of otherlanguages as a means of businesscommunication.creole: A creole language isoriginally a pidgin that hasbecome established as a nativelanguage in some speechcommunity. When a pidgincomes to be adopted by apopulation as its primarylanguage, and children learn it astheir first language, then thepidgin language is called acreole.diglossia: a sociolinguisticsituation in which two verydifferent varieties of languageco-exist in a speech community,each serving a particular socialfunction and used for a particularsituationbilingualism: refers to alinguistic situation in which twostandard languages are usedeither by an individual or by agroup of speakers, such as theinhabitants of a particular regionor a nation.ethnic dialect:An ethniclanguage variety is a socialdialect of a language, oftencutting across regionaldifferences. An ethnic dialect isspoken mainly by a lessprivileged population that hasexperienced some form of socialisolation, such as racialdiscrimination or segregation.slang: Slang is a casual use oflanguage that consists ofexpressive but non-standardvocabulary, typically of arbitrary,flashy and often ephemeralcoinages and figures of speechcharacterized by spontaneityand sometimes by raciness.linguistic taboo: an obsceneprofane, or swear word orexpression that is prohibitedfrom general use by the educatedand “polite” society.euphemism: a word orexpression that is thought to bemild, indirect, or less offensiveand used as a polite substitute forthe supposedly harsh andunpleasant word or expression.idiolect: An idiolect is a personaldialect of an individual speakerthat combines aspects of all theelements regarding regional,social, and stylistic variation, inone form or another.register:a functional speech orlanguage variety that involvesdegrees of formality dependingon the speech situationconcerned.protolanguage:the original (orancestral) form of a languagefamily which has ceased to exist.Haplology: It refers to thephenomenon of the loss of one oftwo phonetically similar syllablesin sequence.cognate: a word in one languagewhich is similar in form andmeaning to a word in anotherlanguage because both languageshave descended from a commonsource.Acronym: An acronym is a wordcreated by combining the initialsof a number of words.apocope:the deletion of aword-final vowel segment.epenthesis:the insertion of theconsonant or vowel sound to themiddle of a word.Metathesis: Sound change as aresult of sound movement isknown as metathesis. It involvesa reversal in position of twoneighbouring sound segments.error analysis: an approach tothe study and analysis of theerrors made by second languagelearners which suggests thatmany leaner errors are not due tothe learner’s mother tongueinterference but reflect universallearning strategies such asovergeneralization andsimplification of rules.diacritics: is a set of symbolswhich can be added to theletter-symbols to make finerdistinctions than the letters alonemake possible.Voiceless清音: when the vocalcords are drawn wide apart,letting air go through withoutcausing vibration, the soundsproduced in such a condition arecalled voiceless sounds.Voicing浊音: Sounds producedwhile the vocal cords arevibrating are called voicedsounds.Vowel:the sounds in productionof which no articulators comevery close together and the airstream passes through the vocaltract without obstruction arecalled vowels. Consonants:the sounds in the production ofwhich there is an obstruction ofthe air stream at some point ofthe vocal tract are calledconsonants.phone:Phones can be simplydefined as the speech sounds weuse when speaking a language. Aphone is a phonetic unit orsegment. It does not necessarilydistinguish meaning.phoneme: a collection ofabstract phonetic features, it is abasic unit in phonology. It isrepresented or realized as acertain phone by a certainphonetic context.allophone:The different phoneswhich can represent a phonemein different phoneticenvironments are called theallophones of that phoneme. Forexample [l] and [l]phonemic contrast:Phonemiccontrast refers to the relationbetween two phonemes. If twophonemes can occur in the sameenvironment and distinguishmeaning, they are in phonemic contrast.Complementary distribution: refers to the relation between two similar phones which are allophones of the same phoneme, and they occur in different environments.minimal pair: When two different forms are identical in every way except for one sound segment which occurs in the same place in the strings, the two words are said to form a minimal pair. For example: bin and pin. Affix:morphemes manifesting various grammatical relations or grammatical categories such as number, tense, degree and case. Inflection（屈折）: the manifestation of various grammatical relationships through the addition of inflectional affixes, such as number, tense, degree and case. Derivation: Derivation is a process of word formation by which derivative affixes are added to an existing form to create a word.linguistic competence: Universally found in the grammars of all human languages, syntactic rules comprise the system of internalized linguistic knowledge of a language speaker known as linguistic competence.finite clause（定式子句）: a clause that takes a subject and a finite verb, and at the same time stands structurally alone. (A simple sentence satisfies the structural requirements of a finite clause.)hierarchical structure（层次结构）: the sentence structure that groups words into structural constituents and shows the syntactic category of each structural constituent, such as NP and VP.grammatical relations:The structural and logical functional relations of constituents are called grammatical relations.X-bar theory is a general and highly abstract schema that collapses all phrasal structure rules into a single format: transformational rules: Transformational rules are the rules that transform one sentence type into another type.Move a: a general movement rule accounting for the syntactic behavior of any constituent movement.Universal Grammar: a systemof linguistic knowledge whichconsists of some generalprinciples and parameters aboutnatural languages.Hyponymy（下义关系）:Hyponymy refers to the senserelation between a more general,more inclusive word and a morespecific word. The word which ismore general is called asuperordinate（上坐标词）, andthe more specific words arecalled its hyponyms.Antonymy:Antonymy refers tothe relation of oppositeness ofmeaning (on differentdimensions).argument is a logical participantin a prediction, largely identicalwith the nominal element(s) in asentence.The grammatical meaning: Thegrammatical meaning of asentence refers to itsgrammaticality, i.e., itsgrammatical well-formedness.The grammaticality of a sentenceis governed by the grammaticalrules of the language.Two-place predication: Atwo-place predication is onewhich contains two arguments.The predication is theabstraction of the meaning of asentence.Constative:Constatives werestatements that either state ordescribe, and were verifiable;Performative: performatives, onthe other hand, were sentencesthat did not state a fact ordescribe a state, and were notverifiable. Their function is toperform a particular speech act.Locutionary act:A locutionaryact is the act of uttering words,phrases, clauses. It is the act ofconveying literal meaning bymeans of syntax, lexicon andphonology.Illocutionary act: Anillocutionary act is the act ofexpressing the speaker’sintention; it is the act performedin saying something.Perlocutionary act:is the actperformed by or resulting fromsaying something; it is theconsequence of, or the changebrought about by the utterance; itis the act performed by sayingsomething.Conversational implicature:Most of the violations of thecooperative principles give riseto what Paul Grice calls“conversational implicatures.”When we violate any of thesemaxims, our language becomesindirect and implies an extrameaning.clipping: clipping is a kind ofabbreviation of otherwise longerwords or phrases.tone: Tones are pitch variations,which are caused by the differingrates of vibration of the vocalcords.intonation: When pitch, stressand sound length are tied to thesentence rather than the word inisolation, they are collectivelyknown as intonation.Root: A root is often seen as partof a word; it can never stand byitself although it bears clear,definite meaning; it must becombined with another root or anaffix to form a word.Prefix: Prefixes occur at thebeginning of a word. Prefixesmodify the meaning of the stem,but they usually do not changethe part of speech of the originalword.Suffix: Suffixes are added to theend of the stems; they modify themeaning of the original word andin many cases change its part ofspeech.sentence: A sentence is astructurally independent unit thatusually comprises a number ofwords to form a completestatement, question or command.The incorporated, orsubordinate, clause is normallycalled an embedded clause, andthe clause into which it isembedded is called a matrixclause.syntactic category: Apart fromsentences and clauses, a syntacticcategory usually refers to a word(called a lexical category) or aphrase (called a phrasal category)that performs a particulargrammatical function,Speech variety refers to anydistinguishable form of speechused by a speaker or group ofspeakers. A speech variety maybe lexical, phonological,morphological, syntactic, or acombination of linguisticfeatures.系列规则The rules that governthe combination of sounds in aparticular language are calledsequential rules.同化规则The assimilation ruleassimilates one sound to anotherby “copying” a feature of asequential phoneme, thus makingthe two phones similar.The description of a language atsome point in time isa synchronic study;thedescription of a language as itchanges through time isa diachronic study.Langue refers to the abstractlinguistic system shared by allthe members of a speechcommunity, and parole refers tothe realization of langue in actualuse.competence as the ideal user’sknowledge of the rules of hislanguage, and performance theactual realization of thisknowledge in linguisticcommunication.格条件：As is required by thecase conditon principle, a nounphrase must have case and caseis assigned by V(verb) orP(preposition) to the objectposition, or by AUX(auxiliary) tothe subject position.Adjacency condition[毗邻条件]on case assignment, which statesthat a case assignor and a caserecipient should stay adjacent toeach other.Great Vowel Shift: It is a seriesof systematic sound change at theend of the Middle English periodapproximately between 1400 and1600 in the history of Englishthat involved seven long vowelsand consequently led to one ofthe major discrepancies betweenEnglish pronunciation and itsspelling system.Sound assimilation: Soundassimilation refers to thephysiological effect of one soundon another. In an assimilativeprocess, successive sounds aremade identical, or more similar,to one another in terms of placeor manner of articulation, or ofhaplology.Domain使用域:Domain refersto the phenomenon that mostbilingual communities have onething in common, that is, fairlyclear functional differentiation ofthe two languages in respect ofspeech situations. For example:the Home Domain, EmploymentDomain etc.。

Coordination

Coordination and context-dependence in the generation of embodiedconversationJustine Cassell ∗Matthew Stone †Hao Y an ∗∗Media Laboratory MIT E15-31520Ames,Cambridge MA {justine,yanhao }@ †Department of Computer Science &Center for Cognitive Science Rutgers University110Frelinghuysen,Piscataway NJ 08854-8019mdstone@AbstractWe describe the generation of communicative ac-tions in an implemented embodied conversational agent.Our agent plans each utterance so that mul-tiple communicative goals may be realized oppor-tunistically by a composite action including not only speech but also coverbal gesture that ﬁts the con-text and the ongoing speech in ways representative of natural human conversation.We accomplish this by reasoning from a grammar which describes ges-ture declaratively in terms of its discourse function,semantics and synchrony with speech.1IntroductionWhen we are face-to-face with another human,no matter what our language,cultural background,or age,we virtually all use our faces and hands as an in-tegral part of our dialogue with others.Research on embodied conversational agents aims to imbue in-teractive dialogue systems with the same nonverbal skills and behaviors (Cassell,2000a).There is good reason to think that nonverbal be-havior will play an important role in evoking from users the kinds of communicative dialogue behav-iors they use with other humans,and thus allow them to use the computer with the same kind of ef-ﬁciency and smoothness that characterizes their di-alogues with other people.For example,(Cassell and Th´o risson,1999)show that humans are more likely to consider computers lifelike,and to rate their language skills more highly,when those computers display not only speech but appropriate nonverbal communicative behavior.This argument takes on particular importance given that users repeat them-selves needlessly,mistake when it is their turn to speak,and so forth when interacting with voice di-alogue systems (Oviatt,1995).In life,noisy situa-tions like these provoke the non-verbal modalities to come into play (Rogers,1978).In this paper,we describe the generation of com-municative actions in an implemented embodied conversational agent.Our generation framework adopts a goal-directed view of generation and casts knowledge about communicative action in the form of a grammar that speciﬁes how forms combine,what interpretive effects they impart and in what contexts they are appropriate (Appelt,1985;Moore,1994;Dale,1992;Stone and Doran,1997).We ex-pand this framework to take into account ﬁndings,by ourselves and others,on the relationship between spontaneous coverbal hand gestures and speech.In particular,our agent plans each utterance so that multiple communicative goals may be realized op-portunistically by a composite action including not only speech but also coverbal gesture.By describing gesture declaratively in terms of its discourse func-tion,semantics and synchrony with speech,we en-sure that coverbal gesture ﬁts the context and the on-going speech in ways representative of natural hu-man conversation.The result is a streamlined imple-mentation that instantiates important theoretical in-sights into the relationship between speech and ges-ture in human-human conversation.2Exploring the relationship between speech and gestureTo generate embodied communicative action re-quires an architecture for embodied conversation;ours is provided by the agent REA (“Real Estate Agent”),a computer-generated humanoid that has an articulated graphical body,can sense the user passively through cameras and audio input,and supports communicative actions realized in speech with intonation,facial display,and animated ges-ture.REA currently offers the reasoning and dis-play capabilities to act as a real estate agent showing users the features of various models of houses that appear on-screen behind her.We use existing fea-tures of REA here as a research platform for imple-menting models of the relationship between speech and spontaneous hand gestures during conversation. For more details about the functionality of REA see (Cassell,2000a).Evidence from many sources suggests that this re-lationship is a close one.About three-quarters of all clauses in narrative discourse are accompanied by gestures of one kind or another(McNeill,1992),and within those clauses,the most effortful part of ges-tures tends to co-occur with or just before the phono-logically most prominent syllable of the accompany-ing speech(Kendon,1974).Of course,communication is still possible with-out gesture.But it has been shown that when speech is ambiguous(Thompson and Massaro,1986)or in a speech situation with some noise(Rogers,1978), listeners do rely on gestural cues(and,the higher the noise-to-signal ratio,the more facilitation by ges-ture).Similarly,Cassell et al.(1999)established that listeners rely on information conveyed only in ges-ture as they try to comprehend a story.Most interesting in terms of building interactive dialogue systems is the semantic and pragmatic rela-tionship between gesture and speech.The two chan-nels do not always manifest the same information, but what they convey is virtually always compati-ble.Semantically,speech and gesture give a con-sistent view of an overall situation.For example, gesture may depict the way in which an action was carried out when this aspect of meaning is not de-picted in speech.Pragmatically,speech and ges-ture mark information about this meaning as advanc-ing the purposes of the conversation in a consistent way.Indeed,gesture often emphasizes information that is also focused pragmatically by mechanisms like prosody in speech(Cassell,2000b).The seman-tic and pragmatic compatibility seen in the gesture-speech relationship recalls the interaction of words and graphics in multimodal presentations(Feiner and McKeown,1991;Green et al.,1998;Wahlster et al.,1991).In fact,some suggest(McNeill,1992), that gesture and speech arise together from an under-lying representation that has both visual and linguis-tic aspects,and so the relationship between gesture and speech is essential to the production of meaning and to its comprehension.This theoretical perspective on speech and gesture involves two key claims with computational import: that gesture and speech reﬂect a common concep-tual source;and that the content and form of a ges-ture is tuned to the communicative context and the actor’s communicative intentions.We believe that these characteristics of the use of gesture are uni-versal,and see the key contribution of this work as providinga general framework for buildingdialogue systems in accord with them.However,a concrete implementation requires more than just generalities behind its operation;we also need an understanding of the precise ways gesture and speech are used to-gether in a particular task and setting.To this end,we collected a sample of real-estate descriptions in line with what REA might be asked to provide.To elicit each description,we asked one subject to study a video andﬂoor plan of a partic-ular house,and then to describe the house to a sec-ond subject(who did not know the house and had not seen the video).During the conversation,the video andﬂoor plan were not available to either subject; the listener was free to interrupt and ask questions. The collected conversations were transcribed, yielding328utterances and134referential gestures, and coded to describe the general communicative goals of the speaker and the kinds of semantic fea-tures realized in speech and gesture.Analysis of the data revealed that for roughly 50%of the gesture-accompanied utterances,gestu-ral content was redundant with speech;for the other 50%gesture contributed content that was different, but complementary,to that contributed by speech. In addition,the relationship between content of ges-ture,content of speech and general communicative functions in house descriptions could be captured by a small number or rules;these rules are informed by and accord with our two key claims about speech and gesture.For example,one rule describes di-alogue contributions whose general function was what we call presentation,to advance the descrip-tion of the house by introducing a single new ob-ject.These contributions tended to be made up of a sentence that asserted the existence of an object of some type,accompanied by a non-redundant ges-ture that elaborated the shape or location of the ob-ject.Our approach casts this extended description of a new entity,mediated by two compatible modali-ties,as the speaker’s expression of one overall func-tion of presentation.(1)is a representative example.(1)It has[a nice garden].(right hand,heldﬂat,traces a circle,indicating location of thegarden surrounding the house)Six rules account for60%of the gestures in theFigure1:Interacting with REA transcriptions(recall)and apply with an accuracy of 96%(precision).These patterns provide a concrete speciﬁcation for the main communicative strategies and communicative resources required for REA.A full discussion of the experimental methods and analysis,and the resulting rules,can be found in (Yan,2000).3Framing the generation problemIn REA,requests for the generation of speech and gesture are formulated within the dialogue manage-ment module.REA’s utterances reﬂect a coordina-tion of multiple kinds of processing in the dialogue manager–the system recognizes that it has theﬂoor, derives the appropriate communicative context for a response and an appropriate set of communicative goals,triggers the generation process,and realizes the resulting speech and gesture.The dialogue man-ager is only one component in a multithreaded ar-chitecture that carries out hardwired reactions to in-put as well as deliberative processing.The diver-sity is required in order to exhibit appropriate inter-actional and propositional conversational behaviors at a range of time scales,from tracking the user’s movements with gaze and providing nods and other feedback as the user speaks,to participating in rou-tine exchanges and generating principled responses to user’s queries.See(Cassell,2000a)for descrip-tion and motivation of the architecture,as well as the conversational functions and behaviors it supports. REA’s design and capabilities reﬂect our research focus on allying conversational content with conver-sation management,and allying nonverbal modali-ties with speech:how can an embodied agent use all its communicative modalities to contribute new con-tent when needed(propositional function),to signal the state of the dialogue,and to regulate the over-all process of conversation(interactional function)?Within this focus,REA’s talk isﬁrmly delimited. REA’s utterances take a question-answer format,in which the user asks about(and REA describes)asingle house at a time.REA’s sentences are short;generally,they contribute just a few new semantic features about particular rooms or features of the house(in speech and gesture),andﬂesh this contri-bution out with a handful of meaningful elements(in speech and gesture)that ground the contribution in shared context of the conversation.Despite the apparent simplicity,the dialogue manager must contribute a wealth of information about the domain and the conversation to represent the communicative context.This detail is needed for REA to achieve a theoretically-motivated realization of the common patterns of speech and gesture we ob-served in human conversation.For example,a vari-ety of changing features determine whether marked forms in speech and gesture are appropriate in the context.REA’s dialogue manager tracks the chang-ing status of such features as:•Attentional prominence,represented(as usual in natural language generation)by setting up a context set for each entity(Dale,1992).Our model of prominence is a simple local one sim-ilar to(Strube,1998).•Cognitive status,including whether an entity is hearer-old or hearer-new(Prince,1992),and whether an entity is in-focus or not(Gundel et al.,1993).We can assume that houses and their rooms are hearer-new until REA describes them;and that just those entities mentioned in the prior sentence are in-focus.•Information structure,including the open propositions or,following(Steedman,1991), themes,which describe the salient questions currently at issue in the discourse(Prince, 1986).In REA’s dialogue,open questions are always general questions about some entity raised by a recent turn;although in principle such an open question ought to be formalized as theme(λP.Pe),REA can use the simpler theme(e).In fact,both speech and gesture depend on the samekinds of features,and access them in the same way; this speciﬁcation of the dialogue state crosscuts dis-tinctions of communicative modality.Another component of context is provided by a domain knowledge base,consisting of facts explic-itly labeled with the kind of information they repre-sent.This deﬁnes the common ground in the con-versation in terms of sources of information that speaker and hearer share.Modeling the discourse as a shared source of information means that new se-mantic features REA imparts are added to the com-mon ground as the dialogue proceeds.Following re-sults from(Kelly et al.,1999)which show that infor-mation from both speech and gesture is used to pro-vide context for ongoing talk,our common ground may be updated by both speech and gesture.The structured domain knowledge also provides a resource for specifying communicative strategies. Recall that REA’s communicative strategies are for-mulated in terms of functions which are common in naturally-occurring dialogues(such as”presenta-tion”)and which lead to distinctive bundles of con-tent in gesture and speech.The knowledge base’s kinds of information provide a mechanism for spec-ifying and reasoning about such functions.The knowledge base is structured to describe the rela-tionship between the system’s private information and the questions of interest that that information can be used to settle.Once the user’s words have been interpreted,a layer of production rules con-structs obligations for response(Traum and Allen, 1994);then,a second layer plans to meet these obli-gations by deciding to present a speciﬁed kind of information about a speciﬁed object.This deter-mines some concrete communicative goals—facts of this kind that a contribution to dialogue could make.Both speech and gesture can access the whole structured database in realizing these concrete communicative goals.For example,a variety of facts that bear on where a residence is—which city, which neighborhood or,if appropriate,where in a building—all provide the same kind of information, and would thereforeﬁt the obligation to specify the location of a residence.Or,to implement the rule for presentation described in connection with(1),we can associate an obligation of presentation with a cluster of facts describing an object’s type,its loca-tion in a house,and its size,shape or quality.The communicative context and concrete com-municative goals provide a common source for gen-erating speech and gesture in REA.The utterance generation problem in REA,then,is to construct a complex communicative action,made up of speech and coverbal gesture,that achieves a given constel-lation of goals and tightlyﬁts the context speciﬁedby the dialogue manager.4Generation and linguistic representation We model REA’s communicative actions as com-posed of a collection of atomic elements,including both lexical items in speech and clusters of seman-tic features expressed as gestures;since we assumethat any such item usually conveys a speciﬁc piece of content,we refer to these elements generally as lexicalized descriptors.The generation task in REAthus involves selecting a number of such lexical-ized descriptors and organizing them into a gram-matical whole that manifests the right semantic andpragmatic coordination between speech and gesture. The information conveyed must be enough that the hearer can identify the entity in each domain ref-erence from among its context set.Moreover,the descriptors must provide a source which allows the hearer to recover any needed new domain proposi-tion,either explicitly or by inference.We use the SPUD generator(“Sentence Planning Using Description”)introduced in(Stone and Do-ran,1997)to carry out this task for REA.SPUDbuilds the utterance element-by-element;at each stage of construction,SPUD’s representation of the current,incomplete utterance speciﬁes its syntax,semantics,interpretation andﬁt to context.This rep-resentation both allows SPUD to determine which lexicalized descriptors are available at each stage toextend the utterance,and to assess the progress to-wards its communicative goals which each exten-sion would bring about.At each stage,then,SPUDselects the available option that offers the best im-mediate advance toward completing the utterance successfully.(We have developed a suite of guide-lines for the design of syntactic structures,seman-tic and pragmatic representations,and the interface between them so that SPUD’s greedy search,which is necessary for real-time performance,succeeds inﬁnding concise and effective utterances described by the grammar(Stone et al.,2000).)As part of the development of REA,we have con-structed a new inventory of lexicalized descriptors. REA’s descriptors consist of entries that contribute to coverbal gestures,as well as revised entries forspoken words that allow for their coordination with gesture under appropriate discourse conditions.The organization of these entries assures that—using the same mechanism as with speech—REA’s gestures draw on the single available conceptual representa-tion and that both REA’s gesture and the relation-ship between gesture and speech vary as a function of pragmatic context in the same way as natural ges-tures and speech do.More abstractly,these entries enable SPUD to realize the concrete goals tied to common communicative functions with same dis-tribution of speech and gesture observed in natural conversations.To explain how these entries work,we need to consider SPUD’s representation of lexicalized de-scriptors in more detail.Each entry is speciﬁed in three parts.Theﬁrst part—the syntax of the element—sets out what words or other actions the element contributes to its utterance.The syn-tax is a hierarchical structure,formalized using Feature-Based Lexicalized Tree Adjoining Gram-mar(LTAG)(Joshi et al.,1975;Schabes,1990). Syntactic structures are also associated with referen-tial indices that specify the entities in the discourse that the entry refers to.For the entry to apply at a particular stage,its syntactic structure must combine by LTAG operations with the syntax of the ongoing utterance.REA’s syntactic entries combine typical phrase-structure analyses of linguistic constructions with annotations that describe the occurrence of gestures in coordination with linguistic phrases.Our device for this is a construction SYNC which pairs a descrip-tion of a gesture G with the syntactic structure of a spoken constituent C:(2)SYNC G CThe temporal interpretation of(2)mirrors the rulesfor surface synchrony between speech and gesture presented in(Cassell et al.,1994).That is,the preparatory phase of gesture G is set to begin before the time constituent C begins;the stroke of gesture G(the most effortful part)co-occurs with the most phonologically prominent syllable in C;and,except in cases of coarticulation between successive ges-tures,by the time the constituent C is complete,the speaker must be relaxing and bringing the hands out of gesture space(while the generator speciﬁes syn-chrony as described,in practice the synchronization of synthesized speech with graphics is an ongoing challenge in the REA project).In sum,the produc-tion of gesture G is synchronized with the produc-tion of speech C.(Our representation of synchrony in a single tree conveniently allows modules down-stream to describe embodied communicative actionsas marked-up text.)The syntactic description of the gesture itself in-dicates the choices the generator must make to pro-duce a gesture,but does not analyze a gesture lit-erally as a hierarchy of separate movements.In-stead,these choices specify independent semanticfeatures which we can associate with aspects of agesture(such as handshape and trajectory through space).Our current grammar does not undertake theﬁnal step of associating semantic features to choiceof particular handshapes and movements,or gesturemorphology;we reserve this problem for later in the research program.We allow gesture to accom-pany alternative constituents by introducing alterna-tive syntactic entries;these entries take on different pragmatic requirements(as described below)to cap-ture their respective discourse functions.So much for syntax.The second part—the seman-tics of the element—is a formula that speciﬁes thecontent that the element carries.Before the entrycan be used,SPUD must establish that the semanticsholds of the entities the entry describes.If the se-mantics already follows from the common ground, SPUD assumes that the hearer can use it to help iden-tify the entities described.If the semantics is merely part of the system’s private knowledge,SPUD treatsit as new information for the hearer.Finally,the third part—the pragmatics of the element—is also a formula that SPUD looks to prove before using the entry.Unlike the semantics,how-ever,the pragmatics does not achieve speciﬁc com-municative goals like identifying referents.Instead, the pragmatics establishes a generalﬁt between the entry and the context.The entry schematized in(3)illustratesthese three components;the entry also suggests how these com-ponents can deﬁne coordinated actions of speech and gesture that respond coherently to the context.(3)a syntax:SNPNP:oVPV/have/SYNCG:x↓NP:x↓b semantics:have(o,x)c pragmatics:hearer-new(x)∧theme(o)(3)describes the use of have to introduce a new fea-ture of(a house)o.The feature,indicated through-out the entry by the variable x,is realized as the ob-ject NP of the verb have,but x can also form the ba-sis of a gesture G coordinated with the noun phrase (as indicated by the SYNC constituent).The entry as-serts that o has x.(3)is a presentational construction;in other words,it coordinates non-redundant paired speech and gesture in the same way as demonstrated by our house description data.To represent this constraint on its use,the entry carries two pragmatic require-ments:ﬁrst,x must be new to the hearer;moreover, o must link up with the open question in the dis-course that the sentence responds to.The pragmatic conditions of(3)help support our theory of the discourse function of gesture and speech.A similar kind of sentence could be used to address other open questions in the discourse—for example,to answer which house has a garden? This would not be a presentational function,and (3)would be infelicitous here.In that case,gesture would naturally coordinate with and elaborate on the answering information—in this case the house.So the different information structure would activate a different entry,where the gesture would coordinate with the subject and describe o.Meanwhile,alternative entries like(4a)and (4b)—two entries that both convey(4c)and that both could combine with(3)by LTAG operations—underlie our claim that our implementation allows gesture and speech to draw on a single conceptual source and fulﬁll similar communicative intentions.(4)a syntax:G:xcircular-trajectory RS:x↓b syntax:NPNP∗:x VPVsurroundingNP:p↓c semantics:surround(x,p)(4a)provides a structure that could substitute for the G node in(3)to produce semantically and pragmat-ically coordinated speech and gesture.(4a)speci-ﬁes a right hand gesture in which the hand traces out a circular trajectory;a further decision must de-termine the correct handshape(node RS,as a func-tion of the entity x that the gesture describes).We pair(4a)with the semantics in(4c),and thereby model that the gesture indicates that one object,x, surrounds another,p.Since p cannot be further de-scribed,p must be identiﬁed by an additional pre-suppositionof the gesture which picks up a reference frame from the shared context.Similarly,(4b)describes how we could modify the VP introduced by(3)(using the LTAG operation of adjunction),to produce an utterance such as It has a garden surrounding it.By pairing(4b)with the same semantics(4c),we ensure that SPUD will treat the communicative contribution of the alterna-tive constructions of(4)in a parallel fashion.Both are triggered by accessing background knowledge and both are recognized as directly communicating speciﬁed facts.5Solving the generation problemWe now sketch how entries such as these combine together to account for REA’s utterances.Our exam-ple is the dialogue in(5):(5)a User:Tell me more about the house.b REA:It has[a nice garden].(right hand,heldﬂat,traces a circle)REA’s response indicates both that the house has a nice garden and that it surrounds the house.As we have seen,(5b)represents a common pat-tern of description;this particular example is moti-vated by an exchange two human subjects had in our study,cf.(1).(5b)represents a solution to a gen-eration problem that arises as follows within REA’s overall architecture.The user’s directive is inter-preted and classiﬁed as a directive requiring a delib-erative response.The dialogue manager recognizes an obligation to respond to the directive,and con-cludes that to fulﬁll the function of presenting the garden would discharge this obligation.The presen-tational function grounds out in the communicative goal to convey a collection of facts about the garden (type,quality,location relative to the house).Along with these goals,the dialogue manager supplies its communicative context,which represents the cen-trality of the house in attentional prominence,cog-nitive status and information structure.In producing(5b)in response to this NLG prob-lem,SPUD both calculates the applicability of and determines a preference for the lexicalized descrip-tors involved.Initially,(3)is applicable;the system knows the house has the garden,and represents thegarden as new and the house as questioned.The en-try can be selected over potential alternatives based on its interpretation—it achieves a communicative goal,refers to a prominent entity,and makes a rel-atively speciﬁc connection to facts in the context. Similarly,in the second stage,SPUD evaluates and selects(4a)because it communicates a needed fact in a way that helpsﬂesh out a concise,balanced communicative act by supplying a gesture that by using(3)SPUD has already realized belongs here. Choices of remaining elements—the words garden and nice,the semantic features to represent the gar-den in the gesture—proceed similarly.Thus SPUD arrives at the response in(5b)just by reasoning from the declarative speciﬁcation of the meaning and con-text of communicative actions.6Related WorkThe interpretation of speech and gesture has been investigated since the pioneering work of(Bolt, 1980)on deictic gesture;recent work includes (Koons et al.,1993;Bolt and Herranz,1992).Sys-tems have also attempted generation of gesture in conjunction with speech.Lester et al.(1998)gener-ate deictic gestures and choose referring expressions as a function of the potential ambiguity of objects re-ferred to,and their proximity to the animated agent. Rickel and Johnson(1999)’s pedagogical agent pro-duces a deictic gesture at the beginning of explana-tions about objects in the virtual world.Andr´e et al.(1999)generate pointing gestures as a sub-action of the rhetorical action of labeling,in turn a sub-action of elaborating.Missing from these prior systems,however,is a representation of communicative action that treats the different modalities on a par.Such representa-tions have been explored in research on combining linguistic and graphical interaction.For example, multimodal managers have been described to allo-cate an underlying content representation for gen-eration of text and graphics(Wahlster et al.,1991; Green et al.,1998).Meanwhile,(Johnston et al., 1997;Johnston,1998)describe a formalism for tightly-coupled interpretation which uses a gram-mar and semantic constraints to analyze input from speech and pen.While many insights from these formalisms are relevant in embodied conversation, spontaneous gesture requires a distinct analysis with different emphasis.For example,we need some no-tion of discourse pragmatics that would allow us to predict where gesture occurs with respect to speech,and what its role might be.Likewise,we need a model of the communicative effects of spontaneous coverbal gesture—one that allows us to reason nat-urally about the multiple goals speakers have in pro-ducing each utterance.7ConclusionResearch on the robustness of human conversation suggests that a dialogue agent capable of acting as a conversational partner would provide for efﬁ-cient and natural collaborative dialogue.But human conversational partners display gestures that derive from the same underlying conceptual source as their speech,and which relate appropriately to their com-municative intent.In this paper,we have summa-rized the evidence for this view of human conver-sation,and shown how it informs the generation of communicative action in our artiﬁcial embodied conversational agent,REA.REA has a working im-plementation,which includes the modules described in this paper,and can engage in a variety of interac-tions including that in(5).Experiments are under-way to investigate the extent to which REA’s conver-sational capacities share the strengths of the human capacities they are modeled on. AcknowledgmentsThe research reported here was supported by NSF(award IIS-9618939),Deutsche Telekom,A T&T,and the other generous sponsors of the MIT Media Lab,and a postdoc-toral fellowship from RUCCS.Hannes Vilhj´a lmsson as-sisted with the implementation of REA’s discourse man-ager.We thank Nancy Green,James Lester,Jeff Rickel, Candy Sidner,and anonymous reviewers for comments on this and earlier drafts.ReferencesElisabeth Andr´e,Thomas Rist,and Jochen M¨u ller.1999.Employing AI methods to control the behavior of ani-mated interface agents.Applied Artiﬁcial Intelligence, 13:415–448.Douglas Appelt.1985.Planning English Sentences.Cambridge University Press,Cambridge England. R.A.Bolt and E.Herranz.1992.Two-handed gesture in multi-modal natural dialog.In UIST92:Fifth Annual Symposium on User Interface Software and Technol-ogy.R.A.Bolt.1980.Put-that-there:voice and gesture at the graphics puter Graphics,14(3):262–270.J.Cassell and K.Th´o risson.1999.The power of a nod and a glance:Envelope vs.emotional feedback in an-imated conversational agents.Applied Artiﬁcial Intel-ligence,13(3).。

conversation analysis

conversation analysis对话分析（ConversationAnalysis），又称为“交谈记录分析”，是一种语言语用学研究，专注于研究语言在日常交流中如何交互作用。

它可以帮助人们更好地了解和理解说话者的取向，有助于提高沟通效率、减少沟通误解和误会。

对话分析着重于研究语言在日常交流中如何发挥作用。

它旨在查明对话参与者如何执行、承认、解释和释义交流中的每个动作及其关联后果。

它不是以概念、理论或模式为基础的一般理论，而是贴近实践的一种具体的、自存的技术，它以具体的话语来分析交流中的每一步，以解读特定的语言行为及其影响。

对话分析的重点在于监测以及控制双方的行为，这种技术有助于发掘参与者的不同意图，也可以用来检查参与者如何通过操纵语言或者表达模式来实现自己的目的。

为此，它提供了一种能够揭示参与者行为背后机制的方法，同时也提高了语言研究的实用性和可应用性。

话语分析不仅是研究语言的有效方法，它还可以应用于许多其他不同的研究领域，如社会学、心理学、文化研究、教育学、社会工作等等。

而由于它让研究者能够更深入地分析语言，所以也可以帮助研究者更全面地了解沟通过程中发生的关键事件。

对话分析的基本方法包括两个组成部分：转录和分析。

转录是指将实地观察中的对话内容提取出来，以便进行分析，而分析则是将转录后的对话内容客观地分析，以便了解双方的目的与意图。

这一过程可以利用多种方法，如构建语料库、探索特定语境、探究语言本身、探究参与者行为等。

对话分析这一技术可以帮助人们更深入地理解日常交流中的行为，以及它们对沟通的影响。

它激发了许多研究者去探索更多元化的主题，如多文化交流、跨文化沟通、沟通中的依赖和信任关系、潜在语言障碍等。

它也为改善沟通提供了一种有效的方法，也有助于提高社会交流的效率，解决沟通问题。

“SPEAKING”模型下《安妮日记》中称谓语选择的研究

“SPEAKING”模型下《安妮日记》中称谓语选择的研究《安妮日记》是安妮·弗兰克用来记录她在二战时期与家人一起躲避纳粹迫害的日记。

在这部作品中，我们可以看到安妮在成长过程中所经历的挣扎和成长。

在语言学的研究中，我们可以通过“SPEAKING”模型来分析《安妮日记》中的称谓语选择，以此来了解安妮在不同情境下的言语表达方式。

SPEAKING模型是由Hymes于1972年提出的一种语言行为分析模型，它以语言行为为中心，通过语用学的角度来研究言语行为，并将言语行为分为了八个要素，即场域（Setting）、参与人（Participants）、结束（Ends）、行为（Act Sequence）、针对性（Key）、规范（Instrumentalities）、风格（Norms）、流利（Genre）。

我们将通过这一模型来分析《安妮日记》中的称谓语选择，以此来了解安妮在不同情境下的言语表达方式。

我们来看场域（Setting）。

《安妮日记》的场域是安妮与家人一起躲藏的阁楼。

在这个受限的空间里，安妮与家人的相处方式和言语表达都会受到影响。

在这个环境中，安妮可能会更加自由地表达自己的想法和情感，因为这个空间相对私密，她不必担心外界的评判和偏见。

在这个场域下，我们可以看到安妮在日记中的称谓语选择更加真实和坦率，她更愿意用更亲密的称谓来称呼家人和朋友，比如“爸爸”、“妈妈”、“朋友们”等。

接下来是参与人（Participants）。

在《安妮日记》中，参与人主要是安妮的家人和朋友。

在不同参与人的情境下，安妮的言语表达也会有所不同。

在与她的父母交流时，她可能会更加尊重和谨慎地选择称谓语，比如“爸爸”、“妈妈”等。

而在与她的朋友交流时，她可能会更加随意和亲密地称呼他们，比如“朋友们”、“亲爱的”等。

然后是结束（Ends）。

在《安妮日记》中，安妮的言语表达通常是为了记录她的生活和情感，以及对身处逆境时的思考和感悟。

她的言语表达是为了与自己和未来的读者沟通，因此在选择称谓语时，她会更加注重情感表达和真实性，以此来传达她内心真实的情感和想法。

基于FVQMM的说话人识别

困难。
为差异的影响，每个人的语音中蕴含着与众不同的
个人特，Ｉ。说话人识别按其被输入的测试语音来怔¨
分可以分为与文本有关的说话人识别和与文本无
本文提出了一种基于模糊矢量量化混合模型
（ＶＱＦＭＭ）的与文本无关说话人识别方法，它综
ＣｍｐｔｒｃｎｅＺｅｉｎｉｅｓｙＨａｇｈｕ３０２，ｉａ）ｏｕｅｉｃ，ｈｊｇＵｎｖｒｉ，ｎｚｏ１０７ＣｈｎＳｅａｔ
ＡｂｔａｔＩｒｅｖｒｏｅｄｆｃｓｆｈｐａｅｏｇｎｚｔｎｏａｉｏａＶｎｓｒｃ：ｎｏｄｒｏｏｅｃｍｅｔｅｅｔｏｅｓｅｋｒｅｒａｉａｏｆｒｄｔｎｌＱａｄＧＭＭ，ｅｔｈｔｒｉｔｉａｎｗｍｅｏｆｐａｅｃｇｉｏｒｓｎｅ．ｈｎＶＱｅｒｒｃｌｉａｏｔｄｉｓａｆｈｒｂｂｌｙｏｔｕｙｔｄｏｅｒｅｏｎｔｎｉｐｅｅｔｄＷｅｒａｅｓｄｐｅｔｄｏｅｏａｉｔｕｐｔｈｓｋｒｉｓｏｓｎｅｔｐｉｂ
维普资讯
第２第５６卷期
、０．６，１２Ｎｏ．５
辽宁工程技术大学学报
ＪｒａｆＬｉｏｉｇＴｅｈｃｌｉｅｓｔｏｕｎｌｏａｎｎｃｎｉａＵｎｖｒｉｙ
２００７年１０月
Ｏｃ．ｔ２０７０
摘要：为了克服传统ＶＱ与ＧＭ说话人识别的缺点，Ｍ提出了一种新的ＦＱＶＭＭ说话人识别方法。该方法综合了Ｖ、ＧＭ和模ＱＭ

Prompt-basedLanguageModels：模版增强语言模型小结

Prompt-basedLanguageModels：模版增强语⾔模型⼩结©PaperWeekly 原创 · 作者 | 李泺秋学校 | 浙江⼤学硕⼠⽣研究⽅向 | ⾃然语⾔处理、知识图谱最近注意到 NLP 社区中兴起了⼀阵基于 Prompt（模版）增强模型预测的潮流：从苏剑林⼤佬近期的⼏篇⽂章《必须要 GPT3 吗？不，BERT 的 MLM 模型也能⼩样本学习》，《P-tuning：⾃动构建模版，释放语⾔模型潜能》，到智源社区在 3 ⽉ 20 ⽇举办的《智源悟道1.0 AI 研究成果发布会暨⼤规模预训练模型交流论坛》[1] 中杨植麟⼤佬关于“预训练与微调新范式”的报告，都指出了 Prompt 在少样本学习等场景下对模型效果的巨⼤提升作⽤。

本⽂根据上述资料以及相关论⽂，尝试梳理⼀下 Prompt 这⼀系列⽅法的前世今⽣。

不知道哪⾥来的图……本⽂⽬录：1. 追本溯源：从GPT、MLM到Pattern-Exploiting Training1. Pattern-Exploiting Training2. 解放双⼿：⾃动构建Prompt1. LM Prompt And Query Archive2. AUTOPROMPT3. Better Few-shot Fine-tuning of Language Models3. 异想天开：构建连续Prompt1. P-tuning4. ⼩结追本溯源：从GPT、MLM到Pattern-Exploiting Training要说明 Prompt 是什么，⼀切还要从 OpenAI 推出的 GPT 模型说起。

GPT 是⼀系列⽣成模型，在 2020 年 5 ⽉推出了第三代即 GPT-3。

具有 1750 亿参数的它，可以不经微调（当然，⼏乎没有⼈可以轻易训练它）⽽⽣成各式各样的⽂本，从常规任务（如对话、摘要）到⼀些稀奇古怪的场景（⽣成 UI、SQL 代码？）等等。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

PROSODYMODELSFORCONVERSATIONALSPEECHRECOGNITIONMariOstendorfIzhakShafranRebeccaBates*EEDept.,UniversityofWashington,Seattle,WAAT&TLabs–Research,FlorhamPark,NJ*CISDept.,MinnesotaStateUniversity,Mankato,MNmo@ee.washington.edu,bates@mnsu.edu,zak@research.att.com

ABSTRACTThispaperdescribesaformalmodelforincorporatingprosodyinthespeechrecognitionprocess,bothforimprovingwordrecog-nitiondirectlyandforjointlyrecognizingwordsandunderlyingstructure.Themodelincludesthepossibilityofusinganinter-mediatesymbolicrepresentationaswellasdirectconditioningonacousticcorrelates.Alternativesforfeatureextractionarede-scribed,togetherwithimplicationsforstatisticalmodeling.Ex-amplesofprosodyconditioninginspontaneousspeechrecogni-tionincludeacousticmodelclusteringanddynamicpronunciationmodeling.

1.INTRODUCTIONProsodyisofinteresttoautomaticspeechrecognition(ASR)researchersbecauseitplaysanimportantroleinthecomprehensionofspokenlanguagebyhumanlisteners:ithelpsinrecognizingspokenwords,inresolvingglobalandlocalambiguities,andinprocessingdiscoursestructure(see[1]forareview).Theroleofprosodyisparticularlyim-portantinspontaneousspeech.Forexample,acousticdiffer-encesbetweenstressedandunstressedsyllablesaregreaterinspontaneousspeechthaninreadspeech,anddetectionofword-initialtargetphonemesisfasterforlexicallystressedthanforunstressedsyllablesinspontaneousspeechbutnotinreadspeech[2,3].Conversationalspeechcontainsalargeamountofprosodicvariation,whichseemtoco-occurwithgreateracousticvariabilitymoregenerally[4].Thus,researchershavelonghypothesizedthatprosodycouldbeusefulinimprovingcomputerrecognitionofspeech.How-ever,prosodyhasbeenusedtoonlyasmallextent,thoughsuccessfulapplicationsinASRaregrowing.Somepromis-ingworkincludestheuseofprosodyforimproveddurationmodeling[5,6],tocontrolthesearchspaceandcross-wordcontextmodels[7],forimprovingnoiserobustness[8],tohelpdrivedynamicpronunciationmodels[9,10],andinlanguagemodeling[11,12].Useofprosodyinspeechunderstandingapplicationshasbeenmoreextensive,andthereareagrowingnumberofapplicationsthatarebeingexplored.Examplesincludede-

tectionofdisﬂuenciesandsentenceboundaries[13,14,15],topicsegmentation[16,17,18],dialogacts[19,20],errorsandcorrectionsinspokendialog[21,22],andimprovedspeechunderstandingindialogsystems[23].Prosodyhasalsoprovedtobeusefulinrecentworkonspeakerrecogni-tion[24].Importantreasonsfortheincreasingnumberofsuc-cessesusingprosodyaretheuseofmoreformalmodelingframeworksinvolvingcombinationsofknowledgesources,andfeatureextractionmethodsbasedonimprovedsignalprocessingandawiderarrayofacousticcuestoprosodicstructure.Inthispaper,wewilldescribeaparticularap-proachtothesetwocorethemes(featureextractionandstatisticalmodeling)thatfocusesonthewordrecognitionproblem,butalsoincorporatestheideaofusingrecogni-tionoflinguisticstructuretoimprovewordrecognitionac-curacy,asin[25,26].ExamplesareincludedtoshowhowprosodycanbeintegratedinASRsystemsatseveraldiffer-entlevelsoftherecognitionprocess.Theremainderofthepaperisstructuredasfollows.Sec-tion2reviewsthestructureoftypicalspeechrecognitionsystemstoprovidecontextforthemodelsdescribedhere.InSection3,weoutlinekeyissuesthatwebelievearecrit-icaltosuccessfuluseofprosodyinspeechrecognitionandunderstanding,relatingtofeatureextractionandstatisticalmodeling.Next,inSection4,wedescribemethodsandsomeexperimentalresultsforintegratingprosodyintodif-ferentaspectsoftherecognitionprocess,includingacousticmodelingandpronunciationmodeling.WeconcludeinSec-tion5withasummaryanddiscussionofopenquestions.

2.SPEECHRECOGNITIONOVERVIEWUsingaprobabilisticapproach,thespeechrecognitionprob-leminvolvessearchingforthewordsequencethatmaxi-mizestheposteriorprobabilityofthewordsequencegivenacousticobservations:orequivalentlyalgorithmusingmultipledecisiontrees[35]orbyprincipalcomponentsanalysis[33].Fundamentalfrequencyhasbeennotoriouslydifﬁcultincomputationalmodeling.Thoughitappearstobehighlyimportantinperceptualstudies,itisusuallytheleastimpor-tantcueincomputationalmodelsexceptintopicsegmenta-tionwork.Manystudieshavefoundthatitisnotusefulatall.Wehypothesizethatthereasonforthisissimplydifﬁ-cultiesinfeatureextractionandnotlackofimportance.Inadditiontoexplorationofnormalizationtechniques,recentadvancesinF0processingmethodsmaychangethistrend.Oneexampleis[36],whichusesaGaussianmixturemodeltodetecthalvinganddoublingfollowedbyapiecewiselinermodeltogenerateastylizedcurve.Theslopesoftheresult-ingpiecewisesegmentsleadtomorerobustmeasuresthansimplederivativesoftherawF0sequence.Theneedtocapturelocaltime(e.g.associatedwithwordboundariesforphraseﬁnaltonetypes)andtonormal-izeforsegmentalcontenthastheproblematicconsequencethatthehypothesizedwordanditstimingareusedinfeatureextraction.Thisisnotaproblemforspeechunderstandingproblemswherethewordrecognitionoutputisﬁxed,butitcanbeaprobleminwordrecognitionitselforforprocessingwordlattices,asdiscussedinthenextsection.