2003 语音情绪研究:研究范式回顾
- 格式:pdf
- 大小:323.13 KB
- 文档页数:30
《汉语语音基于包络频谱调制模式的连续情绪计算》篇一一、引言随着人工智能技术的不断发展,语音识别与处理技术在人机交互、智能语音助手、情感计算等领域得到了广泛应用。
在汉语语音处理中,基于包络频谱调制模式的连续情绪计算成为了一个重要的研究方向。
本文旨在探讨汉语语音中基于包络频谱调制模式的连续情绪计算方法,并对其在情感计算领域的应用进行深入分析。
二、汉语语音的包络频谱调制模式汉语语音的包络频谱调制模式是指语音信号中频率成分随时间变化所形成的包络线。
这种调制模式包含了丰富的语音信息,如音调、音强、音长等,同时也包含了情感信息。
因此,通过对包络频谱调制模式的分析,可以提取出语音中的情感特征,为连续情绪计算提供基础。
三、连续情绪计算的实现方法基于包络频谱调制模式的连续情绪计算主要包括以下几个步骤:1. 语音信号的预处理:对原始语音信号进行预处理,包括去噪、归一化等操作,以便后续分析。
2. 包络频谱提取:通过特定的算法提取出语音信号的包络频谱。
这一步骤是连续情绪计算的关键,需要选择合适的算法和参数。
3. 情感特征提取:根据包络频谱的调制模式,提取出语音中的情感特征,如音调、音强等。
这些特征将用于后续的情绪分类和计算。
4. 情绪分类与计算:利用机器学习等算法对提取出的情感特征进行分类和计算,得到连续的情绪标签或情感分数。
5. 结果输出与反馈:将计算得到的情绪标签或情感分数输出给用户或系统,并根据反馈进行进一步优化。
四、连续情绪计算在情感计算领域的应用基于包络频谱调制模式的连续情绪计算在情感计算领域有着广泛的应用。
以下是几个典型的应用场景:1. 人机交互:通过分析用户的语音信号,识别用户的情绪状态,从而为用户提供更加智能、贴心的服务。
例如,智能语音助手可以根据用户的情绪调整语气和内容,以提供更好的用户体验。
2. 心理健康评估:通过对患者的语音信号进行分析,可以评估其情绪状态和心理健康水平。
这对于精神疾病患者的诊断和治疗具有重要意义。
基于深度学习的语音情感识别与情绪分析技术研究引言:语音是人类最基本、最自然的交流方式之一,能够传递丰富的情感信息。
因此,对于机器来说,能够准确地进行语音情感识别和情绪分析是一项具有重要意义的任务。
本文将探讨基于深度学习的语音情感识别与情绪分析技术的研究进展,并介绍其在不同领域的应用。
一、语音情感识别技术的发展历程语音情感识别是指通过分析语音信号中的音频特征以及使用者的说话语调、语速、音量等信息,来判断说话者所表达的情感状态。
从传统的基于特征工程的方法,到近年来深度学习的兴起,语音情感识别技术经历了长足的发展。
1. 传统方法:传统的语音情感识别方法主要基于特征工程,通过手动选择和提取一系列人工设计的特征,如基频、能量、过零率等,再使用机器学习算法对这些特征进行分类。
然而,传统方法在特征提取的过程中往往缺乏有效的特征表示,导致准确率不高。
2. 深度学习方法:深度学习方法以其自动学习特征表示的能力而备受关注。
深度神经网络模型,如卷积神经网络(CNN)、长短时记忆网络(LSTM)以及自注意力网络(Transformer),成为了语音情感识别的主流模型。
深度学习方法能够从原始的语音信号中提取出高层次的特征表达,大大提升了情感识别的准确率和鲁棒性。
二、基于深度学习的语音情感识别技术研究方向基于深度学习的语音情感识别技术研究涵盖了多个方面,包括特征提取、模型设计以及数据集构建等。
1. 特征提取:从原始的语音信号中提取有效的特征对于语音情感识别至关重要。
近年来,一些基于深度学习的特征提取方法得到了广泛应用,如声码器后端(Vocoder)、自编码器(Autoencoder)等。
这些方法能够学习到更有价值的语音特征表示,提升了情感识别的性能。
2. 模型设计:深度学习模型的设计直接影响着情感识别的准确率和鲁棒性。
除了常见的CNN、LSTM和Transformer模型,一些结合跨模态信息的模型也得到了研究。
例如,将语音和面部表情数据同时输入到网络中,并通过联合训练的方式来提高情感识别性能。
《基于说话人特征双阶段迁移学习的情感语音克隆技术研究》篇一一、引言随着人工智能技术的不断发展,情感语音克隆技术已成为当前研究的热点之一。
情感语音克隆技术是指通过计算机技术,将特定个体的情感声音克隆出来并能够复制到其他个体身上,从而达到对人类情感的模仿和复制。
为了更好地实现这一目标,本文将研究基于说话人特征双阶段迁移学习的情感语音克隆技术。
首先介绍研究的背景与意义,并简述国内外在该领域的研究现状和文献综述。
二、研究背景与意义情感语音克隆技术是人工智能领域中一个重要的研究方向,其应用场景广泛,如虚拟角色、游戏互动、语音助手等。
通过对说话人的情感特征进行提取和克隆,可以更好地实现人机交互的自然性和真实性。
然而,传统的情感语音克隆技术往往只关注于声音的物理特征,而忽略了说话人的情感特征和语言习惯等重要信息。
因此,研究基于说话人特征的双阶段迁移学习情感语音克隆技术,可以更好地解决这一问题,具有重要的理论和应用价值。
三、国内外研究现状及文献综述近年来,国内外学者在情感语音克隆技术方面进行了大量研究。
传统的情感语音克隆技术主要采用基于统计的参数生成模型或深度学习模型来生成特定人的情感声音。
然而,这些方法往往存在以下问题:一是缺乏对说话人个性特征的充分表达;二是需要大量的数据和计算资源。
近年来,随着迁移学习技术的发展,一些学者开始尝试将迁移学习应用于情感语音克隆技术中。
例如,基于深度学习的迁移学习模型可以通过预训练在大量数据上获得通用的特征表示,从而更好地提取说话人的个性特征和情感特征。
此外,双阶段迁移学习也被应用于该领域中,通过在两个不同的任务之间进行迁移学习,以实现更好的性能。
四、基于说话人特征双阶段迁移学习的情感语音克隆技术研究本研究旨在通过双阶段迁移学习技术,实现对说话人个性特征和情感特征的提取和克隆。
首先,在第一阶段中,我们采用预训练的深度学习模型对大量数据进行预训练,以获得通用的特征表示。
在预训练的过程中,我们将使用各种音频数据进行训练,包括但不限于朗读型数据和对话型数据。
基于声学特征的语音情绪识别技术研究近些年来,随着智能音箱和语音助手的普及,语音识别技术正变得越来越重要。
自然语言处理中的情感识别则更是把语音识别技术应用到一个新的领域。
通过识别一段录音里音频的波形,处理里面的声学信号,深度学习模型可以识别出说话人的情绪状态。
被广泛使用于自然交互、心理疾病筛查、广告呈现与个性化多媒体等领域。
语音情绪识别技术的特点在于,不仅要识别说话内容的语义信息,同时还需要捕捉声音信号里体现出来的情感内涵。
而情感内涵一般表现在声调、音量、语速、音调、语气、停顿等方面,而这些都可以通过声学信号进行提取,在模型识别情感方面提供很有力的数据支撑。
声学特征提取是语音情感识别中的核心技术之一。
声学特征指的是在一段语音中提取出的一些与情感相关的数学特征,比如声后转折时间(STA),语速,能量等等。
这些特征可以直接通过Python语音处理库或C++等编程语言提取。
如python下的PyAudio和librosa,这两个开放库都可以提取音频信号的多种声学特征,如STFT(短时傅里叶变换)、MFCC(梅尔频率倒谱系数)特征等等。
最常用的声学特征之一是基音频率(F0),其是由声带震动引发的基本频率,也是认知情感见解中的重要因素之一。
F0特征在情绪识别方面极其重要,因为它不仅与音调和语调相关,还与语言中的情感焦点相关。
与此同时,谱凸包度(SC)是另一种流行的声学特征,它是由语音谱大小的差异到达的最大值和声音仓长度(目的是抑制异常点)的组合得到的。
由于大多数情感状态与语速有关,因此语速(PS)也成为一种普遍的声学特征,它反映了语音信号中的频率和幅度调整。
最近的研究表明,同时使用多种声学特征去学习情感状态比单独使用某个特征效果要好。
在多特征情感分类中,支持向量机(SVM)和神经网络(NN)是常见的分类器。
比如,使用在缺省情感语料库IEMOCAP而Jean-Philippe Lachaud等人提出的模型(基于LPC(线性预测编码(LPC))特征和置信区间的情感分类)的识别结果超3k特征多于%。
基于无监督学习的语音情绪识别算法研究无监督学习的语音情绪识别算法是一项具有广泛应用前景的研究领域。
随着人工智能和语音识别技术的快速发展,情绪识别在人机交互、智能客服、心理健康监测等领域具有重要意义。
本文旨在探讨基于无监督学习的语音情绪识别算法研究,深入分析相关技术和方法,并展望未来发展方向。
一、引言随着社交媒体和智能设备的普及,人们对于情绪识别算法的需求越来越迫切。
传统的基于监督学习方法需要大量标记好的训练数据,而无监督学习方法则可以通过对大量未标记数据进行分析和挖掘,从中发现隐藏在数据中的模式和规律。
因此,基于无监督学习的语音情绪识别算法具有重要意义。
二、相关技术1. 特征提取特征提取是语音情绪识别算法中至关重要的一环。
常用特征包括梅尔频率倒谱系数(MFCC)、线性预测编码系数(LPCC)等。
这些特征可以通过对语音信号进行时频分析和频域滤波得到。
通过提取语音信号的特征,可以将语音信号转化为数值特征,为后续的情绪分类提供依据。
2. 聚类算法聚类算法是无监督学习中常用的一种方法。
通过对特征向量进行聚类分析,可以将具有相似情绪表达的语音样本归为一类。
常用的聚类算法包括k-means、层次聚类等。
这些算法可以根据样本之间的相似性进行分组,从而实现对情绪样本的无监督分类。
3. 降维技术降维技术是无监督学习中常用的一种方法,用于减少特征向量的维度和冗余信息。
常用的降维技术包括主成分分析(PCA)、线性判别分析(LDA)等。
这些技术可以从高维度特征空间中提取出最具有判别性能力和信息量丰富的低维度特征。
三、方法与实验在实际应用中,基于无监督学习的语音情绪识别算法需要经过一系列步骤。
首先,需要收集大量的语音样本,并进行预处理,包括语音信号的去噪、分段、标准化等。
然后,通过特征提取算法将语音信号转化为特征向量。
接下来,使用聚类算法对特征向量进行聚类分析,并将样本归为不同的情绪类别。
最后,通过降维技术对特征向量进行降维处理,提取出最具有判别性能力的特征。
基于深度学习的语音情绪识别与分析方法研究概述:情绪是人类交流中重要的一种信息表达形式,能够影响人际关系、决策过程以及身体健康等方面。
因此,准确地识别和分析语音中的情绪成为了当下深度学习领域的研究热点之一。
本文将重点探讨基于深度学习的语音情绪识别和分析方法,并讨论相关研究的现状和未来发展方向。
一、深度学习在语音情绪识别中的应用深度学习作为一种机器学习方法,在语音情绪识别和分析方面取得了很多突破性的成果。
其主要应用包括声学特征提取、情感特征表示和情感分类等方面。
1.1 声学特征提取声学特征是指从语音信号中提取出的包含情感信息的特征。
传统的方法通常使用基于统计模型的特征提取方法,如MFCC、LPCC等。
相比之下,基于深度学习的方法在语音信号处理中更加高效准确。
例如,利用卷积神经网络(CNN)和递归神经网络(RNN)结构能够提高声学特征的表示能力。
1.2 情感特征表示情感特征表示是将原始语音信号转化为能够表达情感信息的特征。
基于深度学习的方法可以自动地学习到更高层次的情感表示,实现情感与语音特征之间的有效映射。
例如,通过使用自编码器、深度信念网络和生成对抗网络等模型,可以将语音数据转化为具有较高抽象程度的情感特征。
1.3 情感分类情感分类是将语音中的情感进行分类的过程,常见的分类有积极、消极和中性三类。
基于深度学习的方法在情感分类方面取得了显著的成果。
例如,使用深度神经网络结构(如深度信念网络、长短时记忆网络等)可以提高分类准确率和泛化能力。
二、深度学习方法的优势和挑战与传统的方法相比,基于深度学习的方法在语音情绪识别中具有一些显著的优势。
首先,深度学习能够自动地从大规模数据中学习到更高层次的特征表示。
其次,深度学习模型具有较好的泛化能力,可以很好地适用于不同场景和任务。
此外,深度学习方法还能够处理多种特征类型,包括声学特征、情感特征和非语言信息等。
然而,基于深度学习的语音情绪识别也面临一些挑战。
首先,数据集的标注过程是一个耗时费力的任务,需要专业人员进行情感标注。
情绪启动研究的实验范式方 平 1 陈满琪1 姜 媛2(1首都师范大学教科院心理系,北京,100037)(2天津师范大学心理与行为研究院,天津,300074)摘 要 情绪启动是探讨情绪和态度自动激活的有效方法,近期其研究范式有了较大的发展,相继产生了经典情绪启动实验、阈下情绪启动实验、向后情绪启动实验以及情绪启动与其他范式相结合的实验等较有影响的研究范式。
文章分别梳理了这些范式及其研究结果,并在此基础上,总结研究所采用的范式并展望今后研究的发展趋势。
关键词:情绪 情绪启动 实验范式进化论认为,在不断变化的外界环境中,机会或威胁都会在无征兆的情况下突然出现,因此在无需意识参与情况下,迅速地分辨 好 与 坏 成为人们生存和发展的必要条件。
大量研究也证实[1],人们拥有自动评价的机制,该机制能够无需意志努力地快速扫描周围环境的刺激,对其进行积极或者消极的评价,从而为随后行为做准备。
自动评价的最重要证据之一是情绪启动研究[2]。
这种研究具有重要的意义,一方面将启动范式扩展至情绪领域,不仅深化了传统的启动范式,而且促进了情绪的实验研究。
另一方面将自身所具有的自动性及无意识性应用于态度的测量,成为研究态度自动激活的首选方法,丰富了现有的测量方法。
鉴于其研究意义,情绪启动研究吸引了大量的研究者,并产生了不同的实验范式,但由于研究者的关注点不同,目前缺乏对实验范式的系统梳理。
本文分别介绍了经典情绪启动实验、阈下情绪启动实验、向后情绪启动实验、情绪启动与其他范式结合的实验等较有影响的实验范式及其研究结果。
在此基础上,总结现有的研究范式并展望今后研究的发展趋势。
1 实验范式及其研究结果1.1 经典情绪启动范式Fazio 等人[3]最早的情绪启动实证研究已成为经典的情绪启动范式。
研究先呈现积极、消极和中性等不同效价的启动刺激(态度物体)200ms,间隔100ms 或者800ms 后呈现积极、消极和中性等不同效价的靶刺激(形容词),从而形成评价一致、评价不一致及中性等三种条件,同时控制SOA 分别为300ms 或1000ms(SOA 指从启动刺激开始呈现到靶刺激开始呈现的时间间隔)。
情绪启动研究的实验范式方 平 1 陈满琪1 姜 媛2(1首都师范大学教科院心理系,北京,100037)(2天津师范大学心理与行为研究院,天津,300074)摘 要 情绪启动是探讨情绪和态度自动激活的有效方法,近期其研究范式有了较大的发展,相继产生了经典情绪启动实验、阈下情绪启动实验、向后情绪启动实验以及情绪启动与其他范式相结合的实验等较有影响的研究范式。
文章分别梳理了这些范式及其研究结果,并在此基础上,总结研究所采用的范式并展望今后研究的发展趋势。
关键词:情绪 情绪启动 实验范式进化论认为,在不断变化的外界环境中,机会或威胁都会在无征兆的情况下突然出现,因此在无需意识参与情况下,迅速地分辨 好 与 坏 成为人们生存和发展的必要条件。
大量研究也证实[1],人们拥有自动评价的机制,该机制能够无需意志努力地快速扫描周围环境的刺激,对其进行积极或者消极的评价,从而为随后行为做准备。
自动评价的最重要证据之一是情绪启动研究[2]。
这种研究具有重要的意义,一方面将启动范式扩展至情绪领域,不仅深化了传统的启动范式,而且促进了情绪的实验研究。
另一方面将自身所具有的自动性及无意识性应用于态度的测量,成为研究态度自动激活的首选方法,丰富了现有的测量方法。
鉴于其研究意义,情绪启动研究吸引了大量的研究者,并产生了不同的实验范式,但由于研究者的关注点不同,目前缺乏对实验范式的系统梳理。
本文分别介绍了经典情绪启动实验、阈下情绪启动实验、向后情绪启动实验、情绪启动与其他范式结合的实验等较有影响的实验范式及其研究结果。
在此基础上,总结现有的研究范式并展望今后研究的发展趋势。
1 实验范式及其研究结果1.1 经典情绪启动范式Fazio 等人[3]最早的情绪启动实证研究已成为经典的情绪启动范式。
研究先呈现积极、消极和中性等不同效价的启动刺激(态度物体)200ms,间隔100ms 或者800ms 后呈现积极、消极和中性等不同效价的靶刺激(形容词),从而形成评价一致、评价不一致及中性等三种条件,同时控制SOA 分别为300ms 或1000ms(SOA 指从启动刺激开始呈现到靶刺激开始呈现的时间间隔)。
基于神经网络的语音情绪识别技术研究随着人工智能技术的进步,人们对于语音情绪识别技术的需求越来越高。
而相比于传统的情感识别方法,基于神经网络的语音情绪识别技术具有更高的准确率和更广泛的适用性。
一、神经网络的基本原理神经网络是模拟生物神经网络的人工神经网络,它可以通过类似于人脑的学习方式进行自我训练和优化。
神经网络由多个神经元组成,每个神经元都接受一些输入信号,并产生一个输出信号。
神经元之间通过权重进行连接,权重表示了不同神经元之间传递信息的强度。
神经网络的学习过程就是通过不断地调整权重,从而实现输入输出的映射。
二、语音情绪识别技术的发展传统的语音情绪识别技术主要基于信号处理和语音特征提取的方法。
这些方法通过提取声学特征,如基频、非谐波能量、声调等,来判断说话人的情感状态。
然而,随着人工智能技术的快速发展,基于神经网络的语音情绪识别技术日益成为研究热点。
基于神经网络的语音情绪识别技术主要分为两类:基于语音特征的识别和基于语音自适应特征的识别。
基于语音特征的识别主要采用多层感知机(MLP)、卷积神经网络(CNN)和循环神经网络(RNN)等神经网络模型,而基于语音自适应特征的识别则采用了自编码器(AE)和变分自编码器(VAE)等。
三、神经网络在语音情绪识别方面的应用基于神经网络的语音情绪识别技术已经广泛应用于各个领域,包括语音助手、车载语音识别、智能对话系统、心理诊断等。
其中,对于语音助手和车载语音识别应用来说,识别准确率和识别速度是最为关键的因素。
而对于心理诊断等应用来说,模型的稳定性和可解释性则更为重要。
目前,在语音情绪识别方面,最具代表性的研究之一是i-vector和x-vectors技术。
这些技术通过分析语音发声人的空间特征来区分不同的情绪状态。
与传统的基于声学特征的方法相比,这些方法具有更高的识别准确率和更广泛的适用性。
四、神经网络在未来的发展趋势基于神经网络的语音情绪识别技术在未来的发展中,将会更加注重模型的稳定性和可解释性。
《连续对话语音愤怒情绪检测算法研究》篇一一、引言随着人工智能技术的不断发展,语音识别和情感分析逐渐成为研究的热点。
其中,连续对话语音愤怒情绪检测算法研究具有重要的应用价值。
该研究旨在通过分析连续对话语音中的愤怒情绪,为相关领域提供技术支持,如心理咨询、智能客服、社交媒体监控等。
本文将介绍愤怒情绪检测算法的背景、意义及研究现状,并阐述本文的研究目的、方法和创新点。
二、愤怒情绪检测算法的研究现状目前,愤怒情绪检测算法主要基于语音信号处理和机器学习技术。
在语音信号处理方面,研究人员主要关注声学特征、语音韵律特征和情感相关的语音特征。
在机器学习方面,研究人员利用分类器对特征进行分类,以实现情感识别。
然而,连续对话语音的愤怒情绪检测仍面临诸多挑战,如背景噪声、语速变化、口音差异等。
三、研究目的和意义本文旨在研究一种有效的连续对话语音愤怒情绪检测算法。
该算法能够自动识别和分析连续对话中的愤怒情绪,为相关领域提供技术支持。
其意义在于:1. 为心理咨询提供技术支持,帮助心理医生更好地了解患者的情绪状态,提高治疗效果。
2. 为智能客服提供情感识别功能,使机器更好地理解用户需求和情感,提高服务质量和用户满意度。
3. 为社交媒体监控提供技术手段,帮助企业和政府及时发现和处理网络舆情,维护社会稳定。
四、研究方法本文采用以下方法进行研究:1. 收集大量连续对话语音数据,包括愤怒、非愤怒等情感标签数据。
2. 对语音数据进行预处理,包括去噪、归一化等操作,以提高数据质量。
3. 提取语音信号的声学特征和韵律特征,如基频、能量等。
4. 利用机器学习算法对特征进行分类和训练,如支持向量机、深度学习等。
5. 设计实验验证算法的有效性和性能。
五、算法设计和实现本文提出一种基于深度学习的连续对话语音愤怒情绪检测算法。
该算法包括以下步骤:1. 数据预处理:对连续对话语音数据进行去噪、归一化等操作,以提高数据质量。
2. 特征提取:提取语音信号的声学特征和韵律特征,如基频、能量等。
Vocal communication of emotion:A review ofresearch paradigmsKlaus R.Scherer*Department of Psychology,University of Geneva,40.Boulevard du Pont d’Arve,CH-1205Geneva,SwitzerlandAbstractThe current state of research on emotion effects on voice and speech is reviewed and issues for future research efforts are discussed.In particular,it is suggested to use the Brunswikian lens model as a base for research on the vocal communication of emotion.This approach allows one to model the complete process,including both encoding (ex-pression),transmission,and decoding (impression)of vocal emotion communication.Special emphasis is placed on the conceptualization and operationalization of the major elements of the model (i.e.,the speaker Õs emotional state,the listener Õs attribution,and the mediating acoustic cues).In addition,the advantages and disadvantages of research paradigms for the induction or observation of emotional expression in voice and speech and the experimental ma-nipulation of vocal cues are discussed,using pertinent examples drawn from past and present research.Ó2002Elsevier Science B.V.All rights reserved.ZusammenfassungDer Aufsatz gibt einen umfassenden €Uberblick €u ber den Forschungsstand zum Thema der Beeinflussung von Stimme und Sprechweise durch Emotionen des Sprechers.Allgemein wird vorgeschlagen,die Forschung zur vokalen Kommunikation der Emotionen am Brunswik Õschen Linsenmodell zu orientieren.Dieser Ansatz erlaubt den gesamtenKommunikationsprozess zu modellieren,von der Enkodierung (Ausdruck),€uber die Transmission (€U bertragung),bis zur Dekodierung (Eindruck).Besondere Aufmerksamkeit gilt den Problemen der Konzeptualisierung und Opera-tionalisierung der zentralen Elemente des Modells (z.B.,dem Emotionszustand des Sprechers,den Inferenzprozessendes H €o rers,und den zugrundeliegenden vokalen Hinweisreizen).Anhand ausgew €a hlter Beispiele empirischer Unter-suchungen werden die Vor-und Nachteile verschiedener Forschungsparadigmen zur Induktion und Beobachtung des emotionalen Stimmausdrucks sowie zur experimentellen Manipulation vokaler Hinweisreize diskutiert.Ó2002Elsevier Science B.V.All rights reserved.R e sum eL Õ e tat actuel de la recherche sur l Õeffet des e motions d Õun locuteur sur la voix et la parole est d e crit et des approches prometteuses pour le futur identifi e es.En particulier,le mod e le de perception de Brunswik (dit ‘‘de la lentille’’est propos e )comme paradigme pour la recherche sur la communication vocale des e motions.Ce mod e le permet la mod e lisation du processus complet,de l Õencodage (expression)par la transmission au d e codage (impression).La con-ceptualisation et l Õop e rationalization des e l e ments centraux du mod e le (l Õ e tat e motionnel du locuteur,l Õinf e rence de cet e tat par l Õauditeur,et les indices auditifs)sont discut e en d e tail.De plus,en analysant des exemples de la recherche dans le*Tel.:+41-22-705-9211/9215;fax:+41-22-705-9219.E-mail address:klaus.scherer@pse.unige.ch (K.R.Scherer).0167-6393/02/$-see front matter Ó2002Elsevier Science B.V.All rights reserved.PII:S 0167-6393(02)00084-5Speech Communication 40(2003)227–256/locate/specomdomaine,les avantages et d e savantages de diff e rentes m e thodes pour lÕinduction et lÕobservation de lÕexpression e mo-tionnelle dans la voix et la parole et pour la manipulation exp e rimentale de diff e rents indices vocaux sont e voqu e s.Ó2002Elsevier Science B.V.All rights reserved.Keywords:Vocal communication;Expression of emotion;Speaker moods and attitudes;Speech technology;Theories of emotion; Evaluation of emotion effects on voice and speech;Acoustic markers of emotion;Emotion induction;Emotion simulation;Stress effects on voice;Perception/decoding1.Introduction:Modeling the vocal communication of emotionThe importance of emotional expression in speech communication and its powerful impact on the listener has been recognized throughout his-tory.Systematic treatises of the topic,together with concrete suggestions for the strategic use of emo-tionally expressive speech,can be found in early Greek and Roman manuals on rhetoric(e.g.,by Aristotle,Cicero,Quintilian),informing all later treatments of rhetoric in Western philosophy (Kennedy,1972).Renewed interest in the expres-sion of emotion in face and voice was sparked in the19th century by the emergence of modern evolutionary biology,due to the contributions by Spencer,Bell,and particularly Darwin(1872, 1998).The empirical investigation of the effect of emotion on the voice started at the beginning of the 20th century,with psychiatrists trying to diagnose emotional disturbances through the newly devel-oped methods of electroacoustic analysis(e.g., Isserlin,1925;Scripture,1921;Skinner,1935).The invention and rapid dissemination of the telephone and the radio also led to increasing sci-entific concern with the communication of speaker attributes and states via vocal cues in speech (Allport and Cantril,1934;Herzog,1933;Pear, 1931).However,systematic research programs started in the1960s when psychiatrists renewed their interest in diagnosing affective states via vocal expression(Alpert et al.,1963;Moses,1954; Ostwald,1964;Hargreaves et al.,1965;Stark-weather,1956),non-verbal communication re-searchers explored the capacity of different bodily channels to carry signals of emotion(Feldman and Rim e,1991;Harper et al.,1978;Knapp,1972; Scherer,1982b),emotion psychologists charted the expression of emotion in different modalities (Tomkins,1962;Ekman,1972,1992;Izard,1971, 1977),linguists and particularly phoneticians dis-covered the importance of pragmatic information in speech(Mahl and Schulze,1964;Trager,1958; Pittenger et al.,1960;Caffiand Janney,1994),and engineers and phoneticians specializing in acoustic signal processing started to make use of ever more sophisticated technology to study the effects of emotion on the voice(Lieberman and Michaels, 1962;Williams and Stevens,1969,1972).In recent years,speech scientists and engineers,who had tended to disregard pragmatic and paralinguistic aspects of speech in their effort to develop models of speech communication for speech technology applications,have started to devote more atten-tion to speaker attitudes and emotions––often in the interest to increase the acceptability of speech technology for human users.The confer-ence which has motivated the current special issue of this journal(ISCA Workshop on Voice and Emotion,Newcastle,Northern Ireland,2000) and a number of recent publications(Amir and Ron,1998;Bachorowski,1999;Bachorowski and Owren,1995;Banse and Scherer,1996;Cowie and Douglas-Cowie,1996;Erickson et al.,1998; Iida et al.,1998;Kienast et al.,1999;Klasmeyer, 1999;Morris et al.,1999;Murray and Arnott, 1993;Mozziconacci,1998;Pereira and Watson, 1998;Picard,1997;Rank and Pirker,1998;Sobin and Alpert,1999)testifies to the lively research activity that has been sprung up in this domain. This paper attempts to review some of the central issues in empirical research on the vocal com-munication of emotion and to chart some of the promising approaches for interdisciplinary research in this area.I have repeatedly suggested(Scherer,1978, 1982a)to base theory and research in this area on a modified version of BrunswikÕs functional lens228K.R.Scherer/Speech Communication40(2003)227–256model of perception (Brunswik,1956;Gifford,1994;Hammond and Stewart,2001).Since the detailed argument can be found elsewhere (see Kappas et al.,1991;Scherer et al.,in press),I will only briefly outline the model (shown in the upper part of Fig.1,which represents the conceptual level).The process begins with the encoding,or expression,of emotional speaker states by certain voice and speech characteristics amenable to ob-jective measurement in the signal.Concretely,the assumption is that the emotional arousal of the speaker is accompanied by physiological changes that will affect respiration,phonation,and articu-lation in such a way as to produce emotion-specific patterns of acoustic parameters (see (Scherer,1986)for a detailed description).Using Brunswik Õs terminology,one can call the degree to which such characteristics actually correlate with the under-lying speaker state ecological validity .As these acoustic changes can serve as cues to speaker affect for an observer,they are called distal cues (distal in the sense of remote or distant from the observer).They are transmitted,as part of the speech signal,to the ears of the listener and perceived via the auditory perceptual system.In the model,these perceived cues are called proximal cues (proximal in the sense of close to the observer).There is some uncertainty among Brunswikians exactly how to define and operationalize proximal cues in different perceptual domains (see (Ham-mond and Stewart,2001)for the wide variety of uses and definitions of the model).While Brunswik apparently saw the proximal stimulus as close but still outside of the organism (Hammond and Stewart,2001),the classic example given for the distal–proximal relationship in visual percep-tion ––the juxtaposition between object size (distal)and retinal size (proximal)––suggests that the proximal cue,the pattern of light on the retina,is already inside the organism.Similarly,in auditory perception,the fundamental frequency of a speech wave constitutes the distal characteristic that gives rise to the pattern of vibration along the basilar membrane,and,in turn,the pattern of excitation along the inner hair cells,the consequent excita-tion of the auditory neurons,and,finally,its rep-resentation in the auditory cortex.Either phase in this input,transduction,and coding process could be considered a proximal representation of the distal stimulus.I believe that it makes sense to extend this term to the neural representation of the stimulus information as coded by the respective neural structures.The reason is twofold:(1)It is difficult to measure the raw input (e.g.,vibrationFig.1.A Brunswikian lens model of the vocal communication of emotion.K.R.Scherer /Speech Communication 40(2003)227–256229of the basilar membrane)and thus one could not systematically study this aspect of the model in relation to others;(2)The immediate input into the inference process,which Brunswik called cue uti-lization,is arguably the neural representation in the respective sensory cortex.Thus,the proximal cue for fundamental frequency would be perceived pitch.While fraught with many problems,we do have an access to measuring at least the conscious part of this representation via self-report(see B€a nziger and Scherer,2001).One of the most important advantages of the model is to highlight the fact that objectively measured distal characteristics are not necessarily equivalent to the proximal cues they produce in the observer.While the proximal cues are based on(or mimick)distal characteristics,the latter may be modified or distorted by(1)the transmission channel(e.g.,distance,noise)and(2)the structural characteristics of the perceptual organ and the transduction and coding process(e.g.,selective enhancement of certain frequency bands).These issues are discussed in somewhat greater detail below.The decoding process consists of the inference of speaker attitudes and emotions based on inter-nalized representations of emotional speech mod-ifications,the proximal cues.Thefit of the model can be ascertained by operationalizing and mea-suring each of its elements(see operational level in Fig.1),based on the definition of afinite number of cues.If the attribution obtained through listener judgments corresponds(with better than chance recognition accuracy)to the criterion for speaker state(e.g.,intensity of a certain emotion),the model describes a functionally valid communica-tion process.However,the model is also extremely useful in cases in which attributions and criteria do not match since it permits determination of the missing or faulty link of the chain.Thus,it is possible that the respective emotional state does not produce reliable externalizations in the form of specific distal cues in the voice.Alternatively,valid distal cues might be degraded or modified during transmission and perception in such a fashion that they no longer carry the essential information when they are proximally represented in the lis-tener.Finally,it is possible that the proximal cues reliably map the valid distal cues but that the in-ference mechanism,i.e.,the cognitive representa-tion of the underlying relationships,isflawed in the respective listener(e.g.,due to lack of sufficient exposure or inaccurate stereotypes).To my knowl-edge,there is no other paradigm that allows to examine the process of vocal communication in as comprehensive and systematic fashion.This is why I keep arguing for the utility of basing research in this area explicitly on a Brunswikian lens model.Few empirical studies have sought to model a specific communication process by using the com-plete lens model,mostly due to considerations in-volving the investment of the time and money required(but see Gifford,1994;Juslin,2000).In an early study on the vocal communication of speaker personality,trying to determine which personality traits are reliably indexed by vocal cues and cor-rectly inferred by listeners,I obtained natural speech samples in simulated jury discussions with adults(German and American men)for whom detailed personality assessments(self-and peer-ratings)had been obtained(see Scherer,1978, 1982a).Voice quality was measured via expert (phoneticians)ratings(distal cues)and personality inferences(attributions)by lay listenersÕratings (based on listening to content-masked speech sam-ples).In addition,a different group of listeners was asked to rate the voice quality with the help of a rating scale with natural language labels for vocal characteristics(proximal cues).Path ana-lyses were performed to test the complete lens model.This technique consists of a systematic series of regression analyses to test the causal assumptions in a model containing mediating variables(see Bryman and Cramer,1990,pp.246–251).The results for one of the personality traits studied are illustrated in Fig.2.The double arrows correspond to the theoretically specified causal paths.Simple and dashed arrows correspond to non-predicted direct and indirect effects that ex-plain additional variance.The graph shows why extroversion was correctly recognized from the voice:(1)extroversion is indexed by objectively defined vocal cues(ecologically valid in the Bruns-wikian sense),(2)these cues are not too drastically modified in the transmission process,and(3)the listenersÕinference structure in decoding mirrors230K.R.Scherer/Speech Communication40(2003)227–256the encoding structure.These conditions were not met in the case of emotional stability .While there were strong inference structures,shared by most listeners (attributing emotional stability to speak-ers with resonant,warm,low-pitched voices),these do not correspond to an equivalent encoding struc-ture (i.e.,there was no relationship between voice frequency and habitual emotional stability in the sample of speakers studied;see (Scherer,1978)for the data and further discussion).Clearly,a similar approach could be used with emotions rather than personality traits as speaker characteristics.Unfortunately,so far no complete lens model has been tested in this domain.Yet this type of model is useful as a heuristic device to design experimental work in this area even if only parts of the model are investigated.The curved arrows in Fig.1indicate some of the major issues that can be identified with the help of the model.These issues,and the evidence available to date,are presented below.2.A review of the literature 2.1.Encoding studiesThe basis of any functionally valid communi-cation of emotion via vocal expression is thatdifferent types of emotion are actually character-ized by unique patterns or configurations of acoustic cues.In the context of a Brunswikian lens model that means that the identifiable emotional states of the sender are in fact externalised by a specific set of distal cues.Without such distin-guishable acoustic patterns for different emotions,the nature of the underlying speaker state could not be communicated reliably.Not surprisingly,then,there have been a relatively large number of empirical encoding studies conducted over the last six decades,attempting to determine whether elic-itation of emotional speaker states will produce corresponding acoustic changes.These studies can be classified into three major categories:natural vocal expression,induced emotional expression,and simulated emotional expression.2.1.1.Natural vocal expressionWork in this area has made use of material that was recorded during naturally occurring emotional states of various sorts,such as dangerous flight situations for pilots,journalists reporting emotion-eliciting events,affectively loaded therapy sessions,or talk and game shows on TV (Johannes et al.,2000;Cowie and Douglas-Cowie,1996;Duncan et al.,1983;Eldred and Price,1958;Hargreaves et al.,1965;Frolov et al.,1999;Huttar,1968;Kuroda et al.,1979;Niwa,1971;RoesslerandFig.2.Two-dimensional path analysis model for the inference of extroversion from the voice (reproduced from Scherer,1978).The dashed line represents the direct path,the double line the postulated indirect paths,and the single lines the indirect paths compatible with the model.Coefficients shown are standardized coefficients except for r CD1and r CD2,which are Pearson r s (the direction shown is theoretically postulated).R 2s based on all predictors from which paths lead to the variable.Ãp <0:05;ÃÃp <0:01.K.R.Scherer /Speech Communication 40(2003)227–256231Lester,1976,1979;Simonov and Frolov,1973; Sulc,1977;Utsuki and Okamura,1976;Williams and Stevens,1969,1972;Zuberbier,1957;Zwirner, 1930).The use of naturally occurring voice chan-ges in emotionally charged situations seems the ideal research paradigm since it has very high ecological validity.However,there are some seri-ous methodological problems.Voice samples ob-tained in natural situations,often only for a single or a very small number of speakers,are generally very brief,not infrequently suffering from bad re-cording quality.In addition,there are problems in determining the preise nature of the underlying emotion and the effect of regulation(see below).2.1.2.Induced emotionsAnother way to study the vocal effects of emotion is to experimentally induce specific emo-tional states in groups of speakers and to record speech samples.A direct way of inducing affective arousal and studying the effects of the voice is the use of psychoactive drugs.Thus,Helfrich et al. (1984)studied the effects of antidepressive drugs on several vocal parameters(compared to placebo) over a period of several hours.Most induction studies have used indirect paradigms that include stress induction via difficult tasks to be completed under time pressure,the presentation of emotion-inducingfilms or slides,or imagery methods(Al-pert et al.,1963;Bachorowski and Owren,1995; Bonner,1943;Havrdova and Moravek,1979; Hicks,1979;Karlsson et al.,1998;Markel et al., 1973;Plaikner,1970;Roessler and Lester,1979; Scherer,1977,1979;Scherer et al.,1985;Skinner, 1935;Tolkmitt and Scherer,1986).While this ap-proach,generally favoured by experimental psy-chologists because of the degree of control it affords,does result in comparable voice samples for all participants,there are a number of serious drawbacks.Most importantly,these procedures often produce only relatively weak affect.Fur-thermore,in spite of using the same procedure for all participants,one cannot necessarily assume that similar emotional states are produced in all individuals,precisely because of the individual differences in event appraisal mentioned above. Recently,Scherer and his collaborators,in the context of a large scale study on emotion effects in automatic speaker verification,have attempted to remedy some of these shortcomings by developing a computerized induction battery for a variety of different states(Scherer et al.,1998).2.1.3.Simulated(portrayed)vocal expressionsThis has been the preferred way of obtaining emotional voice samples in thisfield.Professional or lay actors are asked to produce vocal expres-sions of emotion(often using standard verbal content)as based on emotion labels and/or typical scenarios(Banse and Scherer,1996;Bortz,1966; Coleman and Williams,1979;Costanzo et al., 1969;Cosmides,1983;Davitz,1964;Fairbanks and Pronovost,1939;Fairbanks and Hoaglin, 1941;Fonagy,1978,1983;Fonagy and Magdics, 1963;Green and Cliff,1975;H€offe,1960;Kaiser, 1962;Kienast et al.,1999;Klasmeyer and Send-lmeier,1997;Klasmeyer and Meier,1999;Klas-meyer and Sendlmeier,1999;Klasmeyer,1999; Kotlyar and Morozov,1976;Levin and Lord, 1975;Paeschke et al.,1999;Scherer et al.,1972a,b, 1973;Sedlacek and Sychra,1963;Sobin and Al-pert,1999;Tischer,1993;van Bezooijen,1984; Wallbott and Scherer,1986;Whiteside,1999; Williams and Stevens,1972).There can be little doubt that simulated vocal portrayal of emotions yields much more intense,prototypical expressions than are found for induced states or even natural emotions(especially when they are likely to be highly controlled;see(Scherer,1986,p.159)). However,it cannot be excluded that actors over-emphasize relatively obvious cues and miss more subtle ones that might appear in natural expres-sion of emotion(Scherer,1986,p.144).It has of-ten been argued that emotion portrayals reflect sociocultural norms or expectations more than the psychophysiological effects on the voice as they occur under natural conditions.However,it can be argued that all publicly observable expres-sions are to some extent‘‘portrayals’’(given the social constraints on expression and unconscious tendencies toward self-presentation;see(Banse and Scherer,1996)).Furthermore,since vocal por-trayals are reliably recognized by listener-judges (see below)it can be assumed that they reflect at least in part‘‘normal’’expression patterns(if the two were to diverge too much,the acted version232K.R.Scherer/Speech Communication40(2003)227–256would lose its credibility).However,there can be little doubt that actorsÕportrayals are influenced by conventionalised stereotypes of vocal expres-sion.2.1.4.Advantages and disadvantages of methodsAs shown above,all of the methods that have been used to obtain vocal emotion expression samples have both advantages and disadvantages. In the long run,it is probably the best strategy to look for convergences between all three ap-proaches in the results.Johnstone and Scherer (2000,pp.226–227)have reviewed the converging evidence with respect to the acoustic patterns that characterize the vocal expression of major modal emotions.Table1summarizes their conclusions in synthetic form.Much of the consistency in thefindings is linked to differential levels of arousal or activation for the target emotions.Indeed,in the past,it has often been assumed that contrary to the face,capable of communicating qualitative differences between emotions,the voice could only mark physiological arousal(see(Scherer,1979,1986)for a detailed discussion).That this conclusion is erroneous is shown by the fact that judges are almost as accu-rate in inferring different emotions from vocal as from facial expression(see below).There are two main reasons why it has been difficult to demon-strate qualitative differentiation of emotions in acoustic patterns apart from arousal:(1)only a limited number of acoustic cues has been studied, and(2)arousal differences within emotion families have been neglected.As to(1),most studies in thefield have limited the scope of acoustic measurement to F0,energy,and speech rate.Only very few studies have looked at the frequency distribution in the spectrum or formant parameters.As argued by Scherer(1986), it is possible that F0,energy,and rate may be most indicative of arousal whereas qualitative,valence differences may have a stronger impact on source and articulation characteristics.As to(2),most investigators have tended to view a few basic emotions as rather uniform and homogeneous,as suggested by the writings of discrete emotion theory(see Scherer,2000a).More recently,the concept of emotion families has been proposed(e.g.,Ekman,1992),acknowledging the fact that there are many different kinds of anger,of joy,or of fear.For example,the important vocal differences between the expression as hot anger (explosive rage)and cold,subdued or controlled, anger may explain some of the lack of replication for results concerning the acoustic patterns of anger expression in the literature.The interpreta-tion of such discrepancies is rendered particularly difficult by the fact that researchers generally do not specify which kind of anger has been produced or portrayed(see Scherer,1986).Thus,different instantiations or variants of specific emotions, even though members of the same family,may vary appreciably with respect to their acoustic expression patterns.Banse and Scherer(1996)have conducted an encoding study,using actor portrayal,which has attempted to remedy both of these problems.They used empirically derived scenarios for14emotions, ten of which consisted of pairs of two members of a same emotion family forfive modal emotions, varying in arousal.In addition to the classical set of acoustic parameters,they measured a numberTable1Synthetic compilation of the review of empirical data on acoustic patterning of basic emotions(based on Johnstone and Scherer,2000)Stress Anger/rage Fear/panic Sadness Joy/elation Boredom IntensityF0floor/meanF0variabilityF0range()Sentence contoursHigh frequency energy()Speech and articulation rate()K.R.Scherer/Speech Communication40(2003)227–256233of indicators of energy distribution in the spec-trum.Banse and Scherer interpret their data as clearly demonstrating that these states are acous-tically differentiated with respect to quality(e.g., valence)as well as arousal.They underline the need for future work that addresses the issue of subtle differences between the members of the same emotion family and that uses an extensive set of different acoustic parameters,pertinent to all as-pects of the voice and speech production process (respiration,phonation,and articulation).One of the major shortcomings of the encoding studies conducted to date has been the lack of theoretical grounding,most studies being moti-vated exclusively by the empirical detection of acoustic differences between emotions.As in most other areas of scientific inquiry,an atheoretical approach has serious shortcomings,particularly in not being able to account for lack of replication and in not allowing the identification of the mechanism underlying the effects found.It is dif-ficult to blame past researchers for the lack of theory,since emotion psychology and physiology, the two areas most directly concerned,have not been very helpful in providing tools for theorizing. For example,discrete emotion theory(Ekman, 1972,1992;Izard,1971,1977;see Scherer,2000a), which has been the most influential source of emo-tion conceptualization in this research domain,has specified the existence of emotion-specific neural motor patterns as underlying the expression.How-ever,apart from this general statement,no detailed descriptions of the underlying mechanisms or concrete hypotheses have been forthcoming(ex-cept to specify patterns of facial expression––mostly based on observation;see(Wehrle et al., 2000)).It is only with the advent of componential models,linked to the development of apprai-sal theories,that there have been attempts at a detailed hypothetical description for the pattern of motor expressions to be expected for specific emotions(see Scherer,1984,1992a;Roseman and Smith,2001).For the vocal domain,I have pro-posed a comprehensive theoretical grounding of emotion effects on vocal expression,as based on my component process model of emotion,yielding a large number of concrete predictions as to the changes in major acoustic parameters to be ex-pected for14modal emotions(Scherer,1986).The results from studies that have provided thefirst experimental tests of these predictions are de-scribed below.In concluding this section,it is appropriate to point out the dearth of comparative approaches that examine the relative convergence between empirical evidence on the acoustic characteristics of animal,particularly primate,vocalizations(see (Fichtel et al.,2001)for an impressive recent ex-ample)and human vocalizations.Such compari-sons would be all the more appropriate since there is much evidence on the phylogenetic continuity of affect vocalization,at least for mammals(see Hau-ser,2000;Scherer,1985),and since we can assume animal vocalizations to be direct expressions of the underlying affect,largely free of control or self-presentation constraints.2.2.Decoding studiesWork in this area examines to what extent lis-teners are able to infer actual or portrayed speaker emotions from content-free speech samples.Re-search in this tradition has a long history and has produced a large body of empirical results.In most of these studies actors have been asked to portray a number of different emotions by producing speech utterances with standardized or nonsense content.Groups of listeners are given the task to recognize the portrayed emotions.They are gen-erally required to indicate the perceived emotion on rating sheets with standard lists of emotion labels,allowing to compute the percentage of stimuli per emotion that were correctly recognized.A review of approximately30studies conducted up to the early1980s yielded an average accuracy percentage of about60%,which is aboutfive times higher than what would be expected if listeners had guessed in a random fashion(Scherer,1989). Later studies reported similar levels of average recognition accuracy across different emotions, e.g.,65%in a study by van Bezooijen(1984)and 56%in a study by Scherer et al.(1991).One of the major drawbacks of all of these studies is that they tend to study discrimination (deciding between alternatives)rather than recog-234K.R.Scherer/Speech Communication40(2003)227–256。