ASR dependent techniques for speaker identification
- 格式:pdf
- 大小:139.94 KB
- 文档页数:4
附录英文原文:Chinese Journal of ElectronicsVo1.15,No.3,July 2006A Speaker--Independent Continuous SpeechRecognition System Using Biomimetic Pattern RecognitionWANG Shoujue and QIN Hong(Laboratory of Artificial Neural Networks,Institute ol Semiconductors,Chinese Academy Sciences,Beijing 100083,China)Abstract—In speaker-independent speech recognition,the disadvantage of the most diffused technology(HMMs,or Hidden Markov models)is not only the need of many more training samples,but also long train time requirement. This Paper describes the use of Biomimetic pattern recognition(BPR)in recognizing some mandarin continuous speech in a speaker-independent Manner. A speech database was developed for the course of study.The vocabulary of the database consists of 15 Chinese dish’s names, the length of each name is 4 Chinese words.Neural networks(NNs)based on Multi-weight neuron(MWN) model are used to train and recognize the speech sounds.The number of MWN was investigated to achieve the optimal performance of the NNs-based BPR.This system, which is based on BPR and can carry out real time recognition reaches a recognition rate of 98.14%for the first option and 99.81%for the first two options to the Persons from different provinces of China speaking common Chinese speech.Experiments were also carried on to evaluate Continuous density hidden Markov models(CDHMM ),Dynamic time warping(DTW)and BPR for speech recognition.The Experiment results show that BPR outperforms CDHMM and DTW especially in the cases of samples of a finite size.Key words—Biomimetic pattern recognition, Speech recogniton,Hidden Markov models(HMMs),Dynamic time warping(DTW).I.IntroductionThe main goal of Automatic speech recognition(ASR)is to produce a system which will recognize accurately normal human speech from any speaker.The recognition system may be classified as speaker-dependent or speaker-independent.The speaker dependence requires that the system be personally trained with the speech of the person that will be involved with its operation in order to achieve a high recognition rate.For applications on the public facilities,on the other hand,the system must be capable of recognizing the speech uttered by many different people,with different gender,age,accent,etc.,the speaker independence has many more applications,primarily in the general area of public facilities.The most diffused technology in speaker-independent speech recognition is Hidden Markov Models,the disadvantage of it is not only the need of many more training samples,but also long train time requirement.Since Biomimetic pattern recognition(BPR) was first proposed by Wang Shoujue,it has already been applied to object recognition, face identification and face recognition etc.,and achieved much better performance.With some adaptations,such modeling techniques could be easily used within speech recognition too.In this paper,a real-time mandarin speech recognition system based on BPR is proposed,which outperforms HMMs especially in the cases of samples of a finite size.The system is a small vocabulary speaker independent continuous speech recognition one. The whole system is implemented on the PC under windows98/2000/XPenvironment with CASSANN-II neurocomputer.It supports standard 16-bit sound card .II .Introduction of Biomimetic Pattern Recognition and Multi —Weights Neuron Networks1. Biomimetic pattern recognitionTraditional Pattern Recognition aims at getting the optimal classification of different classes of sample in the feature space .However, the BPR intends to find the optimal coverage of the samples of the same type. It is from the Principle of Homology —Continuity ,that is to say ,if there are two samples of the same class, the difference between them must be gradually changed . So a gradual change sequence must be exists between the two samples. In BPR theory .the construction of the sample subspace of each type of samples depends only on the type itself .More detailedly ,the construction of the subspace of a certain type of samples depends on analyzing the relations between the trained types of samples and utilizing the methods of “cov erage of objects with complicated geometrical forms in the multidimensional space”.2.Multi-weights neuron and multi-weights neuron networksA Multi-weights neuron can be described as follows :12m Y=f[(,,,)]W W W X θΦ-…,,Where :12m ,,W W W …, are m-weights vectors ;X is the inputvector ;Φis the neuron’s computation function ;θis the threshold ;f is the activation function .According to dimension theory, in the feature spacen R ,n X R ∈,the function12m (,,,)W W W X Φ…,=θconstruct a (n-1)-dimensional hypersurface in n-dimensional space which isdetermined by the weights12m ,,W W W …,.It divides the n-dimensional space into two parts .If12m (,,,)W W W X θΦ=…, is a closed hypersurface, it constructs a finite subspace .According to the principle of BPR,determination the subspace of a certain type of samples basing on the type of samples itself .If we can find out a set of multi-weights neurons(Multi-weights neuron networks) that covering all the training samples ,the subspace of the neural networks represents the sample subspace. When an unknown sample is in the subspace, it can be determined to be the same type of the training samples .Moreover ,if a new type of samples added, it is not necessary to retrain anyone of the trained types of samples .The training of a certain type of samples has nothing to do with the other ones .III .System DescriptionThe Speech recognition system is divided into two main blocks. The first one is the signal pre-processing and speech feature extraction block .The other one is the Multi-weights neuron networks, which performs the task of BPR .1.Speech feature extractionMel based Campestral Coefficients(MFCC) is used as speech features .It is calculated as follows :A /D conversion ;Endpoint detection using short time energy and Zero crossing rate(ZCR);Preemphasis and hamming windowing ;Fast Fourier transform ;DCT transform .The number of features extracted for each frame is 16,and 32 frames are chosen for every utterance .A 512-dimensiona1-Me1-Cepstral feature vector(1632⨯ numerical values) represented the pronunciation of every word . 2. Multi-weights neuron networks architectureAs a new general purpose theoretical model of pattern Recognition, here BPR is realized by multi-weights neuron Networks. In training of a certain class of samples ,an multi-weights neuron subNetwork should beestablished .The subNetwork consists of one input layer .one multi-weights neuron hidden layer and one output layer. Such a subNetwork can be considered as a mapping 512:F R R →.12m ()min(,,Y )F X Y Y =…,,Where Y i is the output of a Multi-weights neuron. There are m hiddenMulti-weights neurons .i= 1,2, …,m,512X R ∈is the input vector .IV .Training for MWN Networks1. Basics of MWN networks trainingTraining one multi-weights neuron subNetwork requires calculating the multi-weights neuron layer weights .The multi-weights neuron and the training algorithm used was that of Ref.[4].In this algorithm ,if the number of training samples of each class is N,we can use2N -neurons .In this paper ,N=30.12[(,,,)]ii i i Y f s s s x ++=,is a function with multi-vector input ,one scalar quantity output .2. Optimization methodAccording to the comments in IV.1,if there are many training samples, the neuron number will be very large thus reduce the recognition speed .In the case of learning several classes of samples, knowledge of the class membership of training samples is available. We use this information in a supervised training algorithm to reduce the network scales .When training class A ,we looked the left training samples of the other 14 classes as class B . So there are 30 training samples in set1230:{,,}A A a a a =…,and 420 training samples inset 12420:{,,}B B b b =…,b .Firstly select 3 samples from A, and we have a neuron :1123Y =f[(,,,)]k k k a a a x .Let 01_123,=f[(,,,)]A i k k k i A A Y a a a a =,where i= 1,2, (30)1_123Y =f[(,,,)]B j k k k j a a a b ,where j= 1,2,…420;1_min(Y )B j V =,we specify a value r ,0<r<1.If1_*A i Y r V <,removed i a from set A, thus we get a new set (1)A .We continue until the number ofsamples in set ()k Ais(){}k A φ=,then the training is ended, and the subNetwork of class A has a hiddenlayer of1r - neurons.V .Experiment ResultsA speech database consisting of 15 Chinese dish’s names was developed for the course of study. The length of each name is 4 Chinese words, that is to say, each sample of speech is a continuous string of 4 words, such as “yu xiang rou si”,“gong bao ji ding”,etc .It was organized into two sets :training set and test set. The speech signal is sampled at 16kHz and 16-bit resolution .Table 1.Experimental result atof different values450 utterances constitute the training set used to train the multi-weights neuron networks. The 450 ones belong to 10 speakers(5 males and 5 females) who are from different Chinese provinces. Each of the speakers uttered each of the word 3 times. The test set had a total of 539 utterances which involved another 4 speakers who uttered the 15 words arbitrarily .The tests made to evaluate the recognition system were carried out on differentr from 0.5 to 0.95 with astep increment of 0.05.The experiment results at r of different values are shown in Table 1.Obviously ,the networks was able to achieve full recognition of training set at any r .From the experiments ,it was found that0.5r achieved hardly the same recognition rate as the Basic algorithm. In the mean time, theMWNs used in the networks are much less than of the Basic algorithm. Table 2.Experiment results of BPR basic algorithmExperiments were also carried on to evaluate Continuous density hidden Markov models (CDHMM),Dynamic time warping(DTW) and Biomimetic pattern recognition(BPR) for speech recognition, emphasizing the performance of each method across decreasing amounts of training samples as wellas requirement of train time. The CDHMM system was implemented with 5 states per word.Viterbi-algorithm and Baum-Welch re-estimation are used for training and recognition .The reference templates for DTW system are the training samples themselves. Both the CDHMM and DTW technique are implemented using the programs in Ref.[11].We give in Table 2 the experiment results comparison of BPR Basic algorithm ,Dynamic time warping (DTW)and Hidden Markov models (HMMs) method .The HMMs system was based on Continuous density hidden Markov models(CDHMMs),and was implemented with 5 states per name.VI.Conclusions and AcknowledgmentsIn this paper, A mandarin continuous speech recognition system based on BPR is established.Besides,a training samples selection method is also used to reduce the networks scales. As a new general purpose theoretical model of pattern Recognition,BPR could be used in speech recognition too, and the experiment results show that it achieved a higher performance than HMM s and DTW.References[1]WangShou-jue,“Blomimetic (Topological) pattern recognit ion-A new model of pattern recognition theoryand its application”,Acta Electronics Sinica,(inChinese),Vo1.30,No.10,PP.1417-1420,2002.[2]WangShoujue,ChenXu,“Blomimetic (Topological) pattern recognition-A new model of patternrecognition theory and its app lication”, Neural Networks,2003.Proceedings of the International Joint Conference on Neural Networks,Vol.3,PP.2258-2262,July 20-24,2003.[3]WangShoujue,ZhaoXingtao,“Biomimetic pattern recognition theory and its applications”,Chinese Journalof Electronics,V0l.13,No.3,pp.373-377,2004.[4]Xu Jian.LiWeijun et a1,“Architecture research and hardware implementation on simplified neuralcomputing system for face identification”,Neuarf Networks,2003.Proceedings of the Intern atonal Joint Conference on Neural Networks,Vol.2,PP.948-952,July 20-24 2003.[5]Wang Zhihai,Mo Huayi et al,“A method of biomimetic pattern recognition for face recognition”,Neural Networks,2003.Proceedings of the International Joint Conference on Neural Networks,Vol.3,pp.2216-2221,20-24 July 2003.[6]WangShoujue,WangLiyan et a1,“A General Purpose Neuron Processor with Digital-Analog Processing”,Chinese Journal of Electornics,Vol.3,No.4,pp.73-75,1994.[7]Wang Shoujue,LiZhaozhou et a1,“Discussion on the basic mathematical models of neurons in gen eralpurpose neuro-computer”,Acta Electronics Sinica(in Chinese),Vo1.29,No.5,pp.577-580,2001.[8]WangShoujue,Wang Bainan,“Analysis and theory of high-dimension space geometry of artificial neuralnetworks”,Acta Electronics Sinica (in Chinese),Vo1.30,No.1,pp.1-4,2001.[9]WangShoujue,Xujian et a1,“Multi-camera human-face personal identiifcation system based on thebiomimetic pattern recognition”,Acta Electronics Sinica (in Chinese),Vo1.31,No.1,pp.1-3,2003.[10]Ryszard Engelking,Dimension Theory,PWN-Polish Scientiifc Publishers—Warszawa,1978.[11]QiangHe,YingHe,Matlab Porgramming,Tsinghua University Press,2002.中文翻译:电子学报2006年7月15卷第3期基于仿生模式识别的非特定人连续语音识别系统王守觉秦虹(中国,北京100083,中科院半导体研究所人工神经网络实验室)摘要:在非特定人语音识别中,隐马尔科夫模型(HMMs)是使用最多的技术,但是它的不足之处在于:不仅需要更多的训练样本,而且训练的时间也很长。
现代电子技术Modern Electronics Technique2023年11月1日第46卷第21期Nov. 2023Vol. 46 No. 210 引 言语音情感识别(Speech Emotion Recognition, SER )是实现人机交互的重要发展方向,其主要有语音情感数据库构建、语音情感特征提取和分类模型三大方面[1]。
由于影响语音情感识别的因素很多,其中不同的语言对情感的表达影响是很大的,这就让语音情感特征提取成为一个重要的研究方向。
深度学习的发展让提取特征变得容易,但是只有输入最能表征语音情感的手工特征,深度学习模型才能从中提取最好的深度特征,得到更好的效果。
为了提高藏语语音情感识别率,本文提出了一种基于藏语的语音情感特征提取方法,通过藏语本身的语言特点手工提取出一个312维的藏语语音情感特征集(TPEFS ),再通过长短时记忆网络(Long Short Term Memory Network, LSTM )提取深度特征,最后对该特征进行分类。
藏语语音情感识别结构如图1所示。
基于多特征融合的藏语语音情感识别谷泽月1, 边巴旺堆1,2, 祁晋东1(1.西藏大学 信息科学技术学院, 西藏 拉萨 850000; 2.信息技术国家级实验教学示范中心, 西藏 拉萨 850000)摘 要: 藏语语音情感识别是语音情感识别在少数民族语音处理上的应用,语音情感识别是人机交互的重要研究方向,提取最能表征语音情感的特征并构建具有较强鲁棒性和泛化性的声学模型是语音情感识别的重要研究内容。
基于此,为了构建具有高效性和针对性的藏语语音情感识别模型,文中构建了一种藏语语音情感数据集(TBSEC001),并提出一种适合于藏语的手工语音情感特征集(TPEFS ),该特征集是在藏语与其他语言的共性和特性的基础上手工提取得到的,TPEFS 特征集在支持向量机(SVM )、多层感知机(MLP )、卷积神经网络(CNN )、长短时记忆网络(LSTM )这些经典网络中都取得了不错的效果。
WHISPERY SPEECH RECOGNITION USING ADAPTED ARTICULATORY FEATURESSzu-Chen Jou,Tanja Schultz,and Alex WaibelInteractive Systems LaboratoriesCarnegie Mellon University,Pittsburgh,PAscjou,tanja,ahw@ABSTRACTThis paper describes our research on adaptation methods appliedto articulatory feature detection on soft whispery speech recordedwith a throat microphone.Since the amount of adaptation datais small and the testing data is very different from the trainingdata,a series of adaptation methods is necessary.The adaptationmethods include:maximum likelihood linear regression,feature-space adaptation,and re-training with downsampling,sigmoidallow-passfilter,and linear multivariate regression.Adapted artic-ulatory feature detectors are used in parallel to standard senone-based HMM models in a stream architecture for decoding.Withthese adaptation methods,articulatory feature detection accuracyimproves from87.82%to90.52%with corresponding F-measurefrom0.504to0.617,while thefinal word error rate improves from33.8%to31.2%.1.INTRODUCTIONToday’s real-world applications are driven by ubiquitous mobiledevices while lack keyboard functionality.These applications de-mand new spoken input methods that do not disturb the environ-ment and preserve the privacy of the user.Verification systems forbanking applications or private phone calls in a quiet environmentare only a few examples.As a consequence,recent developmentsin the area of processing whispered speech or non-audible mur-mur1draw a lot of attention.Automatic speech recognition(ASR)has been proven to be a successful interface for spoken input,butso far,microphones have been used that apply the principle of air-transmission to transmit the sound from the speaker’s mouth to theinput device.When transmitting soft whisper,those microphonestend to fail,causing the performance of ASR to deteriorate.Contact microphones,on the other hand,pick up speech sig-nals through skin vibrations rather than by air transmission.As aresult,processing of whispered speech is possible.Research re-lated to contact microphones includes using a stethoscopic micro-phone for non-audible murmur recognition[1]and speech detec-tion and enhancement with a bone-conductive microphone[2].In our previous work,we have demonstrated how to use athroat microphone,one of many kinds of contact microphones,forautomatic soft whisper recognition[3].Based on that,this paperdiscusses how we incorporate articulatory features(AFs)as an ad-ditional information source to improve recognition results.Artic-ulatory features,e.g.voicing or tongue position,have shown greatpotential for robust speech recognition[4].Since whispery speechspeakers from those of BN data,and our sentences are different from the BN ones but in the same domain.Table1.Data for Training,Adaptation,and TestingAmountTraining66.48hrAdaptation712.8sTesting153.1s89.30/0.58589.04/0.579 Downsample88.46/0.55189.56/0.59289.26/0.583 log Mel-spec87.12/0.49388.95/0.57388.52/0.560 CMN-MFCC87.53/0.513FSA FSA+G.FSA90.27/0.61089.19/0.585also make performance worse,contrast to the improvements made for senone models[3].Since sigmoidal low-passfiltering with is the only improving adaptation method,the following experiments are conducted in addition to it.We then apply additional FSA,group FSA,group MLLR,and iterative MLLR methods with.As shown in Table3, Group FSA performs the best,so further iterative MLLR is con-ducted in addition to Group pared to its effects on senone models,iterative MLLR saturates faster in about20iterations and peaks at34iterations with performance90.52%/0.617.Fig.2shows a comparison of the F-measure of the individual AFs,including the baseline AFs tested on the BNeval98/F0test set and on the throat-whisper test set,and the best adapted AFs on the throat-whisper test set.The AFs are listed in the order of F-score improvement from adaptation2;e.g.the leftmost AFFRICATE has the largest improvement by adaptation.Performance degradation from BN to throat-whisper had been expected.However,some AFs such as AFFRICATIVE and GLOTTAL degrades drastically as the acoustic variation of these features is among the largest.Since there is no vocal cord vibration in whispery speech,GLOTTAL would not be useful for such a task.For the same reason,vowel-related AFs, such as CLOSE,CENTRAL,suffer from the mismatch.Most AFs im-prove by adaptation;NASAL,for example,is one of the best AF on BN data but degrades a lot on throat-whisper,as can be inferred from Fig.1.After adaptation,its F-measure doubles but there is still a gap to the performance level on BN data.5.2.Stream DecodingIn the stream architecture,we put together our best senone model3 and the best AF detectors45.Thefirst experiments combine the senone model with each single AF detector to see how well theTable4.Four-Best Single-AF WERs on Different Weight Ratios AF weight90:10AF weightbaseline33.8baseline ASPIRA TED31.4ALVEOLAR33.1CLOSE32.6RETROFLEX31.7DENTAL33.3PALA TAL33.1AF detectors can help the senone model.Table4shows the WERs of different combination weights and the four-best single AF de-tectors.As shown in the table,the combination of90%of weight on senone models and10%of weight on AF detectors results in the best performance,which can be regarded as a global minimum in the performance concave with respect to different weights.In other words,the single AFs can help only with carefully selected weight.In the next experiments,we incrementally add from one up to ten AF detectors to the streams.We use simple rules to select the AF detectors.The AF selection criteria include one-best WER (WER),accuracy(acc),and F-measure(F).According to each cri-terion,AF selection starts in greedy fashion from the AF detector having the best performance,then it picks the second best one,and so on.There is also a set of weighting rules for adding more AFs. Thefirst weighting rule is always assigning0.05to the weight of every AFs(w5).The second rule distributes uniform weights out of 0.1to the AFs(unif).The last one puts more weight on the better performed AFs using the formula,where is the weight,the total number of AF detectors used,the rank of performance(scaled).Fig.3shows the WERs with AF selection using*-WER,which showed better result than the other two;this result is consistent with[5].On the other hand,fixed weight(w5-*)suffers from insufficient weights for the senone models as the AF number increases.With one exception that the WER improves to31.2%in scaled-F with ALVEOLAR and FRICATIVE, incorporating more than one AF doesn’t improve the WER.We suspect the reason is that the mismatched training and testing data are quite different acoustically,while the adaptation data is not enough to reliably estimate the AFs.Therefore we cannot achieve the improvement level as reported in[5].6.CONCLUSIONSWe have developed a series of adaptation methods applied to ar-ticulatory feature detection,which improve the performance of a standard senone-based HMM throat-whisper recognizer using a stream decoder.Also,we have shown AF adaptation improves detection accuracy and F-measure.With t-test=0.046,the best stream decoding performance(WER=31.2%)is statistically sig-nificant;however,on such a small test set,some other smaller improvements are not.We therefore plan to collect more data. Further work could be applying discriminative model combination (DMC)on the stream architecture for better weights[12].7.ACKNOWLEDGEMENTSThe authors wish to thank Dr.Yoshitaka Nakajima for the invita-tion to his lab,the chance to gain hands-on experience using the stethoscopic microphones developed at his lab,and his hospitality. Many thanks to Hua Yu for providing the BN baseline systemandFig.3.WERs on Number of AF Detectors Used in Stream Florian Metze and Sebastian St¨u ker for the AF and stream scripts. Thanks also go to the reviewers for their valuable comments.8.REFERENCES[1]Y.Nakajima,H.Kashioka,K.Shikano,and N.Camp-bell,“Non-audible murmur recognition input interface us-ing stethoscopic microphone attached to the skin,”in Proc.ICASSP,Hong Kong,2003.[2]Y.Zheng,Z.Liu,Z.Zhang,M.Sinclair,J.Droppo,L.Deng,A.Acero,and X.Huang,“Air-and bone-conductive inte-grated microphones for robust speech detection and enhance-ment,”in Proc.ASRU,St.Thomas,U.S.Virgin Islands,Dec 2003.[3]S.-C.Jou,T.Schultz,and A.Waibel,“Adaptation for softwhisper recognition using a throat microphone,”in Proc.ICSLP,Jeju Island,Korea,Oct2004.[4]K.Kirchhoff,Robust Speech Recognition Using ArticulatoryInformation,Ph.D.thesis,University of Bielefeld,Germany, July1999.[5]F.Metze and A.Waibel,“Aflexible stream architecture forASR using articulatory features,”in Proc.ICSLP,Denver, CO,Sep2002.[6]“/english/prod01.htm,”.[7]H.Yu and A.Waibel,“Streaming the front-end of a speechrecognizer,”in Proc.ICSLP,Beijing,China,2000.[8]C.J.Leggetter and P.C.Woodland,“Maximum likelihoodlinear regression for speaker adaptation of continuous den-sity hidden Markov models,”Computer Speech and Lan-guage,vol.9,pp.171–185,1995.[9]H.Valbret,E.Moulines,and J.P.Tubach,“V oice transfor-mation using PSOLA technique,”Speech Communication, vol.11,pp.175–187,1992.[10]M.J.F.Gales,“Maximum likelihood linear transformationsfor HMM-based speech recognition,”Computer Speech and Language,vol.12,pp.75–98,1998.[11]P.Heracleous,Y.Nakajima,A.Lee,H.Saruwatari,andK.Shikano,“Accurate hidden Markov models for non-audible murmur(NAM)recognition based on iterative super-vised adaptation,”in Proc.ASRU,St.Thomas,U.S.Virgin Islands,Dec2003.[12]S.St¨u ker,F.Metze,T.Schultz,and A.Waibel,“Integratingmultilingual articulatory features into speech recognition,”in Proc.Eurospeech,Geneva,Switzerland,Sep2003.。
零率来判断一段语音信号的端点。
一段语音中,如果有语音开始,那么它的能量会比较大,因此在能量谱上设置一个较大的门限只来确定语音的开始和结束,再取一个比较低的门限五确定语音真正的起点和终点。
过零率用来判断有声和无声,这里取一个较低的门限Q,背景噪声的低门限过零率明显低于语音的低门限过零率。
通常窗长取10ms’15ms,帧间隔取5ms’10ms。
以下为采样率8K,时长1.815sec的汉语数码“O”的能量与过零率分析及双门限法。
图2—8双门限法求短时能量2.3语音信号的频域分析通过语音信号的时域分析,可以得到语音信号的短时能量、短时平均过零率进而可以对语音信号进行端点检测。
从理论上讲,语音信号是动态信号,因此语音更加丰富的信息,存在于频域中。
短时傅立叶变换是在假定语音信号短时存在稳态特征的情况下一种常用的语音分析方法。
假设{x(n)}为语音信号,则其傅里叶变换定义为:x。
(e如)=∑工(Ⅲ)H栉一m)P一伽(2一lo)m;—∞Hn)为窗函数序列,工(小)“n一肌)为加窗语音信号。
这个式子反映了语音信号频谱的动态特性。
从短时傅立叶变换可以得到短时功率谱:只(挪)=Jx。
忙拍)12=竞胄.(七№一』“(2一11)矗。
(七)一:∑x(m)·w(玎一m)·x(m+后)·w(刀一m一七)(2—12)语谱图是功率谱的灰度级表示。
除了频谱和短时功率谱之外,还有对数功率谱和倒谱:图2-9几种短时傅立叶变换谱关系。
”下图为汉语纯净数码语音“O”的频谱,宽带语谱和窄带语谱和倒谱表示。
图2—10汉语纯净数码语音“0”的频谱(a)、倒谱(b)、宽带语谱(c)和窄带语谱(d)144,静态的白高斯加性噪声表现为对所有语音频率都适用的加性偏值。
5,粉色加性噪声则是不同频率成分有不同的加性偏值。
经过以上描述,一个搀杂卷积噪声和加性噪声的语音信号产生过程可以模拟如下:图4—3搀杂噪声的语音信号产生过程x(r)为加噪前的语音信号,y(,)经过传输信道含有卷积噪声:y(f)=z(f)·^O)(4—1)z(f)为y(f)与加性噪声一O)的叠加:z(,)=y(,)+玎(D=x(r)·^O)+以(r)(4—2)本文的目的是研究基于人耳听觉系统的生理特征的特征提取方法,这里只考虑语音中的噪声为高斯白噪声。
Why is Speech Recognition Difficult?Markus ForsbergDepartment of Computing ScienceChalmers University of Technologymarkus@cs.chalmers.seFebruary24,2003AbstractIn this paper we will elaborate on some of the difficulties with Automatic Speech Recognition(ASR).We will argue that the mainmotivation for ASR is efficient interfaces to computers,and for theinterfaces to be truly useful,it should provide coverage for a largegroup of users.We will discuss some of the issues that make the recognition of a single speaker difficult and then extend the discussion with problemsthat occur when we target more than a single user.1IntroductionThe problem of automatically recognizing speech with the help of a computer is a difficult problem,and the reason for this is the complexity of the human language.We will in this article try to sketch some of the issues that make ASR difficult.We will not give any solutions,or present other’s solutions,we will instead try to give you a panorama of some of the potential difficulties.We start by presenting the general setting-what we mean by ASR,why we are interested in performing ASR,and what speech actually is,and then we will list some of the problems we may encounter,andfinally end with a discussion.12What is speech recognition?Speech recognition,or more commonly known as automatic speech recogni-tion(ASR),is the process of interpreting human speech in a computer.A more technical definition is given by Jurafsky[2],where he defines ASR as the building of system for mapping acoustic signals to a string of words[2]. He continues by defining automatic speech understanding(ASU)as extending the goal to producing some sort of understanding of the sentence.We will consider speaker independent ASR,i.e.systems that have not been adapted to a single speaker,but in some sense all speakers of a particular language.3Why do we want speech recognition?The main goal of speech recognition is to get efficient ways for humans to communicate with computers.However,as Ben Schneiderman points out in [6],human-human communication is rarely a good model for designing effi-cient user interfaces.He also points out that verbal communication demands more mental resources than typing on a keyboard.So do we really want to communicate with computers via spoken language?Mats Blomberg[5]enumerates some of the applications of ASR,and some of the advantages that can be achieved.For example,he mentions personal computers that can be voice-controlled and used for dictation.This can be an important application for physically disabled,lawyer etc.Another application he mention is environmental control,such as turning on the light, controlling the TV etc.We feel that speech recognition is important,not because it is’natural’for us to communicate via speech,but because in some cases,it is the most efficient way to interface to a computer.Consider,for example,people that have jobs that occupies their hands,they would greatly benefit from an ASR controlled environment.4What is speech?When we as humans speak,we let air pass from our lungs through our mouth and nasal cavity,and this air stream is restricted and changed with our tongue and lips.This produces contractions and expansions of the air,an acoustic2wave,a sound.The sounds we form,the vowels and consonants,are usually called phones.The phones are combined together into words.How a phone is realized in speech is dependent on its context,i.e.which phone is proceeding it and which phone is directly following it(the term triphones is used for a phone in context).This phenomenon is studied within the area of phonology.However,speech is more than sequences of phones that forms words and sentences.There are contents of speech that carries information,e.g.the prosody of the speech indicates grammatical structures,and the stress of a word signals its importance/topicality.This information is sometimes called the paralingustic content of speech.The term speech signal within ASR refers to the analog electrical rep-resentation of the contractions and expansions of air.The analog signal is then converted into a digital representation by sampling the analog continu-ous signal.A high sampling rate in the A/D conversion gives a more accurate description of the analog signal,but also leads to a higher degree of space consumption.5Difficulties with ASR5.1Human comprehension of speech compared to ASR Humans use more than their ears when listening,they use the knowledge they have about the speaker and the subject.Words are not arbitrarily sequenced together,there is a grammatical structure and redundancy that humans use to predict words not yet spoken.Furthermore,idioms and how we’usually’say things makes prediction even easier.In ASR we only have the speech signal.We can of course construct a model for the grammatical structure and use some kind of statistical model to improve prediction,but there are still the problem of how to model world knowledge,the knowledge of the speaker and encyclopedic knowledge.We can,of course,not model world knowledge exhaustively,but an interesting question is how much we actually need in the ASR to measure up to human comprehension.35.2Body languageA human speaker does not only communicate with speech,but also with body signals-hand waving,eye movements,postures etc.This information is completely missed by ASR.This problem is addressed within the research area multimodality,where studies are conducted how to incorporate body language to improve the human-computer communication.5.3NoiseSpeech is uttered in an environment of sounds,a clock ticking,a computer humming,a radio playing somewhere down the corridor,another human speaker in the background etc.This is usually called noise,i.e.,unwanted information in the speech signal.In ASR we have to identify andfilter out these noises from the speech signal.Another kind of noise is the echo effect,which is the speech signal bounced on some surrounding object,and that arrives in the microphone a few mil-liseconds later.If the place in which the speech signal has been produced is strongly echoing,then this may give raise to a phenomenon called reverber-ation,which may last even as long as seconds.5.4Spoken language=Written languageSpoken language has for many years been viewed just as a less complicated version of written language,with the main difference that spoken language is grammatically less complex and that humans make more performance errors while speaking.However,it has become clear in the last few years that spoken language is essentially different from written language.In ASR,we have to identify and address these differences.Written communication is usually a one-way communication,but speech is dialogue-oriented.In a dialogue,we give feed-back to signal that we un-derstand,we negotiate about the meaning of words,we adapt to the receiver etc.Another important issue is disfluences in speech,e.g.normal speech is filled with hesitations,repetitions,changes of subject in the middle of an utterance,slips of the tounge etc.A human listener does usually not even4notice the disfluences,and this kind of behavior has to be modeled by the ASR system.Another issue that has to be identified,is that the grammaticality of spoken language is quite different to written language at many different levels. In[4],some differences are pointed out:•In spoken language,there is often a radical reduction of morphemes and words in pronunciation.•The frequencies of words,collocations and grammatical constructions are highly different between spoken and written language.•The grammar and semantics of spoken language is also significantly different from that of written language;30-40%of all utterances consist of short utterances of1-2-3words with no predicative verb.This list can be made even longer.The important point is that we can not view speech as the written language turned into a speech signal,it is fundamentally different,and must be treated as such.5.5Continuous speechSpeech has no natural pauses between the word boundaries,the pauses mainly appear on a syntactic level,such as after a phrase or a sentence.This introduces a difficult problem for speech recognition—how should we translate a waveform into a sequence of words?After afirst stage of recognition into phones and phone categories,we have to group them into words.Even if we disregard word boundary ambiguity (see section5.9.2),this is still a difficult problem.One way to simplify this process is to give clear pauses between the words. This works for short command-like communication,but as the possible length of utterances increases,clear pauses get cumbersome and inefficient.5.6Channel variabilityOne aspect of variability is the context were the acoustic wave is uttered. Here we have the problem with noise that changes over time,and differ-ent kinds of microphones and everything else that effects the content of the acoustic wave from the speaker to the discrete representation in a computer. This phenomena is called channel variability.55.7Speaker variabilityAll speakers have their special voices,due to their unique physical body and personality.The voice is not only different between speakers,there are also wide variations within one specific speaker.We will in the subsections below list some of these variations.5.7.1RealizationIf the same words were pronounced over and over again,the resulting speech signal would never look exactly the same.Even if the speaker tries to sound exactly the same,there will always be some small differences in the acoustic wave you produce.The realization of speech changes over time.5.7.2Speaking styleAll humans speak differently,it is a way of expressing their personality.Not only do they use a personal vocabulary,they have an unique way to pro-nounce and emphasize.The speaking style also varies in different situations, we do not speak in the same way in the bank,as with our parents,or with our friends.Humans also communicate their emotions via speech.We speak differ-ently when we are happy,sad,frustrated,stressed,disappointed,defensive etc.If we are sad,we may drop our voice and speak more slowly,and if we are frustrated we may speak with a more strained voice.5.7.3The sex of the speakerMen and women have different voices,and the main reason to this is that women have in general shorter vocal tract than men.The fundamental tone of women’s voices is roughly two times higher than men’s because of this difference.5.7.4Anatomy of vocal tractEvery speaker has his/hers unique physical attributes,and this affects his/her speech.The shape and length of the vocal cords,the formation of the cavities, the size of the lungs etc.These attributes change over time,e.g.depending on the health or the age of the speaker.65.7.5Speed of speechWe speak in different modes of speed,at different times.If we are stressed, we tend to speak faster,and if we are tired,the speed tends to decrease.We also speak in different speeds if we talk about something known or something unknown.5.7.6Regional and social dialectsDialects are group related variation within a language.Janet Holmes[3]de-fines regional and social dialects as follows:Regional dialect Regional dialects involves features of pronunciation,vo-cabulary and grammar which differ according to the geographical area the speaker come from.Social dialect Social dialects are distinguished by features of pronuncia-tion,vocabulary and grammar according to the social group of the speaker.In many cases,we may be forced to consider dialects as’another language’in ASR,due to the large differences between two dialects.5.8Amount of data and search spaceCommunication with a computer via a microphone induces a large amount of speech data every second.This has to be matched to group of phones (monophones/diphones/triphones),the sounds,the words and the sentences. Groups of groups of phones that build up words and words builds up sen-tences.The number of possible sentences are enormous.The quality of the input,and thereby the amount of input data,can be regulated by the number of samples of the input signal,but the quality of the speech signal will,of course,decrease with a lower sampling rate,resulting in incorrect analysis.We can also minimize our lexicon,i.e.set of words.This introduces another problem,which is called out-of-vocabulary,which means that the intended word is not in the lexicon.An ASR system has to handle out-of-vocabulary in a robust way.75.9AmbiguityNatural language has an inherent ambiguity,i.e.we can not always decide which of a set of words is actually intended.This is,of course,a problem in every computer-related language appli-cation,but we will here discuss what kind of ambiguity that typically arises within speech recognition.There are two ambiguities that are particular to ASR,homophones and word boundary ambiguity.5.9.1HomophonesThe concept homophones refers to words that sound the same,but have different orthography.They are two unrelated words that just happened to sound the same.In the table below,we give some examples of homophones: one analysis alternative analysisthe tail of a dog the tale of the dogthe sail of a boat the sale of a boatFigure1:Examples of homophonesHow can we distinguish between homophones?It’s impossible on the word level in ASR,we need a larger context to decided which is intended. However,as is demonstrated in the example,even within a larger context,it is not certain that we can choose the right word.5.9.2Word boundary ambiguityWhen a sequence of groups of phones are put into a sequence of words,we sometimes encounters word boundary ambiguity.Word boundary ambiguity occurs when there are multiple ways of grouping phones into words.An example,taken from[1],illustrate this difficulty:It’s not easy to wreck a nice beach.It’s not easy to recognize speech.It’s not easy to wreck an ice beach.8This example has been artificially constructed,but there are other exam-ples that occurs naturally in the world.This can be viewed as a specific case of handling the continuous speech,where even humans can have problems withfinding the word boundaries.6DiscussionIn this paper,we have addressed some of the difficulties of speech recognition, but not all of them.But one thing is certain,ASR is a challenging task. The most problematic issues being the large search space and the strong variability.We think that the problems are especially serious,because of our low tolerance to errors in the speech recognition process.Think how long you would try to communicate verbally with a computer,if it understood you wrong a couple of times in a row.You would probably say something nasty, and start looking for the keyboard of the computer.So,there are many problems,but does this mean that it is too hard,that we actually should stop trying?Of course not,there have been significant improvements within ASR,and ASR will continue to improve.It seems quite unlikely that we will ever succeed to do perfect ASR,but will surely do good enough.One thing that should be investigated further,is if humans speak differ-ently to computers.Maybe it isn’t natural for a human to communicate in the same way to a computer as to a human.A human may strive to be unam-biguous and speak in a hyper-correct style to get the computer to understand him/her.Under the assumption that the training data also is given in this hyper-correct style,this would simplify the ASR.However,if not,it may be the case that hyper-correct speech even make the ASR harder.And if this is not the case,we may investigate how we as human speaker can adapt to the computer to increase the quality of the speech recognition.As pointed out before,our goal is not a’natural’verbal communication,we want efficient user interfaces.9References[1]N.M.Ben Gold.Speech and Audio Signal Processing,processing andperception of speech and music.John Wiley&Sons,Inc.,2000.[2]J.H.M.Daniel Jurafsky.Speech and Language Processing,An intro-duction to Natural Language Processing,Computational Linguistics,and Speech Recognition.Prentice Hall,Upper Saddle River,New Jersey07458, 2000.[3]J.Holmes.An introduction to sociolinguistics.Longman Group UKLimited,1992.[4]E.A.Jens Allwood.Corpus-based research on spoken language.2001.[5]K.E.Mats Blomberg.Automatisk igenk¨a nning av tal.1997.[6]B.Schneiderman.The limits of speech munications ofthe ACM,43:63–65,2000.10。
矿产资源开发利用方案编写内容要求及审查大纲
矿产资源开发利用方案编写内容要求及《矿产资源开发利用方案》审查大纲一、概述
㈠矿区位置、隶属关系和企业性质。
如为改扩建矿山, 应说明矿山现状、
特点及存在的主要问题。
㈡编制依据
(1简述项目前期工作进展情况及与有关方面对项目的意向性协议情况。
(2 列出开发利用方案编制所依据的主要基础性资料的名称。
如经储量管理部门认定的矿区地质勘探报告、选矿试验报告、加工利用试验报告、工程地质初评资料、矿区水文资料和供水资料等。
对改、扩建矿山应有生产实际资料, 如矿山总平面现状图、矿床开拓系统图、采场现状图和主要采选设备清单等。
二、矿产品需求现状和预测
㈠该矿产在国内需求情况和市场供应情况
1、矿产品现状及加工利用趋向。
2、国内近、远期的需求量及主要销向预测。
㈡产品价格分析
1、国内矿产品价格现状。
2、矿产品价格稳定性及变化趋势。
三、矿产资源概况
㈠矿区总体概况
1、矿区总体规划情况。
2、矿区矿产资源概况。
3、该设计与矿区总体开发的关系。
㈡该设计项目的资源概况
1、矿床地质及构造特征。
2、矿床开采技术条件及水文地质条件。