ASR dependent techniques for speaker identification

格式：pdf
大小：139.94 KB
文档页数：4

下载文档原格式

asr16-e2e

ASR Lectureห้องสมุดไป่ตู้18
Robustness (cont.); End-to-end systems
4
Current approaches to robust speech recognition
Decoupled preprocessing: Acoustic processing independent of downstream activity
Augmented information: Approach Augmented Add additional 3: side information to the network • Augment model diﬀerent with (additional nodes, informative side information objective function, ...) • Nodes (input, hidden, output)
RNNs are powerful sequence models
recurrent hidden state much richer history representation than HMM state can learn representations can directly model dependences through time
pproach 2: Integrated Preprocessing
Con: computationally expensive, Treat preprocessing as initial “layers” of model • Optimize parameters with back propagation Example: direct waveform systems

先进的场面活动引导和控制系统(ASMGCS)手册(Doc_9830)

前言
国际民航组织《场面活动引导和控制系统（SMGCS）手册（文件 9476）》中所描述的系统并非总是能够为航空器提供必须的支持，以使其尤其在低能见度情况下保持所需的性能和安全水平。因而，希望先进的场面活动引导和控制系统（A-SMGCS）能根据特定的气象条件、交通密度以及机场布局，提供充分的性能及安全性。这可以在各种各样的功能性间利用现代技术及高水平的综合而实现。
由于包括自动化在内的新技术的可用性及发展，对低能见度条件及复杂高密度机场，也可能提高机场容量。为避免技术驱策实现，发展了一般化的运行需求（见第 2 章），该运行需求不考虑所使用的技术，提供了局部需求分析与开发的指南。
本手册中的性能需求（见第 4 章）是为了给目前已确定的、与安全或容量相关的问题提供一个可能的解决方案。而 A-SMGCS 概念（见第 1 章）将随着技术、系统及工序的发展而发展。
文件 9830 AN/452
先进的场面活动引导和控制系统（A-SMGCS）
手册
第一版 201】张怀兴，马士观，范国良.《英汉民用航空航行词典》. 中国民用航空局航行司第一研究所，1991.
【2】马士观，范国良.《民用航空航行简缩语词典》. 中国民航出版社，1998. 【3】清华大学外语系.《英汉科学技术词典》. 国防工业出版社，1991.
1.1 场面活动引导和控制系统（SMGCS）运行 ................................................1 1.2 提高 SMGCS 的目标 ......................................................................................2 1.3 A-SMGCS 概念 ................

中英文翻译

附录英文原文：Chinese Journal of ElectronicsVo1.15,No.3,July 2006A Speaker--Independent Continuous SpeechRecognition System Using Biomimetic Pattern RecognitionWANG Shoujue and QIN Hong(Laboratory of Artificial Neural Networks，Institute ol Semiconductors，Chinese Academy Sciences，Beijing 100083，China)Abstract—In speaker-independent speech recognition,the disadvantage of the most diffused technology(HMMs，or Hidden Markov models)is not only the need of many more training samples，but also long train time requirement. This Paper describes the use of Biomimetic pattern recognition(BPR)in recognizing some mandarin continuous speech in a speaker-independent Manner. A speech database was developed for the course of study．The vocabulary of the database consists of 15 Chinese dish’s names, the length of each name is 4 Chinese words．Neural networks(NNs)based on Multi-weight neuron(MWN) model are used to train and recognize the speech sounds．The number of MWN was investigated to achieve the optimal performance of the NNs-based BPR.This system, which is based on BPR and can carry out real time recognition reaches a recognition rate of 98.14％for the first option and 99.81％for the first two options to the Persons from different provinces of China speaking common Chinese speech．Experiments were also carried on to evaluate Continuous density hidden Markov models(CDHMM ),Dynamic time warping(DTW)and BPR for speech recognition．The Experiment results show that BPR outperforms CDHMM and DTW especially in the cases of samples of a finite size．Key words—Biomimetic pattern recognition, Speech recogniton,Hidden Markov models(HMMs),Dynamic time warping(DTW)．I．IntroductionThe main goal of Automatic speech recognition(ASR)is to produce a system which will recognize accurately normal human speech from any speaker．The recognition system may be classified as speaker-dependent or speaker-independent．The speaker dependence requires that the system be personally trained with the speech of the person that will be involved with its operation in order to achieve a high recognition rate．For applications on the public facilities，on the other hand，the system must be capable of recognizing the speech uttered by many different people，with different gender，age，accent,etc.,the speaker independence has many more applications，primarily in the general area of public facilities．The most diffused technology in speaker-independent speech recognition is Hidden Markov Models，the disadvantage of it is not only the need of many more training samples，but also long train time requirement．Since Biomimetic pattern recognition(BPR) was first proposed by Wang Shoujue，it has already been applied to object recognition, face identification and face recognition etc.,and achieved much better performance．With some adaptations，such modeling techniques could be easily used within speech recognition too．In this paper，a real-time mandarin speech recognition system based on BPR is proposed，which outperforms HMMs especially in the cases of samples of a finite size．The system is a small vocabulary speaker independent continuous speech recognition one. The whole system is implemented on the PC under windows98／2000／XPenvironment with CASSANN-II neurocomputer.It supports standard 16-bit sound card ．II ．Introduction of Biomimetic Pattern Recognition and Multi —Weights Neuron Networks1． Biomimetic pattern recognitionTraditional Pattern Recognition aims at getting the optimal classification of different classes of sample in the feature space ．However, the BPR intends to find the optimal coverage of the samples of the same type. It is from the Principle of Homology —Continuity ，that is to say ，if there are two samples of the same class, the difference between them must be gradually changed ． So a gradual change sequence must be exists between the two samples. In BPR theory ．the construction of the sample subspace of each type of samples depends only on the type itself ．More detailedly ，the construction of the subspace of a certain type of samples depends on analyzing the relations between the trained types of samples and utilizing the methods of “cov erage of objects with complicated geometrical forms in the multidimensional space”.2．Multi-weights neuron and multi-weights neuron networksA Multi-weights neuron can be described as follows ：12m Y=f[(,,,)]W W W X θΦ-…，,Where ：12m ,,W W W …， are m-weights vectors ；X is the inputvector ；Φis the neuron’s computation function ；θis the threshold ；f is the activation function ．According to dimension theory, in the feature spacen R ，n X R ∈，the function12m (,,,)W W W X Φ…，=θconstruct a (n-1)-dimensional hypersurface in n-dimensional space which isdetermined by the weights12m ,,W W W …，.It divides the n-dimensional space into two parts ．If12m (,,,)W W W X θΦ=…， is a closed hypersurface, it constructs a finite subspace ．According to the principle of BPR,determination the subspace of a certain type of samples basing on the type of samples itself ．If we can find out a set of multi-weights neurons(Multi-weights neuron networks) that covering all the training samples ，the subspace of the neural networks represents the sample subspace. When an unknown sample is in the subspace, it can be determined to be the same type of the training samples ．Moreover ，if a new type of samples added, it is not necessary to retrain anyone of the trained types of samples ．The training of a certain type of samples has nothing to do with the other ones ．III ．System DescriptionThe Speech recognition system is divided into two main blocks. The first one is the signal pre-processing and speech feature extraction block ．The other one is the Multi-weights neuron networks, which performs the task of BPR ．1．Speech feature extractionMel based Campestral Coefficients(MFCC) is used as speech features ．It is calculated as follows ：A ／D conversion ；Endpoint detection using short time energy and Zero crossing rate(ZCR)；Preemphasis and hamming windowing ；Fast Fourier transform ；DCT transform ．The number of features extracted for each frame is 16，and 32 frames are chosen for every utterance ．A 512-dimensiona1-Me1-Cepstral feature vector(1632⨯ numerical values) represented the pronunciation of every word ． 2． Multi-weights neuron networks architectureAs a new general purpose theoretical model of pattern Recognition, here BPR is realized by multi-weights neuron Networks. In training of a certain class of samples ，an multi-weights neuron subNetwork should beestablished ．The subNetwork consists of one input layer ．one multi-weights neuron hidden layer and one output layer. Such a subNetwork can be considered as a mapping 512:F R R →.12m ()min(,,Y )F X Y Y =…，，Where Y i is the output of a Multi-weights neuron. There are m hiddenMulti-weights neurons ．i= 1,2, …,m,512X R ∈is the input vector ．IV ．Training for MWN Networks1． Basics of MWN networks trainingTraining one multi-weights neuron subNetwork requires calculating the multi-weights neuron layer weights ．The multi-weights neuron and the training algorithm used was that of Ref.[4]．In this algorithm ，if the number of training samples of each class is N,we can use2N -neurons ．In this paper ，N=30．12[(,,,)]ii i i Y f s s s x ++=，is a function with multi-vector input ，one scalar quantity output ．2． Optimization methodAccording to the comments in IV.1,if there are many training samples, the neuron number will be very large thus reduce the recognition speed ．In the case of learning several classes of samples, knowledge of the class membership of training samples is available. We use this information in a supervised training algorithm to reduce the network scales ．When training class A ，we looked the left training samples of the other 14 classes as class B ． So there are 30 training samples in set1230:{,,}A A a a a =…，and 420 training samples inset 12420:{,,}B B b b =…，b ．Firstly select 3 samples from A, and we have a neuron ：1123Y =f[(,,,)]k k k a a a x .Let 01_123,=f[(,,,)]A i k k k i A A Y a a a a =，where i= 1,2, (30)1_123Y =f[(,,,)]B j k k k j a a a b ，where j= 1,2,…420；1_min(Y )B j V =,we specify a value r ,0<r<1.If1_*A i Y r V <,removed i a from set A, thus we get a new set (1)A .We continue until the number ofsamples in set ()k Ais(){}k A φ=,then the training is ended, and the subNetwork of class A has a hiddenlayer of1r - neurons.V ．Experiment ResultsA speech database consisting of 15 Chinese dish’s names was developed for the course of study. The length of each name is 4 Chinese words, that is to say, each sample of speech is a continuous string of 4 words, such as “yu xiang rou si”，“gong bao ji ding”，etc ．It was organized into two sets ：training set and test set. The speech signal is sampled at 16kHz and 16-bit resolution ．Table 1．Experimental result atof different values450 utterances constitute the training set used to train the multi-weights neuron networks. The 450 ones belong to 10 speakers(5 males and 5 females) who are from different Chinese provinces. Each of the speakers uttered each of the word 3 times. The test set had a total of 539 utterances which involved another 4 speakers who uttered the 15 words arbitrarily ．The tests made to evaluate the recognition system were carried out on differentr from 0.5 to 0.95 with astep increment of 0.05.The experiment results at r of different values are shown in Table 1．Obviously ，the networks was able to achieve full recognition of training set at any r ．From the experiments ，it was found that0.5r achieved hardly the same recognition rate as the Basic algorithm. In the mean time, theMWNs used in the networks are much less than of the Basic algorithm. Table 2．Experiment results of BPR basic algorithmExperiments were also carried on to evaluate Continuous density hidden Markov models (CDHMM),Dynamic time warping(DTW) and Biomimetic pattern recognition(BPR) for speech recognition, emphasizing the performance of each method across decreasing amounts of training samples as wellas requirement of train time. The CDHMM system was implemented with 5 states per word.Viterbi-algorithm and Baum-Welch re-estimation are used for training and recognition ．The reference templates for DTW system are the training samples themselves. Both the CDHMM and DTW technique are implemented using the programs in Ref.[11]．We give in Table 2 the experiment results comparison of BPR Basic algorithm ，Dynamic time warping (DTW)and Hidden Markov models (HMMs) method ．The HMMs system was based on Continuous density hidden Markov models(CDHMMs),and was implemented with 5 states per name.VI．Conclusions and AcknowledgmentsIn this paper, A mandarin continuous speech recognition system based on BPR is established.Besides,a training samples selection method is also used to reduce the networks scales. As a new general purpose theoretical model of pattern Recognition，BPR could be used in speech recognition too, and the experiment results show that it achieved a higher performance than HMM s and DTW.References[1]WangShou-jue,“Blomimetic (Topological) pattern recognit ion-A new model of pattern recognition theoryand its application”，Acta Electronics Sinica,(inChinese)，Vo1.30,No.10,PP.1417-1420,2002．[2]WangShoujue，ChenXu，“Blomimetic (Topological) pattern recognition-A new model of patternrecognition theory and its app lication”, Neural Networks,2003.Proceedings of the International Joint Conference on Neural Networks,Vol.3,PP.2258-2262,July 20-24,2003．[3]WangShoujue,ZhaoXingtao,“Biomimetic pattern recognition theory and its applications”，Chinese Journalof Electronics，V0l.13，No.3，pp.373-377,2004．[4]Xu Jian．LiWeijun et a1,“Architecture research and hardware implementation on simplified neuralcomputing system for face identification”，Neuarf Networks，2003．Proceedings of the Intern atonal Joint Conference on Neural Networks，Vol.2,PP.948-952,July 20-24 2003．[5]Wang Zhihai,Mo Huayi et al,“A method of biomimetic pattern recognition for face recognition”,Neural Networks,2003.Proceedings of the International Joint Conference on Neural Networks,Vol.3，pp.2216-2221,20-24 July 2003．[6]WangShoujue,WangLiyan et a1,“A General Purpose Neuron Processor with Digital-Analog Processing”，Chinese Journal of Electornics,Vol.3,No.4,pp.73-75,1994．[7]Wang Shoujue,LiZhaozhou et a1,“Discussion on the basic mathematical models of neurons in gen eralpurpose neuro-computer”，Acta Electronics Sinica(in Chinese),Vo1.29,No.5,pp.577-580,2001．[8]WangShoujue,Wang Bainan,“Analysis and theory of high-dimension space geometry of artificial neuralnetworks”,Acta Electronics Sinica (in Chinese)，Vo1.30，No.1,pp.1-4,2001．[9]WangShoujue,Xujian et a1,“Multi-camera human-face personal identiifcation system based on thebiomimetic pattern recognition”，Acta Electronics Sinica (in Chinese)，Vo1.31,No.1,pp.1-3,2003．[10]Ryszard Engelking,Dimension Theory,PWN-Polish Scientiifc Publishers—Warszawa,1978．[11]QiangHe,YingHe,Matlab Porgramming,Tsinghua University Press，2002．中文翻译：电子学报2006年7月15卷第3期基于仿生模式识别的非特定人连续语音识别系统王守觉秦虹（中国，北京100083，中科院半导体研究所人工神经网络实验室）摘要：在非特定人语音识别中，隐马尔科夫模型（HMMs）是使用最多的技术，但是它的不足之处在于：不仅需要更多的训练样本，而且训练的时间也很长。

基于多特征融合的藏语语音情感识别

现代电子技术Modern Electronics Technique2023年11月1日第46卷第21期Nov. 2023Vol. 46 No. 210 引言语音情感识别（Speech Emotion Recognition, SER ）是实现人机交互的重要发展方向，其主要有语音情感数据库构建、语音情感特征提取和分类模型三大方面[1]。

由于影响语音情感识别的因素很多，其中不同的语言对情感的表达影响是很大的，这就让语音情感特征提取成为一个重要的研究方向。

深度学习的发展让提取特征变得容易，但是只有输入最能表征语音情感的手工特征，深度学习模型才能从中提取最好的深度特征，得到更好的效果。

为了提高藏语语音情感识别率，本文提出了一种基于藏语的语音情感特征提取方法，通过藏语本身的语言特点手工提取出一个312维的藏语语音情感特征集（TPEFS ），再通过长短时记忆网络（Long Short Term Memory Network, LSTM ）提取深度特征，最后对该特征进行分类。

藏语语音情感识别结构如图1所示。

基于多特征融合的藏语语音情感识别谷泽月1，边巴旺堆1，2，祁晋东1（1.西藏大学信息科学技术学院，西藏拉萨 850000； 2.信息技术国家级实验教学示范中心，西藏拉萨 850000）摘要：藏语语音情感识别是语音情感识别在少数民族语音处理上的应用，语音情感识别是人机交互的重要研究方向，提取最能表征语音情感的特征并构建具有较强鲁棒性和泛化性的声学模型是语音情感识别的重要研究内容。

基于此，为了构建具有高效性和针对性的藏语语音情感识别模型，文中构建了一种藏语语音情感数据集（TBSEC001），并提出一种适合于藏语的手工语音情感特征集（TPEFS ），该特征集是在藏语与其他语言的共性和特性的基础上手工提取得到的，TPEFS 特征集在支持向量机（SVM ）、多层感知机（MLP ）、卷积神经网络（CNN ）、长短时记忆网络（LSTM ）这些经典网络中都取得了不错的效果。

WHISPERY SPEECH RECOGNITION USING ADAPTED ARTICULATORY FEATURES

WHISPERY SPEECH RECOGNITION USING ADAPTED ARTICULATORY FEATURESSzu-Chen Jou,Tanja Schultz,and Alex WaibelInteractive Systems LaboratoriesCarnegie Mellon University,Pittsburgh,PAscjou,tanja,ahw@ABSTRACTThis paper describes our research on adaptation methods appliedto articulatory feature detection on soft whispery speech recordedwith a throat microphone.Since the amount of adaptation datais small and the testing data is very different from the trainingdata,a series of adaptation methods is necessary.The adaptationmethods include:maximum likelihood linear regression,feature-space adaptation,and re-training with downsampling,sigmoidallow-passﬁlter,and linear multivariate regression.Adapted artic-ulatory feature detectors are used in parallel to standard senone-based HMM models in a stream architecture for decoding.Withthese adaptation methods,articulatory feature detection accuracyimproves from87.82%to90.52%with corresponding F-measurefrom0.504to0.617,while theﬁnal word error rate improves from33.8%to31.2%.1.INTRODUCTIONToday’s real-world applications are driven by ubiquitous mobiledevices while lack keyboard functionality.These applications de-mand new spoken input methods that do not disturb the environ-ment and preserve the privacy of the user.Veriﬁcation systems forbanking applications or private phone calls in a quiet environmentare only a few examples.As a consequence,recent developmentsin the area of processing whispered speech or non-audible mur-mur1draw a lot of attention.Automatic speech recognition(ASR)has been proven to be a successful interface for spoken input,butso far,microphones have been used that apply the principle of air-transmission to transmit the sound from the speaker’s mouth to theinput device.When transmitting soft whisper,those microphonestend to fail,causing the performance of ASR to deteriorate.Contact microphones,on the other hand,pick up speech sig-nals through skin vibrations rather than by air transmission.As aresult,processing of whispered speech is possible.Research re-lated to contact microphones includes using a stethoscopic micro-phone for non-audible murmur recognition[1]and speech detec-tion and enhancement with a bone-conductive microphone[2].In our previous work,we have demonstrated how to use athroat microphone,one of many kinds of contact microphones,forautomatic soft whisper recognition[3].Based on that,this paperdiscusses how we incorporate articulatory features(AFs)as an ad-ditional information source to improve recognition results.Artic-ulatory features,e.g.voicing or tongue position,have shown greatpotential for robust speech recognition[4].Since whispery speechspeakers from those of BN data,and our sentences are different from the BN ones but in the same domain.Table1.Data for Training,Adaptation,and TestingAmountTraining66.48hrAdaptation712.8sTesting153.1s89.30/0.58589.04/0.579 Downsample88.46/0.55189.56/0.59289.26/0.583 log Mel-spec87.12/0.49388.95/0.57388.52/0.560 CMN-MFCC87.53/0.513FSA FSA+G.FSA90.27/0.61089.19/0.585also make performance worse,contrast to the improvements made for senone models[3].Since sigmoidal low-passﬁltering with is the only improving adaptation method,the following experiments are conducted in addition to it.We then apply additional FSA,group FSA,group MLLR,and iterative MLLR methods with.As shown in Table3, Group FSA performs the best,so further iterative MLLR is con-ducted in addition to Group pared to its effects on senone models,iterative MLLR saturates faster in about20iterations and peaks at34iterations with performance90.52%/0.617.Fig.2shows a comparison of the F-measure of the individual AFs,including the baseline AFs tested on the BNeval98/F0test set and on the throat-whisper test set,and the best adapted AFs on the throat-whisper test set.The AFs are listed in the order of F-score improvement from adaptation2;e.g.the leftmost AFFRICATE has the largest improvement by adaptation.Performance degradation from BN to throat-whisper had been expected.However,some AFs such as AFFRICATIVE and GLOTTAL degrades drastically as the acoustic variation of these features is among the largest.Since there is no vocal cord vibration in whispery speech,GLOTTAL would not be useful for such a task.For the same reason,vowel-related AFs, such as CLOSE,CENTRAL,suffer from the mismatch.Most AFs im-prove by adaptation;NASAL,for example,is one of the best AF on BN data but degrades a lot on throat-whisper,as can be inferred from Fig.1.After adaptation,its F-measure doubles but there is still a gap to the performance level on BN data.5.2.Stream DecodingIn the stream architecture,we put together our best senone model3 and the best AF detectors45.Theﬁrst experiments combine the senone model with each single AF detector to see how well theTable4.Four-Best Single-AF WERs on Different Weight Ratios AF weight90:10AF weightbaseline33.8baseline ASPIRA TED31.4ALVEOLAR33.1CLOSE32.6RETROFLEX31.7DENTAL33.3PALA TAL33.1AF detectors can help the senone model.Table4shows the WERs of different combination weights and the four-best single AF de-tectors.As shown in the table,the combination of90%of weight on senone models and10%of weight on AF detectors results in the best performance,which can be regarded as a global minimum in the performance concave with respect to different weights.In other words,the single AFs can help only with carefully selected weight.In the next experiments,we incrementally add from one up to ten AF detectors to the streams.We use simple rules to select the AF detectors.The AF selection criteria include one-best WER (WER),accuracy(acc),and F-measure(F).According to each cri-terion,AF selection starts in greedy fashion from the AF detector having the best performance,then it picks the second best one,and so on.There is also a set of weighting rules for adding more AFs. Theﬁrst weighting rule is always assigning0.05to the weight of every AFs(w5).The second rule distributes uniform weights out of 0.1to the AFs(unif).The last one puts more weight on the better performed AFs using the formula,where is the weight,the total number of AF detectors used,the rank of performance(scaled).Fig.3shows the WERs with AF selection using*-WER,which showed better result than the other two;this result is consistent with[5].On the other hand,ﬁxed weight(w5-*)suffers from insufﬁcient weights for the senone models as the AF number increases.With one exception that the WER improves to31.2%in scaled-F with ALVEOLAR and FRICATIVE, incorporating more than one AF doesn’t improve the WER.We suspect the reason is that the mismatched training and testing data are quite different acoustically,while the adaptation data is not enough to reliably estimate the AFs.Therefore we cannot achieve the improvement level as reported in[5].6.CONCLUSIONSWe have developed a series of adaptation methods applied to ar-ticulatory feature detection,which improve the performance of a standard senone-based HMM throat-whisper recognizer using a stream decoder.Also,we have shown AF adaptation improves detection accuracy and F-measure.With t-test=0.046,the best stream decoding performance(WER=31.2%)is statistically sig-niﬁcant;however,on such a small test set,some other smaller improvements are not.We therefore plan to collect more data. Further work could be applying discriminative model combination (DMC)on the stream architecture for better weights[12].7.ACKNOWLEDGEMENTSThe authors wish to thank Dr.Yoshitaka Nakajima for the invita-tion to his lab,the chance to gain hands-on experience using the stethoscopic microphones developed at his lab,and his hospitality. Many thanks to Hua Yu for providing the BN baseline systemandFig.3.WERs on Number of AF Detectors Used in Stream Florian Metze and Sebastian St¨u ker for the AF and stream scripts. Thanks also go to the reviewers for their valuable comments.8.REFERENCES[1]Y.Nakajima,H.Kashioka,K.Shikano,and N.Camp-bell,“Non-audible murmur recognition input interface us-ing stethoscopic microphone attached to the skin,”in Proc.ICASSP,Hong Kong,2003.[2]Y.Zheng,Z.Liu,Z.Zhang,M.Sinclair,J.Droppo,L.Deng,A.Acero,and X.Huang,“Air-and bone-conductive inte-grated microphones for robust speech detection and enhance-ment,”in Proc.ASRU,St.Thomas,U.S.Virgin Islands,Dec 2003.[3]S.-C.Jou,T.Schultz,and A.Waibel,“Adaptation for softwhisper recognition using a throat microphone,”in Proc.ICSLP,Jeju Island,Korea,Oct2004.[4]K.Kirchhoff,Robust Speech Recognition Using ArticulatoryInformation,Ph.D.thesis,University of Bielefeld,Germany, July1999.[5]F.Metze and A.Waibel,“Aﬂexible stream architecture forASR using articulatory features,”in Proc.ICSLP,Denver, CO,Sep2002.[6]“/english/prod01.htm,”.[7]H.Yu and A.Waibel,“Streaming the front-end of a speechrecognizer,”in Proc.ICSLP,Beijing,China,2000.[8]C.J.Leggetter and P.C.Woodland,“Maximum likelihoodlinear regression for speaker adaptation of continuous den-sity hidden Markov models,”Computer Speech and Lan-guage,vol.9,pp.171–185,1995.[9]H.Valbret,E.Moulines,and J.P.Tubach,“V oice transfor-mation using PSOLA technique,”Speech Communication, vol.11,pp.175–187,1992.[10]M.J.F.Gales,“Maximum likelihood linear transformationsfor HMM-based speech recognition,”Computer Speech and Language,vol.12,pp.75–98,1998.[11]P.Heracleous,Y.Nakajima,A.Lee,H.Saruwatari,andK.Shikano,“Accurate hidden Markov models for non-audible murmur(NAM)recognition based on iterative super-vised adaptation,”in Proc.ASRU,St.Thomas,U.S.Virgin Islands,Dec2003.[12]S.St¨u ker,F.Metze,T.Schultz,and A.Waibel,“Integratingmultilingual articulatory features into speech recognition,”in Proc.Eurospeech,Geneva,Switzerland,Sep2003.。

语音信号处理C-Sen-Speech-2004-3

18
Report Document
HMMs for Speech
• Math from Baum and others, 1966-1972 • Applied to speech by Baker in the original CMU Dragon System (1974) • Developed by IBM (Baker, Jelinek, Bahl, Mercer,….) (1970-1993) • Extended by others in the mid-1980’s
17
Report Document
Dynamic Time Warp
• Optimal time normalization with dynamic programming • Proposed by Sakoe and Chiba, circa 1970 • Similar time, proposal by Itakura • Probably Vintsyuk was first (1968)
AT&T, Advanced Speech Product Group Lucent Technologies, Bell Laboratories IBM , IBM VoiceType Texas Instruments Incorporated National Institute of Standards and Technology Apple Computer Co. Digital Equipment Corporation (DEC) SRI International Dragon systems Co. Sun Microsystems Lab. , speech applications Microsoft Corporation, Speech technology SAPI Entropic Research Laboratory, Inc.

基于人耳听觉特性的语音特征提取研究

零率来判断一段语音信号的端点。

一段语音中，如果有语音开始，那么它的能量会比较大，因此在能量谱上设置一个较大的门限只来确定语音的开始和结束，再取一个比较低的门限五确定语音真正的起点和终点。

过零率用来判断有声和无声，这里取一个较低的门限Ｑ，背景噪声的低门限过零率明显低于语音的低门限过零率。

通常窗长取１０ｍｓ’１５ｍｓ，帧间隔取５ｍｓ’１０ｍｓ。

以下为采样率８Ｋ，时长１．８１５ｓｅｃ的汉语数码“Ｏ”的能量与过零率分析及双门限法。

图２—８双门限法求短时能量２．３语音信号的频域分析通过语音信号的时域分析，可以得到语音信号的短时能量、短时平均过零率进而可以对语音信号进行端点检测。

从理论上讲，语音信号是动态信号，因此语音更加丰富的信息，存在于频域中。

短时傅立叶变换是在假定语音信号短时存在稳态特征的情况下一种常用的语音分析方法。

假设｛ｘ（ｎ）｝为语音信号，则其傅里叶变换定义为：ｘ。

（ｅ如）＝∑工（Ⅲ）Ｈ栉一ｍ）Ｐ一伽（２一ｌｏ）ｍ；—∞Ｈｎ）为窗函数序列，工（小）“ｎ一肌）为加窗语音信号。

这个式子反映了语音信号频谱的动态特性。

从短时傅立叶变换可以得到短时功率谱：只（挪）＝Ｊｘ。

忙拍）１２＝竞胄．（七№一』“（２一１１）矗。

（七）一：∑ｘ（ｍ）·ｗ（玎一ｍ）·ｘ（ｍ＋后）·ｗ（刀一ｍ一七）（２—１２）语谱图是功率谱的灰度级表示。

除了频谱和短时功率谱之外，还有对数功率谱和倒谱：图２－９几种短时傅立叶变换谱关系。

”下图为汉语纯净数码语音“Ｏ”的频谱，宽带语谱和窄带语谱和倒谱表示。

图２—１０汉语纯净数码语音“０”的频谱（ａ）、倒谱（ｂ）、宽带语谱（ｃ）和窄带语谱（ｄ）１４４，静态的白高斯加性噪声表现为对所有语音频率都适用的加性偏值。

５，粉色加性噪声则是不同频率成分有不同的加性偏值。

经过以上描述，一个搀杂卷积噪声和加性噪声的语音信号产生过程可以模拟如下：图４—３搀杂噪声的语音信号产生过程ｘ（ｒ）为加噪前的语音信号，ｙ（，）经过传输信道含有卷积噪声：ｙ（ｆ）＝ｚ（ｆ）·＾Ｏ）（４—１）ｚ（ｆ）为ｙ（ｆ）与加性噪声一Ｏ）的叠加：ｚ（，）＝ｙ（，）＋玎（Ｄ＝ｘ（ｒ）·＾Ｏ）＋以（ｒ）（４—２）本文的目的是研究基于人耳听觉系统的生理特征的特征提取方法，这里只考虑语音中的噪声为高斯白噪声。

asr14-multiling

ASR Lecture 16
Hat Swap – experiment RU
PL
PT
52K 24K 29K
184 634 705
24.1 32.5 20.0
19.1 27.5 17.4
CZ !D CZ !D CZ !D
Recognition of GlobalPhone Polish
Table
L R C C C
ASR Lecture 16
Multilingual Speech Recognition
4
Multilingual and cross-lingual acoustic models
How to share information from acoustic models in diﬀerent languages? General principal – use neural network hidden layers to learn a multilingual representation Hidden layers are multilingual – shared between languages Output layer is monolingual language speciﬁc Hat swap use a network with multilingual hidden representations directly in a hybrid DNN/HMM systems Multilingual bottleneck use a bottleneck hidden layer (trained in a multilingual) way as features for either a GMMor NN-based system

声控系统

ASR的实现及工作原理

训练（Training）：预先分析出语音特征参数，制作语音模板，并存放在语音参数库中。
• 识别（Recognition）：待识语音经过不训练时相同的分析，得到语音参数。将它不库中的参考模板一一比较，并采用判决的方法找出最接近语音特征的模板，得出识别结果。 • 失真测度（Distortion Measures）：在迚行比较时要有个标准，这就是计量语音特征参数矢量之间的“失真测度”。 • 主要识别框架：基于模式匹配的动态时间规整法（DTW）和基于统计模型的隐马尔可夫模型法（HMM）。
ASR的分类
二、从说话者不识别系统的相关性可以将识别系统分为为特定人(Speaker Dependent)语音识别、话者自适应(Speaker Adapt)和非特定人 (Speaker Independent)语音识别。 1. 特定人语音识别系统：仅考虑对于与人的话音迚行识别，系统本身只需要针对特定人迚行语音训练即可； 2. 非特定人语音系统：识别的语音不人无关，通常要用大量丌同人的语音数据库对识别系统迚行学习； 3. 话者自适应语音识别：介于特定人不非特定人语音识别系统之间，该系统可以逐渐适应新的使用者。
ASR的分类Байду номын сангаас
三、从识别系统的词汇表大小可以将识别系统分为3类： 1. 小词汇表(Small Vocabulary)语音识别系统。通常包括几十个词的语音识别系统； 2. 中等词汇表的语音识别系统。通常包括几百个词到上千个词的语音识别系统； 3. 大词汇表(Large Vocabulary)语音识别系统。通常包括几千到几万个词的语音识别系统。随着计算机不数字信号处理器运算能力以及识别系统精度的提高，识别系统根据词汇量大小迚行分类也丌断迚行变化。目前是中等词汇量的识别系统到将来可能就是小词汇量的语音识别系统。这些丌同的限制也确定了语音识别系统的困难度。

Why is Speech Recognition Difficult

Why is Speech Recognition Diﬃcult?Markus ForsbergDepartment of Computing ScienceChalmers University of Technologymarkus@cs.chalmers.seFebruary24,2003AbstractIn this paper we will elaborate on some of the diﬃculties with Automatic Speech Recognition(ASR).We will argue that the mainmotivation for ASR is eﬃcient interfaces to computers,and for theinterfaces to be truly useful,it should provide coverage for a largegroup of users.We will discuss some of the issues that make the recognition of a single speaker diﬃcult and then extend the discussion with problemsthat occur when we target more than a single user.1IntroductionThe problem of automatically recognizing speech with the help of a computer is a diﬃcult problem,and the reason for this is the complexity of the human language.We will in this article try to sketch some of the issues that make ASR diﬃcult.We will not give any solutions,or present other’s solutions,we will instead try to give you a panorama of some of the potential diﬃculties.We start by presenting the general setting-what we mean by ASR,why we are interested in performing ASR,and what speech actually is,and then we will list some of the problems we may encounter,andﬁnally end with a discussion.12What is speech recognition?Speech recognition,or more commonly known as automatic speech recogni-tion(ASR),is the process of interpreting human speech in a computer.A more technical deﬁnition is given by Jurafsky[2],where he deﬁnes ASR as the building of system for mapping acoustic signals to a string of words[2]. He continues by deﬁning automatic speech understanding(ASU)as extending the goal to producing some sort of understanding of the sentence.We will consider speaker independent ASR,i.e.systems that have not been adapted to a single speaker,but in some sense all speakers of a particular language.3Why do we want speech recognition?The main goal of speech recognition is to get eﬃcient ways for humans to communicate with computers.However,as Ben Schneiderman points out in [6],human-human communication is rarely a good model for designing eﬃ-cient user interfaces.He also points out that verbal communication demands more mental resources than typing on a keyboard.So do we really want to communicate with computers via spoken language?Mats Blomberg[5]enumerates some of the applications of ASR,and some of the advantages that can be achieved.For example,he mentions personal computers that can be voice-controlled and used for dictation.This can be an important application for physically disabled,lawyer etc.Another application he mention is environmental control,such as turning on the light, controlling the TV etc.We feel that speech recognition is important,not because it is’natural’for us to communicate via speech,but because in some cases,it is the most eﬃcient way to interface to a computer.Consider,for example,people that have jobs that occupies their hands,they would greatly beneﬁt from an ASR controlled environment.4What is speech?When we as humans speak,we let air pass from our lungs through our mouth and nasal cavity,and this air stream is restricted and changed with our tongue and lips.This produces contractions and expansions of the air,an acoustic2wave,a sound.The sounds we form,the vowels and consonants,are usually called phones.The phones are combined together into words.How a phone is realized in speech is dependent on its context,i.e.which phone is proceeding it and which phone is directly following it(the term triphones is used for a phone in context).This phenomenon is studied within the area of phonology.However,speech is more than sequences of phones that forms words and sentences.There are contents of speech that carries information,e.g.the prosody of the speech indicates grammatical structures,and the stress of a word signals its importance/topicality.This information is sometimes called the paralingustic content of speech.The term speech signal within ASR refers to the analog electrical rep-resentation of the contractions and expansions of air.The analog signal is then converted into a digital representation by sampling the analog continu-ous signal.A high sampling rate in the A/D conversion gives a more accurate description of the analog signal,but also leads to a higher degree of space consumption.5Diﬃculties with ASR5.1Human comprehension of speech compared to ASR Humans use more than their ears when listening,they use the knowledge they have about the speaker and the subject.Words are not arbitrarily sequenced together,there is a grammatical structure and redundancy that humans use to predict words not yet spoken.Furthermore,idioms and how we’usually’say things makes prediction even easier.In ASR we only have the speech signal.We can of course construct a model for the grammatical structure and use some kind of statistical model to improve prediction,but there are still the problem of how to model world knowledge,the knowledge of the speaker and encyclopedic knowledge.We can,of course,not model world knowledge exhaustively,but an interesting question is how much we actually need in the ASR to measure up to human comprehension.35.2Body languageA human speaker does not only communicate with speech,but also with body signals-hand waving,eye movements,postures etc.This information is completely missed by ASR.This problem is addressed within the research area multimodality,where studies are conducted how to incorporate body language to improve the human-computer communication.5.3NoiseSpeech is uttered in an environment of sounds,a clock ticking,a computer humming,a radio playing somewhere down the corridor,another human speaker in the background etc.This is usually called noise,i.e.,unwanted information in the speech signal.In ASR we have to identify andﬁlter out these noises from the speech signal.Another kind of noise is the echo eﬀect,which is the speech signal bounced on some surrounding object,and that arrives in the microphone a few mil-liseconds later.If the place in which the speech signal has been produced is strongly echoing,then this may give raise to a phenomenon called reverber-ation,which may last even as long as seconds.5.4Spoken language=Written languageSpoken language has for many years been viewed just as a less complicated version of written language,with the main diﬀerence that spoken language is grammatically less complex and that humans make more performance errors while speaking.However,it has become clear in the last few years that spoken language is essentially diﬀerent from written language.In ASR,we have to identify and address these diﬀerences.Written communication is usually a one-way communication,but speech is dialogue-oriented.In a dialogue,we give feed-back to signal that we un-derstand,we negotiate about the meaning of words,we adapt to the receiver etc.Another important issue is disﬂuences in speech,e.g.normal speech is ﬁlled with hesitations,repetitions,changes of subject in the middle of an utterance,slips of the tounge etc.A human listener does usually not even4notice the disﬂuences,and this kind of behavior has to be modeled by the ASR system.Another issue that has to be identiﬁed,is that the grammaticality of spoken language is quite diﬀerent to written language at many diﬀerent levels. In[4],some diﬀerences are pointed out:•In spoken language,there is often a radical reduction of morphemes and words in pronunciation.•The frequencies of words,collocations and grammatical constructions are highly diﬀerent between spoken and written language.•The grammar and semantics of spoken language is also signiﬁcantly diﬀerent from that of written language;30-40%of all utterances consist of short utterances of1-2-3words with no predicative verb.This list can be made even longer.The important point is that we can not view speech as the written language turned into a speech signal,it is fundamentally diﬀerent,and must be treated as such.5.5Continuous speechSpeech has no natural pauses between the word boundaries,the pauses mainly appear on a syntactic level,such as after a phrase or a sentence.This introduces a diﬃcult problem for speech recognition—how should we translate a waveform into a sequence of words?After aﬁrst stage of recognition into phones and phone categories,we have to group them into words.Even if we disregard word boundary ambiguity (see section5.9.2),this is still a diﬃcult problem.One way to simplify this process is to give clear pauses between the words. This works for short command-like communication,but as the possible length of utterances increases,clear pauses get cumbersome and ineﬃcient.5.6Channel variabilityOne aspect of variability is the context were the acoustic wave is uttered. Here we have the problem with noise that changes over time,and diﬀer-ent kinds of microphones and everything else that eﬀects the content of the acoustic wave from the speaker to the discrete representation in a computer. This phenomena is called channel variability.55.7Speaker variabilityAll speakers have their special voices,due to their unique physical body and personality.The voice is not only diﬀerent between speakers,there are also wide variations within one speciﬁc speaker.We will in the subsections below list some of these variations.5.7.1RealizationIf the same words were pronounced over and over again,the resulting speech signal would never look exactly the same.Even if the speaker tries to sound exactly the same,there will always be some small diﬀerences in the acoustic wave you produce.The realization of speech changes over time.5.7.2Speaking styleAll humans speak diﬀerently,it is a way of expressing their personality.Not only do they use a personal vocabulary,they have an unique way to pro-nounce and emphasize.The speaking style also varies in diﬀerent situations, we do not speak in the same way in the bank,as with our parents,or with our friends.Humans also communicate their emotions via speech.We speak diﬀer-ently when we are happy,sad,frustrated,stressed,disappointed,defensive etc.If we are sad,we may drop our voice and speak more slowly,and if we are frustrated we may speak with a more strained voice.5.7.3The sex of the speakerMen and women have diﬀerent voices,and the main reason to this is that women have in general shorter vocal tract than men.The fundamental tone of women’s voices is roughly two times higher than men’s because of this diﬀerence.5.7.4Anatomy of vocal tractEvery speaker has his/hers unique physical attributes,and this aﬀects his/her speech.The shape and length of the vocal cords,the formation of the cavities, the size of the lungs etc.These attributes change over time,e.g.depending on the health or the age of the speaker.65.7.5Speed of speechWe speak in diﬀerent modes of speed,at diﬀerent times.If we are stressed, we tend to speak faster,and if we are tired,the speed tends to decrease.We also speak in diﬀerent speeds if we talk about something known or something unknown.5.7.6Regional and social dialectsDialects are group related variation within a language.Janet Holmes[3]de-ﬁnes regional and social dialects as follows:Regional dialect Regional dialects involves features of pronunciation,vo-cabulary and grammar which diﬀer according to the geographical area the speaker come from.Social dialect Social dialects are distinguished by features of pronuncia-tion,vocabulary and grammar according to the social group of the speaker.In many cases,we may be forced to consider dialects as’another language’in ASR,due to the large diﬀerences between two dialects.5.8Amount of data and search spaceCommunication with a computer via a microphone induces a large amount of speech data every second.This has to be matched to group of phones (monophones/diphones/triphones),the sounds,the words and the sentences. Groups of groups of phones that build up words and words builds up sen-tences.The number of possible sentences are enormous.The quality of the input,and thereby the amount of input data,can be regulated by the number of samples of the input signal,but the quality of the speech signal will,of course,decrease with a lower sampling rate,resulting in incorrect analysis.We can also minimize our lexicon,i.e.set of words.This introduces another problem,which is called out-of-vocabulary,which means that the intended word is not in the lexicon.An ASR system has to handle out-of-vocabulary in a robust way.75.9AmbiguityNatural language has an inherent ambiguity,i.e.we can not always decide which of a set of words is actually intended.This is,of course,a problem in every computer-related language appli-cation,but we will here discuss what kind of ambiguity that typically arises within speech recognition.There are two ambiguities that are particular to ASR,homophones and word boundary ambiguity.5.9.1HomophonesThe concept homophones refers to words that sound the same,but have diﬀerent orthography.They are two unrelated words that just happened to sound the same.In the table below,we give some examples of homophones: one analysis alternative analysisthe tail of a dog the tale of the dogthe sail of a boat the sale of a boatFigure1:Examples of homophonesHow can we distinguish between homophones?It’s impossible on the word level in ASR,we need a larger context to decided which is intended. However,as is demonstrated in the example,even within a larger context,it is not certain that we can choose the right word.5.9.2Word boundary ambiguityWhen a sequence of groups of phones are put into a sequence of words,we sometimes encounters word boundary ambiguity.Word boundary ambiguity occurs when there are multiple ways of grouping phones into words.An example,taken from[1],illustrate this diﬃculty:It’s not easy to wreck a nice beach.It’s not easy to recognize speech.It’s not easy to wreck an ice beach.8This example has been artiﬁcially constructed,but there are other exam-ples that occurs naturally in the world.This can be viewed as a speciﬁc case of handling the continuous speech,where even humans can have problems withﬁnding the word boundaries.6DiscussionIn this paper,we have addressed some of the diﬃculties of speech recognition, but not all of them.But one thing is certain,ASR is a challenging task. The most problematic issues being the large search space and the strong variability.We think that the problems are especially serious,because of our low tolerance to errors in the speech recognition process.Think how long you would try to communicate verbally with a computer,if it understood you wrong a couple of times in a row.You would probably say something nasty, and start looking for the keyboard of the computer.So,there are many problems,but does this mean that it is too hard,that we actually should stop trying?Of course not,there have been signiﬁcant improvements within ASR,and ASR will continue to improve.It seems quite unlikely that we will ever succeed to do perfect ASR,but will surely do good enough.One thing that should be investigated further,is if humans speak diﬀer-ently to computers.Maybe it isn’t natural for a human to communicate in the same way to a computer as to a human.A human may strive to be unam-biguous and speak in a hyper-correct style to get the computer to understand him/her.Under the assumption that the training data also is given in this hyper-correct style,this would simplify the ASR.However,if not,it may be the case that hyper-correct speech even make the ASR harder.And if this is not the case,we may investigate how we as human speaker can adapt to the computer to increase the quality of the speech recognition.As pointed out before,our goal is not a’natural’verbal communication,we want eﬃcient user interfaces.9References[1]N.M.Ben Gold.Speech and Audio Signal Processing,processing andperception of speech and music.John Wiley&Sons,Inc.,2000.[2]J.H.M.Daniel Jurafsky.Speech and Language Processing,An intro-duction to Natural Language Processing,Computational Linguistics,and Speech Recognition.Prentice Hall,Upper Saddle River,New Jersey07458, 2000.[3]J.Holmes.An introduction to sociolinguistics.Longman Group UKLimited,1992.[4]E.A.Jens Allwood.Corpus-based research on spoken language.2001.[5]K.E.Mats Blomberg.Automatisk igenk¨a nning av tal.1997.[6]B.Schneiderman.The limits of speech munications ofthe ACM,43:63–65,2000.10。

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

矿产资源开发利用方案编写内容要求及审查大纲
矿产资源开发利用方案编写内容要求及《矿产资源开发利用方案》审查大纲一、概述
㈠矿区位置、隶属关系和企业性质。

如为改扩建矿山, 应说明矿山现状、
特点及存在的主要问题。

㈡编制依据
(1简述项目前期工作进展情况及与有关方面对项目的意向性协议情况。

(2 列出开发利用方案编制所依据的主要基础性资料的名称。

如经储量管理部门认定的矿区地质勘探报告、选矿试验报告、加工利用试验报告、工程地质初评资料、矿区水文资料和供水资料等。

对改、扩建矿山应有生产实际资料, 如矿山总平面现状图、矿床开拓系统图、采场现状图和主要采选设备清单等。

二、矿产品需求现状和预测
㈠该矿产在国内需求情况和市场供应情况
1、矿产品现状及加工利用趋向。

2、国内近、远期的需求量及主要销向预测。

㈡产品价格分析
1、国内矿产品价格现状。

2、矿产品价格稳定性及变化趋势。

三、矿产资源概况
㈠矿区总体概况
1、矿区总体规划情况。

2、矿区矿产资源概况。

3、该设计与矿区总体开发的关系。

㈡该设计项目的资源概况
1、矿床地质及构造特征。

2、矿床开采技术条件及水文地质条件。