Fast Adaptation for Robust Speech Recognition in Reverberant Environments
- 格式:pdf
- 大小:65.07 KB
- 文档页数:4
《基于深度学习的多通道语音增强方法研究》篇一一、引言随着人工智能技术的飞速发展,语音信号处理在众多领域中扮演着越来越重要的角色。
然而,由于环境噪声、信道失真、干扰声源等因素的影响,实际环境中获得的语音信号往往存在严重的质量问题。
为了改善这一情况,提高语音识别的准确性和可懂度,多通道语音增强技术应运而生。
本文将重点研究基于深度学习的多通道语音增强方法,旨在通过深度学习技术提高语音信号的信噪比和清晰度。
二、多通道语音增强技术概述多通道语音增强技术通过在空间域和时间域上利用多个传感器,以收集到来自不同方向的语音信号信息。
利用这一技术,可以有效地抑制噪声和干扰声源,从而提高语音信号的信噪比和清晰度。
传统的多通道语音增强方法主要依赖于信号处理技术,如滤波器、波束形成等。
然而,这些方法往往难以处理复杂的噪声环境和动态变化的声源。
三、深度学习在多通道语音增强中的应用深度学习技术为多通道语音增强提供了新的解决方案。
通过构建深度神经网络模型,可以自动学习和提取语音信号中的有效特征,从而实现对噪声和干扰声源的更有效抑制。
此外,深度学习还可以在多通道语音信号的融合和降噪过程中,对时间域和空间域的信息进行联合处理,进一步提高语音增强的效果。
四、基于深度学习的多通道语音增强方法研究本文提出了一种基于深度学习的多通道语音增强方法。
该方法首先通过多个传感器收集来自不同方向的语音信号信息,然后利用深度神经网络模型对收集到的信息进行特征提取和降噪处理。
具体而言,我们采用了卷积神经网络(CNN)和循环神经网络(RNN)的组合模型,以实现时间和空间域上的联合处理。
在训练过程中,我们使用了大量的实际录音数据和模拟噪声数据,以使模型能够更好地适应各种噪声环境和动态变化的声源。
五、实验与结果分析为了验证本文提出的多通道语音增强方法的性能,我们进行了大量的实验。
实验结果表明,该方法在各种噪声环境下均能显著提高语音信号的信噪比和清晰度。
与传统的多通道语音增强方法相比,基于深度学习的多通道语音增强方法具有更高的准确性和鲁棒性。
一种基于深度学习的英语语音识别技术一、引言随着科技的发展和人们对语音技术的需求不断增加,语音识别技术逐渐进入人们的视野。
语音识别技术有着广泛的应用场景,如智能音箱、车载语音识别、语音助手等等。
在这些应用场景中,英语是一种被广泛使用的语言,因此本文将介绍一种基于深度学习的英语语音识别技术。
二、英语语音识别技术的概述英语语音识别技术是指通过计算机对英语语音进行分析,从而将其转化为相应的文本。
该技术的实现主要分为三个过程:特征提取、模型训练和识别。
(一)特征提取语音信号是一种包含时间变化的信号,需要将其转化为计算机可以处理的数字信号并提取有效的语音特征。
其中比较常用的基于短时傅里叶变换的梅尔频率倒谱系数(MFCC)特征提取方法。
(二)模型训练模型训练是指根据已知的数据集,让计算机学习到一种映射关系,从而可以对新的语音信号进行识别。
在英语语音识别领域,深度学习模型得到越来越广泛的应用。
当模型训练完成后,我们可以使用其对新的语音信号进行识别。
该过程基本分为两步:语音信号的预处理和模型的推理。
在预处理中,我们需要对传入的语音信号进行MFCC特征提取,并且进行归一化;在推理中,我们需要使用训练好的模型,对预处理后的特征进行推理,从而得到相应的文本输出。
三、基于深度学习的英语语音识别技术在深度学习领域,常用的模型有循环神经网络(RNN)、卷积神经网络(CNN)和深度神经网络(DNN)等。
在英语语音识别领域,这些模型均有着广泛的应用。
(一)RNNRNN是一种具有记忆功能的神经网络,它的前向传播过程包括当前输入和前一时刻的输出。
在语音信号处理中,RNN可以根据前一时刻的输出,对当前输入进行预测。
尤其是长短时记忆网络(LSTM)和门控循环单元(GRU)等变种模型,都能够有效地解决语音识别过程中的记忆问题。
(二)CNNCNN是一种卷积神经网络,它主要用于图像处理领域。
但是,在语音信号处理中,CNN也被广泛应用。
通过卷积核的处理,CNN可以提取语音信号的局部特征,从而提高识别的准确性。
英文语音识别技术在自然语言处理中的应用前景展望The Prospects of English Speech Recognition Technology in Natural Language ProcessingIntroductionIn recent years, the field of natural language processing (NLP) has witnessed significant advancements, thanks to the rapid development of English speech recognition technology. As a subfield of artificial intelligence, NLP focuses on enabling computers to understand, interpret, and generate human language. English speech recognition technology plays a crucial role in enhancing NLP applications by converting spoken English into written text. This article aims to explore the potential applications and future prospects of English speech recognition technology in NLP.Improved Voice Assistants and ChatbotsVoice assistants, like Siri and Alexa, have become ubiquitous in our daily lives, performing various tasks such as setting reminders, answering questions, and providing recommendations. English speech recognition technology can enhance the accuracy and efficiency of these voice assistants by offering robust speech-to-text conversion. Users can communicate with their devices more naturally, enabling voice assistants to understand and respond accurately. Additionally, chatbots, which rely on NLP algorithms to engage in natural language conversations, can significantly benefit from improved speech recognition technology. The ability to understand spoken language opens up a whole new level of interactivity, making chatbots more user-friendly and effective.Transcription and Voice SearchOne of the primary applications of English speech recognition technology is in transcription services. Transcribing interviews, meetings, or lectures can be a time-consuming task, but with the help of speech recognition, the process can be automated. English speech recognition technology accurately converts spoken language into text, saving valuable time and resources for professionals and researchers alike. Furthermore, voice search functionality is a growing trend in internet searches. Enhanced speech recognition technology ensures more accurate and relevant search results, providing users with faster access to information. This technology can revolutionize the way we navigate the internet and retrieve information.Language Learning and AccessibilityEnglish speech recognition technology has immense potential in language learning and accessibility. Language learners can utilize real-time speech recognition to improve their pronunciation and enhance their communication skills. By providing instant feedback on errors, learners can correct their mistakes and progress more effectively. Additionally, individuals with disabilities that affect their ability to type or write can benefit from speech recognition technology. People with motor disabilities or conditions like dyslexia can express themselves more easily using spoken language, which is then converted into written text for communication purposes.Enhanced Sentiment Analysis and Voice AnalyticsSentiment analysis, or opinion mining, is a valuable tool used to determine the sentiment expressed in text. English speech recognition technology can dramatically improve sentiment analysis by incorporating the tone and intonation of spoken language, adding a new dimension to the analysis. Voice analytics, which measures patterns and characteristics in speech, can also benefit from advanced speech recognition. It enables businesses to gather meaningful insights from customer calls and interactions, helping improve customer service and product development.Virtual Assistants in HealthcareThe healthcare industry stands to benefit significantly from the integration of English speech recognition technology into virtual assistants or chatbot applications. Patients can engage with virtual assistants to schedule appointments, access healthcare information,and even receive personalized medical advice. Healthcare professionals can dictate patient notes and medical records, allowing for more efficient documentation and reducing administrative burden. This technology opens up new possibilities for telemedicine and remote patient monitoring, where accurate speech recognition plays a vital role in transforming spoken medical instructions into written text.Challenges and Future DevelopmentsWhile English speech recognition technology has shown remarkable progress, challenges still exist. Accents, dialects, and background noise can pose difficulties for accurate recognition. Ongoing research aims to address these challenges, developing algorithms that can adapt to various speaking styles and environmental conditions. Moreover, advancements in machine learning and deep neural networks hold promise for further improving speech recognition accuracy and efficiency.In conclusion, English speech recognition technology holds immense potential in enhancing NLP applications. From improving voice assistants and chatbots to aiding transcription services, language learning, sentiment analysis, and healthcare, the prospects are vast. As researchers continue to advance the technology and overcome existing challenges, we can expect even more innovative applications and improved user experiences in the future. English speech recognition is set to revolutionize how we interact with technology and harness the power of spoken language.。
专利名称:Noise-robust speech processing发明人:Neti, Chalapathy V.申请号:EP96308906.5申请日:19961209公开号:EP0781833A2公开日:19970702专利内容由知识产权出版社提供专利附图:摘要:A method for noise-robust speech processing with cochlea filters within a computer system is disclosed. This invention provides a method for producing feature vectors from a segment of speech, that is more robust to variations in the environment due to additive noise. A first output is produced by convolving (50) a speech signal input with spatially dependent impulse responses that resemble cochlea filters. The temporal transient and the spatial transient of the first output is then enhanced by taking a time derivative (52) and a spatial derivative (54), respectively, of the first output to produce a second output. Next, all the negative values of the second output are replaced (56) with zeros. A feature vector is then obtained (58) from each frame of the second output by a multiple resolution extraction. The parameters for the cochlea filters are finally optimized by minimizing the difference between a feature vector generated from a relatively noise-free speech signal input and a feature vector generated from a noisyspeech signal input.申请人:International Business Machines Corporation 地址:Old Orchard Road Armonk, N.Y. 10504 US国籍:US代理机构:Ling, Christopher John更多信息请下载全文后查看。
专利名称:Fast speech recognition method for mandarin words发明人:Chung-Mou Pengwu申请号:US08/685733申请日:19960724公开号:US05764851A公开日:19980609专利内容由知识产权出版社提供摘要:A method for fast speech recognition of Mandarin words is accomplished by obtaining a first database which is a vocabulary of N Mandarin phrases. The vocabulary is described by an acoustic model which is formed by concatenating together word models. Each of the so concatenated word models is a concatenation of an initial model and a final model, wherein the initial model may be a null element, and both the initial and final model are represented by a probability model. A second database which contains initial models is determined. A preliminary logarithmic probability is subsequently calculated. A sub- set of vocabulary comprising acoustic models having the highest probability of occupance are established using the preliminary logarithmic probabilities. This facilitates recognizing phrases of Mandarin phrases which are then outputted to a user.申请人:INDUSTRIAL TECHNOLOGY RESEARCH INSTITUTE代理人:W. Wayne Liauh更多信息请下载全文后查看。
DUTCH HLT RESOURCES: FROM BLARK TO PRIORITY LISTSH. Strik1, W. Daelemans2, D. Binnenpoorte3, J. Sturm1, F. De Vriend1, C. Cucchiarini1,41 Department of Language and Speech, University of Nijmegen, The Netherlands{Strik, D.Binnenpoorte, Janienke.Sturm, F.deVriend, C.Cucchiarini}@let.kun.nl2 Department of CNTS Language Technology, University of Antwerp, BelgiumWalter.Daelemans@uia.ua.ac.be3 Speech Processing Expertise Centre (SPEX), Nijmegen, The Netherlands4 Nederlandse Taalunie, The Hague, The NetherlandsABSTRACTn this paper we report on a project about Dutch Human Language Technologies (HLT) resources. I n this project we first defined a so-called BLARK (Basic LAnguage Resources Kit). Subsequently, a survey was carried out to make an inventory and evaluation of existing Dutch HLT resources. Based on the information collected in the survey, a priority list was drawn up of materials that need to be developed to complete the Dutch BLARK. Although the current project only concerns the Dutch language, the method employed and some of the results are also relevant for other languages.1. INTRODUCTIONWith information and communication technology becoming more and more important, the need for HLT also increases. HLT enable people to use natural language in their communication with computers, and for many reasons it is desirable that this natural language be the user’s mother tongue. In order for people to use their native language in these applications, a set of basic provisions (such as tools, corpora, and lexicons) is required. However, since the costs of developing HLT resources are high, it is important that all parties involved, both in industry and academia, co-operate so as to maximise the outcome of efforts in the field of HLT. This particularly applies to languages that are commercially less interesting than English, such as Dutch.For this reason, the Dutch Language Union (Nederlandse Taalunie – abbreviated NTU), which is a Dutch/Flemish intergovernmental organisation responsible for strengthening the position of the Dutch language (for further details on the NTU, see [1]), launched an initiative, the Dutch HLT Platform. This platform aims at stimulating co-operation between industry and scientific institutes and at providing an infrastructure that will make it possible to develop, maintain and distribute HLT resources for Dutch.The work to be carried out in this project was organised along four action lines, which are described in more detail in [2]. I n the present paper, action lines B and C are further outlined. Action line A is about constructing a ‘broking and linking’ function, and the goal of action line D is to define a blueprint for management, maintenance and distribution.The aims of action line B are to define a set of basic HLT resources for Dutch that should be available for both academia and industry, the so-called BLARK (Basic LAnguage Resources Kit), and to carry out a survey to determine what is needed to complete this BLARK and what costs are associated with the development of the materials needed. These efforts should result in a priority list with cost estimates, which can serve as a policy guideline. Action line C is aimed at drawing up a set of standards and criteria for the evaluation of the basic materials contained in the BLARK and for the assessment of project results. Obviously, the work done in action lines B and C is closely related, for determining whether materials are available cannot be done without a quality evaluation. For this reason, action lines B and C have been carried out in an integrated way.The work in action lines B and C was carried out in three stages, which are described in more detail below:1. defining the BLARK,2. carrying out a field survey to make an inventory andevaluation of existing HLT resources, and3. defining the priority list.The project was co-ordinated by a steering committee consisting of Dutch and Flemish HLT experts.2. DEFINING THE BLARKThe first step towards defining the BLARK was to reach consensus on the components and the instruments to be distinguished in the survey. A distinction was made between applications, modules, and data (see Table 1). 'Applications' refers to classes of applications that make use of HLT. 'Modules' are the basic software components that are essential for developing HLT applications, while 'data' refers to data sets and electronic descriptions that are used to build, improve, or evaluate modules.In order to guarantee that the survey is complete, unbiased and uniform, a matrix was drawn up by the steering committee describing (1) which modules are required for which applications, (2) which data are required for which modules, and (3) what the relative importance is of the modules and data. This matrix (subdivided in language and speech technology) is depicted in Table 1, where “+” means important and “++” means very important.This matrix serves as the basis for defining the BLARK. Table 1 shows for instance that monolingual lexicons and annotated corpora are required for the development of a wide range of modules; these should therefore be included in the BLARK. Furthermore, semantic analysis, syntactic analysis, and text pre-processing (for language technology) and speechrecognition, speech synthesis, and prosody prediction (for speech technology) serve a large number of applications and should therefore be part of the BLARK, as well.Based on the data in the matrix and the additional prerequisite that the technology with which to construct the modules be available, a BLARK is proposed consisting of the following components:For language technology:Modules:• Robust modular text pre-processing (tokenisation and named entity recognition)• Morphological analysis and morpho-syntacticdisambiguation• Syntactic analysis• Semantic analysisData:• Mono-lingual lexicon• Annotated corpus of text (a treebank with syntactic, morphological, and semantic structures)• Benchmarks for evaluationFor speech technology:Modules:• Automatic speech recognition (including tools for robust speech recognition, recognition of non-natives, adaptation, and prosody recognition) • Speech synthesis (including tools for unit selection)• Tools for calculating confidence measures• Tools for identification (speaker identification as well as language and dialect identification)• Tools for (semi-) automatic annotation of speech corporaData:• Speech corpora for specific applications, such as Computer Assisted Language Learning (CALL),directory assistance, etc.• Multi-modal speech corpora• Multi-media speech corpora• Multi-lingual speech corpora• Benchmarks for evaluation3. SURVEY: INVENTORY & EVALUATION In the second stage, a survey was carried out to establish which of the components that make up the BLARK are already available; i.e. which modules and data can be bought or are freely obtainable for example through open source. Besides being available, the components should also be (re-)usable. Note that only language specific modules and data were considered in this survey.Obviously, components can only be considered usable if they are of sufficient quality. Therefore, a formal evaluation of the quality of all modules and data is indispensable. Evaluation of the components can be carried out on two levels: a descriptive level and a content level. Evaluation on a content level would comprise validation of data and performance validation of modules, whereas evaluation on a descriptive level would mean checking the modules and data against a list of evaluation criteria. Since there was only a limited amount of time, it was decided that only the checklist approach would be feasible. A checklist was drawn up consisting of the following items:• Availability:• public domain, freeware, shareware, etc.• legal aspects, IPR• Programming code:• language: Fortran, Pascal, C, C++, etc.• makefile• stand-alone or part of a larger module?• Platform: Unix, Linux, Windows 95/98/NT, etc.• Documentation• Compatibility with standards: (S)API, SABLE• Compatibility with standard packages: MATLAB, Praat, etc.• Reusability / adaptability / extendibility:• to other tasks and applications• to other platforms• StandardsAs a first step in the inventory, the experts in the steering committee made an overview of the availability of components. Then the steering committee appointed four field researchers to carry out the survey. The field researchers then extended and completed this overview on the basis of information found on the internet and in the literature, and personal communication with experts.4. PRIORIT LISTSThe survey of Dutch and Flemish HLT resources resulted in an extensive overview of the present state of HLT for the Dutch language. We then combined the BLARK with the inventory of components that were available and of sufficient quality, and drew up priority lists of the components that need to be developed to complete the BLARK. The prioritisation proposed was based on the following requirements:• the components should be relevant (either directly or indirectly) for a large number of applications,• the components should currently be either unavailable, inaccessible, or have insufficient quality, and• developing the components should be feasible in the short term.At this point, we incorporated all information gathered in a report containing the BLARK, the availability figures together with a detailed inventory of available HLT resources for Dutch, priority lists of components that need to be developed, and a number of recommendations [3]. This report was given a provisional status, as feedback on this version from a lot of actors in the field was considered desirable, since reaching consensus on the analysis and recommendations for the Dutch and Flemish HLT field is one of the main objectives.Therefore, we consulted the whole HLT field. Using the address list compiled in Action Line A of the Platform, a first version of the priority lists, the recommendations, and a link to a pre-final version of the inventory [3] were sent to all known actors in the Dutch HLT field: a total of about 2000 researchers, commercial developers and users of commercial systems. We asked all actors to comment on the report, the priority lists, and the recommendations. Relevant comments were incorporated in the report.Simultaneously, the same group of people was invited to a workshop that was organised to discuss the BLARK, the priority list and the recommendations. Some of the actors that had sent their comments were asked to give a presentation to make their ideas publicly known. The presentations served as an onset for a concluding discussion between the audience and a panel consisting of five experts (all members of the steering committee). A number of conclusions that could be drawn from the workshop are:• Cooperation between universities, research institutes and companies should be stimulated.• It should be clear for all components in the BLARK how they can be integrated with off-the-shelf software packages. Furthermore, documentation and information about performance should be readily available.• Control and maintenance of all modules and data sets in the BLARK should be guaranteed.• Feedback of users on the components (regarding quality and usefulness of the components) should be processed ina structured way.• The question as to what open source / license policy should be used needs some further discussion.On the basis of the feedback received from the Dutch HLT field, some adjustments were made to the first version of the report. The final priority lists are as follows:For language technology:1. Annotated corpus written Dutch: a treebank with syntacticand morphological structures2. Syntactic analysis: robust recognition of sentence structurein texts3. Robust text pre-processing: tokenisation and named entityrecognition4. Semantic annotations for the treebank mentioned above5. Translation equivalents6. Benchmarks for evaluationFor speech technology:1. Automatic speech recognition (including modules for non-native speech recognition, robust speech recognition,adaptation, and prosody recognition)2. Speech corpora for specific applications (e.g. directoryassistance, CALL)3. Multi-media speech corpora (speech corpora that alsocontain information from other media, i.e. speech togetherwith text, html, figures, movies, etc.).4. Tools for (semi-) automatic transcription of speech data5. Speech synthesis (including tools for unit selection)6. Benchmarks for evaluationFrom the inventory and the reactions from the field, it can be concluded that the current HLT infrastructure is scattered, incomplete, and not sufficiently accessible. Often the available modules and applications are poorly documented. Moreover, there is a great need for objective and methodologically sound comparisons and benchmarking of the materials. The components that constitute the BLARK should be available at low cost or for free.To overcome the problems in the development of HLT resources for Dutch the following can be recommended:• existing parts of the BLARK should be collected, documented and maintained by some sort of HLT agency, • the BLARK should be completed by encouraging funding bodies to finance the development of the prioritisedresources,• the BLARK should be available to academia and the HLT industry under the conditions of some sort of open source/ open license development,• benchmarks, test corpora, and a methodology for objective comparison, evaluation, and validation of parts of theBLARK should be developed.Furthermore, it can be concluded that there is a need for well-trained HLT researchers, as this was one of the issues discussedat the workshop. Finally, enough funding should be assigned to fundamental research.The results of the survey will be disseminated to the HLT field. The priority lists and the recommendations will be made available to funding bodies and policy institutions by the NTU.A summary of the report, containing the priority lists, the recommendations, and the BLARK will be translated into English to reach a broader public. More information can be found at [4, 5, 6].5. ACKNOWLEDGEMENTThe following people participated in the steering committee (at various stages of the project): J. Beeken, G. Bouma, C. Cucchiarini, E. D’Halleweyn, W. Daelemans, E. Dewallef, A. Dirksen, A. Dijkstra, D. Heijlen, F. de Jong, J.P. Martens, A. Nijholt, H. Strik, L. Teunissen, D. van Compernolle, F. van Eynde, and R. Veldhuis. The four field researchers were: D. Binnenpoorte, J. Sturm, F. de Vriend, and M. Kempen. We would like to thank all of them, and all others who contributedto the work presented in this paper. Furthermore, we would liketo thank an anonymous reviewer for constructive remarks on a previous version of this paper.6. REFERENCES[1] Beeken, J., Dewallef, E., D'Halleweyn, E. (2000), APlatform for Dutch in Human Language Technologies.Proceedings of LREC2000, Athens, Greece.[2] Cucchiarini, C., D'Halleweyn, E. and Teunissen, L. (2002),A Human Language Technologies Platform for the Dutchlanguage: awareness, management, maintenance anddistribution. Proceedings LREC2002, CanaryIslands, Spain.[3] Daelemans, W., Strik, H. (Eds.) (2001) Het Nederlands inde taal- en spraaktechnologie: prioriteiten voorbasisvoorzieningen (versie 1), 27 sept. 2001. See/tst/actieplan/batavo-v1.pdfor http://lands.let.kun.nl/TSpublic/strik/publications/a82-batavo-v1.pdf[4] /tst/[5] http://www.ntu.nl/_/werkt/technologie.html[6] http://lands.let.kun.nl/TSpublic/strik/taalunie/Table 1. Overview of the importance of data for modules, and modules for applications.Data ApplicationsModulesm o n o l i n g l e xm u l t i l i n l e xt h e s a u r ia n n o c o r pu n a n n o c o r ps p e e c h c o r pm u l t i l i n g c o r pm u l t i m o d c o r pm u l t i m e d i a c o rC A L La c c e s s c o n t r o ls p e e c h i n p u ts p e e c h o u t p u td i a l o g s y s te m sd o c p r o di n f o a c c e s st r a n s l a -t i o nLanguage TechnologyGrapheme-phoneme conv.++ ++ + ++ ++ + + Token detection ++ + ++ + + + + + + Sent boundary detection + ++ ++ + ++ ++ + ++ ++ ++ Name recognition + + + ++ ++ ++ + ++ ++ + ++ ++ ++ Spelling correction + Lemmatising ++ ++ + + + + + + + + Morphological analysis ++ ++ + + + ++ + ++ ++ ++ Morphological synthesis ++ ++ + + ++ + ++ ++ Word sort disambig. ++ ++ + + ++ + ++ ++ ++ ++ Parsers and grammars ++ ++ + ++ ++ ++ ++ ++ ++ Shallow parsing ++ ++ ++ + ++ ++ ++ ++ ++ ++ Constituent recognition ++ ++ + + ++ ++ ++ ++ ++ ++ Semantic analysis ++ ++ ++ ++ ++ + ++ ++ ++ ++ ++ Referent resolution + ++ ++ + + ++ ++ ++ ++ ++ Word meaning disambig. + ++ ++ + + ++ + + + ++ ++ Pragmatic analysis + + ++ ++ ++ + ++ ++ ++ + ++ Text generation ++ ++ ++ ++ ++ + ++ ++ ++ ++ Lang. dep. translation ++ ++ ++ ++ + ++ ++ Speech TechnologyComplete speech recog. ++ + ++ + ++ + ++ ++ ++ ++ ++ ++ ++ ++ ++ Acoustic models ++ + ++ + ++ + + + ++ + ++ ++ + + + Language models + ++ + + + + + ++ + ++ ++ ++ ++ ++ Pronunciation lexicon ++ + + ++ + + + ++ + ++ + ++ + ++ ++ Robust speech recognition + + + + + + ++ + + ++ ++ + + + Non-native speech recog. + ++ + ++ ++ + + ++ + + + + + Speaker adaptation + + + ++ + + ++ + + ++ + + ++ + Lexicon adaptation ++ + + ++ + + + ++ + ++ + ++ + ++ ++ Prosody recognition + + ++ + ++ + + + ++ + ++ ++ ++ ++ ++ Complete speech synth. ++ + + + + + ++ ++ + + ++ Allophone synthesis + + + + + + + + + + Di-phone synthesis ++ + + + + + ++ ++ + + + Unit selection ++ + + + + + ++ ++ + + + Prosody prediction for Text-to-Speech ++ + + + + + ++ ++ ++ + ++ Aut. phon. transcription ++ ++ + + ++ + + + ++ + + + + + + + Aut. phon. segmentation ++ ++ + + ++ + + + ++ + + + + + + + Phoneme alignment + + + ++ + + + ++ + + + + Distance calc. phonemes + + + ++ + + + ++ + + + + Speaker identification + ++ ++ ++ + ++ + + ++ + + + + Speaker verification + ++ ++ ++ + ++ + ++ + + + + Speaker tracking + ++ ++ ++ + ++ + + + + + Language identification + ++ + + ++ ++ + + + + + + + + Dialect identification + ++ + + ++ ++ + + + + + + + + Confidence measures + + + ++ + ++ + ++ ++ ++ ++ + + + Utterance verification + + + ++ + + + + + ++ ++ + + +。
收稿日期:2020年10月12日,修回日期:2020年11月22日基金项目:山西工程技术学院科研课题(编号:2020004)资助。
作者简介:刘鹏,男,硕士,讲师,工程师,研究方向:模式识别与机器学习。
∗1引言传统的语音增强算法(如子空间法、谱减法和维纳滤波法),作为一种非监督方法大都基于语音和噪声信号复杂的统计特性来实现降噪,但在降噪过程中不可避免地会产生“音乐噪音”,导致语音失真[1]。
考虑到噪声对清晰语音产生影响的复杂过程,在带噪语音与清晰语音信号间基于神经网络建立非线性映射模型来实现语音增强已成为当前研究的一个热点。
Xugang Lu ,Yu Tsao 等学者采用逐层预训练(layer-wise pre-training )堆叠自编码器(Stacked AutoEncoder )后微调(fine tuning )整个学习网络的方法,建立了深度降噪自编码器(Deep Denoising AutoEncoder ,DDAE ),完成了带噪语音的降噪,并验证了增加降噪自编码器的深度,有助于提高语音增强的效果[2]。
但是,由于深度降噪自编码器是对训练集中带噪语音与清晰语音对的一种统计平均,在缺乏足够样本量的数据集上进行训练,极易产生神经元的联合适应性(co-adaptations ),进而导致过拟合。
为此,文献[3]提出了DDAE 的集成模型(Ensemble Model ),将训练数据聚类后分别训练多个DDAE 模型,然后在训练数据集上通过回归拟合来建立多个DDAE 的组合函数。
但是,集成模型需要训练和评估多个DDAE 模型,这将花费大量的运行时间和内存空间。
研究表明,集成模型通常只能一种深度降噪自编码器的语音增强算法∗刘鹏(山西工程技术学院信息工程与大数据科学系阳泉045000)摘要依据带噪语音中不同类型语音分段(segment )对语音整体的可懂度影响不同,提出了一种基于语音分段来分类训练深度降噪自编码器(DDAE )的语音增强算法。
Fast Adaptation for Robust Speech Recognition in Reverberant Environments L.Couvreur,S.Dupont,C.Ris,J.-M.Boite and C.Couvreur Facult´e Polytechnique de Mons,Belgium Lernout&Hauspie Speech Products,Belgium lcouv,dupont,ris,boite@tcts.fpms.ac.be christophe.couvreur@lhs.beAbstractWe present a fast method,i.e.requiring little data,for adapting a hybrid Hidden Markov Model/Multi Layer Perceptron speech recognizer to reverberant environments.Adaptation is per-formed by a linear transformation of the acoustic feature space.A dimensionality reduction technique similar to the eigenvoice approach is also investigated.A pool of adaptation transfor-mations are estimated a priori for various reverberant environ-ments.Then,the principal directions of the pool are extracted, the so-called eigenrooms.The adaptation transformation for ev-ery new reverberant environment is constrained to lay on the subspace spanned by the most significant eigenrooms.Conse-quently,the adaptation procedure involves estimating only the projection coefficients on the selected eigenrooms,which re-quires less data than direct estimation of the adaptation trans-formation.Supervised adaptation experiments for recognition of connected digit sequences(A URORA database)in reverberant environments are carried out.Standard adaptation demonstrates improvements in word error rate higher than30%for typical re-verberation levels.The eigenroom-based adaptation technique implemented so far allows at most50%reduction of adaptation data for the same improvement.1.IntroductionIn many real applications,automatic speech recognition(ASR) systems have to deal with noise and room reverberation.Since these systems,or more exactly their acoustic models,are com-monly trained on clean speech material,i.e.noise-free and echo-free speech,they perform poorly during operation because of the mismatch between the training conditions and the oper-ating conditions.In this work,we are primarily concerned by the mismatch due to room reverberation.Two approaches come out naturally for reducing this mismatch.One can suggest to train the acoustic models on reverberated speech.Such training material can be obtained by convolving clean speech with room impulse responses which are either measured in reverberant en-closures[1]or artificially generated[2].Alternatively,one can suggest to recover(partially)echo-free acoustic features and keep on using the acoustic models trained on echo-free speech.We propose here adaptation methods in the framework of connectionist speech recognition[3]to compensate for room reverberation by linear transformation of the acoustic features. The standard adaptation procedure consists in estimating the coefficients of the linear mapping from data recorded in the target operating reverberant environment.Recently,the eigen-voice concept has been introduced[4,5]for reducing the dimen-sionality problem inherent to such adaptation procedures.The eigenvoice method increases the reliability and the efficiency of the adaptation procedure by limiting the amount of parame-ters that must be estimated.The method was originally devel-oped for fast speaker adaptation of recognizers based on Hid-den Markov Models/Gaussian Mixture Models(HMM/GMM) [6,7].In[8],the method was extended to hybrid Hidden Markov Models/Multi Layer Perceptron(HMM/MLP)recog-nizers.We generalize here the latter approach to room reverber-ation adaptation by introducing the eigenroom concept.In the next section,wefirst review the standard technique for adaptation of a HMM/MLP recognizer by linear transforma-tion of the acoustic features.We then describe how the eigen-voice concept can be generalized for adaptation to room rever-beration,and we propose a fast version of the standard adap-tation technique.In section3,results for recognition of con-nected digit sequences are reported.Conclusions are drawn in section4.2.Adaptation ProcedureIn this work,we use a Multi Layer Perceptron(MLP)as acous-tic model for speech recognition[3].Actually,a single hidden layer MLP is used.The MLP inputs are acoustic feature vectors computed for successive frames of speech along the utterance to be recognized.The MLP outputs are estimates of a posteri-ori phone probabilities.The resulting lattice of probabilities is then searched for the most likely word sequence given a lexi-con of word phonetic transcriptions.Such an acoustic model is commonly trained on a large database consisting of a sequence of acoustic feature vectors and the corresponding sequence of phone labels.The training procedure aims at minimizing the square error between the actual outputs and the expected ones (1for the output of the desired phone and0otherwise).This supervised training of the MLP coefficients can be efficiently implemented via a gradient descent procedure using the popu-lar back-propagation(BP)algorithm[9].2.1.Standard AdaptationUnfortunately,the performance of a MLP-based speech recog-nizer degrades severely when the training acoustic conditions differ from the operating acoustic conditions[2].In order to recover satisfying performance,the acoustic model has to be adapted.A usual technique for adapting a MLP consists in transforming the input acoustic feature vectors linearly[10]:(1) where,,and denote the current acoustic feature vec-tor possibly augmented with left and right context acoustic fea-ture vectors,its compensated version,and the adaptation pa-rameters,respectively.The transformed feature vector serves then as input to the unchanged existing MLP.Hence,the adap-tation procedure consists in estimating the adaptation parame-ters.The linear transformation can be seen asFigure1:Adaptation scheme for hands-free speech recognition in reverberant environments. an extra linear input layer appended to the existing MLP.Ini-tializing this layer with the identity matrix()and zerobiases(),it can be estimated by resuming the supervisedtraining of the augmented MLP on the available adaptation data,keeping all the other layers frozen[10].In this work,we ap-ply this procedure for adapting an existing speaker-independentMLP trained on echo-free speech to a reverberant environment(seefigure1).2.2.Fast AdaptationAs shown in section3,a significant amount of data is necessaryto adapt efficiently the echo-free MLP,and to obtain a room-dependent but still speaker-independent acoustic model.Wepropose to apply a method similar to the eigenvoice approach[5]in order to reduce the amount of adaptation data.Let de-fine as the-dimensional adaptation vector gathering theadaptation parameters,(2)Assume that such a vector may be computed for reverber-ant environments with different reverberation levels.That is,a set of adaptation vectors are com-puted a priori.Next,a Principal Component Analysis(PCA)[11]is performed on the set.The principal directionsare extracted by eigendecomposition of the covariance matrix(3) withandand the th element of(7) where denotes the learning rate coefficient.The gradient termFigure2:Eigenroom-based adaptation:the adaptation transfor-mation is assumed to be modeled by only3parameters(=3) and constrained to lay on thefirst eigenroom(=1).Table1:Word error rate(WER)as the sum of substitution errorrate(SUB),deletion error rate(DEL)and insertion error rate(INS)for the baseline speaker-independent echo-free MLP.WER[%]0.70.50.5Test set=2004006008001000WER=8.220.433.446.748.5 25k4.97.412.020.022.175k3.4 6.29.717.118.5125k3.4 5.78.815.817.1175k3.9 5.48.515.116.1(8) where thefirst term is obtained by the BP algorithm and thesecond term is easily derived from equation(6),i.e.equal to theoutput of the th eigenroom transformation.3.Experimental ResultsThe speech material used in this work comes from the cleanpart of the AURORA database[12]and consists of English con-nected digit sequences.The corpus is divided into a training setof8840utterances and a test set of1001utterances,pronouncedby110speakers and104other speakers,respectively.First,we train a MLP with a600-node hidden layer onthe echo-free training set.The resulting model is assumed tobe speaker-independent.For every speech frame,it estimatesthe a posteriori probabilities for a33-phoneme set given theacoustic vectors of the current frame augmented with7-frame left-context and7-frame right-context acoustic vectors.Eachacoustic vector is composed of12Mel-warped frequency cep-stral coefficients(MFCC)and the energy.The performance of the resulting MLP for recognition of the echo-free test set isgiven in table1.Speech decoding is done by Viterbi search,with neither pruning nor grammar constraints.Then,we try to adapt the echo-free MLP to various rever-berant environments.Following the notation of section2,the compensated vector is obtained by linear transformation(see equation(1))of the-dimensional vector formed by the cur-rent acoustic vector with its context,i.e..As described in section2.2,no reverberated data arecollected in the reverberant environment to which we want toadapt.Only the reverberation time has to be known.Given,one can generate adaptation material by convolving the echo-free training set with artificial room impulse responses matching[2].Once the echo-free MLP has been adapted, it is used to recognize a reverberated test set.In this work,the Table3:Word error rate(WER[%])for various reverberant environments([ms])and for various amounts of adaptation data in the case of a block diagonal adaptation matrix.Adapt.data[frame]No6.112.922.034.637.050k5.511.318.128.632.5100k5.510.416.627.130.2150k5.210.216.126.728.8200k=.Figure3:Block diagonal adaptation matrix. reverberated test sets are obtained by acoustic room simulation (Image method[13])which allows us to specify any room con-figuration and control the reverberation time.Table2shows the WER for various as a function of the number of adapta-tion frames.Thefirst line corresponds to the performance of the echo-free system,i.e.with no adaptation.As expected,adapt-ing the acoustic model significantly improves the performance of the speech recognizer.Though the standard adaptation technique provides high WER improvements,especially with large amounts of adap-tation data,the number of parameters defining the adaptation transformation is too large() for using it in a fast adaptation framework.Indeed,the compu-tation of the eigenrooms would be highly memory demanding, computationally prohibitive and prone to round-off error.First, we observe that the biases do not help adapting.Besides,we observe that high value coefficients of the adaptation matrix are mostly located along the main diagonal.Hence,we decide to use a block diagonal adaptation matrix instead of a full matrix. That is,the elements which are off the main block diagonal are forced to zero(seefigure3).The number of adaptation parame-ters is reduced drastically().For the sake of comparison,table3reports the WER for variousas a function of the number of adaptation frames.As expected, the adaptation procedure with a block diagonal matrix provides less WER improvement than with a full matrix.Next,we test the eigenroom-based adaptation technique. As afirst step,block diagonal adaptation matrices are generated for varying from100ms to1200ms.For each,200000 frames of adaptation data are obtained by using artificially gen-erated room impulse responses(see section2.2).Then,theeigenvalue rank kN o r m a l i z e d e i g e n v a l u e λkFigure 4:Scree plot for PCA applied to block diagonal matrices for adaptation to room reverberation (limited to first 20eigen-values).adaptation vectors are formed and the principal directions of the resulting vector set are extracted.Figure 4gives the scree plot of the resulting eigenvalues.It clearly shows that the first few principal directions ()account for most of the vari-ability within the adaptation vector set.Finally,we apply the eigenroom-based adaptation approach with various number of eigenrooms.Figure 5compares WER improvements relativelyto the performance of the unadapted MLP:(a)forms with the number of adaptation frames varying from 25000to200000,and (b)for 50000adaptation frames withvary-ing from 200ms to 1000ms.The eigenroom-based adaptation procedure performs significantly better than the standard adap-tation procedure,especially for low amounts of adaptation data.4.Conclusion and Future WorkWe have shown that HMM/MLP recognizers can be efficientlyadapted to room reverberation by linear transformation of the input acoustic vectors.Besides,the eigenroom concept has been proposed.Similarly to the eigenvoice approach for fast speaker adaptation,the eigenroom-based approach requires less data for room adaptation than the standard approach.Unfortunately,the adaptation matrix has to be limited to a block diagonal matrix for computational reasons.Future work will focused on relax-ing this constrain.For example,a structured full matrix like a FIR matrix might be used.Furthermore,the promising results obtained for supervised adaptation have to be confirmed in un-supervised adaptation mode.5.References[1] D.Giuliani,M.Matassoni,M.Omologo and P.Svaizer,“Training of HMM with Filtered Speech Material for Hands-free Recognition”,Proc.ICASSP’99,vol.1,pp.449–452,Phoenix,USA,Mar.1999.[2]L.Couvreur,C.Couvreur and C.Ris,“A Corpus-BasedApproach for Robust ASR in Reverberant Environments”,Proc.ICSLP’2000,vol.1,pp.397–400,Beiging,China,Oct.2000.[3]H.Bourlard and N.Morgan,“Connectionist Speech Recog-nition –A Hybrid Approach”,Kluwer Academic Publish-ers,1994.[4]P.Nguyen,C.Wellekens and J.-C.Junqua,“MaximumLikelihood Eigenspace and MLLR for Speech Recognitionfor eigenroom-based adapted MLP’s with various numbers of eigenrooms (a)as a function of the number of adaptation frames and (b)as a function of the reverberation time .in Noisy Environments”,Proc.EUROSPEECH’99,vol.6,pp.2519–2522,Budapest,Hungary,Sep.1999.[5]R.Kuhn,J.-C.Junqua,P.Nguyen and N.Niedzielski,“Rapid Speaker Adaptation in Eigenvoice Space”,IEEE Trans.on Speech and Audio Processing ,vol.8,no.6,pp.695–707,Nov.2000.[6]R.Kuhn,P.Nguyen,J.-C.Junqua,L.Goldwasser,N.Niedzielski,S.Fincke,K.Field and M.Contolini,“Eigenvoices for Speaker Adaptation”,Proc.ICSLP’98,vol.5,pp.1771–1774,Sydney,Australia,Dec.1998.[7]P.Nguyen,“Fast Speaker Adaptation”,Technical Report ,Eur´e com Institute,Jun.1998.[8]S.Dupont and L.Cheboub,“Fast Speaker Adaptation of Artificial Neural Networks for Automatic Speech Recogni-tion”,Proc.ICASSP’2000,vol.3,pp.1795–1798,Istanbul,Turkey,Jun.2000.[9]S.Haykin,“Neural Networks:A Comprehensive Founda-tion”,McMillan,1994.[10]o,C.Martins and L.Almeida,“Speaker-Adaptationin a Hybrid HMM-MLP Recognizer”,Proc.ICASSP’96,vol.6,pp.3383–3386,Atlanta,USA,May 1996.[11]K.Fukunaga,“Introduction to Statistical Pattern Recog-nition”,Academic Press,1990.[12]AURORA database -http://www.elda.fr/aurora2.html.[13]J.B.Allen and D.A.Berkley,“Image Method for Effi-ciently Simulating Small-Room Acoustics”,J.Acoust.Soc.Am.,vol.65,no.4,pp.943–950,Apr.1979.。