Speech enhancement based on temporal processing
- 格式:pdf
- 大小:670.42 KB
- 文档页数:4
专利名称:ENHANCEMENT OF NOISY SPEECH BASEDON STATISTICAL SPEECH AND NOISEMODELS发明人:Jensen, Jesper申请号:EP16175779.4申请日:20160622公开号:EP3118851A1公开日:20170118专利内容由知识产权出版社提供专利附图:摘要:The disclosure relates to a method and a system for enhancement of noisy speech. The system according to the disclosure comprises an analysis filterbank (24)configured to subdivide the spectrum of the input signal into a plurality of frequency sub-bands and to provide time-frequency coefficients () for a sequence of observable noisy signal samples for each of said frequency sub-bands. The system further comprises enhancement processing unit configured to receive () and to provide enhanced time-frequency coefficients (,) and a synthesis filterbank configured to receive the enhanced time-frequency coefficients () and to provide an enhanced time-domain signal. The system further comprises a covariance estimator configured for estimating and providing the covariance matrix (,) of the time-frequency coefficients () for successive samples in each respective subband and an optimal linear combination optimiser configured to receive the covariance matrix (,), the covariance matrix of the noise-free signal () derived from a statistical speech model, and the covariance matrix of the noise signal derived from a statistical noise model and to provide a linear combination of the covariance matrix of the noise-free signal and the covariance matrix of the noise signal that explains the noisy observation as closely as required. The system further comprises a storage for the statistical model(s) of speech and for the statistical model(s) of noise. The optimal linear combination optimiser provides the linear combination of the covariance matrix of the noise-free signal and the covariance matrix of the noise signal to the enhancement processing unit, thereby enabling the enhancement processing unit to determine the enhanced time-frequency coefficients based on the time-frequency coefficients for each of said frequency sub-bands.申请人:Oticon A/S地址:Kongebakken 9 2765 Smørum DK国籍:DK代理机构:William Demant更多信息请下载全文后查看。
Thesis for the degree of Doctor of PhilosophyModel Based Speech Enhancement andCodingDavid Yuheng Zhao赵羽珩Sound and Image Processing LaboratorySchool of Electrical EngineeringKTH(Royal Institute of Technology)Stockholm2007Zhao,David YuhengModel Based Speech Enhancement and Coding Copyright c 2007David Y.Zhao except where otherwise stated.All rights reserved.ISBN978-91-628-7157-4TRITA-EE2007:018ISSN1653-5146Sound and Image Processing LaboratorySchool of Electrical EngineeringKTH(Royal Institute of Technology)SE-10044Stockholm,SwedenTelephone+46(0)8-7907790AbstractIn mobile speech communication,adverse conditions,such as noisy acoustic environments and unreliable network connections,may severely degrade the intelligibility and natural-ness of the received speech quality,and increase the listening effort.This thesis focuses on countermeasures based on statistical signal processing techniques.The main body of the thesis consists of three research articles,targeting two specific problems:speech enhancement for noise reduction andflexible source coder design for unreliable networks.Papers A and B consider speech enhancement for noise reduction.New schemes based on an extension to the auto-regressive(AR)hidden Markov model(HMM)for speech and noise are proposed.Stochastic models for speech and noise gains(excitation variance from an AR model)are integrated into the HMM framework in order to improve the modeling of energy variation.The extended model is referred to as a stochastic-gain hidden Markov model(SG-HMM).The speech gain describes the energy variations of the speech phones,typically due to differences in pronunciation and/or different vocalizations of individual speakers.The noise gain improves the tracking of the time-varying energy of non-stationary noise,e.g.,due to movement of the noise source.In Paper A,it is assumed that prior knowledge on the noise environment is available,so that a pre-trained noise model is used.In Paper B,the noise model is adaptive and the model parameters are estimated on-line from the noisy observations using a recursive estimation algorithm. Based on the speech and noise models,a novel Bayesian estimator of the clean speech is developed in Paper A,and an estimator of the noise power spectral density(PSD)in Paper B.It is demonstrated that the proposed schemes achieve more accurate models of speech and noise than traditional techniques,and as part of a speech enhancement system provide improved speech quality,particularly for non-stationary noise sources.In Paper C,aflexible entropy-constrained vector quantization scheme based on Gaus-sian mixture model(GMM),lattice quantization,and arithmetic coding is proposed.The method allows for changing the average rate in real-time,and facilitates adaptation to the currently available bandwidth of the network.A practical solution to the classical issue of indexing and entropy-coding the quantized code vectors is given.The proposed scheme has a computational complexity that is independent of rate,and quadratic with respect to vector dimension.Hence,the scheme can be applied to the quantization of source vectors in a high dimensional space.The theoretical performance of the scheme is analyzed under a high-rate assumption.It is shown that,at high rate,the scheme approaches the theoretically optimal performance,if the mixture components are located far apart.The practical performance of the scheme is confirmed through simulations on both synthetic and speech-derived source vectors.Keywords:statistical model,Gaussian mixture model(GMM),hidden Markov model(HMM),noise reduction,speech enhancement,vector quantization.iList of PapersThe thesis is based on the following papers:[A]D.Y.Zhao and W.B.Kleijn,“HMM-based gain-modeling forenhancement of speech in noise,”In IEEE Trans.Audio,Speech and Language Processing,vol.15,no.3,pp.882–892,Mar.2007.[B]D.Y.Zhao,W.B.Kleijn,A.Ypma,and B.de Vries,“On-linenoise estimation using stochastic-gain HMM for speech enhance-ment,”IEEE Trans.Audio,Speech and Language Processing, submitted,Apr.2007.[C]D.Y.Zhao,J.Samuelsson,and M.Nilsson,“Entropy-constrained vector quantization using Gaussian mixture mod-els,”IEEE munications,submitted,Apr.2007.iiiIn addition to papers A-C,the following papers have also been produced in part by the author of the thesis:[1]D.Y.Zhao,J.Samuelsson,and M.Nilsson,“GMM-basedentropy-constrained vector quantization,”in Proc.IEEE Int.Conf.Acoustics,Speech and Signal Processing,Apr.2007,pp.1097–1100.[2]V.Grancharov,D.Y.Zhao,J.Lindblom,and W.B.Kleijn,“Low complexity,non-intrusive speech quality assessment,”inIEEE Trans.Audio,Speech and Language Processing,vol.14,pp.1948–1956,Nov.2006.[3]——,“Non-intrusive speech quality assessment with low compu-tational complexity,”in Interspeech-ICSLP,September2006. [4]D.Y.Zhao and W.B.Kleijn,“HMM-based speech enhance-ment using explicit gain modeling,”in Proc.IEEE Int.Conf.Acoustics,Speech and Signal Processing,vol.1,May2006,pp.161–164.[5]D.Y.Zhao,W.B.Kleijn,A.Ypma,and B.de Vries,“Methodand apparatus for improved estimation of non-stationary noisefor speech enhancement,”2005,provisional patent applicationfiled by GN ReSound.[6]D.Y.Zhao and W.B.Kleijn,“On noise gain estimation forHMM-based speech enhancement,”in Proc.Interspeech,Sep.2005,pp.2113–2116.[7]——,“Multiple-description vector quantization using translatedlattices with local optimization,”in Proceedings IEEE GlobalTelecommunications Conference,vol.1,2004,pp.41–45.[8]J.Plasberg,D.Y.Zhao,and W.B.Kleijn,“The sensitivitymatrix for a spectro-temporal auditory model,”in Proc.12thEuropean Signal Processing Conf.(EUSIPCO),2004,pp.1673–1676.ivAcknowledgementsApproaching the end of my Ph.D.study,I would like to thank my supervisor, Prof.Bastiaan Kleijn,for introducing me to the world of academic research. Your creativity,dedication,professionalism and hardworking nature have greatly inspired and influenced me.I also appreciate your patience with me,and that you spend many hours of your valuable time for correcting my English mistakes.I am grateful to all my current and past colleges at the Sound and Im-age Processing lab.Unfortunately,the list of names is too long to be given here.Thank you for creating a pleasant and stimulating work environment.I particularly enjoyed the cultural diversity and the feeling of an interna-tional“family”atmosphere.I am thankful to my co-authors Dr.Jonas Samuelsson,Dr.Mattias Nilsson,Dr.Volodya Grancharov,Jan Plasberg, Dr.Jonas Lindblom,Dr.Alexander Ypma,and Dr.Bert de Vries.Thank you for our fruitful collaborations.Further more,I would like to express my gratitude to Prof.Bastiaan Kleijn,Dr.Mattias Nilsson,Dr.Volodya Gran-charov,Tiago Falk and Anders Ekman for proofreading the introduction of the thesis.During my Ph.D.study,I was involved in projectsfinancially supported by GN ReSound and the Foundation for Strategic Research.I would like to thank Dr.Alexander Ypma,Dr.Bert de Vries,Prof.Mikael Skoglund, Prof.Gunnar Karlsson,for your encouragement and friendly discussions during the projects.Thanks to my Chinese fellow students and friends at KTH,particularly Xi,Jing,Lei,and Jinfeng.I feel very lucky to get to know you.The fun we have had together is unforgettable.I would also like to express my deep gratitude to my family,my wife Lili,for your encouragement,understanding and patience;my beloved kids: Andrew and whom yet-to-be-named,for the joy and hope you bring.I thank my parents,for your sacrifice and your endless support;my brother, for all the venture and joy we have together;my mother-in-law,for taking care of st but not least,I am grateful to our host family,Harley and Eva,for your kind hearts and unconditional friendship.David Zhao,Stockholm,May2007vContentsAbstract i List of Papers iii Acknowledgements v Contents vii Acronyms xi I Introduction11Statistical models (3)1.1Short-term models (4)1.2Long-term models (6)1.3Model-parameter estimation (9)2Speech enhancement for noise reduction (12)2.1Overview (14)2.2Short-term model based approach (17)2.3Long-term model based approach (21)3Flexible source coding (25)3.1Source coding (25)3.2High-rate theory (30)3.3High-rate quantizer design (32)4Summary of contributions (35)References (37)II Included papers51 A HMM-based Gain-Modeling for Enhancement of Speech inNoise A1 1Introduction...........................A1vii2Signal model..........................A42.1Speech model......................A42.2Noise model.......................A52.3Noisy signal model...................A63Speech estimation.......................A7 4Off-line parameter estimation.................A9 5On-line parameter estimation.................A11 6Experiments and results....................A136.1System implementation................A136.2Experimental setup...................A146.3Evaluation of the modeling accuracy.........A156.4Objective evaluation of MMSE waveform estimators A166.5Objective evaluation of sample spectrum estimators A196.6Perceptual quality evaluation.............A197Conclusions...........................A22 Appendix I.EM based solution to(12)...............A22 Appendix II.Approximation of f¯s(¯g n|x n).............A23 References...............................A24B On-line Noise Estimation Using Stochastic-Gain HMM forSpeech Enhancement B1 1Introduction...........................B1 2Noise model parameter estimation..............B42.1Signal model......................B42.2On-line parameter estimation.............B62.3Safety-net state strategy................B92.4Discussion........................B103Noise power spectrum estimation...............B11 4Experiments and results....................B124.1System Implementation................B124.2Experimental setup...................B134.3Evaluation of noise model accuracy..........B144.4Evaluation of noise PSD estimate...........B174.5Evaluation of speech enhancement performance...B184.6Evaluation of safety-net state strategy........B215Conclusions...........................B22 Appendix I.Derivation of(13-14)..................B25 References...............................B27 C Entropy-Constrained Vector Quantization Using GaussianMixture Models C1 1Introduction...........................C1 2Preliminaries..........................C42.1Problem formulation..................C4viii2.2High-rate ECVQ....................C42.3Gaussian mixture modeling..............C5 3Quantizer design........................C5 4Theoretical analysis......................C64.1Distortion analysis...................C64.2Rate analysis......................C74.3Performance bounds..................C74.4Rate adjustment....................C10 5Implementation.........................C105.1Arithmetic coding...................C105.2Encoding for the Z-lattice...............C115.3Decoding for the Z-lattice...............C125.4Encoding for arbitrary lattices............C125.5Decoding for arbitrary lattices............C135.6Lattice truncation...................C13 6The algorithm..........................C14 7Complexity...........................C15 8Experiments and results....................C158.1Artificial source.....................C168.2Speech derived source.................C19 9Summary............................C21 References...............................C22ixAcronyms3GPP The Third Generation Partnership Project AMR Adaptive Multi-RateAR Auto-RegressiveAR-HMM Auto-Regressive Hidden Markov ModelCCR Comparison Category RatingCDF Cumulative Distribution FunctionDFT Discrete Fourier TransformECVQ Entropy-Constrained Vector QuantizerEM Expectation MaximizationEVRC Enhanced Variable Rate CodecFFT Fast Fourier TransformGMM Gaussian Mixture ModelHMM Hidden Markov ModelIEEE Institute of Electrical and Electronics Engineers I.I.D.Independent and Identically-DistributedIP Internet ProtocolKLT Karhunen-Lo`e ve TransformLL Log-LikelihoodLP Linear PredictionLPC Linear Prediction CoefficientLSD Log-Spectral DistortionxiLSF Line Spectral FrequencyMAP Maximum A-PosterioriML Maximum LikelihoodMMSE Minimum Mean Square ErrorMOS Mean Opinion ScoreMSE Mean Square ErrorMVU Minimum Variance UnbiasedPCM Pulse-Code ModulationPDF Probability Density FunctionPESQ Perceptual Evaluation of Speech Quality PMF Probability Mass FunctionPSD Power Spectral DensityRCVQ Resolution-Constrained Vector Quantizer REM Recursive Expectation MaximizationSG-HMM Stochastic-Gain Hidden Markov Model SMV Selectable Mode VocoderSNR Signal-to-Noise RatioSSNR Segmental Signal-to-Noise RatioSS Spectral SubtractionSTSA Short-Time Spectral AmplitudeVAD Voice Activity DetectorVoIP Voice over IPVQ Vector QuantizerWF Wiener FilteringWGN White Gaussian NoisexiiPart I IntroductionIntroductionThe advance of modern information technology is revolutionizing the way we communicate.Global deployment of mobile communication systems has enabled speech communication from and to nearly every corner of the world. Speech communication systems based on the Internet protocol(IP),the so-called voice over IP(VoIP),are rapidly emerging.The new technologies allow for communication over the wide range of transport media and proto-cols that is associated with the increasingly heterogeneous communication network environment.In such a complex communication infrastructure, adverse conditions,such as noisy acoustic environments and unreliable net-work connections,may severely degrade the intelligibility and naturalness of the perceived speech quality,and increase the required listening effort for speech communication.The demand for better quality in such situations has given rise to new problems and challenges.In this thesis,two specific problems are considered:enhancement of speech in noisy environments and source coder design for unreliable network conditions.Network Figure1:A one-direction example of speech communication in a noisy environment.Figure1illustrates an adverse speech communication scenario that is2Introduction typical of mobile communications.The speech signal is recorded in an environment with a high level of ambient noise such as on a street,in a restaurant,or on a train.In such a scenario,the recorded speech signal is contaminated by additive noise and has degraded quality and intelligibility compared to the clean speech.Speech quality is a subjective measure on how pleasant the signal sounds to the listeners.It can be related to the amount of effort required by listeners to understand the speech.Speech intelligibil-ity can be measured objectively based on the percentage of sounds that are correctly understood by listeners.Naturally,both speech quality and intel-ligibility are important factors in the subjective rating of a speech signal by a listener.The level of background noise may be suppressed by a speech enhancement system.However,suppression of the background noise may lead to reduced speech intelligibility,due to possible speech distortion as-sociated with excessive noise suppression.Therefore,most practical speech enhancement systems for noise reduction are designed to improve the speech quality,while preserving the speech intelligibility[6,60,61].Besides speech communication,speech enhancement for noise reduction may be applied to a wide range of other speech processing applications in adverse situations, including speech recognition,hearing aids and speaker identification.Another limiting factor for the quality of speech communication is the network condition.It is a particularly relevant issue for wireless communica-tion,as the network capacity may vary depending on interferences from the physical environment.The same issue occurs in heterogeneous networks with a large variety of different communication links.Traditional source coder design is often optimized for a particular network scenario,and may not perform well in another network.To allow communication with the best possible quality in the new network,it is necessary to adapt the coding strategy,e.g.,coding rate,depending on the network condition.This thesis concerns statistical signal processing techniques that facili-tate better solutions to the aforementioned problems.Formulated in a sta-tistical framework,speech enhancement is an estimation problem.Source coding can be seen as a constrained optimization problem that optimizes the coder under certain rate constraints.Statistical modeling of speech(and noise)is a key element for solving both problems.The techniques proposed in this thesis are based on a hidden Markov model(HMM)or a Gaussian mixture model(GMM).Both are generic mod-els capable of modeling complex data sources such as speech.In Papers A and Paper B,we propose an extension to an HMM that allows for more accurate modeling of speech and noise.We show that the improved models also lead to an improved speech enhancement performance.In Paper C,we propose aflexible source coder design for entropy-constrained vector quan-tization.The proposed coder design is based on Gaussian mixture modeling of the source.The introduction is organized as follows.In section1,an overview of1Statistical models3Figure2:Histogram of time-domain speech waveform samples.Gener-ated using100utterances from the TIMIT database,down-sampled to8kHz,normalized between-1and1,and2048bins. statistical modeling for speech is given.We discuss short-term models for modeling the local statistics of a speech frame,and long-term models(such as an HMM or a GMM)that describe the long-term statistics spanning over multiple frames.Parameter estimation using the expectation-maximization (EM)algorithm is discussed.In section2,we introduce speech enhancement techniques for noise reduction,with our focus on statistical model-based methods.Different approaches using short-term and long-term models are reviewed and discussed.Finally,in section3,an overview of theory and practice forflexible source coder design is given.We focus on the high-rate theory and coders derived from the theory.1Statistical modelsStatistical modeling of speech is a relevant topic for nearly all speech pro-cessing applications,including coding,enhancement,and recognition.A statistical model can be seen as a parameterized set of probability density functions(PDFs).Once the parameters are determined,the model pro-vides a PDF of the speech signal that describes the stochastic behavior of the signal.A PDF is useful for designing optimal quantizers[64,91,128], estimators[35,37–39,90],and recognizers[11,105,108]in a statistical frame-work.One can make statistical models of speech in different representations. The simplest one applies directly to the digital speech samples in the wave-form domain(analog speech is out of the scope of this thesis).For instance, a histogram of speech waveform samples,as shown in Figure2,provides4Introduction the long-term averaged distribution of the amplitudes of speech samples.A similarfigure is shown in[135].The distribution is useful for designing an entropy code for a pulse-code modulation(PCM)system.However,in such a model,statistical dependencies among the speech samples are not explicitly modeled.It is more common to model speech signal as a stochastic process that is locally stationary within20-30ms frames.The speech signal is then pro-cessed on a frame-by-frame basis.Statistical models in speech processing can be roughly divided into two classes,short-term and long-term signal models,depending on their scope of operation.A short-term model de-scribes the statistics of the vector for a particular frame.As the statistics are changing over the frames,the parameters of a short-term model are to be determined for each frame.On the other hand,a model that describes the statistics over multiple signal frames is referred to as a long-term model.1.1Short-term modelsIt is essential to model the sample dependencies within a speech frame of20-30ms.This short-term dependency determines the formants of the speech frame,which are essential for recognizing the associated phoneme.Modeling of this dependency is therefore of interest to many speech applications. Auto-regressive modelAn efficient tool for modeling the short-term dependencies in the time do-main is the auto-regressive(AR)model.Let x[i]denote the discrete-time speech signal,the AR model of x[i]is defined asx[i]=−pj=1α[j]x[i−j]+e[i],(1)whereαare the AR model parameters,also called the linear prediction coefficients,p is the model order and e[i]is the excitation signal,modeled as white Gaussian noise(WGN).An AR process is then modeled as the output offiltering WGN through an all-pole AR modelfilter.For a signal frame of K samples,the vector x=[x[1],...,x[K]]can be seen as a linearly transformation of a WGN vector e=[e[1],...,e[K]],by considering thefiltering process as a convolution formulated as a matrix transform.Consequently,the PDF of x can be modeled as a multi-variate Gaussian PDF,with zero-mean and a covariance matrix parameterized using the AR model parameters.The AR Gaussian density function is defined asf(x)=1(2πσ2)K2exp −12x T D−1x ,(2)1Statistical models5 Pitch periodGain Gain Speech signalFigure3:Schematic diagram of a source-filter speech production model.with the covariance matrix D=σ2(A A)−1,where A is a K×K lower triangular Toeplitz matrix with thefirst p+1elements of thefirst column consisting of the AR coefficients[α[0],α[1],α[2],...,α[p]]T,α[0]=1,and the remaining elements equal to zero, denotes the Hermitian transpose, andσ2denotes the excitation variance.Due to the structure of A,the covariance matrix D has the determinantσ2K.Applying the AR model for speech processing is often motivated and supported by the physical properties of speech production[43].A commonly used model of speech production is the source-filter model as shown in Figure3.The excitation signal,modeled using a pulse train for a voiced sound or Gaussian noise for an unvoiced sound,isfiltered through a time-varying all-polefilter that models the vocal tract.Due to the absence of a pitch model,the AR Gaussian model is more accurate for unvoiced sounds.Frequency-domain modelsThe spectral properties of a speech signal can be analyzed in the frequency domain through the short-time discrete Fourier transform(DFT).On a frame-by-frame basis,each speech segment is transformed into the frequency domain by applying a sliding window to the samples and a DFT on the windowed signal.A commonly used model is the complex Gaussian model,motivated by the asymptotic properties of the DFT coefficients.Assuming that the DFT length approaches infinity and that the span of correlation of the signal frame is short compared to the DFT length,the DFT coefficients may be considered as a weighted sum of weakly dependent random samples[51,94, 99,100].Using the central limit theorem for weakly dependent variables[12, 130],the DFT coefficients across frequency can be considered as independent6Introductionzero-mean Gaussian random variables.For k∈{1,K2+1},the PDF of theDFT coefficients of the k’th frequency bin X[k]is given byf(X[k])=1πλ2[k]exp −|X[k]|2λ2[k] ,(3)whereλ2[k]=E[|X[k]|2]is the power spectrum.For k∈{1,K2+1},X[k]isreal and has a Gaussian distribution with zero mean and variance E[|X[k]|2]. Due to the conjugate symmetry of DFT for real signals,only half of the DFT coefficients need to be considered.One consequence of the model is that each frequency bin may be ana-lyzed and processed independently.Although,for typical speech process-ing DFT lengths,the independence assumption does not always hold,it is widely used,since it significantly simplifies the algorithm design and reduces the associated computational complexity.For instance,the complex Gaus-sian model has been successfully applied in speech enhancement[38,39,92].The frequency domain model of an AR process can be derived under the asymptotic assumption.Assuming that the frame length approaches infinity,A is well approximated by a circulant matrix(neglecting the frame boundary),and is approximately diagonalized by the discrete Fourier trans-form.Therefore,the frequency domain AR model is of the form of(3),and the power spectrumλ2[k]is given byλ2[k]AR=σ2| p n=0α[n]e−jnk|2.(4)It has been argued[89,103,118]that supergaussian density functions, such as the Laplace density and the two-sided Gamma density,fit better to speech DFT coefficients.However,the experimental studies were based on the long-term averaged behavior of speech,and may not necessarily hold for short-time DFT coefficients.Speech enhancement methods based on supergaussian density function have been proposed in[19,85,86,89,90]. The comprehensive evaluations in[90]suggest that supergaussian methods achieve consistent but small improvement in terms of higher segmental SNR results relative to the Wienerfiltering approach(which is based on the Gaus-sian model).However,in terms of perceptual quality,the Ephraim-Malah short-time spectral amplitude estimators[38,39]based on the Gaussian model were found to provide more pleasant residual noise in the processed speech[90].1.2Long-term modelsWith a long-term model,we mean a model that describes the long-term statistics that span over multiple signal frames of20-30ms.The short-term models in section1.1apply to one signal frame only,and the model1Statistical models7 parameters obtained for a frame may not generalize to another frame.A long-term model,on the other hand,is generic andflexible enough to allow statistical variations in different signal frames.To model different short-term statistics in different frames,a long-term model comprises multiple states,each containing a sub-model describing the short-term statistics of a frame.Applied to speech modeling,each sub-model represents a particular class of speech frames that are considered statistically similar.Some of the most well-known long-term models for speech processing are the hidden Markov model(HMM),and the Gaussian mixture model(GMM).Hidden Markov modelThe hidden Markov model(HMM)was originally proposed by Baum and Petrie[13]as probabilistic functions offinite state Markov chains.HMM applied for speech processing can be found in automatic speech recognition [11,65,105]and speech enhancement[35,36,111,147].An HMM models the statistics of an observation sequence,x1,...,x N. The model consists of afinite number of states that are not observable, and the transition from one state to another is controlled by afirst-order Markov chain.The observation variables are statistically dependent of the underlying states,and may be considered as the states observed through a noisy channel.Due to the Markov assumption of the hidden states,the observations are considered to be statistically interdependent to each other. For a realization of state sequence,let s=[s1,...,s N],s n denote the stateof frame n,a sn−1s n denote the transition probability from state s n−1to s nwith a s0s1being the initial state probability,and let f sn(x n)denote theoutput probability of x n for a given state s n,the PDF of the observation sequence x1,...,x N of an HMM is given byf(x1,...,x N)= s∈S N n=1a s n−1s n f s n(x n),(5)where the summation is over the set of all possible state sequences S. The joint probability of a state sequence s and the observation sequence x1,...,x N consists of the product of the transition probabilities and the output probabilities.Finally,the PDF of the observation sequence is the sum of the joint probabilities over all possible state sequences.Applied to speech modeling,each sub-model of a state represents the short-term statistics of a ing the frequency domain Gaussian model(3),each state consists of a power spectral density representing a particular class of speech sounds(with similar spectral properties).The Markov chain of states models the temporal evolution of short-term speech power spectra.。
一种基于CASA的单通道语音增强方法余世经;李冬梅;刘润生【摘要】基于对计算听觉场景分析(Computational Auditory Scene Analysis,CASA)算法思想的研究,提出了一种单通道语音增强方法.通过分析白噪声、风噪声、周期性噪声三类典型噪声和一般语音信号的频谱特点,构造适合的信号提取特征作为线索,判别出信号时频单元中的主要信号成分,然后对各时频单元乘以相应的衰减系数以掩蔽噪声成分.对仿真实验结果的客观测试和非正式听音测试表明,相对于常用的多子带谱减法和维纳滤波法,所提出的算法能够更有效地抑制白噪声、风噪声、周期性噪声等背景噪声.【期刊名称】《电声技术》【年(卷),期】2014(038)002【总页数】5页(P50-54)【关键词】语音增强;计算听觉场景分析;线索;掩蔽【作者】余世经;李冬梅;刘润生【作者单位】清华大学电子工程系电路与系统研究所,北京 100084;清华大学电子工程系电路与系统研究所,北京 100084;清华大学电子工程系电路与系统研究所,北京 100084【正文语种】中文【中图分类】TN912.351 引言单通道语音增强可用于实时通信下的背景噪声抑制,也可以作为语音压缩、语音识别等系统的前端。
单通道语音增强目前最常用的算法有谱减法、多子带谱减法、维纳滤波等。
谱减法[1]首先通过某种方法从带噪语音功率谱中估计出噪声功率谱,然后将其从带噪语音功率谱中减去;多子带谱减法[2-3]对传统谱减法做出了改进,它将频域信号按频率分成多个子带来进行谱减,可以获得更大的信噪比的提高。
谱减法及多子带谱减法存在的一个主要问题就是会残留“音乐噪声”,特别是在低信噪比(小于5 dB)情况下,残留噪声会严重影响人的听觉效果。
维纳滤波法[4-5]可以一定程度上抑制“音乐噪声”,但是在低信噪比情况下仍然有较大的噪声残留,在信噪比的提高上难以令人满意。
谱减法和维纳滤波法都需要对噪声功率谱进行估计,其增强效果的好坏很大程度上取决于噪声谱幅度估计是否准确,当干扰噪声为风噪声等非平稳噪声时,要准确估计出噪声功率谱是很困难的,因此谱减法和维纳滤波对非平稳噪声的抑制效果比较差。
一种适用于双微阵列的语音增强算法毛维;曾庆宁;龙超【摘要】Considering the limitations of traditional single channel speech enhancement algorithm for noise sup -pression,a dual-mini microphone array consisting of two microphones array is employed.The noisy speech is pro-cessed by the temporal and spatial characteristics of the spatial structure of the array,and a speech enhancement al-gorithm based on dual-mini microphone array is proposed.The speech enhancement algorithm uses the logarithmic minimum mean square error(LogMMSE)to enhance the signal-to-noise(SNR)of the noisy speech signals collect-ed by each channel.Then,the signal in the target source direction is obtained and retained by using the wideband minimum variance distortionless response(MVDR)in the frequency domain,which also suppresses the signal inter-ference in other directions.Finally,an improved intelligibility and improved minimum controlled recursive averagingcombination(IMCRA)Wiener filter for noise estimation to remove residual noise can be used to improve the speech quality.Simulation results show that compared with the traditional single channel speech enhancement algo -rithm,the proposed algorithm has good performance in noise suppression.%考虑到传统单通道语音增强算法对噪声抑制的局限性,采用由两个微型麦克风阵列组成的双微阵列;利用该阵列空间结构的时空域特性对含噪语音进行处理,提出了一种适用于双微阵列的语音增强算法;该增强算法是将各通道采集到的带噪语音信号先使用对数最小均方误差(logarithmic minimum mean squareerror,LogMMSE)提升其信噪比;然后利用频域宽带最小方差无畸变响应(MVDR)通过对目标声源信号的获取,保留目标声源方向的信号并抑制其他方向的信号干扰;最后通过一个改进可懂度结合改进最小控制递归平均(improved minimum controlled recursive average algorithm,IMCRA)噪声估计的维纳滤波器来去除噪声残留提升语音质量.仿真实验结果表明,相比传统的单通道语音增强算法,该算法具有良好的噪声抑制性能.【期刊名称】《科学技术与工程》【年(卷),期】2018(018)010【总页数】5页(P245-249)【关键词】双微阵列;语音增强;对数最小均方误差;最小方差无畸变响应;改进最小控制递归平均;维纳滤波【作者】毛维;曾庆宁;龙超【作者单位】桂林电子科技大学信息与通信学院,桂林541004;桂林电子科技大学信息与通信学院,桂林541004;桂林电子科技大学信息与通信学院,桂林541004【正文语种】中文【中图分类】TP391.42近年来,人工智能技术发展迅速;特别是在如计算机视觉、大数据分析、无人驾驶、自动翻译、垃圾邮件过滤等诸多领域已取得了革命性的突破进展;而语音作为人工智能交互的接口,对其性能的改善和优化起了重要的作用。
2020年2月第47卷第1期Feb.2020Vol.47No.1西安电子科技大学学报JOURNAL OF XIDIAN UNIVERSITYdoi:10.19665/j.issn1001-2400.2020.01.015融合多头自注意力机制的语音增强方法常新旭,张杨,杨林,寇金桥,王昕,徐冬冬(北京计算机技术及应用研究所,北京100854)摘要:由于人类在听觉感知过程中存在掩蔽效应,因此能量高的信号会屏蔽掉其他能量6的信号。
受到这一现发,结合自注意力方法和多头注意力方法,提出融合多头自注意力机制的语音增强方法。
通过对输入的含噪语音特征施加多头自注意力计算,可以使得输入语音特征的干净语音部分和噪声部分有较为明显的区分,从而使得后续的处理能够更有效地抑制噪声。
实验结果表明,这种语音增强方法相较于使用门控循环神经网络的语音增强方法,语音增强性能更好,增强语音的语音质量与可懂度更高#关键词:语音增强;深度神经网络;自注意力;多头注意力;门控循环单元中图分类号:TN912.35文献标识码:A文章编号=1001-2400(2020)01-0104-07Speech enhancement method based on the multi-head self-attention mechanismCHANG Xtnxu,ZHANG Yang,YANGI-n,KOUJtnqtao,WANG Xtn,XU Dongdong (Beijing Institute of Computer Technology and Application,Beijing100854,China)Abstract:The human ear can only accept one sound signal at one time,and the signal with the highestenergy will shield other signals with low energy.According to the above principle,this paper combines theself-a?en?ionand?hemuli-heada?enion?oproposeaspeechenhancemen?me?hodbasedon?hemuli-headself-attention mechanism.By applying multi-head self-attention calculation to the input noisy speechfea?ures,?hecleanspeechpar?and?henoisepar?of?heinpu?speechfea?urecanbeclearlydisinguished,thereby enabling subsequent processing to suppress noise more effectively.Experimental results show thatthe proposed method significantly outperforms the method based on the recurrent neural network in terms ofbo?hspeechquali?yandin?e l igibiliy.;60?0*#>speech enhancement;deep neural network;self attention;multi-head attention;gated recurrent unit语音增强是信号处理领域中一个重要的研究方向,其主要目的是提高被噪声污染语音的质量和可懂度1,并且在语音识别、移动通信和人工听觉等诸多领域有着广泛的应用前景2*近年来,随着深度学习技术的兴起,基于深度神经网络(Deep Neural Network,DNN)的有监督语音增强方法取得了巨大的成功*特别是在低信噪比和非平稳噪声环境中,这种方法相较于传统方法凸显出明显的优势。
修正的基于广义Gamma语音模型语音增强算法赵改华;周彬;张雄伟【摘要】广义Gamma模型是近年来新提出的一种语音分布模型,相对于传统的高斯或超高斯模型具有更好的普适性和灵活性,提出一种基于广义Gamma语音模型和语音存在概率修正的语音增强算法。
在假设语音和噪声的幅度谱系数分别服从广义Gamma分布和Gaussian分布的基础上,推导了语音信号对数谱的最小均方误差估计式;在该模型下进一步推导了语音存在概率,对最小均方误差估计进行修正。
仿真结果表明,与传统的短时谱估计算法相比,该算法不仅能够进一步提高增强语音的信噪比,而且可以有效减小增强语音的失真度,提高增强语音的主观感知质量。
%This paper presents a modified speech enhancement algorithm under signal presence probability. Generalized Gamma distribution priors are assumed for speech short-time spectral amplitudes, which is more flexible in capturing the statistical behavior of speech signals. It derives a Minimum Mean-Square Error(MMSE)estimator of the log-spectra am-plitude for speech signals, under the assumption of a generalized Gamma speech priors and additive Gaussian noise priors. Furthermore, modification under signal presence probability is obtained, which is estimated for each frequency bin and each frame consistent with the new model. The simulation results show that the proposed algorithm achieves better noise suppression and lower speech distortion compared to the conventional short-time spectral amplitude estimators, which are based on Gaussian and super-Gaussian speech model.【期刊名称】《计算机工程与应用》【年(卷),期】2014(000)018【总页数】6页(P230-235)【关键词】语音增强;语音存在概率;广义Gamma分布;最小均方误差;对数谱【作者】赵改华;周彬;张雄伟【作者单位】解放军理工大学指挥信息系统学院,南京 210007;解放军理工大学指挥信息系统学院,南京 210007;解放军理工大学指挥信息系统学院,南京210007【正文语种】中文【中图分类】TP912.31 引言在语音通信过程中,语音信号不可避免地会受到噪声的干扰,影响通信质量和语音信号的后续处理,语音增强技术是从含噪语音中尽可能提取原始纯净语音的重要手段,在提高语音可懂度、改善语音通信质量等方面有重要的应用。
I.J. Image, Graphics and Signal Processing, 2019, 9, 44-55Published Online September 2019 in MECS (/)DOI: 10.5815/ijigsp.2019.09.05Speech Enhancement based on Wavelet Thresholding the Multitaper Spectrum Combined with Noise Estimation AlgorithmP.SunithaResearch Scholar, Dept. of ECE, JNTUK,IndiaEmail:Sunitha4949@,Dr.K.Satya PrasadRetd.Professor, Dept. of ECE, JNTUK,IndiaEmail:sprasad.kodati@Received: 26 May 2019; Accepted: 26 June 2019; Published: 08 September 2019Abstract—This paper presents a method to reduce the musical noise encountered with the most of the frequency domain speech enhancement algorithms. Musical Noise is a phenomenon which occurs due to random spectral speaks in each speech frame, because of large variance and inaccurate estimate of spectra of noisy speech and noise signals. In order to get low variance spectral estimate, this paper uses a method based on wavelet thresholding the multitaper spectrum combined with noise estimation algorithm, which estimates noise spectrum based on the spectral average of past and present according to a predetermined weighting factor to reduce the musical noise. To evaluate the performance of this method, sine multitapers were used and the spectral coefficients are threshold using Wavelet thresholding to get low variance spectrum .In this paper, both scale dependent, independent thresholdings with soft and hard thresholding using Daubauchies wavelet were used to evaluate the proposed method in terms of objective quality measures under eight different types of real-world noises at three distortions of input SNR. To predict the speech quality in presence of noise, objective quality measures like Segmental SNR ,Weighted Spectral Slope Distance ,Log Likelihood Ratio, Perceptual Evaluation of Speech Quality (PESQ) and composite measures are compared against wavelet de-noising techniques, Spectral Subtraction and Multiband Spectral Subtraction provides consistent performance to all eight different noises in most of the cases considered.Index Terms—Speech Enhancement, Wavelet thresholding, Multitaper Power Spectrum, Noise power estimation, smoothing parameter, SNR, threshold.I.I NTRODUCTIONSpeech is a basic way of communicating ideas from one person to another. This speech is degraded due to background noise .To reduce this background noise numerous speech enhancement algorithms were available, among them spectral subtractive algorithms are more popular because of their simple implementation and their effectiveness. In these algorithms Noise power spectrum is subtracted from the noisy power spectrum by assuming the noise spectrum is available. These methods introduce musical noise due to inaccurate estimate of noise. These spectral subtractive algorithms works well in stationary noise but they fails in non-stationary noise. This led to the use of low variance spectral estimation methods because spectral estimation plays a key role in speech enhancement algorithms. To reduce the variance an average of estimate can be calculated across all frequencies. To improve the speech quality and intelligibility in presence of highly non stationary noise, a speech enhancement algorithm requires noise estimation algorithms which update the noise spectrum continuously. Most of the speech enhancement applications in non-stationary scenarios use noise estimation methods algorithms which track the noise spectrum continuously. Now, researchers focus their attention to improve the speech quality and intelligibility using efficient noise estimation algorithms. Estimate of noise signal strongly depends on the smoothing parameter. If its value is too large i.e closer to one results in over estimation of the noise level. Generally, smoothing parameter is set to be small during speech activity to track the non-stationary of the speech. This makes the smoothing parameter as time and frequency dependent, taking into the consideration of speech presence or absence probability. Numerous noise estimation algorithms are available in literature. One among them is minimum statistics algorithm, proposed in [1] estimates the noise by considering the instantaneous SNR of speech using smoothing parameter and bias correction factor. It tracks the minimum over a fixed window and updates the noise PSD. The performance of this method was tested under non-stationary noise it results in large error and it is unable to respond for fast changes in increasing levels of noise power. Martín .R implemented spectral subtraction with minimum statistics and its performance was evaluated in terms of both objective and subjective measures. This was comparedagainst spectral subtraction method that uses voice activity detection which results in improved speech intelligibility measures [2].Another variant of minimum statistics suggested in [3] implements estimates noise by continuous spectral minimum tracking in sub bands. In this method a different approach was used to obtain spectral minimum, by smoothing the noisy speech power spectra continuously using a non-linear smoothing rule. This non-linear tracking provides continuous smoothing over PSD without making any distinction between speech presence and absence segments. The shortcoming of this was when noise power spectrum increases, then the noise estimate increases irrespective of changes in the noise power level. Similarly when the noisy power is decreasing then the noise power is decreasing .This will results in overestimation of speech during speech presence regions i.e clipping of speech. This method was evaluated in terms of objective and subjective quality measures it shows its superior performance over Minimum statistics algorithm. The non-uniform effect of noise on the speech spectrum affects few frequency components severely than others. This led to the use of time recursive noise estimation algorithm which updates the noise spectrum when the effective SNR in a particular band is too small [4]. In this method noise spectrum is estimated as a weighted average of past and present estimates of noisy power spectrum depending on the effective SNR in each frequency bin .This algorithm works well in tracking the non-stationary noise in case of multitalker babble noise. Another type of recursive algorithm ,which uses a fixed smoothing factor, but the noise spectrum should be updated based on the comparison of the estimated a-posteriori SNR over a threshold[5].If this a-posteriori SNR is larger than the threshold indicates that speech presence and no update is required for noise spectrum. Otherwise it is treated as a speech absence segment, which requires a noise updating. This method is well known as weighted spectral averaging. In this method the threshold value, have a significant effect on the noise spectrum estimation .If the threshold value is too small noise spectrum is underestimated, conversely the threshold value is too high then the spectrum is over estimated. Improvements to the Minimum statistics was suggested in [6] by using optimal smoothing for noise power spectral density estimation. Cohen proposed noise estimation algorithm, which uses time-frequency dependent smoothing factor which requires continuous updating depending on the speech presence probability in each frequency bin .Speech presence probability was calculated as the ratio of the noisy power spectrum to its local minimum [6].This local minimum is computed considering the smoothed noisy PSD, over a fixed window by sample wise comparison of noisy PSD. This has a short coming ,it may lag when the noise power is raising from the true noise PSD .To address this shortcoming, a different approach was suggested in[8,9] uses continuous spectral minimal tracking and frequency dependent threshold was used to identify the speech presence segments .This method was evaluated in terms of subjective preference tests over other noise estimation algorithms like MS and MCRA. This method shows better performance .Further refinement to this algorithm was reported in [10] i.e noise power spectrum estimation in adverse environments by Improved Minima Controlled Recursive Averaging (IMCRA).This method involves two steps smoothing and minimal tracking .Minimal tracking provides Voice Activity Detection in each frame whereas smoothing excludes strong speech components. Speech presence probability is calculated using a –posteriori and a-priori SNRs. This method yields in lower values of error for different types of noise considered.The structure of the paper is as follows, Section II provides Literature review , Multitaper spectral estimation and spectral refinement is given in Section III ,noise estimation by weighted spectral averaging technique was presented in section IV ,section V presents proposed speech enhancement method, results and discussion in VI and finally section VII gives conclusion.II.L ITERATURE R EVIEWThis section presents literature review on spectral subtractive type algorithms for single channel enhancement techniques. In the past, number of researchers proposed different speech enhancement methods. Most of them are based on Spectral Subtraction (SS), Statistical Model based, Sub space algorithms and Transform based methods. One of the popular noise reduction method which is computationally efficient and less complexity for single channel speech enhancement is Spectral subtraction proposed by Boll S.F for both Magnitude and Power Spectral Subtraction which itself creates a bi-product named as synthetic noise[17].A significant improvement to spectral subtraction with over subtraction factor and spectral floor parameter to reduce the musical noise given by Berouti [19]is Non –Linear Spectral subtraction . Multi Band Spectral Subtraction (MBSS) proposed by S.D. Kamath with multiple subtraction factors in non-overlapping frequency bands [18] .Ephraim and Malah proposed spectral subtraction with MMSE using a gain function based on priori and posteriori SNRs[20].Spectral subtraction based on perceptual properties using masking properties of human auditory system proposed by Virag [21].Another method in spectral subtraction with Wiener filter to estimate the noise spectrum is extended spectral subtraction by Sovka [22]. Spectral Subtraction algorithm based on two-band is Selective spectral subtraction described by He,C.and Zweig,G. [23].Spectral subtraction with Adaptive Gain Averaging to reduce the overall processing delay given by Gustafsson et al[24].A frequency dependent spectral subtraction is non-linear spectral subtraction (NSS) method conferred by Lockwood and Boudy[25].The spectral subtractive type algorithms works well in case of additive noise but fails in colored noise. To overcome this problem Hu and Loizou proposed a Speech enhancement technique based on wavelet thresholding the multitaper spectrum [11] and its performance is evaluated in terms of objective quality measures.III. M ULTITAPER S PECTRAL E STIMATION A ND S PECTRALR EFINEMENT Due to sudden changes and sporadic behavior, Speech signal can be modeled as a non- stationary signal. As time evolves the statistics like mean, variance, co-variance and higher order moments of a non-stationary signal changes over time. Spectral analysis plays a major rule in speech enhancement techniques to get accurate noise estimation. FFT method is widely used to get power spectrum estimation in most of the speech enhancement algorithms especially in spectral subtractive type methods. The estimated power spectrum obtained by FFT is reduced by variance of the estimate and energy leakage across frequencies which create bias. To avoid leakage, multiply the signal in time domain with a suitable window which having less energy in side lobes. Type of window affects the noise estimate in speech enhancement algorithms, hence selection of desirable window which provides an accurate noise estimation plays a significant role in Speech enhancement process .Generally Hamming window is preferable with less energy in side lobes but it effects the estimate by reducing leakage but not the variance. In most of the speech enhancement algorithms noise estimate is obtained by using suitable windows which reduce the bias but not the variance. The variance can be reduced by taking multiple estimates from the sample which can be achieved by using tapers. Hu and Loizou [11] used these multi-tapers to get low variance spectral estimate ,further the spectrum was refined using wavelet thresholding ,Finally this was used to improve the quality of speech signal in case of highly non-stationary noise. Results shows that this method has superior performance in terms of quality measures with high correlation between subjective listening test and objective quality measures. Speech enhancement techniques find wide range of applications like hearing aids to personal communication, teleconferencing, Automatic Speech Recognition (ASR), Speaker Authentication and Voice operated Systems. The multitaper spectrum estimator is given byS ̂mt (ω)=1L ∑Ŝpmt L−1p=0(ω) (1) WithS ̂p mt (ω)=|∑b p (m )x (m )e−jωm N−1M=0|2 (2) Here data length is given by N and b p is the p thsine taper used for spectral estimate [12] and b p is given byb p (m )=√2N+1sinπp(m+1)N+1,m =0,.N − 1 (3)Further refinement of spectrum is obtained by applying wavelet thresholding techniquesv (ω)=S ̂mt (ω)S(ω)~X 2L22L,0<ω<π (4)Where v (ω)is the ratio of the estimated multitaper spectrum to the true power spectrum .Taking logarithm on both sides, we getlogŜmt (ω)=log S (ω)+logv(ω) (5)From this equation , we conclude that sum of the true log spectrum and noise can be treated as log of multitaper spectrum. If L is at least equivalent to 5 then logv(ω) will be nearer to normal distribution and the random variable n (ω) is given byn (ω)=logv (ω)−∅(L )+log (L ) (6)Z (ω) is defined asZ (ω)=logS ̂mt (ω)−∅(L )+log (L). (7)The idea behind multitaper spectral refinement [11] can be summarized as1. Obtain the multitaper spectrum of noisy speech using orthogonal sine tapers by equation1 .2. Apply Dabauchies Discrete Wavelet Transform to get the DWT coefficients.3. Perform thresholding procedure on the DWT coefficients.4. Apply Inverse Discrete Wavelet Transform to get the refined log spectrum.Fig.1. Multiple window method for spectrum estimation by individualwindows.Fig.2. Speech signal(mtlb.wav)(a)(b)(c)Fig.3. (a),(b) and (c) Spectrum obtained by N=1,2,3.’N’ is the number oftapersFig.4. Final spectrum obtained by averagingIV.N OISE E STIMATION B Y W EIGHTED S PECTRALA VERAGINGNoise estimation algorithms works on the assumption that the duration of analysis segment is too long enough that it should contain both low energy segments and speech pauses .Noise present in analysis segment is more stationary than speech. This paper uses noise estimation based on the variance of the spectrum suggested in [12].The noise spectrum updating will take place when the magnitude spectrum of noisy speech falls within a variance of the noise estimate. The noise spectrum was updated based on the following condition.|Ŝmt(λ,K)|−σd(λ,K)<ϵ√Var d(λ,K) (8) Where Var d(λ,K)represents the instantaneous variance of the noise spectrum and ϵis a adjustable parameter, |Ŝmt(λ,K)|is multitaper magnitude spectrum and σd(λ,K) is the estimate of the noise PSD .The variance of the noise spectrum was evaluated using the recursive equationVar d(λ,K)=δ Var d(λ−1,K)+(1−δ)[|Ŝmt(λ,K)|−σd(λ,K)]2 (9)Where δis a smoothing parameter. ‘λ, is a frame index and ‘K’ is a frequency bin .The noise estimation algorithm can be summarized as ifIf|Ŝmt(λ,K)|−σd(λ−1,K)<ϵ√Var d(λ−1,K)σd(λ,K)=α σd(λ−1,K)+(1−α)|Ŝmt(λ,K)|(9)Var d(λ,K)=δ Var d(λ−1,K)+(1−δ)[|Ŝmt(λ,K)|−σd(λ,K)]2(10)Elseσ̂d(λ,K)=σ̂(λ−1,K) (11)This paper uses this weighted spectral averaging method for noise estimation from noisy power spectrum using the parameters δ=α=0.9 and ϵ=2.5.V.P ROPOSED M ETHODThe implementation details of speech enhancement method can be given as follows:1. Obtain the multi taper estimate of the Noisy speech using sine tapers using equation12. Perform spectral refinement with the help of wavelet thresholding procedure, which involves Forward Discrete Wavelet Transform (FDWT), Thresholding and Inverse Discrete Wavelet Transform (IDWT).In this paper Dabauchies wavelets were used at level 5 decompition by using both soft and hard thresholding.3. Compute Z(ω)from the equation (6) and apply Discrete Wavelet Transform to Z(ω)then threshold the multitaper spectrum for further refinement of spectrum and the refined log spectrum .4. Estimate of the noise can be evaluated using weighted spectral recursive averaging algorithm discussed in section IV.5. Perform multitaper spectral subtraction between the refined log spectrum of noisy speech and noise spectrumto get an estimate of Clean Speech spectrum .S ̂x ωmt (ω)=S ̂y ωmt (ω)−S ̂n mt (ω) (12)and it results in negative values which are rounded asS ̂X mt ={S ̂y mt −S ̂n mt , if S ̂y mt >S ̂n mt βS ̂n mt , if S ̂y mt ≤ S ̂n mt ,(13) Where ‘ β’ is spectral floor parameter .6. Finally the enhanced speech Signal can be reconstructed using Inverse Discrete Fourier Transform and overlap- add method.Fig.5. Block diagram of proposed methodVI. R ESULTS A ND D ISCUSSIONAssessment of speech enhancement techniques can be done either by using objective quality or subjective listening tests. Comparative analysis of original speech and processed speech signals by a group of listeners is known as subjective listening test based on human auditory system.. Which involves a complex process and it is difficult to identify the persons with good listening skills. While objective evaluation is done on mathematical comparison of clean and enhanced signals .In order to calculate the objective measures, the speech signal is first divided into frames of duration of 10-30 msec. This result in a single measure which gives the average of distortion measures calculated for all the processed frames. This section gives the performance analysis of the proposed method by using four numbers of bands. Simulations were performed in the MATLAB environment. NOIZEUS is used as a speech corpus which is available at [15] and used by the most of the researchers, containing 30 sentences of six different speakers, three are male and other three are female speakers originally sampled at 25 KHz and down sampled to 8 KHz with 16 bits resolution quantization. Clean Speech is distorted by eight different real-world noises (babble, airport, station, street, exhibition, restaurant, car and train) at three distinct ranges of input SNR (0dB, 5dB, 10dB). In this algorithm speech sample is taken from a male speaker, English sentence is ”we can find joy in the simplest things”. This paper presents the performance evaluation based on different quality measures which are segmental-SNR, Weighted Slope Spectral Distance(WSSD) [13], Log Likelikelihood Ratio, Perceptual Evaluation of Speech Quality (PESQ) [14]and three different composite measures[13]. A. Segmental SNR (seg-SNR)To improve the correlation between clean and processed speech signals summation can be performed over each frame of the signal [13] this results in segmental SNR .The segmental Signal-to-Noise Ratio (seg-SNR) in the time domain can be expressed asSNR seg =10M∑log 10M−1M=0∑x 2(n)Nm+N−1n=NM∑(x (n)−x̂(n))2Nm+N−1n=NM (14)Here x(n) shows the original speech signal. x(n)̂ is theprocessed speech signal, frame length is given by N andthe number of frames is given by M. The geometric mean of all frames of the speech signal is seg-SNR [10], whose value was limited in the range of [-10, 35dB] B. Log Likelihood Ratio (LLR)This measure was based on LPC analysis of speech signal.LLR (a x ⃗⃗⃗⃗ ,ax ̂)=log (a ⃗ x̂R x a ⃗ x ̂T a ⃗ x R x a⃗ x T ) (15)a x ,a x ̂Tare the LPC coefficients of the original andprocessed signals. R x is the autocorrelation matrix of the original signal .In LLR denominator term is always lower than numerator therefore LLR is always positive [13] and the LLR values are in the range of (0-2).Multitaper spectral estimation using sine-tapersWavelet thresholding the Multitaper spectrumSpectralSubtractionNoise estimation by weighted spectral averagingSignal framing using Hamming WindowInverse DFT &OLANoisy SpeechEnhanced SpeechC. Weighted Slope Spectral Distance(WSSD)This measure can be evaluated as the weighted difference between the spectral slopes in each band can be computed using first order difference operation[13].Spectral slopes in each band of original and processed signals are given byWSSD=1M ∑∑W(j,m)(X x(j,m)−X X̂(j,m))2Kj=1∑W(j,m)Kj=1M−1M=0(16)D. Perceptual Evaluation of Speech Quality (PESQ) One among the objective quality measures which provides an accurate speech quality recommended by ITU_T [14] which involves more complexity in computation. A linear combination of average asymmetrical disturbance A ind and average disturbance D ind is given by PESQ.PESQ=4.754-0.186D ind-0.008 A ind (17) E. Composite MeasuresLinear combination of existing objective quality measures results in a new measure [10].This can be evaluated by using linear regression analysis. This paper uses the multiple linear regression analysis to obtain the following new composite measures [13].These composite measures were measured on a five-point scale.(i) Signal Distortion(C sig): The linear combination of PESQ, LLR and WSSD measures results in a new composite measure named as Signal Distortion [13].This is evaluated using the following equationC sig=3.093-1.029*LLR+0.603*PESQ-0.009*WSSD(18)(ii) Noise intrusiveness(C bak): The linear combination of PESQ, seg-SNR and WSSD measures results in anew composite measure named as noise Distortion [13]. This is evaluated using the following equation.C bak=1.634+0.478*PESQ+0.007*WSSD+0.063*seg-SNR(19)(iii) Overall Quality (C ovl): Overall Quality is formed by Linear combination of LLR ,PESQ and WSSD measures and is given byC ovl=1.594+0.805*PESQ-0.512*LLR-0.007*WSSD(20) Scale of signal degradation, background intrusiveness and overall quality measures are shown in table 1,2,3.Table 1. Scale of Signal DistortionTable 2. Scale of Background IntrusivenessTable 3. Scale of Overall qualityTo obtain objective quality measures for the proposed method first the multitaper spectrum was obtained using sine tapers. Further spectral refinement is achieved through wavelet thresholding the multitaper spectrum .Then noise spectrum is estimated using weighted spectral averaging. The results were compared against Wavelet de-noising using hard thresholding (WDH) and soft thresholding (WDS) suggested in [16], Spectral subtraction(SS) [17] and Multi Band Spectral Subtraction (MBSS) [18].Table 4. Objective quality measures Segmental SNR(seg_SNR),Log Likelihood Ratio(LLR),Weighted Slope Spectral Distance (WSSD),PerceptualEvaluation of Speech Quality(PESQ)Table 5. Composite measures(C sig, C bak, C ovl) for eight different types of noises(a)(b)(c)(d)(e)(f)(g)Fig.6. a)Segmental SNR b)Log Likelihood Ratio c)Weighted spectral slope distance d)PESQ e)Signal Distortion (C sig) f) Background intrusiveness (C bak) g) Overall quality (C ovl) measures against inputSNR.Fig.7. Time domain and spectrogram representation of Clean Speech noisy speech and enhanced speech signals by SS[17],MBSS[18],WDH,WDS[16] ,Wavelet thresholding the multi taper spectrum[11] and proposed method.VII. C ONCLUSIONFrom the results shown in table.4, performance of wavelet de-noising techniques is very poor in terms of all objective quality measures i.e lower values of segmental SNR and PESQ and higher values of LLR and WSSD in all the cases considered when compared to other techniques. The proposed method exhibits its superior performance i,e higher values of segmental SNR and PESQ for all types of noises at three levels of input SNR against all the methods considered. The performance of proposed method decreases in terms of LLR and WSSD when compared to Multi Band Spectral Subtraction method. Composite measures were shown in table.5, indicates that the proposed method provides improvement in terms of all three composite measures when compared to all the four different methods considered. The same results can be shown in the form of graphs by taking average of all eight different noises at three levels in figure .6 from a to g. From the results it can be concluded that the proposed method is suitable for higher values of segmental SNR, PESQ and composite measures.Figure7.,shows the time domain and frequency domain representation of noisy speech, noise and enhanced speech signals for various methods like Spectral Subtraction [17],Multi Band Spectral Subtraction[18], Wavelet de-noising techniques with both soft and hard thresholding [16],Wavelet thresholding the Multitaper spectrum for speech enhancement[11] and proposed methods.Spectrograms are widely used in speech processing to plot the spectrum of frequencies as it varies with time. The spectrogram can be evaluated as a sequence of FFTs computed over a windowed signal of duration of 20ms In the time domain Enhanced speech signal from Spectral subtractive type algorithm introduces musical noise; it was eliminated in the Multi Band Spectral Subtraction the same can be observed in spectrograms. Wavelet de-noising techniques shows its performance in suppression of noise. The proposed method gives the enhanced signal closer to original clean speech signal and spectrogram also closer to the spectrogram of clean speech signal.A CKNOWLEDGEMENTI would like to take this opportunity to express my profound gratitude and deep regard to my Research Guide Dr.K.Satya Prasad for his exemplary guidance, valuable feedback and constant encouragement throughout the duration of the research. His valuable suggestions were of immense help throughout research. Working under him was an extremely knowledgeable experience for me. I would also like to give my sincere gratitude to the authors Hu and Loizou, for inspiring me with their research papers in the field of speech enhancement along with objective quality measures.R EFERENCES[1]R.Martin, “An efficient algorithm to estimate theinstantaneous SNR of speech signals”, proceedings of Euro speech ,Berlin,pp.1093-1096,1993.[2]R.Martin, “Spectral subtraction based on minimumstatistics, Proceedings of European Signal Processing,U.K,pp.1182-1185,1994.[3]G.Doblinger, “Computationally efficient speechenhancement by spectral minima tracking in sub bands”, proceedings of Euro speech ,Spain, pp:1513-1516,1995. [4]H.Hirch, and C.Ehrlicher, “Noise estimation techniquesfor robust speech rec ognition”, proceedings of IEEE International Conference on Acoustic Speech Signal Processing, MI, pp.153-156,1995.[5]R.Martin, “Noise Power Spectral Density Estimationbased on Optimal Smoothing and Minimum statistics”, IEEE Transactions on Audio, Speech Processing pp.504–512, 2001.[6]I.Cohen, “Noise Estimation by Minima controlledrecursive averaging for robust speech enhancement”, IEEE Signal Processing. Letter, pp.12–15,2002[7]I.Cohen, “Noise spectrum Estimation in adverseenvironments: Improved Minima controlled recursive averaging”, IEEE Transactions on Audio, Speech Processing, pp.466-475, 2003.[8]L.Lin ,W.Holmes and E.Ambikairajah , “Adaptive noiseestimation algorithm for speech enhancement”,Electron .Lett,754-555,2003[9]Loizou, R.Sundarajan,Y. Hu,”Nois e estimation Algorithmwith rapid Adaption for highly non-stationary environments “Proceedings on IEEE International Conference on Acoustic Speech Signal Processing,2004.[10]Loizou, R.Sundarajan, “A Noise estimation Algorithm forhighly non-stationary Envi ronments”. Speech Communication,48, Science Direct , pp.220-231,2006. [11]Yi.Hu ,P.C .Loizou.,"Speech enhancement based onwavelet thresholding the multitaper spectrum”, IEEE Transactions on Speech and Audio Processing,pp.59-67,2004.[12] C.Ris and S.Dupont, “Assessing local noise levelestimation methods: Applications to noise robust ASR”, Speech Communication, pp.141-158,2001.[13]Yi.Hu ,P.C .Loizou.,"Evaluation of objective QualityMeasures for Speech Enhancement " ,IEEE Transactions on Audio, Speech and Language Processing pp.229-238,Jan.2008.[14]ITU_T Rec, “Perceptual evaluation of speechquality(PESQ), An objective method for end to end speech quality assessment of narrowband telephone networks and speech codecs”.,International Telecommunications Union ,Geneva Switzerland, February 2001.[15] A Noisy Speech Corpus for Assessment of SpeechEnhancement Algorithms. https: // / Loizou /speech/noizeous.[16]DL.Donoho, “De-noising by soft thresholding “,IEEErm.Theory,41(3), 613627,1995.[17]Boll,S.F, “Suppression of acoustic noise in speech usingspectral subtraction”. IEEE Transactions on Acoustics Speech and Signal Processing, 1979,27(2), 113–120.。
Apr.2"21Vol.42 No.42021年4月 第42卷第4期计算机工程与设计COMPUTER ENGINEERING AND DESIGN联合相位谱补偿优化的幅度谱掩蔽语音增强张存远】,马建芬1+,张朝霞2(1.太原理工大学信息与计算机学院,山西晋中030600;2.太原理工大学物理与光电工程学院,山西晋中030600)摘 要:针对现阶段大多数的监督性语音增强网络无法处理高度非结构化特性的相位谱的问题,提出一种基于多目标深度神经网络的相位谱估计的语音增强方法。
将对非结构化相位的估计转化为对相位谱补偿(PSC )的估计,在训练阶段将PSC 中的补偿因子作为训练目标之一,达到直接估计相位的目的;在此基X 上,结合信噪比(SNR )改进PSC ,提高PSC在非平稳噪声下估计目标语音相位的准确度;建立一个同步估计幅度谱掩蔽和相位谱补偿为训练目标的多任务学习模型,实现对幅度和相位的同步估计。
实验结果表明,联合估计幅度和相位的增强模型具有更好的去噪能力,语音质量和语音可懂度显著提高。
关键词:深度神经网络;语音增强;相位估计;幅度掩蔽;联合优化中图法分类号:TP391 文献标识号:A 文章编号:1000-7024 (2021) 04-0990-06doi : 10. 16208/j. issnl 000-7024. 2021. 04. 014Phase spectrum compensation optimization jointed amplitudespectrum masking speech enhancementZHANG Cun-yuan 1 & MA Jian-fen 1+ & ZHANG Zhao-xia 2(1. College of Information and Computer, Taiyuan University of Technology, Jinzhong 030600, China ;2. Co l iege of Physics and Optoelectronics, Taiyuan University of Technology, Jinzhong 030600, China)Abstract : To solve the problem that most supervised speech enhancement networks cannot deal with the highly unstructuredphase spectrum so far, a speech enhancement method based on the phase spectrum estimation of multi-target deep neural net- workswasproposed.Theestimationoftheunstructuredphasewasconvertedintotheestimationofthephasespectrumcompen- sation (PSC) & the compensation factor in the PSC was taken as one of the training targets in the training stage to directly esti-matethephase.Signal-to-noiseratio (SNR )wasconsideredtoimprovetheaccuracyofthePSCestimatingthetargetspeechphase in unstable noisy environment. A multi-task learning model with simultaneous estimation of amplitude spectrum masking and phase spectrum compensation as training targets was established to achieve simultaneous estimation of amplitude and phase.Experimentalresultsshowthattheenhancedmodelthatjointlyestimatestheamplitudeandphasehasbe t erdenoisingability & and the quality and intelligibility of enhanced speech are significantly improved.Key words : deep neural networks ; speech enhancement ; phase spectrum estimation ; amplitude masking ; joint optimization“ R 亠研究者提出了基于监督性学习的语音掩蔽增强⑴常见的() 引 ==口DNN 掩蔽估计目标有:理想二值掩蔽(IBM)*+、理想比语音增强的目的是将目标语音与噪声干扰区分开。
2020年4月计算机工程与设计Apr.2020第41卷第4期COMPUTER ENGINEERING AND DESIGN Vol.41No.4基于麦克风阵列的语音增强算法闵新宇,王清理,冉云飞(中国航天科工集团第二研究院706所,北京100854)摘要:为解决现有语音增强算法需要麦克风数量较多和受估计误差影响较大的问题!提出一种改进的声源定位和波束形成方法。
在现有声源定位算法利用时间延迟的基础上!引入能量衰减参数!实现利用双麦克风进行声源定位的目标;在波束形成算法中引入加载系数,在出现协方差矩阵统计失配时仍可对期望方向聚焦!提高波束形成算法的鲁棒性。
仿真结果表明!改进后的算法与传统算法相比具有更强的鲁棒性。
关键词:语音增强;麦克风阵列;声源定位;波束形成;时延估计中图法分类号:TN912.35文献标识号:A文章编号:1000-7024(2020)04-1074-06doi:10.16208/j.issnl000-7024.2020.04.029Speech enhancement algorithm based on microphone arrayMIN Xin-yu,WANG Qing-li?RAN Yun-fei(Institute706,Second Academy of China Aerospace Science and Industry Corporation,Beijing100854,China) Abstract:To solve the problem that the existing speech enhancement algorithms need a large number of microphones and they aregreatlya f ectedbytheestimationerror,animprovedsoundsourcelocalizationandbeamformingmethodwasproposed\Based ontimedelay,energya t enuationparameterwasintroducedtorealizethetargetofsoundsourcelocalizationusingdoublemicro-phones.Loading coefficient was introduced to beamforming algorithm&which focused on the desired direction when the cova-riancematrixstatisticmismatchoccurred,e f ectivelyimprovingtherobustnessofbeamformingalgorithm\Thesimulationresults showthattheimprovedalgorithmismorerobustthanthetraditionalalgorithm\Key words:speech enhancement;microphone array;sound localization;beamforming#time delay estimation2引言为提高生产生活质量和效率,人们期望通过语音来发送指令和控制设备,但在实际使用环境中存在着各种各样的干扰,因此需要使用语音增强方法来滤除噪音,增强有效语音。
文章编号:1007-757X(2021)03-0108-03基于神经网络的语音增强算法研究王金超(上海大学通信与信息工程学院,上海200444)摘要:利用神经网络提高语音增强模型的性能与泛化能力。
对语音信号做短时傅立叶变换并提取对数能量谱特征,使用卷积循环网络(CRN)进行拟合,理想比例掩膜(IRM)作为回归目标%在方法上与全连接层网络、RNNoise对比,在目标上将理想比例掩膜与直接映射(DM)对比%在未训练过的噪声各个信噪比(SNR)上平均提高主观质量评分0.55分%关键词:神经网络;语音增强;卷积循环网络;理想比例掩膜中图分类号:TP311文献标志码:AResearch and Comparison of Speech EnhancementAlgorithms Based on Neural NetworkWANG Jinchao(College of Communication j Information Engineering,Shanghai University,Shanghai200444,China) Abstract:The neural network is used to improve the performance and generalization ability of speech enhancement.We perform short-time Fourier transform on the speech signal and extract log-power spectral features.Convolutional recurrent network (CRN)model is used to predict the clean spectral features,and ideal ratio mask(IRM)is used as regression target.The CRN is compared to fully connected neural network and RNNoise.IRM is compared to direct mapping(DM).On each signal-to-noise ratio(SNR)of untrained noise,perceptual evaluation of speech quality score is improved by an average of0.55points. Key words:neural network;speech enhancement;convolutional recurrent network;ideal ratio mask0引言随着智的发展,人机交互的,语音质量高产品质量与体验的重。
苏科版七年级数学上册《3.3.2代数式的值》说课稿一. 教材分析苏科版七年级数学上册《3.3.2代数式的值》这一节,主要让学生掌握代数式的求值方法。
教材从实际问题出发,引导学生理解和掌握代数式的概念,通过具体的例子让学生学会如何求解代数式的值。
教材内容由浅入深,让学生在掌握基础知识的同时,培养解决问题的能力。
二. 学情分析七年级的学生已经掌握了整数、分数和小数的基本运算,对数学知识有一定的基础。
但是,对于代数式的求值,他们可能还比较陌生,需要通过具体的例子和练习来逐步理解和掌握。
三. 说教学目标1.知识与技能目标:让学生理解和掌握代数式的概念,学会求解代数式的值。
2.过程与方法目标:通过具体的例子和练习,培养学生解决问题的能力。
3.情感态度与价值观目标:激发学生学习数学的兴趣,培养他们积极思考、克服困难的习惯。
四. 说教学重难点1.重点:代数式的概念和求值方法。
2.难点:如何引导学生理解代数式的求值过程,以及如何解决复杂的代数式求值问题。
五. 说教学方法与手段1.教学方法:采用问题驱动法、案例教学法和练习法。
2.教学手段:利用多媒体课件、黑板和粉笔进行教学。
六. 说教学过程1.导入:从实际问题出发,引导学生理解和掌握代数式的概念。
2.新课讲解:通过具体的例子,讲解代数式的求值方法。
3.课堂练习:让学生在课堂上进行代数式求值的练习,巩固所学知识。
4.拓展提高:引导学生思考和解决复杂的代数式求值问题。
5.总结:对本节课的内容进行总结,强调代数式求值的方法和注意事项。
七. 说板书设计板书设计主要包括代数式的概念、求值方法和注意事项。
通过板书,让学生一目了然地了解本节课的主要内容。
八. 说教学评价教学评价主要通过课堂练习和课后作业来进行。
对于课堂练习,要及时批改,给予学生反馈,帮助他们及时纠正错误。
对于课后作业,要注重学生的解题过程和结果,评价他们的掌握程度。
九. 说教学反思在教学过程中,要不断反思自己的教学方法和手段,根据学生的实际情况调整教学策略。
SPEECH ENHANCEMENT BASED ON TEMPORAL PROCESSINGHynek Hermansky, Eric A. Wan, and Carlos AvendanoOregon Graduate Institute of Science & Technology Department of Electrical Engineering and Applied Physics P.O. Box 91000, Portland, OR 97291Finite Impulse Response (FIR) Wiener-like lters are applied to time trajectories of cubic-root compressed short-term power spectrum of noisy speech recorded over cellular telephone communications. Informal listenings indicate that the technique brings a noticeable improvement to the quality of processed noisy speech while not causing any signi cant degradation to clean speech. Alternative lter structures are being investigated as well as other potential applications in cellular channel compensation and narrowband to wideband speech mapping.ABSTRACTBoll (1979)), i.e., enhanced speech often contains musical noise and the technique typically degrades clean speech.2. CURRENT TECHNIQUE1. INTRODUCTIONThe need for enhancement of noisy speech in telecommunications increases with the spread of cellular telephony. Calls may originate from rather noisy environments such as moving cars or crowded public places. The corrupting noise is often relatively stationary, or at least changing rather slowly. This leads us to investigate the use of RelAtive SpecTrAl (RASTA) processing of speech (Hermansky and Morgan, et. al., 1991, 1994, Hirsch et. al., 1991), which was originally designed to alleviate e ects of convolutional and additive noise in automatic speech recognition (ASR). RASTA does this by band-pass ltering time trajectories of parametric representations of speech in a domain in which the disturbing noisy components are additive. Recently, RASTA was also applied to direct enhancement of noisy speech (Hermansky, Morgan, and Hirsch, 1993). In that case, RASTA ltering was applied to a magnitude (or cubic-root compressed power) of the short-term spectrum of speech while keeping the phase of the original noisy speech. However, applying rather aggressive xed ARMA RASTA lters (designed for suppression of convolutional distortions in ASR) yields results similar to spectral subtraction (see e.g.,This work was sponsored in part by USWEST Advanced Technologies under contract number 9069-111. Appeared in ICASSP 95, Detroit MI,1995RASTA involves nonlinear ltering of the trajectory of the short-term power spectrum (estimated using a 256 point FFT applied with a 64 point overlap Hamming window at a 8kHz sampling rate). Currently, we use linear lters applied to the cubic-root of the estimated power spectrum (Hermansky et. al. (1994)). In this study, we have substituted the ad hoc designed and xed RASTA lters by a bank of non-casual FIR Wiener-like lters. Each lter is designed to optimally map a time window of the noisy speech spectrum of a speci c frequency to a single estimate of the short-term magnitude spectrum of clean speech. For a 256 point FFT we require 129 unique lters.-10 Noisy Short-term Speech Analysis 1 0 FIR filter 10( ).a( )b. Magnitude . & Phase Extraction 129. . .. . .Phasea=2/3 b=3/2Figure 1: Block Diagram of System Let n( ) be the cubic-root estimate of the shorti term power spectrum of noisy speech in frequency bin ( = 1 to 129 and corresponds to an 8ms step). The output of each lter is the following:p k i i k p^i( ) =kj =?MM Xwi( ) n( ? ) ij p kSynthesisEstimate of Clean Speechj ;(1)jwhere ^i( ) is the estimate of the clean speech cubicroot spectrum. FIR lter coe cients i( ) are found such that ^i is the least squares estimate of the cleanp k w psignal i for each frequency bin . The current design has = 10 corresponding to 21 tap noncausal lters. As in the spectral subtraction technique, any negative spectral values after RASTA ltering are substituted by zeros and the noisy speech phase is used for the synthesis. The technique is illustrated in Fig. 1. Initial lters were designed on approximately 2 minutes of speech of one male talker recorded at 8 kHz sampling over a public analog cellular line from a relatively quiet laboratory. The speech was arti cially corrupted by additive noise recorded over a second cellular channel from a) car driving on a freeway with a window closed, b) car driving on a freeway with a window open, and c) busy shopping mall.p i M3. RESULTING OPTIMAL RASTA FILTERSFilter frequency responses are shown in Fig.2 (darker shades represent larger values). The lters are approximately symmetric (i.e., near zero phase). Filters for di erent frequency channels di er. The whole frequency band between 0 and 4 kHz appears to be subdivided into several regions, each characterized by its own RASTA processing.C A -60 dB 0 dBModulation frequency (Hz)Bsponse of typical lters in this region (slice A in Fig. 2) are shown in Fig. 3(A). Typically, lters in this frequency band have a band-pass character, emphasizing modulation frequencies around 6-8 Hz. Comparing to our original ad hoc designed RASTA lter (Hermansky, Morgan, and Hirsch, 1993), the low frequency bandstop is much milder, being only at most 10 dB down from the maximum. Filters for very low frequencies (0-100 Hz) are highgain lters with a rather at frequency response (slice C in Fig. 2 and Fig. 3(C)). Filters in the 150-250 Hz and the 2700-4000 Hz regions are low-gain low-pass lters (slice B in Fig.2) with at least 10 dB attenuation for modulation frequencies above 5 Hz. The low frequency pass-band of these lters are typically below the passband of the high-gain band-pass lters of Fig. 3(A). The frequency response of a 129 tap Wiener lter designed on noisy data is shown by the solid line in Fig. 4. For comparison, the center taps of the RASTA lters designed on magnitude spectrum (i.e. a=b=1 in Fig.1) are shown in the gure by the dashed line. Informal comparisons between the quality of speech processed by the time-domain Wiener lter and the RASTA lters indicates some advantage of the additional lter taps and compressed spectrum used with the new RASTA lters.10| | |62Magnitude (dB)RASTA center taps Wiener filter0 -10 -20 -30 -40| |00 0.5 1 1.5 2 2.5 3 3.5 4Center frequency of analysis channel (kHz)Figure 2: Frequency Response of RASTA Filters|01234Frequency (kHz)00(C)−5−5Figure 4: Wiener lter response and RASTA centertaps.Gain (dB)−10 −10 −15 −15(A) (B)−20 −20 −25 −25 −30 −30 −35 −35 0 0 20 40 20 60 80 40 100 1204. EXPERIMENTS60Modulation Frequency (Hz)Figure 3: Freq. Response of Di erent Filters. The highest gain RASTA lters are applied in the frequency bands of about 300-2300 Hz. Frequency re-The RASTA processing using lters described above has been tested on several recordings of actual noisy cellular communications (out of training set). We have informally observed that the processing brings a noticeable improvement in subjective quality of the recordings. Noise seems to be less disturbing while the quality of speech is not impaired. Further, we have also observed that the processing does not seem to impair the quality of clean speech recordings. A visible reduc-tion of the noise after processing is apparent from the spectrograms shown in Fig. 5.4have been taken towards the investigation of this problem.5.3.1. Cellular Channel Equalization0 0 4sec6.40 0sec6.4Figure 5: Spectrogram showing noisy speech followed bythe same noisy segment after processing.The original 16kHz sampled speech is downsampled to 8kHz and used as the desired response in our training scheme. The input data corresponds to the same speech segments recorded over the cellular channel. Additional white Gaussian noise is added to the input in order to avoid excessive gains in the band edges where cellular speech is not present. Several tests done on clean cellular communications have resulted in a noticeable improvement; however, formal comparisons with other techniques have yet to be done.8kHzkHzA database that includes the e ects of actual ambient noise on cellular speech communications is being collected. A training set of 330 sentences taken from the TIMIT database (complete dr8 set) is played through a portable PC over a public cellular line from di erent locations. A test set generated with 123 randomly chosen TIMIT sentences (dr1-dr7 only) is recorded under the same conditions for later evaluation of our algorithms. Adjacent channel information has been used to train each channel resulting in a set of (Multi-Input-SingleOutput) MISO lters. Initial results with 6 adjacent channels (3 at each side of the frequency of interest) show an improvement in the reduction of perceived noise as well as reduction of residual error at each frequency bin. However, the additional improvement obtained with this structure is noticeable only within the training set (original 2 minute source), while the processing generalizes poorly over di erent speakers and noise types. The e ect of training on large data sets and the e ect of di erent architectures need to be investigated before any conclusions can be drawn. Availability of the original 16kHz sampled speech data has raised the question as to the possibility of recovering a good quality wideband signal from a degraded, 8khz sampled, cellular speech recording. Two steps5. WORK IN PROGRESS 5.1. Data CollectionkHz4 0 0 8 1.8kHz4 0 0 8 1.8kHz4 0 0 1.85.2. Adjacent Channelssecondstrogram of downsampled signal and c) recovered spectrogram. 5.3.2. Wideband Speech ReconstructionFigure 6: a) Spectrogram of the original speech, b) spec-5.3. Cellular to Wideband Speech RecoveryAn approximate wideband from narrowband speech map has been produced by extending the adjacent channel concept. Now the 129 trajectories derived from the 8kHz sampled speech analysis are fed to MISO lters whose outputs correspond to frequencies from 4kHz to 8kHz (i.e., bins 130 to 257) and the signal reconstruction is done at twice the original rate. Training was performed with 9 TIMIT sentences (same speaker) with a time window of 40 ms ( = 2 in (1)). In Fig.6, we show an out of training set original wideband speech spectrogram, the downsampled version of it, and the recovered spectrogram. Within a given talker the high frequency recovery appears to be viable. However, for time signal reconstruction an estimate of the phase ofMthe recovered high frequencies needs to be performed. This is discussed in the following section. Modi cation of the short-term magnitude spectrum of a signal and synthesis with original phase do not, in general, yield a time signal with the desired magnitude spectrum (see e.g., Allen, J.(1977)). To avoid distortion in the synthesized signal, especially when the power spectrum has been severely modi ed (see subsection 5.6) , we are investigating an iterative algorithm (see e.g., Gri n and Lim (1984)) in which the mean squared error between the desired spectrum and the spectrum produced by the synthesized signal is minimized. In noise reduction applications, we have used the noisy signal's phase to perform the initial step in the reconstruction. In the case of wideband speech mapping, a linear map from the available low frequency phase components to the higher frequencies is taken as a rst approximation.5.4. Phase Reconstructionshort-term spectrum of speech. While no formal subjective perceptual tests were carried out, our initial experience indicates a possible advantage of the proposed technique over conventional speech enhancement techniques. Future work will investigate generalization of the technique for additional noise sources and larger training sets. In addition, more formal perceptual evaluations will be undertaken. Related work outlined in section 5 will also be pursued.7. ACKNOWLEDGEMENTSWe would like to thank Mahesh K. Kumashikar for the neural networks training, and CONACYT (Mexico) for partial support of the third author.8. REFERENCES1] J.B. Allen: Short term spectral analysis, synthesis, and modi cation by discrete Fourier transform, IEEE ASSP-25, pp. 235-238, Jun. 1977. 2] S.F. Boll: Suppression of acoustic noise in speech using spectral subtraction, IEEE ASSP-27, pp. 113-120, Apr. 1979. 3] V. Ramamoorthy, N.S. Jayant, R.V. Cox and M.M. Sondhi: Enhancement of ADPCM speech coding with backward-adaptive algorithms for post ltering and noise feedback, IEEE J. Selec. Areas in Comm. Vol. 6, No.2, pp. 364-382, Feb. 1988. 4] H. Hermansky: Perceptual predictive (PLP) analysis of speech, J. Acoust. Soc. Am. 87 pp. 17381752 Apr. 1990. 5] H. G. Hirsch, P. Meyer, and H. Ruehl: Improved speech recognition using high-pass ltering of subband envelopes, Proc. EUROSPEECH '91, pp. 413-416, Genova, 1991 6] H. Hermansky, N. Morgan, A. Bayya, and P. Kohn: Compensation for the e ect of the communication channel in auditory-like analysis of speech (RASTA-PLP) Proc. EUROSPEECH '91, pp. 1367-1370, Genova, 1991. 7] H. Hermansky, N. Morgan, and H.G. Hirsch: Recognition of speech in additive and convolutional noise based on RASTA spectral processing, Proc. ICASSP-93, pp. II-83, II-86, 1993. 8] H. Hermansky, E. Wan, and C. Avendano: Noise suppression in cellular communications, Proc. IVTTA 94, pp. 85-88, Kyoto 1994.5.5. Postprocessing with Envelope EnhancementFirst proposed for enhancement of ADPCM speech coding (see Ramamoorthy, et. al., (1988)), a pre ltering technique for spectral envelope enhancement has been applied in our system. Spectral domain ltering is done by designing a lter at each 8 ms step with an untilted and smoothed 8th order all-pole model spectrum of the particular frame, giving the e ect of enhancing frequencies around formants and suppression elsewhere. Best results have been achieved by pre ltering the compressed power spectrum prior to applying RASTA lters to its trajectories.Non-linear RASTA lters have also been implemented with 3 layer arti cial neural networks. Preliminary results on small training sets show a more aggressive ltering and greater noise reduction. However, musicallike residual noise as well as greater degradation of clean speech is also observed. The iterative synthesis method described above has been used with some success to alleviate the phase distortion introduced by the large amount of spectral modi cation. Current e orts are focused on reducing the adverse e ect introduced by the networks.5.6. Non-linear Filters6. CONCLUSIONSWe have presented a new technique for enhancement of noisy speech in cellular communications, which utilizes Wiener-like RASTA lters on cubic-root compressed。