A fte r an acousti c spee ch s i gnal i s conve rte d to an ele ctri cal si gnal by a mi crophone, i t m ay be desi rable to anal yze the ele ctri cal si gnal to e stim a te some time -v ary i ng paramete rs whi ch provide i nform ati on about a model of the speech producti on me chanism. S peech a na ly sis i s the process of e stim ati ng such paramete rs. S imil arl y , g ive n some parametri c model of spee ch production and a se que nce of param eters for that m ode l, speech sy n thesi s is the proce ss of cre ating an ele ctri cal s i gnal w hi ch approxim ate s spe ech. W hile anal y si s and sy nthesis te chni ques m ay be done eithe r on the continuous si gnal or on a sam pled ve rsi on of the si gnal, most mode rn anal y sis and sy nthesis methods are base d on di gital si gnal processing.

A ty pi cal spee ch production m odel is shown in Fi g . 15 .6. In this m odel the output of the ex citati on function is scale d by the gai n paramete r and then filtere d to produce spee ch. A ll of these functions are ti me -v ary ing.

F IGUR E 15 .6 A ge ne ra l spee ch production model.

F IGUR E 1 5 .7 W ave form of a spoken phone me /i/ as i n beet.

For m any models, the parame ters are v arie d at a pe riodi c rate, ty pi call y 5 0 to 100 time s pe r se cond. M ost spee ch inform ati on is containe d i n the porti on of the si gnal bel ow about 4 k Hz.

The ex citati on is usuall y modele d as e ithe r a mi xture or a choi ce of random noise and periodi c w aveform . For hum an spee ch, v oi ced e x citati on occurs w hen the vocal fol ds in the lary nx vibrate; unvoi ce d e x citati on occurs at constri cti ons i n the vocal tract w hi ch cre ate turbulent a i r fl ow [Fl anagan, 1965] . The rel ati ve mi x of these tw o ty pe s of ex citati on is terme d “v oicing .”In addition, the periodi c e xcitation i s characte rized by a fundame ntal fre quency , te rmed p itch or F0 . The ex citati on is scaled by a factor de si gne d to produce the prope r ampli tude or level of the spee ch si gnal . The scaled ex citati on function i s then fi ltere d to produce the prope r spe ctral characte risti cs. W hile the filter m ay be nonli near, i t i s usuall y m odele d as a li near function.

An a l y sis of Excit a t ion

In a si m plified form, the ex citati on function m ay be consi dere d to be purel y pe ri odi c, for v oi ced speech, or purel y random, for unvoi ce d. T hese tw o states correspond to voi ce d phoneti c cl asse s such as vow els and nasals and unvoi ce d sounds such as unvoi ce d fri catives. This binary voi ci ng m odel is an ove rsi mplifi cation for sounds such as v oi ced fri cati ves, whi ch consist of a mi xture of peri odi c and random compone nts. Fi gure 1 5.7 is an ex ample of a time w ave form of a spoke n /i/ phoneme , w hi ch is w ell m odeled by onl y pe riodi c e x citati on.

B oth ti me dom ai n and frequency dom ai n anal y s is te chni ques have bee n used to esti m ate the de gree of voi ci ng for a short se gme nt or frame of spee ch. One ti me dom ain fe ature, te rme d the ze ro crossing rate, i s the numbe r of ti mes the si gnal changes si gn in a short i nte rv al . As show n i n Fi g . 1 5. 7, the z ero crossing rate for voi ce d sounds i s rel ati vel y low . S i nce unvoi ce d spee ch ty pi call y has a la rger proportion of hi gh-frequency energy than voi ce d spee ch, the ratio of high-fre que ncy to low -frequency e nergy is a fre que ncy dom ai n techni que that provi des i nform ation on voi ci ng.

A nothe r measure use d to estim ate the de gree of voi ci ng is the autocorrel ation functi on, w hi ch is de fine d for

a sam pled speech se gment, S , as

w here s( n) is the val ue of the nth sam ple w i t hi n the se gme nt of le ngth N. S ince the autocorrel ati on function of a periodi c functi on is i tsel f pe ri odi c, voi ci ng can be e sti mated from the de gree of pe ri odi city of the autocorrel ati on function. Fi gure 15. 8 i s a graph of the nonne gati ve te rms of the autocorrel ation functi on for a 64 -ms frame of the w aveform of Fi g . 15. 7. Ex cept for the de cre ase i n amplitude w ith i ncre asi ng lag, whi ch results from the re ctangul ar wi ndow functi on w hi ch delim its the se gment, the autocorre lati on function i s see n to be quite pe riodi c for thi s voi ce d utterance.

F IGUR E 1 5 .8 A utocorrel ati on functi on of one frame of /i /.

If an anal y sis of the voicing of the spee ch si gnal i ndi cate s a voice d or pe ri odi c com ponent is prese nt, another ste p i n the anal y si s process m ay be to estim ate the freque ncy ( or pe ri od) of the voi ce d com ponent. There are a num ber of w ay s in whi ch this m ay be done. One is to me asure the ti me l apse between pe aks i n the time dom ai n si gnal. For ex am ple i n Fi g . 15.7 the m aj or peaks are separate d by about 0. 00 71 s, for a fundame ntal fre quency of about 1 41 Hz. Note, it w oul d be quite possible to e rr i n the e stim ate of fundame ntal fre quency by mistaki ng the sm aller pe aks that occur betwee n the m a jor pe aks for the m aj or pe aks. These sm alle r pe aks are produced by resonance i n the v ocal tract w hi ch, i n this e x ample , happen to be at about twi ce the ex ci tation

fre quency . T his ty pe of e rror w ould re sult in an e sti m ate of pitch approxi m atel y tw i ce the corre ct fre quency .

The di stance betw ee n m ajor pe ak s of the autocorrel ation functi on is a closel y rel ate d fe ature that is

fre quentl y use d to esti m ate the pitch pe ri od. In Fi g . 15. 8, the di stance between the m aj or peaks in the autocorrel ati on function i s about 0. 00 71 s. Esti m ates of pi tch from the autocorrel ation functi on are also

susce pti ble to mistaking the fi rst vocal track resonance for the g l ottal e x citati on freque ncy.

The absol ute m agnitude di ffere nce functi on ( AM DF), de fi ned as,

is another functi on w hi ch is often use d i n estim ating the pitch of voi ce d spee ch. A n ex ample of the AM DF is shown in Fi g. 15. 9 for the same 6 4 -m s frame of the /i / phoneme. How e ve r, the minim a of the AM DF i s used as an indi cator of the pitch pe ri od. The AM DF has been show n to be a good pitch pe riod i ndi cator [R oss et al. ,

19 74 ] and does not requi re multi pli cati ons.

F ou r ie r An a ly sis

One of the m ore comm on processe s for e stim ating the spe ctrum of a se gme nt of spee ch is the Fourie r transform [ Oppenheim and S chafer, 1 97 5 ]. T he Fourie r transform of a seque nce is m athem ati call y de fine d as

w here s( n) represe nts the terms of the sequence. The short-ti me Fourier transform of a seque nce i s a

timede pende nt functi on, de fi ned as

F IGUR E 1 5 .9 A bsolute m agnitude diffe rence functi on of one frame of /i/.

w here the w indow function w( n) is usuall y ze ro ex ce pt for some fi nite range, and the vari able m is used to select the se cti on of the se que nce for anal y sis . The di screte Fourier transform ( DFT) i s obtai ned by uni forml y sam pling the short-ti me Fourie r transform i n the fre quency dime nsi on. Thus an N-point DFT is computed usi ng Eq. ( 15 .1 4),

w here the set of N sample s, s( n), m ay have first been multiplie d by a window function. A n e x am ple of the m agnitude of a 5 12 -poi nt DFT of the w aveform of the /i/ from Fi g . 15. 10 i s show n i n Fi g. 15 .10. Note for this fi gure, the 512 poi nts in the se que nce have been m ulti plied by a Ham ming w i ndow de fi ned by

F IGUR E 1 5 .1 0 M agnitude of 51 2-point FFT of Ham mi ng window e d /i/.

S ince the spe ctral characteristi cs of spee ch m ay change dram a ti call y in a fe w milli se conds, the le ngth, ty pe, and l ocation of the wi ndow function are im portant consi derati ons. If the w indow is too long, changi ng spe ctral characte risti cs m ay cause a blurred result; if the w indow is too short, spe ctral i naccuracies re sult. A Ham ming wi ndow of 16 to 32 m s durati on is com m onl y use d for spee ch anal ysis.

S everal characte risti cs of a speech utte rance m ay be dete rmine d by ex amination of the DFT m agnitude. In Fi g. 1 5. 10, the DFT of a v oi ce d utterance contai ns a se ries of sharp pe aks i n the fre quency dom ai n. T hese peaks, caused by the peri odi c sampl ing acti on of the g lottal ex ci tation, are separated by the fundame ntal fre quency w hi ch is about 141 Hz, i n this e x am ple. In addi tion, broader pe aks can be se en, for e x ample a t about 300 Hz and at about 2300 Hz. T hese broad peaks, calle d formants, result from resonances in the vocal tract.

L in ea r P r ed ictive An a l y sis

Gi ven a sam pled ( discrete-ti me) si gnal s( n), a pow e rful and ge ne ral parame tri c model for ti me se ries anal y s is is

w here s( n) i s the output and u( n) i s the input ( perhaps unknow n). The model parameters are a( k) for k = 1, p, b( l ) for l = 1, q, and G. b( 0) is assume d to be unity. Thi s m odel , describe d as an autore g ressi ve m ov ing average ( AR M A) or pole -ze ro m odel , forms the foundation for the anal y s is method te rme d li ne ar pre di ction. An autore gressive ( AR ) or all -pole model, for w hi ch all of the “b”coe ffi cie nts ex ce pt b( 0 ) are ze ro, i s freque ntl y used for spee ch anal y si s [M arkel and Gray, 1976 ].

In the standard A R formul ati on of li ne ar predi ction, the model paramete rs are sele cte d to mi ni mize the me an-square d error betw ee n the m ode l and the speech data. In one of the v ariants of line ar pre di cti on, the autocorrel ati on method, the mi nimiz ation is carrie d out for a wi ndowe d se gment of data. In the autocorrel ation method, mi nimizi ng the me an-square e rror of the ti me dom ai n sam ples is equivalent to mi nimizing the inte grate d rati o of the si gnal spectrum to the spe ctrum of the all -pole m odel. Thus, line ar predi ctive anal y sis i s a good method for spe ctral anal y sis w hene ver the si gnal is produce d by an a ll -pol e sy ste m. M ost spee ch sounds fi t thi s model w ell .

One ke y consi deration for li near pre dicti ve anal y si s is the order of the model, p. For spee ch, if the orde r is too sm all , the form ant s tructure is not we ll re pre sente d. If the order i s too l arge , pitch pul ses as well as form ants be g in to be represe nted. Tenth- or tw el fth-orde r anal y si s is ty pi cal for spe ech. Fi gures 1 5. 11 and 15.

1 2 provi de e x amples of the spe ctrum produced by ei ghth-orde r and si xteenth-order line ar predi cti ve anal y sis of the /i / w ave form of Fi g . 1 5.7 . Fi gure 15 .11 show s there to be three form ants at fre que ncies of about 30 0,

23 00, and 3200 Hz , whi ch are ty pi cal for an /i/.

H om om or p h ic ( C epst r a l) A n a l y si s

For the speech m odel of Fi g. 15. 6, the e x citati on and filter i mpulse response are convol ved to produce the

spee ch. One of the problem s of speech anal y sis is to separate or de convolve the spee ch into the se tw o com ponents. One such te chni que is called hom omorphi c filte ri ng [ Oppe nheim and S chafer, 1968 ]. The characte risti c sy ste m for a sy ste m for hom om orphi c deconvol ution conve rts a convolution operation to an

additi on ope ration. The output of such a characteristi c sy stem is calle d the com ple x cep str u m . The com plex

cepstrum i s defi ned as the i nve rse Fourie r transform of the com plex logarithm of the Fourie r transform of the input. If the i nput sequence is mi nim um phase ( i .e., the z-transform of the input se que nce has no poles or ze ros outside the unit ci rcle), the se quence can be represe nted by the real portion of the transforms. Thus, the re al cepstrum can be com pute d by cal cul ati ng the inve rse Fourie r transform of the log- spe ctrum of the input.

F IGUR E 15 .11 Eig hth-orde r li ne ar predi ctive anal y sis of an “i”.

F IGUR E 1 5 .1 2 S ixteenth-orde r l ine ar pre di cti ve anal y si s of an “i”.

Fi gure 1 5.1 3 show s an e x ample of the cepstrum for the voi ced /i/ utterance from Fi g. 15.7 . The cepstrum of such a voi ce d utterance i s characte rized by rel ati vel y la rge v alues in the fi rst one or tw o milli se conds as w ell as

by pulses of de cayi ng am plitude s at m ulti ples of the pitch pe riod. T he fi rst tw o of these pulses can cle arl y be seen i n Fi g . 15. 13 at ti me l ags of 7 .1 and 1 4. 2 ms. The locati on and ampl itudes of these pulses m ay be used to estim ate pitch and v oi cing [R abi ner and S chafe r, 1978 ].

In additi on to pitch and voi cing estim ation, a smooth log m agnitude function m ay be obtaine d by

wi ndow i ng or “l i fte ring ”the ce pstrum to eli mi nate the te rm s w hi ch contain the pitch i nform ation. Fi gure 15. 14 is one such smoothed spectrum . It w as obtai ned from the DFT of the ce pstrum of Fi g . 15.1 3 afte r fi rst setting all terms of

the cepstrum to ze ro ex ce pt for the fi rst 16.

F IGUR E 15 .13 R eal ce pstrum of /i /.

F IGUR E 15 .14 S moothe d spe ctrum of /i/ from 16 poi nts of ce pstrum.

S p eech S y nth esis

S pee ch sy nthesis is the cre ati on of spee ch-li ke w aveform s from te xtual w ords or sy m bols. In gene ral, the spee ch sy nthesi s process m ay be divi ded into three le vels of processi ng [ Kl att, 1 98 2] . T he first le vel transform s the te xt i nto a se ries of acousti c phoneti c sy m bols, the se cond transforms those sy mbols to


