Novel Time Domain Multi-class SVMs for Landmark Detection
- 格式:pdf
- 大小:182.72 KB
- 文档页数:4
Novel Time Domain Multi-class SVMs for Landmark Detection
Rahul Chitturi, Mark Hasegawa Johnson*
University of Texas at Dallas, USA
rahul.ch@crss.utdallas.edu
* University of Illinois at Urbana Champaign, USA
jhasegaw@uiuc.edu
Abstract
The training of precise speech recognition models depends
on accurate segmentation of the phonemes in a training
corpus. Segmentation is typically performed using HMMs,
but recent speech recognition work suggests that the
transient acoustic features characteristic of manner-class
phoneme boundaries (landmarks) may be more precisely
localized using acoustic classifiers specifically designed for
the task of landmark detection. This paper makes an
empirical exploration of new features which suit Landmark
Detection and the application of Multi-class SVMs that are
capable of improving the time alignment of phoneme
boundaries proposed by Binary SVMs and HMM-based
speech recognizer. On a standard benchmark data set (A
database of Telugu - Official Indian Language, spoken by 75
million people), we achieve a new state-of-the-art
performance, reducing RMS phone boundary alignment
error from 32ms to 22ms.
Index Terms: Landmark, Multi Class SVM, Time Domain
flatness measure, Segmentation
1 Introduction
Accurately segmented training data improves the
precision of both speech recognition and speech synthesis
models. Phonetic segmentation has also been proposed as
part of a lattice rescoring algorithm for multipass speech
recognition [1]. Since manual segmentation of speech is
time consuming and unrealistic in most conditions, various
approaches on automatic speech segmentation have been
proposed [2, 3], most typically including forced alignment of
an HMM-based sentence model [4,5] observing standard
speech recognition features such as energy, LPCC, MFCC,
and PLP.
In 1998, Niyogi proposed the use of support vector
machines (SVMs) to detect stop consonant bursts [6].
Instead of observing MFCCs once per 10ms, Niyogi's SVMs
observed a much smaller feature vector (energy, lowpass
energy, and spectral flatness) once per millisecond. Niyogi's
SVMs significantly outperformed an HMM-based speech
recognizer in this task [7]. Other authors have since used
Niyogi's method to develop complete first-pass [8] and
second-pass [1] speech recognition engines. As we
proposed the time domain landmark detection techniques,
we chose the features on the similar lines as Niyogi. We
came up with a new feature which we named as Time
Domain Flatness Measure gave the maximum performance
for this problem.
Multi-class SVMs are usually implemented by
combining several two-class SVMs. The one-versus-all
method using winner-takes-all strategy and the one-versus-
one method implemented by max-wins voting are popularly
used for this purpose. In this paper we give empirical
evidence to show that these methods are inferior to another
one-versus-one method. The implementation of the binary
SVMs was normal one-versus-one SVMs and so it is not
explained in detail due to space constraints.
2 Related work
The usage of Multiclass SVMs in speech was introduced in
2002, by Salomon et al. for phoneme classification [9].
Recently in 2005, Yin An-ron et al. used this Multi Class
SVM as an extension of binary classifier using ECOC
(Hadamard Error-Correcting Output Code [10]. Jorge
Bernal-Chaves et al. used the Multi class SVMs for isolated
word recognition [11]. In 2006, Andrew Hatch et al. have
used the same technique for Speaker Recognition [12]
In this article, we extend the usage of Multi-class SVMs for
Phoneme segmentation with HMM segmentation system as
the baseline. We also explore new features which suit
Landmark Detection.
3 Database Description
All our experiments were conducted on the Telugu (Official
Indian Language- spoken by 75 million people) database,
which has over 600 sentences collected from a native
speaker and contains substantial coarticulatory effects. This
database was collected at a 16 KHz sampling frequency and
a single channel, using a headset. This database has well
defined transcriptions corresponding to the speech data and
even the phone-level alignment was done manually. For the
purpose of exact evaluation of our algorithms, we used the
manual phone-alignments, but we don’t assume that these
are necessary.
4 Baseline System
The goal of the algorithms proposed in this paper is to refine
the phoneme boundary times proposed by the HMM, in
order to reduce their RMS error with reference to a “ground
truth” manual phonetic transcription. Though HMMs are
known to work better for tri-phone models, since we are
concerned only with the time boundaries of a particular
phone, monophone based HMM segmentation system is