Book Volume 1
Page: i-i (1)
Author: Alex Acero
Page: ii-ii (1)
Author: Javier Ramirez and Juan Manuel Gorriz
Page: iii-vi (4)
Author: Javier Ramirez and Juan Manuel Gorriz
Integration of Statistical-Model-Based Voice Activity Detection and Noise Suppression for Noise Robust Speech Recogni
Page: 1-12 (12)
Author: Masakiyo Fujimoto
This chapter addresses robust front-end processing for automatic speech recognition in noisy environments. To recognize corrupted speech accurately, it is necessary to employ robust methods against various types of interference. Usually, noise suppression is used for the frontend processing of speech recognition in the presence of noise. Voice activity detection (VAD) is also used for front-end processing to eliminate the redundant non-speech period. VAD and noise suppression are typically combined as series processing. VAD and noise suppression should not be assumed to be separate techniques, because the output information of these methods is mutually beneficial. Thus, this chapter introduces the integrated front-end processing of VAD and noise suppression, which can utilize each others' input-output information.
Page: 13-29 (17)
Author: Rasool Tahmasbi
GARCH (Generalized Autoregressive Conditional Heteroscedasticity) models are new statistical methods that are used especially in economic time series. There is a consensus that speech signals exhibit variance that changes through time. GARCH models are a popular choice to model these changing variances. In this chapter, we propose three methods for VAD, which are based on GARCH models. In the first method, heteroscedasticity will be modeled by GARCH process and hard detection is the result of comparing a Multiple Observation Likelihood Ratio Test (MOLRT) with a threshold function. In the second method, no distinct probability functions are assumed for speech and noise distributions and no LRT is employed. We will show that VAD is related to the parameter constancy test in GARCH process. For testing parameter constancy in GARCH models, the algorithm of the Cramer-von Mises (CVM) test is described. In the last method, the process of outlier detection in GARCH model is presented and the motivation for using it and its relation with VAD are discussed.
Page: 30-45 (16)
Author: J. Ramirez and J. M. Gorriz
Emerging applications in the field of speech processing are demanding increasing levels of performance in noise adverse environments. These systems often require a noise reduction scheme working in combination with a precise voice activity detector (VAD). This chapter shows an overview of the main challenges in robust speech detection and a review of the state of the art and applications. Experimental results show the effectiveness of VADs using contextual information when compared to recently reported algorithms.
Page: 46-59 (14)
Author: Juan M. Gorriz and Javier Ramirez
Nowadays, the accuracy of speech processing systems is strongly affected by acoustic noise. This is a serious obstacle regarding the demands of modern applications. Therefore, these systems often need a noise reduction algorithmworking in combination with a precise voice activity detector (VAD). The computation needed to achieve denoising and speech detection must not exceed the limitations imposed by real time speech processing systems. This chapter presents a novel VAD for improving speech detection robustness in noisy environments and the performance of speech recognition systems in real time applications. The algorithm is based on a Multivariate Complex Gaussian (MCG) observation model and defines an optimal likelihood ratio test (LRT) involving Multiple and Correlated Observations (MCO) based on a jointly Gaussian probability distribution (jGpdf) and a symmetric covariance matrix. The complete derivation of the jGpdf- LRT for the general case of a symmetric covariance matrix is shown in terms of the Cholesky decomposition which allows to efficiently compute the VAD decision rule. An extensive analysis of the proposed methodology for a low dimensional observation model demonstrates: i) the improved robustness of the proposed approach by means of a clear reduction of the classification error as the number of observations is increased, and ii) the trade-off between the number of observations and the detection performance. The proposed strategy is also compared to different VAD methods including the G.729, AMR and AFE standards, as well as other recently reported algorithms showing a sustained advantage in speech/non-speech detection accuracy and speech recognition performance using the Aurora databases.
Page: 60-102 (43)
Author: Philipos C. Loizou
The need to remove or suppress acoustic noise arises in many situations in which the voice signal originates from a noisy location or is affected by noise over a communication channel. This Chapter presents an up-to-date coverage of all major noise suppression algorithms proposed over the past two decades including spectral subtractive algorithms, Wiener filtering, statistical-model based algorithms and subspace algorithms. It presents a comprehensive evaluation and comparison of major enhancement algorithms in terms of speech quality and speech intelligibility. Finally, this hapter concludes with a description of major objective measures used for predicting the subjective quality and intelligibility of noise-suppressed speech.
Page: 103-113 (11)
Author: Peter Jancovic, Xin Zou and Munevver Kokuer
This chapter presents our recent research on the employment of the Independent Component Analysis (ICA) for speech enhancement and speech representation. In speech enhancement part, we consider a single-channel speech signal corrupted by an additive noise. We investigate novel algorithms for improving the conventional ICA-based speech enhancement, referred to as Sparse Code Shrinkage (SCS). The proposed SCS-based algorithms incorporate multiple ICA transformations and distribution models of speech signal. The speech enhancement algorithms are evaluated in terms of segmental SNR and spectral distortion on speech from the TIMIT database corrupted by Gaussian and real-world Subway noise. The proposed algorithms show significant improvements over the conventional SCS and Wiener filtering. In speech representation part, we present an employment of the ICA for speaker recognition in noisy environments. Finally, we show on a noisy speaker recognition task that the combination of the proposed ICA-based speech enhancement and ICA-based speech representation leads to recognition accuracy improvements compared to the conventional enhancement and representation algorithms.
Page: 114-132 (19)
Author: Nam Soo Kim and Joon-Hyuk Chang
Acoustic interferences such as the background noise and reverberation are the major causes of quality degradation in speech communication. During the several decades, a huge number of attempts to reduce the effect of these interferences have been made by employing statistical model based techniques. In the statistical model based techniques, not only the clean speech source but also the background noise and acoustic echo are assumed to be generated from a class of parametric distributions for which there exist efficient methods to estimate the relevant parameters. In this chapter, we review the parametric models and their application to voice activity detection, noise reduction, and echo suppression, which are important preprocessing parts in robust speech communication systems.
Page: 133-140 (8)
Author: Antonio Miguel, Alfonso Ortega and Eduardo Lleida
Traditionally, in speech recognition, the hidden Markov model state emission probability distributions are usually associated to continuous random variables, by using Gaussian mixtures. Thus, the inter-feature dependence is not accurately modeled by the covariance matrix, since it only considers pairs of variables. The mixture is the part of the model which usually captures this information, but this is done in a loose and inefficient way. Graphical models provide a precise and simple mechanism to model the dependencies among two or more variables. We propose the use of discrete random variables as observations and graphical models to extract the internal dependence structure in the feature vectors. A method to estimate a graphical model with a constrained number of dependencies is shown in this chapter, which is a special kind of Bayesian network. Experimental results show that this method can be considered robust as compared to standard baseline systems.
Page: 141-154 (14)
Author: Yujun Wang, Maarten Van Segbroeck and Hugo Van hamme
Solutions for two important problems for the deployment of noise-robust large vocabulary automatic speech recognizers using the missing data paradigm are presented. irst problem is the generation of missing data masks. We propose and evaluate a method based on vector quantization and harmonicity that successfully exploits the characteristics of speech while requiring only weak assumptions on the noise. A second problem that is addressed is computational efficiency. We advocate the usage of PROSPECT features and the L-cluster-Mbest method for Gaussian selection. In total, a speed up of a factor of about 6 can be achieved with these methods.
Page: 155-168 (14)
Author: Berlin Chen and Shih-Hsiang Lin
The performance of current automatic speech recognition (ASR) systems often degrades dramatically when the input speech is corrupted by various kinds of noise sources. In this chapter, we first discuss several prominently-used and effective distribution-based feature compensation methods to improving ASR robustness, and then review two polynomial regression methods that have the merit of directly characterizing the relationship between speech features and their corresponding distribution characteristics to compensate for noise interference. All these methods were thoroughly investigated and compared using the Aurora-2 standard database and task. The empirical results demonstrate that most of these distribution-based feature compensation methods can achieve considerable word error rate reductions over the baseline system for either clean-condition or multi-condition training settings.
Page: 169-174 (6)
Author: Weifeng Li, Kazuya Takeda and Fumitada Itakura
Conventional single- andmulti-channel speech enhancementmethods aimat improving the signal-to-noise ratio (SNR) of the signal signals captured through distant microphones, which do not specifically target the improvements of ASR performance. We investigate a nonlinear multiple regression to extract robust features for automatic speech recognition (ASR). The idea is to approximate the log spectra of a close-talking microphone by effectively combining of the log spectra of distant microphones. The devised system turns out to be a generalized log spectral subtraction framework for the robust speech recognition. We demonstrate the effectiveness of the proposed approach through our extensive evaluations on the single- and multi-channel isolated word recognition experiments conducted in 15 real car-driving environments.
Page: 175-189 (15)
Author: Chang-Wen Hsu and Lin-Shan Lee
Cepstral normalization has widely been used as a powerful approach to produce robust features for speech recognition. Good examples of this approach include Cepstral Mean Subtraction, and Cepstral Mean and Variance Normalization, in which either the first or both the first and the second moments of the Mel-frequency Cepstral Coefficients (MFCCs) are normalized. In this chapter, we propose the family of Higher Order Cepstral Moment Normalization, in which the MFCC parameters are normalized with respect to a few moments of orders higher than 1 or 2. The basic idea is that the higher order moments are more dominated by samples with larger values, which are very likely the primary sources of the asymmetry and abnormal flatness or tail size of the parameter distributions. Normalization with respect to these moments therefore puts more emphasis on these signal components and constrains the distributions to be more symmetric with more reasonable flatness and tail size. The fundamental principles behind this approach are also analyzed and discussed based on the statistical properties of the distributions of the MFCC parameters. Experimental results based on the AURORA 2, AURORA 3, AURORA 4 testing environments show that with the proposed approach, recognition accuracy can be significantly and consistently improved for all types of noise and all SNR conditions.
Page: 190-196 (7)
Author: Luz Garcia, Jose Carlos Segura and Angel de la Torre
The aim of Robust Speech Recognition is to reduce as much as possible the environmental mismatch between the training and test conditions in order to optimally use the acoustic models in the recognition process. There are several factors producing such mismatch: inter-speaker variability, intra-speaker variability, and changes in the speaker environment or in the channel characteristics. The changes in the environment represent a challenging area of work and constitute one of the main driving forces of research in voice processing, that nowadays faces application scenarios like mobile phones, moving cars, spontaneous speech, speech masked by other speech, speech masked by music or non-stationary noises. The different strategies that fight the effects of additive noise in the voice signal and the recognition process will be summarized in this review, focusing in the normalization techniques and particularly in the non linear transformations of the MFCC features. Histogram Equalization and Parametric Histogram Equalization with their variants and evolutions will be analyzed as main representatives of this family of non-linear feature transformations.
Advances in Human-Machine Systems for In-Vehicle Environments: Noise and Cognitive Stress/Distraction
Page: 197-210 (14)
Author: John H.L. Hansen, Pongtep Angkititrakul and Wooil Kim
As computing technology advances, the ability to integrate a wider range of personal services for in-vehicle environments increases. These technologies include hands-free wireless communications, video/data/internet within the vehicle, route planning and navigation, access to music and information download, command and control of vehicle instrumentation, as well as inter-vehicle communications. While these advances offer a diverse range of entertainment and information access opportunities, they generally are introduced into the vehicle with limited understanding of their impact to driver distraction and cognitive stress load. As the diversity of speech, video, biometric, and vehicle signals increases, improved corpora and system formulation are needed. In this study, we consider recent advances for in-vehicle humanmachine systems for route navigation, noise suppression for robust speech recognition, and driver behavior modeling. Multi-microphone array processing based on combined fixedadaptive beamforming is developed for noise suppression for hands-free communications as well as improved automatic speech recognition for route dialog interaction. Next, advances in modeling driver behavior are considered in the UT-Drive project, which is focused on advancing smart vehicle technologies for improved safety while driving. Finally, a general discussion considers next generation advances for in-vehicle environments which sense driver cognitive stress/distraction to adapt interactive systems to improve safety.
Page: 211-214 (4)
Author: Javier Ramirez and Juan Manuel Gorriz
Full Text Available
This E-book is a collection of articles that describe advances in speech recognition technology. Robustness in speech recognition refers to the need to maintain high speech recognition accuracy even when the quality of the input speech is degraded, or when the acoustical, articulate, or phonetic characteristics of speech in the training and testing environments differ. Obstacles to robust recognition include acoustical degradations produced by additive noise, the effects of linear filtering, nonlinearities in transduction or transmission, as well as impulsive interfering sources, and diminished accuracy caused by changes in articulation produced by the presence of high-intensity noise sources. Although progress over the past decade has been impressive, there are significant obstacles to overcome before speech recognition systems can reach their full potential. Automatic speech recognition (ASR) systems must be robust to all levels, so that they can handle background or channel noise, the occurrence on unfamiliar words, new accents, new users, or unanticipated inputs. They must exhibit more 'intelligence' and integrate speech with other modalities, deriving the user's intent by combining speech with facial expressions, eye movements, gestures, and other input features, and communicating back to the user through multimedia responses. Therefore, as speech recognition technology is transferred from the laboratory to the marketplace, robustness in recognition becomes increasingly significant. This E-book should be useful to computer engineers interested in recent developments in speech recognition technology.