Speech processing systems

Model
Digital Document
Publisher
Florida Atlantic University
Description
Society's increased demand for communications requires searching for techniques that preserve bandwidth. It has been observed that much of the time spent during telephone communications is actually idle time with no voice activity present. Detecting these idle periods and preventing transmission during these idle periods can aid in reducing bandwidth requirements during high traffic periods. While techniques exist to perform this detection, certain types of noise can prove difficult at best for signal detection. The use of wavelets with multi-resolution subspaces can aid detection by providing noise whitening and signal matching. This thesis explores its use and proposes a technique for detection.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Glottal pulse models provide vocal tract excitation signals which are used in producing high quality speech. Most of the currently used glottal pulse models are obtained by concatenating a small number of parametric functions over the pitch period. In this thesis, a new glottal pulse model is proposed. It is an alternative approach, which is based on the projection of glottal volume velocity over multiresolution subspaces spanned by wavelets and scaling functions. A detailed multiresolution analysis of the glottal models is performed using the compactly supported orthogonal Daubechies wavelets. The wavelet representation has been tested for optimality in terms of the reconstruction error and the energy compactness of the coefficients. It is demonstrated that by choosing proper parameters of the wavelet representation, high compression ratios and low rms error can be achieved.
Model
Digital Document
Publisher
Florida Atlantic University
Description
A waveform substitution technique using interpolation based on such slow varying parameters of speech as short-time energy and average zero-crossing rate is developed for a packetized speech communication system. The system uses 64 Kbps conventional PCM for encoding and takes advantage of active talkpurts and silence intervals to increase the utilization efficiency of a digital link. The short-time energy and average zero-crossing rates calculated for the purpose of determining talkpurts are transmitted in a preceeding packet. Hence, when a packet is pronounced "lost", its envelope and frequency characteristics are obtained from the previous packet and used to synthetize a substitution waveform which is free of annoying sounds that are due to abrupt changes in amplitude. Informal listening tests show that tolerable packet loss rate up to 40% are achievable with these procedures.
Model
Digital Document
Publisher
Florida Atlantic University
Description
This thesis deals with the application of Line Spectrum Pairs to
tone detection. Linear Predictive Coding (LPC) is described as a
background to deriving the Line Spectrum Pairs. Two sources of LPC
prediction coefficients are used to calcul?te Line Spectrum Pairs.
One source is the polynomial roots of an LPC inverse filter; various
locations of up to 3 pairs of complex conjugate roots are used to
provide filter coefficients. The radii of the conjugate roots are
varied to see the effect on the calculated Line Spectrum Pairs. A
second source of the filter coefficients is single and multiple
sinusoidal tones that are LPC analyzed by the autocorrelation method
to generate filter prediction coefficients. The frequencies and
amplitudes of the summed sinusoids, and the length of the LPC analysis
window are varied to determine the ability to detect the sinusoids by
calculating the related Line Spectrum Pairs.
Model
Digital Document
Publisher
Florida Atlantic University
Description
In speech analysis, a Voiced-Unvoiced-Silence (V/UV/S) decision is performed through pattern recognition, based on measurements made on the signal. The examined speech segment is assigned to a particular class, V/UV/S, based on a minimum probability-of-error decision rule which is obtained under the assumption that the measured parameters are distributed according to a multidimensional Gaussian probability density function. The means and covariances for the Gaussian distribution are determined from manually classified speech data included in a training set. If the recording conditions vary considerably, a new set of training data is required. With the assumption that all three classes exist in the incoming speech signal, this research describes an automatic parametric learning method. Such a method estimates the means and covariances from the incoming speech signal and provides a reliable classification in any reasonable acoustic environment. This approach eliminates the necessity for the manual classification of training data and has the capability of being self-adapting to the background acoustic environment as well as to speech level variations. Thus the presented approach can be readily applied to on-line continuous speech classification without prior recognition.
Model
Digital Document
Publisher
Florida Atlantic University
Description
Blind source separation (BSS) refers to a class of methods by which multiple sensor signals are combined with the aim of estimating the original source signals. Independent component analysis (ICA) is one such method that effectively resolves static linear combinations of independent non-Gaussian distributions. We propose a method that can track variations in the mixing system by seeking a compromise between adaptive and block methods by using mini-batches. The resulting permutation indeterminacy is resolved based on the correlation continuity principle. Methods employing higher order cumulants in the separation criterion are susceptible to outliers in the finite sample case. We propose a robust method based on low-order non-integer moments by exploiting the Laplacian model of speech signals. We study separation methods for even (over)-determined linear convolutive mixtures in the frequency domain based on joint diagonalization of matrices employing time-varying second order statistics. We investigate the sources affecting the sensitivity of the solution under the finite sample case such as the set size, overlap amount and cross-spectrum estimation methods.
Model
Digital Document
Publisher
Florida Atlantic University
Description
This work explores the process of model-based classification of speech audio signals using low-level feature vectors. The process of extracting low-level features from audio signals is described along with a discussion of established techniques for training and testing mixture model-based classifiers and using these models in conjunction with feature selection algorithms to select optimal feature subsets. The results of a number of classification experiments using a publicly available speech database, the Berlin Database of Emotional Speech, are presented. This includes experiments in optimizing feature extraction parameters and comparing different feature selection results from over 700 candidate feature vectors for the tasks of classifying speaker gender, identity, and emotion. In the experiments, final classification accuracies of 99.5%, 98.0% and 79% were achieved for the gender, identity and emotion tasks respectively.
Model
Digital Document
Publisher
Florida Atlantic University
Description
The goal of a speech enhancement algorithm is to remove noise and recover the original signal with as little distortion and residual noise as possible. Most successful real-time algorithms thereof have done in the frequency domain where the frequency amplitude of clean speech is estimated per short-time frame of the noisy signal. The state of-the-art short-time spectral amplitude estimator algorithms estimate the clean spectral amplitude in terms of the power spectral density (PSD) function of the noisy signal. The PSD has to be computed from a large ensemble of signal realizations. However, in practice, it may only be estimated from a finite-length sample of a single realization of the signal. Estimation errors introduced by these limitations deviate the solution from the optimal. Various spectral estimation techniques, many with added spectral smoothing, have been investigated for decades to reduce the estimation errors. These algorithms do not address significantly issue on quality of speech as perceived by a human. This dissertation presents analysis and techniques that offer spectral refinements toward speech enhancement. We present an analytical framework of the effect of spectral estimate variance on the performance of speech enhancement. We use the variance quality factor (VQF) as a quantitative measure of estimated spectra. We show that reducing the spectral estimator VQF reduces significantly the VQF of the enhanced speech. The Autoregressive Multitaper (ARMT) spectral estimate is proposed as a low VQF spectral estimator for use in speech enhancement algorithms. An innovative method of incorporating a speech production model using multiband excitation is also presented as a technique to emphasize the harmonic components of the glottal speech input.