IMPROVING THE ACCURACY OF VOICE IDENTIFICATION THROUGH RATIONAL SELECTION OF IDENTIFICATION CHARACTERISTICS
12.11.2024 15:40
[3. Технические науки]
Автор: Yana Bielozorova, Associate Professor, PhD, National Aviation University, Kyiv
Problem Statement. The issue of speech signal identification has been well-researched, with numerous foundational studies conducted. However, even today, the accuracy of speech signal identification is still deemed insufficient, and among various methods of personal identification, it remains one of the least reliable. Consequently, improving the accuracy of speech signal identification is a relevant problem, especially in areas related to human-computer interaction.
Literature Review and Analysis. Previous studies have shown that the human brain processes information from external receptors, such as auditory and visual perception, at a frequency of approximately 60 operations per second [1], which corresponds to time segments of 16.67 ms. Given that most studies employ varying window lengths and, in many cases, larger windows to prevent edge effects during signal decomposition, using a window length comparable to that processed by the human brain may improve identification accuracy, similar to other approaches discussed below.
Let’s consider the issue of the “edge effect” in constructing speech signal identification systems. Suppose there is a normalized signal with an amplitude that decreases from 1 at the beginning to 0 at the end of the time interval. The spectrum of this signal will show a reduction in spectral amplitude at 50 Hz, which does not accurately represent the actual signal spectrum. Increasing the window size in research will gradually attenuate the spectral amplitudes, but they will still appear in the signal’s spectrogram. This idealized case assumes a perfect signal representation; however, with real speech signals, edge distortions are much more pronounced, leading to the known “edge effect.” This effect significantly impacts both the accuracy of signal decomposition and the interpretation of research results.
To reduce the “edge effect,” the signal is multiplied by a special function, known as a window function, before decomposition. This technique avoids the described effect but results in two interdependent effects: widening of the main peak and expansion of the spectral side lobes [2]. Reducing the side lobe width blurs the main peak, and vice versa. Depending on the specific application, an appropriate window is chosen (Hann, Hamming, Blackman, and others), though typically one resolution improvement is achieved at the expense of another.
Most existing window functions are based on Gaussian-type functions [2]. The Gaussian window has a unique advantage over other windows – it provides the most compact time-frequency representation of the spectrum. In other words, it minimizes the uncertainty product in both the time and frequency domains, linked to the Fourier Transform uncertainty principle. Notably, in physics, this property of the Fourier Transform serves as the mathematical foundation of Heisenberg’s uncertainty principle, which states that position and momentum cannot be known simultaneously with infinite precision.
Objective. The aim of this study is to identify approaches to improving the accuracy of speech signal identification through a novel approach to describing the speech signal.
Research Results. The previously discussed approach has a significant drawback that considerably impacts the accuracy of spectral decomposition of the speech signal and, consequently, the accuracy of its identification.
The proposed method is based on the theory of wavelet frames [3]. The determination of decomposition parameters in the frequency-time domain based on this theory is structured on the principle that the extent of the Heisenberg rectangle along the time and frequency axes is proportional to the scales and , respectively. Transformations are performed with a frequency step of 1 Hz, independent of window size. According to Heisenberg’s principle, frequency resolution does not impose significant limitations on analyzing signal structure over short time intervals. This relates to the fundamental frequency, typically within 100–550 Hz, and the spacing between formant peaks over sufficiently short time intervals. Experiments have shown [4] that the frequency resolution of two closely spaced peaks over a 16.67 ms transformation interval approximates 50 Hz.
The Morlet wavelet was chosen as the basis for the wavelet transformation in this study, with selection criteria based on the following:
1.computational speed of the algorithm;
2.frequent use in speech signal description tasks due to its effective approximation compared to other bases;
3.the wavelet should represent a windowed transformation with a Gaussian function.
This study demonstrated that the arrangement of scalogram ridges along the time parameter in Figure 1 corresponds precisely to local amplitude extrema in the time domain of the sound wave, unlike FFT decomposition of the same region (Figure 2).
Fig.1 Spectrogram of the decomposition of a fragment of a speech signal using the described method
These local extrema align with amplitude spikes in the sound wave, determined by the fundamental frequency. An essential aspect of the high affinity between the Morlet basis and self-similar structures in speech fragments is the higher smoothness of the scalogram compared to, for example, a Fourier transform (Figure 2). The higher degree of smoothness in the functions allows for an efficient mathematical analysis of ridge parameters.
Fig. 2 Spectrogram of the decomposition of a signal using FFT with a rectangular window
In this approach, the distances between local maxima of the scalogram in the frequency domain serve as estimates of the fundamental frequency. An important factor in the robustness and reliability of fundamental frequency estimates within this methodology is the ability to assess fundamental frequency not only through local maxima in the wavelet transform but also through correlation between fragments of maximum regions (Figure 3). On small intervals, these regions exhibit approximately self-similar structures. When analyzing the self-similarity of such structures, it is possible to extract identical-sized structures within the time window without focusing on scalogram maxima.
Fig. 3 Determination of the fundamental frequency based on local maxima of a fragment of the speech signal
Conclusions. An approach has been proposed for defining the characteristic features of a speech signal based on the detection and use of self-similar signal structures as identifying features. The possible causes of identification accuracy loss in speech signals have been examined, and an effort has been made to mitigate them within the proposed identification method. Key components for enhancing signal identification accuracy include:
•Segmenting sound fragments into 16.67 ms frames, corresponding to the processing interval typical for the human brain;
•Extracting distinctive features of the voice signal in the frequency domain by analyzing the maxima of wavelet transform coefficients;
•Employing wavelet frames with a frequency resolution of 1 Hz for detailed analysis;
•Using not only the fundamental frequency but also a broader range of frequency characteristics derived from the wavelet transform as identifying features.
References:
1. Itzhak Fried,Ueli Rutishauser,Moran Cerf, Gabriel Kreiman Single Neuron Studies of the Human Brain: Probing Cognition - The MIT Press, 2014. - pp.365
2. Skopina M., Krivoshein A., Protasov V. Multivariate Wavelet Frames — Springer, 2017. - pp.261
3. Benedetto J.J., Treiber O.M. (2001) Wavelet Frames: Multiresolution Analysis and Extension Principles. In: Debnath L. (eds) Wavelet Transforms and Time-Frequency Signal Analysis. Applied and Numerical Harmonic Analysis. Birkhäuser, Boston, MA. https://doi.org/10.1007/978-1-4612-0137-3_1
4. V. Solovyov, Y. Byelozorova: Multifractal approach in pattern recognition of an announcer’s voice. Polish Academy of Sciences University of Engineering and Economics in Rzeszów, Teka, Vol. 14, no 2, p.p. 164-170, 2014.