A new algorithm that enables precise measurement of the pitch frequency of a speech signal, a crucial parameter for identifying emotions and diagnosing illnesses, has been introduced by researchers at HSE Campus in Nizhny Novgorod. This method can operate in a noisy environment, in real time and with fewer computing resources than any currently existing analogues. The results of the study have been published in the Journal of Communications Technology and Electronics.
This research was carried out as part of the 'Efficient audiovisual analysis of dynamical changes in emotional state based on an information-theoretic approach' project supported by the Russian Science Foundation.
Voice control is no longer limited to smartphones, and even smart kettles and irons can now be operated using voice commands. Despite significant progress in machine learning and speech processing technologies, accurately recognising emotions still remains a major challenge. To enable devices equipped with AI to recognise human emotions, these devices need to learn how to process a person’s voice more effectively. In particular, this pertains to the pitch frequency, a parameter that reflects the vibrations of the vocal cords when pronouncing vowels.
The study carried out by scientists at HSE Campus in Nizhny Novgorod aimed to develop an effective method for determining the pitch frequency in speech signals. The authors measured changes in pitch frequency, which varies widely from 200 to 400 Hz in women and from 80 to 200 Hz in men. They used mathematical methods such as a fast Fourier transform (FFT) algorithm to analyse audio recordings and register changes.
In situations where there is background noise, or a low-quality microphone is used, the simple application of FFT may not be effective and accurate. To address this issue, the authors of the paper used additional processing of the audio spectrum. They developed a self-learning algorithm that uses a single-layer neural network and a whitening filter. This method focuses on the parts of the utterance which are associated with pitch frequency and, consequently, the expression of emotions.
The whitening filter reverses the process of speech formation: it extracts parameters—which are linear prediction coefficients—from the incoming speech signal and tries to generate white noise at the output. Our proposal is to estimate the parameters of the whitening filter so that the power spectral density obtained with them closely resembles the Fourier transform in terms of a specially developed spectral distortion measure.
The researchers suggest that this new tool for working with acoustic data has a wide range of applications, including in the fields of psychology and medicine. For instance, identifying the pitch frequency can aid in detecting voice pathologies during the diagnosis of neurodegenerative diseases.