GRaNN: feature selection with golden ratio-aided neural network for emotion, gender and speaker identification from voice signals
Author
dc.contributor.author
Garain, Avishek
Author
dc.contributor.author
Ray, Biswarup
Author
dc.contributor.author
Giampaolo, Fabio
Author
dc.contributor.author
Velásquez Silva, Juan Domingo
Author
dc.contributor.author
Singh, Pawan Kumar
Author
dc.contributor.author
Sarkar, Ram
Admission date
dc.date.accessioned
2022-06-08T20:58:45Z
Available date
dc.date.available
2022-06-08T20:58:45Z
Publication date
dc.date.issued
2022
Cita de ítem
dc.identifier.citation
Neural Computing and Applications (2022)
es_ES
Identifier
dc.identifier.other
10.1007/s00521-022-07261-x
Identifier
dc.identifier.uri
https://repositorio.uchile.cl/handle/2250/185938
Abstract
dc.description.abstract
Compared to other features of the human body, voice is quite complex and dynamic, in a sense that a speech can be spoken in various languages with different accents and in different emotional states. Recognizing the gender, i.e. male or female from the voice of an individual, is by all accounts a minor errand for human beings. Similar goes for speaker identification if we are well accustomed with the speaker for a long time. Our ears function as the front end, accepting the sound signs which our cerebrum processes and settles on our disposition. Although being trivial for us, it becomes a challenging task to mimic for any computing device. Automatic gender, emotion and speaker identification systems have many applications in surveillance, multimedia technology, robotics and social media. In this paper, we propose a Golden Ratio-aided Neural Network (GRaNN) architecture for the said purposes. As deciding the number of units for each layer in deep NN is a challenging issue, we have done this using the concept of Golden Ratio. Prior to that, an optimal subset of features are selected from the feature vector extracted, common for all three tasks, from spectral images obtained from the input voice signals. We have used a wrapper-filter framework where minimum redundancy maximum relevance selected features are fed to Mayfly algorithm combined with adaptive beta hill climbing (A beta HC) algorithm. Our model achieves accuracies of 99.306% and 95.68% for gender identification in RAVDESS and Voice Gender datasets, 95.27% for emotion identification in RAVDESS dataset and 67.172% for speaker identification in RAVDESS dataset. Performance comparison of this model with existing models on the publicly available datasets confirms its superiority over those models. Results also ensure that we have chosen the common feature set meticulously, which works equally well on three different pattern classification tasks. The proposed wrapper-filter framework reduces the feature dimension significantly, thereby lessening the storage requirement and training time. Finally, strategically selecting the number units in each layer in NN help increases the overall performance of all three pattern classification tasks.
es_ES
Patrocinador
dc.description.sponsorship
ANID PIA/APOYO AFB180003
es_ES
Lenguage
dc.language.iso
en
es_ES
Publisher
dc.publisher
Springer
es_ES
Type of license
dc.rights
Attribution-NonCommercial-NoDerivs 3.0 United States