Reconhecimento de fonemas com compactação das frequências via centroide e redes stacked autoencoders.

Detalhes bibliográficos
Ano de defesa: 2024
Autor(a) principal: PEREIRA , Bianca Valéria Lopes lattes
Orientador(a): ALMEIDA NETO, Areolino de lattes
Banca de defesa: ALMEIDA NETO, Areolino de lattes, OLIVEIRA, Alexandre César Muniz de lattes, SAMPAIO NETO, Nelson Cruz lattes
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal do Maranhão
Programa de Pós-Graduação: PROGRAMA DE PÓS-GRADUAÇÃO EM CIÊNCIA DA COMPUTAÇÃO/CCET
Departamento: DEPARTAMENTO DE INFORMÁTICA/CCET
País: Brasil
Palavras-chave em Português:
Palavras-chave em Inglês:
Área do conhecimento CNPq:
Link de acesso: https://tedebc.ufma.br/jspui/handle/tede/5486
Resumo: Phoneme recognition is an area of linguistics and speech processing that involves identifying and distinguishing the distinctive sounds that make up a language. Recognizing phonemes involves the ability to discern and categorize the different sounds of speech, even when there are variations in pronunciation, context or intonation. In this work, a phoneme recognition model is proposed using a stacked autoencoder network, called CollabNet. CollabNet introduces a collaborative method for inserting new hidden layers, in contrast to the traditional stacking of autoencoders. In CollabNet, the addition of a new layer is done in a coordinated and gradual manner, allowing the designer to control its influence on the training. This collaboration ensures that the learning of the new layer is effectively integrated with the previous layers, resulting in more aligned and efficient training. To represent the phonemes, the frequencies were compacted using centroids so as to preserve the particularities of the sound. In order to create a geometric representation of the audios in the databases, the fast Fourier transform (FFT) was calculated for each audio sample, then the frequencies were grouped and the centroid of each group was calculated. Subsequently, the deep stacked autoencoder network was parameterized and trained to recognize phonetic syllables. With this representation of the audios, one could maintain their particular characterization so that CollabNet could identify the various sounds of the Brazilian Portuguese language, thus achieving an accuracy of 75.96% and a PER of 23.73%.