Avaliação da qualidade da sintetização de fala gerada por modelos de redes neurais profundas

Oliveira, Frederico Santos de

Avaliação da qualidade da sintetização de fala gerada por modelos de redes neurais profundas

Detalhes bibliográficos
Ano de defesa:	2023
Autor(a) principal:	Oliveira, Frederico Santos de
Orientador(a):	Soares, Anderson da Silva
Banca de defesa:	Soares, Anderson da Silva, Aluisio, Sandra Maria, Duarte, Julio Cesar, Laureano, Gustavo Teodoro, Galvão Filho, Arlindo Rodrigues
Tipo de documento:	Tese
Tipo de acesso:	Acesso aberto
Idioma:	por
Instituição de defesa:	Universidade Federal de Goiás
Programa de Pós-Graduação:	Programa de Pós-graduação em Ciência da Computação (INF)
Departamento:	Instituto de Informática - INF (RMG)
País:	Brasil
Palavras-chave em Português:	Avaliação da fala Avaliação da fala sintetizada Predição de MOS Redes neurais profundas Predição da qualidade
Palavras-chave em Inglês:	Speech assessment Synthesized speech assessment MOS prediction Deep neural networks Quality prediction
Área do conhecimento CNPq:	CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::SISTEMAS DE COMPUTACAO
Link de acesso:	http://repositorio.bc.ufg.br/tede/handle/tede/12916
Resumo:	With the emergence of intelligent personal assistants, the need for high-quality conversational interfaces has increased. While text-based chatbots are popular, the development of voice interfaces is equally important. However, the primary method for evaluating voice-based conversational models is mainly done through Mean Opinion Score (MOS), which relies on a manual and subjective process. In this context, this thesis aims to contribute with a new methodology for evaluating voice-based conversational interfaces, with a case study specifically conducted in Brazilian Portuguese. The proposed methodology includes an architecture for predicting the quality of synthesized speech in Brazilian Portuguese, correlated with MOS. To evaluate the proposed methodology, this work included training Text-to-Speech models to create the dataset called BRSpeechMOS. Details about the creation of this dataset are presented, along with a qualitative and quantitative analysis of it. A series of experiments were conducted to train various architectures using the BRSpeechMOS dataset. The architectures used are based on supervised and self-supervised learning. The results obtained confirm the hypothesis raised that pre-trained models on voice processing tasks such as speaker verification and automatic speech recognition produce suitable acoustic representations for the task of predicting speech quality, contributing to the advancement of the state of the art in the development of evaluation methodologies for conversational models.

Avaliação da qualidade da sintetização de fala gerada por modelos de redes neurais profundas

Registros relacionados