Optimizing cleanuNet architecture for speech denoising

Silva, Matheus Vieira da

Optimizing cleanuNet architecture for speech denoising

Detalhes bibliográficos
Ano de defesa:	2024
Autor(a) principal:	Silva, Matheus Vieira da
Orientador(a):	Não Informado pela instituição
Banca de defesa:	Não Informado pela instituição
Tipo de documento:	Dissertação
Tipo de acesso:	Acesso aberto
Idioma:	eng
Instituição de defesa:	Universidade Federal de Uberlândia Brasil Programa de Pós-graduação em Ciência da Computação
Programa de Pós-Graduação:	Não Informado pela instituição
Departamento:	Não Informado pela instituição
País:	Não Informado pela instituição
Palavras-chave em Português:	Melhoramento de fala Speech denoising Aprendizado profundo Deep learning Transformer Redes neurais convolucionais Convolutional neural networks CNPQ::CIENCIAS EXATAS E DA TERRA Computação Redes neurais (Computação) Controle de ruído Codificador de voz ODS::ODS 12. Consumo e produção responsáveis - Assegurar padrões de produção e de consumo sustentáveis. ODS::ODS 7. Energia limpa e acessível - Garantir acesso à energia barata, confiável, sustentável e renovável para todos.
Link de acesso:	https://repositorio.ufu.br/handle/123456789/44653 http://doi.org/10.14393/ufu.di.2024.5523
Resumo:	Speech enhancement techniques are crucial for recovering clean speech from signals degraded by noise and suboptimal acoustic conditions, such as background noise and echo. These challenges demand effective denoising methods to improve speech clarity. This work presents an optimized version of CleanUNet, a Convolutional Neural Network based on the U-Net architecture designed explicitly for causal speech-denoising tasks. Our approach introduces the Mamba architecture as a novel alternative to the traditional transformer bottleneck, enabling more efficient handling of encoder outputs with linear complexity. Additionally, we integrated batch normalization across the convolutional layers, stabilizing and accelerating the training process. We also experimented with various activation functions to identify the most effective configuration for our model. By reducing the number of hidden channels in the convolutional layers, we significantly reduced the model's parameter count, thereby enhancing training and inference speed on a single GPU with slight degradation in performance. These improvements make the model particularly suitable for real-time applications. Our best model, 52.53\% smaller than the baseline, achieves 2.745, 3.288, and 0.911 of PESQ (WB), PESQ (NB), and STOI, respectively. We also optimized the smallest model using only 1.36\% of the original parameters, and it achieved competitive results. To the best of our knowledge, this work is the first to integrate the Mamba architecture as a replacement for the vanilla transformer in CleanUNet and, in combination with architectural optimizations, offers a streamlined, computationally efficient solution for speech enhancement.

Optimizing cleanuNet architecture for speech denoising

Registros relacionados