Sound pressure level prediction from video frames using deep convolutional neural networks
Ano de defesa: | 2019 |
---|---|
Autor(a) principal: | |
Orientador(a): | |
Banca de defesa: | |
Tipo de documento: | Dissertação |
Tipo de acesso: | Acesso aberto |
Idioma: | eng |
Instituição de defesa: |
Universidade Federal do Rio de Janeiro
Brasil Instituto Alberto Luiz Coimbra de Pós-Graduação e Pesquisa de Engenharia Programa de Pós-Graduação em Engenharia Elétrica UFRJ |
Programa de Pós-Graduação: |
Não Informado pela instituição
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: | |
Link de acesso: | http://hdl.handle.net/11422/14030 |
Resumo: | Some CCTV systems do not have microphones. As a result, sound pressure information is not available in such systems. A method to generate traffic sound pressure estimates using solely video frames as input data is presented. To that end, we trained several combinations of models based on pretrained convolutional networks using a dataset that was automatically generated by a single camera with a mono microphone pointing at a busy traffic crossroad with cars, trucks, and motorbikes. For neural network training from that dataset, color images are used as neural network inputs, and true sound pressure level values are used as neural network targets. A correlation of 0.607 in preliminary results suggest that sound pressure level targets are sufficient for convolutional neural networks to detect sound generating sources within a traffic scene. This hypothesis is tested by evaluating the class activation maps (CAM) of a model with the required global average pooling+fully connected layer structure. We find that the CAM strongly highlights sources that produce large sound pressure values such as buses and faintly highlights objects associated with lower sound pressure such as cars. The neural network with the lowest MSE was cross-validated with 6 folds and the best model was evaluated in the test set. The best model attained a correlation of approximately 0.6 in three of the test videos and correlations of 0.272 and 0.207 in two of the test videos. The low correlation in the two last videos was associated with a traffic warden that constantly whistles: a characteristic not present in the training set. The overall correlation using the whole test set was 0.647. A correlation of 0.844 with a longer term (1 minute) sound pressure level (Leq) estimate using all test videos indicate that estimation of longer term sound pressure levels is less sensitive to sporadic noise in the dataset. |