Content-based video retrieval from natural language

Detalhes bibliográficos
Ano de defesa: 2022
Autor(a) principal: Jorge, Oliver Cabral
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Universidade Tecnológica Federal do Paraná
Curitiba
Brasil
Programa de Pós-Graduação em Engenharia Elétrica e Informática Industrial
UTFPR
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://repositorio.utfpr.edu.br/jspui/handle/1/29964
Resumo: More and more, videos are becoming the most common means of communication, leveraged by the popularization of affordable video recording devices and social networks such as TikTok, Instagram, and others. The most common ways of searching for videos on these social networks as well as on search portals are based on metadata linked to videos through keywords and previous classifications. However, keyword searches depend on exact knowledge of what you want and may not necessarily be efficient when trying to find a particular video from a description, superficial or not, of a particular scene, which may lead to frustrating results in the search. The objective of this work is to find a particular video within a list of available videos from a textual description in natural language based only on the content of its scenes, without relying on previously cataloged metadata. From a dataset containing videos with a defined number of descriptions of their scenes, a Siamese network with a triplet loss function was modeled to identify, in hyperspace, the similarities between two different modalities, one of them being the information extracted from a video, and the other information extracted from a text in natural language. The final architecture of the model, as well as the values of its parameters, was defined based on tests that followed the best results obtained. Because videos are not classified into groups or classes and considering that the triplet loss function is based on an anchor text and two video examples, one positive and one negative, a difficulty was identified in the selection of false examples needed for the model training. In this way, methods of choosing examples of negative videos for training were also tested using a random choice and a directed choice, based on the distances of the available descriptions of the videos in the training phase, being the first the most effective. At the end of the tests, a result was achieved with the exact presence of the searched video in 10.67% of the cases in the top 1 and 49.80% of the cases in the top 10. More than the numerical results, a qualitative analysis of the results was conducted. From this analysis, it was identified that the model does not behave satisfactorily for searches in atomic words, with better results in more complex descriptions. Satisfactory results are also mainly related to the use of verbs and nouns, and less to adjectives and adverbs. Still, it was observed that the returned videos have, in some way, similarities of scenes or topics with the searched text, indicating that the network identified the meaning of the original text query. In general, the results obtained are promising and encourage the continuity of the research. Future work will include the use of new models for extracting information from videos and texts, as well as further studies into the controlled choice of negative video examples to reinforce training.