Graph to sequence syntactic pattern recognition for image classification problems

Detalhes bibliográficos
Ano de defesa: 2021
Autor(a) principal: Gilberto Astolfi
Orientador(a): Hemerson Pistori
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Fundação Universidade Federal de Mato Grosso do Sul
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Brasil
Palavras-chave em Português:
Link de acesso: https://repositorio.ufms.br/handle/123456789/3913
Resumo: A growing interest in applying Natural Language Processing (NLP) models in computer vision problems has recently emerged. This interest is motivated by the success of NLP models in tasks such as translation and text summarization. In this work, a new method for applying NLP to image classification problems is proposed. The aim is to represent the visual patterns of objects using a sequence of alphabet symbols and then train some form of Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), or Transformer using these sequences to classify objects. The visual pattern representation of objects in a syntactic way allows PLN models to be applied to image classification problems in a natural way, i.e., in the same way as applied to natural language problems. Two visual pattern representation approaches of objects in a syntactic way were investigated: representation using keypoints and representation using component parts of objects. In the approach that uses keypoints, the keypoints are identified in the images, associated with alphabet symbols, and then related using a graph to derive strings from images. Strings are the inputs for training an LSTM encoder. Experiments showed evidence that the syntactic pattern representation can represent visual variations in superpixel images captured by Unmanned Aerial Vehicles, even when there is a small set of images for training. In the approach that uses component parts of objects, the component parts are provided by means of bounding boxes in the images. The component parts are associated with alphabet symbols and related with each other to derive a sequence of symbols from the object for representing its visual pattern. Then, some form of GRU, LSTM, or Transformer are trained to learn the spatial relation between component parts of the objects contained in the sequences. An extensive experimental evaluation using a limited number of samples for training has been conducted to compare our method with ResNet-50 deep learning architecture. The results achieved by the proposed method overcome ResNet-50 in all test scenarios. In one test, the method presents an average accuracy of 95.3% against 89.9% of the ResNet-50. Both experiments showed evidence that from a finite set of primitive structures is possible to obtain many variations in the visual pattern of the object same when there are few samples for training. Besides, the experiments evidenced that the NPL models can be applied in a natural way to image classification problems in computer vision.