Aprendizagem e rotulação de intenções semiautomática para modelos de classificação de texto em linguagem natural

Detalhes bibliográficos
Ano de defesa: 2024
Autor(a) principal: Santos Júnior, Valmir Oliveira dos
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Não Informado pela instituição
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://repositorio.ufc.br/handle/riufc/78233
Resumo: It is increasingly common to use chatbots as service interfaces. One of the main components of a chatbot is the NLU module, responsible for interpreting the text, extracting the intent, and identifying the entities present. It is possible to focus on just one of these NLU tasks, such as intent classification. To train an NLU intent classification model usually requires a considerable amount of annotated data, where each sentence in the dataset is labeled with an intent. Depending on the volume of data, manual data labeling can be laborious and time-consuming. Thus, an unsupervised machine learning technique, such as data clustering, could be applied to find and label patterns. For this task, an effective text vector representation that captures semantic information and helps the machine understand the context, intent, and other nuances of the entire text is essential. This work extensively evaluates different text embedding models for clustering and labeling. Some operations are also applied to improve the dataset’s quality, where the least representative sentences of each generated group are discarded. Then, some Intent Classification Models are trained using two architectures based on Neural Networks, using service text from PPC. A dataset was also manually annotated to be used as validation data. A study was conducted on semiautomatic labeling, implemented through data clustering and visual inspection, which introduced some labeling errors in the intent classification models. However, it would be unfeasible to manually annotate the entire dataset. Nonetheless, models were built that achieved over 98% accuracy with test data and over 96% with validation data.