Análise de textos por meio de processos estocásticos na representação word2vec

Detalhes bibliográficos
Ano de defesa: 2021
Autor(a) principal: Massoni, Gabriela
Orientador(a): Stern, Rafael Bassi lattes
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal de São Carlos
Câmpus São Carlos
Programa de Pós-Graduação: Programa Interinstitucional de Pós-Graduação em Estatística - PIPGEs
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Palavras-chave em Inglês:
Área do conhecimento CNPq:
Link de acesso: https://repositorio.ufscar.br/handle/20.500.14289/14241
Resumo: Within the field of Natural Language Processing (NLP), the word2vec model has been extensively explored in the field of vector representation of words. It is a neural network that is based on the hypothesis that similar words have similar contexts. In the literature in general, the text is represented by the mean vector of the representations of its words, which, in turn, is used as an explanatory variable in predictive models. An alternative is, in addition to averages, to use other measures, such as standard deviation and position measures. However, the use of these measures assumes the order of the words does not matter. Thus, in this dissertation we explore the use of stochastic processes, in particular, Time Series Models and Hidden Markov Models (HMM), to incorporate the "chronological" order of words in the construction of explanatory variables from the vector representation given by word2vec. The impact of this approach is measured with the quality of the predictive models of real data and compared to the usual ones. For the analysed data, the proposed approaches have a result that is superior to or equivalent to the usual approaches in most cases.