Detalhes bibliográficos
Ano de defesa: |
2022 |
Autor(a) principal: |
Carmo, Paulo Ricardo Viviurka do |
Orientador(a): |
Não Informado pela instituição |
Banca de defesa: |
Não Informado pela instituição |
Tipo de documento: |
Dissertação
|
Tipo de acesso: |
Acesso aberto |
Idioma: |
eng |
Instituição de defesa: |
Biblioteca Digitais de Teses e Dissertações da USP
|
Programa de Pós-Graduação: |
Não Informado pela instituição
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: |
|
Link de acesso: |
https://www.teses.usp.br/teses/disponiveis/55/55134/tde-11012023-172819/
|
Resumo: |
In order to use text data in machine learning tasks, they must be cleaned and transformed to a structured representation. Recently, neural embeddings have been used to encode text data in low dimensionality latent spaces. For example, BERT pre-trained neural language models can position words, sentences, or documents with fixed dimension embedding vectors. Another way to model text data is to use heterogeneous information networks. That structure models multi-typed data respecting relations and characteristics. Heterogeneous information networks also have their challenges for use with off-the-shelf machine learning methods. Network embedding methods allow the extraction of embedding vectors for each node in an information network. However, these methods usually use only network topology, and sometimes, metadata for the relationships. Embedding propagation methods allow previously generated features with pre-trained methods to be propagated through all network nodes. Information networks that contain some nodes with textual information can use pre-trained neural language models features for propagation. This masters dissertation presents an embedding propagation method for heterogeneous information networks with some textual nodes. The proposed method combines pre-trained neural language models to the topology of heterogeneous information networks through a regularization function to generate embedding for non-textual nodes. Three papers on use case experiments to evaluate and validate the proposed method are presented, where one paper extends the experiments from another: (1) Embedding Propagation over Heterogeneous Event Networks presents the results of the proposed method for event analysis where it achieved the best performance by at least 3% MRR@k in all scenarios; (2) TRENCHANT: TRENd prediCtion on Heterogeneous informAtion NeTworks extends Commodities trend link prediction on heterogeneous information networks where the proposed method is evaluated against network embeddings in the task of predicting price trends for commodities, and it achieved the best performance in some scenarios, where its best results 8% better F1 when predicting weekly soybean price trends; and (3) NatUKE: Benchmark for Natural Product Knowledge Extraction from Academic Literature that evaluates the use of network embedding methods for unsupervised knowledge extraction and the proposed method achieved the best performance in most scenarios, more notably it achieved 43% more Hits@1 than baselines when extraction the isolation process type to obtain a molecule from a certain species. The presented papers show, in three different use cases and experiments, that the proposed method achieves the research goals of propagating the initial embedding from some textual nodes to the remaining nodes in a heterogeneous information network and allowing dynamic insertion of new nodes in the embedding propagation process. |