On bipartite decision forests

Detalhes bibliográficos
Ano de defesa: 2024
Autor(a) principal: Silva, Pedro de Carvalho Braga Ilidio
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Biblioteca Digitais de Teses e Dissertações da USP
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: https://www.teses.usp.br/teses/disponiveis/76/76133/tde-01072024-082057/
Resumo: The present study investigates decision forest algorithms for predicting interactions in bipartite networks.We concentrate on examples of such problems in the biological domain, such as drugprotein interactions, microRNA-gene interactions or long non-coding RNA-protein interactions. Notwithstanding, the proposed methods encompass the broad range of tasks satisfying i) the goal is to predict interactions between two entities; ii) the interacting pairs are composed of two different types of entities; and iii) each type of entity has its own set of input features. We refer to this paradigm as bipartite interaction learning or bipartite learning. Predicting interactions in such networks has fundamental challenges. For instance, the number of possible interactions is often very large in comparison to the number of known interactions. As a result, the data is frequently sparse, and negative annotations are unreliable. We explore a class of decision forest models specifically designed to address these challenges, that we broadly call bipartite forests. First, we demonstrate how these trees can be adapted to yield a log n speedup in training time. We also propose using weighted-neighbors approaches to determine each leafs output, which resulted in improved generalization. Finally, we introduce semi-supervised impurity functions to bipartite forests. These functions result in trees that also consider clusters of instances in the feature space, rather than only their labels. This is shown to improve the forests resilience to the missing annotations. Our models display highly-competitive performance across ten interaction prediction datasets.We believe the proposed methods can be a crucial step in developing effective and scalable machine learning models for interaction prediction. Further adaptations of these models could also impact other domains, such as recommendation systems, multilabel learning and weak-label learning.