Discovering a domain-specific schema from general-purpose knowledge base

Detalhes bibliográficos
Ano de defesa: 2023
Autor(a) principal: SILVA NETO, Everaldo Costa
Orientador(a): SALGADO, Ana Carolina Brandão
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Universidade Federal de Pernambuco
Programa de Pós-Graduação: Programa de Pos Graduacao em Ciencia da Computacao
Departamento: Não Informado pela instituição
País: Brasil
Palavras-chave em Português:
Link de acesso: https://repositorio.ufpe.br/handle/123456789/51840
Resumo: General-purpose knowledge bases (KBs), e.g., DBpedia, YAGO, and Wikidata, store fac- tual data about a set of entities. These KBs have been constructed to store cross-domain knowledge, e.g., health, entertainment, industry, sports, and arts. Most applications that use data from general-purpose KBs are domain-specific. Some tasks, such as query formu- lation and information extraction, require a data schema to explore the contents of a KB. However, schema-related declarations are not mandatory and, sometimes, are not pro- vided. Therefore, these domain-specific applications face two issues: (1) they require only a subset of data that meets the domain of interest, but general-purpose KBs have a large volume of factual data within many distinct domains; and (2) the lack of schema-related information. In this thesis, we address the problem of domain-specific schema discov- ery from general-purpose KBs. Specifically, we build ANCHOR, an end-to-end pipeline to identify a domain-specific dataset as well as its schematic description in an automatic way. ANCHOR works in three steps: domain discovery, class identification and class schema discovery. First, it extracts a specific domain exploring category-category mappings from KB. From this, it identifies domain entities through entity-category mappings. Next, the class identification step discovers implicit classes within the dataset. For that, ANCHOR learns entity representation from entity-category mappings and uses it to identify im- plicit entities’ classes by grouping similar entities. Finally, the class schema discovery task builds the class schema, i.e., it identifies a set of relevant attributes that best describe the entities within the same class. For that, ANCHOR runs CoFFee, an approach based on attributes co-occurrence and frequency to identify a set of core attributes for each class discovered in the previous step. We have performed an extensive experimental evaluation on four distinct DBpedia domains. For the class identification task, we compare ANCHOR against some traditional and embedding-based baselines. The results show that applied to standard clustering algorithms, our entity representation outperforms the baselines and is effective for the class identification task. For the class schema discovery task, we compare CoFFee against two state-of-the-art approaches. The results show that CoFFee proved to be effective in filtering out less relevant attributes. It selects a set of core attributes keep- ing its retrieval rate high and producing a higher-quality schema class for the identified classes.