Detalhes bibliográficos
Ano de defesa: |
2014 |
Autor(a) principal: |
Lima, Rinaldo José de |
Outros Autores: |
Freitas, Frederico Luiz Gonçalves de |
Orientador(a): |
Não Informado pela instituição |
Banca de defesa: |
Não Informado pela instituição |
Tipo de documento: |
Tese
|
Tipo de acesso: |
Acesso aberto |
Idioma: |
eng |
Instituição de defesa: |
Universidade Federal de Pernambuco
|
Programa de Pós-Graduação: |
Não Informado pela instituição
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: |
|
Link de acesso: |
https://repositorio.ufpe.br/handle/123456789/12425
|
Resumo: |
Information Extraction (IE) consists in the task of discovering and structuring information found in a semi-structured or unstructured textual corpus. Named Entity Recognition (NER) and Relation Extraction (RE) are two important subtasks in IE. The former aims at finding named entities, including the name of people, locations, among others, whereas the latter consists in detecting and characterizing relations involving such named entities in text. Since the approach of manually creating extraction rules for performing NER and RE is an intensive and time-consuming task, researchers have turned their attention to how machine learning techniques can be applied to IE in order to make IE systems more adaptive to domain changes. As a result, a myriad of state-of-the-art methods for NER and RE relying on statistical machine learning techniques have been proposed in the literature. Such systems typically use a propositional hypothesis space for representing examples, i.e., an attribute-value representation. In machine learning, the propositional representation of examples presents some limitations, particularly in the extraction of binary relations, which mainly demands not only contextual and relational information about the involving instances, but also more expressive semantic resources as background knowledge. This thesis attempts to mitigate the aforementioned limitations based on the hypothesis that, to be efficient and more adaptable to domain changes, an IE system should exploit ontologies and semantic resources in a framework for IE that enables the automatic induction of extraction rules by employing machine learning techniques. In this context, this thesis proposes a supervised method to extract both entity and relation instances from textual corpora based on Inductive Logic Programming, a symbolic machine learning technique. The proposed method, called OntoILPER, benefits not only from ontologies and semantic resources, but also relies on a highly expressive relational hypothesis space, in the form of logical predicates, for representing examples whose structure is relevant to the information extraction task. OntoILPER automatically induces symbolic extraction rules that subsume examples of entity and relation instances from a tailored graph-based model of sentence representation, another contribution of this thesis. Moreover, this graph-based model for representing sentences also enables the exploitation of domain ontologies and additional background knowledge in the form of a condensed set of features including lexical, syntactic, semantic, and relational ones. Differently from most of the IE methods (a comprehensive survey is presented in this thesis, including the ones that also apply ILP), OntoILPER takes advantage of a rich text preprocessing stage which encompasses various shallow and deep natural language processing subtasks, including dependency parsing, coreference resolution, word sense disambiguation, and semantic role labeling. Further mappings of nouns and verbs to (formal) semantic resources are also considered. OntoILPER Framework, the OntoILPER implementation, was experimentally evaluated on both NER and RE tasks. This thesis reports the results of several assessments conducted using six standard evaluationcorpora from two distinct domains: news and biomedical. The obtained results demonstrated the effectiveness of OntoILPER on both NER and RE tasks. Actually, the proposed framework outperforms some of the state-of-the-art IE systems compared in this thesis. |