Semantic enrichment of American English corpora through automatic semantic annotation based on top-level ontologies using the CRF clas- sification model

Detalhes bibliográficos
Ano de defesa: 2018
Autor(a) principal: Andrade, Guidson Coelho de
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Universidade Federal de Viçosa
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://www.locus.ufv.br/handle/123456789/21639
Resumo: Textual databases carry with them human-perceived meanings, but those meanings are difficult to be interpreted by computers. In order for the machines to understand the semantics attached to texts, and not only their syntax, it is necessary to add extra information to these corpora. Semantic annotation is the task of incorporat- ing this information by adding metadata to lexical items. This information can be ontological concepts that help define the nature of the word in order to give it some meaning. However, annotating texts according to an ontology is still a task that requires time and effort from annotators trained for this purpose. Another approach to be considered is the use of automatic semantic annotation tools that use machine learning techniques to classify annotated terms. This approach demands a database for training the algorithms that in this case are corpora pre-annotated according to the semantic dimension to be explored. However, this methodological lineage has limited resources to meet the needs of learning methods. There is a large lack of semantically annotated corpora and an even larger absence of ontologically anno- tated corpora, hindering the advance of the area of automatic semantic annotation. The purpose of the present work is to assist in the semantic enrichment of Amer- ican English texts by automatically annotating them based on top-level ontology through the Conditional Random Fields (CRF) supervised learning model. After the selection of the Open American National Corpus as a linguistic database and Schema.org as an ontology, the work had its structure divided into two stages. First, the pre-processed and corrected corpus was submitted to a hybrid annotation, with a rule-based annotator, and later manually. Both annotation tasks were driven by the concepts and definitions of the eight classes from the top-level of the selected ontology. Once the corpus was written ontologically, the automatic annotation pro- cess was started using the CRF learning method. The prediction model took into account the linguistic and structural features of the terms to classify them under the eight ontological types. The results obtained during the evaluation of the model were very satisfactory and reached the objective of the research. The work, although it is a new approach of semantic annotation and with little margin of comparison, presented promising results for the advance of the research in the area of automatic semantic enrichment based on top-level ontologies.