A formal specification for syntactic annotation and its usage in corpus development and maintenance: a case study in universal dependencies

Passos, Guilherme Paulino

A formal specification for syntactic annotation and its usage in corpus development and maintenance: a case study in universal dependencies

Detalhes bibliográficos
Ano de defesa:	2018
Autor(a) principal:	Passos, Guilherme Paulino
Orientador(a):	Não Informado pela instituição
Banca de defesa:	Não Informado pela instituição
Tipo de documento:	Dissertação
Tipo de acesso:	Acesso aberto
Idioma:	eng
Instituição de defesa:	Universidade Federal do Rio de Janeiro Brasil Instituto Alberto Luiz Coimbra de Pós-Graduação e Pesquisa de Engenharia Programa de Pós-Graduação em Engenharia de Sistemas e Computação UFRJ
Programa de Pós-Graduação:	Não Informado pela instituição
Departamento:	Não Informado pela instituição
País:	Não Informado pela instituição
Palavras-chave em Português:	Natural language processing Syntactic parsing Knowledge representation CNPQ::ENGENHARIAS
Link de acesso:	http://hdl.handle.net/11422/13142
Resumo:	Linguistically annotated data are currently crucial resources for natural language processing (NLP). They are necessary for both evaluation and as input to training machine learning models of language. However, producing new datasets is a very time and labor-consuming. Usually some expertise in linguistics is required for annotators, and even so the annotation decision problem is far from trivial. This difficulty grows in scale: in projects with many annotators or spanning a long period of time, annotation consistency can be compromised. Furthermore, annotating data from specific domain requires annotators with corresponding knowledge. This is a serious problem for technical domains such as biomedical sciences, oil & gas and law. In this work, we contribute to solving the problem of producing syntactically annotated texts (treebanks) by formal methods. We develop a formal specification of the syntactic annotation standard Universal Dependencies, a project developed by the NLP community around the world which is growing in importance. We argue that this formal specification is useful for improving the quality of treebanks and reducing annotation costs, by enforcing consistency in the data. We discuss the features, design choices and limitations of our ontology, implemented in the OWL2- DL language. We evaluate experimentally the usefulness of our ontology in a task of automatically detecting wrong analysis, showing high precision in four languages. Finally, we contextualize our contribution by surveying state-of-the-art methods for developing and maintaining treebanks.

A formal specification for syntactic annotation and its usage in corpus development and maintenance: a case study in universal dependencies

Registros relacionados