Extração automatizada de dados de documentos em formato PDF: aplicação a grandes conjuntos de exames educacionais

Detalhes bibliográficos
Ano de defesa: 2021
Autor(a) principal: Wiechork, Karina
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal de Santa Maria
Brasil
Ciência da Computação
UFSM
Programa de Pós-Graduação em Ciência da Computação
Centro de Tecnologia
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
PDF
Link de acesso: http://repositorio.ufsm.br/handle/1/23130
Resumo: The massive production of documents in PDF has motivated research on automated extraction of data contained in these files. Many educational tests use tests available in PDF format, which serve as study and research material. Segmenting, identifying and automatically extracting the content of a test in PDF represents a challenge, as the layout of this type of document can have many variations. Research in the areas of document analysis and recognition, computer vision and information retrieval have produced algorithms and tools that can be applied to this task, but determining their effectiveness for a given set of documents is not a trivial task. This work proposes an approach to evaluate native digital PDF data extraction tools, available in large educational test repositories. For this, the educational tests applied at Enade were used, between the years 2004 to 2019. The files used for the evaluation comprise 343 tests, with 11.196 objective and discursive questions, in addition to all 396 answers, with 14.475 alternatives extracted from the questions objectives. For the construction of ground truth in the tests, the Aletheia tool was used, whose purpose is to define the regions of interest in each question. For the extractions, existing tools were used that perform data extractions in PDF files, defined for three categories: extractions of tabular data, extractions of textual content and extractions of regions of interest. The results of the extractions point out some limitations in relation to the diversity of layout in each year of application of the Enade test, the difficulty in identifying and extracting questions when arranged in two columns on the same page or in multiple columns. The extracted data provide useful information, which can assist students who intend to study for other tests, teachers in order to use these questions for classroom exercises, as well as course coordinators helping to map students’ difficulties from questions in reports.