Distância de edição de árvores aplicada à extração de dados da web

Detalhes bibliográficos
Ano de defesa: 2005
Autor(a) principal: Davi de Castro Reis
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal de Minas Gerais
UFMG
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://hdl.handle.net/1843/RVMR-6EAG8V
Resumo: The World Wide Web is the largest information repository nowadays, with billions of pages, dealing with several topics, available to people of different nationalities. The content of these pages, however, is formatted for human consumption, and computer agents have a lot of difficulties to access and manipulate the data in these web pages. One of the options to circumnvent this problem is to write, manually, extractors for all web pages one is interested in, and, therefore, make them suitable for computer agents. Recently, new semi-automatic extractor generation tools have been developed, but, even with these tools, it is still not possible to extract data from a large collection of web pages, due to the need of human intervention.This dissertation presents a new strategy for the construction of web data extraction systems. The systems created using the proposed strategy are completely automatic and can be used for large extractions tasks. In our experiments, we extracted in a completely automatic fashion, the news found in the pages of 35 of the main Brazilian media vehicles present on the Web, summing up a total of 4088 pages, with correctness precision of 87.71%. The key to achieve this result is the use of the tree edit distance technique. Given that web pages are serialized trees, we can use this technique to find the differences between the trees and, then, extract the data from the pages. Besides an extensive revision of the tree edit distance problem, this dissertation presents a new algorithm for the problem. The algorithm, named Restricted Top-Down Mapping, or simply RTDM, is described in details, including pseudo-code, assyntotical limits and empirical analysis, which led to the conclusion that this algorithm surpasses all other algorithms, with applications to web data extraction, available in the literature.