Uso de informação estrutural para melhorar qualidade de busca em coleções web

Detalhes bibliográficos
Ano de defesa: 2010
Autor(a) principal: David Braga Fernandes de Oliveira
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal de Minas Gerais
UFMG
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://hdl.handle.net/1843/SLSS-85RKDG
Resumo: Unlike plain text documents, Web pages are commonly composed of distinct segments or blocks such as service channels, decoration skins, navigation bars, main sections, copyright and privacy announcements. This is of interest because previous works have demonstrated that these different segments or blocks, which can be automatically iden- tified in Web pages, can be used to improve information retrieval tasks such as search- ing, Web link analysis and Web mining. For instance, block information can be used to estimate term weights according to the occurrence of the terms inside blocks (instead of inside pages). As a consequence, the importance of each term occurrence may vary depending on its location (or block) within the Web page. The motivation is that, for instance, the occurrence of a term in the main contents section of a Web page is expected to be more important for ranking purposes than an occurrence of that same term in a menu of that page. In this thesis, we investigate how to improve retrieval tasks by exploring the block structure of Web pages. For that, we propose: (i) a new model for representing the content of Web sites in information retrieval systems that takes into account the internal structure of their Web pages and the relationship of the structural components found on the pages; (ii) a method to automatically identify the internal structure of the Web pages, according to the model of representing the Web sites contents proposed in this work; and (iii) a set of 9 block-weight functions to distinguish the impact of term occurrences inside page blocks, instead of inside whole pages. These functions, that are used to compile a modified BM25 ranking function, have the advantage of not requiring a learning process nor any type of manual intervention to compute the ranking, as required by previous works. Using 4 distinct Web collections, we ran extensive experiments to compare our block-weight ranking formulas with 3 other baselines: (i) a BM25 ranking applied to full pages, (ii) a BM25 ranking applied to pages after templates removal, and (iii) a BM25 ranking that takes into account best blocks. Our methods suggest that our block-weighting ranking method is superior to all baselines across all collections we used and that average gain in precision figures from 5% to 20% are generated. Further, our methods decrease the cost of processing queries when compared to the systems using no structural information, decreasing indexing storage requirements and increasing the speed of query processing.