Uso de informação estrutural para melhorar qualidade de busca em coleções web
Ano de defesa: | 2010 |
---|---|
Autor(a) principal: | |
Orientador(a): | |
Banca de defesa: | |
Tipo de documento: | Tese |
Tipo de acesso: | Acesso aberto |
Idioma: | por |
Instituição de defesa: |
Universidade Federal de Minas Gerais
UFMG |
Programa de Pós-Graduação: |
Não Informado pela instituição
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: | |
Link de acesso: | http://hdl.handle.net/1843/SLSS-85RKDG |
Resumo: | Unlike plain text documents, Web pages are commonly composed of distinct segments or blocks such as service channels, decoration skins, navigation bars, main sections, copyright and privacy announcements. This is of interest because previous works have demonstrated that these different segments or blocks, which can be automatically iden- tified in Web pages, can be used to improve information retrieval tasks such as search- ing, Web link analysis and Web mining. For instance, block information can be used to estimate term weights according to the occurrence of the terms inside blocks (instead of inside pages). As a consequence, the importance of each term occurrence may vary depending on its location (or block) within the Web page. The motivation is that, for instance, the occurrence of a term in the main contents section of a Web page is expected to be more important for ranking purposes than an occurrence of that same term in a menu of that page. In this thesis, we investigate how to improve retrieval tasks by exploring the block structure of Web pages. For that, we propose: (i) a new model for representing the content of Web sites in information retrieval systems that takes into account the internal structure of their Web pages and the relationship of the structural components found on the pages; (ii) a method to automatically identify the internal structure of the Web pages, according to the model of representing the Web sites contents proposed in this work; and (iii) a set of 9 block-weight functions to distinguish the impact of term occurrences inside page blocks, instead of inside whole pages. These functions, that are used to compile a modified BM25 ranking function, have the advantage of not requiring a learning process nor any type of manual intervention to compute the ranking, as required by previous works. Using 4 distinct Web collections, we ran extensive experiments to compare our block-weight ranking formulas with 3 other baselines: (i) a BM25 ranking applied to full pages, (ii) a BM25 ranking applied to pages after templates removal, and (iii) a BM25 ranking that takes into account best blocks. Our methods suggest that our block-weighting ranking method is superior to all baselines across all collections we used and that average gain in precision figures from 5% to 20% are generated. Further, our methods decrease the cost of processing queries when compared to the systems using no structural information, decreasing indexing storage requirements and increasing the speed of query processing. |