Estratégias baseadas em exemplos para extração de dados semi-estruturados da web

Detalhes bibliográficos
Ano de defesa: 2002
Autor(a) principal: Altigran Soares da Silva
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal de Minas Gerais
UFMG
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
web
Link de acesso: http://hdl.handle.net/1843/SLBS-5KKKXX
Resumo: In this work we propose, implement and evaluate strategies and techniques for the problem of extracting semistructured data fromWeb data sources within the context of an approach we call DEByE (Data Extraction By Example). The results we have reached have been used in the implementation of a data extraction tool,also called DEByE, and have their effectiveness verified through experiments.The DEByE approach is semi-automatic, in the sense that the role of users (i.e., wrapper developers) is limited to providing examples of the data to be extracted, what shields them from being aware of specific formatting features of the target pages. The examples provided describe the structure of theobjects being extracted by means of nested tables, which are simple and intuitive, and expressive enough to represent the structure of the data normally present in Web pages. To deal with typical variations of complexsemistructured objects, we have extended the original concept of nested tables by relaxing the original assumption that all inner tables nested in a column should have a same internal structure.Based on this extended form of nested tables, we formalize the concept of wrappers by means of tabular grammars. Such context-free grammars are formed by productions that lead to parse trees that can be directly mapped to nested tables. We have developed strategies for generating tabular grammars from a set of example objects provided by a user from a sample page. This includes: (1) the generation of terminal productions for extracting single values belonging to a specific domain (e.g., an item description, a price, etc.) and (2) the generation of non-terminal productions that represent the structure of the complex objects to be extracted.The extraction of data from target pages is accomplished by parsing these pages using a tabular grammar. For this, we have developed an eficient bottom-up strategy. This strategy includes two distinct phases: an extraction phase, in which atomic attribute values are extracted based on local context informationavailable in the extraction productions, and an assembling phase, in which such values are assembled to form complex objects according to the target structure supplied by the user through examples, which is encoded in the non-terminal productions. We experimentally demonstrate the effectiveness of thebottom-up strategy for dealing with multi-level objects presenting structural variations.The general principle used by the bottom-up algorithm, that is, first extracting atomic values and then grouping these values to assemble complex objects, has been further exploited by the Hot Cycles algorithm we have developed. This algorithm aims at uncovering a plausible tabular structure for assembling complex objects with a given set of atomic values extracted from a target page. This algorithm is useful for deploying the DEByE approach in applications where the user is not available for assembling example tables.