Extração de dados de produtos em páginas de comércio eletrônico

Detalhes bibliográficos
Ano de defesa: 2015
Autor(a) principal: Godoy, Lucas Antonio Toledo [UNESP]
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Dissertação
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Estadual Paulista (Unesp)
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://hdl.handle.net/11449/127761
http://www.athena.biblioteca.unesp.br/exlibris/bd/cathedra/14-09-2015/000845512.pdf
Resumo: Web data extraction is an imp ortant issue which started b ecoming a strong line of study in the mid 90s. A sub domain of that category of study is the pro duct data extraction from online sales pages, given the wealth of information provided by stores through their websites. Data extraction of pro ducts contained in these kind of pages, like pro duct name and prices, enables the creation of a wide variety of other to ols that are able to use such data in order to provide a semantic interpretation to them, such as prices comparison among different stores and consumption habits analysis. Several approaches have b een applied to reach the target data extraction from Web pages. These approaches, in turn, use a wide range of techniques to reach their goals, and Tree Matching technique has great prominence due to its go o d results. This dissertation aimed to implement and evaluate the Tree Matching technique for the extraction of pro duct data, sp ecifically the pro duct name, its price and, p erhaps, the promotional price, on e-commerce pages, in order to determine its applicability to a commercial system. Improvements have b een prop osed to the extraction pro cess in order to reduce the resp onse time and increase the accuracy of the Generalized Simple Tree Matching algorithm. Experimental results demonstrated that the extraction process got an accuracy of about 93.6% on pages contained in Ecommerce Database and an average gain in response time of about 36% when the pages were reduced by the methods proposed in this study