Um modelo para prototipagem rápida de aplicações de mineração na web

Detalhes bibliográficos
Ano de defesa: 2008
Autor(a) principal: Alvaro Rodrigues Pereira Junior
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Universidade Federal de Minas Gerais
UFMG
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
web
Link de acesso: http://hdl.handle.net/1843/RVMR-7P8NTM
Resumo: Web mining can be seen as the process of discovering patterns from the Web by means of data mining techniques. Web mining is a computation-intensive task and most mining software is developed ad-hoc, which makes scalability and reusability difficult for other mining tasks. Web mining is an iterative process and prototyping plays an essential role in experimenting with different alternatives, as well as in incorporating knowledge acquired in previous iterations of the process. The objective of this thesis is the development of a model for fast Web mining prototyping, referred to as WIM -- Web Information Mining. The main motivation for developing the WIM model is the fact that its underlying conceptual model provides its users with a level of abstraction appropriate for prototyping and experimentation during the Web mining task. WIM is composed of a data model and an algebra. The WIM data model is a relational view of Web data. The three types of existing Web data, namely Web content, Web structure and Web usage, are represented by relations. The main input components for the WIM data model are the Web pages, the hyperlink structure linking Web pages and the query logs obtained from Web search engines. WIM is implemented with a declarative programming language provided by its algebra. The WIM programming language is based on dataflows, where sequences of operations are applied to relations. The operations are defined by the WIM algebra, which contains operators for data manipulation and for data mining. We present the WIM softwarearchitecture, its implementation issues, and discuss alternative architecture designs on which a forthcomingindustrial-scale WIM software version could be implemented.We have applied WIM to a set of five real Web mining use cases, as a means to demonstrate the WIM features. The main use case,called Genealogical Trees on the Web, is a study of how Web content evolves in time. We have elected this use case to perform a complete analysis of its results, which present evidence that some Web publishers actually performed queries using search engines in order to find content and then republish what was found as answer to the query. The conclusion is that search engines bias the content of the Web. Theexperimentation of WIM in five real use cases has been shown to significantly facilitate fast Web mining prototyping.Experimental use of the WIM programming language has shown thatit reduces the code size written for an application by orders of magnitude when compared with ad-hoc implementations.