Detecção de réplicas de sítios web usando aprendizado semissupervisionado baseado em maximização de expectativas
Ano de defesa: | 2014 |
---|---|
Autor(a) principal: | |
Orientador(a): | |
Banca de defesa: | |
Tipo de documento: | Dissertação |
Tipo de acesso: | Acesso aberto |
Idioma: | por |
Instituição de defesa: |
Universidade Federal de Minas Gerais
UFMG |
Programa de Pós-Graduação: |
Não Informado pela instituição
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: | |
Link de acesso: | http://hdl.handle.net/1843/ESBF-9TENX8 |
Resumo: | The Web contains a vast repository of information. According to the literature about 29% of this repository contains duplicate content. Duplication of content may occur within a single web site (intra-site) or between different web sites (inter-site). This thesis addresses the problem of detecting inter-site replicas. In this work, this problem is treated as a classification task, where positive and negative replica examples are used to train a binary classifier. The proposed method uses a semi-supervised learning algorithm based on the Expectation-Maximization (EM) approach. The EM algorithm is an iterative method that allows estimation of parameters in probabilistic models with latent or unobserved data. In replica detection, it is easy to find obvious replica and non-replica examples. The EM algorithm is used to find non-obvious examples and form a training set for the classification algorithm at no cost of manual labeling. It is possible to substantially improve the quality of the results obtained with the combination of classifiers by exploring a central concept of Economics, the Pareto efficiency. More specifically, this technique allows to choose results that excel in at least one of the classifiers used. The proposed algorithm provides significant gains compared to state-of-art in detection of website replicas. The combination of proposed algorithm that eliminates inter-site replicas with algorithms that eliminate intra-sites replica content leads to a more complete solution allowing an effective reduction in the number of duplicated URLs on the collection. |