Efficient partitioning strategies for distributed Web crawling

Bibliographic Details
Main Author: Exposto, José
Publication Date: 2007
Other Authors: Macedo, Joaquim, Pina, António Manuel Silva, Alves, Albano Agostinho Gomes, Amaro, José Carlos Rufino
Language: eng
Source: Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Download full: http://hdl.handle.net/1822/6634
Summary: This paper presents a multi-objective approach to Web space partitioning, aimed to improve distributed crawling efficiency. The in- vestigation is supported by the construction of two different weighted graphs. The first is used to model the topological communication infras- tructure between crawlers and Web servers and the second is used to represent the amount of link connections between servers’ pages. The values of the graph edges represent, respectively, computed RTTs and pages links between nodes. The two graphs are further combined, using a multi-ob jective partitio- ning algorithm, to support Web space partitioning and load allocation for an adaptable number of geographical distributed crawlers. Partitioning strategies were evaluated by varying the number of partiti- ons (crawlers) to obtain merit figures for: i) download time, ii) exchange time and iii) relocation time. Evaluation has showed that our partitio- ning schemes outperform traditional hostname hash based counterparts in all evaluated metric, achieving on average 18% reduction for download time, 78% reduction for exchange time and 46% reduction for relocation time.
id RCAP_3eddd7ec07dbbd72ae4f7f9a35646f31
oai_identifier_str oai:repositorium.sdum.uminho.pt:1822/6634
network_acronym_str RCAP
network_name_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str https://opendoar.ac.uk/repository/7160
spelling Efficient partitioning strategies for distributed Web crawlingDatabasesComputer communications and networksThis paper presents a multi-objective approach to Web space partitioning, aimed to improve distributed crawling efficiency. The in- vestigation is supported by the construction of two different weighted graphs. The first is used to model the topological communication infras- tructure between crawlers and Web servers and the second is used to represent the amount of link connections between servers’ pages. The values of the graph edges represent, respectively, computed RTTs and pages links between nodes. The two graphs are further combined, using a multi-ob jective partitio- ning algorithm, to support Web space partitioning and load allocation for an adaptable number of geographical distributed crawlers. Partitioning strategies were evaluated by varying the number of partiti- ons (crawlers) to obtain merit figures for: i) download time, ii) exchange time and iii) relocation time. Evaluation has showed that our partitio- ning schemes outperform traditional hostname hash based counterparts in all evaluated metric, achieving on average 18% reduction for download time, 78% reduction for exchange time and 46% reduction for relocation time.Fundação para a Ciência e a Tecnologia (FCT)Universidade do MinhoExposto, JoséMacedo, JoaquimPina, António Manuel SilvaAlves, Albano Agostinho GomesAmaro, José Carlos Rufino2007-012007-01-01T00:00:00Zconference paperinfo:eu-repo/semantics/publishedVersionapplication/pdfhttp://hdl.handle.net/1822/6634engINTERNATIONAL CONFERENCE ON INFORMATION NETWORKING, 21, Estoril, Portugal, 2007 – “ICOIN 2007 : proceedings of the 21st International Conference on Information Networking”. [S.l. : s.n., 2007?].info:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2024-05-11T06:41:06Zoai:repositorium.sdum.uminho.pt:1822/6634Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T16:01:00.431454Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv Efficient partitioning strategies for distributed Web crawling
title Efficient partitioning strategies for distributed Web crawling
spellingShingle Efficient partitioning strategies for distributed Web crawling
Exposto, José
Databases
Computer communications and networks
title_short Efficient partitioning strategies for distributed Web crawling
title_full Efficient partitioning strategies for distributed Web crawling
title_fullStr Efficient partitioning strategies for distributed Web crawling
title_full_unstemmed Efficient partitioning strategies for distributed Web crawling
title_sort Efficient partitioning strategies for distributed Web crawling
author Exposto, José
author_facet Exposto, José
Macedo, Joaquim
Pina, António Manuel Silva
Alves, Albano Agostinho Gomes
Amaro, José Carlos Rufino
author_role author
author2 Macedo, Joaquim
Pina, António Manuel Silva
Alves, Albano Agostinho Gomes
Amaro, José Carlos Rufino
author2_role author
author
author
author
dc.contributor.none.fl_str_mv Universidade do Minho
dc.contributor.author.fl_str_mv Exposto, José
Macedo, Joaquim
Pina, António Manuel Silva
Alves, Albano Agostinho Gomes
Amaro, José Carlos Rufino
dc.subject.por.fl_str_mv Databases
Computer communications and networks
topic Databases
Computer communications and networks
description This paper presents a multi-objective approach to Web space partitioning, aimed to improve distributed crawling efficiency. The in- vestigation is supported by the construction of two different weighted graphs. The first is used to model the topological communication infras- tructure between crawlers and Web servers and the second is used to represent the amount of link connections between servers’ pages. The values of the graph edges represent, respectively, computed RTTs and pages links between nodes. The two graphs are further combined, using a multi-ob jective partitio- ning algorithm, to support Web space partitioning and load allocation for an adaptable number of geographical distributed crawlers. Partitioning strategies were evaluated by varying the number of partiti- ons (crawlers) to obtain merit figures for: i) download time, ii) exchange time and iii) relocation time. Evaluation has showed that our partitio- ning schemes outperform traditional hostname hash based counterparts in all evaluated metric, achieving on average 18% reduction for download time, 78% reduction for exchange time and 46% reduction for relocation time.
publishDate 2007
dc.date.none.fl_str_mv 2007-01
2007-01-01T00:00:00Z
dc.type.driver.fl_str_mv conference paper
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/1822/6634
url http://hdl.handle.net/1822/6634
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv INTERNATIONAL CONFERENCE ON INFORMATION NETWORKING, 21, Estoril, Portugal, 2007 – “ICOIN 2007 : proceedings of the 21st International Conference on Information Networking”. [S.l. : s.n., 2007?].
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron:RCAAP
instname_str FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str RCAAP
institution RCAAP
reponame_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv info@rcaap.pt
_version_ 1833595685381865472