Algoritmos de espaço quase ótimo para hashing perfeito
Ano de defesa: | 2008 |
---|---|
Autor(a) principal: | |
Orientador(a): | |
Banca de defesa: | |
Tipo de documento: | Tese |
Tipo de acesso: | Acesso aberto |
Idioma: | por |
Instituição de defesa: |
Universidade Federal de Minas Gerais
UFMG |
Programa de Pós-Graduação: |
Não Informado pela instituição
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: | |
Link de acesso: | http://hdl.handle.net/1843/RVMR-7LAMRS |
Resumo: | A perfect hash function (PHF) h : S [0, m 1] for a key set S U of size n, where m n and U is a key universe, is an injective function that maps the keys of S to unique values. A minimal perfect hash function (MPHF) is a PHF with m = n, the smallest possible range. Minimal perfect hash functions are widely used for memory efficient storage and fast retrieval of items from static sets, such as words in natural languages, reserved words in programming languages or interactive systems, universal resource locations (URLs) in web search engines, or item sets in data mining techniques. In this thesis we present a simple, highly scalable and near-space optimal perfect hashing algorithm. Evaluation of a PHF on a given element of S requires constant time, and thedominating phase in the construction algorithm consists of sorting n fingerprints of O(log n) bits in O(n) time. The space usage depends on the relation between m and n. For m = n the space usage is in the range 2.62n to 3.3n bits, depending on the constants involved in the construction and in the evaluation phases. For m = 1.23n the space usage is in the range 1.95n to 2.7n bits. In all cases, this is within a small constant factor from the information theoretical minimum of approximately 1.44n bits for MPHFs and 0.89n bits for PHFs, something that has not been achieved by previous algorithms, except asymptotically for very large n. This small space usage opens up the use of MPHFs to applications forwhich they were not useful in the past. We demonstrate the scalability of our algorithm by constructing an MPHF for a set of 1.024 billion URLs from the World Wide Web of average length 64 characters in approximately 50 minutes, using a commodity PC. We also present a distributed and parallel implementation of the algorithm, which generates an MPHF for the same URL set, using a 14 computer cluster, in approximately 4 minutes, achieving an almost linear speedup.Also, for 14.336 billion 16-byte random integers distributed among the 14 participating machines, the algorithm outputs an MPHF in approximately 50 minutes, with a performance degradation of 20%. |