Optimizing aho­corasick for word counting.

Detalhes bibliográficos
Autor(a) principal: LUCENA, Emerson Leonardo.
Data de Publicação: 2020
Tipo de documento: Trabalho de conclusão de curso
Idioma: eng
Título da fonte: Biblioteca Digital de Teses e Dissertações da UFCG
Texto Completo: https://dspace.sti.ufcg.edu.br/handle/riufcg/20128
Resumo: The Aho-Corasick algorithm is used to recognize all occurrences of a set of strings in a text. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑜), where 𝑚 is the sum of the lengths of the keywords, 𝑛 is the text length and 𝑜 is the number of occurrences of all keywords in the text. However, when the input contains a large amount of matches with the dictionary patterns, the algorithm performance decreases. In many domains, such as information retrieval, natural language processing and DNA sequence analysis, Aho-Corasick is used for word counting, possibly with many repetitions. In this paper, we improve the Aho-Corasick algorithm to count the number of occur rences of a set of words in a text. The new algorithm works offline and does not depend on the frequencies of the dictionary words. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑢), where 𝑢 is the number of distinct keywords found in the text, and its space complexity is the same as the Aho-Corasick. We compare the original and the new algorithm performances with texts varying up to 100MB and dictionaries with sizes 1KB, 1MB and 10MB. The new algorithm performed better in every experiment made, from 50% to 300% faster in comparison with the Aho-Corasick.
id UFCG_61ba5851fc0d9d66dd1fea56450bc5ac
oai_identifier_str oai:dspace.sti.ufcg.edu.br:riufcg/20128
network_acronym_str UFCG
network_name_str Biblioteca Digital de Teses e Dissertações da UFCG
repository_id_str 4851
spelling GHEYI, Rohit.GHEYI, R.http://lattes.cnpq.br/2931270888717344MONTEIRO , João Arthur Brunet.MASSONI , Tiago Lima.LUCENA, E. L.http://lattes.cnpq.br/5944567562075735LUCENA, Emerson Leonardo.Universidade Federal de Campina GrandeUFCGBrasilCentro de Engenharia Elétrica e Informática - CEEICiência da ComputaçãoAho-Corasick algoritmPattern matchingCorrespondência de padrõesFiltrageCoincidencia de patronesWord countingRecuento de palabrasComptage de motsContagem de palavrasAlgoritmo offlineAlgorithme hors ligneAlgoritmo sin conexiónOffline algorithmProcessamento de textosProcessing of textsProcesamiento de textosTraitement des textesOptimizing aho­corasick for word counting.Otimizando ahocorasick para contagem de palavras.20202021-07-20T13:15:32Z2021-07-202021-07-20T13:15:32ZThe Aho-Corasick algorithm is used to recognize all occurrences of a set of strings in a text. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑜), where 𝑚 is the sum of the lengths of the keywords, 𝑛 is the text length and 𝑜 is the number of occurrences of all keywords in the text. However, when the input contains a large amount of matches with the dictionary patterns, the algorithm performance decreases. In many domains, such as information retrieval, natural language processing and DNA sequence analysis, Aho-Corasick is used for word counting, possibly with many repetitions. In this paper, we improve the Aho-Corasick algorithm to count the number of occur rences of a set of words in a text. The new algorithm works offline and does not depend on the frequencies of the dictionary words. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑢), where 𝑢 is the number of distinct keywords found in the text, and its space complexity is the same as the Aho-Corasick. We compare the original and the new algorithm performances with texts varying up to 100MB and dictionaries with sizes 1KB, 1MB and 10MB. The new algorithm performed better in every experiment made, from 50% to 300% faster in comparison with the Aho-Corasick.Submitted by Emanuel Varela Cardoso (emanuel.varela@ufcg.edu.br) on 2021-07-20T13:15:32Z No. of bitstreams: 1 EMERSON LEONARDO LUCENA - TCC CIÊNCIA DA COMPUTAÇÃO 2020.pdf: 1526013 bytes, checksum: 120b9a03bcda7345d197fefc65fb796d (MD5)Made available in DSpace on 2021-07-20T13:15:32Z (GMT). No. of bitstreams: 1 EMERSON LEONARDO LUCENA - TCC CIÊNCIA DA COMPUTAÇÃO 2020.pdf: 1526013 bytes, checksum: 120b9a03bcda7345d197fefc65fb796d (MD5) Previous issue date: 2020https://dspace.sti.ufcg.edu.br/handle/riufcg/20128LUCENA, E. L. Optimizing aho­corasick for word counting. 12 f. Trabalho de Conclusão de Curso - Artigo (Curso de Bacharelado em Ciência da Computação) Graduação em Ciência da Computação, Centro de Engenharia Elétrica e Informática, Universidade Federal de Campina Grande - Paraíba - Brasil, 2020. Disponível em: https://dspace.sti.ufcg.edu.br/handle/riufcg/20128info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/bachelorThesisenginfo:eu-repo/semantics/openAccessreponame:Biblioteca Digital de Teses e Dissertações da UFCGinstname:Universidade Federal de Campina Grande (UFCG)instacron:UFCGTEXTEMERSON LEONARDO LUCENA - TCC CIÊNCIA DA COMPUTAÇÃO 2020.pdf.txtEMERSON LEONARDO LUCENA - TCC CIÊNCIA DA COMPUTAÇÃO 2020.pdf.txttext/plain46965https://dspace.sti.ufcg.edu.br/bitstream/riufcg/20128/3/EMERSON+LEONARDO+LUCENA+-+TCC+CIE%CC%82NCIA+DA+COMPUTAC%CC%A7A%CC%83O+2020.pdf.txt579da4150bb653a14e1e2de342fedbd8MD53LICENSElicense.txtlicense.txttext/plain; charset=utf-81748https://dspace.sti.ufcg.edu.br/bitstream/riufcg/20128/2/license.txt8a4605be74aa9ea9d79846c1fba20a33MD52ORIGINALEMERSON LEONARDO LUCENA - TCC CIÊNCIA DA COMPUTAÇÃO 2020.pdfEMERSON LEONARDO LUCENA - TCC CIÊNCIA DA COMPUTAÇÃO 2020.pdfapplication/pdf1526013https://dspace.sti.ufcg.edu.br/bitstream/riufcg/20128/1/EMERSON+LEONARDO+LUCENA+-+TCC+CIE%CC%82NCIA+DA+COMPUTAC%CC%A7A%CC%83O+2020.pdf120b9a03bcda7345d197fefc65fb796dMD51riufcg/201282025-07-24 08:04:30.173oai:dspace.sti.ufcg.edu.br:riufcg/20128Tk9URTogUExBQ0UgWU9VUiBPV04gTElDRU5TRSBIRVJFClRoaXMgc2FtcGxlIGxpY2Vuc2UgaXMgcHJvdmlkZWQgZm9yIGluZm9ybWF0aW9uYWwgcHVycG9zZXMgb25seS4KCk5PTi1FWENMVVNJVkUgRElTVFJJQlVUSU9OIExJQ0VOU0UKCkJ5IHNpZ25pbmcgYW5kIHN1Ym1pdHRpbmcgdGhpcyBsaWNlbnNlLCB5b3UgKHRoZSBhdXRob3Iocykgb3IgY29weXJpZ2h0Cm93bmVyKSBncmFudHMgdG8gRFNwYWNlIFVuaXZlcnNpdHkgKERTVSkgdGhlIG5vbi1leGNsdXNpdmUgcmlnaHQgdG8gcmVwcm9kdWNlLAp0cmFuc2xhdGUgKGFzIGRlZmluZWQgYmVsb3cpLCBhbmQvb3IgZGlzdHJpYnV0ZSB5b3VyIHN1Ym1pc3Npb24gKGluY2x1ZGluZwp0aGUgYWJzdHJhY3QpIHdvcmxkd2lkZSBpbiBwcmludCBhbmQgZWxlY3Ryb25pYyBmb3JtYXQgYW5kIGluIGFueSBtZWRpdW0sCmluY2x1ZGluZyBidXQgbm90IGxpbWl0ZWQgdG8gYXVkaW8gb3IgdmlkZW8uCgpZb3UgYWdyZWUgdGhhdCBEU1UgbWF5LCB3aXRob3V0IGNoYW5naW5nIHRoZSBjb250ZW50LCB0cmFuc2xhdGUgdGhlCnN1Ym1pc3Npb24gdG8gYW55IG1lZGl1bSBvciBmb3JtYXQgZm9yIHRoZSBwdXJwb3NlIG9mIHByZXNlcnZhdGlvbi4KCllvdSBhbHNvIGFncmVlIHRoYXQgRFNVIG1heSBrZWVwIG1vcmUgdGhhbiBvbmUgY29weSBvZiB0aGlzIHN1Ym1pc3Npb24gZm9yCnB1cnBvc2VzIG9mIHNlY3VyaXR5LCBiYWNrLXVwIGFuZCBwcmVzZXJ2YXRpb24uCgpZb3UgcmVwcmVzZW50IHRoYXQgdGhlIHN1Ym1pc3Npb24gaXMgeW91ciBvcmlnaW5hbCB3b3JrLCBhbmQgdGhhdCB5b3UgaGF2ZQp0aGUgcmlnaHQgdG8gZ3JhbnQgdGhlIHJpZ2h0cyBjb250YWluZWQgaW4gdGhpcyBsaWNlbnNlLiBZb3UgYWxzbyByZXByZXNlbnQKdGhhdCB5b3VyIHN1Ym1pc3Npb24gZG9lcyBub3QsIHRvIHRoZSBiZXN0IG9mIHlvdXIga25vd2xlZGdlLCBpbmZyaW5nZSB1cG9uCmFueW9uZSdzIGNvcHlyaWdodC4KCklmIHRoZSBzdWJtaXNzaW9uIGNvbnRhaW5zIG1hdGVyaWFsIGZvciB3aGljaCB5b3UgZG8gbm90IGhvbGQgY29weXJpZ2h0LAp5b3UgcmVwcmVzZW50IHRoYXQgeW91IGhhdmUgb2J0YWluZWQgdGhlIHVucmVzdHJpY3RlZCBwZXJtaXNzaW9uIG9mIHRoZQpjb3B5cmlnaHQgb3duZXIgdG8gZ3JhbnQgRFNVIHRoZSByaWdodHMgcmVxdWlyZWQgYnkgdGhpcyBsaWNlbnNlLCBhbmQgdGhhdApzdWNoIHRoaXJkLXBhcnR5IG93bmVkIG1hdGVyaWFsIGlzIGNsZWFybHkgaWRlbnRpZmllZCBhbmQgYWNrbm93bGVkZ2VkCndpdGhpbiB0aGUgdGV4dCBvciBjb250ZW50IG9mIHRoZSBzdWJtaXNzaW9uLgoKSUYgVEhFIFNVQk1JU1NJT04gSVMgQkFTRUQgVVBPTiBXT1JLIFRIQVQgSEFTIEJFRU4gU1BPTlNPUkVEIE9SIFNVUFBPUlRFRApCWSBBTiBBR0VOQ1kgT1IgT1JHQU5JWkFUSU9OIE9USEVSIFRIQU4gRFNVLCBZT1UgUkVQUkVTRU5UIFRIQVQgWU9VIEhBVkUKRlVMRklMTEVEIEFOWSBSSUdIVCBPRiBSRVZJRVcgT1IgT1RIRVIgT0JMSUdBVElPTlMgUkVRVUlSRUQgQlkgU1VDSApDT05UUkFDVCBPUiBBR1JFRU1FTlQuCgpEU1Ugd2lsbCBjbGVhcmx5IGlkZW50aWZ5IHlvdXIgbmFtZShzKSBhcyB0aGUgYXV0aG9yKHMpIG9yIG93bmVyKHMpIG9mIHRoZQpzdWJtaXNzaW9uLCBhbmQgd2lsbCBub3QgbWFrZSBhbnkgYWx0ZXJhdGlvbiwgb3RoZXIgdGhhbiBhcyBhbGxvd2VkIGJ5IHRoaXMKbGljZW5zZSwgdG8geW91ciBzdWJtaXNzaW9uLgo=Biblioteca Digital de Teses e Dissertaçõeshttp://bdtd.ufcg.edu.br/PUBhttp://dspace.sti.ufcg.edu.br:8080/oai/requestbdtd@setor.ufcg.edu.br || bdtd@setor.ufcg.edu.bropendoar:48512025-07-24T11:04:30Biblioteca Digital de Teses e Dissertações da UFCG - Universidade Federal de Campina Grande (UFCG)false
dc.title.pt_BR.fl_str_mv Optimizing aho­corasick for word counting.
dc.title.alternative.pt_BR.fl_str_mv Otimizando ahocorasick para contagem de palavras.
title Optimizing aho­corasick for word counting.
spellingShingle Optimizing aho­corasick for word counting.
LUCENA, Emerson Leonardo.
Ciência da Computação
Aho-Corasick algoritm
Pattern matching
Correspondência de padrões
Filtrage
Coincidencia de patrones
Word counting
Recuento de palabras
Comptage de mots
Contagem de palavras
Algoritmo offline
Algorithme hors ligne
Algoritmo sin conexión
Offline algorithm
Processamento de textos
Processing of texts
Procesamiento de textos
Traitement des textes
title_short Optimizing aho­corasick for word counting.
title_full Optimizing aho­corasick for word counting.
title_fullStr Optimizing aho­corasick for word counting.
title_full_unstemmed Optimizing aho­corasick for word counting.
title_sort Optimizing aho­corasick for word counting.
author LUCENA, Emerson Leonardo.
author_facet LUCENA, Emerson Leonardo.
author_role author
dc.contributor.advisor1.fl_str_mv GHEYI, Rohit.
dc.contributor.advisor1ID.fl_str_mv GHEYI, R.
dc.contributor.advisor1Lattes.fl_str_mv http://lattes.cnpq.br/2931270888717344
dc.contributor.referee1.fl_str_mv MONTEIRO , João Arthur Brunet.
dc.contributor.referee2.fl_str_mv MASSONI , Tiago Lima.
dc.contributor.authorID.fl_str_mv LUCENA, E. L.
dc.contributor.authorLattes.fl_str_mv http://lattes.cnpq.br/5944567562075735
dc.contributor.author.fl_str_mv LUCENA, Emerson Leonardo.
contributor_str_mv GHEYI, Rohit.
MONTEIRO , João Arthur Brunet.
MASSONI , Tiago Lima.
dc.subject.cnpq.fl_str_mv Ciência da Computação
topic Ciência da Computação
Aho-Corasick algoritm
Pattern matching
Correspondência de padrões
Filtrage
Coincidencia de patrones
Word counting
Recuento de palabras
Comptage de mots
Contagem de palavras
Algoritmo offline
Algorithme hors ligne
Algoritmo sin conexión
Offline algorithm
Processamento de textos
Processing of texts
Procesamiento de textos
Traitement des textes
dc.subject.por.fl_str_mv Aho-Corasick algoritm
Pattern matching
Correspondência de padrões
Filtrage
Coincidencia de patrones
Word counting
Recuento de palabras
Comptage de mots
Contagem de palavras
Algoritmo offline
Algorithme hors ligne
Algoritmo sin conexión
Offline algorithm
Processamento de textos
Processing of texts
Procesamiento de textos
Traitement des textes
description The Aho-Corasick algorithm is used to recognize all occurrences of a set of strings in a text. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑜), where 𝑚 is the sum of the lengths of the keywords, 𝑛 is the text length and 𝑜 is the number of occurrences of all keywords in the text. However, when the input contains a large amount of matches with the dictionary patterns, the algorithm performance decreases. In many domains, such as information retrieval, natural language processing and DNA sequence analysis, Aho-Corasick is used for word counting, possibly with many repetitions. In this paper, we improve the Aho-Corasick algorithm to count the number of occur rences of a set of words in a text. The new algorithm works offline and does not depend on the frequencies of the dictionary words. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑢), where 𝑢 is the number of distinct keywords found in the text, and its space complexity is the same as the Aho-Corasick. We compare the original and the new algorithm performances with texts varying up to 100MB and dictionaries with sizes 1KB, 1MB and 10MB. The new algorithm performed better in every experiment made, from 50% to 300% faster in comparison with the Aho-Corasick.
publishDate 2020
dc.date.issued.fl_str_mv 2020
dc.date.accessioned.fl_str_mv 2021-07-20T13:15:32Z
dc.date.available.fl_str_mv 2021-07-20
2021-07-20T13:15:32Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/bachelorThesis
format bachelorThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://dspace.sti.ufcg.edu.br/handle/riufcg/20128
dc.identifier.citation.fl_str_mv LUCENA, E. L. Optimizing aho­corasick for word counting. 12 f. Trabalho de Conclusão de Curso - Artigo (Curso de Bacharelado em Ciência da Computação) Graduação em Ciência da Computação, Centro de Engenharia Elétrica e Informática, Universidade Federal de Campina Grande - Paraíba - Brasil, 2020. Disponível em: https://dspace.sti.ufcg.edu.br/handle/riufcg/20128
url https://dspace.sti.ufcg.edu.br/handle/riufcg/20128
identifier_str_mv LUCENA, E. L. Optimizing aho­corasick for word counting. 12 f. Trabalho de Conclusão de Curso - Artigo (Curso de Bacharelado em Ciência da Computação) Graduação em Ciência da Computação, Centro de Engenharia Elétrica e Informática, Universidade Federal de Campina Grande - Paraíba - Brasil, 2020. Disponível em: https://dspace.sti.ufcg.edu.br/handle/riufcg/20128
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.publisher.none.fl_str_mv Universidade Federal de Campina Grande
dc.publisher.initials.fl_str_mv UFCG
dc.publisher.country.fl_str_mv Brasil
dc.publisher.department.fl_str_mv Centro de Engenharia Elétrica e Informática - CEEI
publisher.none.fl_str_mv Universidade Federal de Campina Grande
dc.source.none.fl_str_mv reponame:Biblioteca Digital de Teses e Dissertações da UFCG
instname:Universidade Federal de Campina Grande (UFCG)
instacron:UFCG
instname_str Universidade Federal de Campina Grande (UFCG)
instacron_str UFCG
institution UFCG
reponame_str Biblioteca Digital de Teses e Dissertações da UFCG
collection Biblioteca Digital de Teses e Dissertações da UFCG
bitstream.url.fl_str_mv https://dspace.sti.ufcg.edu.br/bitstream/riufcg/20128/3/EMERSON+LEONARDO+LUCENA+-+TCC+CIE%CC%82NCIA+DA+COMPUTAC%CC%A7A%CC%83O+2020.pdf.txt
https://dspace.sti.ufcg.edu.br/bitstream/riufcg/20128/2/license.txt
https://dspace.sti.ufcg.edu.br/bitstream/riufcg/20128/1/EMERSON+LEONARDO+LUCENA+-+TCC+CIE%CC%82NCIA+DA+COMPUTAC%CC%A7A%CC%83O+2020.pdf
bitstream.checksum.fl_str_mv 579da4150bb653a14e1e2de342fedbd8
8a4605be74aa9ea9d79846c1fba20a33
120b9a03bcda7345d197fefc65fb796d
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
repository.name.fl_str_mv Biblioteca Digital de Teses e Dissertações da UFCG - Universidade Federal de Campina Grande (UFCG)
repository.mail.fl_str_mv bdtd@setor.ufcg.edu.br || bdtd@setor.ufcg.edu.br
_version_ 1863362943159107584