Optimizing ahocorasick for word counting.
| Autor(a) principal: | |
|---|---|
| Data de Publicação: | 2020 |
| Tipo de documento: | Trabalho de conclusão de curso |
| Idioma: | eng |
| Título da fonte: | Biblioteca Digital de Teses e Dissertações da UFCG |
| Texto Completo: | https://dspace.sti.ufcg.edu.br/handle/riufcg/20128 |
Resumo: | The Aho-Corasick algorithm is used to recognize all occurrences of a set of strings in a text. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑜), where 𝑚 is the sum of the lengths of the keywords, 𝑛 is the text length and 𝑜 is the number of occurrences of all keywords in the text. However, when the input contains a large amount of matches with the dictionary patterns, the algorithm performance decreases. In many domains, such as information retrieval, natural language processing and DNA sequence analysis, Aho-Corasick is used for word counting, possibly with many repetitions. In this paper, we improve the Aho-Corasick algorithm to count the number of occur rences of a set of words in a text. The new algorithm works offline and does not depend on the frequencies of the dictionary words. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑢), where 𝑢 is the number of distinct keywords found in the text, and its space complexity is the same as the Aho-Corasick. We compare the original and the new algorithm performances with texts varying up to 100MB and dictionaries with sizes 1KB, 1MB and 10MB. The new algorithm performed better in every experiment made, from 50% to 300% faster in comparison with the Aho-Corasick. |
| id |
UFCG_61ba5851fc0d9d66dd1fea56450bc5ac |
|---|---|
| oai_identifier_str |
oai:dspace.sti.ufcg.edu.br:riufcg/20128 |
| network_acronym_str |
UFCG |
| network_name_str |
Biblioteca Digital de Teses e Dissertações da UFCG |
| repository_id_str |
4851 |
| spelling |
GHEYI, Rohit.GHEYI, R.http://lattes.cnpq.br/2931270888717344MONTEIRO , João Arthur Brunet.MASSONI , Tiago Lima.LUCENA, E. L.http://lattes.cnpq.br/5944567562075735LUCENA, Emerson Leonardo.Universidade Federal de Campina GrandeUFCGBrasilCentro de Engenharia Elétrica e Informática - CEEICiência da ComputaçãoAho-Corasick algoritmPattern matchingCorrespondência de padrõesFiltrageCoincidencia de patronesWord countingRecuento de palabrasComptage de motsContagem de palavrasAlgoritmo offlineAlgorithme hors ligneAlgoritmo sin conexiónOffline algorithmProcessamento de textosProcessing of textsProcesamiento de textosTraitement des textesOptimizing ahocorasick for word counting.Otimizando ahocorasick para contagem de palavras.20202021-07-20T13:15:32Z2021-07-202021-07-20T13:15:32ZThe Aho-Corasick algorithm is used to recognize all occurrences of a set of strings in a text. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑜), where 𝑚 is the sum of the lengths of the keywords, 𝑛 is the text length and 𝑜 is the number of occurrences of all keywords in the text. However, when the input contains a large amount of matches with the dictionary patterns, the algorithm performance decreases. In many domains, such as information retrieval, natural language processing and DNA sequence analysis, Aho-Corasick is used for word counting, possibly with many repetitions. In this paper, we improve the Aho-Corasick algorithm to count the number of occur rences of a set of words in a text. The new algorithm works offline and does not depend on the frequencies of the dictionary words. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑢), where 𝑢 is the number of distinct keywords found in the text, and its space complexity is the same as the Aho-Corasick. We compare the original and the new algorithm performances with texts varying up to 100MB and dictionaries with sizes 1KB, 1MB and 10MB. The new algorithm performed better in every experiment made, from 50% to 300% faster in comparison with the Aho-Corasick.Submitted by Emanuel Varela Cardoso (emanuel.varela@ufcg.edu.br) on 2021-07-20T13:15:32Z No. of bitstreams: 1 EMERSON LEONARDO LUCENA - TCC CIÊNCIA DA COMPUTAÇÃO 2020.pdf: 1526013 bytes, checksum: 120b9a03bcda7345d197fefc65fb796d (MD5)Made available in DSpace on 2021-07-20T13:15:32Z (GMT). No. of bitstreams: 1 EMERSON LEONARDO LUCENA - TCC CIÊNCIA DA COMPUTAÇÃO 2020.pdf: 1526013 bytes, checksum: 120b9a03bcda7345d197fefc65fb796d (MD5) Previous issue date: 2020https://dspace.sti.ufcg.edu.br/handle/riufcg/20128LUCENA, E. L. Optimizing ahocorasick for word counting. 12 f. Trabalho de Conclusão de Curso - Artigo (Curso de Bacharelado em Ciência da Computação) Graduação em Ciência da Computação, Centro de Engenharia Elétrica e Informática, Universidade Federal de Campina Grande - Paraíba - Brasil, 2020. Disponível em: https://dspace.sti.ufcg.edu.br/handle/riufcg/20128info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/bachelorThesisenginfo:eu-repo/semantics/openAccessreponame:Biblioteca Digital de Teses e Dissertações da UFCGinstname:Universidade Federal de Campina Grande (UFCG)instacron:UFCGTEXTEMERSON LEONARDO LUCENA - TCC CIÊNCIA DA COMPUTAÇÃO 2020.pdf.txtEMERSON LEONARDO LUCENA - TCC CIÊNCIA DA COMPUTAÇÃO 2020.pdf.txttext/plain46965https://dspace.sti.ufcg.edu.br/bitstream/riufcg/20128/3/EMERSON+LEONARDO+LUCENA+-+TCC+CIE%CC%82NCIA+DA+COMPUTAC%CC%A7A%CC%83O+2020.pdf.txt579da4150bb653a14e1e2de342fedbd8MD53LICENSElicense.txtlicense.txttext/plain; charset=utf-81748https://dspace.sti.ufcg.edu.br/bitstream/riufcg/20128/2/license.txt8a4605be74aa9ea9d79846c1fba20a33MD52ORIGINALEMERSON LEONARDO LUCENA - TCC CIÊNCIA DA COMPUTAÇÃO 2020.pdfEMERSON LEONARDO LUCENA - TCC CIÊNCIA DA COMPUTAÇÃO 2020.pdfapplication/pdf1526013https://dspace.sti.ufcg.edu.br/bitstream/riufcg/20128/1/EMERSON+LEONARDO+LUCENA+-+TCC+CIE%CC%82NCIA+DA+COMPUTAC%CC%A7A%CC%83O+2020.pdf120b9a03bcda7345d197fefc65fb796dMD51riufcg/201282025-07-24 08:04:30.173oai:dspace.sti.ufcg.edu.br:riufcg/20128Tk9URTogUExBQ0UgWU9VUiBPV04gTElDRU5TRSBIRVJFClRoaXMgc2FtcGxlIGxpY2Vuc2UgaXMgcHJvdmlkZWQgZm9yIGluZm9ybWF0aW9uYWwgcHVycG9zZXMgb25seS4KCk5PTi1FWENMVVNJVkUgRElTVFJJQlVUSU9OIExJQ0VOU0UKCkJ5IHNpZ25pbmcgYW5kIHN1Ym1pdHRpbmcgdGhpcyBsaWNlbnNlLCB5b3UgKHRoZSBhdXRob3Iocykgb3IgY29weXJpZ2h0Cm93bmVyKSBncmFudHMgdG8gRFNwYWNlIFVuaXZlcnNpdHkgKERTVSkgdGhlIG5vbi1leGNsdXNpdmUgcmlnaHQgdG8gcmVwcm9kdWNlLAp0cmFuc2xhdGUgKGFzIGRlZmluZWQgYmVsb3cpLCBhbmQvb3IgZGlzdHJpYnV0ZSB5b3VyIHN1Ym1pc3Npb24gKGluY2x1ZGluZwp0aGUgYWJzdHJhY3QpIHdvcmxkd2lkZSBpbiBwcmludCBhbmQgZWxlY3Ryb25pYyBmb3JtYXQgYW5kIGluIGFueSBtZWRpdW0sCmluY2x1ZGluZyBidXQgbm90IGxpbWl0ZWQgdG8gYXVkaW8gb3IgdmlkZW8uCgpZb3UgYWdyZWUgdGhhdCBEU1UgbWF5LCB3aXRob3V0IGNoYW5naW5nIHRoZSBjb250ZW50LCB0cmFuc2xhdGUgdGhlCnN1Ym1pc3Npb24gdG8gYW55IG1lZGl1bSBvciBmb3JtYXQgZm9yIHRoZSBwdXJwb3NlIG9mIHByZXNlcnZhdGlvbi4KCllvdSBhbHNvIGFncmVlIHRoYXQgRFNVIG1heSBrZWVwIG1vcmUgdGhhbiBvbmUgY29weSBvZiB0aGlzIHN1Ym1pc3Npb24gZm9yCnB1cnBvc2VzIG9mIHNlY3VyaXR5LCBiYWNrLXVwIGFuZCBwcmVzZXJ2YXRpb24uCgpZb3UgcmVwcmVzZW50IHRoYXQgdGhlIHN1Ym1pc3Npb24gaXMgeW91ciBvcmlnaW5hbCB3b3JrLCBhbmQgdGhhdCB5b3UgaGF2ZQp0aGUgcmlnaHQgdG8gZ3JhbnQgdGhlIHJpZ2h0cyBjb250YWluZWQgaW4gdGhpcyBsaWNlbnNlLiBZb3UgYWxzbyByZXByZXNlbnQKdGhhdCB5b3VyIHN1Ym1pc3Npb24gZG9lcyBub3QsIHRvIHRoZSBiZXN0IG9mIHlvdXIga25vd2xlZGdlLCBpbmZyaW5nZSB1cG9uCmFueW9uZSdzIGNvcHlyaWdodC4KCklmIHRoZSBzdWJtaXNzaW9uIGNvbnRhaW5zIG1hdGVyaWFsIGZvciB3aGljaCB5b3UgZG8gbm90IGhvbGQgY29weXJpZ2h0LAp5b3UgcmVwcmVzZW50IHRoYXQgeW91IGhhdmUgb2J0YWluZWQgdGhlIHVucmVzdHJpY3RlZCBwZXJtaXNzaW9uIG9mIHRoZQpjb3B5cmlnaHQgb3duZXIgdG8gZ3JhbnQgRFNVIHRoZSByaWdodHMgcmVxdWlyZWQgYnkgdGhpcyBsaWNlbnNlLCBhbmQgdGhhdApzdWNoIHRoaXJkLXBhcnR5IG93bmVkIG1hdGVyaWFsIGlzIGNsZWFybHkgaWRlbnRpZmllZCBhbmQgYWNrbm93bGVkZ2VkCndpdGhpbiB0aGUgdGV4dCBvciBjb250ZW50IG9mIHRoZSBzdWJtaXNzaW9uLgoKSUYgVEhFIFNVQk1JU1NJT04gSVMgQkFTRUQgVVBPTiBXT1JLIFRIQVQgSEFTIEJFRU4gU1BPTlNPUkVEIE9SIFNVUFBPUlRFRApCWSBBTiBBR0VOQ1kgT1IgT1JHQU5JWkFUSU9OIE9USEVSIFRIQU4gRFNVLCBZT1UgUkVQUkVTRU5UIFRIQVQgWU9VIEhBVkUKRlVMRklMTEVEIEFOWSBSSUdIVCBPRiBSRVZJRVcgT1IgT1RIRVIgT0JMSUdBVElPTlMgUkVRVUlSRUQgQlkgU1VDSApDT05UUkFDVCBPUiBBR1JFRU1FTlQuCgpEU1Ugd2lsbCBjbGVhcmx5IGlkZW50aWZ5IHlvdXIgbmFtZShzKSBhcyB0aGUgYXV0aG9yKHMpIG9yIG93bmVyKHMpIG9mIHRoZQpzdWJtaXNzaW9uLCBhbmQgd2lsbCBub3QgbWFrZSBhbnkgYWx0ZXJhdGlvbiwgb3RoZXIgdGhhbiBhcyBhbGxvd2VkIGJ5IHRoaXMKbGljZW5zZSwgdG8geW91ciBzdWJtaXNzaW9uLgo=Biblioteca Digital de Teses e Dissertaçõeshttp://bdtd.ufcg.edu.br/PUBhttp://dspace.sti.ufcg.edu.br:8080/oai/requestbdtd@setor.ufcg.edu.br || bdtd@setor.ufcg.edu.bropendoar:48512025-07-24T11:04:30Biblioteca Digital de Teses e Dissertações da UFCG - Universidade Federal de Campina Grande (UFCG)false |
| dc.title.pt_BR.fl_str_mv |
Optimizing ahocorasick for word counting. |
| dc.title.alternative.pt_BR.fl_str_mv |
Otimizando ahocorasick para contagem de palavras. |
| title |
Optimizing ahocorasick for word counting. |
| spellingShingle |
Optimizing ahocorasick for word counting. LUCENA, Emerson Leonardo. Ciência da Computação Aho-Corasick algoritm Pattern matching Correspondência de padrões Filtrage Coincidencia de patrones Word counting Recuento de palabras Comptage de mots Contagem de palavras Algoritmo offline Algorithme hors ligne Algoritmo sin conexión Offline algorithm Processamento de textos Processing of texts Procesamiento de textos Traitement des textes |
| title_short |
Optimizing ahocorasick for word counting. |
| title_full |
Optimizing ahocorasick for word counting. |
| title_fullStr |
Optimizing ahocorasick for word counting. |
| title_full_unstemmed |
Optimizing ahocorasick for word counting. |
| title_sort |
Optimizing ahocorasick for word counting. |
| author |
LUCENA, Emerson Leonardo. |
| author_facet |
LUCENA, Emerson Leonardo. |
| author_role |
author |
| dc.contributor.advisor1.fl_str_mv |
GHEYI, Rohit. |
| dc.contributor.advisor1ID.fl_str_mv |
GHEYI, R. |
| dc.contributor.advisor1Lattes.fl_str_mv |
http://lattes.cnpq.br/2931270888717344 |
| dc.contributor.referee1.fl_str_mv |
MONTEIRO , João Arthur Brunet. |
| dc.contributor.referee2.fl_str_mv |
MASSONI , Tiago Lima. |
| dc.contributor.authorID.fl_str_mv |
LUCENA, E. L. |
| dc.contributor.authorLattes.fl_str_mv |
http://lattes.cnpq.br/5944567562075735 |
| dc.contributor.author.fl_str_mv |
LUCENA, Emerson Leonardo. |
| contributor_str_mv |
GHEYI, Rohit. MONTEIRO , João Arthur Brunet. MASSONI , Tiago Lima. |
| dc.subject.cnpq.fl_str_mv |
Ciência da Computação |
| topic |
Ciência da Computação Aho-Corasick algoritm Pattern matching Correspondência de padrões Filtrage Coincidencia de patrones Word counting Recuento de palabras Comptage de mots Contagem de palavras Algoritmo offline Algorithme hors ligne Algoritmo sin conexión Offline algorithm Processamento de textos Processing of texts Procesamiento de textos Traitement des textes |
| dc.subject.por.fl_str_mv |
Aho-Corasick algoritm Pattern matching Correspondência de padrões Filtrage Coincidencia de patrones Word counting Recuento de palabras Comptage de mots Contagem de palavras Algoritmo offline Algorithme hors ligne Algoritmo sin conexión Offline algorithm Processamento de textos Processing of texts Procesamiento de textos Traitement des textes |
| description |
The Aho-Corasick algorithm is used to recognize all occurrences of a set of strings in a text. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑜), where 𝑚 is the sum of the lengths of the keywords, 𝑛 is the text length and 𝑜 is the number of occurrences of all keywords in the text. However, when the input contains a large amount of matches with the dictionary patterns, the algorithm performance decreases. In many domains, such as information retrieval, natural language processing and DNA sequence analysis, Aho-Corasick is used for word counting, possibly with many repetitions. In this paper, we improve the Aho-Corasick algorithm to count the number of occur rences of a set of words in a text. The new algorithm works offline and does not depend on the frequencies of the dictionary words. Its time complexity is 𝑂 (𝑚 + 𝑛 + 𝑢), where 𝑢 is the number of distinct keywords found in the text, and its space complexity is the same as the Aho-Corasick. We compare the original and the new algorithm performances with texts varying up to 100MB and dictionaries with sizes 1KB, 1MB and 10MB. The new algorithm performed better in every experiment made, from 50% to 300% faster in comparison with the Aho-Corasick. |
| publishDate |
2020 |
| dc.date.issued.fl_str_mv |
2020 |
| dc.date.accessioned.fl_str_mv |
2021-07-20T13:15:32Z |
| dc.date.available.fl_str_mv |
2021-07-20 2021-07-20T13:15:32Z |
| dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
| dc.type.driver.fl_str_mv |
info:eu-repo/semantics/bachelorThesis |
| format |
bachelorThesis |
| status_str |
publishedVersion |
| dc.identifier.uri.fl_str_mv |
https://dspace.sti.ufcg.edu.br/handle/riufcg/20128 |
| dc.identifier.citation.fl_str_mv |
LUCENA, E. L. Optimizing ahocorasick for word counting. 12 f. Trabalho de Conclusão de Curso - Artigo (Curso de Bacharelado em Ciência da Computação) Graduação em Ciência da Computação, Centro de Engenharia Elétrica e Informática, Universidade Federal de Campina Grande - Paraíba - Brasil, 2020. Disponível em: https://dspace.sti.ufcg.edu.br/handle/riufcg/20128 |
| url |
https://dspace.sti.ufcg.edu.br/handle/riufcg/20128 |
| identifier_str_mv |
LUCENA, E. L. Optimizing ahocorasick for word counting. 12 f. Trabalho de Conclusão de Curso - Artigo (Curso de Bacharelado em Ciência da Computação) Graduação em Ciência da Computação, Centro de Engenharia Elétrica e Informática, Universidade Federal de Campina Grande - Paraíba - Brasil, 2020. Disponível em: https://dspace.sti.ufcg.edu.br/handle/riufcg/20128 |
| dc.language.iso.fl_str_mv |
eng |
| language |
eng |
| dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
| eu_rights_str_mv |
openAccess |
| dc.publisher.none.fl_str_mv |
Universidade Federal de Campina Grande |
| dc.publisher.initials.fl_str_mv |
UFCG |
| dc.publisher.country.fl_str_mv |
Brasil |
| dc.publisher.department.fl_str_mv |
Centro de Engenharia Elétrica e Informática - CEEI |
| publisher.none.fl_str_mv |
Universidade Federal de Campina Grande |
| dc.source.none.fl_str_mv |
reponame:Biblioteca Digital de Teses e Dissertações da UFCG instname:Universidade Federal de Campina Grande (UFCG) instacron:UFCG |
| instname_str |
Universidade Federal de Campina Grande (UFCG) |
| instacron_str |
UFCG |
| institution |
UFCG |
| reponame_str |
Biblioteca Digital de Teses e Dissertações da UFCG |
| collection |
Biblioteca Digital de Teses e Dissertações da UFCG |
| bitstream.url.fl_str_mv |
https://dspace.sti.ufcg.edu.br/bitstream/riufcg/20128/3/EMERSON+LEONARDO+LUCENA+-+TCC+CIE%CC%82NCIA+DA+COMPUTAC%CC%A7A%CC%83O+2020.pdf.txt https://dspace.sti.ufcg.edu.br/bitstream/riufcg/20128/2/license.txt https://dspace.sti.ufcg.edu.br/bitstream/riufcg/20128/1/EMERSON+LEONARDO+LUCENA+-+TCC+CIE%CC%82NCIA+DA+COMPUTAC%CC%A7A%CC%83O+2020.pdf |
| bitstream.checksum.fl_str_mv |
579da4150bb653a14e1e2de342fedbd8 8a4605be74aa9ea9d79846c1fba20a33 120b9a03bcda7345d197fefc65fb796d |
| bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 MD5 |
| repository.name.fl_str_mv |
Biblioteca Digital de Teses e Dissertações da UFCG - Universidade Federal de Campina Grande (UFCG) |
| repository.mail.fl_str_mv |
bdtd@setor.ufcg.edu.br || bdtd@setor.ufcg.edu.br |
| _version_ |
1863362943159107584 |