Exportação concluída — 

Estudo de modelos de word embedding

Detalhes bibliográficos
Autor(a) principal: Sousa, Samanta de
Data de Publicação: 2016
Tipo de documento: Trabalho de conclusão de curso
Idioma: por
Título da fonte: Repositório Institucional da UTFPR (da Universidade Tecnológica Federal do Paraná (RIUT))
Texto Completo: http://repositorio.utfpr.edu.br/jspui/handle/1/12522
Resumo: The area of Artificial Intelligence seeks to construct mechanisms that simulate the intelligence of the Human beings so that they perform tasks that help them. There is the field of Natural Language Processing, an AI sub-area that seeks to understand and To generate the natural language, in this way the PLN is used by AI as a means to The mechanisms that use the natural language in its execution, such as writing and production Of a text, translation, learning and teaching among others. The language follows a format Not difficult to process by the computer, such as sd morphological variations and Syntactic as well as the ambiguity in the natural language that hinder the process of comprehension, In this way, area methodologies convert such information so that the manipulation Computer are easier. Among the information representations Existing Word Embedding technique is currently in the PLN field, where The information is represented in vectors where their values are similar when the Words are similar, that is, it is a representation that encodes similarity relations Between the words besides having a low computational cost. In this way the goal of Work was to carry out a comparison between three models ofWord Embeddings Cbow, Skip- Gram and Glove with the purpose of identifying which presents better performance in the generation of Vectors of representation of words (embeddings). First, construction was carried out Of a corpus using Wikipedia in sequence, the pre-processing of those corpus Information to be used as a training set, the models were trained Using scripts that are created using the Gensim and Glove Python libraries, the Embedding evaluations were done with the files available from Pennington et al. (2014), where in each evaluation / test the parameters were modified in order to verify the Their influence on the performance of models. Some specific settings for running Of the training of the models were identified and reported in the study, the results obtained Demonstrated that the Cbow was the model that presented better performances in the majority Of the tests. It has been found that the Word Embeddings technique fairly Similarity information between words even with the values of the parameters being Small compared to other jobs.
id UTFPR-12_772ca5b09b9242a245557b9ec7a6a4b2
oai_identifier_str oai:repositorio.utfpr.edu.br:1/12522
network_acronym_str UTFPR-12
network_name_str Repositório Institucional da UTFPR (da Universidade Tecnológica Federal do Paraná (RIUT))
repository_id_str
spelling Estudo de modelos de word embeddingStudy word embedding modelsInteligencia ArtificialProcessamento de linguagem natural (Computação)Bibliotecas digitaisArtificial intelligenceNatural language processing (Computer science)Digital librariesCNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAOThe area of Artificial Intelligence seeks to construct mechanisms that simulate the intelligence of the Human beings so that they perform tasks that help them. There is the field of Natural Language Processing, an AI sub-area that seeks to understand and To generate the natural language, in this way the PLN is used by AI as a means to The mechanisms that use the natural language in its execution, such as writing and production Of a text, translation, learning and teaching among others. The language follows a format Not difficult to process by the computer, such as sd morphological variations and Syntactic as well as the ambiguity in the natural language that hinder the process of comprehension, In this way, area methodologies convert such information so that the manipulation Computer are easier. Among the information representations Existing Word Embedding technique is currently in the PLN field, where The information is represented in vectors where their values are similar when the Words are similar, that is, it is a representation that encodes similarity relations Between the words besides having a low computational cost. In this way the goal of Work was to carry out a comparison between three models ofWord Embeddings Cbow, Skip- Gram and Glove with the purpose of identifying which presents better performance in the generation of Vectors of representation of words (embeddings). First, construction was carried out Of a corpus using Wikipedia in sequence, the pre-processing of those corpus Information to be used as a training set, the models were trained Using scripts that are created using the Gensim and Glove Python libraries, the Embedding evaluations were done with the files available from Pennington et al. (2014), where in each evaluation / test the parameters were modified in order to verify the Their influence on the performance of models. Some specific settings for running Of the training of the models were identified and reported in the study, the results obtained Demonstrated that the Cbow was the model that presented better performances in the majority Of the tests. It has been found that the Word Embeddings technique fairly Similarity information between words even with the values of the parameters being Small compared to other jobs.A área de Inteligência Artificial busca construir mecanismos que simulem a inteligência do ser humano de forma que os mesmos executem tarefas que os auxiliem. Tem-se o campo de estudo de Processamento de Língua Natural uma sub área de IA que busca compreender e gerar a língua natural, dessa forma o PLN ´e utilizado pela IA como um meio para aprimorar os mecanismos que utilizam da língua natural na sua execução, como escrita e produção de um texto, tradução, aprendizagem e ensino entre outros. A língua segue um formato não estruturado de difícil processamento pelo computador, como as variações morfológicas e sintáticas além da ambiguidade na língua natural que dificultam o processo de compreensão, dessa forma metodologias da área convertem tais informações de forma que a manipulação das mesmas pelo computador sejam mais fáceis. Dentre as representações de informações existentes a técnica deWord Embedding está em tendência atualmente no campo de PLN, onde as informações são representadas em vetores onde os seus valores são semelhantes quando as palavras são similares, ou seja, ´e uma representação que codifica as relações de similaridade entre as palavras além de possuir um custo computacional baixo. Dessa forma o objetivo do trabalho foi realizar um comparativo entre três modelos de Word Embeddings Cbow, Skipgram e Glove com a finalidade de identificar qual apresenta melhor desempenho na geração dos vetores de representação das palavras (embeddings). Primeiramente foi realizada a construção de um corpus utilizando a Wikipédia em sequência foi realizado o pré-processamento dessas informações para serem utilizadas como conjunto de treinamento, os modelos foram treinados utilizando scripts que forma criados utilizando as bibliotecas do Python Gensim e Glove, as avaliações dos embeddings foram feitas com as arquivos disponíveis por Pennington et al. (2014), onde em cada avaliação/teste feito os parâmetros eram modificados afim de verificar a sua influência no desempenho dos modelos. Algumas configurações específicas para execução do treinamento dos modelos foram identificadas e relatadas no trabalho, os resultados obtidos demonstraram que o Cbow foi o modelo que apresentou melhores desempenhos na maioria dos testes. Foi verificado que a técnica de Word Embeddings codifica razoavelmente bem as informações de similaridade entre as palavras mesmo com os valores dos parâmetros sendo pequenos se comparados com outros trabalhos.Universidade Tecnológica Federal do ParanáMedianeiraBrasilGraduação em Ciência da ComputaçãoUTFPRCandido Junior, ArnaldoHartmann, Nathan SiegleCandido Junior, ArnaldoAikes Junior, JorgePessini, Evando CarlosSousa, Samanta de2020-11-16T13:09:45Z2020-11-16T13:09:45Z2016-11-16info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/bachelorThesisapplication/pdfSOUSA, Samanta de. Estudo de modelos de word embedding. 2016. 53 f. Trabalho de Conclusão de Curso (Graduação) - Universidade Tecnológica Federal do Paraná, Medianeira, 2016.http://repositorio.utfpr.edu.br/jspui/handle/1/12522porinfo:eu-repo/semantics/openAccessreponame:Repositório Institucional da UTFPR (da Universidade Tecnológica Federal do Paraná (RIUT))instname:Universidade Tecnológica Federal do Paraná (UTFPR)instacron:UTFPR2020-11-16T13:09:45Zoai:repositorio.utfpr.edu.br:1/12522Repositório InstitucionalPUBhttp://repositorio.utfpr.edu.br:8080/oai/requestriut@utfpr.edu.br || sibi@utfpr.edu.bropendoar:2020-11-16T13:09:45Repositório Institucional da UTFPR (da Universidade Tecnológica Federal do Paraná (RIUT)) - Universidade Tecnológica Federal do Paraná (UTFPR)false
dc.title.none.fl_str_mv Estudo de modelos de word embedding
Study word embedding models
title Estudo de modelos de word embedding
spellingShingle Estudo de modelos de word embedding
Sousa, Samanta de
Inteligencia Artificial
Processamento de linguagem natural (Computação)
Bibliotecas digitais
Artificial intelligence
Natural language processing (Computer science)
Digital libraries
CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
title_short Estudo de modelos de word embedding
title_full Estudo de modelos de word embedding
title_fullStr Estudo de modelos de word embedding
title_full_unstemmed Estudo de modelos de word embedding
title_sort Estudo de modelos de word embedding
author Sousa, Samanta de
author_facet Sousa, Samanta de
author_role author
dc.contributor.none.fl_str_mv Candido Junior, Arnaldo
Hartmann, Nathan Siegle
Candido Junior, Arnaldo
Aikes Junior, Jorge
Pessini, Evando Carlos
dc.contributor.author.fl_str_mv Sousa, Samanta de
dc.subject.por.fl_str_mv Inteligencia Artificial
Processamento de linguagem natural (Computação)
Bibliotecas digitais
Artificial intelligence
Natural language processing (Computer science)
Digital libraries
CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
topic Inteligencia Artificial
Processamento de linguagem natural (Computação)
Bibliotecas digitais
Artificial intelligence
Natural language processing (Computer science)
Digital libraries
CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
description The area of Artificial Intelligence seeks to construct mechanisms that simulate the intelligence of the Human beings so that they perform tasks that help them. There is the field of Natural Language Processing, an AI sub-area that seeks to understand and To generate the natural language, in this way the PLN is used by AI as a means to The mechanisms that use the natural language in its execution, such as writing and production Of a text, translation, learning and teaching among others. The language follows a format Not difficult to process by the computer, such as sd morphological variations and Syntactic as well as the ambiguity in the natural language that hinder the process of comprehension, In this way, area methodologies convert such information so that the manipulation Computer are easier. Among the information representations Existing Word Embedding technique is currently in the PLN field, where The information is represented in vectors where their values are similar when the Words are similar, that is, it is a representation that encodes similarity relations Between the words besides having a low computational cost. In this way the goal of Work was to carry out a comparison between three models ofWord Embeddings Cbow, Skip- Gram and Glove with the purpose of identifying which presents better performance in the generation of Vectors of representation of words (embeddings). First, construction was carried out Of a corpus using Wikipedia in sequence, the pre-processing of those corpus Information to be used as a training set, the models were trained Using scripts that are created using the Gensim and Glove Python libraries, the Embedding evaluations were done with the files available from Pennington et al. (2014), where in each evaluation / test the parameters were modified in order to verify the Their influence on the performance of models. Some specific settings for running Of the training of the models were identified and reported in the study, the results obtained Demonstrated that the Cbow was the model that presented better performances in the majority Of the tests. It has been found that the Word Embeddings technique fairly Similarity information between words even with the values of the parameters being Small compared to other jobs.
publishDate 2016
dc.date.none.fl_str_mv 2016-11-16
2020-11-16T13:09:45Z
2020-11-16T13:09:45Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/bachelorThesis
format bachelorThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv SOUSA, Samanta de. Estudo de modelos de word embedding. 2016. 53 f. Trabalho de Conclusão de Curso (Graduação) - Universidade Tecnológica Federal do Paraná, Medianeira, 2016.
http://repositorio.utfpr.edu.br/jspui/handle/1/12522
identifier_str_mv SOUSA, Samanta de. Estudo de modelos de word embedding. 2016. 53 f. Trabalho de Conclusão de Curso (Graduação) - Universidade Tecnológica Federal do Paraná, Medianeira, 2016.
url http://repositorio.utfpr.edu.br/jspui/handle/1/12522
dc.language.iso.fl_str_mv por
language por
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Universidade Tecnológica Federal do Paraná
Medianeira
Brasil
Graduação em Ciência da Computação
UTFPR
publisher.none.fl_str_mv Universidade Tecnológica Federal do Paraná
Medianeira
Brasil
Graduação em Ciência da Computação
UTFPR
dc.source.none.fl_str_mv reponame:Repositório Institucional da UTFPR (da Universidade Tecnológica Federal do Paraná (RIUT))
instname:Universidade Tecnológica Federal do Paraná (UTFPR)
instacron:UTFPR
instname_str Universidade Tecnológica Federal do Paraná (UTFPR)
instacron_str UTFPR
institution UTFPR
reponame_str Repositório Institucional da UTFPR (da Universidade Tecnológica Federal do Paraná (RIUT))
collection Repositório Institucional da UTFPR (da Universidade Tecnológica Federal do Paraná (RIUT))
repository.name.fl_str_mv Repositório Institucional da UTFPR (da Universidade Tecnológica Federal do Paraná (RIUT)) - Universidade Tecnológica Federal do Paraná (UTFPR)
repository.mail.fl_str_mv riut@utfpr.edu.br || sibi@utfpr.edu.br
_version_ 1850497890213953536