O código por trás do código-fonte: mudança de representação para a análise de similaridade

Bibliographic Details
Main Author: França, Allyson Bonetti
Publication Date: 2019
Format: Doctoral thesis
Language: por
Source: Repositório Institucional da Universidade Federal do Ceará (UFC)
Download full: http://www.repositorio.ufc.br/handle/riufc/44452
Summary: The similarity identified between source codes sent by students in programming disciplines is often used as indicative of plagiarism by teachers and / or automated submission systems.Among the techniques of measuring similarity, we include the use of syntactic structures and lexical patterns.Despite the large number of available tools that use this technique, few are able to identify the similarity, effectively, due to the inherent complexity of this type of analysis.One limitation of this technique is the sensitivity to the modifications in the organization of characters and symbols, even when subtle, and their dispositions along the expressions.We investigate the similarity between preprocessed codes in order to remove the irrelevant aspects of coding, as well as to highlight their relevant characteristics.The contributions presented are inserted in the preprocessing action, which results in a recoding, and in the transformation with change in the representation domain: the code behind the source code.These contributions aim to mitigate the complexity of code analysis through the use of syntactic structures and lexical patterns.In the perspective of recoding, invasive adjustments are made and analyzed in the structures of the original code that, even in the case of purposive modifications by the developer, preserve characteristics that are difficult to compromise or corrupt.This analysis is done with the Sherlock N-Overlap algorithm and two types of pre-processing, called normalizations.One of the normalizations is proposed in this work.In the perspective of the transformation, the representation of the source codes through digital images was performed.For this, I-Sim was developed, which translates the organization of the program's structures into visual arrangements that allow the identification of similarity between source codes.Experiments were conducted with 84 purposely modified source codes and a base composed of 2160 codes created by students of engineering courses in programming classes. In this last set, the situation of similarity was not previously known, so a method was used to calculate precision and recall, in a relative way, based on a set of reference tools, as a kind of oracle. The results show that, in most cases, the similarity indexes of the solutions developed are superior to reference tools inthe literature, such as SIM, JPlag and MOSS.
id UFC-7_6effc6cba2a9c9150b32a15b40b6bb33
oai_identifier_str oai:repositorio.ufc.br:riufc/44452
network_acronym_str UFC-7
network_name_str Repositório Institucional da Universidade Federal do Ceará (UFC)
repository_id_str
spelling França, Allyson BonettiBarroso, Giovanni CordeiroSoares, José Marques2019-08-05T14:25:59Z2019-08-05T14:25:59Z2019-01-21FRANÇA, A. B. O código por trás do código-fonte: mudança de representação para a análise de similaridade. 2019. 99 f. Tese (Doutorado em Engenharia de Teleinformática)-Centro de Tecnologia, Universidade Federal do Ceará, Fortaleza, 2019.http://www.repositorio.ufc.br/handle/riufc/44452The similarity identified between source codes sent by students in programming disciplines is often used as indicative of plagiarism by teachers and / or automated submission systems.Among the techniques of measuring similarity, we include the use of syntactic structures and lexical patterns.Despite the large number of available tools that use this technique, few are able to identify the similarity, effectively, due to the inherent complexity of this type of analysis.One limitation of this technique is the sensitivity to the modifications in the organization of characters and symbols, even when subtle, and their dispositions along the expressions.We investigate the similarity between preprocessed codes in order to remove the irrelevant aspects of coding, as well as to highlight their relevant characteristics.The contributions presented are inserted in the preprocessing action, which results in a recoding, and in the transformation with change in the representation domain: the code behind the source code.These contributions aim to mitigate the complexity of code analysis through the use of syntactic structures and lexical patterns.In the perspective of recoding, invasive adjustments are made and analyzed in the structures of the original code that, even in the case of purposive modifications by the developer, preserve characteristics that are difficult to compromise or corrupt.This analysis is done with the Sherlock N-Overlap algorithm and two types of pre-processing, called normalizations.One of the normalizations is proposed in this work.In the perspective of the transformation, the representation of the source codes through digital images was performed.For this, I-Sim was developed, which translates the organization of the program's structures into visual arrangements that allow the identification of similarity between source codes.Experiments were conducted with 84 purposely modified source codes and a base composed of 2160 codes created by students of engineering courses in programming classes. In this last set, the situation of similarity was not previously known, so a method was used to calculate precision and recall, in a relative way, based on a set of reference tools, as a kind of oracle. The results show that, in most cases, the similarity indexes of the solutions developed are superior to reference tools inthe literature, such as SIM, JPlag and MOSS.A similaridade identificada entre códigos-fonte enviados por alunos em disciplinas de programação é frequentemente utilizada como indicativo de plágio por professores e/ou sistemas de submissão automatizados. Dentre as técnicas de se medir a similaridade, inclui-se o uso de estruturas sintáticas e de padrões léxicos. Apesar da ampla quantidade de ferramentas disponíveis que se utilizam dessa técnica, poucas são capazes de identificar a similaridade, de maneira eficaz, o que se deve à complexidade inerente a esse tipo de análise. Uma limitação dessa técnica é a sensibilidade às modificações na organização de caracteres e símbolos, mesmo quando sutis, e suas disposições ao longo das expressões. Investiga-se neste trabalho a similaridade entre códigos pré-processados, de maneira a remover os aspectos irrelevantes da codificação, bem como, ressaltar as suas características relevantes. As contribuições apresentadas inserem-se na ação do pré-processamento, que resulta em uma recodificação, e na transformação com mudança no domínio da representação: o código por trás do código- fonte. Tais contribuições visam mitigar a complexidade decorrente da análise do código através do uso de estruturas sintáticas e de padrões léxicos. Na perspectiva da recodificação, são realizados e analisados ajustes invasivos nas estruturas do código original que, mesmo em caso de modificações propositais pelo desenvolvedor, preservam características dificilmente comprometidas ou corrompidas. Essa análise é feita com o algoritmo Sherlock N-Overlap e com dois tipos de pré-processamento – denominados normalizações –, sendo que uma destas é proposta neste trabalho. Na perspectiva da transformação foi realizada a representação dos códigos-fonte por meio de imagens digitais. Para isso, foi desenvolvido o I-Sim, que traduz a organização das estruturas do programa em arranjos visuais, permitindo identificar a similaridade entre códigos-fonte. Foram realizados experimentos com 84 códigos-fonte propositalmente modificados e uma base composta por 2160 códigos criados por estudantes de cursos de engenharia em aulas de programação. No último conjunto, a situação de eventuais modificações propositais não é conhecida a priori, usando-se um método específico para cálculo relativo da precisão e da revocação com base em um conjunto de oráculos. Os resultados apresentados mostram que, na maioria dos casos, os índices de similaridade das soluções desenvolvidas se mostram superiores ao SIM, ao JPlag e ao MOSS, ferramentas utilizadas como referência pela literatura.TeleinformáticaPlágio - IdentificaçãoImagens digitaisSimilarity between source codesPlagiarism detectionSimilarity investigation toolMethod of conformityO código por trás do código-fonte: mudança de representação para a análise de similaridadeinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisporreponame:Repositório Institucional da Universidade Federal do Ceará (UFC)instname:Universidade Federal do Ceará (UFC)instacron:UFCinfo:eu-repo/semantics/openAccessORIGINAL2019_tese_abfranca.pdf2019_tese_abfranca.pdfapplication/pdf5147945http://repositorio.ufc.br/bitstream/riufc/44452/3/2019_tese_abfranca.pdf8d45c342809d860232b4d8edb93b2be2MD53LICENSElicense.txtlicense.txttext/plain; charset=utf-81748http://repositorio.ufc.br/bitstream/riufc/44452/4/license.txt8a4605be74aa9ea9d79846c1fba20a33MD54riufc/444522019-08-05 11:25:59.462oai:repositorio.ufc.br:riufc/44452Tk9URTogUExBQ0UgWU9VUiBPV04gTElDRU5TRSBIRVJFClRoaXMgc2FtcGxlIGxpY2Vuc2UgaXMgcHJvdmlkZWQgZm9yIGluZm9ybWF0aW9uYWwgcHVycG9zZXMgb25seS4KCk5PTi1FWENMVVNJVkUgRElTVFJJQlVUSU9OIExJQ0VOU0UKCkJ5IHNpZ25pbmcgYW5kIHN1Ym1pdHRpbmcgdGhpcyBsaWNlbnNlLCB5b3UgKHRoZSBhdXRob3Iocykgb3IgY29weXJpZ2h0Cm93bmVyKSBncmFudHMgdG8gRFNwYWNlIFVuaXZlcnNpdHkgKERTVSkgdGhlIG5vbi1leGNsdXNpdmUgcmlnaHQgdG8gcmVwcm9kdWNlLAp0cmFuc2xhdGUgKGFzIGRlZmluZWQgYmVsb3cpLCBhbmQvb3IgZGlzdHJpYnV0ZSB5b3VyIHN1Ym1pc3Npb24gKGluY2x1ZGluZwp0aGUgYWJzdHJhY3QpIHdvcmxkd2lkZSBpbiBwcmludCBhbmQgZWxlY3Ryb25pYyBmb3JtYXQgYW5kIGluIGFueSBtZWRpdW0sCmluY2x1ZGluZyBidXQgbm90IGxpbWl0ZWQgdG8gYXVkaW8gb3IgdmlkZW8uCgpZb3UgYWdyZWUgdGhhdCBEU1UgbWF5LCB3aXRob3V0IGNoYW5naW5nIHRoZSBjb250ZW50LCB0cmFuc2xhdGUgdGhlCnN1Ym1pc3Npb24gdG8gYW55IG1lZGl1bSBvciBmb3JtYXQgZm9yIHRoZSBwdXJwb3NlIG9mIHByZXNlcnZhdGlvbi4KCllvdSBhbHNvIGFncmVlIHRoYXQgRFNVIG1heSBrZWVwIG1vcmUgdGhhbiBvbmUgY29weSBvZiB0aGlzIHN1Ym1pc3Npb24gZm9yCnB1cnBvc2VzIG9mIHNlY3VyaXR5LCBiYWNrLXVwIGFuZCBwcmVzZXJ2YXRpb24uCgpZb3UgcmVwcmVzZW50IHRoYXQgdGhlIHN1Ym1pc3Npb24gaXMgeW91ciBvcmlnaW5hbCB3b3JrLCBhbmQgdGhhdCB5b3UgaGF2ZQp0aGUgcmlnaHQgdG8gZ3JhbnQgdGhlIHJpZ2h0cyBjb250YWluZWQgaW4gdGhpcyBsaWNlbnNlLiBZb3UgYWxzbyByZXByZXNlbnQKdGhhdCB5b3VyIHN1Ym1pc3Npb24gZG9lcyBub3QsIHRvIHRoZSBiZXN0IG9mIHlvdXIga25vd2xlZGdlLCBpbmZyaW5nZSB1cG9uCmFueW9uZSdzIGNvcHlyaWdodC4KCklmIHRoZSBzdWJtaXNzaW9uIGNvbnRhaW5zIG1hdGVyaWFsIGZvciB3aGljaCB5b3UgZG8gbm90IGhvbGQgY29weXJpZ2h0LAp5b3UgcmVwcmVzZW50IHRoYXQgeW91IGhhdmUgb2J0YWluZWQgdGhlIHVucmVzdHJpY3RlZCBwZXJtaXNzaW9uIG9mIHRoZQpjb3B5cmlnaHQgb3duZXIgdG8gZ3JhbnQgRFNVIHRoZSByaWdodHMgcmVxdWlyZWQgYnkgdGhpcyBsaWNlbnNlLCBhbmQgdGhhdApzdWNoIHRoaXJkLXBhcnR5IG93bmVkIG1hdGVyaWFsIGlzIGNsZWFybHkgaWRlbnRpZmllZCBhbmQgYWNrbm93bGVkZ2VkCndpdGhpbiB0aGUgdGV4dCBvciBjb250ZW50IG9mIHRoZSBzdWJtaXNzaW9uLgoKSUYgVEhFIFNVQk1JU1NJT04gSVMgQkFTRUQgVVBPTiBXT1JLIFRIQVQgSEFTIEJFRU4gU1BPTlNPUkVEIE9SIFNVUFBPUlRFRApCWSBBTiBBR0VOQ1kgT1IgT1JHQU5JWkFUSU9OIE9USEVSIFRIQU4gRFNVLCBZT1UgUkVQUkVTRU5UIFRIQVQgWU9VIEhBVkUKRlVMRklMTEVEIEFOWSBSSUdIVCBPRiBSRVZJRVcgT1IgT1RIRVIgT0JMSUdBVElPTlMgUkVRVUlSRUQgQlkgU1VDSApDT05UUkFDVCBPUiBBR1JFRU1FTlQuCgpEU1Ugd2lsbCBjbGVhcmx5IGlkZW50aWZ5IHlvdXIgbmFtZShzKSBhcyB0aGUgYXV0aG9yKHMpIG9yIG93bmVyKHMpIG9mIHRoZQpzdWJtaXNzaW9uLCBhbmQgd2lsbCBub3QgbWFrZSBhbnkgYWx0ZXJhdGlvbiwgb3RoZXIgdGhhbiBhcyBhbGxvd2VkIGJ5IHRoaXMKbGljZW5zZSwgdG8geW91ciBzdWJtaXNzaW9uLgo=Repositório InstitucionalPUBhttp://www.repositorio.ufc.br/ri-oai/requestbu@ufc.br || repositorio@ufc.bropendoar:2019-08-05T14:25:59Repositório Institucional da Universidade Federal do Ceará (UFC) - Universidade Federal do Ceará (UFC)false
dc.title.pt_BR.fl_str_mv O código por trás do código-fonte: mudança de representação para a análise de similaridade
title O código por trás do código-fonte: mudança de representação para a análise de similaridade
spellingShingle O código por trás do código-fonte: mudança de representação para a análise de similaridade
França, Allyson Bonetti
Teleinformática
Plágio - Identificação
Imagens digitais
Similarity between source codes
Plagiarism detection
Similarity investigation tool
Method of conformity
title_short O código por trás do código-fonte: mudança de representação para a análise de similaridade
title_full O código por trás do código-fonte: mudança de representação para a análise de similaridade
title_fullStr O código por trás do código-fonte: mudança de representação para a análise de similaridade
title_full_unstemmed O código por trás do código-fonte: mudança de representação para a análise de similaridade
title_sort O código por trás do código-fonte: mudança de representação para a análise de similaridade
author França, Allyson Bonetti
author_facet França, Allyson Bonetti
author_role author
dc.contributor.co-advisor.none.fl_str_mv Barroso, Giovanni Cordeiro
dc.contributor.author.fl_str_mv França, Allyson Bonetti
dc.contributor.advisor1.fl_str_mv Soares, José Marques
contributor_str_mv Soares, José Marques
dc.subject.por.fl_str_mv Teleinformática
Plágio - Identificação
Imagens digitais
Similarity between source codes
Plagiarism detection
Similarity investigation tool
Method of conformity
topic Teleinformática
Plágio - Identificação
Imagens digitais
Similarity between source codes
Plagiarism detection
Similarity investigation tool
Method of conformity
description The similarity identified between source codes sent by students in programming disciplines is often used as indicative of plagiarism by teachers and / or automated submission systems.Among the techniques of measuring similarity, we include the use of syntactic structures and lexical patterns.Despite the large number of available tools that use this technique, few are able to identify the similarity, effectively, due to the inherent complexity of this type of analysis.One limitation of this technique is the sensitivity to the modifications in the organization of characters and symbols, even when subtle, and their dispositions along the expressions.We investigate the similarity between preprocessed codes in order to remove the irrelevant aspects of coding, as well as to highlight their relevant characteristics.The contributions presented are inserted in the preprocessing action, which results in a recoding, and in the transformation with change in the representation domain: the code behind the source code.These contributions aim to mitigate the complexity of code analysis through the use of syntactic structures and lexical patterns.In the perspective of recoding, invasive adjustments are made and analyzed in the structures of the original code that, even in the case of purposive modifications by the developer, preserve characteristics that are difficult to compromise or corrupt.This analysis is done with the Sherlock N-Overlap algorithm and two types of pre-processing, called normalizations.One of the normalizations is proposed in this work.In the perspective of the transformation, the representation of the source codes through digital images was performed.For this, I-Sim was developed, which translates the organization of the program's structures into visual arrangements that allow the identification of similarity between source codes.Experiments were conducted with 84 purposely modified source codes and a base composed of 2160 codes created by students of engineering courses in programming classes. In this last set, the situation of similarity was not previously known, so a method was used to calculate precision and recall, in a relative way, based on a set of reference tools, as a kind of oracle. The results show that, in most cases, the similarity indexes of the solutions developed are superior to reference tools inthe literature, such as SIM, JPlag and MOSS.
publishDate 2019
dc.date.accessioned.fl_str_mv 2019-08-05T14:25:59Z
dc.date.available.fl_str_mv 2019-08-05T14:25:59Z
dc.date.issued.fl_str_mv 2019-01-21
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.citation.fl_str_mv FRANÇA, A. B. O código por trás do código-fonte: mudança de representação para a análise de similaridade. 2019. 99 f. Tese (Doutorado em Engenharia de Teleinformática)-Centro de Tecnologia, Universidade Federal do Ceará, Fortaleza, 2019.
dc.identifier.uri.fl_str_mv http://www.repositorio.ufc.br/handle/riufc/44452
identifier_str_mv FRANÇA, A. B. O código por trás do código-fonte: mudança de representação para a análise de similaridade. 2019. 99 f. Tese (Doutorado em Engenharia de Teleinformática)-Centro de Tecnologia, Universidade Federal do Ceará, Fortaleza, 2019.
url http://www.repositorio.ufc.br/handle/riufc/44452
dc.language.iso.fl_str_mv por
language por
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.source.none.fl_str_mv reponame:Repositório Institucional da Universidade Federal do Ceará (UFC)
instname:Universidade Federal do Ceará (UFC)
instacron:UFC
instname_str Universidade Federal do Ceará (UFC)
instacron_str UFC
institution UFC
reponame_str Repositório Institucional da Universidade Federal do Ceará (UFC)
collection Repositório Institucional da Universidade Federal do Ceará (UFC)
bitstream.url.fl_str_mv http://repositorio.ufc.br/bitstream/riufc/44452/3/2019_tese_abfranca.pdf
http://repositorio.ufc.br/bitstream/riufc/44452/4/license.txt
bitstream.checksum.fl_str_mv 8d45c342809d860232b4d8edb93b2be2
8a4605be74aa9ea9d79846c1fba20a33
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
repository.name.fl_str_mv Repositório Institucional da Universidade Federal do Ceará (UFC) - Universidade Federal do Ceará (UFC)
repository.mail.fl_str_mv bu@ufc.br || repositorio@ufc.br
_version_ 1847792106743005184