O código por trás do código-fonte: mudança de representação para a análise de similaridade

França, Allyson Bonetti

O código por trás do código-fonte: mudança de representação para a análise de similaridade

Bibliographic Details
Main Author:	França, Allyson Bonetti
Publication Date:	2019
Format:	Doctoral thesis
Language:	por
Source:	Repositório Institucional da Universidade Federal do Ceará (UFC)
Download full:	http://www.repositorio.ufc.br/handle/riufc/44452
Summary:	The similarity identified between source codes sent by students in programming disciplines is often used as indicative of plagiarism by teachers and / or automated submission systems.Among the techniques of measuring similarity, we include the use of syntactic structures and lexical patterns.Despite the large number of available tools that use this technique, few are able to identify the similarity, effectively, due to the inherent complexity of this type of analysis.One limitation of this technique is the sensitivity to the modifications in the organization of characters and symbols, even when subtle, and their dispositions along the expressions.We investigate the similarity between preprocessed codes in order to remove the irrelevant aspects of coding, as well as to highlight their relevant characteristics.The contributions presented are inserted in the preprocessing action, which results in a recoding, and in the transformation with change in the representation domain: the code behind the source code.These contributions aim to mitigate the complexity of code analysis through the use of syntactic structures and lexical patterns.In the perspective of recoding, invasive adjustments are made and analyzed in the structures of the original code that, even in the case of purposive modifications by the developer, preserve characteristics that are difficult to compromise or corrupt.This analysis is done with the Sherlock N-Overlap algorithm and two types of pre-processing, called normalizations.One of the normalizations is proposed in this work.In the perspective of the transformation, the representation of the source codes through digital images was performed.For this, I-Sim was developed, which translates the organization of the program's structures into visual arrangements that allow the identification of similarity between source codes.Experiments were conducted with 84 purposely modified source codes and a base composed of 2160 codes created by students of engineering courses in programming classes. In this last set, the situation of similarity was not previously known, so a method was used to calculate precision and recall, in a relative way, based on a set of reference tools, as a kind of oracle. The results show that, in most cases, the similarity indexes of the solutions developed are superior to reference tools inthe literature, such as SIM, JPlag and MOSS.

Item metadata

id	UFC-7_6effc6cba2a9c9150b32a15b40b6bb33
oai_identifier_str	oai:repositorio.ufc.br:riufc/44452
network_acronym_str	UFC-7
network_name_str	Repositório Institucional da Universidade Federal do Ceará (UFC)
repository_id_str
spelling	França, Allyson BonettiBarroso, Giovanni CordeiroSoares, José Marques2019-08-05T14:25:59Z2019-08-05T14:25:59Z2019-01-21FRANÇA, A. B. O código por trás do código-fonte: mudança de representação para a análise de similaridade. 2019. 99 f. Tese (Doutorado em Engenharia de Teleinformática)-Centro de Tecnologia, Universidade Federal do Ceará, Fortaleza, 2019.http://www.repositorio.ufc.br/handle/riufc/44452The similarity identified between source codes sent by students in programming disciplines is often used as indicative of plagiarism by teachers and / or automated submission systems.Among the techniques of measuring similarity, we include the use of syntactic structures and lexical patterns.Despite the large number of available tools that use this technique, few are able to identify the similarity, effectively, due to the inherent complexity of this type of analysis.One limitation of this technique is the sensitivity to the modifications in the organization of characters and symbols, even when subtle, and their dispositions along the expressions.We investigate the similarity between preprocessed codes in order to remove the irrelevant aspects of coding, as well as to highlight their relevant characteristics.The contributions presented are inserted in the preprocessing action, which results in a recoding, and in the transformation with change in the representation domain: the code behind the source code.These contributions aim to mitigate the complexity of code analysis through the use of syntactic structures and lexical patterns.In the perspective of recoding, invasive adjustments are made and analyzed in the structures of the original code that, even in the case of purposive modifications by the developer, preserve characteristics that are difficult to compromise or corrupt.This analysis is done with the Sherlock N-Overlap algorithm and two types of pre-processing, called normalizations.One of the normalizations is proposed in this work.In the perspective of the transformation, the representation of the source codes through digital images was performed.For this, I-Sim was developed, which translates the organization of the program's structures into visual arrangements that allow the identification of similarity between source codes.Experiments were conducted with 84 purposely modified source codes and a base composed of 2160 codes created by students of engineering courses in programming classes. In this last set, the situation of similarity was not previously known, so a method was used to calculate precision and recall, in a relative way, based on a set of reference tools, as a kind of oracle. The results show that, in most cases, the similarity indexes of the solutions developed are superior to reference tools inthe literature, such as SIM, JPlag and MOSS.A similaridade identificada entre códigos-fonte enviados por alunos em disciplinas de programação é frequentemente utilizada como indicativo de plágio por professores e/ou sistemas de submissão automatizados. Dentre as técnicas de se medir a similaridade, inclui-se o uso de estruturas sintáticas e de padrões léxicos. Apesar da ampla quantidade de ferramentas disponíveis que se utilizam dessa técnica, poucas são capazes de identificar a similaridade, de maneira eficaz, o que se deve à complexidade inerente a esse tipo de análise. Uma limitação dessa técnica é a sensibilidade às modificações na organização de caracteres e símbolos, mesmo quando sutis, e suas disposições ao longo das expressões. Investiga-se neste trabalho a similaridade entre códigos pré-processados, de maneira a remover os aspectos irrelevantes da codificação, bem como, ressaltar as suas características relevantes. As contribuições apresentadas inserem-se na ação do pré-processamento, que resulta em uma recodificação, e na transformação com mudança no domínio da representação: o código por trás do código- fonte. Tais contribuições visam mitigar a complexidade decorrente da análise do código através do uso de estruturas sintáticas e de padrões léxicos. Na perspectiva da recodificação, são realizados e analisados ajustes invasivos nas estruturas do código original que, mesmo em caso de modificações propositais pelo desenvolvedor, preservam características dificilmente comprometidas ou corrompidas. Essa análise é feita com o algoritmo Sherlock N-Overlap e com dois tipos de pré-processamento – denominados normalizações –, sendo que uma destas é proposta neste trabalho. Na perspectiva da transformação foi realizada a representação dos códigos-fonte por meio de imagens digitais. Para isso, foi desenvolvido o I-Sim, que traduz a organização das estruturas do programa em arranjos visuais, permitindo identificar a similaridade entre códigos-fonte. Foram realizados experimentos com 84 códigos-fonte propositalmente modificados e uma base composta por 2160 códigos criados por estudantes de cursos de engenharia em aulas de programação. No último conjunto, a situação de eventuais modificações propositais não é conhecida a priori, usando-se um método específico para cálculo relativo da precisão e da revocação com base em um conjunto de oráculos. Os resultados apresentados mostram que, na maioria dos casos, os índices de similaridade das soluções desenvolvidas se mostram superiores ao SIM, ao JPlag e ao MOSS, ferramentas utilizadas como referência pela literatura.TeleinformáticaPlágio - IdentificaçãoImagens digitaisSimilarity between source codesPlagiarism detectionSimilarity investigation toolMethod of conformityO código por trás do código-fonte: mudança de representação para a análise de similaridadeinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisporreponame:Repositório Institucional da Universidade Federal do Ceará (UFC)instname:Universidade Federal do Ceará (UFC)instacron:UFCinfo:eu-repo/semantics/openAccessORIGINAL2019_tese_abfranca.pdf2019_tese_abfranca.pdfapplication/pdf5147945http://repositorio.ufc.br/bitstream/riufc/44452/3/2019_tese_abfranca.pdf8d45c342809d860232b4d8edb93b2be2MD53LICENSElicense.txtlicense.txttext/plain; charset=utf-81748http://repositorio.ufc.br/bitstream/riufc/44452/4/license.txt8a4605be74aa9ea9d79846c1fba20a33MD54riufc/444522019-08-05 11:25:59.462oai:repositorio.ufc.br:riufc/44452Tk9URTogUExBQ0UgWU9VUiBPV04gTElDRU5TRSBIRVJFClRoaXMgc2FtcGxlIGxpY2Vuc2UgaXMgcHJvdmlkZWQgZm9yIGluZm9ybWF0aW9uYWwgcHVycG9zZXMgb25seS4KCk5PTi1FWENMVVNJVkUgRElTVFJJQlVUSU9OIExJQ0VOU0UKCkJ5IHNpZ25pbmcgYW5kIHN1Ym1pdHRpbmcgdGhpcyBsaWNlbnNlLCB5b3UgKHRoZSBhdXRob3Iocykgb3IgY29weXJpZ2h0Cm93bmVyKSBncmFudHMgdG8gRFNwYWNlIFVuaXZlcnNpdHkgKERTVSkgdGhlIG5vbi1leGNsdXNpdmUgcmlnaHQgdG8gcmVwcm9kdWNlLAp0cmFuc2xhdGUgKGFzIGRlZmluZWQgYmVsb3cpLCBhbmQvb3IgZGlzdHJpYnV0ZSB5b3VyIHN1Ym1pc3Npb24gKGluY2x1ZGluZwp0aGUgYWJzdHJhY3QpIHdvcmxkd2lkZSBpbiBwcmludCBhbmQgZWxlY3Ryb25pYyBmb3JtYXQgYW5kIGluIGFueSBtZWRpdW0sCmluY2x1ZGluZyBidXQgbm90IGxpbWl0ZWQgdG8gYXVkaW8gb3IgdmlkZW8uCgpZb3UgYWdyZWUgdGhhdCBEU1UgbWF5LCB3aXRob3V0IGNoYW5naW5nIHRoZSBjb250ZW50LCB0cmFuc2xhdGUgdGhlCnN1Ym1pc3Npb24gdG8gYW55IG1lZGl1bSBvciBmb3JtYXQgZm9yIHRoZSBwdXJwb3NlIG9mIHByZXNlcnZhdGlvbi4KCllvdSBhbHNvIGFncmVlIHRoYXQgRFNVIG1heSBrZWVwIG1vcmUgdGhhbiBvbmUgY29weSBvZiB0aGlzIHN1Ym1pc3Npb24gZm9yCnB1cnBvc2VzIG9mIHNlY3VyaXR5LCBiYWNrLXVwIGFuZCBwcmVzZXJ2YXRpb24uCgpZb3UgcmVwcmVzZW50IHRoYXQgdGhlIHN1Ym1pc3Npb24gaXMgeW91ciBvcmlnaW5hbCB3b3JrLCBhbmQgdGhhdCB5b3UgaGF2ZQp0aGUgcmlnaHQgdG8gZ3JhbnQgdGhlIHJpZ2h0cyBjb250YWluZWQgaW4gdGhpcyBsaWNlbnNlLiBZb3UgYWxzbyByZXByZXNlbnQKdGhhdCB5b3VyIHN1Ym1pc3Npb24gZG9lcyBub3QsIHRvIHRoZSBiZXN0IG9mIHlvdXIga25vd2xlZGdlLCBpbmZyaW5nZSB1cG9uCmFueW9uZSdzIGNvcHlyaWdodC4KCklmIHRoZSBzdWJtaXNzaW9uIGNvbnRhaW5zIG1hdGVyaWFsIGZvciB3aGljaCB5b3UgZG8gbm90IGhvbGQgY29weXJpZ2h0LAp5b3UgcmVwcmVzZW50IHRoYXQgeW91IGhhdmUgb2J0YWluZWQgdGhlIHVucmVzdHJpY3RlZCBwZXJtaXNzaW9uIG9mIHRoZQpjb3B5cmlnaHQgb3duZXIgdG8gZ3JhbnQgRFNVIHRoZSByaWdodHMgcmVxdWlyZWQgYnkgdGhpcyBsaWNlbnNlLCBhbmQgdGhhdApzdWNoIHRoaXJkLXBhcnR5IG93bmVkIG1hdGVyaWFsIGlzIGNsZWFybHkgaWRlbnRpZmllZCBhbmQgYWNrbm93bGVkZ2VkCndpdGhpbiB0aGUgdGV4dCBvciBjb250ZW50IG9mIHRoZSBzdWJtaXNzaW9uLgoKSUYgVEhFIFNVQk1JU1NJT04gSVMgQkFTRUQgVVBPTiBXT1JLIFRIQVQgSEFTIEJFRU4gU1BPTlNPUkVEIE9SIFNVUFBPUlRFRApCWSBBTiBBR0VOQ1kgT1IgT1JHQU5JWkFUSU9OIE9USEVSIFRIQU4gRFNVLCBZT1UgUkVQUkVTRU5UIFRIQVQgWU9VIEhBVkUKRlVMRklMTEVEIEFOWSBSSUdIVCBPRiBSRVZJRVcgT1IgT1RIRVIgT0JMSUdBVElPTlMgUkVRVUlSRUQgQlkgU1VDSApDT05UUkFDVCBPUiBBR1JFRU1FTlQuCgpEU1Ugd2lsbCBjbGVhcmx5IGlkZW50aWZ5IHlvdXIgbmFtZShzKSBhcyB0aGUgYXV0aG9yKHMpIG9yIG93bmVyKHMpIG9mIHRoZQpzdWJtaXNzaW9uLCBhbmQgd2lsbCBub3QgbWFrZSBhbnkgYWx0ZXJhdGlvbiwgb3RoZXIgdGhhbiBhcyBhbGxvd2VkIGJ5IHRoaXMKbGljZW5zZSwgdG8geW91ciBzdWJtaXNzaW9uLgo=Repositório InstitucionalPUBhttp://www.repositorio.ufc.br/ri-oai/requestbu@ufc.br \|\| repositorio@ufc.bropendoar:2019-08-05T14:25:59Repositório Institucional da Universidade Federal do Ceará (UFC) - Universidade Federal do Ceará (UFC)false
dc.title.pt_BR.fl_str_mv	O código por trás do código-fonte: mudança de representação para a análise de similaridade
title	O código por trás do código-fonte: mudança de representação para a análise de similaridade
spellingShingle	O código por trás do código-fonte: mudança de representação para a análise de similaridade França, Allyson Bonetti Teleinformática Plágio - Identificação Imagens digitais Similarity between source codes Plagiarism detection Similarity investigation tool Method of conformity
title_short	O código por trás do código-fonte: mudança de representação para a análise de similaridade
title_full	O código por trás do código-fonte: mudança de representação para a análise de similaridade
title_fullStr	O código por trás do código-fonte: mudança de representação para a análise de similaridade
title_full_unstemmed	O código por trás do código-fonte: mudança de representação para a análise de similaridade
title_sort	O código por trás do código-fonte: mudança de representação para a análise de similaridade
author	França, Allyson Bonetti
author_facet	França, Allyson Bonetti
author_role	author
dc.contributor.co-advisor.none.fl_str_mv	Barroso, Giovanni Cordeiro
dc.contributor.author.fl_str_mv	França, Allyson Bonetti
dc.contributor.advisor1.fl_str_mv	Soares, José Marques
contributor_str_mv	Soares, José Marques
dc.subject.por.fl_str_mv	Teleinformática Plágio - Identificação Imagens digitais Similarity between source codes Plagiarism detection Similarity investigation tool Method of conformity
topic	Teleinformática Plágio - Identificação Imagens digitais Similarity between source codes Plagiarism detection Similarity investigation tool Method of conformity
description	The similarity identified between source codes sent by students in programming disciplines is often used as indicative of plagiarism by teachers and / or automated submission systems.Among the techniques of measuring similarity, we include the use of syntactic structures and lexical patterns.Despite the large number of available tools that use this technique, few are able to identify the similarity, effectively, due to the inherent complexity of this type of analysis.One limitation of this technique is the sensitivity to the modifications in the organization of characters and symbols, even when subtle, and their dispositions along the expressions.We investigate the similarity between preprocessed codes in order to remove the irrelevant aspects of coding, as well as to highlight their relevant characteristics.The contributions presented are inserted in the preprocessing action, which results in a recoding, and in the transformation with change in the representation domain: the code behind the source code.These contributions aim to mitigate the complexity of code analysis through the use of syntactic structures and lexical patterns.In the perspective of recoding, invasive adjustments are made and analyzed in the structures of the original code that, even in the case of purposive modifications by the developer, preserve characteristics that are difficult to compromise or corrupt.This analysis is done with the Sherlock N-Overlap algorithm and two types of pre-processing, called normalizations.One of the normalizations is proposed in this work.In the perspective of the transformation, the representation of the source codes through digital images was performed.For this, I-Sim was developed, which translates the organization of the program's structures into visual arrangements that allow the identification of similarity between source codes.Experiments were conducted with 84 purposely modified source codes and a base composed of 2160 codes created by students of engineering courses in programming classes. In this last set, the situation of similarity was not previously known, so a method was used to calculate precision and recall, in a relative way, based on a set of reference tools, as a kind of oracle. The results show that, in most cases, the similarity indexes of the solutions developed are superior to reference tools inthe literature, such as SIM, JPlag and MOSS.
publishDate	2019
dc.date.accessioned.fl_str_mv	2019-08-05T14:25:59Z
dc.date.available.fl_str_mv	2019-08-05T14:25:59Z
dc.date.issued.fl_str_mv	2019-01-21
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/doctoralThesis
format	doctoralThesis
status_str	publishedVersion
dc.identifier.citation.fl_str_mv	FRANÇA, A. B. O código por trás do código-fonte: mudança de representação para a análise de similaridade. 2019. 99 f. Tese (Doutorado em Engenharia de Teleinformática)-Centro de Tecnologia, Universidade Federal do Ceará, Fortaleza, 2019.
dc.identifier.uri.fl_str_mv	http://www.repositorio.ufc.br/handle/riufc/44452
identifier_str_mv	FRANÇA, A. B. O código por trás do código-fonte: mudança de representação para a análise de similaridade. 2019. 99 f. Tese (Doutorado em Engenharia de Teleinformática)-Centro de Tecnologia, Universidade Federal do Ceará, Fortaleza, 2019.
url	http://www.repositorio.ufc.br/handle/riufc/44452
dc.language.iso.fl_str_mv	por
language	por
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.source.none.fl_str_mv	reponame:Repositório Institucional da Universidade Federal do Ceará (UFC) instname:Universidade Federal do Ceará (UFC) instacron:UFC
instname_str	Universidade Federal do Ceará (UFC)
instacron_str	UFC
institution	UFC
reponame_str	Repositório Institucional da Universidade Federal do Ceará (UFC)
collection	Repositório Institucional da Universidade Federal do Ceará (UFC)
bitstream.url.fl_str_mv	http://repositorio.ufc.br/bitstream/riufc/44452/3/2019_tese_abfranca.pdf http://repositorio.ufc.br/bitstream/riufc/44452/4/license.txt
bitstream.checksum.fl_str_mv	8d45c342809d860232b4d8edb93b2be2 8a4605be74aa9ea9d79846c1fba20a33
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5
repository.name.fl_str_mv	Repositório Institucional da Universidade Federal do Ceará (UFC) - Universidade Federal do Ceará (UFC)
repository.mail.fl_str_mv	bu@ufc.br \|\| repositorio@ufc.br
_version_	1847792106743005184

O código por trás do código-fonte: mudança de representação para a análise de similaridade

Similar Items