O código por trás do código-fonte: mudança de representação para a análise de similaridade

Detalhes bibliográficos
Ano de defesa: 2019
Autor(a) principal: França, Allyson Bonetti
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: por
Instituição de defesa: Não Informado pela instituição
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://www.repositorio.ufc.br/handle/riufc/44452
Resumo: The similarity identified between source codes sent by students in programming disciplines is often used as indicative of plagiarism by teachers and / or automated submission systems.Among the techniques of measuring similarity, we include the use of syntactic structures and lexical patterns.Despite the large number of available tools that use this technique, few are able to identify the similarity, effectively, due to the inherent complexity of this type of analysis.One limitation of this technique is the sensitivity to the modifications in the organization of characters and symbols, even when subtle, and their dispositions along the expressions.We investigate the similarity between preprocessed codes in order to remove the irrelevant aspects of coding, as well as to highlight their relevant characteristics.The contributions presented are inserted in the preprocessing action, which results in a recoding, and in the transformation with change in the representation domain: the code behind the source code.These contributions aim to mitigate the complexity of code analysis through the use of syntactic structures and lexical patterns.In the perspective of recoding, invasive adjustments are made and analyzed in the structures of the original code that, even in the case of purposive modifications by the developer, preserve characteristics that are difficult to compromise or corrupt.This analysis is done with the Sherlock N-Overlap algorithm and two types of pre-processing, called normalizations.One of the normalizations is proposed in this work.In the perspective of the transformation, the representation of the source codes through digital images was performed.For this, I-Sim was developed, which translates the organization of the program's structures into visual arrangements that allow the identification of similarity between source codes.Experiments were conducted with 84 purposely modified source codes and a base composed of 2160 codes created by students of engineering courses in programming classes. In this last set, the situation of similarity was not previously known, so a method was used to calculate precision and recall, in a relative way, based on a set of reference tools, as a kind of oracle. The results show that, in most cases, the similarity indexes of the solutions developed are superior to reference tools inthe literature, such as SIM, JPlag and MOSS.