From source code identifiers to natural language terms

Bibliographic Details
Main Author: Carvalho, Nuno Ramos
Publication Date: 2015
Other Authors: Almeida, José João, Henriques, Pedro Rangel, Pereira, Maria João
Format: Article
Language: eng
Source: Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Download full: http://hdl.handle.net/10198/11577
Summary: Program comprehension techniques often explore program identifiers, to infer knowledge about programs. The relevance of source code identifiers as one relevant source of information about programs is already established in the literature, as well as their direct impact on future comprehension tasks. Most programming languages enforce some constrains on identifiers strings (e.g., white spaces or commas are not allowed). Also, programmers often use word combinations and abbreviations, to devise strings that represent single, or multiple, domain concepts in order to increase programming linguistic efficiency (convey more semantics writing less). These strings do not always use explicit marks to distinguish the terms used (e.g., CamelCase or underscores), so techniques often referred as hard splitting are not enough. This paper introduces Lingua::IdSplitter a dictionary based algorithm for splitting and expanding strings that compose multi-term identifiers. It explores the use of general programming and abbreviations dictionaries, but also a custom dictionary automatically generated from software natural language content, prone to include application domain terms and specific abbreviations. This approach was applied to two software packages, written in C, achieving a f-measure of around 90% for correctly splitting and expanding identifiers. A comparison with current state-of-the-art approaches is also presented.
id RCAP_c2d1c8e32e05bc2db2ad41256d33a9d9
oai_identifier_str oai:bibliotecadigital.ipb.pt:10198/11577
network_acronym_str RCAP
network_name_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str https://opendoar.ac.uk/repository/7160
spelling From source code identifiers to natural language termsProgram comprehensionNatural language processingIdentifier splittingProgram comprehension techniques often explore program identifiers, to infer knowledge about programs. The relevance of source code identifiers as one relevant source of information about programs is already established in the literature, as well as their direct impact on future comprehension tasks. Most programming languages enforce some constrains on identifiers strings (e.g., white spaces or commas are not allowed). Also, programmers often use word combinations and abbreviations, to devise strings that represent single, or multiple, domain concepts in order to increase programming linguistic efficiency (convey more semantics writing less). These strings do not always use explicit marks to distinguish the terms used (e.g., CamelCase or underscores), so techniques often referred as hard splitting are not enough. This paper introduces Lingua::IdSplitter a dictionary based algorithm for splitting and expanding strings that compose multi-term identifiers. It explores the use of general programming and abbreviations dictionaries, but also a custom dictionary automatically generated from software natural language content, prone to include application domain terms and specific abbreviations. This approach was applied to two software packages, written in C, achieving a f-measure of around 90% for correctly splitting and expanding identifiers. A comparison with current state-of-the-art approaches is also presented.This work is funded by National Funds through the FCT–Fundação para a Ciência e a Tecnologia (Portuguese Foundation for Science and Technology) within project PEst-OE/EEI/UI0752/2014. We would like to thank the reviewers for their valuable insight and detailed comments, which aided in improving this paper. We would like to thank Latifa Guerrouj, Philippe Galinier, Yann-Gaël Guéhéneuc, Giuliano Antoniol, and Massimiliano Di Penta, for their work in Guerrouj et al. (2012) ,and Emily Hill, David Binkley, Dawn Lawrie, Lori Pollok and K. Vijay-Shanker for their work in Hill et al. (2013), which allowed the experimental comparison between approaches.ElsevierBiblioteca Digital do IPBCarvalho, Nuno RamosAlmeida, José JoãoHenriques, Pedro RangelPereira, Maria João2015-01-15T12:46:09Z20152015-01-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10198/11577engCarvalho, Nuno; Almeida, José João; Henriques, Pedro; Pereira, Maria João (2015). From source code identifiers to natural language terms. Journal of Systems and Software. ISSN 0164-1212. 100, p. 117-1280164-121210.1016/j.jss.2014.10.013info:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2025-02-25T12:02:22Zoai:bibliotecadigital.ipb.pt:10198/11577Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T11:27:31.826349Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv From source code identifiers to natural language terms
title From source code identifiers to natural language terms
spellingShingle From source code identifiers to natural language terms
Carvalho, Nuno Ramos
Program comprehension
Natural language processing
Identifier splitting
title_short From source code identifiers to natural language terms
title_full From source code identifiers to natural language terms
title_fullStr From source code identifiers to natural language terms
title_full_unstemmed From source code identifiers to natural language terms
title_sort From source code identifiers to natural language terms
author Carvalho, Nuno Ramos
author_facet Carvalho, Nuno Ramos
Almeida, José João
Henriques, Pedro Rangel
Pereira, Maria João
author_role author
author2 Almeida, José João
Henriques, Pedro Rangel
Pereira, Maria João
author2_role author
author
author
dc.contributor.none.fl_str_mv Biblioteca Digital do IPB
dc.contributor.author.fl_str_mv Carvalho, Nuno Ramos
Almeida, José João
Henriques, Pedro Rangel
Pereira, Maria João
dc.subject.por.fl_str_mv Program comprehension
Natural language processing
Identifier splitting
topic Program comprehension
Natural language processing
Identifier splitting
description Program comprehension techniques often explore program identifiers, to infer knowledge about programs. The relevance of source code identifiers as one relevant source of information about programs is already established in the literature, as well as their direct impact on future comprehension tasks. Most programming languages enforce some constrains on identifiers strings (e.g., white spaces or commas are not allowed). Also, programmers often use word combinations and abbreviations, to devise strings that represent single, or multiple, domain concepts in order to increase programming linguistic efficiency (convey more semantics writing less). These strings do not always use explicit marks to distinguish the terms used (e.g., CamelCase or underscores), so techniques often referred as hard splitting are not enough. This paper introduces Lingua::IdSplitter a dictionary based algorithm for splitting and expanding strings that compose multi-term identifiers. It explores the use of general programming and abbreviations dictionaries, but also a custom dictionary automatically generated from software natural language content, prone to include application domain terms and specific abbreviations. This approach was applied to two software packages, written in C, achieving a f-measure of around 90% for correctly splitting and expanding identifiers. A comparison with current state-of-the-art approaches is also presented.
publishDate 2015
dc.date.none.fl_str_mv 2015-01-15T12:46:09Z
2015
2015-01-01T00:00:00Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10198/11577
url http://hdl.handle.net/10198/11577
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv Carvalho, Nuno; Almeida, José João; Henriques, Pedro; Pereira, Maria João (2015). From source code identifiers to natural language terms. Journal of Systems and Software. ISSN 0164-1212. 100, p. 117-128
0164-1212
10.1016/j.jss.2014.10.013
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Elsevier
publisher.none.fl_str_mv Elsevier
dc.source.none.fl_str_mv reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron:RCAAP
instname_str FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str RCAAP
institution RCAAP
reponame_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv info@rcaap.pt
_version_ 1833591926011461632