From source code identifiers to natural language terms

Carvalho, Nuno Alexandre Ramos; Almeida, J. J.; Henriques, Pedro Rangel; Varanda, Maria João

From source code identifiers to natural language terms

Bibliographic Details
Main Author:	Carvalho, Nuno Alexandre Ramos
Publication Date:	2015
Other Authors:	Almeida, J. J., Henriques, Pedro Rangel, Varanda, Maria João
Format:	Article
Language:	eng
Source:	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Download full:	https://hdl.handle.net/1822/53525
Summary:	Program comprehension techniques often explore program identifiers, to infer knowledge about programs. The relevance of source code identifiers as one relevant source of information about programs is already established in the literature, as well as their direct impact on future comprehension tasks. Most programming languages enforce some constrains on identifiers strings (e.g., white spaces or commas are not allowed). Also, programmers often use word combinations and abbreviations, to devise strings that represent single, or multiple, domain concepts in order to increase programming linguistic efficiency (convey more semantics writing less). These strings do not always use explicit marks to distinguish the terms used (e.g., CamelCase or underscores), so techniques often referred as hard splitting are not enough. This paper introduces Lingua::IdSplitter a dictionary based algorithm for splitting and expanding strings that compose multi-term identifiers. It explores the use of general programming and abbreviations dictionaries, but also a custom dictionary automatically generated from software natural language content, prone to include application domain terms and specific abbreviations. This approach was applied to two software packages, written in C, achieving a f-measure of around 90% for correctly splitting and expanding identifiers. A comparison with current state-of-the-art approaches is also presented.

Item metadata

id	RCAP_c68e92d4386a23afa3578b768dc7981d
oai_identifier_str	oai:repositorium.sdum.uminho.pt:1822/53525
network_acronym_str	RCAP
network_name_str	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str	https://opendoar.ac.uk/repository/7160
spelling	From source code identifiers to natural language termsProgram comprehensionNatural language processingIdentifier splittingScience & TechnologyProgram comprehension techniques often explore program identifiers, to infer knowledge about programs. The relevance of source code identifiers as one relevant source of information about programs is already established in the literature, as well as their direct impact on future comprehension tasks. Most programming languages enforce some constrains on identifiers strings (e.g., white spaces or commas are not allowed). Also, programmers often use word combinations and abbreviations, to devise strings that represent single, or multiple, domain concepts in order to increase programming linguistic efficiency (convey more semantics writing less). These strings do not always use explicit marks to distinguish the terms used (e.g., CamelCase or underscores), so techniques often referred as hard splitting are not enough. This paper introduces Lingua::IdSplitter a dictionary based algorithm for splitting and expanding strings that compose multi-term identifiers. It explores the use of general programming and abbreviations dictionaries, but also a custom dictionary automatically generated from software natural language content, prone to include application domain terms and specific abbreviations. This approach was applied to two software packages, written in C, achieving a f-measure of around 90% for correctly splitting and expanding identifiers. A comparison with current state-of-the-art approaches is also presented.This work is funded by National Funds through the FCT – Fundac¸ ão para a Ciência e a Tecnologia (Portuguese Foundation for Science and Technology) within project PEst-OE/EEI/UI0752/2014.info:eu-repo/semantics/publishedVersionElsevierUniversidade do MinhoCarvalho, Nuno Alexandre RamosAlmeida, J. J.Henriques, Pedro RangelVaranda, Maria João20152015-01-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttps://hdl.handle.net/1822/53525eng0164-12121873-122810.1016/j.jss.2014.10.013info:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2025-04-12T04:03:20Zoai:repositorium.sdum.uminho.pt:1822/53525Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T14:49:44.260601Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv	From source code identifiers to natural language terms
title	From source code identifiers to natural language terms
spellingShingle	From source code identifiers to natural language terms Carvalho, Nuno Alexandre Ramos Program comprehension Natural language processing Identifier splitting Science & Technology
title_short	From source code identifiers to natural language terms
title_full	From source code identifiers to natural language terms
title_fullStr	From source code identifiers to natural language terms
title_full_unstemmed	From source code identifiers to natural language terms
title_sort	From source code identifiers to natural language terms
author	Carvalho, Nuno Alexandre Ramos
author_facet	Carvalho, Nuno Alexandre Ramos Almeida, J. J. Henriques, Pedro Rangel Varanda, Maria João
author_role	author
author2	Almeida, J. J. Henriques, Pedro Rangel Varanda, Maria João
author2_role	author author author
dc.contributor.none.fl_str_mv	Universidade do Minho
dc.contributor.author.fl_str_mv	Carvalho, Nuno Alexandre Ramos Almeida, J. J. Henriques, Pedro Rangel Varanda, Maria João
dc.subject.por.fl_str_mv	Program comprehension Natural language processing Identifier splitting Science & Technology
topic	Program comprehension Natural language processing Identifier splitting Science & Technology
description	Program comprehension techniques often explore program identifiers, to infer knowledge about programs. The relevance of source code identifiers as one relevant source of information about programs is already established in the literature, as well as their direct impact on future comprehension tasks. Most programming languages enforce some constrains on identifiers strings (e.g., white spaces or commas are not allowed). Also, programmers often use word combinations and abbreviations, to devise strings that represent single, or multiple, domain concepts in order to increase programming linguistic efficiency (convey more semantics writing less). These strings do not always use explicit marks to distinguish the terms used (e.g., CamelCase or underscores), so techniques often referred as hard splitting are not enough. This paper introduces Lingua::IdSplitter a dictionary based algorithm for splitting and expanding strings that compose multi-term identifiers. It explores the use of general programming and abbreviations dictionaries, but also a custom dictionary automatically generated from software natural language content, prone to include application domain terms and specific abbreviations. This approach was applied to two software packages, written in C, achieving a f-measure of around 90% for correctly splitting and expanding identifiers. A comparison with current state-of-the-art approaches is also presented.
publishDate	2015
dc.date.none.fl_str_mv	2015 2015-01-01T00:00:00Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/article
format	article
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://hdl.handle.net/1822/53525
url	https://hdl.handle.net/1822/53525
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	0164-1212 1873-1228 10.1016/j.jss.2014.10.013
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	Elsevier
publisher.none.fl_str_mv	Elsevier
dc.source.none.fl_str_mv	reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia instacron:RCAAP
instname_str	FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv	info@rcaap.pt
_version_	1833594915936796672

From source code identifiers to natural language terms

Similar Items