Semantic Similarity Match for Data Quality

Detalhes bibliográficos
Autor(a) principal: Martins, Fernando
Data de Publicação: 2007
Outros Autores: Falcão, André, Couto, Francisco M.
Tipo de documento: Relatório
Idioma: por
Título da fonte: Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Texto Completo: http://hdl.handle.net/10451/14158
Resumo: Data quality is a critical aspect of applications that support business operations. Often entities are represented more than once in data repositories. Since duplicate records do not share a common key, they are hard to detect. Duplicate detection over text is usually performed using lexical approaches, which do not capture text sense. The difficulties increase when the duplicate detection must be performed using the text sense. This work presents a semantic similarity approach, based on a text sense matching mechanism, that performs the detection of text units which are similar in sense. The goal of the proposed semantic similarity approach is therefore to perform the duplicate detection task in a data quality process
id RCAP_0e9d53a6bf5198af2fa7f93a19890d3d
oai_identifier_str oai:repositorio.ulisboa.pt:10455/3050
network_acronym_str RCAP
network_name_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str https://opendoar.ac.uk/repository/7160
spelling Semantic Similarity Match for Data Qualitysemantic similaritydata cleaningdata qualitywordnetsimilarity matchData quality is a critical aspect of applications that support business operations. Often entities are represented more than once in data repositories. Since duplicate records do not share a common key, they are hard to detect. Duplicate detection over text is usually performed using lexical approaches, which do not capture text sense. The difficulties increase when the duplicate detection must be performed using the text sense. This work presents a semantic similarity approach, based on a text sense matching mechanism, that performs the detection of text units which are similar in sense. The goal of the proposed semantic similarity approach is therefore to perform the duplicate detection task in a data quality processDepartment of Informatics, University of LisbonRepositório da Universidade de LisboaMartins, FernandoFalcão, AndréCouto, Francisco M.2009-02-10T13:12:03Z2007-102007-10-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/reportapplication/pdfhttp://hdl.handle.net/10451/14158porinfo:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2025-03-17T13:12:40Zoai:repositorio.ulisboa.pt:10455/3050Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-29T02:37:34.067276Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv Semantic Similarity Match for Data Quality
title Semantic Similarity Match for Data Quality
spellingShingle Semantic Similarity Match for Data Quality
Martins, Fernando
semantic similarity
data cleaning
data quality
wordnet
similarity match
title_short Semantic Similarity Match for Data Quality
title_full Semantic Similarity Match for Data Quality
title_fullStr Semantic Similarity Match for Data Quality
title_full_unstemmed Semantic Similarity Match for Data Quality
title_sort Semantic Similarity Match for Data Quality
author Martins, Fernando
author_facet Martins, Fernando
Falcão, André
Couto, Francisco M.
author_role author
author2 Falcão, André
Couto, Francisco M.
author2_role author
author
dc.contributor.none.fl_str_mv Repositório da Universidade de Lisboa
dc.contributor.author.fl_str_mv Martins, Fernando
Falcão, André
Couto, Francisco M.
dc.subject.por.fl_str_mv semantic similarity
data cleaning
data quality
wordnet
similarity match
topic semantic similarity
data cleaning
data quality
wordnet
similarity match
description Data quality is a critical aspect of applications that support business operations. Often entities are represented more than once in data repositories. Since duplicate records do not share a common key, they are hard to detect. Duplicate detection over text is usually performed using lexical approaches, which do not capture text sense. The difficulties increase when the duplicate detection must be performed using the text sense. This work presents a semantic similarity approach, based on a text sense matching mechanism, that performs the detection of text units which are similar in sense. The goal of the proposed semantic similarity approach is therefore to perform the duplicate detection task in a data quality process
publishDate 2007
dc.date.none.fl_str_mv 2007-10
2007-10-01T00:00:00Z
2009-02-10T13:12:03Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/report
format report
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10451/14158
url http://hdl.handle.net/10451/14158
dc.language.iso.fl_str_mv por
language por
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Department of Informatics, University of Lisbon
publisher.none.fl_str_mv Department of Informatics, University of Lisbon
dc.source.none.fl_str_mv reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron:RCAAP
instname_str FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str RCAAP
institution RCAAP
reponame_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv info@rcaap.pt
_version_ 1833601431516479488