Document retrieval for question answering : a quantitative evaluation of text preprocessing

Bibliographic Details
Main Author: Carvalho, Gracinda
Publication Date: 2007
Other Authors: Matos, David Martins de, Rocio, Vitor
Language: eng
Source: Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Download full: http://hdl.handle.net/10400.2/5966
Summary: Question Answering (QA) has been an area of interest for researchers, in part motivated by the international QA evaluation forums, namely the Text REtrieval Conference (TREC), and more recently, the Cross Language Evaluation Forum (CLEF) through QA@CLEF, that since 2004 includes the Portuguese language. In these forums, a collection of written documents is provided, as well as a set of questions, which are to be answered by the participating systems. Each system is evaluated by its capacity to answer the questions, as a whole, and there are relatively few results published that focus on the performance of its different components and their influence on the overall system performance. That is the case of the Information Retrieval (IR) component, which is broadly used in QA systems. Our work concentrates on the different options of preprocessing Portuguese text before feeding it to the IR component, evaluating their impact on the IR performance in the specific context of QA, so that we can make a sustained choice of which options to choose. From this work we conclude the clear advantage of the basic preprocessing techniques: case folding and removal of punctuation marks. For the other techniques considered, stop word removal enhanced the performance of the IR system but that was not the case as far as Stemming and Lemmatization are concerned.
id RCAP_e052a683aa78b1afa1a70a06cb03e543
oai_identifier_str oai:repositorioaberto.uab.pt:10400.2/5966
network_acronym_str RCAP
network_name_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str https://opendoar.ac.uk/repository/7160
spelling Document retrieval for question answering : a quantitative evaluation of text preprocessingInformation retrievalQuestion answeringQuestion Answering (QA) has been an area of interest for researchers, in part motivated by the international QA evaluation forums, namely the Text REtrieval Conference (TREC), and more recently, the Cross Language Evaluation Forum (CLEF) through QA@CLEF, that since 2004 includes the Portuguese language. In these forums, a collection of written documents is provided, as well as a set of questions, which are to be answered by the participating systems. Each system is evaluated by its capacity to answer the questions, as a whole, and there are relatively few results published that focus on the performance of its different components and their influence on the overall system performance. That is the case of the Information Retrieval (IR) component, which is broadly used in QA systems. Our work concentrates on the different options of preprocessing Portuguese text before feeding it to the IR component, evaluating their impact on the IR performance in the specific context of QA, so that we can make a sustained choice of which options to choose. From this work we conclude the clear advantage of the basic preprocessing techniques: case folding and removal of punctuation marks. For the other techniques considered, stop word removal enhanced the performance of the IR system but that was not the case as far as Stemming and Lemmatization are concerned.ACMRepositório AbertoCarvalho, GracindaMatos, David Martins deRocio, Vitor2017-01-24T10:28:29Z20072007-01-01T00:00:00Zconference objectinfo:eu-repo/semantics/publishedVersionapplication/pdfhttp://hdl.handle.net/10400.2/5966eng978-1-59593-832-910.1145/1316874.1316894info:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2025-02-26T09:47:57Zoai:repositorioaberto.uab.pt:10400.2/5966Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T21:08:36.051334Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv Document retrieval for question answering : a quantitative evaluation of text preprocessing
title Document retrieval for question answering : a quantitative evaluation of text preprocessing
spellingShingle Document retrieval for question answering : a quantitative evaluation of text preprocessing
Carvalho, Gracinda
Information retrieval
Question answering
title_short Document retrieval for question answering : a quantitative evaluation of text preprocessing
title_full Document retrieval for question answering : a quantitative evaluation of text preprocessing
title_fullStr Document retrieval for question answering : a quantitative evaluation of text preprocessing
title_full_unstemmed Document retrieval for question answering : a quantitative evaluation of text preprocessing
title_sort Document retrieval for question answering : a quantitative evaluation of text preprocessing
author Carvalho, Gracinda
author_facet Carvalho, Gracinda
Matos, David Martins de
Rocio, Vitor
author_role author
author2 Matos, David Martins de
Rocio, Vitor
author2_role author
author
dc.contributor.none.fl_str_mv Repositório Aberto
dc.contributor.author.fl_str_mv Carvalho, Gracinda
Matos, David Martins de
Rocio, Vitor
dc.subject.por.fl_str_mv Information retrieval
Question answering
topic Information retrieval
Question answering
description Question Answering (QA) has been an area of interest for researchers, in part motivated by the international QA evaluation forums, namely the Text REtrieval Conference (TREC), and more recently, the Cross Language Evaluation Forum (CLEF) through QA@CLEF, that since 2004 includes the Portuguese language. In these forums, a collection of written documents is provided, as well as a set of questions, which are to be answered by the participating systems. Each system is evaluated by its capacity to answer the questions, as a whole, and there are relatively few results published that focus on the performance of its different components and their influence on the overall system performance. That is the case of the Information Retrieval (IR) component, which is broadly used in QA systems. Our work concentrates on the different options of preprocessing Portuguese text before feeding it to the IR component, evaluating their impact on the IR performance in the specific context of QA, so that we can make a sustained choice of which options to choose. From this work we conclude the clear advantage of the basic preprocessing techniques: case folding and removal of punctuation marks. For the other techniques considered, stop word removal enhanced the performance of the IR system but that was not the case as far as Stemming and Lemmatization are concerned.
publishDate 2007
dc.date.none.fl_str_mv 2007
2007-01-01T00:00:00Z
2017-01-24T10:28:29Z
dc.type.driver.fl_str_mv conference object
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10400.2/5966
url http://hdl.handle.net/10400.2/5966
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv 978-1-59593-832-9
10.1145/1316874.1316894
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv ACM
publisher.none.fl_str_mv ACM
dc.source.none.fl_str_mv reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron:RCAAP
instname_str FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str RCAAP
institution RCAAP
reponame_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv info@rcaap.pt
_version_ 1833599101975920640