Export Ready — 

Automatic parallel corpora and bilingual terminology extraction from parallel WebSites

Bibliographic Details
Main Author: Almeida, J. J.
Publication Date: 2010
Other Authors: Simões, Alberto
Language: eng
Source: Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Download full: http://hdl.handle.net/1822/16442
Summary: In our days, the notion, the importance and the significance of parallel corpora is so big that needs no special introduction. Unfortunately, public available parallel corpora is somewhat limited in range. There are big corpora about politics or legislation, about medicine and other specific areas, but we miss corpora for other different areas. Currently there is a huge investment on using the Web as a corpus. This article uncovers GWB, a tool that aims automatic construction of parallel corpora from the web. We defend that it is possible to build high quality terminological corpora in an automatic fashion, just by specifying a sensible Internet domain and using an appropriate set of seed keywords. GWB is a web-spider that works in conjunction with a set of other Open-Source tools, defining a pipeline that includes the documents retrieval from the web, alignment at sentence level and its quality analysis, bilingual dictionaries and terminology extraction and construction of off-line dictionaries.
id RCAP_ce8bf8bd87ed654ae589b92045bef9fb
oai_identifier_str oai:repositorium.sdum.uminho.pt:1822/16442
network_acronym_str RCAP
network_name_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str https://opendoar.ac.uk/repository/7160
spelling Automatic parallel corpora and bilingual terminology extraction from parallel WebSitesParallel corporaBlingual terminologyWeb as corporaSocial SciencesIn our days, the notion, the importance and the significance of parallel corpora is so big that needs no special introduction. Unfortunately, public available parallel corpora is somewhat limited in range. There are big corpora about politics or legislation, about medicine and other specific areas, but we miss corpora for other different areas. Currently there is a huge investment on using the Web as a corpus. This article uncovers GWB, a tool that aims automatic construction of parallel corpora from the web. We defend that it is possible to build high quality terminological corpora in an automatic fashion, just by specifying a sensible Internet domain and using an appropriate set of seed keywords. GWB is a web-spider that works in conjunction with a set of other Open-Source tools, defining a pipeline that includes the documents retrieval from the web, alignment at sentence level and its quality analysis, bilingual dictionaries and terminology extraction and construction of off-line dictionaries.European Language Resources Association (ELRA)Universidade do MinhoAlmeida, J. J.Simões, Alberto2010-052010-05-01T00:00:00Zconference paperinfo:eu-repo/semantics/publishedVersionapplication/pdfhttp://hdl.handle.net/1822/16442enginfo:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2024-05-11T04:48:33Zoai:repositorium.sdum.uminho.pt:1822/16442Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T14:59:03.927562Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv Automatic parallel corpora and bilingual terminology extraction from parallel WebSites
title Automatic parallel corpora and bilingual terminology extraction from parallel WebSites
spellingShingle Automatic parallel corpora and bilingual terminology extraction from parallel WebSites
Almeida, J. J.
Parallel corpora
Blingual terminology
Web as corpora
Social Sciences
title_short Automatic parallel corpora and bilingual terminology extraction from parallel WebSites
title_full Automatic parallel corpora and bilingual terminology extraction from parallel WebSites
title_fullStr Automatic parallel corpora and bilingual terminology extraction from parallel WebSites
title_full_unstemmed Automatic parallel corpora and bilingual terminology extraction from parallel WebSites
title_sort Automatic parallel corpora and bilingual terminology extraction from parallel WebSites
author Almeida, J. J.
author_facet Almeida, J. J.
Simões, Alberto
author_role author
author2 Simões, Alberto
author2_role author
dc.contributor.none.fl_str_mv Universidade do Minho
dc.contributor.author.fl_str_mv Almeida, J. J.
Simões, Alberto
dc.subject.por.fl_str_mv Parallel corpora
Blingual terminology
Web as corpora
Social Sciences
topic Parallel corpora
Blingual terminology
Web as corpora
Social Sciences
description In our days, the notion, the importance and the significance of parallel corpora is so big that needs no special introduction. Unfortunately, public available parallel corpora is somewhat limited in range. There are big corpora about politics or legislation, about medicine and other specific areas, but we miss corpora for other different areas. Currently there is a huge investment on using the Web as a corpus. This article uncovers GWB, a tool that aims automatic construction of parallel corpora from the web. We defend that it is possible to build high quality terminological corpora in an automatic fashion, just by specifying a sensible Internet domain and using an appropriate set of seed keywords. GWB is a web-spider that works in conjunction with a set of other Open-Source tools, defining a pipeline that includes the documents retrieval from the web, alignment at sentence level and its quality analysis, bilingual dictionaries and terminology extraction and construction of off-line dictionaries.
publishDate 2010
dc.date.none.fl_str_mv 2010-05
2010-05-01T00:00:00Z
dc.type.driver.fl_str_mv conference paper
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/1822/16442
url http://hdl.handle.net/1822/16442
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv European Language Resources Association (ELRA)
publisher.none.fl_str_mv European Language Resources Association (ELRA)
dc.source.none.fl_str_mv reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron:RCAAP
instname_str FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str RCAAP
institution RCAAP
reponame_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv info@rcaap.pt
_version_ 1833595022767816704