Automatic parallel corpora and bilingual terminology extraction from parallel WebSites
Main Author: | |
---|---|
Publication Date: | 2010 |
Other Authors: | |
Language: | eng |
Source: | Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
Download full: | http://hdl.handle.net/1822/16442 |
Summary: | In our days, the notion, the importance and the significance of parallel corpora is so big that needs no special introduction. Unfortunately, public available parallel corpora is somewhat limited in range. There are big corpora about politics or legislation, about medicine and other specific areas, but we miss corpora for other different areas. Currently there is a huge investment on using the Web as a corpus. This article uncovers GWB, a tool that aims automatic construction of parallel corpora from the web. We defend that it is possible to build high quality terminological corpora in an automatic fashion, just by specifying a sensible Internet domain and using an appropriate set of seed keywords. GWB is a web-spider that works in conjunction with a set of other Open-Source tools, defining a pipeline that includes the documents retrieval from the web, alignment at sentence level and its quality analysis, bilingual dictionaries and terminology extraction and construction of off-line dictionaries. |
id |
RCAP_ce8bf8bd87ed654ae589b92045bef9fb |
---|---|
oai_identifier_str |
oai:repositorium.sdum.uminho.pt:1822/16442 |
network_acronym_str |
RCAP |
network_name_str |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
repository_id_str |
https://opendoar.ac.uk/repository/7160 |
spelling |
Automatic parallel corpora and bilingual terminology extraction from parallel WebSitesParallel corporaBlingual terminologyWeb as corporaSocial SciencesIn our days, the notion, the importance and the significance of parallel corpora is so big that needs no special introduction. Unfortunately, public available parallel corpora is somewhat limited in range. There are big corpora about politics or legislation, about medicine and other specific areas, but we miss corpora for other different areas. Currently there is a huge investment on using the Web as a corpus. This article uncovers GWB, a tool that aims automatic construction of parallel corpora from the web. We defend that it is possible to build high quality terminological corpora in an automatic fashion, just by specifying a sensible Internet domain and using an appropriate set of seed keywords. GWB is a web-spider that works in conjunction with a set of other Open-Source tools, defining a pipeline that includes the documents retrieval from the web, alignment at sentence level and its quality analysis, bilingual dictionaries and terminology extraction and construction of off-line dictionaries.European Language Resources Association (ELRA)Universidade do MinhoAlmeida, J. J.Simões, Alberto2010-052010-05-01T00:00:00Zconference paperinfo:eu-repo/semantics/publishedVersionapplication/pdfhttp://hdl.handle.net/1822/16442enginfo:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2024-05-11T04:48:33Zoai:repositorium.sdum.uminho.pt:1822/16442Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T14:59:03.927562Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse |
dc.title.none.fl_str_mv |
Automatic parallel corpora and bilingual terminology extraction from parallel WebSites |
title |
Automatic parallel corpora and bilingual terminology extraction from parallel WebSites |
spellingShingle |
Automatic parallel corpora and bilingual terminology extraction from parallel WebSites Almeida, J. J. Parallel corpora Blingual terminology Web as corpora Social Sciences |
title_short |
Automatic parallel corpora and bilingual terminology extraction from parallel WebSites |
title_full |
Automatic parallel corpora and bilingual terminology extraction from parallel WebSites |
title_fullStr |
Automatic parallel corpora and bilingual terminology extraction from parallel WebSites |
title_full_unstemmed |
Automatic parallel corpora and bilingual terminology extraction from parallel WebSites |
title_sort |
Automatic parallel corpora and bilingual terminology extraction from parallel WebSites |
author |
Almeida, J. J. |
author_facet |
Almeida, J. J. Simões, Alberto |
author_role |
author |
author2 |
Simões, Alberto |
author2_role |
author |
dc.contributor.none.fl_str_mv |
Universidade do Minho |
dc.contributor.author.fl_str_mv |
Almeida, J. J. Simões, Alberto |
dc.subject.por.fl_str_mv |
Parallel corpora Blingual terminology Web as corpora Social Sciences |
topic |
Parallel corpora Blingual terminology Web as corpora Social Sciences |
description |
In our days, the notion, the importance and the significance of parallel corpora is so big that needs no special introduction. Unfortunately, public available parallel corpora is somewhat limited in range. There are big corpora about politics or legislation, about medicine and other specific areas, but we miss corpora for other different areas. Currently there is a huge investment on using the Web as a corpus. This article uncovers GWB, a tool that aims automatic construction of parallel corpora from the web. We defend that it is possible to build high quality terminological corpora in an automatic fashion, just by specifying a sensible Internet domain and using an appropriate set of seed keywords. GWB is a web-spider that works in conjunction with a set of other Open-Source tools, defining a pipeline that includes the documents retrieval from the web, alignment at sentence level and its quality analysis, bilingual dictionaries and terminology extraction and construction of off-line dictionaries. |
publishDate |
2010 |
dc.date.none.fl_str_mv |
2010-05 2010-05-01T00:00:00Z |
dc.type.driver.fl_str_mv |
conference paper |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/1822/16442 |
url |
http://hdl.handle.net/1822/16442 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
European Language Resources Association (ELRA) |
publisher.none.fl_str_mv |
European Language Resources Association (ELRA) |
dc.source.none.fl_str_mv |
reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia instacron:RCAAP |
instname_str |
FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
collection |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
repository.name.fl_str_mv |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia |
repository.mail.fl_str_mv |
info@rcaap.pt |
_version_ |
1833595022767816704 |