Automatic parallel corpora and bilingual terminology extraction from parallel WebSites

Almeida, J. J.; Simões, Alberto

Automatic parallel corpora and bilingual terminology extraction from parallel WebSites

Bibliographic Details
Main Author:	Almeida, J. J.
Publication Date:	2010
Other Authors:	Simões, Alberto
Language:	eng
Source:	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Download full:	http://hdl.handle.net/1822/16442
Summary:	In our days, the notion, the importance and the significance of parallel corpora is so big that needs no special introduction. Unfortunately, public available parallel corpora is somewhat limited in range. There are big corpora about politics or legislation, about medicine and other specific areas, but we miss corpora for other different areas. Currently there is a huge investment on using the Web as a corpus. This article uncovers GWB, a tool that aims automatic construction of parallel corpora from the web. We defend that it is possible to build high quality terminological corpora in an automatic fashion, just by specifying a sensible Internet domain and using an appropriate set of seed keywords. GWB is a web-spider that works in conjunction with a set of other Open-Source tools, defining a pipeline that includes the documents retrieval from the web, alignment at sentence level and its quality analysis, bilingual dictionaries and terminology extraction and construction of off-line dictionaries.

Item metadata

id	RCAP_ce8bf8bd87ed654ae589b92045bef9fb
oai_identifier_str	oai:repositorium.sdum.uminho.pt:1822/16442
network_acronym_str	RCAP
network_name_str	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str	https://opendoar.ac.uk/repository/7160
spelling	Automatic parallel corpora and bilingual terminology extraction from parallel WebSitesParallel corporaBlingual terminologyWeb as corporaSocial SciencesIn our days, the notion, the importance and the significance of parallel corpora is so big that needs no special introduction. Unfortunately, public available parallel corpora is somewhat limited in range. There are big corpora about politics or legislation, about medicine and other specific areas, but we miss corpora for other different areas. Currently there is a huge investment on using the Web as a corpus. This article uncovers GWB, a tool that aims automatic construction of parallel corpora from the web. We defend that it is possible to build high quality terminological corpora in an automatic fashion, just by specifying a sensible Internet domain and using an appropriate set of seed keywords. GWB is a web-spider that works in conjunction with a set of other Open-Source tools, defining a pipeline that includes the documents retrieval from the web, alignment at sentence level and its quality analysis, bilingual dictionaries and terminology extraction and construction of off-line dictionaries.European Language Resources Association (ELRA)Universidade do MinhoAlmeida, J. J.Simões, Alberto2010-052010-05-01T00:00:00Zconference paperinfo:eu-repo/semantics/publishedVersionapplication/pdfhttp://hdl.handle.net/1822/16442enginfo:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2024-05-11T04:48:33Zoai:repositorium.sdum.uminho.pt:1822/16442Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T14:59:03.927562Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv	Automatic parallel corpora and bilingual terminology extraction from parallel WebSites
title	Automatic parallel corpora and bilingual terminology extraction from parallel WebSites
spellingShingle	Automatic parallel corpora and bilingual terminology extraction from parallel WebSites Almeida, J. J. Parallel corpora Blingual terminology Web as corpora Social Sciences
title_short	Automatic parallel corpora and bilingual terminology extraction from parallel WebSites
title_full	Automatic parallel corpora and bilingual terminology extraction from parallel WebSites
title_fullStr	Automatic parallel corpora and bilingual terminology extraction from parallel WebSites
title_full_unstemmed	Automatic parallel corpora and bilingual terminology extraction from parallel WebSites
title_sort	Automatic parallel corpora and bilingual terminology extraction from parallel WebSites
author	Almeida, J. J.
author_facet	Almeida, J. J. Simões, Alberto
author_role	author
author2	Simões, Alberto
author2_role	author
dc.contributor.none.fl_str_mv	Universidade do Minho
dc.contributor.author.fl_str_mv	Almeida, J. J. Simões, Alberto
dc.subject.por.fl_str_mv	Parallel corpora Blingual terminology Web as corpora Social Sciences
topic	Parallel corpora Blingual terminology Web as corpora Social Sciences
description	In our days, the notion, the importance and the significance of parallel corpora is so big that needs no special introduction. Unfortunately, public available parallel corpora is somewhat limited in range. There are big corpora about politics or legislation, about medicine and other specific areas, but we miss corpora for other different areas. Currently there is a huge investment on using the Web as a corpus. This article uncovers GWB, a tool that aims automatic construction of parallel corpora from the web. We defend that it is possible to build high quality terminological corpora in an automatic fashion, just by specifying a sensible Internet domain and using an appropriate set of seed keywords. GWB is a web-spider that works in conjunction with a set of other Open-Source tools, defining a pipeline that includes the documents retrieval from the web, alignment at sentence level and its quality analysis, bilingual dictionaries and terminology extraction and construction of off-line dictionaries.
publishDate	2010
dc.date.none.fl_str_mv	2010-05 2010-05-01T00:00:00Z
dc.type.driver.fl_str_mv	conference paper
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/1822/16442
url	http://hdl.handle.net/1822/16442
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	European Language Resources Association (ELRA)
publisher.none.fl_str_mv	European Language Resources Association (ELRA)
dc.source.none.fl_str_mv	reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia instacron:RCAAP
instname_str	FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv	info@rcaap.pt
_version_	1833595022767816704

Automatic parallel corpora and bilingual terminology extraction from parallel WebSites

Similar Items