Export Ready — 

Web scraping technologies in an API world

Bibliographic Details
Main Author: Glez-Peña, Daniel
Publication Date: 2014
Other Authors: Lourenço, Anália, López-Fernández, Hugo, Reboiro-Jato, Miguel, Fdez-Riverola, Florentino
Format: Article
Language: eng
Source: Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Download full: http://hdl.handle.net/1822/32460
Summary: Web services are the de facto standard in biomedical data integration. However, there are data integration scenarios that cannot be fully covered by Web services. A number of Web databases and tools do not support Web services, and existing Web services do not cover for all possible user data demands. As a consequence, Web data scraping, one of the oldest techniques for extracting Web contents, is still in position to offer a valid and valuable service to a wide range of bioinformatics applications, ranging from simple extraction robots to online meta-servers. This article reviews existing scraping frameworks and tools, identifying their strengths and limitations in terms of extraction capabilities. The main focus is set on showing how straightforward it is today to set up a data scraping pipeline, with minimal programming effort, and answer a number of practical needs. For exemplification purposes, we introduce a biomedical data extraction scenario where the desired data sources, well-known in clinical microbiology and similar domains, do not offer programmatic interfaces yet. Moreover, we describe the operation of WhichGenes and PathJam, two bioinformatics meta-servers that use scraping as means to cope with gene set enrichment analysis.
id RCAP_97c0a022f81aafc5dfc778d0fd2e7ff7
oai_identifier_str oai:repositorium.sdum.uminho.pt:1822/32460
network_acronym_str RCAP
network_name_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str https://opendoar.ac.uk/repository/7160
spelling Web scraping technologies in an API worldWeb scrapingData integrationInteroperabilityDatabase interfacesScience & TechnologyWeb services are the de facto standard in biomedical data integration. However, there are data integration scenarios that cannot be fully covered by Web services. A number of Web databases and tools do not support Web services, and existing Web services do not cover for all possible user data demands. As a consequence, Web data scraping, one of the oldest techniques for extracting Web contents, is still in position to offer a valid and valuable service to a wide range of bioinformatics applications, ranging from simple extraction robots to online meta-servers. This article reviews existing scraping frameworks and tools, identifying their strengths and limitations in terms of extraction capabilities. The main focus is set on showing how straightforward it is today to set up a data scraping pipeline, with minimal programming effort, and answer a number of practical needs. For exemplification purposes, we introduce a biomedical data extraction scenario where the desired data sources, well-known in clinical microbiology and similar domains, do not offer programmatic interfaces yet. Moreover, we describe the operation of WhichGenes and PathJam, two bioinformatics meta-servers that use scraping as means to cope with gene set enrichment analysis.This work was partially funded by (i) the [TIN2009-14057-C03-02] project from the Spanish Ministry of Science and Innovation, the Plan E from the Spanish Government and the European Union from the European Regional Development Fund (ERDF), (ii) the Portugal-Spain cooperation action sponsored by the Foundation of Portuguese Universities [E 48/11] and the Spanish Ministry of Science and Innovation [AIB2010PT-00353] and (iii) the Agrupamento INBIOMED [2012/273] from the DXPCTSUG (Direccion Xeral de Promocion Cientifica e Tecnoloxica do Sistema Universitario de Galicia) from the Galician Government and the European Union from the ERDF unha maneira de facer Europa. H. L. F. was supported by a pre-doctoral fellowship from the University of Vigo.Oxford University PressUniversidade do MinhoGlez-Peña, DanielLourenço, AnáliaLópez-Fernández, HugoReboiro-Jato, MiguelFdez-Riverola, Florentino20142014-01-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/1822/32460eng1477-40541467-546310.1093/bib/bbt02623632294info:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2024-05-11T07:12:27Zoai:repositorium.sdum.uminho.pt:1822/32460Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T16:19:13.480303Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv Web scraping technologies in an API world
title Web scraping technologies in an API world
spellingShingle Web scraping technologies in an API world
Glez-Peña, Daniel
Web scraping
Data integration
Interoperability
Database interfaces
Science & Technology
title_short Web scraping technologies in an API world
title_full Web scraping technologies in an API world
title_fullStr Web scraping technologies in an API world
title_full_unstemmed Web scraping technologies in an API world
title_sort Web scraping technologies in an API world
author Glez-Peña, Daniel
author_facet Glez-Peña, Daniel
Lourenço, Anália
López-Fernández, Hugo
Reboiro-Jato, Miguel
Fdez-Riverola, Florentino
author_role author
author2 Lourenço, Anália
López-Fernández, Hugo
Reboiro-Jato, Miguel
Fdez-Riverola, Florentino
author2_role author
author
author
author
dc.contributor.none.fl_str_mv Universidade do Minho
dc.contributor.author.fl_str_mv Glez-Peña, Daniel
Lourenço, Anália
López-Fernández, Hugo
Reboiro-Jato, Miguel
Fdez-Riverola, Florentino
dc.subject.por.fl_str_mv Web scraping
Data integration
Interoperability
Database interfaces
Science & Technology
topic Web scraping
Data integration
Interoperability
Database interfaces
Science & Technology
description Web services are the de facto standard in biomedical data integration. However, there are data integration scenarios that cannot be fully covered by Web services. A number of Web databases and tools do not support Web services, and existing Web services do not cover for all possible user data demands. As a consequence, Web data scraping, one of the oldest techniques for extracting Web contents, is still in position to offer a valid and valuable service to a wide range of bioinformatics applications, ranging from simple extraction robots to online meta-servers. This article reviews existing scraping frameworks and tools, identifying their strengths and limitations in terms of extraction capabilities. The main focus is set on showing how straightforward it is today to set up a data scraping pipeline, with minimal programming effort, and answer a number of practical needs. For exemplification purposes, we introduce a biomedical data extraction scenario where the desired data sources, well-known in clinical microbiology and similar domains, do not offer programmatic interfaces yet. Moreover, we describe the operation of WhichGenes and PathJam, two bioinformatics meta-servers that use scraping as means to cope with gene set enrichment analysis.
publishDate 2014
dc.date.none.fl_str_mv 2014
2014-01-01T00:00:00Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/1822/32460
url http://hdl.handle.net/1822/32460
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv 1477-4054
1467-5463
10.1093/bib/bbt026
23632294
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Oxford University Press
publisher.none.fl_str_mv Oxford University Press
dc.source.none.fl_str_mv reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron:RCAAP
instname_str FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str RCAAP
institution RCAAP
reponame_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv info@rcaap.pt
_version_ 1833595872306266112