Web scraping technologies in an API world
| Main Author: | |
|---|---|
| Publication Date: | 2014 |
| Other Authors: | , , , |
| Format: | Article |
| Language: | eng |
| Source: | Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
| Download full: | http://hdl.handle.net/1822/32460 |
Summary: | Web services are the de facto standard in biomedical data integration. However, there are data integration scenarios that cannot be fully covered by Web services. A number of Web databases and tools do not support Web services, and existing Web services do not cover for all possible user data demands. As a consequence, Web data scraping, one of the oldest techniques for extracting Web contents, is still in position to offer a valid and valuable service to a wide range of bioinformatics applications, ranging from simple extraction robots to online meta-servers. This article reviews existing scraping frameworks and tools, identifying their strengths and limitations in terms of extraction capabilities. The main focus is set on showing how straightforward it is today to set up a data scraping pipeline, with minimal programming effort, and answer a number of practical needs. For exemplification purposes, we introduce a biomedical data extraction scenario where the desired data sources, well-known in clinical microbiology and similar domains, do not offer programmatic interfaces yet. Moreover, we describe the operation of WhichGenes and PathJam, two bioinformatics meta-servers that use scraping as means to cope with gene set enrichment analysis. |
| id |
RCAP_97c0a022f81aafc5dfc778d0fd2e7ff7 |
|---|---|
| oai_identifier_str |
oai:repositorium.sdum.uminho.pt:1822/32460 |
| network_acronym_str |
RCAP |
| network_name_str |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
| repository_id_str |
https://opendoar.ac.uk/repository/7160 |
| spelling |
Web scraping technologies in an API worldWeb scrapingData integrationInteroperabilityDatabase interfacesScience & TechnologyWeb services are the de facto standard in biomedical data integration. However, there are data integration scenarios that cannot be fully covered by Web services. A number of Web databases and tools do not support Web services, and existing Web services do not cover for all possible user data demands. As a consequence, Web data scraping, one of the oldest techniques for extracting Web contents, is still in position to offer a valid and valuable service to a wide range of bioinformatics applications, ranging from simple extraction robots to online meta-servers. This article reviews existing scraping frameworks and tools, identifying their strengths and limitations in terms of extraction capabilities. The main focus is set on showing how straightforward it is today to set up a data scraping pipeline, with minimal programming effort, and answer a number of practical needs. For exemplification purposes, we introduce a biomedical data extraction scenario where the desired data sources, well-known in clinical microbiology and similar domains, do not offer programmatic interfaces yet. Moreover, we describe the operation of WhichGenes and PathJam, two bioinformatics meta-servers that use scraping as means to cope with gene set enrichment analysis.This work was partially funded by (i) the [TIN2009-14057-C03-02] project from the Spanish Ministry of Science and Innovation, the Plan E from the Spanish Government and the European Union from the European Regional Development Fund (ERDF), (ii) the Portugal-Spain cooperation action sponsored by the Foundation of Portuguese Universities [E 48/11] and the Spanish Ministry of Science and Innovation [AIB2010PT-00353] and (iii) the Agrupamento INBIOMED [2012/273] from the DXPCTSUG (Direccion Xeral de Promocion Cientifica e Tecnoloxica do Sistema Universitario de Galicia) from the Galician Government and the European Union from the ERDF unha maneira de facer Europa. H. L. F. was supported by a pre-doctoral fellowship from the University of Vigo.Oxford University PressUniversidade do MinhoGlez-Peña, DanielLourenço, AnáliaLópez-Fernández, HugoReboiro-Jato, MiguelFdez-Riverola, Florentino20142014-01-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/1822/32460eng1477-40541467-546310.1093/bib/bbt02623632294info:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2024-05-11T07:12:27Zoai:repositorium.sdum.uminho.pt:1822/32460Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T16:19:13.480303Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse |
| dc.title.none.fl_str_mv |
Web scraping technologies in an API world |
| title |
Web scraping technologies in an API world |
| spellingShingle |
Web scraping technologies in an API world Glez-Peña, Daniel Web scraping Data integration Interoperability Database interfaces Science & Technology |
| title_short |
Web scraping technologies in an API world |
| title_full |
Web scraping technologies in an API world |
| title_fullStr |
Web scraping technologies in an API world |
| title_full_unstemmed |
Web scraping technologies in an API world |
| title_sort |
Web scraping technologies in an API world |
| author |
Glez-Peña, Daniel |
| author_facet |
Glez-Peña, Daniel Lourenço, Anália López-Fernández, Hugo Reboiro-Jato, Miguel Fdez-Riverola, Florentino |
| author_role |
author |
| author2 |
Lourenço, Anália López-Fernández, Hugo Reboiro-Jato, Miguel Fdez-Riverola, Florentino |
| author2_role |
author author author author |
| dc.contributor.none.fl_str_mv |
Universidade do Minho |
| dc.contributor.author.fl_str_mv |
Glez-Peña, Daniel Lourenço, Anália López-Fernández, Hugo Reboiro-Jato, Miguel Fdez-Riverola, Florentino |
| dc.subject.por.fl_str_mv |
Web scraping Data integration Interoperability Database interfaces Science & Technology |
| topic |
Web scraping Data integration Interoperability Database interfaces Science & Technology |
| description |
Web services are the de facto standard in biomedical data integration. However, there are data integration scenarios that cannot be fully covered by Web services. A number of Web databases and tools do not support Web services, and existing Web services do not cover for all possible user data demands. As a consequence, Web data scraping, one of the oldest techniques for extracting Web contents, is still in position to offer a valid and valuable service to a wide range of bioinformatics applications, ranging from simple extraction robots to online meta-servers. This article reviews existing scraping frameworks and tools, identifying their strengths and limitations in terms of extraction capabilities. The main focus is set on showing how straightforward it is today to set up a data scraping pipeline, with minimal programming effort, and answer a number of practical needs. For exemplification purposes, we introduce a biomedical data extraction scenario where the desired data sources, well-known in clinical microbiology and similar domains, do not offer programmatic interfaces yet. Moreover, we describe the operation of WhichGenes and PathJam, two bioinformatics meta-servers that use scraping as means to cope with gene set enrichment analysis. |
| publishDate |
2014 |
| dc.date.none.fl_str_mv |
2014 2014-01-01T00:00:00Z |
| dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
| dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
| format |
article |
| status_str |
publishedVersion |
| dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/1822/32460 |
| url |
http://hdl.handle.net/1822/32460 |
| dc.language.iso.fl_str_mv |
eng |
| language |
eng |
| dc.relation.none.fl_str_mv |
1477-4054 1467-5463 10.1093/bib/bbt026 23632294 |
| dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
| eu_rights_str_mv |
openAccess |
| dc.format.none.fl_str_mv |
application/pdf |
| dc.publisher.none.fl_str_mv |
Oxford University Press |
| publisher.none.fl_str_mv |
Oxford University Press |
| dc.source.none.fl_str_mv |
reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia instacron:RCAAP |
| instname_str |
FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia |
| instacron_str |
RCAAP |
| institution |
RCAAP |
| reponame_str |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
| collection |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
| repository.name.fl_str_mv |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia |
| repository.mail.fl_str_mv |
info@rcaap.pt |
| _version_ |
1833595872306266112 |