Scalable transcriptomics analysis with dask: applications in data science and machine learning
| Main Author: | |
|---|---|
| Publication Date: | 2022 |
| Other Authors: | , |
| Format: | Article |
| Language: | eng |
| Source: | Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
| Download full: | https://hdl.handle.net/1822/90084 |
Summary: | Background: Gene expression studies are an important tool in biological and biomedical research. The signal carried in expression profles helps derive signatures for the prediction, diagnosis and prognosis of diferent diseases. Data science and specifcally machine learning have many applications in gene expression analysis. However, as the dimensionality of genomics datasets grows, scalable solutions become necessary. Methods: In this paper we review the main steps and bottlenecks in machine learning pipelines, as well as the main concepts behind scalable data science including those of concurrent and parallel programming. We discuss the benefts of the Dask framework and how it can be integrated with the Python scientifc environment to perform data analysis in computational biology and bioinformatics. Results: This review illustrates the role of Dask for boosting data science applications in diferent case studies. Detailed documentation and code on these procedures is made available at https://github.com/martaccmoreno/gexp-ml-dask. Conclusion: By showing when and how Dask can be used in transcriptomics analysis, this review will serve as an entry point to help genomic data scientists develop more scalable data analysis procedures. |
| id |
RCAP_2c8ac84ab6d2cd739b27a97a2cc64563 |
|---|---|
| oai_identifier_str |
oai:repositorium.sdum.uminho.pt:1822/90084 |
| network_acronym_str |
RCAP |
| network_name_str |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
| repository_id_str |
https://opendoar.ac.uk/repository/7160 |
| spelling |
Scalable transcriptomics analysis with dask: applications in data science and machine learningData SciencePythonDaskTranscriptomics analysisMachine learningScalable data scienceGene expressionTranscriptomicsData analysisCiências Médicas::Ciências da SaúdeScience & TechnologySaúde de qualidadeBackground: Gene expression studies are an important tool in biological and biomedical research. The signal carried in expression profles helps derive signatures for the prediction, diagnosis and prognosis of diferent diseases. Data science and specifcally machine learning have many applications in gene expression analysis. However, as the dimensionality of genomics datasets grows, scalable solutions become necessary. Methods: In this paper we review the main steps and bottlenecks in machine learning pipelines, as well as the main concepts behind scalable data science including those of concurrent and parallel programming. We discuss the benefts of the Dask framework and how it can be integrated with the Python scientifc environment to perform data analysis in computational biology and bioinformatics. Results: This review illustrates the role of Dask for boosting data science applications in diferent case studies. Detailed documentation and code on these procedures is made available at https://github.com/martaccmoreno/gexp-ml-dask. Conclusion: By showing when and how Dask can be used in transcriptomics analysis, this review will serve as an entry point to help genomic data scientists develop more scalable data analysis procedures.This work is fnanced by National Funds through the Portuguese funding agency, FCT - Fundação para a Ciência e a Tecnologia, within project LA/P/0063/2020, the Portuguese National Network for Advanced Computing to PGF for the Grant CPCA/A2/2640/2020, and the Portuguese Foundation for Science and Technology to MM for the Ph.D. scholarship (reference SFRH/BD/145707/2019). No funding body played any role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.BMCUniversidade do MinhoMoreno, MartaVilaça, Ricardo Manuel PereiraFerreira, Pedro G.2022-11-302022-11-30T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttps://hdl.handle.net/1822/90084engMoreno, M., Vilaça, R., & Ferreira, P. G. (2022, November 30). Scalable transcriptomics analysis with Dask: applications in data science and machine learning. BMC Bioinformatics. Springer Science and Business Media LLC. http://doi.org/10.1186/s12859-022-05065-31471-210510.1186/s12859-022-05065-336451115https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05065-3info:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2024-11-30T01:18:06Zoai:repositorium.sdum.uminho.pt:1822/90084Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T16:15:27.084719Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse |
| dc.title.none.fl_str_mv |
Scalable transcriptomics analysis with dask: applications in data science and machine learning |
| title |
Scalable transcriptomics analysis with dask: applications in data science and machine learning |
| spellingShingle |
Scalable transcriptomics analysis with dask: applications in data science and machine learning Moreno, Marta Data Science Python Dask Transcriptomics analysis Machine learning Scalable data science Gene expression Transcriptomics Data analysis Ciências Médicas::Ciências da Saúde Science & Technology Saúde de qualidade |
| title_short |
Scalable transcriptomics analysis with dask: applications in data science and machine learning |
| title_full |
Scalable transcriptomics analysis with dask: applications in data science and machine learning |
| title_fullStr |
Scalable transcriptomics analysis with dask: applications in data science and machine learning |
| title_full_unstemmed |
Scalable transcriptomics analysis with dask: applications in data science and machine learning |
| title_sort |
Scalable transcriptomics analysis with dask: applications in data science and machine learning |
| author |
Moreno, Marta |
| author_facet |
Moreno, Marta Vilaça, Ricardo Manuel Pereira Ferreira, Pedro G. |
| author_role |
author |
| author2 |
Vilaça, Ricardo Manuel Pereira Ferreira, Pedro G. |
| author2_role |
author author |
| dc.contributor.none.fl_str_mv |
Universidade do Minho |
| dc.contributor.author.fl_str_mv |
Moreno, Marta Vilaça, Ricardo Manuel Pereira Ferreira, Pedro G. |
| dc.subject.por.fl_str_mv |
Data Science Python Dask Transcriptomics analysis Machine learning Scalable data science Gene expression Transcriptomics Data analysis Ciências Médicas::Ciências da Saúde Science & Technology Saúde de qualidade |
| topic |
Data Science Python Dask Transcriptomics analysis Machine learning Scalable data science Gene expression Transcriptomics Data analysis Ciências Médicas::Ciências da Saúde Science & Technology Saúde de qualidade |
| description |
Background: Gene expression studies are an important tool in biological and biomedical research. The signal carried in expression profles helps derive signatures for the prediction, diagnosis and prognosis of diferent diseases. Data science and specifcally machine learning have many applications in gene expression analysis. However, as the dimensionality of genomics datasets grows, scalable solutions become necessary. Methods: In this paper we review the main steps and bottlenecks in machine learning pipelines, as well as the main concepts behind scalable data science including those of concurrent and parallel programming. We discuss the benefts of the Dask framework and how it can be integrated with the Python scientifc environment to perform data analysis in computational biology and bioinformatics. Results: This review illustrates the role of Dask for boosting data science applications in diferent case studies. Detailed documentation and code on these procedures is made available at https://github.com/martaccmoreno/gexp-ml-dask. Conclusion: By showing when and how Dask can be used in transcriptomics analysis, this review will serve as an entry point to help genomic data scientists develop more scalable data analysis procedures. |
| publishDate |
2022 |
| dc.date.none.fl_str_mv |
2022-11-30 2022-11-30T00:00:00Z |
| dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
| dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
| format |
article |
| status_str |
publishedVersion |
| dc.identifier.uri.fl_str_mv |
https://hdl.handle.net/1822/90084 |
| url |
https://hdl.handle.net/1822/90084 |
| dc.language.iso.fl_str_mv |
eng |
| language |
eng |
| dc.relation.none.fl_str_mv |
Moreno, M., Vilaça, R., & Ferreira, P. G. (2022, November 30). Scalable transcriptomics analysis with Dask: applications in data science and machine learning. BMC Bioinformatics. Springer Science and Business Media LLC. http://doi.org/10.1186/s12859-022-05065-3 1471-2105 10.1186/s12859-022-05065-3 36451115 https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05065-3 |
| dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
| eu_rights_str_mv |
openAccess |
| dc.format.none.fl_str_mv |
application/pdf |
| dc.publisher.none.fl_str_mv |
BMC |
| publisher.none.fl_str_mv |
BMC |
| dc.source.none.fl_str_mv |
reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia instacron:RCAAP |
| instname_str |
FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia |
| instacron_str |
RCAAP |
| institution |
RCAAP |
| reponame_str |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
| collection |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
| repository.name.fl_str_mv |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia |
| repository.mail.fl_str_mv |
info@rcaap.pt |
| _version_ |
1833595837350936576 |