Scalable transcriptomics analysis with dask: applications in data science and machine learning

Bibliographic Details
Main Author: Moreno, Marta
Publication Date: 2022
Other Authors: Vilaça, Ricardo Manuel Pereira, Ferreira, Pedro G.
Format: Article
Language: eng
Source: Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Download full: https://hdl.handle.net/1822/90084
Summary: Background: Gene expression studies are an important tool in biological and biomedical research. The signal carried in expression profles helps derive signatures for the prediction, diagnosis and prognosis of diferent diseases. Data science and specifcally machine learning have many applications in gene expression analysis. However, as the dimensionality of genomics datasets grows, scalable solutions become necessary. Methods: In this paper we review the main steps and bottlenecks in machine learning pipelines, as well as the main concepts behind scalable data science including those of concurrent and parallel programming. We discuss the benefts of the Dask framework and how it can be integrated with the Python scientifc environment to perform data analysis in computational biology and bioinformatics. Results: This review illustrates the role of Dask for boosting data science applications in diferent case studies. Detailed documentation and code on these procedures is made available at https://github.com/martaccmoreno/gexp-ml-dask. Conclusion: By showing when and how Dask can be used in transcriptomics analysis, this review will serve as an entry point to help genomic data scientists develop more scalable data analysis procedures.
id RCAP_2c8ac84ab6d2cd739b27a97a2cc64563
oai_identifier_str oai:repositorium.sdum.uminho.pt:1822/90084
network_acronym_str RCAP
network_name_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str https://opendoar.ac.uk/repository/7160
spelling Scalable transcriptomics analysis with dask: applications in data science and machine learningData SciencePythonDaskTranscriptomics analysisMachine learningScalable data scienceGene expressionTranscriptomicsData analysisCiências Médicas::Ciências da SaúdeScience & TechnologySaúde de qualidadeBackground: Gene expression studies are an important tool in biological and biomedical research. The signal carried in expression profles helps derive signatures for the prediction, diagnosis and prognosis of diferent diseases. Data science and specifcally machine learning have many applications in gene expression analysis. However, as the dimensionality of genomics datasets grows, scalable solutions become necessary. Methods: In this paper we review the main steps and bottlenecks in machine learning pipelines, as well as the main concepts behind scalable data science including those of concurrent and parallel programming. We discuss the benefts of the Dask framework and how it can be integrated with the Python scientifc environment to perform data analysis in computational biology and bioinformatics. Results: This review illustrates the role of Dask for boosting data science applications in diferent case studies. Detailed documentation and code on these procedures is made available at https://github.com/martaccmoreno/gexp-ml-dask. Conclusion: By showing when and how Dask can be used in transcriptomics analysis, this review will serve as an entry point to help genomic data scientists develop more scalable data analysis procedures.This work is fnanced by National Funds through the Portuguese funding agency, FCT - Fundação para a Ciência e a Tecnologia, within project LA/P/0063/2020, the Portuguese National Network for Advanced Computing to PGF for the Grant CPCA/A2/2640/2020, and the Portuguese Foundation for Science and Technology to MM for the Ph.D. scholarship (reference SFRH/BD/145707/2019). No funding body played any role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.BMCUniversidade do MinhoMoreno, MartaVilaça, Ricardo Manuel PereiraFerreira, Pedro G.2022-11-302022-11-30T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttps://hdl.handle.net/1822/90084engMoreno, M., Vilaça, R., & Ferreira, P. G. (2022, November 30). Scalable transcriptomics analysis with Dask: applications in data science and machine learning. BMC Bioinformatics. Springer Science and Business Media LLC. http://doi.org/10.1186/s12859-022-05065-31471-210510.1186/s12859-022-05065-336451115https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05065-3info:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2024-11-30T01:18:06Zoai:repositorium.sdum.uminho.pt:1822/90084Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T16:15:27.084719Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv Scalable transcriptomics analysis with dask: applications in data science and machine learning
title Scalable transcriptomics analysis with dask: applications in data science and machine learning
spellingShingle Scalable transcriptomics analysis with dask: applications in data science and machine learning
Moreno, Marta
Data Science
Python
Dask
Transcriptomics analysis
Machine learning
Scalable data science
Gene expression
Transcriptomics
Data analysis
Ciências Médicas::Ciências da Saúde
Science & Technology
Saúde de qualidade
title_short Scalable transcriptomics analysis with dask: applications in data science and machine learning
title_full Scalable transcriptomics analysis with dask: applications in data science and machine learning
title_fullStr Scalable transcriptomics analysis with dask: applications in data science and machine learning
title_full_unstemmed Scalable transcriptomics analysis with dask: applications in data science and machine learning
title_sort Scalable transcriptomics analysis with dask: applications in data science and machine learning
author Moreno, Marta
author_facet Moreno, Marta
Vilaça, Ricardo Manuel Pereira
Ferreira, Pedro G.
author_role author
author2 Vilaça, Ricardo Manuel Pereira
Ferreira, Pedro G.
author2_role author
author
dc.contributor.none.fl_str_mv Universidade do Minho
dc.contributor.author.fl_str_mv Moreno, Marta
Vilaça, Ricardo Manuel Pereira
Ferreira, Pedro G.
dc.subject.por.fl_str_mv Data Science
Python
Dask
Transcriptomics analysis
Machine learning
Scalable data science
Gene expression
Transcriptomics
Data analysis
Ciências Médicas::Ciências da Saúde
Science & Technology
Saúde de qualidade
topic Data Science
Python
Dask
Transcriptomics analysis
Machine learning
Scalable data science
Gene expression
Transcriptomics
Data analysis
Ciências Médicas::Ciências da Saúde
Science & Technology
Saúde de qualidade
description Background: Gene expression studies are an important tool in biological and biomedical research. The signal carried in expression profles helps derive signatures for the prediction, diagnosis and prognosis of diferent diseases. Data science and specifcally machine learning have many applications in gene expression analysis. However, as the dimensionality of genomics datasets grows, scalable solutions become necessary. Methods: In this paper we review the main steps and bottlenecks in machine learning pipelines, as well as the main concepts behind scalable data science including those of concurrent and parallel programming. We discuss the benefts of the Dask framework and how it can be integrated with the Python scientifc environment to perform data analysis in computational biology and bioinformatics. Results: This review illustrates the role of Dask for boosting data science applications in diferent case studies. Detailed documentation and code on these procedures is made available at https://github.com/martaccmoreno/gexp-ml-dask. Conclusion: By showing when and how Dask can be used in transcriptomics analysis, this review will serve as an entry point to help genomic data scientists develop more scalable data analysis procedures.
publishDate 2022
dc.date.none.fl_str_mv 2022-11-30
2022-11-30T00:00:00Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://hdl.handle.net/1822/90084
url https://hdl.handle.net/1822/90084
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv Moreno, M., Vilaça, R., & Ferreira, P. G. (2022, November 30). Scalable transcriptomics analysis with Dask: applications in data science and machine learning. BMC Bioinformatics. Springer Science and Business Media LLC. http://doi.org/10.1186/s12859-022-05065-3
1471-2105
10.1186/s12859-022-05065-3
36451115
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-05065-3
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv BMC
publisher.none.fl_str_mv BMC
dc.source.none.fl_str_mv reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron:RCAAP
instname_str FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str RCAAP
institution RCAAP
reponame_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv info@rcaap.pt
_version_ 1833595837350936576