Impact of sorting in DNA sequence compression

Bibliographic Details
Main Author: Fonseca, Tiago Rafael Soares da
Publication Date: 2023
Format: Master thesis
Language: eng
Source: Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Download full: http://hdl.handle.net/10773/41955
Summary: The increase in the production of genomic data led to a growing need to find efficient methods that could store and analyze this type of data. Due to its redundant structure and the storage space needed, data compression is seen as an essential strategy to deal with this kind of problem. There are various tools available, from those that were created with a more general use intent to those that were created to deal specifically with this type of data. There’s however still space to test new tools who can make this process more efficient. There’s the case of the tool presented here, FASTA ANALYSIS, which allows the sorting of FASTA sequences (a genomics sequence representation type) according to different criteria, taking advantage of putting closer similar blocks. The benchmark performed had as an objective to compare the impact of sorting two distinct file types. On one hand, there were generated nearly-synthetic sequences with the AlcoR toolkit, who permits to build sequences from specific input data. On the other hand, there were used groups of specific genomes and collections of genomes organized according to their type (bacteria, viral, etc). The results here presented show variable gains depending on the file being tested, but they’re more pronounced when the files are sorted by the absolute number of bases. Furthermore, it was also possible to notice that the file types that showed higher percentage gain, among those tested, were those who combine different types of datasets. FASTA ANALYSIS is a tool which allows the improvement of compression efficiency of FASTA sequences, by sorting the sequences present on Multi-FASTA files, according to different criteria. FASTA ANALYSIS is distributed under GPLv3 and it’s available for free download at https://github.com/tiagof1993/FASTA-ANALYSIS.
id RCAP_abca8fb8e447c285ad956c4077d00bf0
oai_identifier_str oai:ria.ua.pt:10773/41955
network_acronym_str RCAP
network_name_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str https://opendoar.ac.uk/repository/7160
spelling Impact of sorting in DNA sequence compressionLossless data compressionDNA sequencesGenomeSortingEfficiencyThe increase in the production of genomic data led to a growing need to find efficient methods that could store and analyze this type of data. Due to its redundant structure and the storage space needed, data compression is seen as an essential strategy to deal with this kind of problem. There are various tools available, from those that were created with a more general use intent to those that were created to deal specifically with this type of data. There’s however still space to test new tools who can make this process more efficient. There’s the case of the tool presented here, FASTA ANALYSIS, which allows the sorting of FASTA sequences (a genomics sequence representation type) according to different criteria, taking advantage of putting closer similar blocks. The benchmark performed had as an objective to compare the impact of sorting two distinct file types. On one hand, there were generated nearly-synthetic sequences with the AlcoR toolkit, who permits to build sequences from specific input data. On the other hand, there were used groups of specific genomes and collections of genomes organized according to their type (bacteria, viral, etc). The results here presented show variable gains depending on the file being tested, but they’re more pronounced when the files are sorted by the absolute number of bases. Furthermore, it was also possible to notice that the file types that showed higher percentage gain, among those tested, were those who combine different types of datasets. FASTA ANALYSIS is a tool which allows the improvement of compression efficiency of FASTA sequences, by sorting the sequences present on Multi-FASTA files, according to different criteria. FASTA ANALYSIS is distributed under GPLv3 and it’s available for free download at https://github.com/tiagof1993/FASTA-ANALYSIS.O aumento da produção de dados genómicos levou a que existisse também uma necessidade crescente de encontrar métodos eficientes para os armazenar e analisar. Devido à sua estrutura bastante redundante e ao espaço necessário para o seu armazenamento, a compressão de dados é vista como uma estratégia essencial para lidar com estes problemas. Existem variadas ferramentas disponíveis neste âmbito desde aquelas que são de âmbito mais geral às que foram criadas com o intuito de lidar directamente com este tipo de dados. Há ainda assim espaço para novas ferramentas que permitam tornar este processo mais eficiente. É o caso da ferramenta que aqui apresentamos, o FASTA ANALYSIS, que permite a ordenação de sequências FASTA (um tipo de representação de sequências genómicas) segundo diferentes critérios,tirando partido da aproximação de blocos similares. Os testes efetuados tiveram como objetivo comparar o impacto da ordenação em dois tipos de ficheiros distintos. Por um lado foram gerados ficheiros com conteúdo parcialmente aleatório usando a ferramenta AlcoR, que permite construir sequências FASTA a partir de um conjunto de dados de entrada. Por outro foram usados diferentes genomas de espécies específicas e conjuntos de genomas organizados segundo tipos (bactérias, vírus, etc). Os resultados aqui representados mostram ganhos variáveis em função do ficheiro testado, sendo mais notórios quando são usados métodos de ordenação baseados no número absoluto de bases presentes na sequência. Para além disso, também foi possível perceber que os tipos de ficheiros que permitem maiores percentagens de ganho dentro daqueles que foram testados são os que combinam diferentes tipos de datasets. O FASTA ANALYSIS é uma ferramenta que permite melhorar a eficiência de compressão de sequências FASTA, através da ordenação de sequências presentes em ficheiros Multi-FASTA, segundo diferentes critérios. O FASTA ANALYSIS é distribuído sob GPLv3 e está disponível para download gratuito em https://github.com/tiagof1993/FASTAANALYSIS.2024-05-28T13:39:22Z2023-12-18T00:00:00Z2023-12-18info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10773/41955engFonseca, Tiago Rafael Soares dainfo:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2024-06-10T01:48:13Zoai:ria.ua.pt:10773/41955Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T17:55:15.071592Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv Impact of sorting in DNA sequence compression
title Impact of sorting in DNA sequence compression
spellingShingle Impact of sorting in DNA sequence compression
Fonseca, Tiago Rafael Soares da
Lossless data compression
DNA sequences
Genome
Sorting
Efficiency
title_short Impact of sorting in DNA sequence compression
title_full Impact of sorting in DNA sequence compression
title_fullStr Impact of sorting in DNA sequence compression
title_full_unstemmed Impact of sorting in DNA sequence compression
title_sort Impact of sorting in DNA sequence compression
author Fonseca, Tiago Rafael Soares da
author_facet Fonseca, Tiago Rafael Soares da
author_role author
dc.contributor.author.fl_str_mv Fonseca, Tiago Rafael Soares da
dc.subject.por.fl_str_mv Lossless data compression
DNA sequences
Genome
Sorting
Efficiency
topic Lossless data compression
DNA sequences
Genome
Sorting
Efficiency
description The increase in the production of genomic data led to a growing need to find efficient methods that could store and analyze this type of data. Due to its redundant structure and the storage space needed, data compression is seen as an essential strategy to deal with this kind of problem. There are various tools available, from those that were created with a more general use intent to those that were created to deal specifically with this type of data. There’s however still space to test new tools who can make this process more efficient. There’s the case of the tool presented here, FASTA ANALYSIS, which allows the sorting of FASTA sequences (a genomics sequence representation type) according to different criteria, taking advantage of putting closer similar blocks. The benchmark performed had as an objective to compare the impact of sorting two distinct file types. On one hand, there were generated nearly-synthetic sequences with the AlcoR toolkit, who permits to build sequences from specific input data. On the other hand, there were used groups of specific genomes and collections of genomes organized according to their type (bacteria, viral, etc). The results here presented show variable gains depending on the file being tested, but they’re more pronounced when the files are sorted by the absolute number of bases. Furthermore, it was also possible to notice that the file types that showed higher percentage gain, among those tested, were those who combine different types of datasets. FASTA ANALYSIS is a tool which allows the improvement of compression efficiency of FASTA sequences, by sorting the sequences present on Multi-FASTA files, according to different criteria. FASTA ANALYSIS is distributed under GPLv3 and it’s available for free download at https://github.com/tiagof1993/FASTA-ANALYSIS.
publishDate 2023
dc.date.none.fl_str_mv 2023-12-18T00:00:00Z
2023-12-18
2024-05-28T13:39:22Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10773/41955
url http://hdl.handle.net/10773/41955
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron:RCAAP
instname_str FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str RCAAP
institution RCAAP
reponame_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv info@rcaap.pt
_version_ 1833597060736090112