Impact of sorting in DNA sequence compression
Main Author: | |
---|---|
Publication Date: | 2023 |
Format: | Master thesis |
Language: | eng |
Source: | Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
Download full: | http://hdl.handle.net/10773/41955 |
Summary: | The increase in the production of genomic data led to a growing need to find efficient methods that could store and analyze this type of data. Due to its redundant structure and the storage space needed, data compression is seen as an essential strategy to deal with this kind of problem. There are various tools available, from those that were created with a more general use intent to those that were created to deal specifically with this type of data. There’s however still space to test new tools who can make this process more efficient. There’s the case of the tool presented here, FASTA ANALYSIS, which allows the sorting of FASTA sequences (a genomics sequence representation type) according to different criteria, taking advantage of putting closer similar blocks. The benchmark performed had as an objective to compare the impact of sorting two distinct file types. On one hand, there were generated nearly-synthetic sequences with the AlcoR toolkit, who permits to build sequences from specific input data. On the other hand, there were used groups of specific genomes and collections of genomes organized according to their type (bacteria, viral, etc). The results here presented show variable gains depending on the file being tested, but they’re more pronounced when the files are sorted by the absolute number of bases. Furthermore, it was also possible to notice that the file types that showed higher percentage gain, among those tested, were those who combine different types of datasets. FASTA ANALYSIS is a tool which allows the improvement of compression efficiency of FASTA sequences, by sorting the sequences present on Multi-FASTA files, according to different criteria. FASTA ANALYSIS is distributed under GPLv3 and it’s available for free download at https://github.com/tiagof1993/FASTA-ANALYSIS. |
id |
RCAP_abca8fb8e447c285ad956c4077d00bf0 |
---|---|
oai_identifier_str |
oai:ria.ua.pt:10773/41955 |
network_acronym_str |
RCAP |
network_name_str |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
repository_id_str |
https://opendoar.ac.uk/repository/7160 |
spelling |
Impact of sorting in DNA sequence compressionLossless data compressionDNA sequencesGenomeSortingEfficiencyThe increase in the production of genomic data led to a growing need to find efficient methods that could store and analyze this type of data. Due to its redundant structure and the storage space needed, data compression is seen as an essential strategy to deal with this kind of problem. There are various tools available, from those that were created with a more general use intent to those that were created to deal specifically with this type of data. There’s however still space to test new tools who can make this process more efficient. There’s the case of the tool presented here, FASTA ANALYSIS, which allows the sorting of FASTA sequences (a genomics sequence representation type) according to different criteria, taking advantage of putting closer similar blocks. The benchmark performed had as an objective to compare the impact of sorting two distinct file types. On one hand, there were generated nearly-synthetic sequences with the AlcoR toolkit, who permits to build sequences from specific input data. On the other hand, there were used groups of specific genomes and collections of genomes organized according to their type (bacteria, viral, etc). The results here presented show variable gains depending on the file being tested, but they’re more pronounced when the files are sorted by the absolute number of bases. Furthermore, it was also possible to notice that the file types that showed higher percentage gain, among those tested, were those who combine different types of datasets. FASTA ANALYSIS is a tool which allows the improvement of compression efficiency of FASTA sequences, by sorting the sequences present on Multi-FASTA files, according to different criteria. FASTA ANALYSIS is distributed under GPLv3 and it’s available for free download at https://github.com/tiagof1993/FASTA-ANALYSIS.O aumento da produção de dados genómicos levou a que existisse também uma necessidade crescente de encontrar métodos eficientes para os armazenar e analisar. Devido à sua estrutura bastante redundante e ao espaço necessário para o seu armazenamento, a compressão de dados é vista como uma estratégia essencial para lidar com estes problemas. Existem variadas ferramentas disponíveis neste âmbito desde aquelas que são de âmbito mais geral às que foram criadas com o intuito de lidar directamente com este tipo de dados. Há ainda assim espaço para novas ferramentas que permitam tornar este processo mais eficiente. É o caso da ferramenta que aqui apresentamos, o FASTA ANALYSIS, que permite a ordenação de sequências FASTA (um tipo de representação de sequências genómicas) segundo diferentes critérios,tirando partido da aproximação de blocos similares. Os testes efetuados tiveram como objetivo comparar o impacto da ordenação em dois tipos de ficheiros distintos. Por um lado foram gerados ficheiros com conteúdo parcialmente aleatório usando a ferramenta AlcoR, que permite construir sequências FASTA a partir de um conjunto de dados de entrada. Por outro foram usados diferentes genomas de espécies específicas e conjuntos de genomas organizados segundo tipos (bactérias, vírus, etc). Os resultados aqui representados mostram ganhos variáveis em função do ficheiro testado, sendo mais notórios quando são usados métodos de ordenação baseados no número absoluto de bases presentes na sequência. Para além disso, também foi possível perceber que os tipos de ficheiros que permitem maiores percentagens de ganho dentro daqueles que foram testados são os que combinam diferentes tipos de datasets. O FASTA ANALYSIS é uma ferramenta que permite melhorar a eficiência de compressão de sequências FASTA, através da ordenação de sequências presentes em ficheiros Multi-FASTA, segundo diferentes critérios. O FASTA ANALYSIS é distribuído sob GPLv3 e está disponível para download gratuito em https://github.com/tiagof1993/FASTAANALYSIS.2024-05-28T13:39:22Z2023-12-18T00:00:00Z2023-12-18info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10773/41955engFonseca, Tiago Rafael Soares dainfo:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2024-06-10T01:48:13Zoai:ria.ua.pt:10773/41955Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T17:55:15.071592Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse |
dc.title.none.fl_str_mv |
Impact of sorting in DNA sequence compression |
title |
Impact of sorting in DNA sequence compression |
spellingShingle |
Impact of sorting in DNA sequence compression Fonseca, Tiago Rafael Soares da Lossless data compression DNA sequences Genome Sorting Efficiency |
title_short |
Impact of sorting in DNA sequence compression |
title_full |
Impact of sorting in DNA sequence compression |
title_fullStr |
Impact of sorting in DNA sequence compression |
title_full_unstemmed |
Impact of sorting in DNA sequence compression |
title_sort |
Impact of sorting in DNA sequence compression |
author |
Fonseca, Tiago Rafael Soares da |
author_facet |
Fonseca, Tiago Rafael Soares da |
author_role |
author |
dc.contributor.author.fl_str_mv |
Fonseca, Tiago Rafael Soares da |
dc.subject.por.fl_str_mv |
Lossless data compression DNA sequences Genome Sorting Efficiency |
topic |
Lossless data compression DNA sequences Genome Sorting Efficiency |
description |
The increase in the production of genomic data led to a growing need to find efficient methods that could store and analyze this type of data. Due to its redundant structure and the storage space needed, data compression is seen as an essential strategy to deal with this kind of problem. There are various tools available, from those that were created with a more general use intent to those that were created to deal specifically with this type of data. There’s however still space to test new tools who can make this process more efficient. There’s the case of the tool presented here, FASTA ANALYSIS, which allows the sorting of FASTA sequences (a genomics sequence representation type) according to different criteria, taking advantage of putting closer similar blocks. The benchmark performed had as an objective to compare the impact of sorting two distinct file types. On one hand, there were generated nearly-synthetic sequences with the AlcoR toolkit, who permits to build sequences from specific input data. On the other hand, there were used groups of specific genomes and collections of genomes organized according to their type (bacteria, viral, etc). The results here presented show variable gains depending on the file being tested, but they’re more pronounced when the files are sorted by the absolute number of bases. Furthermore, it was also possible to notice that the file types that showed higher percentage gain, among those tested, were those who combine different types of datasets. FASTA ANALYSIS is a tool which allows the improvement of compression efficiency of FASTA sequences, by sorting the sequences present on Multi-FASTA files, according to different criteria. FASTA ANALYSIS is distributed under GPLv3 and it’s available for free download at https://github.com/tiagof1993/FASTA-ANALYSIS. |
publishDate |
2023 |
dc.date.none.fl_str_mv |
2023-12-18T00:00:00Z 2023-12-18 2024-05-28T13:39:22Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10773/41955 |
url |
http://hdl.handle.net/10773/41955 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia instacron:RCAAP |
instname_str |
FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
collection |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
repository.name.fl_str_mv |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia |
repository.mail.fl_str_mv |
info@rcaap.pt |
_version_ |
1833597060736090112 |