Impact of sorting in DNA sequence compression

Fonseca, Tiago Rafael Soares da

Impact of sorting in DNA sequence compression

Bibliographic Details
Main Author:	Fonseca, Tiago Rafael Soares da
Publication Date:	2023
Format:	Master thesis
Language:	eng
Source:	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Download full:	http://hdl.handle.net/10773/41955
Summary:	The increase in the production of genomic data led to a growing need to find efficient methods that could store and analyze this type of data. Due to its redundant structure and the storage space needed, data compression is seen as an essential strategy to deal with this kind of problem. There are various tools available, from those that were created with a more general use intent to those that were created to deal specifically with this type of data. There’s however still space to test new tools who can make this process more efficient. There’s the case of the tool presented here, FASTA ANALYSIS, which allows the sorting of FASTA sequences (a genomics sequence representation type) according to different criteria, taking advantage of putting closer similar blocks. The benchmark performed had as an objective to compare the impact of sorting two distinct file types. On one hand, there were generated nearly-synthetic sequences with the AlcoR toolkit, who permits to build sequences from specific input data. On the other hand, there were used groups of specific genomes and collections of genomes organized according to their type (bacteria, viral, etc). The results here presented show variable gains depending on the file being tested, but they’re more pronounced when the files are sorted by the absolute number of bases. Furthermore, it was also possible to notice that the file types that showed higher percentage gain, among those tested, were those who combine different types of datasets. FASTA ANALYSIS is a tool which allows the improvement of compression efficiency of FASTA sequences, by sorting the sequences present on Multi-FASTA files, according to different criteria. FASTA ANALYSIS is distributed under GPLv3 and it’s available for free download at https://github.com/tiagof1993/FASTA-ANALYSIS.

Item metadata

id	RCAP_abca8fb8e447c285ad956c4077d00bf0
oai_identifier_str	oai:ria.ua.pt:10773/41955
network_acronym_str	RCAP
network_name_str	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str	https://opendoar.ac.uk/repository/7160
spelling	Impact of sorting in DNA sequence compressionLossless data compressionDNA sequencesGenomeSortingEfficiencyThe increase in the production of genomic data led to a growing need to find efficient methods that could store and analyze this type of data. Due to its redundant structure and the storage space needed, data compression is seen as an essential strategy to deal with this kind of problem. There are various tools available, from those that were created with a more general use intent to those that were created to deal specifically with this type of data. There’s however still space to test new tools who can make this process more efficient. There’s the case of the tool presented here, FASTA ANALYSIS, which allows the sorting of FASTA sequences (a genomics sequence representation type) according to different criteria, taking advantage of putting closer similar blocks. The benchmark performed had as an objective to compare the impact of sorting two distinct file types. On one hand, there were generated nearly-synthetic sequences with the AlcoR toolkit, who permits to build sequences from specific input data. On the other hand, there were used groups of specific genomes and collections of genomes organized according to their type (bacteria, viral, etc). The results here presented show variable gains depending on the file being tested, but they’re more pronounced when the files are sorted by the absolute number of bases. Furthermore, it was also possible to notice that the file types that showed higher percentage gain, among those tested, were those who combine different types of datasets. FASTA ANALYSIS is a tool which allows the improvement of compression efficiency of FASTA sequences, by sorting the sequences present on Multi-FASTA files, according to different criteria. FASTA ANALYSIS is distributed under GPLv3 and it’s available for free download at https://github.com/tiagof1993/FASTA-ANALYSIS.O aumento da produção de dados genómicos levou a que existisse também uma necessidade crescente de encontrar métodos eficientes para os armazenar e analisar. Devido à sua estrutura bastante redundante e ao espaço necessário para o seu armazenamento, a compressão de dados é vista como uma estratégia essencial para lidar com estes problemas. Existem variadas ferramentas disponíveis neste âmbito desde aquelas que são de âmbito mais geral às que foram criadas com o intuito de lidar directamente com este tipo de dados. Há ainda assim espaço para novas ferramentas que permitam tornar este processo mais eficiente. É o caso da ferramenta que aqui apresentamos, o FASTA ANALYSIS, que permite a ordenação de sequências FASTA (um tipo de representação de sequências genómicas) segundo diferentes critérios,tirando partido da aproximação de blocos similares. Os testes efetuados tiveram como objetivo comparar o impacto da ordenação em dois tipos de ficheiros distintos. Por um lado foram gerados ficheiros com conteúdo parcialmente aleatório usando a ferramenta AlcoR, que permite construir sequências FASTA a partir de um conjunto de dados de entrada. Por outro foram usados diferentes genomas de espécies específicas e conjuntos de genomas organizados segundo tipos (bactérias, vírus, etc). Os resultados aqui representados mostram ganhos variáveis em função do ficheiro testado, sendo mais notórios quando são usados métodos de ordenação baseados no número absoluto de bases presentes na sequência. Para além disso, também foi possível perceber que os tipos de ficheiros que permitem maiores percentagens de ganho dentro daqueles que foram testados são os que combinam diferentes tipos de datasets. O FASTA ANALYSIS é uma ferramenta que permite melhorar a eficiência de compressão de sequências FASTA, através da ordenação de sequências presentes em ficheiros Multi-FASTA, segundo diferentes critérios. O FASTA ANALYSIS é distribuído sob GPLv3 e está disponível para download gratuito em https://github.com/tiagof1993/FASTAANALYSIS.2024-05-28T13:39:22Z2023-12-18T00:00:00Z2023-12-18info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10773/41955engFonseca, Tiago Rafael Soares dainfo:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2024-06-10T01:48:13Zoai:ria.ua.pt:10773/41955Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T17:55:15.071592Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv	Impact of sorting in DNA sequence compression
title	Impact of sorting in DNA sequence compression
spellingShingle	Impact of sorting in DNA sequence compression Fonseca, Tiago Rafael Soares da Lossless data compression DNA sequences Genome Sorting Efficiency
title_short	Impact of sorting in DNA sequence compression
title_full	Impact of sorting in DNA sequence compression
title_fullStr	Impact of sorting in DNA sequence compression
title_full_unstemmed	Impact of sorting in DNA sequence compression
title_sort	Impact of sorting in DNA sequence compression
author	Fonseca, Tiago Rafael Soares da
author_facet	Fonseca, Tiago Rafael Soares da
author_role	author
dc.contributor.author.fl_str_mv	Fonseca, Tiago Rafael Soares da
dc.subject.por.fl_str_mv	Lossless data compression DNA sequences Genome Sorting Efficiency
topic	Lossless data compression DNA sequences Genome Sorting Efficiency
description	The increase in the production of genomic data led to a growing need to find efficient methods that could store and analyze this type of data. Due to its redundant structure and the storage space needed, data compression is seen as an essential strategy to deal with this kind of problem. There are various tools available, from those that were created with a more general use intent to those that were created to deal specifically with this type of data. There’s however still space to test new tools who can make this process more efficient. There’s the case of the tool presented here, FASTA ANALYSIS, which allows the sorting of FASTA sequences (a genomics sequence representation type) according to different criteria, taking advantage of putting closer similar blocks. The benchmark performed had as an objective to compare the impact of sorting two distinct file types. On one hand, there were generated nearly-synthetic sequences with the AlcoR toolkit, who permits to build sequences from specific input data. On the other hand, there were used groups of specific genomes and collections of genomes organized according to their type (bacteria, viral, etc). The results here presented show variable gains depending on the file being tested, but they’re more pronounced when the files are sorted by the absolute number of bases. Furthermore, it was also possible to notice that the file types that showed higher percentage gain, among those tested, were those who combine different types of datasets. FASTA ANALYSIS is a tool which allows the improvement of compression efficiency of FASTA sequences, by sorting the sequences present on Multi-FASTA files, according to different criteria. FASTA ANALYSIS is distributed under GPLv3 and it’s available for free download at https://github.com/tiagof1993/FASTA-ANALYSIS.
publishDate	2023
dc.date.none.fl_str_mv	2023-12-18T00:00:00Z 2023-12-18 2024-05-28T13:39:22Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/10773/41955
url	http://hdl.handle.net/10773/41955
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.source.none.fl_str_mv	reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia instacron:RCAAP
instname_str	FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv	info@rcaap.pt
_version_	1833597060736090112

Impact of sorting in DNA sequence compression

Similar Items