Supporting data for "BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale"

César, Piñeiro; Juan, Pichel Carlos

doi:10.5524/102409

Supporting data for "BigSeqKit: a parallel Big Data toolkit to process FASTA and FASTQ files at scale"

Editor: GigaScience Database

Ano de publicación: 2023

Tipo: Dataset

CC0 1.0

DOI: 10.5524/102409 Acceso aberto editor

Resumo

High-throughput sequencing technologies have led to an unprecedented explosion in the amounts of sequencing data available, which are typically stored using FASTA and FASTQ files. We can find in the literature several tools to process and manipulate these type of files with the aim of transforming sequence data into biological knowledge. However, none of them are well fitted for efficiently processing very large files, likely in the order of terabytes, since they are based on sequential processing. Only some routines of the well-known seqkit tool are partly parallelized. In any case, its scalability is limited to use few threads on a single computing node. <br>Our approach, BigSeqKit, takes advantage of an HPC-Big Data framework to parallelize and optimize the commands included in seqkit with the aim of speeding up the manipulation of FASTA/FASTQ files. In this way, in most cases it is from tens to hundreds of times faster than several state-of-the-art tools. At the same time, our toolkit is easy to use and install on any kind of hardware platform (local server or cluster), and its routines can be used as a bioinformatics library or from the command line. <br>BigSeqKit is a very complete and ultra-fast toolkit to process and manipulate large FASTA and FASTQ files. It is publicly available at: https://github.com/citiususc/BigSeqKit.