Faster reading sequences from a fastq file
5
0
Entering edit mode
4.9 years ago
brs1111 ▴ 10

I have a fastq file of size 450gb (162,266,303 sequences) and I would like to parse the sequences. I am using Heng Li's C program (https://github.com/lh3/readfq). I am wondering if it is the fastest one as it takes more than an hour to just count the number of sequences using the following program.

#include <zlib.h>
#include <stdio.h>
#include "kseq.h"

int main(int argc, char **argv)
{
gzFile fp;
kseq_t *seq;
int n = 0, slen = 0, qlen = 0;
fp = gzopen(argv[1], "r");
seq = kseq_init(fp);
++n;
/* DO SOME PROCESSING */
}
printf("Total n. of sequences: %d\n", n);
kseq_destroy(seq);
gzclose(fp);
return 0;
}

fastq c • 3.7k views
4
Entering edit mode
4.9 years ago

That's a pretty big file. Perhaps profile the code and see where things are slowest. You might improve the speed of zlib decompression with https://github.com/ebiggers/libdeflate or similar. Or look at recompressing with bzip2 or other blocked compression schemes and using a parallel decompression routine like pbzip2. Likewise, you might look at how kseq_read() is reading in and parsing buffers of data, to see if any buffer sizes or other read tweaks can be made to improve I/O. If you have an SSD, you might use that for storage to enable parallel reads, using an asynchronous library to read and parse multiple streams.

0
Entering edit mode

Thanks Alex for the suggestions.

2
Entering edit mode
4.9 years ago
GenoMax 123k

You could look to see if you can use pigz (https://zlib.net/pigz/ ) instead of gzip.

BTW: 450gb sounds rather large for 162 Million sequences. Is that the uncompressed size?

0
Entering edit mode

Yes, it is not compressed.

0
Entering edit mode

If the file is uncompressed then perhaps only @kloetzl's answer applies in your case. Is this non-illumina data by any chance?

0
Entering edit mode

Just curious, but why are you using gzopen() if the input is not compressed?

1
Entering edit mode
4.9 years ago
Dan D 7.3k

fastp might be worth a look.

1
Entering edit mode
4.9 years ago
chen ★ 2.4k

You can try the code of fastp, which provides ultra-fast FASTQ file preprocessing. You may look into the src/fastqreader.cpp which provides a almost-standalone C++ class to read FASTQ file, and can be reused by anyone else.

0
Entering edit mode

Thanks Chen !! Let me try fastp

0
Entering edit mode

Hi Chen, I am not able to figure out how to use it. Could you please provide a simple main() ?

1
Entering edit mode
4.9 years ago
kloetzl ★ 1.1k

If your data is not gzipped, you can avoid going through zlib by just KSEQ_INIT(int, read). That should reduce latency. Make sure to have a buffer in kseq of atleast 16384 bytes. Try increasing it, to avoid syscall overheads (which became more pronounced by Meltdown mitigations). The maximum throughput I could measure for kseq is 847MiB/s. You can get a bit faster ~1GiB/s if you start using intrinsics, for special instructions (pcmpistri). Note sure if using a mmapped approach would give you any advantage here.

0
Entering edit mode

Thanks for the suggestion. I modified the code and it works good now. It takes little more than an hour just to read and count the number sequences. But when I calculate hash for all k-mers of all sequences , it takes more than 7 hours. I understand that there are additional operations for calculating the hash, but the programs like KmerStream(https://github.com/pmelsted/KmerStream) and ntCard(https://github.com/bcgsc/ntCard) does that in less than 2 hours. Here is the sample code I am using.

#include <zlib.h>
#include <stdio.h>
#include "kseq.h"

int main(int argc, char **argv)
{
FILE* fp;
kseq_t *seq;
int n = 0, slen = 0, qlen = 0;
fp = fopen(argv[1], "r");
int k = atoi(argv[2]); // K-mer length
seq = kseq_init(fileno(fp));
for(int i=0; i<strlen(seq->seq.s)- k+1; i++)
uint64_t hash = HASH(seq->seq.s+i, k, seed);   // HASH is either murmerHash (used in KmerStream) or ntHash (used in ntCard)
}
kseq_destroy(seq);
fclose(fp);
return 0;
}

0
Entering edit mode

Those tools are using a rolling hash. I.e. they use the previous hash value to quickly compute the next one.

0
Entering edit mode

Yes, I saw that on ntCard paper. In fact, I used the sample code given in ntHash

0
Entering edit mode

Maybe you forgot some compiler flags? -O3 -march=native. Sorry, can't help much more than that, remotely.

0
Entering edit mode

Thanks for your help !! Let me try these compiler flags. I will try to figure it out.