Faster reading sequences from a fastq file
5
0
Entering edit mode
6.3 years ago
brs1111 ▴ 10

I have a fastq file of size 450gb (162,266,303 sequences) and I would like to parse the sequences. I am using Heng Li's C program (https://github.com/lh3/readfq). I am wondering if it is the fastest one as it takes more than an hour to just count the number of sequences using the following program.

#include <zlib.h>
#include <stdio.h>
#include "kseq.h"
KSEQ_INIT(gzFile, gzread)

int main(int argc, char **argv)
{
        gzFile fp;
        kseq_t *seq;
        int n = 0, slen = 0, qlen = 0;
        fp = gzopen(argv[1], "r");
        seq = kseq_init(fp);
        while (kseq_read(seq) >= 0){
                ++n;
               /* DO SOME PROCESSING */
        }
        printf("Total n. of sequences: %d\n", n);
        kseq_destroy(seq);
        gzclose(fp);
        return 0;
}
fastq c • 4.9k views
ADD COMMENT
4
Entering edit mode
6.3 years ago

That's a pretty big file. Perhaps profile the code and see where things are slowest. You might improve the speed of zlib decompression with https://github.com/ebiggers/libdeflate or similar. Or look at recompressing with bzip2 or other blocked compression schemes and using a parallel decompression routine like pbzip2. Likewise, you might look at how kseq_read() is reading in and parsing buffers of data, to see if any buffer sizes or other read tweaks can be made to improve I/O. If you have an SSD, you might use that for storage to enable parallel reads, using an asynchronous library to read and parse multiple streams.

ADD COMMENT
0
Entering edit mode

Thanks Alex for the suggestions.

ADD REPLY
2
Entering edit mode
6.3 years ago
GenoMax 141k

You could look to see if you can use pigz (https://zlib.net/pigz/ ) instead of gzip.

BTW: 450gb sounds rather large for 162 Million sequences. Is that the uncompressed size?

ADD COMMENT
0
Entering edit mode

Yes, it is not compressed.

ADD REPLY
0
Entering edit mode

If the file is uncompressed then perhaps only @kloetzl's answer applies in your case. Is this non-illumina data by any chance?

ADD REPLY
0
Entering edit mode

Just curious, but why are you using gzopen() if the input is not compressed?

ADD REPLY
1
Entering edit mode
6.3 years ago
Dan D 7.4k

fastp might be worth a look.

ADD COMMENT
1
Entering edit mode
6.3 years ago
chen ★ 2.5k

You can try the code of fastp, which provides ultra-fast FASTQ file preprocessing. You may look into the src/fastqreader.cpp which provides a almost-standalone C++ class to read FASTQ file, and can be reused by anyone else.

ADD COMMENT
0
Entering edit mode

Thanks Chen !! Let me try fastp

ADD REPLY
0
Entering edit mode

Hi Chen, I am not able to figure out how to use it. Could you please provide a simple main() ?

ADD REPLY
1
Entering edit mode
6.2 years ago
kloetzl ★ 1.1k

If your data is not gzipped, you can avoid going through zlib by just KSEQ_INIT(int, read). That should reduce latency. Make sure to have a buffer in kseq of atleast 16384 bytes. Try increasing it, to avoid syscall overheads (which became more pronounced by Meltdown mitigations). The maximum throughput I could measure for kseq is 847MiB/s. You can get a bit faster ~1GiB/s if you start using intrinsics, for special instructions (pcmpistri). Note sure if using a mmapped approach would give you any advantage here.

ADD COMMENT
0
Entering edit mode

Thanks for the suggestion. I modified the code and it works good now. It takes little more than an hour just to read and count the number sequences. But when I calculate hash for all k-mers of all sequences , it takes more than 7 hours. I understand that there are additional operations for calculating the hash, but the programs like KmerStream(https://github.com/pmelsted/KmerStream) and ntCard(https://github.com/bcgsc/ntCard) does that in less than 2 hours. Here is the sample code I am using.

#include <zlib.h>
#include <stdio.h>
#include "kseq.h"
KSEQ_INIT(int, read)

int main(int argc, char **argv)
{
        FILE* fp;
        kseq_t *seq;
        int n = 0, slen = 0, qlen = 0;
        fp = fopen(argv[1], "r");
        int k = atoi(argv[2]); // K-mer length
        seq = kseq_init(fileno(fp));
        while (kseq_read(seq) >= 0){
           for(int i=0; i<strlen(seq->seq.s)- k+1; i++)
             uint64_t hash = HASH(seq->seq.s+i, k, seed);   // HASH is either murmerHash (used in KmerStream) or ntHash (used in ntCard)
        }
        kseq_destroy(seq);
        fclose(fp);
        return 0;
}
ADD REPLY
0
Entering edit mode

Those tools are using a rolling hash. I.e. they use the previous hash value to quickly compute the next one.

ADD REPLY
0
Entering edit mode

Yes, I saw that on ntCard paper. In fact, I used the sample code given in ntHash

ADD REPLY
0
Entering edit mode

Maybe you forgot some compiler flags? -O3 -march=native. Sorry, can't help much more than that, remotely.

ADD REPLY
0
Entering edit mode

Thanks for your help !! Let me try these compiler flags. I will try to figure it out.

ADD REPLY

Login before adding your answer.

Traffic: 1893 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6