Question

proper way to alter bgzf-compressed fastq files using htslib, multithreaded

0

Entering edit mode

19 months ago

andyp • 0

If I were doing this to an uncompressed fastq file, I'd have a pretty easy task: read in fastq records, one by one, modify the header slightly, output them one by one to a new file.

These fastq files aren't sorted so the output's records need not be in the same order.

I'm wondering if anyone has an example of this sort of use-case with htslib. If not, I'm also just wondering on a basic level what strategy I should employ to properly read and write my compressed fastq files as fast as possible:

1) Use htslib's thread pool which is designed to compress/decompress bgzf blocks... which I think are independent of fastq record boundaries. This means I'd probably need the whole file in memory before looping through records. I think.

2) Use my own thread routines and leverage the .fai and .gzi files for random access into the compressed fastq, each thread assigned a more/less equal sized slice of the file: decompressing, reading, transforming, then waiting for a mutex on a write thread to unlock and then writing out uncompressed data to a file, which I compress with bgzip later.

Any advice on 1 vs 2? Anyone have a htslib "cookbook" somewhere with a bunch of different recipes? I'd be thrilled if that exists. I'm having a really tough time trying to figure out how to use htslib. My C/C++ isn't nearly as strong as the authors of that library.

Thanks.

bgfz fastq htslib • 1.4k views

ADD COMMENT • link 19 months ago by andyp • 0

0

Entering edit mode

sorry if off topic but what is the purpose of using bgzip fastq? do you utilize the random access aspect of bgzip?

ADD REPLY • link 19 months ago by cmdcolin ★ 3.8k

0

Entering edit mode

Speed. It's much faster than regular gzip. And if compiled with libdeflate, even better. I'm looking for anything to help me code with the htslib library. Examples of apps leveraging it, etc.

The code itself is extremely dense. I can tell there's sort of a hierarchy of high-level functions and structs declared in hts.h and that's what I wish to stick to as much as possible.

The htslib docs (htslib.org) really seem to just show you how to use samtools or bcftools.

ADD REPLY • link 19 months ago by andyp • 0

0

Entering edit mode

is that really true ? in what circumstance is it faster? would be curious to see benchmark. sorry i'm not being helpful

ADD REPLY • link 19 months ago by cmdcolin ★ 3.8k

0

Entering edit mode

In the circumstance that you have a multithreaded CPU and the application speed is bound to the compression/decompression speed. Which will be often, certainly in situations with a mild transformation being applied. I don't know what's so controversial about this; bgzip has been around for a decade. Here, I got a 3.4 GB test.fastq:

$ time bgzip -@ 8 test.fastq # (using 8 threads)
bgzip -@ 8 test.fastq  68.27s user 3.17s system 818% cpu 8.727 total
# 513 MB test.fastq.gz
$ gunzip test.fastq.gz # (unzip)
$ time gzip test.fastq # (default gzip)
gzip test.fastq  71.36s user 0.81s system 99% cpu 1:12.45 total
# 498 MB test.fastq.gz
$ gunzip test.fastq.gz # (unzip)
$ time gzip --fast test.fastq # ("fast" gzip, worst compression)
gzip --fast test.fastq  14.21s user 0.68s system 97% cpu 15.299 total
# 646 MB test.fastq.gz

ADD REPLY • link 19 months ago by andyp • 0

score 0 · Answer 1 · 2022-09-22

0

Entering edit mode

19 months ago

Pierre Lindenbaum 161k

this code is a copy+paste of a program I wrote. It just add "HELLO WORLD" to the headers. I don't know if it's possible to add multithreading for fastq read/write.

ADD COMMENT • link 19 months ago by Pierre Lindenbaum 161k

score 0 · Answer 2 · 2022-09-22

Thanks but actually that is really similar to my starting point. I have currently have a program that leverages zlib+kseq already, and my goal is to improve it with multithreading. That solution reads and writes a compressed fastq file but it won't produce a bgzf-compatible one. My input is bgzf fastq, and I've been able to loop through a fastq file like:

#include "htslib/bgzf.h"
#include "htslib/hts.h"

#define BUFSZ 32768

int main(int argc, char** argv) {
  BGZF *fp;
  ssize_t bg_got;
  unsigned char bg_buf[BUFSZ];

  fp = bgzf_open("some.fq.gz", "r");
  nthreads = 4;
  bgzf_mt(fp, nthreads, 64);
  do {
        bg_got = bgzf_read(fp, bg_buf, BUFSZ);
  } while (bg_got > 0);
  bgzf_close(fp);
  return 0;
}

So this loops over the fastq blocks in 4 threads in 4 secs on a 500 MB fastq.gz while that'd be 15 seconds if nthreads = 1. So the htslib threadpool machinery is doing its job quite well. I'm just stuck now sorta like "ok now what?" I'm just reading into a buffer I'm not using, and I'm not sure how to effectively use kseq with that bg_buf buffer.