If I were doing this to an uncompressed fastq file, I'd have a pretty easy task: read in fastq records, one by one, modify the header slightly, output them one by one to a new file.
These fastq files aren't sorted so the output's records need not be in the same order.
I'm wondering if anyone has an example of this sort of use-case with htslib. If not, I'm also just wondering on a basic level what strategy I should employ to properly read and write my compressed fastq files as fast as possible:
1) Use htslib's thread pool which is designed to compress/decompress bgzf blocks... which I think are independent of fastq record boundaries. This means I'd probably need the whole file in memory before looping through records. I think.
2) Use my own thread routines and leverage the .fai and .gzi files for random access into the compressed fastq, each thread assigned a more/less equal sized slice of the file: decompressing, reading, transforming, then waiting for a mutex on a write thread to unlock and then writing out uncompressed data to a file, which I compress with bgzip later.
Any advice on 1 vs 2? Anyone have a htslib "cookbook" somewhere with a bunch of different recipes? I'd be thrilled if that exists. I'm having a really tough time trying to figure out how to use htslib. My C/C++ isn't nearly as strong as the authors of that library.