3.6 years ago by
It sounds like you are advocating something like storing sequencing data in HDF5 https://www.hdfgroup.org/genomics-2/
As with any "standard", moving to different ones take a lot of time and is generally only done when there is a compelling reason for it. Your particular example of:
# Records: 39810657
is fine if people only ever wanted to count reads, but it comes at a cost. In this particular case, you've just massively increased the cost and complexity of filtering your FASTQ+ file as you need to keep the header in sync with the rest of the file. You have also introduced a whole class of data inconsistency errors that weren't there before. You can't use sed/awk/grep anymore as that breaks the header. Your change also means you can't pipe the file between programs: if you are streaming the file through a pipe you have to write the header first which means you have to know how many records you have which means to you have to cache the entire file which defeats the purpose of streaming it in the first place.
Even if that was part of a FASTQ+ specifications, I wouldn't trust that every file in the pipeline actually respected the specifications so your operation which was in theory O(1) would be turn out to be O(n) in practice anyway.
The SAM/VCF specifications are good examples of the mess and proliferation of edge cases that can be inadvertently introduced by adding additional structure to a simple format. They look fine on the surface, but as soon as you start trying to write generic software you find that there's a whole lot more edge cases that need to be handled than with FASTA/Q.
modified 3.6 years ago
3.6 years ago by
d-cameron • 2.2k