Question

Forum:Modernising the FASTQ Format

0

Entering edit mode

8.4 years ago

dario.garvan ▴ 520

The FASTA and FASTQ formats were formed a long time ago when only a small number of sequences were ever available. Now that they are being used as the default format for most sequencing instruments, their limitations become apparent. To determine the number of reads in a FASTQ file, the entire file needs to be read to count the number of newline characters in it. It would be good if these formats kept pace with progress. They could allow a header section like:

# Records: 39810657

It would turn an O(n) operation into an O(1) one.

FASTA Efficiency FASTQ • 2.5k views

ADD COMMENT • link updated 2.2 years ago by Ram 45k • written 8.4 years ago by dario.garvan ▴ 520

1

Entering edit mode

Curious as to why knowing how many reads are in a file is important (without doing any additional operations)?

ADD REPLY • link 8.4 years ago by GenoMax 152k

0

Entering edit mode

It's useful for calculating the percentage of mapped reads, when also calculating the same number for a BAM file.

ADD REPLY • link 8.4 years ago by dario.garvan ▴ 520

0

Entering edit mode

I don't get this: all the reads, even the unmapped, are contained in the bam file.

ADD REPLY • link 8.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Often, aligners have an option to output the unmapped reads to a separate FASTA and the BAM contains only mapped reads. If the only data publicly available is the set of FASTQ files and the BAM files containing only the mapped reads, then some calculation is required.

ADD REPLY • link 8.4 years ago by dario.garvan ▴ 520

2

Entering edit mode

uBAM's (uBAM & metadata - the death of Fastq? ) are a special representation of fastq where all the original data is included as is (i.e. reads present in input fastq are directly converted to bam format, without loss of any information/alignment). Since BAM files allow for additional fields at the beginning one could potentially include any information you want (i.e. total number of reads etc).

Most aligners (bbmap being a simple example) will output stats you mention without needing any additional calculation after a standard alignment. You can also easily capture unmapped reads in a separate file.

ADD REPLY • link 8.4 years ago by GenoMax 152k

score 6 · Answer 1 · 2017-02-15

6

Entering edit mode

8.4 years ago

Pierre Lindenbaum 166k

Store you fastq in an unsorted BAM instead of fastq ( https://broadinstitute.github.io/picard/command-line-overview.html#FastqToSam ) and put whatever you want in the header...

ADD COMMENT • link 8.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Once more aligners support unmapped BAM files as input, this would be useful.

ADD REPLY • link 8.4 years ago by dario.garvan ▴ 520

2

Entering edit mode

no as converting a bam to an interleaved fastq is straightforward and many support fastq on standard input;

bam2fastq your.bam | bwa

ADD REPLY • link 8.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Is FastqtoSam the only option to do this?

Brian Bushnell there is streamsam.sh to convert sam/bam to fastq/fasta but nothing to do the reverse, correct?

ADD REPLY • link 8.4 years ago by GenoMax 152k

0

Entering edit mode

streamsam.sh or reformat.sh will work (streamsam is faster but only supports sam/bam input). reformat.sh will go from sam, bam, fasta, fastq, scarf, fqz, or oneline (tab-delimited header and sequence) to sam, bam, fasta, fastq, fqz, oneline, or header (sequence headers only). bam support requires samtools to be in the path, though.

ADD REPLY • link 8.4 years ago by Brian Bushnell 20k

0

Entering edit mode

Would it make sense to give clumpify.sh the ability to write uBAM files? A 2-for-1, compression plus bam output?

ADD REPLY • link 8.4 years ago by GenoMax 152k

0

Entering edit mode

Clumpify can write bam files, it just can't read them (of course, I could change that without much problem). Not sure what happens when it writes temp files though; I'll have to look into that. I don't like "uBAM" as it's a lossy format; you lose the trailing " 1:N:ACATGTA" in the read headers. It also takes more space and is both slower to read and write compared to .fq.gz, when streaming.

ADD REPLY • link 8.4 years ago by Brian Bushnell 20k

1

Entering edit mode

. I don't like "uBAM" as it's a lossy format; you lose the trailing " 1:N:ACATGTA".

the '1' and the 'N' are saved in the sam flag
the sequence index would be saved in the read-group.

ADD REPLY • link 8.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Oh so bam writes are already there in clumpify? Since you save the entire header in BBMap alignments can you do the same with clumpify. If that is not uBAM spec compliant then so be it.

To recover the original fastq files you could ask people to use streamsam.sh and then use the points noted by @Pierre below to reconstruct normal illumina headers (if you made the uBAM spec compliant).

BTW: @Heng Li is also against using unaligned BAM's for the same reasons (overhead) in the thread I had linked above.

ADD REPLY • link 8.4 years ago by GenoMax 152k

score 3 · Answer 2 · 2017-02-15

It sounds like you are advocating something like storing sequencing data in HDF5 https://www.hdfgroup.org/genomics-2/

As with any "standard", moving to different ones take a lot of time and is generally only done when there is a compelling reason for it. Your particular example of:

# Records: 39810657

is fine if people only ever wanted to count reads, but it comes at a cost. In this particular case, you've just massively increased the cost and complexity of filtering your FASTQ+ file as you need to keep the header in sync with the rest of the file. You have also introduced a whole class of data inconsistency errors that weren't there before. You can't use sed/awk/grep anymore as that breaks the header. Your change also means you can't pipe the file between programs: if you are streaming the file through a pipe you have to write the header first which means you have to know how many records you have which means to you have to cache the entire file which defeats the purpose of streaming it in the first place.

Even if that was part of a FASTQ+ specifications, I wouldn't trust that every file in the pipeline actually respected the specifications so your operation which was in theory O(1) would be turn out to be O(n) in practice anyway.

The SAM/VCF specifications are good examples of the mess and proliferation of edge cases that can be inadvertently introduced by adding additional structure to a simple format. They look fine on the surface, but as soon as you start trying to write generic software you find that there's a whole lot more edge cases that need to be handled than with FASTA/Q.