Forum: Modernising the FASTQ Format
0
gravatar for dario.garvan
2.3 years ago by
dario.garvan450
Australia
dario.garvan450 wrote:

The FASTA and FASTQ formats were formed a long time ago when only a small number of sequences were ever available. Now that they are being used as the default format for most sequencing instruments, their limitations become apparent. To determine the number of reads in a FASTQ file, the entire file needs to be read to count the number of newline characters in it. It would be good if these formats kept pace with progress. They could allow a header section like:

# Records: 39810657

It would turn an O(n) operation into an O(1) one.

efficiency forum fastq fasta • 650 views
ADD COMMENTlink modified 2.3 years ago by d-cameron2.0k • written 2.3 years ago by dario.garvan450
1

Curious as to why knowing how many reads are in a file is important (without doing any additional operations)?

ADD REPLYlink written 2.3 years ago by genomax68k

It's useful for calculating the percentage of mapped reads, when also calculating the same number for a BAM file.

ADD REPLYlink written 2.3 years ago by dario.garvan450

I don't get this: all the reads, even the unmapped, are contained in the bam file.

ADD REPLYlink written 2.3 years ago by Pierre Lindenbaum120k

Often, aligners have an option to output the unmapped reads to a separate FASTA and the BAM contains only mapped reads. If the only data publicly available is the set of FASTQ files and the BAM files containing only the mapped reads, then some calculation is required.

ADD REPLYlink written 2.3 years ago by dario.garvan450
2

uBAM's (uBAM & metadata - the death of Fastq? ) are a special representation of fastq where all the original data is included as is (i.e. reads present in input fastq are directly converted to bam format, without loss of any information/alignment). Since BAM files allow for additional fields at the beginning one could potentially include any information you want (i.e. total number of reads etc).

Most aligners (bbmap being a simple example) will output stats you mention without needing any additional calculation after a standard alignment. You can also easily capture unmapped reads in a separate file.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by genomax68k
6
gravatar for Pierre Lindenbaum
2.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:

Store you fastq in an unsorted BAM instead of fastq ( https://broadinstitute.github.io/picard/command-line-overview.html#FastqToSam ) and put whatever you want in the header...

ADD COMMENTlink written 2.3 years ago by Pierre Lindenbaum120k

Once more aligners support unmapped BAM files as input, this would be useful.

ADD REPLYlink written 2.3 years ago by dario.garvan450
2

no as converting a bam to an interleaved fastq is straightforward and many support fastq on standard input;

bam2fastq your.bam | bwa
ADD REPLYlink written 2.3 years ago by Pierre Lindenbaum120k

Is FastqtoSam the only option to do this?

Brian Bushnell there is streamsam.sh to convert sam/bam to fastq/fasta but nothing to do the reverse, correct?

ADD REPLYlink written 2.3 years ago by genomax68k

streamsam.sh or reformat.sh will work (streamsam is faster but only supports sam/bam input). reformat.sh will go from sam, bam, fasta, fastq, scarf, fqz, or oneline (tab-delimited header and sequence) to sam, bam, fasta, fastq, fqz, oneline, or header (sequence headers only). bam support requires samtools to be in the path, though.

ADD REPLYlink written 2.3 years ago by Brian Bushnell16k

Would it make sense to give clumpify.sh the ability to write uBAM files? A 2-for-1, compression plus bam output?

ADD REPLYlink written 2.3 years ago by genomax68k

Clumpify can write bam files, it just can't read them (of course, I could change that without much problem). Not sure what happens when it writes temp files though; I'll have to look into that. I don't like "uBAM" as it's a lossy format; you lose the trailing " 1:N:ACATGTA" in the read headers. It also takes more space and is both slower to read and write compared to .fq.gz, when streaming.

ADD REPLYlink written 2.3 years ago by Brian Bushnell16k
1

. I don't like "uBAM" as it's a lossy format; you lose the trailing " 1:N:ACATGTA".

  • the '1' and the 'N' are saved in the sam flag
  • the sequence index would be saved in the read-group.
ADD REPLYlink written 2.3 years ago by Pierre Lindenbaum120k

Oh so bam writes are already there in clumpify? Since you save the entire header in BBMap alignments can you do the same with clumpify. If that is not uBAM spec compliant then so be it.

To recover the original fastq files you could ask people to use streamsam.sh and then use the points noted by @Pierre below to reconstruct normal illumina headers (if you made the uBAM spec compliant).

BTW: @Heng Li is also against using unaligned BAM's for the same reasons (overhead) in the thread I had linked above.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by genomax68k
3
gravatar for d-cameron
2.3 years ago by
d-cameron2.0k
Australia
d-cameron2.0k wrote:

It sounds like you are advocating something like storing sequencing data in HDF5 https://www.hdfgroup.org/genomics-2/

As with any "standard", moving to different ones take a lot of time and is generally only done when there is a compelling reason for it. Your particular example of:

# Records: 39810657

is fine if people only ever wanted to count reads, but it comes at a cost. In this particular case, you've just massively increased the cost and complexity of filtering your FASTQ+ file as you need to keep the header in sync with the rest of the file. You have also introduced a whole class of data inconsistency errors that weren't there before. You can't use sed/awk/grep anymore as that breaks the header. Your change also means you can't pipe the file between programs: if you are streaming the file through a pipe you have to write the header first which means you have to know how many records you have which means to you have to cache the entire file which defeats the purpose of streaming it in the first place.

Even if that was part of a FASTQ+ specifications, I wouldn't trust that every file in the pipeline actually respected the specifications so your operation which was in theory O(1) would be turn out to be O(n) in practice anyway.

The SAM/VCF specifications are good examples of the mess and proliferation of edge cases that can be inadvertently introduced by adding additional structure to a simple format. They look fine on the surface, but as soon as you start trying to write generic software you find that there's a whole lot more edge cases that need to be handled than with FASTA/Q.

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by d-cameron2.0k

For example, the following are valid VCFv4.1 constructs that I'm pretty sure your favourite variant calling pipeline doesn't correctly handle:

  • U+0000 (aka \0, or C string null terminator) as part of the input file
  • Mixed CR (unix-style) and CR+LR (windows-style) line terminators (or even RS, CR, 0x9B, LF+CR)
  • A[<DEL>[, or even C[<=>:-0[ as ALT alleles (yes, your ALT alleles can contain a smiley face)
  • numerical values such as "123.456.789,00" (aka 123,456,789.00 in an EN-US locale)
  • percent encoded newlines INFO strings
  • newline character in a quote-escaped section of header line

And there's a whole lot more - https://github.com/samtools/hts-specs/issues

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by d-cameron2.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1853 users visited in the last hour