Question

Forum:uBAM & metadata - the death of Fastq?

3

Entering edit mode

8.4 years ago

John 13k

Hello all!

I would very much like to know what people's opinions are about the uBAM format. (http://gatkforums.broadinstitute.org/discussion/6484/)

The idea of the uBAM format is that instead of storing all your FASTQ metadata in the file name or the FASTA header (which is then lost on mapping), you ditch the FASTQ format entirely and store the unmapped reads (paired or not) in a single standard BAM file, and tag your reads with the appropriate metadata.

In essence, this means the metadata moves with the data - and I think this means you can combine reads from different flow cells/lanes/multiplex adaptors into a single BAM file (myExperiment.bam) but still have at the read-level all the metadata about which sequencing machine was used, which flow cell, which lane, which multiplex barcode, which expected fragment length, etc. If that's true, I think that's really great and worth talking about.

Furthermore, I can imagine a scenario in the future where all your reads for a particular experiment are in a single BAM file (all your technical/biological replicates, different assays even (RNA,ChIP...), etc, and the downstream tools programmatically determine which reads to use for signal, normalization, etc, by reading these BAM metadata flags.

But I must admit I'm not sure how some of this works. The picard tool FastqToSam (https://broadinstitute.github.io/picard/command-line-overview.html#FastqToSam) has a number of potential metadata names which it can add to your BAM, but I'm not sure if:

These are standardized names, or we're just using the X?/Y?/Z? user-definable SAM tags? If so, is there a consensus for metadata in BAM files somewhere?
I know some tags are definitely read-level, but perhaps not all of them are. Maybe some are just in the BAM header alone, and thus merging two files with different PLATFORM values, say, would confuse things..?
Obviously the GATK tools can understand this metadata, but does anyone know of any other tools that are metadata-aware?

GATK • 5.0k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.4 years ago by John 13k

1

Entering edit mode

Ion Torrent PGM default output is unmapped BAM.

ADD REPLY • link 8.4 years ago by 5heikki 11k

1

Entering edit mode

I think PacBio is switching to BAM, too.

ADD REPLY • link 8.4 years ago by lh3 33k

score 5 · Answer 1 · 2015-12-17

There are a bunch of standard SAM tags created purely for metadata. New tags may be added when there is a strong need.
Read groups have a bunch of standard tags, too, again for metadata. A BAM may have multiple read groups.
It all depends on if your tool cares about them. For example, most duplicate removers use library info when present.

Unmapped BAM is a pretty old thing. It has been around for 5+ years, used by several big sequencing centers, but it is not often seen elsewhere. I don't like uBAM because it is too heavy. If all I care is read sequences, which is common, fastq is much more convenient. I don't need to learn new APIs. Users don't need to install extra dependencies. It is a win for many (of course not for everyone – that is why big centers prefer uBAM).