Forum: uBAM & metadata - the death of Fastq?
3
gravatar for John
4.6 years ago by
John12k
Germany
John12k wrote:

Hello all!

I would very much like to know what people's opinions are about the uBAM format.
(http://gatkforums.broadinstitute.org/discussion/6484/)


The idea of the uBAM format is that instead of storing all your FASTQ metadata in the file name or the FASTA header (which is then lost on mapping), you ditch the FASTQ format entirely and store the unmapped reads (paired or not) in a single standard BAM file, and tag your reads with the appropriate metadata.

In essence, this means the metadata moves with the data - and I *think* this means you can combine reads from different flowcells/lanes/multiplex adaptors into a single BAM file (myExperiment.bam) but still have at the read-level all the metadata about which sequencing machine was used, which flowcell, which lane, which multiplex barcode, which expected fragment length, etc. If thats true, I think that's really great and worth talking about.

Furthermore, I can imagine a scenario in the future where all your reads for a particular experiment are in a single BAM file (all your technical/biological replicates, different assays even (RNA,ChIP...), etc, and the downstream tools programatically determine which reads to use for signal, normalization, etc, by reading these BAM metadata flags.

But I must admit i'm not sure how some of this works. The picard tool FastqToSam (https://broadinstitute.github.io/picard/command-line-overview.html#FastqToSam) has a number of potential metadata names which it can add to your BAM, but i'm not sure if:

  1. These are standardized names, or we're just using the X?/Y?/Z? user-definable SAM tags? If so, is there a consensus for metadata in BAM files somewhere?
  2. I know some tags are definitely read-level, but perhaps not all of them are. Maybe some are just in the BAM header alone, and thus merging two files with different PLATFORM values, say, would confuse things..?
  3. Obviously the GATK tools can understand this metadata, but does anyone know of any other tools that are metadata-aware?
forum gatk • 3.0k views
ADD COMMENTlink modified 4.6 years ago by lh332k • written 4.6 years ago by John12k
1

Ion Torrent PGM default output is unmapped BAM.

ADD REPLYlink written 4.6 years ago by 5heikki8.9k
1

I think PacBio is switching to BAM, too.

ADD REPLYlink written 4.6 years ago by lh332k
5
gravatar for lh3
4.6 years ago by
lh332k
United States
lh332k wrote:
  1. There are a bunch of standard SAM tags created purely for metadata. New tags may be added when there is a strong need.
  2. Read groups have a bunch of standard tags, too, again for metadata. A BAM may have multiple read groups.
  3. It all depends on if your tool cares about them. For example, most duplicate removers use library info when present.

Unmapped BAM is a pretty old thing. It has been around for 5+ years, used by several big sequencing centers, but it is not often seen elsewhere. I don't like uBAM because it is too heavy. If all I care is read sequences, which is common, fastq is much more convenient. I don't need to learn new APIs. Users don't need to install extra dependencies. It is a win for many (of course not for everyone – that is why big centers prefer uBAM).

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by lh332k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1354 users visited in the last hour