I would very much like to know what people's opinions are about the uBAM format.
The idea of the uBAM format is that instead of storing all your FASTQ metadata in the file name or the FASTA header (which is then lost on mapping), you ditch the FASTQ format entirely and store the unmapped reads (paired or not) in a single standard BAM file, and tag your reads with the appropriate metadata.
In essence, this means the metadata moves with the data - and I *think* this means you can combine reads from different flowcells/lanes/multiplex adaptors into a single BAM file (myExperiment.bam) but still have at the read-level all the metadata about which sequencing machine was used, which flowcell, which lane, which multiplex barcode, which expected fragment length, etc. If thats true, I think that's really great and worth talking about.
Furthermore, I can imagine a scenario in the future where all your reads for a particular experiment are in a single BAM file (all your technical/biological replicates, different assays even (RNA,ChIP...), etc, and the downstream tools programatically determine which reads to use for signal, normalization, etc, by reading these BAM metadata flags.
But I must admit i'm not sure how some of this works. The picard tool FastqToSam (https://broadinstitute.github.io/picard/command-line-overview.html#FastqToSam) has a number of potential metadata names which it can add to your BAM, but i'm not sure if:
- These are standardized names, or we're just using the X?/Y?/Z? user-definable SAM tags? If so, is there a consensus for metadata in BAM files somewhere?
- I know some tags are definitely read-level, but perhaps not all of them are. Maybe some are just in the BAM header alone, and thus merging two files with different PLATFORM values, say, would confuse things..?
- Obviously the GATK tools can understand this metadata, but does anyone know of any other tools that are metadata-aware?