Whilst fiddling around with the BAM format, I was a little surprised to see that since it supports the IUPAC codes when encoding DNA in 4-bits-per-base, and the IUPAC only supports T OR U, but not both, that seems to mean that DNA in a BAM file can only have T; with U getting mapped to N.
This got me thinking:
- How do we currently store DNA-RNA hybrid fragments?
- How will we store DNA-modifications now that sequencing modifications directly has become feasible (SMRT-Seq). For example, how would we store a 5mC?
- What does an "=" mean? Is it IUPAC for gap, or SAM for "the same"?
Also, this is somewhat unrelated, but today I decompressed a typical BAM file, split it into its constituent columns, and then applied the usual compression on each column (to see how much smaller it would be). I found that after compression, the unique names we give to each read takes up roughly 4x more space that the actual DNA sequencing it describes. Since 20bp is often enough to describe a unique location in a 3billion bp genome, and these reads are 36bp long, I found it a rather poetic demonstration of the difference between Nature's engineering, and our own :)
It also makes me think sequencing machines should be more careful with what they label reads as, and future file format designers should totally ignore what their data looks like uncompressed. It is essentially irrelevant. We could have used 8 bits to encode each letter of DNA instead of 4, allowing for 256 kinds of DNA letter rather than 16, and the typical BAM filesize wouldn't have changed at all :D