I'm a complete novice with zero background in bio; I spent the day yesterday trying to answer this question without any luck.
Reading the paper describing the SAM format, it says that the number of bps in an alignment set can exceed 100 billion for deep resequencing of a single human. Given that the human genome has about 3.3 billion bps, I would assume the reference string would be upper-bounded by this number. And assuming that "deep" means coverage of about 10x, we get 33 billion pairs, far below the number we were supposed to exceed. Diploid sequencing doubles this, but we still fall short. Questions:
- What would cause us to exceed 100 billion bps?
- Does a deep resequencing of a human require alignment against the 98% of the reference genome that is shared by all humans?
- At 2 bits per nucleotide, the SAM file should be about 25 Gb for 100 billion bps, but these files are often 500+ Gb. Why?
To reiterate, I'm a complete novice. If you respond to this question, I would be deeply in your debt if you could use simple terminology.