Question: Why Are Sam/Bam Files So Large?
3
gravatar for Fixee
8.8 years ago by
Fixee60
Colorado
Fixee60 wrote:

I'm a complete novice with zero background in bio; I spent the day yesterday trying to answer this question without any luck.

Reading the paper describing the SAM format, it says that the number of bps in an alignment set can exceed 100 billion for deep resequencing of a single human. Given that the human genome has about 3.3 billion bps, I would assume the reference string would be upper-bounded by this number. And assuming that "deep" means coverage of about 10x, we get 33 billion pairs, far below the number we were supposed to exceed. Diploid sequencing doubles this, but we still fall short. Questions:

  • What would cause us to exceed 100 billion bps?
  • Does a deep resequencing of a human require alignment against the 98% of the reference genome that is shared by all humans?
  • At 2 bits per nucleotide, the SAM file should be about 25 Gb for 100 billion bps, but these files are often 500+ Gb. Why?

To reiterate, I'm a complete novice. If you respond to this question, I would be deeply in your debt if you could use simple terminology.

sam • 11k views
ADD COMMENTlink written 8.8 years ago by Fixee60
7
gravatar for Gww
8.8 years ago by
Gww2.7k
Canada
Gww2.7k wrote:

SAM / BAM files contain a lot more than just the read sequence. There is the quality string, which is 1 byte per read base (in both SAM and BAM files), the cigar string, the read ID, the flag, and tags. Furthermore, in SAM files the sequence is actually 1 byte / base, in BAM files they are stored as 4 bits per base. These numbers aren't exact because BAM files are also block compressed so the number of bytes / base will be smaller, especially if the BAM file is sorted by the read target / offset.

ADD COMMENTlink written 8.8 years ago by Gww2.7k
3
gravatar for Ketil
8.8 years ago by
Ketil4.0k
Germany
Ketil4.0k wrote:

To try to answer your questions:

  • 100Gbp isn't difficult to achieve, Illumina HiSeq produces something like 100M reads - times two for paired ends - and I think that's just a single lane.

  • No, you're probably not interested in a lot of that for diagnostic purposes. But it's probably cheaper and simpler to sequence it all, rather than to try to PCR out the bits you are interested in.

  • Somebody already pointed to quality data, which take up a significant (and poorly compressible) chunk of the SAM format. Use 'samtools view' on a BAM file to see the contents in detail (remember to pipe output to less).

But basically, the reason files are large is that they contain lots of data. Sequencing is cheap, so we get lots of sequences.

ADD COMMENTlink written 8.8 years ago by Ketil4.0k

@Ketil: Illumina HiSeq-2000 produces almost 80 million paired-end reads in a single lane

ADD REPLYlink written 8.8 years ago by Gww2.7k

Yes, but we see a large variation, with fastq files ranging from (2x) 8G to almost 30G, the largest being over 100M reads.

ADD REPLYlink written 8.8 years ago by Ketil4.0k
1
gravatar for 2184687-1231-83-
8.8 years ago by
2184687-1231-83-5.0k wrote:

In many resequencing standards, "deep" means coverage of about 30-40x. Part of it is sequencing errors, but also low frequency and het SNPs that want to be found.

ADD COMMENTlink written 8.8 years ago by 2184687-1231-83-5.0k
1
gravatar for Jeremy Leipzig
8.8 years ago by
Philadelphia, PA
Jeremy Leipzig19k wrote:

the Hsi paper shows a 1000 Genomes BAM file on one human chromosome using 17.48 bits/base (for everything not just sequence), and that a somewhat lossy, reference-based compression scheme could bring that down to an amazing .74 bits/base.

That's a huge improvement, the main drawbacks presumably being the processing time to make the file and the time penalties to use such a file.

ADD COMMENTlink written 8.8 years ago by Jeremy Leipzig19k

The 1000g files are huge because they keep two quality strings and a lot of other unnecessary information. If we do it right, it should cost ~10bit/base in its current form or <8bit when we merge samples. Ultimately, the reference based compression is the future. More tools will be designed to directly work such files.

ADD REPLYlink written 8.8 years ago by lh332k

The 1000g files are huge because they keep two quality strings and a lot of other unnecessary information. If we do it right, it should cost ~10bit/base in its current form or <8bit when we merge samples. Ultimately, the reference based compression is the future. More tools will be designed to directly work with such files, just as more and more tools work with SAM/BAM.

ADD REPLYlink written 8.8 years ago by lh332k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 792 users visited in the last hour