Question: Two bams with similar flagstat output have very different file size
0
gravatar for freeheart.steven
12 weeks ago by
Singapore
freeheart.steven10 wrote:

Hi,

Here is the context, I tried to align one of whole genome sequencing (WGS) library to two different references using bwa (0.7.17) mem program, one is hg38 reference genome, the other is our own human assembly. After that, we got two bam file, let's call them hg38.bam and our.bam for hg38 alignment and our assembly alignment separately. The file sizes are quite different:

-rw-r--r-- 1 aaa aaa 68G Oct 11 19:15 hg38.bam
-rw-r--r-- 1 aaa aaa 34G Sep 23 18:00 our.bam

I found that hg38.bam and our.bam showed very similar samtools flagstat output as follows:

## samtools/1.9 is loaded
> samtools flagstat -@ 10 hg38.bam
739122995 + 0 in total (QC-passed reads + QC-failed reads)
5012623 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
736826700 + 0 mapped (99.69% : N/A)
734110372 + 0 paired in sequencing
367055186 + 0 read1
367055186 + 0 read2
713973832 + 0 properly paired (97.26% : N/A)
730652966 + 0 with itself and mate mapped
1161111 + 0 singletons (0.16% : N/A)
12730864 + 0 with mate mapped to a different chr
6698018 + 0 with mate mapped to a different chr (mapQ>=5)

> samtools flagstat -@ 10 our.bam
739259679 + 0 in total (QC-passed reads + QC-failed reads)
5149307 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
734647166 + 0 mapped (99.38% : N/A)
734110372 + 0 paired in sequencing
367055186 + 0 read1
367055186 + 0 read2
707687886 + 0 properly paired (96.40% : N/A)
727855208 + 0 with itself and mate mapped
1642651 + 0 singletons (0.22% : N/A)
17931634 + 0 with mate mapped to a different chr
9748969 + 0 with mate mapped to a different chr (mapQ>=5)

More informations are: 1. the two bams are generated using same version of samtools (1.8) with the following command:

bwa mem -M -R "@RG\tID:$lib\tLB:$lib\tPL:illumina\tSM:$lib" -t $nthreads $refGenome $fastq1 $fastq2 | $samtools view -bS - > $lib.bam

And, 2. our.bam (smaller file) is sorted by coordinates, while hg38.bam (larger file) is not.

My question is why the two files' size are so different.

I've checked length of each line of two bams, they are also similar (~ 450 bytes).

Thank you very much!

bam bwa alignment flagstat • 229 views
ADD COMMENTlink modified 12 weeks ago by a.alnawfal.1992110 • written 12 weeks ago by freeheart.steven10

Compression level most likely introduced by the sorting. How did you sort?

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by ATpoint44k

OIC, I didn't realize the sorting will introduce compression, I think that can explain what I observed, because one of two files are not sorted! Will try to sort and re-compare.

For the bam sorted, I sorted it by coordinates using samtools sort:

samtools sort -o $sorted_bam $input_bam

Thank you!

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by freeheart.steven10

it will be interesting to know whether this is indeed caused by sorting. Makes sense intuitively, similar data gets placed closer next to one another hence compresses better. It probably depends a lot on the coverage of the data. The higher the coverage the more space savings.

ADD REPLYlink written 12 weeks ago by Istvan Albert ♦♦ 86k

I tend to remember that samtools uses different compression levels depending on the numbers of threads used but this is a faint memory. Why would higher coverafe save space in a bam file, It is one row per entry regardless of the coverage, I think in a bedGraph-like file coverage would matter a lot.

ADD REPLYlink written 12 weeks ago by ATpoint44k

To follow up, I've sorted the hg38.bam file which is previously unsorted, the size decreased from 68G to 34G (become similar to the sorted our.bam which). Therefore, for @Istvan Albert question: "it will be interesting to know whether this is indeed caused by sorting", the answer is yes.

For another great point raised by @ATpoint about whether diff compression levels will be used for diff number of cpus used, I've tested by running the sorting using different cores (12 vs 4), it gave me same size of resulting sorted output.

Thank you guys again, for your great input!

ADD REPLYlink written 12 weeks ago by freeheart.steven10
1

thanks for following up, the best questions/answers are those that everyone learns something from

ADD REPLYlink written 12 weeks ago by Istvan Albert ♦♦ 86k

are you aligned reads from human to reference genome that you generated by your own ?! (Not GRCh38 and Not GRCh37) !! if yes! why would you do that ?! and how do you suspect to have the same result?!

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by a.alnawfal.1992110

What does this comment has to do with the question?

ADD REPLYlink written 12 weeks ago by ATpoint44k

He Said: Here is the context, I tried to align one of whole genome sequencing (WGS) library to two different references using bwa (0.7.17) mem program, one is hg38 reference genome, the other is our own human assembly. After that, we got two bam file, let's call them hg38.bam and our.bam for hg38 alignment and our assembly alignment separately.

i just want to ensure that i understand the issue !!

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by a.alnawfal.1992110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1830 users visited in the last hour
_