Two bams with similar flagstat output have very different file size
0
0
Entering edit mode
7 months ago

Hi,

Here is the context, I tried to align one of whole genome sequencing (WGS) library to two different references using bwa (0.7.17) mem program, one is hg38 reference genome, the other is our own human assembly. After that, we got two bam file, let's call them hg38.bam and our.bam for hg38 alignment and our assembly alignment separately. The file sizes are quite different:

-rw-r--r-- 1 aaa aaa 68G Oct 11 19:15 hg38.bam
-rw-r--r-- 1 aaa aaa 34G Sep 23 18:00 our.bam


I found that hg38.bam and our.bam showed very similar samtools flagstat output as follows:

## samtools/1.9 is loaded
> samtools flagstat -@ 10 hg38.bam
5012623 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
736826700 + 0 mapped (99.69% : N/A)
734110372 + 0 paired in sequencing
713973832 + 0 properly paired (97.26% : N/A)
730652966 + 0 with itself and mate mapped
1161111 + 0 singletons (0.16% : N/A)
12730864 + 0 with mate mapped to a different chr
6698018 + 0 with mate mapped to a different chr (mapQ>=5)

> samtools flagstat -@ 10 our.bam
5149307 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
734647166 + 0 mapped (99.38% : N/A)
734110372 + 0 paired in sequencing
707687886 + 0 properly paired (96.40% : N/A)
727855208 + 0 with itself and mate mapped
1642651 + 0 singletons (0.22% : N/A)
17931634 + 0 with mate mapped to a different chr
9748969 + 0 with mate mapped to a different chr (mapQ>=5)


More informations are: 1. the two bams are generated using same version of samtools (1.8) with the following command:

bwa mem -M -R "@RG\tID:$lib\tLB:$lib\tPL:illumina\tSM:$lib" -t$nthreads $refGenome$fastq1 $fastq2 |$samtools view -bS - > lib.bam  And, 2. our.bam (smaller file) is sorted by coordinates, while hg38.bam (larger file) is not. My question is why the two files' size are so different. I've checked length of each line of two bams, they are also similar (~ 450 bytes). Thank you very much! alignment flagstat bam bwa • 353 views ADD COMMENT 0 Entering edit mode Compression level most likely introduced by the sorting. How did you sort? ADD REPLY 0 Entering edit mode OIC, I didn't realize the sorting will introduce compression, I think that can explain what I observed, because one of two files are not sorted! Will try to sort and re-compare. For the bam sorted, I sorted it by coordinates using samtools sort: samtools sort -osorted_bam \$input_bam


Thank you!

0
Entering edit mode

it will be interesting to know whether this is indeed caused by sorting. Makes sense intuitively, similar data gets placed closer next to one another hence compresses better. It probably depends a lot on the coverage of the data. The higher the coverage the more space savings.

0
Entering edit mode

I tend to remember that samtools uses different compression levels depending on the numbers of threads used but this is a faint memory. Why would higher coverafe save space in a bam file, It is one row per entry regardless of the coverage, I think in a bedGraph-like file coverage would matter a lot.

0
Entering edit mode

To follow up, I've sorted the hg38.bam file which is previously unsorted, the size decreased from 68G to 34G (become similar to the sorted our.bam which). Therefore, for @Istvan Albert question: "it will be interesting to know whether this is indeed caused by sorting", the answer is yes.

For another great point raised by @ATpoint about whether diff compression levels will be used for diff number of cpus used, I've tested by running the sorting using different cores (12 vs 4), it gave me same size of resulting sorted output.

Thank you guys again, for your great input!

1
Entering edit mode

thanks for following up, the best questions/answers are those that everyone learns something from

0
Entering edit mode

are you aligned reads from human to reference genome that you generated by your own ?! (Not GRCh38 and Not GRCh37) !! if yes! why would you do that ?! and how do you suspect to have the same result?!

0
Entering edit mode

What does this comment has to do with the question?

0
Entering edit mode

He Said: Here is the context, I tried to align one of whole genome sequencing (WGS) library to two different references using bwa (0.7.17) mem program, one is hg38 reference genome, the other is our own human assembly. After that, we got two bam file, let's call them hg38.bam and our.bam for hg38 alignment and our assembly alignment separately.

i just want to ensure that i understand the issue !!