file size after sorting the BAM file using samtools
1
0
Entering edit mode
4.2 years ago
neranjan ▴ 60

Hi All,

I wanted to create a sorted BAM file from the SAM file.

So these are the steps I took

samtools view -@ 8 -bhS input.sam -o mapped.bam
samtools sort -@ 8 mapped.bam -o sorted.bam

Once i created the mapped.bam file and the sorted.bam file I looked and the file sizes of BAM (mapped.bam & sorted.bam ) and saw a discrepancy. As I assumed these file sizes to be of same size, but in fact they were not.

1.5G   mapped.bam 
974M   sorted.bam

My question is:
1) what I am doing wrong here ?
2) Is there a way to check the contents in these two files are the same ? (I am assuming ideally the contents should be same as it was just sorting them in order)

Thank you very much.

alignment samtools next-gen • 2.7k views
ADD COMMENT
0
Entering edit mode

Thanks Pierre Lindenbaum

Does it make any difference if I use Picard to BAM files ?

ADD REPLY
1
Entering edit mode

picard is slower and doesn't work the same way than samtools when sorting on queryname (!= coordinate)

ADD REPLY
1
Entering edit mode

The answer about sizes has already been given so I won't repeat it.

However in answer to part 2, we locally use Biobambam's bamseqchksum tool to validate that a file operation hasn't lost data in the process, or that it's lost only the bits we know will be lost. For example it can compute checksums of all the sequences and quality strings irrespective or order and hence validate they still exist and haven't been modified.

https://manpages.debian.org/unstable/biobambam2/bamseqchksum.1.en.html

That may look complex, but just do "bamseqchksum < input.bam" and you'll get some stats.

ADD REPLY
5
Entering edit mode
4.2 years ago

1) what I am doing wrong here ?

nothing is wrong. BAM are gzipped compressed and the reads are sorted on coordinate , In consequence some very similar data are close together (the DNA sequence of the reads) and it's much more efficient for compression.

ADD COMMENT
0
Entering edit mode

Thanks for the reply,

So what you are saying file size of BAM files will differ before sorting and after ?

I can understand that BAM and SAM files will have different in size.

11G  input.sam
1.5G   mapped.bam 
974M   sorted.bam
ADD REPLY
2
Entering edit mode

So what you are saying file size of BAM files will differ before sorting and after ?

YES sorting change the size of the compressed data.

$ wget -q -O - "https://en.wikipedia.org" |fold -w 1 | gzip -c | wc -c
23105
$ wget -q -O - "https://en.wikipedia.org" |fold -w 1 | sort | gzip -c | wc -c
851
ADD REPLY
0
Entering edit mode

Thank you very much Pierre Lindenbaum

ADD REPLY
0
Entering edit mode

please validate the answer to close the question (green tick on the left)

ADD REPLY

Login before adding your answer.

Traffic: 2530 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6