Samtools, not sorting a bam file correctly ?
2
0
Entering edit mode
5.8 years ago
pinn ▴ 210

Hi,

I'm not able to sort the bam file correctly using samtools sort. My bam file size, increases after sorting ? Aligner - LAST Dataset- Human genome dataset I generated a MAF file which I converted further in to sam ---> bam converted ---> sorted bam (problemetric)

1) Converting LAST output, MAF to sam

./maf-convert sam  last-941/src/SRR2928269.test.maf > SRR.2928269last.sam

2) I added header to sam

samtools view -bT test/hg38.fa SRR2928269.last.sam > SRR2928269.last.bam

3) BAM sorting

samtools sort -@ 5 SRR2928269.last.bam  -T /tmp/SRR2928269.last.bam.sort -o SRR2928269.last.sorted.bam</pre>

(NOT working on bigger datasets, sometimes)

samtools sort -@ 10 SRR2928269.last.bam -o SRR2928269.last.sorted.bam

(This cmd generating much bigger bam file when compare to unsorted bam file)


bam file size
Dataset -1
60G ----> SRR2928269.last.bam 
91G ----> SRR2928269.last.sorted.bam **(after sorting bam)**
Dataset - 2
61G ----> SRR2928268.last.bam
95G ----> SRR2928268.last.sorted.bam  **(after sorting bam)**
Dataset - 3
83G -----> SRR2928267.last.bam
122G ------> SRR2928267.last.sorted.bam  **(after sorting bam)**

can any one comment on this, how to sort it out ?

Assembly genome next-gen • 2.9k views
ADD COMMENT
1
Entering edit mode

@OP: my understanding is that sorted bam is smaller (by a little bit) than unsorted bam. Probably take a small set of your unsorted bam and then sort and see.

ADD REPLY
0
Entering edit mode

No, sorted bam is much bigger than unsorted bam file why ? once again I tried the sorting it generated same file size bam.

ADD REPLY
1
Entering edit mode

I cannot reproduce this. My sorted BAMs are always slightly smaller than the unsorted ones. Check if the number of reads are the same, and then proceed with your analysis.

ADD REPLY
0
Entering edit mode

Then something is wrong.

ADD REPLY
1
Entering edit mode

Just out of interest @OP, is it correct that you use hg38.fa together with SRR2928269 as in your command, because SRR2928269 is RNA-seq from a monkey.

ADD REPLY
0
Entering edit mode

pinninti1991reddy : Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLY
1
Entering edit mode
5.8 years ago
GenoMax 141k

NEVER use file sizes as a metric for any NGS QC. At best it can be used as guide to see if a process/step failed.

An appropriate @HD-SO sort order header tag will be added or an existing one updated when the file is sorted so check that.

Are you sure there is enough space in /tmp for -T to work right? Perhaps that is your problem. Use a regular directory for -T option instead.

ADD COMMENT
1
Entering edit mode
5.8 years ago
ATpoint 82k

As genomax said, do not use file size as an indicator for anything. Compression levels do matter. Use samtools flagstat to see if the number of reads is the same.

NOT working on bigger datasets, sometimes

That is not too surprising. Even though you gave no details on what is not working (I guess now), specifying /tmp in -T is a bad idea, as /tmp represents the memory of your machine. This way, your temporary files are written and kept in memory at all time, possibly killing the process. The purpose of these temporary files is to empty the memory once it is full, so change -T to a path on disk.

ADD COMMENT

Login before adding your answer.

Traffic: 2735 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6