Issue while sorting a file
1
0
Entering edit mode
4.3 years ago
ricfoz ▴ 80

hello everyone

I am working with some bam files, and trying to retrieve a fasta sequence from them.

I have sorted my workflow, but i have an inquiry since the size of the files doesn't seem to match, i describe my issue as follows:

My original bam file has 2Gb size, when i sort it in order to retrieve a sequence with the "samtools view" tool, like this:

samtools sort inputfile.bam -o inputfile.sort.bam

the resulting file is 390Kb in size.

Is that normal? , i have checked a region of the genome with the "less" argument, and it does have information, still, i am hesitating on wheather the sorted file is truncated, since i think it should measure 2Gb as the original file.

Anyone with any idea what may be happening, or if my info is to be relied on ?

sort trouble inconsistency • 1.1k views
ADD COMMENT
1
Entering edit mode

Hello ricfoz,

that the file size is smaller after sorting is normal, as the compression works better on sorted data. But the difference is to huge. Are you sure that you inputfile.bam is really a bam file or is it a sam file? Try

$ head inputfile.bam

If you get something human readable this is the sam file and could explain why this file is much larger.

fin swimmer

ADD REPLY
0
Entering edit mode
4.3 years ago
drkennetz ▴ 560

No this is not normal your file has decreased in size almost 1000x and samtools sort doesn't actually remove any reads. what you can do is pipe wc -l with samtools view on your presorted file and your sorted file for a quickcheck using the following:

samtools view unsorted.bam | wc -l
samtools view sorted.bam | wc -l

This will tell you the number of lines in each of your bam files. I would imagine the sorted will be much smaller. A way to track down your real issue would be to use linux gdb to see if there is an actual issue with the file, you can do this by typing the following into the command-line:

gdb samtools

some stuff will print to screen like GNU gdb etc.... you will then have gdb in place of your normal command-line [user]$ after which you will enter:

(gdb) run sort input.bam -o input.sort.bam

it will most likely print some errors you don't understand that are related to file compression failure or some formatting issue. Post this to samtools github issues and they should help with the problems you are seeing.

Edit: you can also run samtools sort input.bam -o input.sorted.bam -B to disable BAQ quality calculation. I don't know if quality is something you want to look at downstream but I have seen quality scores be an issue for longread sequencing technology. Dennis

ADD COMMENT
0
Entering edit mode

Thank you... I ran the pipe with wc -l as you suggested. The number of lines it tells the files have are the same !, i think that means that the info may be the same?

It is very weird, still, i don't think there is any trouble with the compression with my samtools, since i have run the tool on other files, and they don't shrink, as this very one does.

ADD REPLY
0
Entering edit mode

.. I also ran the gdb command you suggested, it didnĀ“t give any error message, it just worked as usually, outputting what the program is doing in a verbrose manner. but all ran smoothly.

ADD REPLY

Login before adding your answer.

Traffic: 1419 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6