Issue while sorting a file
1
0
Entering edit mode
4.3 years ago
ricfoz ▴ 80

hello everyone

I am working with some bam files, and trying to retrieve a fasta sequence from them.

I have sorted my workflow, but i have an inquiry since the size of the files doesn't seem to match, i describe my issue as follows:

My original bam file has 2Gb size, when i sort it in order to retrieve a sequence with the "samtools view" tool, like this:

samtools sort inputfile.bam -o inputfile.sort.bam

the resulting file is 390Kb in size.

Is that normal? , i have checked a region of the genome with the "less" argument, and it does have information, still, i am hesitating on wheather the sorted file is truncated, since i think it should measure 2Gb as the original file.

Anyone with any idea what may be happening, or if my info is to be relied on ?

sort trouble inconsistency • 1.1k views
1
Entering edit mode

Hello ricfoz,

that the file size is smaller after sorting is normal, as the compression works better on sorted data. But the difference is to huge. Are you sure that you inputfile.bam is really a bam file or is it a sam file? Try

$head inputfile.bam  If you get something human readable this is the sam file and could explain why this file is much larger. fin swimmer ADD REPLY 0 Entering edit mode 4.3 years ago drkennetz ▴ 560 No this is not normal your file has decreased in size almost 1000x and samtools sort doesn't actually remove any reads. what you can do is pipe wc -l with samtools view on your presorted file and your sorted file for a quickcheck using the following: samtools view unsorted.bam | wc -l samtools view sorted.bam | wc -l  This will tell you the number of lines in each of your bam files. I would imagine the sorted will be much smaller. A way to track down your real issue would be to use linux gdb to see if there is an actual issue with the file, you can do this by typing the following into the command-line: gdb samtools  some stuff will print to screen like GNU gdb etc.... you will then have gdb in place of your normal command-line [user]$ after which you will enter:

(gdb) run sort input.bam -o input.sort.bam


it will most likely print some errors you don't understand that are related to file compression failure or some formatting issue. Post this to samtools github issues and they should help with the problems you are seeing.

Edit: you can also run samtools sort input.bam -o input.sorted.bam -B to disable BAQ quality calculation. I don't know if quality is something you want to look at downstream but I have seen quality scores be an issue for longread sequencing technology. Dennis

0
Entering edit mode

Thank you... I ran the pipe with wc -l as you suggested. The number of lines it tells the files have are the same !, i think that means that the info may be the same?

It is very weird, still, i don't think there is any trouble with the compression with my samtools, since i have run the tool on other files, and they don't shrink, as this very one does.

0
Entering edit mode

.. I also ran the gdb command you suggested, it didn´t give any error message, it just worked as usually, outputting what the program is doing in a verbrose manner. but all ran smoothly.