Multiple iterations of sorted bam output files with picard SortSam
1
0
Entering edit mode
4 weeks ago
shpak.max ▴ 50

I'm running the standard cleaning/sorting picard functions on a set of bam files, and one of the output files is suspiciously small in comparison to the input, i.e. running:

SortSam.jar SORT_ORDER=coordinate INPUT=myfile.bam OUTPUT=myfile_sort.bam

I re-ran just the SortSam step of my pipeline to see why I was getting such a small file in comparison to the 20 other bam files. One thing that I noticed is that SortSam Read would run for some time, and then give the console message

INFO    2024-04-02 16:01:07     SortSam Finished reading inputs, merging and writing to  output now.

At that point, myfile_sort.bam would be generated. However, immediately afterwards, SortSam read would start running again, and myfile_sort.bam would return to being approximately 0 MB in size. It would continue to generate these "temporary" myfile_sort.bam files for 2 or 3 iterations before finally terminating and generating a final version.

My question is this: in each case, the bam files were created by merging files from different runs of the same sample. Could it be that for some strange reason (anomalous headers), SortSam is treating this particular bam file as three separate entities and overwriting each time, or does SortSam always generate temporary sorted files with preliminary sorting? Due to the long run time, I didn't check whether there were multiple iterations of this kind for the sorted bam files whose sizes more closely matched the inputs.

picard • 280 views
ADD COMMENT
0
Entering edit mode
4 weeks ago

What you are looking is the files generated by the EXTERNAL SORT algorithm (see Wikipedia )

P.S. : your picard is very old and you should just use samtools.

ADD COMMENT
0
Entering edit mode

Thanks - evidently the temporary files are given the same name as the terminal file.

Unfortunately, I'm more or less stuck using outdated software packages and versions for the time being. The lab I'm working in generated a large number of sequenced genomes nearly a decade ago, and in order to avoid artifactual differences introduced by different mapping/genotyping tools when comparing across genomes, I'm using the decade-old pipeline for consistency. That is why I post questions about bwa -aln, stampy, older version of picard and GATK, etc.

ADD REPLY
0
Entering edit mode

Are you running the 20 sorts simultaneously? You could try to assign a separate --tmp-dir to each job and see if that mitigates the issue of

temporary files are given the same name as the terminal file.

ADD REPLY

Login before adding your answer.

Traffic: 1319 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6