Question

Multiple iterations of sorted bam output files with picard SortSam

0

Entering edit mode

4 weeks ago

shpak.max ▴ 50

I'm running the standard cleaning/sorting picard functions on a set of bam files, and one of the output files is suspiciously small in comparison to the input, i.e. running:

SortSam.jar SORT_ORDER=coordinate INPUT=myfile.bam OUTPUT=myfile_sort.bam

I re-ran just the SortSam step of my pipeline to see why I was getting such a small file in comparison to the 20 other bam files. One thing that I noticed is that SortSam Read would run for some time, and then give the console message

INFO    2024-04-02 16:01:07     SortSam Finished reading inputs, merging and writing to  output now.

At that point, myfile_sort.bam would be generated. However, immediately afterwards, SortSam read would start running again, and myfile_sort.bam would return to being approximately 0 MB in size. It would continue to generate these "temporary" myfile_sort.bam files for 2 or 3 iterations before finally terminating and generating a final version.

My question is this: in each case, the bam files were created by merging files from different runs of the same sample. Could it be that for some strange reason (anomalous headers), SortSam is treating this particular bam file as three separate entities and overwriting each time, or does SortSam always generate temporary sorted files with preliminary sorting? Due to the long run time, I didn't check whether there were multiple iterations of this kind for the sorted bam files whose sizes more closely matched the inputs.

picard • 280 views

ADD COMMENT • link updated 4 weeks ago by GenoMax 142k • written 4 weeks ago by shpak.max ▴ 50

score 0 · Answer 1 · 2024-04-02

0

Entering edit mode

4 weeks ago

Pierre Lindenbaum 161k

What you are looking is the files generated by the EXTERNAL SORT algorithm (see Wikipedia )

P.S. : your picard is very old and you should just use samtools.

ADD COMMENT • link 4 weeks ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks - evidently the temporary files are given the same name as the terminal file.

Unfortunately, I'm more or less stuck using outdated software packages and versions for the time being. The lab I'm working in generated a large number of sequenced genomes nearly a decade ago, and in order to avoid artifactual differences introduced by different mapping/genotyping tools when comparing across genomes, I'm using the decade-old pipeline for consistency. That is why I post questions about bwa -aln, stampy, older version of picard and GATK, etc.

ADD REPLY • link 4 weeks ago by shpak.max ▴ 50

0

Entering edit mode

Are you running the 20 sorts simultaneously? You could try to assign a separate --tmp-dir to each job and see if that mitigates the issue of

temporary files are given the same name as the terminal file.

ADD REPLY • link 4 weeks ago by GenoMax 142k