Question: samtools sort by something other than position/name?
1
gravatar for anderspitman
9 months ago by
anderspitman40
United States
anderspitman40 wrote:

I'm trying to automatically compare BAM files being output by bowtie2 (for a continuous integration system). Currently using bamUtil diff which works well, except I'm noticing sometimes 2 lines in the BAM file will have the same name/start position, but represent different reads. For whatever reason, bowtie2 sometimes reverses the order of the lines, so bamUtil's algorithm calls a mismatch. I see several potential solutions to this problem:

  1. Get samtools to sort by another dimension in addition to position/name. Pretty much anything should do. Can this be done? I don't see any options in the man page.

  2. Use another comparision tool than bamUtil which doesn't have this problem. Honestly a stand-alone program for diffing would be best for me anyway and I'm considering writing one.

  3. Find some way to get bowtie2 to be more deterministic in its ordering. Any ideas here?

samtools • 520 views
ADD COMMENTlink modified 8 months ago • written 9 months ago by anderspitman40

For whatever reason, bowtie2 sometimes reverses the order of the lines, so bamUtil's algorithm calls a mismatch.

show us an example please.

ADD REPLYlink written 9 months ago by Pierre Lindenbaum112k

I confirmed that bamHash shows the files as matching. I think my solution for now is to just use that. I'm still a little concerned that I don't know which stage of my pipeline is interleaving the reads. From what I've read I should be accounting for all sources of non-determinism (using single core, etc). I might dig deeper into this at some point but after reading up on bamHash and testing it out I think that's what I need. Thanks for the help. If you move your comment suggesting bamHash to an answer I'll accept it.

ADD REPLYlink written 9 months ago by anderspitman40

I've moved my comment to an answer. Please double check, though, that BamHash is really doing the comparison you want and that you don't care about mapping positions being the same (if you just want to ensure the same entries are contained in both files, regardless of whether/how they're mapped then you're good to go with BamHash).

ADD REPLYlink written 9 months ago by Devon Ryan85k

Now that I understand the problem better, you may actually be right about mappings being important for my application. Read names/sequences might be enough of a guarantee for us, but I'm waiting to hear back on that.

ADD REPLYlink written 8 months ago by anderspitman40
2
gravatar for Devon Ryan
9 months ago by
Devon Ryan85k
Freiburg, Germany
Devon Ryan85k wrote:

Exactly which parts of the alignments do you want to be diffed? If you just care about name/sequence then you can use bamHash. If you want more than that then you'll need a custom solution. There are no options to have samtools sort in a 3rd way (no one has ever needed that).

ADD COMMENTlink written 9 months ago by Devon Ryan85k
1
gravatar for anderspitman
8 months ago by
anderspitman40
United States
anderspitman40 wrote:

I kinda feel like an idiot right now. Upon further reflection on my task, I realized that in this particular case there's no reason to limit myself to samtools' method of sorting. I can simply convert my two BAM files to SAM (without the headers), then feed them to GNU coreutils sort, then diff them normally. Seems to be working. Here's a bash script that does it:

#!/bin/bash
# bamdiff.sh    

f1=${1}
f2=${2}

f1TempHeaderlessSAM=.bamdiff_f1TempHeaderlessSAM
f2TempHeaderlessSAM=.bamdiff_f2TempHeaderlessSAM

samtools view ${f1} | sort > ${f1TempHeaderlessSAM}
samtools view ${f2} | sort > ${f2TempHeaderlessSAM}

diff ${f1TempHeaderlessSAM} ${f2TempHeaderlessSAM}

diffExitCode=$?

rm ${f1TempHeaderlessSAM}
rm ${f2TempHeaderlessSAM}

exit ${diffExitCode}
ADD COMMENTlink written 8 months ago by anderspitman40
0
gravatar for Pierre Lindenbaum
9 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum112k wrote:

Get samtools to sort by another dimension in addition to position/name.

easy in java with htsjdk, I've written a program sorting on REF/Name : http://lindenb.github.io/jvarkit/SortSamRefName.html https://github.com/lindenb/jvarkit/blob/master/src/main/java/com/github/lindenb/jvarkit/tools/misc/SortSamRefName.java

Use another comparision tool than bamUtil which doesn't have this problem

search biostars : https://www.google.fr/search?q=compare+bam+site%3Abiostars.org

ADD COMMENTlink written 9 months ago by Pierre Lindenbaum112k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2005 users visited in the last hour