Question: Samtools: How To Find Tandem Hits Of Same Query Sequence In An Indexed Bam/Sam File?
4
gravatar for Ahdf-Lell-Kocks
7.3 years ago by
Ahdf-Lell-Kocks1.6k
Ahdf-Lell-Kocks1.6k wrote:

I would like to know how can I find tandem hits of the same query sequence in an indexed bam/sam file?

By this I mean hits of the same sequence in the same location, one hit next to each other, non-overlapping, given a maximum distance D between them, with no other hits from other sequences in the BAM file in between. I assume that given this would be queried in an indexed BAM files, these should be two consecutive entries. As a graphical example below, I am interested in finding the two consecutive hits of sequence 'aaaaaaaaaa' in the target genome:

----------------------------------------------------------------------------
    xxxxxxxxx  aaaaaaaaaa       aaaaaaaaaa       bbbbbbbbb    ccccccccc
  ddddddddd                                       eeeeeeeeeeeee
bam samtools • 1.6k views
ADD COMMENTlink modified 7.3 years ago by Fred700 • written 7.3 years ago by Ahdf-Lell-Kocks1.6k
1

Are you really interested in finding repeated hits, or is there a biological question that you are trying to answer?

ADD REPLYlink written 7.3 years ago by Sean Davis25k
1

how about dummying up a sam file which has some such hits in it?

ADD REPLYlink written 7.3 years ago by Malcolm.Cook1.0k

It's not clear to me. Can you give a graphical example please ?

ADD REPLYlink written 7.3 years ago by Pierre Lindenbaum119k

It sounds like: find all non-overlapping intervals in the BAM which are not more distant than D from each other. You are correct that these should be adjacent in a sorted file, but you have to account for overlaps if these can exist.

ADD REPLYlink written 7.3 years ago by Michael Dondrup46k

I still don't get "hits of the same sequence in the same location". What is the "sequence" ? what is the "hit" ?. Can your replace those words with "Reference Sequence" and "Short Read" please.

ADD REPLYlink written 7.3 years ago by Pierre Lindenbaum119k

Pierre, agreed more detail would be valuable. My "solution" below assumes "sequence" is that of the "Short Read" and "hit" is the "alignment" (being one row in the sam file). Hoping I'm right....

ADD REPLYlink written 7.3 years ago by Malcolm.Cook1.0k
1
gravatar for Fred
7.3 years ago by
Fred700
Paris, France
Fred700 wrote:

I'm not sure to understand exactly what you want to do, but the following script may help.

It takes a sorted input sam file and outputs the lines when the consecutive ones corresponds to identical sequences with different coordinates.

#!/usr/bin/perl -w
use strict;
my $prevChr   = -1;
my $prevStart = -1;
my $prevSequence = -1;
my $prevLine = -1;

while(<>){
    if(!/^@/){
        chomp;
        my @tab = split(/\t/);
        my ($chr,$start,$sequence) = @tab[2,3,9];
        if($prevStart ne -1 && $prevChr ne -1 && 
           $prevSequence eq $sequence && 
           $prevChr eq $chr && $start != $prevStart){
            print $prevLine."\n";
            print $_."\n";
        }

        $prevChr   = $chr;
        $prevStart = $start;
        $prevSequence = $sequence;
        $prevLine = $_;
    }
}

You just have to do something like that:

$ samtools view sorted_file.bam | script.pl
ADD COMMENTlink modified 7.3 years ago • written 7.3 years ago by Fred700
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1643 users visited in the last hour