Samtools Indels -- Filtering Only Hits With Insertion In The Reference In The Middle Of The Sequence Hit
1
0
Entering edit mode
9.8 years ago

Can anyone suggest how to use samtools to filter only hits where there is a insertion in the reference that splits the sequence hit roughly by the middle?

My sequences are in the range of 100-1000bp and were aligned using "bwa bwasw -z 100".

I don't expect perfect hits, so mismatches and small indels can occur at both ends of the hit, but I am looking for insertions in the reference in the middle that are 10x+ bigger than any small indels at both ends.

I don't have paired ends.

samtools indel • 2.3k views
1
Entering edit mode

So you are assuming that only one read aligns around a given indel? If so, I think you might need to write some code. If, on the other hand, your depth is more than that, it might be useful to call the indels using something like Dindel rather than relying on ad hoc processing.

0
Entering edit mode

What kind of coverage are you talking about here? What data processing do you use to produce the alignments?

0
Entering edit mode

@Sean Davis: added comment - My sequences are in the range of 100-1000bp and were aligned using "bwa bwasw -z 100".

1
Entering edit mode
9.8 years ago

I think your best bet is to parse out the CIGAR string and make a decision on that. The definition of 'roughly by the middle' is fuzzy enough to make it unlikely that such functionality would be implemented by default, a possible python code:

import re

patt  = re.compile( '\d+M|\d+D|\d+I' )
cigar = "1I12M1D12M"
vals  =  patt.findall( cigar )

print vals

# decide here how roughly the middle is defined


prints:

['1I', '12M', '1D', '12M']

0
Entering edit mode

I changed the description a bit, I want to allow d+Ds at both ends, but the one in the middle had got to be 10x+ bigger than the ones at both ends.