Hi,
I have a set of BAM files (pair-end reads) from which I want to identify reads having more than one mutation. I already have the mutation calls for the same data. I am thinking of the following steps to achieve this.
- Identify pairs of "nearby" mutations (based on a maximum distance, say 400-500).
- Construct a BED file from above containing chromosome, start and end information, where 'start' will be the first and 'end' will be the second of above pair.
- Find intersecting regions between above BED and BAM file to filter putative reads.
- Query each read for presence of mutation pair
I have two questions. Firstly, I am wondering if I have to somehow take the strand information into account while finding intersected regions (I don't have the strand information in the variant-calling file)? Secondly, is there an easier alternative that I may use for this analysis? Apologies if some of the info is not-correct/too-basic as I am newbie to this kind of analysis.
Thanks!
Ikram
If you know a bit of python, its fairly simple with
pysam
which supports reading bam and vcf file.For each SNP ( as you already have the calls), find the overlapping reads ( with pileup function ) and check if the base on read matched with the ref/atl allele, and push the read names to a list/dictionary. Then check if you encounter the same read again for another SNP. There could be other optimised ways.
Note: Be careful that both the read and its mate has the same name.