I'm hoping to identify reads pairs that have a mismatch between each read in the pair at a particular position in my genome. I have a bam file of PE reads mapped to my genome. It was a pretty shoddy library prep (supposed to be 150bp PE with insert size of 150bp), so there's massive variation in insert size, and many of mates overlap with each other.
I'm hoping to make the best of a bad situation and try to use this data to identify mismatches between each read in a pair at a specific position in my genome. The problem is very much that I'm looking at mismatches between mates of a pair rather than mismatches to the genome.
My thinking was as follows:
1) Identify read pairs that map correctly
2) Of these read pairs from 1, find pairs for which both reads map to the same position.
3) Count mismatches between each pair at this position.
I've made a quick drawing which may help
In these three pairs, there is one mismatch at the position, which is between mates Pair2. I'm hoping to count the number of mismatches at a position for as many pairs as I can. So I would have a mismatch count of 1.
Any suggestions on how I can do this?
Can you say why you want to do that? You are essentially (and quite artifically I think) looking for sequencing errors.
I have a mitochondrial genome that has two chromosomes, one of which is a dimer consisting of two 'typical' mitochondrial genomes fused together as inverted repeats (i.e 2 14kb mitochondrial genomes fused together into a 28kb circular chromosome), while the other chromosome is just the linearized 14kb aspect of the circular chromosome. It's possible that the linear chromosome is the result of self-renaturation of the circular chromosome during replication.
Because there are three differences between sides of the circular chromosome, if the linear chromosome is the result of self-renaturation then at these sites of difference, we would expect to see mismatches in pairs originating from the linear chromosome. If this were the case, then I'd expect to see the frequency of these mismatches to be much higher than the frequency of sequencing errors. Or at least, that's my thinking at the moment.
The following paper (https://www.ncbi.nlm.nih.gov/pubmed/28679546 ) did something similar but was able to leverage PacBio hairpins to be able to distinguish between chromosomes.