Question

Identifying mismatches between read mates at position in genome

0

Entering edit mode

4.0 years ago

wpearman1996 • 0

I'm hoping to identify reads pairs that have a mismatch between each read in the pair at a particular position in my genome. I have a bam file of PE reads mapped to my genome. It was a pretty shoddy library prep (supposed to be 150bp PE with insert size of 150bp), so there's massive variation in insert size, and many of mates overlap with each other.

I'm hoping to make the best of a bad situation and try to use this data to identify mismatches between each read in a pair at a specific position in my genome. The problem is very much that I'm looking at mismatches between mates of a pair rather than mismatches to the genome.

My thinking was as follows:

1) Identify read pairs that map correctly

2) Of these read pairs from 1, find pairs for which both reads map to the same position.

3) Count mismatches between each pair at this position.

I've made a quick drawing which may help enter image description here

In these three pairs, there is one mismatch at the position, which is between mates Pair2. I'm hoping to count the number of mismatches at a position for as many pairs as I can. So I would have a mismatch count of 1.

Any suggestions on how I can do this?

alignment sequencing • 1.3k views

ADD COMMENT • link updated 4.0 years ago by i.sudbery 19k • written 4.0 years ago by wpearman1996 • 0

0

Entering edit mode

Can you say why you want to do that? You are essentially (and quite artifically I think) looking for sequencing errors.

ADD REPLY • link 4.0 years ago by ATpoint 82k

0

Entering edit mode

I have a mitochondrial genome that has two chromosomes, one of which is a dimer consisting of two 'typical' mitochondrial genomes fused together as inverted repeats (i.e 2 14kb mitochondrial genomes fused together into a 28kb circular chromosome), while the other chromosome is just the linearized 14kb aspect of the circular chromosome. It's possible that the linear chromosome is the result of self-renaturation of the circular chromosome during replication.

Because there are three differences between sides of the circular chromosome, if the linear chromosome is the result of self-renaturation then at these sites of difference, we would expect to see mismatches in pairs originating from the linear chromosome. If this were the case, then I'd expect to see the frequency of these mismatches to be much higher than the frequency of sequencing errors. Or at least, that's my thinking at the moment.

The following paper (https://www.ncbi.nlm.nih.gov/pubmed/28679546 ) did something similar but was able to leverage PacBio hairpins to be able to distinguish between chromosomes.

ADD REPLY • link 4.0 years ago by wpearman1996 • 0

score 0 · Answer 1 · 2020-04-14

Pretty sure you could do this with pysam if you are up for some python scripting.

The a pysam.AlignedSegment reprsents a read, and the method get_aligned_pairs returns a tuple of (read_position, genome_position) if you use get_aligned_pairs(with_seq=True) then the reference sequence is also returned. Importantly, it is returned lower case if there is a mismatch.

You can get the mate of a read with pysam.AlignmentFile.mate(read)

You can now build up some sort of dictionary of the base for a read at each genome location, fetch the mate and then test if the sequence is the same in the mate.

Something like:

mismatch_count = 0
bamfile = pysam.AlignmentFile("mybam.bam")

# multiple_iterators=True is neccesary because bamfile.mate resets the file pointer. 
for read in bamfile.fetch(chr="chrMT", start = something, end = somthingelse, multiple_iterators=True):

    # always compare read1 to read2 rather than vice-versa to prevent double counting.
    if read.is_read2:
        continue

    read1_bases = {ref_pos: ref_base 
                   for _, ref_pos, ref_base in read.get_aligned_pairs(with_seq=True)}

    mate = bamfile.mate(read)
    mismatches = [ref_pos in read1_bases and read1_bases[ref_pos] != ref_base
                  for _, ref_pos, ref_base in mate.get_aligned_pairs(with_seq=True)]

    mismatch_count += sum(mismatches)