Question

Filter overlapping reads in pysam

1

Entering edit mode

9.9 years ago

Ron ★ 1.2k

Hi everyone,

I want to count the reads per allele using pysam.The script I have pasted here is only for a variant,but I am doing for many variants in the main script.I want to filter out the reads that overlap,so that I am not double counting them.Basically I want to check the difference in the count of reads per allele with vs without overlap.This can not be done using pysam functionality,so I am trying to do something without it.Below is my script:

#! /usr/bin/python
import sys
import random
import pysam
samfile = pysam.Samfile("f1.bam", "rb")
out = open('outfile', 'w')
f= open("f2.vcf", "r")
ref_counter=0
alt_counter=0
ref='G'
alt='A'
total=0
for pileupcolumn in samfile.pileup( 'chr1',881626,881628):
    if int(pileupcolumn.pos) == 881626:
        if int(pileupcolumn.n) <= 300:
            for pileupread in pileupcolumn.pileups:
                if str(pileupread.alignment.seq[pileupread.qpos]) == str(ref):
                    ref_counter = float(ref_counter) + 1
                elif str(pileupread.alignment.seq[pileupread.qpos]) == str(alt):
                    alt_counter = float(alt_counter) + 1
            total=ref_counter+alt_counter 
            print(str(ref) + '\t' + str(ref_counter) +'\t' + str(alt) + '\t' + str(alt_counter) + '\t' + str(total) + '\n')

Any suggestions?

bam reads pysam • 5.7k views

ADD COMMENT • link updated 2.5 years ago by Ram 43k • written 9.9 years ago by Ron ★ 1.2k

0

Entering edit mode

I reformatted your post so that it should now display correctly. While I think the source code is properly formatted, you might want to double check.

ADD REPLY • link 9.9 years ago by Devon Ryan 104k

Ram · Accepted Answer · 2014-05-30

3

Entering edit mode

9.9 years ago

Devon Ryan 104k

I assume you have paired-end reads and don't want to double count things if both read of a pair cover some position of interest. The simplest method would be to just add read names to a list as you process them. You can then check that before processing and skip an alignment if the read name is already in there.

for pileupcolumn in samfile.pileup('chr1',881626,881628):
    found = []
    ...
    for pileupread in pileupcolumn.pileups:
        try:
            found.index(pileupread.alignment.qname)
        except ValueError :
            if str(pileupread.alignment.seq[pileupread.qpos]) == str(ref):
                ...
            found.append(pileupread.alignment.qname)
        ...

That's the general idea at least.

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by Devon Ryan 104k

0

Entering edit mode

Hey Devon,

thanks for your response.Can you please add explanation,as I am a bit new to Python especially dictionaries.I have the idea that creating a hash/dictionary for the reads and then checking.Also,do mate pairs have same read names?

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by Ron ★ 1.2k

1

Entering edit mode

Sure, the general idea is to create a list of read names and check if you've already seen the current one before processing its alignment.

found = [] just creates an empty list. found.index(pileupread.alignment.qname) then tries to get the index (position number) of that name in the found list. While you don't actually care about the position in the list, this is convenient because if the read name isn't in the list, then the index method will fail with a ValueError. This is then caught with the try...except structure and things are processed identically to your original code. found.append(pileupread.alignment.qname) then adds the read's name to the list if it's not already there. Note that the list would be reinitialized if you were iterating over more than one position, which is what you want.

Regarding read names, yes, they're normally identical within a pair. In fact, a number of tools will break if this isn't the case. Note that I wrote "normally" before, since it's possible (though pretty unlikely these days) that they're different. At least if the raw reads follow one of the standard formats, then the aligner is likely to make the names identical.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by Devon Ryan 104k

0

Entering edit mode

I have the code. I am also checking the mapping quality and base quality,so was wondering if adding the read name to the list should be done prior to checking the mapping quality and base quality or after?

if int(pileupread.alignment.mapq)>=int(mapqlim):
    if int(ord(pileupread.alignment.qual[pileupread.qpos])-33)>=int(baseqlim):
        if str(pileupread.alignment.seq[pileupread.qpos]) == str(ref):
            ref_counter = float(ref_counter) + 1
            found.append(pileupread.alignment.qname)
.....................

Currently, I am adding read name to the list after checking the mapping quality and base quality.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by Ron ★ 1.2k

1

Entering edit mode

Doing it after makes the most sense. If you filter out read #1 of a pair at a position but read #2 has sufficient quality there then it makes sense to keep it. Adding names to the list before filtering would just ignore good data, so stick with your current implementation.

ADD REPLY • link 9.9 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks Devon for a succinct answer!

ADD REPLY • link 9.9 years ago by Ron ★ 1.2k

0

Entering edit mode

Hi Devon,

Just a quick query.I want to filter the reads with gaps.for this I have added this if statement from this link in the code before filtering for mapping quality.

http://pysam.readthedocs.org/en/latest/faq.html

if (pileupread.indel == 0 and pileupread.is_del == 0):

Just wanted to make sure if this is the right way to do this?

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by Ron ★ 1.2k