Question: Filter overlapping reads in pysam
1
gravatar for Ron
5.0 years ago by
Ron930
United States
Ron930 wrote:

Hi everyone,

I want to count the reads per allele using pysam.The script I have pasted here is only for a variant,but I am doing for many variants in the main script.I want to filter out the reads that overlap,so that I am not double counting them.Basically I want to check the difference in the count of reads per allele with vs without overlap.This can not be done using pysam functionality,so I am trying to do something without it.Below is my script:

 

#! /usr/bin/python
import sys
import random
import pysam
samfile = pysam.Samfile("f1.bam", "rb")
out = open('outfile', 'w')
f= open("f2.vcf", "r")
ref_counter=0
alt_counter=0
ref='G'
alt='A'
total=0
for pileupcolumn in samfile.pileup( 'chr1',881626,881628):
    if int(pileupcolumn.pos) == 881626:
        if int(pileupcolumn.n) <= 300:
            for pileupread in pileupcolumn.pileups:
                if str(pileupread.alignment.seq[pileupread.qpos]) == str(ref):
                    ref_counter = float(ref_counter) + 1
                elif str(pileupread.alignment.seq[pileupread.qpos]) == str(alt):
                    alt_counter = float(alt_counter) + 1
            total=ref_counter+alt_counter 
            print(str(ref) + '\t' + str(ref_counter) +'\t' + str(alt) + '\t' + str(alt_counter) + '\t' + str(total) + '\n')

 

Any suggestions?

pysam bam reads • 3.3k views
ADD COMMENTlink modified 5.0 years ago by Devon Ryan90k • written 5.0 years ago by Ron930

I reformatted your post so that it should now display correctly. While I think the source code is properly formatted, you might want to double check.

ADD REPLYlink written 5.0 years ago by Devon Ryan90k
3
gravatar for Devon Ryan
5.0 years ago by
Devon Ryan90k
Freiburg, Germany
Devon Ryan90k wrote:

I assume you have paired-end reads and don't want to double count things if both read of a pair cover some position of interest. The simplest method would be to just add read names to a list as you process them. You can then check that before processing and skip an alignment if the read name is already in there.

 

for pileupcolumn in samfile.pileup( 'chr1',881626,881628):
    found = []
    ...
    for pileupread in pileupcolumn.pileups:
        try:
            found.index(pileupread.alignment.qname)
        except ValueError :
            if str(pileupread.alignment.seq[pileupread.qpos]) == str(ref):
                ...
            found.append(pileupread.alignment.qname)
        ...

That's the general idea at least.

  

ADD COMMENTlink written 5.0 years ago by Devon Ryan90k

Hey Devon,

thanks for your response.Can you please add explanation,as I am a bit new to Python especially dictionaries.I have the idea that creating a hash/dictionary for the reads and then checking.Also,do mate pairs have same read names?

ADD REPLYlink written 5.0 years ago by Ron930
1

Sure, the general idea is to create a list of read names and check if you've already seen the current one before processing its alignment.

found = [] just creates an empty list. found.index(pileupread.alignment.qname) then tries to get the index (position number) of that name in the found list. While you don't actually care about the position in the list, this is convenient because if the read name isn't in the list, then the index method will fail with a ValueError. This is then caught with the try...except structure and things are processed identically to your original code. found.append(pileupread.alignment.qname) then adds the read's name to the list if it's not already there. Note that the list would be reinitialized if you were iterating over more than one position, which is what you want.

Regarding read names, yes, they're normally identical within a pair. In fact, a number of tools will break if this isn't the case. Note that I wrote "normally" before, since it's possible (though pretty unlikely these days) that they're different. At least if the raw reads follow one of the standard formats, then the aligner is likely to make the names identical.

ADD REPLYlink written 5.0 years ago by Devon Ryan90k

I have the code.I am also checking the mapping quality and base quality,so was wondering if adding the read name to the list should be done prior to checking the mapping quality and base quality or after?

 

     if int(pileupread.alignment.mapq)>=int(mapqlim):

          if int(ord(pileupread.alignment.qual[pileupread.qpos])-33)>=int(baseqlim):
                          if str(pileupread.alignment.seq[pileupread.qpos]) == str(ref):

                             ref_counter = float(ref_counter) + 1

                             found.append(pileupread.alignment.qname)

.....................

 

 

Currently,I am adding read name to the list after checking the mapping quality and base quality.

ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by Ron930
1

Doing it after makes the most sense. If you filter out read #1 of a pair at a position but read #2 has sufficient quality there then it makes sense to keep it. Adding names to the list before filtering would just ignore good data, so stick with your current implementation.

ADD REPLYlink written 5.0 years ago by Devon Ryan90k

Thanks Devon for a succinct answer!

ADD REPLYlink written 5.0 years ago by Ron930

Hi Devon,

Just a quick query.I want to filter the reads with gaps.for this I have added this if statement from this link in the code before filtering for mapping quality.

http://pysam.readthedocs.org/en/latest/faq.html

if (pileupread.indel == 0 and pileupread.is_del == 0):

Just wanted to make sure if this is the right way to do this?

ADD REPLYlink written 5.0 years ago by Ron930
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 992 users visited in the last hour