Question: Reads in BAM file overlapping coordinates
0
gravatar for madhavkurup1991
4.7 years ago by
madhavkurup19910 wrote:

Hi guys,

I'm trying to create a program that can count the number of reads in a BAM file that overlap a specific genomic coordinate(for example, chr15:28365618) and which nucleotides are present in those reads(with a minimum base quality of 30). This is my program so far:

import pysam

bamfile=pysam.AlignmentFile("file1.bam","rb")

for read in bamfile.fetch("chr15",28365618,28365619)
     a=0
     c=0
     g=0
     t=0
     n=0
    pos=read.reference.start
    nucl=read.query_sequence[28365618-pos]
    for i in nucl
        if i=="A":
            a+=1
        elif i=="C":
            c+=1
        elif i=="G":
            g+=1
        elif i=="T":
            t+=1
       elif i=="N":
            n+=1
print a+c+g+t+n #total no. of reads
print "A;",a
print "C:" c
print "G",g
print "T",t
print "N",n

However,my output is way off when I compare it to IGV, and I can't quite figure out why.Maybe because of duplicate reads? Any help will be greatly appreciated.

snp pysam bam reads • 1.3k views
ADD COMMENTlink modified 4.7 years ago by Jorge Amigo12k • written 4.7 years ago by madhavkurup19910
2

BTW, I corrected the formatting and added the missing : after if i=="T".

ADD REPLYlink written 4.7 years ago by Devon Ryan98k
1

Hi You can user Counter from the collections module to do the counting.

from collections import Counter

counts = Counter(sequence)

Here is a one liner which should work:

all_bases = [Counter([aln.seq[i] for i in range(len(aln.seq)) if aln.qual[i] > 30]) for aln in bamfile.fetch('chr5', 1234, 12345)]

[Counter({'C': 1, 'G': 1}), Counter({'C': 1, 'G': 1})]

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by kezcleal160

OK, is it working? Is it not? Are you asking for help with something or just showing what you can do?

ADD REPLYlink written 4.7 years ago by Devon Ryan98k

Not quite, the output is way off and I'm not sure what's wrong with the code. Also, to find the base quality I used the following code: [ord(c)-33 < _30 for c in list(read.qual)]):

However, it doesn't seem to have any effect .

ADD REPLYlink written 4.7 years ago by madhavkurup19910
1

Something like the following would seem to do what you want:

counts = {"A": 0, "C": "G": 0, "T": 0, "N": 0}
for b, q in zip(read.query_alignment_sequence, read.query_alignment_qualities):
    if q >= 30:
        counts[b] += 1
print(counts)
ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by Devon Ryan98k

This seems to be working a lot better ,thanks so much. Just a quick query: Is there a way of knowing how many of these reads are mapped to the forward strand? I tried using if read.is_Reverse, but it doesn't seem to do the trick. Also,I'm trying to remove duplicate reads as well.

Again, thanks a lot.

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by madhavkurup19910
1

if read.is_reverse, spelling and capitalization are rather important in programming. If duplicates have been marked, then if read.is_duplicate.

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by Devon Ryan98k

in your for loop you haven't defined n. and I guess you need to change three if s to elif!

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by reza.jabal440

Yeah, edited that. Still, the program doesn't seem to give the correct output.

ADD REPLYlink written 4.7 years ago by madhavkurup19910
2
gravatar for Jorge Amigo
4.7 years ago by
Jorge Amigo12k
Santiago de Compostela, Spain
Jorge Amigo12k wrote:

this would be a very fast samtools + perl alternative to your python code:

samtools mpileup -uv -t AD -Q 30 -r chr15:28365618-28365618 file.bam \
| perl -lane 'unless (/^#/) {
$F[7] =~ /^DP=(\d+)/ and print "A+C+G+T+N: $1";
@a = split ",", $F[4];
$F[9] =~ /([\d,]+)$/ and @ad = split ",", $1;
for ($i=0; $i<=$#a; $i++) { print "$a[$i]: $ad[$i+1]" }
}'

samtools mpileup is used to interrogate the position of interest; -uv is for uncompressed vcf output; -t AD adds allele depth to the output; -Q 30 limits minimum base quality to 30; the perl script just parses the vcf output's last line, which contains all the information needed.

ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by Jorge Amigo12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2431 users visited in the last hour
_