Question

Reads in BAM file overlapping coordinates

0

Entering edit mode

9.2 years ago

madhavkurup1991 • 0

Hi guys,

I'm trying to create a program that can count the number of reads in a BAM file that overlap a specific genomic coordinate(for example, chr15:28365618) and which nucleotides are present in those reads(with a minimum base quality of 30). This is my program so far:

import pysam

bamfile=pysam.AlignmentFile("file1.bam","rb")

for read in bamfile.fetch("chr15",28365618,28365619)
     a=0
     c=0
     g=0
     t=0
     n=0
    pos=read.reference.start
    nucl=read.query_sequence[28365618-pos]
    for i in nucl
        if i=="A":
            a+=1
        elif i=="C":
            c+=1
        elif i=="G":
            g+=1
        elif i=="T":
            t+=1
       elif i=="N":
            n+=1
print a+c+g+t+n #total no. of reads
print "A;",a
print "C:" c
print "G",g
print "T",t
print "N",n

However,my output is way off when I compare it to IGV, and I can't quite figure out why.Maybe because of duplicate reads? Any help will be greatly appreciated.

snp bam pysam reads • 3.1k views

ADD COMMENT • link updated 9.2 years ago by Jorge Amigo 14k • written 9.2 years ago by madhavkurup1991 • 0

2

Entering edit mode

BTW, I corrected the formatting and added the missing : after if i=="T".

ADD REPLY • link 9.2 years ago by Devon Ryan 105k

1

Entering edit mode

Hi You can user Counter from the collections module to do the counting.

from collections import Counter

counts = Counter(sequence)

Here is a one liner which should work:

all_bases = [Counter([aln.seq[i] for i in range(len(aln.seq)) if aln.qual[i] > 30]) for aln in bamfile.fetch('chr5', 1234, 12345)]

[Counter({'C': 1, 'G': 1}), Counter({'C': 1, 'G': 1})]

ADD REPLY • link 9.2 years ago by kezcleal ▴ 160

0

Entering edit mode

OK, is it working? Is it not? Are you asking for help with something or just showing what you can do?

ADD REPLY • link 9.2 years ago by Devon Ryan 105k

0

Entering edit mode

Not quite, the output is way off and I'm not sure what's wrong with the code. Also, to find the base quality I used the following code: [ord(c)-33 < _30 for c in list(read.qual)]):

However, it doesn't seem to have any effect .

ADD REPLY • link 9.2 years ago by madhavkurup1991 • 0

1

Entering edit mode

Something like the following would seem to do what you want:

counts = {"A": 0, "C": "G": 0, "T": 0, "N": 0}
for b, q in zip(read.query_alignment_sequence, read.query_alignment_qualities):
    if q >= 30:
        counts[b] += 1
print(counts)

ADD REPLY • link 9.2 years ago by Devon Ryan 105k

0

Entering edit mode

This seems to be working a lot better ,thanks so much. Just a quick query: Is there a way of knowing how many of these reads are mapped to the forward strand? I tried using if read.is_Reverse, but it doesn't seem to do the trick. Also,I'm trying to remove duplicate reads as well.

Again, thanks a lot.

ADD REPLY • link 9.2 years ago by madhavkurup1991 • 0

1

Entering edit mode

if read.is_reverse, spelling and capitalization are rather important in programming. If duplicates have been marked, then if read.is_duplicate.

ADD REPLY • link 9.2 years ago by Devon Ryan 105k

0

Entering edit mode

in your for loop you haven't defined n. and I guess you need to change three if s to elif!

ADD REPLY • link 9.2 years ago by reza.jabal ▴ 580

0

Entering edit mode

Yeah, edited that. Still, the program doesn't seem to give the correct output.

ADD REPLY • link 9.2 years ago by madhavkurup1991 • 0

score 2 · Answer 1 · 2016-05-10

this would be a very fast samtools + perl alternative to your python code:

samtools mpileup -uv -t AD -Q 30 -r chr15:28365618-28365618 file.bam \
| perl -lane 'unless (/^#/) {
$F[7] =~ /^DP=(\d+)/ and print "A+C+G+T+N: $1";
@a = split ",", $F[4];
$F[9] =~ /([\d,]+)$/ and @ad = split ",", $1;
for ($i=0; $i<=$#a; $i++) { print "$a[$i]: $ad[$i+1]" }
}'

samtools mpileup is used to interrogate the position of interest; -uv is for uncompressed vcf output; -t AD adds allele depth to the output; -Q 30 limits minimum base quality to 30; the perl script just parses the vcf output's last line, which contains all the information needed.