Reads in BAM file overlapping coordinates
1
0
Entering edit mode
7.9 years ago

Hi guys,

I'm trying to create a program that can count the number of reads in a BAM file that overlap a specific genomic coordinate(for example, chr15:28365618) and which nucleotides are present in those reads(with a minimum base quality of 30). This is my program so far:

import pysam

bamfile=pysam.AlignmentFile("file1.bam","rb")

for read in bamfile.fetch("chr15",28365618,28365619)
     a=0
     c=0
     g=0
     t=0
     n=0
    pos=read.reference.start
    nucl=read.query_sequence[28365618-pos]
    for i in nucl
        if i=="A":
            a+=1
        elif i=="C":
            c+=1
        elif i=="G":
            g+=1
        elif i=="T":
            t+=1
       elif i=="N":
            n+=1
print a+c+g+t+n #total no. of reads
print "A;",a
print "C:" c
print "G",g
print "T",t
print "N",n

However,my output is way off when I compare it to IGV, and I can't quite figure out why.Maybe because of duplicate reads? Any help will be greatly appreciated.

snp bam pysam reads • 2.3k views
ADD COMMENT
2
Entering edit mode

BTW, I corrected the formatting and added the missing : after if i=="T".

ADD REPLY
1
Entering edit mode

Hi You can user Counter from the collections module to do the counting.

from collections import Counter

counts = Counter(sequence)

Here is a one liner which should work:

all_bases = [Counter([aln.seq[i] for i in range(len(aln.seq)) if aln.qual[i] > 30]) for aln in bamfile.fetch('chr5', 1234, 12345)]

[Counter({'C': 1, 'G': 1}), Counter({'C': 1, 'G': 1})]

ADD REPLY
0
Entering edit mode

OK, is it working? Is it not? Are you asking for help with something or just showing what you can do?

ADD REPLY
0
Entering edit mode

Not quite, the output is way off and I'm not sure what's wrong with the code. Also, to find the base quality I used the following code: [ord(c)-33 < _30 for c in list(read.qual)]):

However, it doesn't seem to have any effect .

ADD REPLY
1
Entering edit mode

Something like the following would seem to do what you want:

counts = {"A": 0, "C": "G": 0, "T": 0, "N": 0}
for b, q in zip(read.query_alignment_sequence, read.query_alignment_qualities):
    if q >= 30:
        counts[b] += 1
print(counts)
ADD REPLY
0
Entering edit mode

This seems to be working a lot better ,thanks so much. Just a quick query: Is there a way of knowing how many of these reads are mapped to the forward strand? I tried using if read.is_Reverse, but it doesn't seem to do the trick. Also,I'm trying to remove duplicate reads as well.

Again, thanks a lot.

ADD REPLY
1
Entering edit mode

if read.is_reverse, spelling and capitalization are rather important in programming. If duplicates have been marked, then if read.is_duplicate.

ADD REPLY
0
Entering edit mode

in your for loop you haven't defined n. and I guess you need to change three if s to elif!

ADD REPLY
0
Entering edit mode

Yeah, edited that. Still, the program doesn't seem to give the correct output.

ADD REPLY
2
Entering edit mode
7.9 years ago

this would be a very fast samtools + perl alternative to your python code:

samtools mpileup -uv -t AD -Q 30 -r chr15:28365618-28365618 file.bam \
| perl -lane 'unless (/^#/) {
$F[7] =~ /^DP=(\d+)/ and print "A+C+G+T+N: $1";
@a = split ",", $F[4];
$F[9] =~ /([\d,]+)$/ and @ad = split ",", $1;
for ($i=0; $i<=$#a; $i++) { print "$a[$i]: $ad[$i+1]" }
}'

samtools mpileup is used to interrogate the position of interest; -uv is for uncompressed vcf output; -t AD adds allele depth to the output; -Q 30 limits minimum base quality to 30; the perl script just parses the vcf output's last line, which contains all the information needed.

ADD COMMENT

Login before adding your answer.

Traffic: 2669 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6