Generate percent of bases with quality > x from BAM file
2
0
Entering edit mode
8.7 years ago
Andrew ▴ 60

I am interesting in finding the percent of bases with quality > x from a BAM file. I want to do this multiple times for a BAM file. So for example I might want to find percent of bases with quality > 0 and also percent of bases with quality > 50.

I know I could parse data from the BAM file using samtools view, but that is really slow. I could also use samtools mpileup, but again it is slow and generates more information that I need.

Are there any tools out there that can achieve this?

bam • 2.6k views
ADD COMMENT
0
Entering edit mode

Did you try FASTQC? I think it can reads BAM files http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

ADD REPLY
3
Entering edit mode
8.7 years ago

Using my tool bioalcidae: https://github.com/lindenb/jvarkit/wiki/BioAlcidae

The javascript file:

var qual2count={};
var total=0.0;
var maxqual=0;
while(iter.hasNext())
    {
    var rec = iter.next();

    var quals = rec.getBaseQualities();


    for(var i=0;i< quals.length;++i)
        {
        var qual = quals[i];
        if( maxqual <= qual) maxqual=qual+1;
        if(!(qual in qual2count)) qual2count[qual]=0;
        qual2count[qual]++;
        total++;
        }
    }
for(var i=0;i< maxqual;++i)
    {
    var n=0;
    for(var j=i;j< maxqual;++j)
        {
        if(!(j in qual2count)) continue;
        n += qual2count[j];
        }
    out.println("qual("+i+")\t"+(100.0* (n/total)));
    }

Invoke:

$ java -jar dist-1.133/bioalcidae.jar -f input.js samtools-0.1.19/examples/ex1.bam

qual(0)    100
qual(1)    99.77434771044436
qual(2)    99.77434771044436
qual(3)    99.77348971694795
qual(4)    99.75632984701977
qual(5)    99.6036070046589
qual(6)    99.10854475723075
qual(7)    98.96869181731603
qual(8)    98.80138308551622
qual(9)    98.55513895204673
qual(10)    98.30975281207368
qual(11)    97.57788435963656
qual(12)    97.32906624567785
qual(13)    97.01418263249565
qual(14)    96.65039339001811
qual(15)    96.35867560123894
qual(16)    96.01976817015728
qual(17)    95.50154009832606
qual(18)    94.98588600698406
qual(19)    94.39043852047601
qual(20)    93.61996036070046
qual(21)    92.94729345951558
qual(22)    91.90483136137829
qual(23)    89.93316230662971
qual(24)    88.37676210414325
qual(25)    86.75601238942609
qual(26)    83.89889404638313
qual(27)    74.77584919906307
qual(28)    5.675626978747501
qual(29)    0.14671688788598983
qual(30)    0.01801786342459524
qual(31)    0.0025739804892278917
ADD COMMENT
0
Entering edit mode

This is exactly what I am looking for, thank you.

ADD REPLY
1
Entering edit mode

fixed the code

ADD REPLY
0
Entering edit mode

wait... there's something wrong in my javascript, numbers should be decreasing....

ADD REPLY
2
Entering edit mode
8.7 years ago

If speed is an issue, I think the easiest way to deal with large bam files is to parse a read every n-th. After all once you have collected hundreds of thousands of nearly random reads you have a pretty good approximation of the quality distribution. This simple python script should do the trick:

Example usage, parse one every 100th reads:

samtools view aln.bam | ./baseQualityDistr.py 100

Code:

#!/usr/bin/env python

import sys
import collections

docstring= """
Generate percent of bases with quality > x from BAM file.
Read sam file from stdin. Optionally parse every n-th read.
USAGE
samtools view aln.bam | baseQualityDistr.py [parse-every-nth-read]
"""

if len(sys.argv) > 1 and sys.argv[1] in ['-h', '--help']:
    print(docstring)
    sys.exit()

if len(sys.argv) == 1:
    nth= 1
else:
    nth= int(sys.argv[1])

quals= collections.Counter()
n= 0
for line in sys.stdin:
    n += 1
    if n % nth == 0:
        qstr= line.split('\t')[10]
        for b in qstr:
            quals[b] += 1

q33= dict()
for k in quals:
    q33[ord(k) - 33]= quals[k]

q33sorted= collections.OrderedDict(sorted(q33.items(), reverse= True))
cdf= dict()
cumsum= 0
tot= 0
for k in q33sorted:
    cumsum += q33sorted[k]
    tot += q33sorted[k]
    cdf[k]= cumsum
cdfSorted= collections.OrderedDict(sorted(cdf.items(), reverse= True))

tot= float(tot)
print('Q33\tcumsum\tpct')
for k in cdfSorted:
    print('\t'.join([str(k), str(cdfSorted[k]), str(100 * cdfSorted[k] / tot)]))

sys.exit()
ADD COMMENT

Login before adding your answer.

Traffic: 2419 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6