Question: Generate percent of bases with quality > x from BAM file
0
gravatar for Andrew
3.7 years ago by
Andrew50
Canada
Andrew50 wrote:

I am interesting in finding the percent of bases with quality > x from a BAM file.  I want to do this multiple times for a BAM file.  So for example I might want to find percent of bases with quality > 0 and also percent of bases with quality > 50.

I know I could parse data from the BAM file using samtools view, but that is really slow.  I could also use samtools mpileup, but again it is slow and generates more information that I need.

Are there any tools out there that can achieve this?

bam quality • 1.3k views
ADD COMMENTlink modified 3.7 years ago by dariober9.9k • written 3.7 years ago by Andrew50

did you try FASTQC ?  I think it can reads BAM files http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

ADD REPLYlink written 3.7 years ago by Pierre Lindenbaum118k
3
gravatar for Pierre Lindenbaum
3.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

Using my tool bioalcidae : https://github.com/lindenb/jvarkit/wiki/BioAlcidae

the javascript file:
var qual2count={};
var total=0.0;
var maxqual=0;
while(iter.hasNext())
    {
    var rec = iter.next();
    
    var quals = rec.getBaseQualities();
    

    for(var i=0;i< quals.length;++i)
        {
        var qual = quals[i];
        if( maxqual <= qual) maxqual=qual+1;
        if(!(qual in qual2count)) qual2count[qual]=0;
        qual2count[qual]++;
        total++;
        }
    }
for(var i=0;i< maxqual;++i)
    {
    var n=0;
    for(var j=i;j< maxqual;++j)
        {
        if(!(j in qual2count)) continue;
        n += qual2count[j];
        }
    out.println("qual("+i+")\t"+(100.0* (n/total)));
    }

 

 

 

 

invoke:

$ java -jar dist-1.133/bioalcidae.jar -f input.js samtools-0.1.19/examples/ex1.bam


qual(0)    100
qual(1)    99.77434771044436
qual(2)    99.77434771044436
qual(3)    99.77348971694795
qual(4)    99.75632984701977
qual(5)    99.6036070046589
qual(6)    99.10854475723075
qual(7)    98.96869181731603
qual(8)    98.80138308551622
qual(9)    98.55513895204673
qual(10)    98.30975281207368
qual(11)    97.57788435963656
qual(12)    97.32906624567785
qual(13)    97.01418263249565
qual(14)    96.65039339001811
qual(15)    96.35867560123894
qual(16)    96.01976817015728
qual(17)    95.50154009832606
qual(18)    94.98588600698406
qual(19)    94.39043852047601
qual(20)    93.61996036070046
qual(21)    92.94729345951558
qual(22)    91.90483136137829
qual(23)    89.93316230662971
qual(24)    88.37676210414325
qual(25)    86.75601238942609
qual(26)    83.89889404638313
qual(27)    74.77584919906307
qual(28)    5.675626978747501
qual(29)    0.14671688788598983
qual(30)    0.01801786342459524
qual(31)    0.0025739804892278917

 

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by Pierre Lindenbaum118k

This is exactly what I am looking for, thank you.

ADD REPLYlink written 3.7 years ago by Andrew50
1

fixed the code

ADD REPLYlink written 3.7 years ago by Pierre Lindenbaum118k

wait... there's something wrong in my javascript, numbers should be decreasing....

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Pierre Lindenbaum118k
2
gravatar for dariober
3.7 years ago by
dariober9.9k
WCIP | Glasgow | UK
dariober9.9k wrote:

If speed is an issue, I think the easiest way to deal with large bam files is to parse a read every n-th. After all once you have collected hundreds of thousands of nearly random reads you have a pretty good approximation of the quality distribution. This simple python script should do the trick:

Example usage, parse one every 100th reads:

samtools view aln.bam | ./baseQualityDistr.py 100

Code:

#!/usr/bin/env python

import sys
import collections

docstring= """
Generate percent of bases with quality > x from BAM file.
Read sam file from stdin. Optionally parse every n-th read.
USAGE
samtools view aln.bam | baseQualityDistr.py [parse-every-nth-read]
"""

if len(sys.argv) > 1 and sys.argv[1] in ['-h', '--help']:
    print(docstring)
    sys.exit()

if len(sys.argv) == 1:
    nth= 1
else:
    nth= int(sys.argv[1])

quals= collections.Counter()
n= 0
for line in sys.stdin:
    n += 1
    if n % nth == 0:
        qstr= line.split('\t')[10]
        for b in qstr:
            quals[b] += 1

q33= dict()
for k in quals:
    q33[ord(k) - 33]= quals[k]

q33sorted= collections.OrderedDict(sorted(q33.items(), reverse= True))
cdf= dict()
cumsum= 0
tot= 0
for k in q33sorted:
    cumsum += q33sorted[k]
    tot += q33sorted[k]
    cdf[k]= cumsum
cdfSorted= collections.OrderedDict(sorted(cdf.items(), reverse= True))

tot= float(tot)
print('Q33\tcumsum\tpct')
for k in cdfSorted:
    print('\t'.join([str(k), str(cdfSorted[k]), str(100 * cdfSorted[k] / tot)]))

sys.exit()
ADD COMMENTlink written 3.7 years ago by dariober9.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1251 users visited in the last hour