Question: FASTQ Phred33 average base quality score
0
gravatar for oars
10 months ago by
oars150
oars150 wrote:

I have a FASTQ dataset where I'm trying to find the average base quality score. I found this old link that helped somewhat (https://www.biostars.org/p/47751/).

Here is my script (I'm trying to stick to awk, bioawk or python):

bioawk -c fastx '{print ">"$name; print meanqual($qual)}' XXX.cln.fastq.gz

I was expecting a single average output for the entire set, instead I got a huge dump of data with many rows looking like this:

    >FCD23GWACXX:2:1101:3183:2494
36.45

I assume that the 36.45 is my average phred quality score for that sequence? I was hoping for an average score for the entire file. Could such a script be written?

phred33 ascii fastq • 1.0k views
ADD COMMENTlink modified 10 months ago by WouterDeCoster35k • written 10 months ago by oars150

I was hoping for an average score for the entire file

I don't think average quality score is useful for any practical purpose. If it is really bad then perhaps to confirm that you have horrible data.

ADD REPLYlink written 10 months ago by genomax59k
2
gravatar for lieven.sterck
10 months ago by
lieven.sterck3.3k
VIB, Ghent, Belgium
lieven.sterck3.3k wrote:

bioawk is to some extent still awk based (though with -c you get a diff read mode, as in 'per entry' rather then 'per line'). You can easily extent this cmdline to get the desired result:

bioawk -c fastx '{print ">"$name; print meanqual($qual)}' XXX.cln.fastq.gz | paste - - | awk '{sum += $3} END {print sum/NR}'

should do the trick

ADD COMMENTlink modified 10 months ago • written 10 months ago by lieven.sterck3.3k

Many thanks. Where you have | paste - - |, this is where I re-enter my file name?

because as written I just return the following: >

ADD REPLYlink modified 10 months ago • written 10 months ago by oars150

no, that's a paste 'trick' : it will take two consecutive lines from the stream and print them on a single line (tab delineated) . so to get the mean qual values . You can also filter the stream with awk to only take each second line.

bioawk -c fastx '{print ">"$name; print meanqual($qual)}' XXX.cln.fastq.gz | awk '{ if (NR%2==0) sum += $1} END {print sum/NR}'

You only need to put the input file once in the bioawk part . This all assume you just have results print to screen from bioawk though and thus not to a file!

ADD REPLYlink modified 10 months ago • written 10 months ago by lieven.sterck3.3k
1

and or BUT do have a look at what @WouterDeCoster writes , 'cus indeed simply taking the avg of phred scores has not much meaning

ADD REPLYlink modified 10 months ago • written 10 months ago by lieven.sterck3.3k

This worked, thank you. The output was 18.035 which I believe translates to at ASCII value of ~ 51.

ADD REPLYlink written 10 months ago by oars150
2
gravatar for WouterDeCoster
10 months ago by
Belgium
WouterDeCoster35k wrote:

Phred scores are log probabilities, so simply taking an average of those is wrong. For some more background I'd like to refer to a blog post I wrote: Averaging basecall quality scores the right way.

Below I've added my function to average quality scores. I use Biopython for parsing fastq files:

from math import log

def ave_qual(quals):
    """Calculate average basecall quality of a read.
    Receive the integer quality scores of a read and return the average quality for that read
    First convert Phred scores to probabilities,
    calculate average error probability
    convert average back to Phred scale
    """
    if quals:
        return -10 * log(sum([10**(q / -10) for q in quals]) / len(quals), 10)
    else:
        return None

I use this function as below, in which fq is the name of a fastq file:

from Bio import SeqIO

def extract_from_fastq(fq):
    """Extract quality score from a fastq file."""
    for rec in SeqIO.parse(fq, "fastq"):
        yield ave_qual(rec.letter_annotations["phred_quality"]))

This will get you the average quality per read. I'm not sure if your request for an average quality score per file makes sense. What does that mean? It doesn't say a lot about the dataset. But it's certainly feasible using the functions above to write an additional line of code to get your average quality per file.

ADD COMMENTlink written 10 months ago by WouterDeCoster35k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 851 users visited in the last hour