FASTQ Phred33 average base quality score
2
0
Entering edit mode
4.3 years ago
oars ▴ 180

I have a FASTQ dataset where I'm trying to find the average base quality score. I found this old link that helped somewhat (https://www.biostars.org/p/47751/).

Here is my script (I'm trying to stick to awk, bioawk or python):

bioawk -c fastx '{print ">"$name; print meanqual($qual)}' XXX.cln.fastq.gz


I was expecting a single average output for the entire set, instead I got a huge dump of data with many rows looking like this:

    >FCD23GWACXX:2:1101:3183:2494
36.45


I assume that the 36.45 is my average phred quality score for that sequence? I was hoping for an average score for the entire file. Could such a script be written?

FASTQ Phred33 ASCII • 5.2k views
0
Entering edit mode

I was hoping for an average score for the entire file

I don't think average quality score is useful for any practical purpose. If it is really bad then perhaps to confirm that you have horrible data.

2
Entering edit mode
4.3 years ago

bioawk is to some extent still awk based (though with -c you get a diff read mode, as in 'per entry' rather then 'per line'). You can easily extent this cmdline to get the desired result:

bioawk -c fastx '{print ">"$name; print meanqual($qual)}' XXX.cln.fastq.gz | paste - - | awk '{sum += $3} END {print sum/NR}'  should do the trick ADD COMMENT 0 Entering edit mode Many thanks. Where you have | paste - - |, this is where I re-enter my file name? because as written I just return the following: > ADD REPLY 0 Entering edit mode no, that's a paste 'trick' : it will take two consecutive lines from the stream and print them on a single line (tab delineated) . so to get the mean qual values . You can also filter the stream with awk to only take each second line. bioawk -c fastx '{print ">"$name; print meanqual($qual)}' XXX.cln.fastq.gz | awk '{ if (NR%2==0) sum +=$1} END {print sum/NR}'

You only need to put the input file once in the bioawk part . This all assume you just have results print to screen from bioawk though and thus not to a file!

1
Entering edit mode

and or BUT do have a look at what @WouterDeCoster writes , 'cus indeed simply taking the avg of phred scores has not much meaning

0
Entering edit mode

This worked, thank you. The output was 18.035 which I believe translates to at ASCII value of ~ 51.

6
Entering edit mode
4.3 years ago

Phred scores are log probabilities, so simply taking an average of those is wrong. For some more background I'd like to refer to a blog post I wrote: Averaging basecall quality scores the right way.

Below I've added my function to average quality scores. I use Biopython for parsing fastq files:

from math import log

def ave_qual(quals):
"""Calculate average basecall quality of a read.
Receive the integer quality scores of a read and return the average quality for that read
First convert Phred scores to probabilities,
calculate average error probability
convert average back to Phred scale
"""
if quals:
return -10 * log(sum([10**(q / -10) for q in quals]) / len(quals), 10)
else:
return None


I use this function as below, in which fq is the name of a fastq file:

from Bio import SeqIO

def extract_from_fastq(fq):
"""Extract quality score from a fastq file."""
for rec in SeqIO.parse(fq, "fastq"):
yield ave_qual(rec.letter_annotations["phred_quality"]))


This will get you the average quality per read. I'm not sure if your request for an average quality score per file makes sense. What does that mean? It doesn't say a lot about the dataset. But it's certainly feasible using the functions above to write an additional line of code to get your average quality per file.

0
Entering edit mode

Two years later... +1! I spent the last 2 hours trying to figure out why my dumb average of phred scores didn't match NanoPlot's (+1 also to that) until I dug into its code and eventually found this thread.

0
Entering edit mode

:)