I have a FASTQ dataset where I'm trying to find the average base quality score. I found this old link that helped somewhat (https://www.biostars.org/p/47751/).
Here is my script (I'm trying to stick to awk, bioawk or python):
I was expecting a single average output for the entire set, instead I got a huge dump of data with many rows looking like this:
>FCD23GWACXX:2:1101:3183:2494
36.45
I assume that the 36.45 is my average phred quality score for that sequence? I was hoping for an average score for the entire file. Could such a script be written?
bioawk is to some extent still awk based (though with -c you get a diff read mode, as in 'per entry' rather then 'per line').
You can easily extent this cmdline to get the desired result:
no, that's a paste 'trick' : it will take two consecutive lines from the stream and print them on a single line (tab delineated) . so to get the mean qual values . You can also filter the stream with awk to only take each second line.
bioawk -c fastx '{print ">"$name; print meanqual($qual)}' XXX.cln.fastq.gz | awk '{ if (NR%2==0) sum += $1} END {print sum/NR}'
You only need to put the input file once in the bioawk part .
This all assume you just have results print to screen from bioawk though and thus not to a file!
Phred scores are log probabilities, so simply taking an average of those is wrong. For some more background I'd like to refer to a blog post I wrote: Averaging basecall quality scores the right way.
Below I've added my function to average quality scores. I use Biopython for parsing fastq files:
from math import log
def ave_qual(quals):
"""Calculate average basecall quality of a read.
Receive the integer quality scores of a read and return the average quality for that read
First convert Phred scores to probabilities,
calculate average error probability
convert average back to Phred scale
"""
if quals:
return -10 * log(sum([10**(q / -10) for q in quals]) / len(quals), 10)
else:
return None
I use this function as below, in which fq is the name of a fastq file:
from Bio import SeqIO
def extract_from_fastq(fq):
"""Extract quality score from a fastq file."""
for rec in SeqIO.parse(fq, "fastq"):
yield ave_qual(rec.letter_annotations["phred_quality"]))
This will get you the average quality per read. I'm not sure if your request for an average quality score per file makes sense. What does that mean? It doesn't say a lot about the dataset. But it's certainly feasible using the functions above to write an additional line of code to get your average quality per file.
Two years later... +1! I spent the last 2 hours trying to figure out why my dumb average of phred scores didn't match NanoPlot's (+1 also to that) until I dug into its code and eventually found this thread.
This was very useful for me today. fastqc showed I had loads of reads >Q30 but chopper gave none. Turns out that fastqc is calculating the per read Q values wrong.
I don't think average quality score is useful for any practical purpose. If it is really bad then perhaps to confirm that you have horrible data.