Phred scores are log probabilities, so simply taking an average of those is wrong. For some more background I'd like to refer to a blog post I wrote: Averaging basecall quality scores the right way.
Below I've added my function to average quality scores. I use Biopython for parsing fastq files:
from math import log
"""Calculate average basecall quality of a read.
Receive the integer quality scores of a read and return the average quality for that read
First convert Phred scores to probabilities,
calculate average error probability
convert average back to Phred scale
return -10 * log(sum([10**(q / -10) for q in quals]) / len(quals), 10)
I use this function as below, in which
fq is the name of a fastq file:
from Bio import SeqIO
"""Extract quality score from a fastq file."""
for rec in SeqIO.parse(fq, "fastq"):
This will get you the average quality per read. I'm not sure if your request for an average quality score per file makes sense. What does that mean? It doesn't say a lot about the dataset. But it's certainly feasible using the functions above to write an additional line of code to get your average quality per file.