Weird fastq quality distributions
2
2
Entering edit mode
23 months ago
evissc ▴ 30

Hi! I'm new to bioinformatics, but am working with some fastq files that have some strange base quality distributions, see image. It is strange as we see only 4 unique phred scores across the whole file, which seems surprising given illumina sequencing has 41 possible scores. This is happening across multiple files, and these files are straight from the sequencing company. The values correspond to phred scores of "F" "," ":" and "#".

I have confirmed this behaviour with multiple people in my team, so this is not an analysis problem this is an issue with the files (also obvious when looking at raw read phred scores).

I also found this other question on Biostars, which it's hard to tell but they seem to have the same behaviour, suggesting perhaps it is a common issue. Does anyone have any idea what is happening? The reads themselves seem normal when compared to reference genome.

We have contacted the sequencing company and they haven't really provided clarity so thought that maybe people here could provide some insight.

Thanks in advance!

Aggregation of base qualities over fastq files for same sample

fastq distribution • 988 views
ADD COMMENT
1
Entering edit mode
23 months ago
evissc ▴ 30

Reposting from lieven.sterck comment:

"If I remember correctly, was Illumina not gonna change it's qual scores, in binned approach to reduce file size?"

From which I found this answer which seems to be it.

ADD COMMENT
1
Entering edit mode

While this is not going to be needed, BBMap suite offers a tool that will allow you to recalibrate the Q scores based on alignments. Tool is called calctruequality.sh.

ADD REPLY
1
Entering edit mode
23 months ago
ATpoint 82k

It means that the vast majority of bases are of best quality, that is a good thing, so what is the issue here? Be happy about it.

ADD COMMENT
0
Entering edit mode

The issue is that the distribution is surprising- why only 4 distinct values, why isn't there a range of high qualities instead of just this one high value. I am unsure what a standard distribution is on WGS, but was expecting something like the image included (ignore the equal height thing).

Taken from [GATK][1] website

ADD REPLY
1
Entering edit mode

What kind of data is this? I mean very recent? illumina? that sort of things ...

If I remember correctly, was Illumina not gonna change it's qual scores, in binned approach to reduce file size?

ADD REPLY
1
Entering edit mode

Ah thank you @lievensterck- found this which seems to be it!

ADD REPLY
1
Entering edit mode

That is probably how whatever tool you used to make this graph made it. Run it through fastqc if you want the "standard" way or write a script that collects the metrics into a proper histogram. WIthout information on the tool I cannot comment. Honestly though, quality is good -- go ahead with the analysis and don't waste time on base qualities.

ADD REPLY

Login before adding your answer.

Traffic: 2620 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6