I tested NovaSeq data. It can actually be very accurate! In my tests, R1 had an 86% rate of matching the reference perfectly, which is great! Unfortunately, R2 had a 0% rate of matching the reference, presumably because of a lighting failure. I expect that in the future this will be fixed.
That said, NovaSeq has a huge problem with index misassignment, so never, ever use NovaSeq when you care about cross-contamination. Ever - for now. My analysis indicated that ~1.5% of single-indexed reads were contaminants (I was surprised; I was expecting 10%, but it's really not that bad), and ~0.08% of dual-indexed reads were contaminants (but note that these dual-indexed reads did not have entirely unique index pairs; it was the usual 8x12=96 arrangement). Contaminants, in this context, means reads that mapped to the wrong genome. With a pure sample the rate is 0%. I think that in the future, using unique pairs of indexes for libraries (IDT and Kapa are working on this) the problem might be solved, but since I have not tested it, I can't say for sure.
Also, the quality scores of NovaSeq reads are worthless. My analysis indicated that there was an average of 20 points on the Phred scale of discrepancy between the stated and actual quality of reads. If every read had completely random quality scores, and the accuracy was also completely random (on the Phred scale of 0-41), you could achieve, on average, being correct 1/42 of the time, and an average discrepancy of 13.83. So you could generate far more accurate quality scores by throwing darts while blindfolded, than by using Illumina's NovaSeq quality-score generation software.
I'm not saying that a blind monkey is better than Illumina's quality-score algorithm. Illumina is a big company, and they care about their users, so obviously, a huge amount of work has gone into base-calling and quality-assignment. But, personally, I'd prefer the blind monkey.
To clarify -
This is the result of mapping NovaSeq reads to a reference and comparing the stated quality scores to the measured rate of matches and mismatches.
This is the link to the empirical data.
That is scary ... How did you calculated the real quality score?
I align the reads to the reference, and count the number of matches and mismatches for a given quality score. The actual quality is simply the number of mismatches compared to the number of matches, on the Phred scale (deletions adjacent to called bases get a 50% impact on each side, while insertions get a 100% impact on that base only).
Very interesting. Mind providing some more evidence? Do you think this might be happening with Xten's or Hiseq 4400 as well?
I have no data for X10's, so no, I can't help you there. They are human-only and JGI does not deal with human data.
As for more evidence - currently, this is what I have; we got a NovaSeq recently and this is the first run. I expect that the problem with read 2 being useless will be resolved, since that seems to have been a lighting failure. The problem with quality scores being useless will not be resolved since that is intentional by Illumina. There are two factors at play: A) they want gzipped fastq files to be as small as possible (which is why they quantize quality scores) and B) they unfortunately don't have any interest in making their quality scores accurate, while they do have an interest in ensuring most of their quality scores are >=30 (because that's the metric they use to decide whether or not to recompensate people for failed runs).
Just learned how to test this using bbmap:
bbduk.sh in=trimmed.fq out=recalibrated.fq recalibrate
I will give that a go and check the results.
Thanks a lot!
Bear in mind - you first need to map all of your reads, and then run calctruequality.sh. For example:
Once the reads are mapped, you can run:
You can run that on the raw reads for the true quality accuracy, or after recalibration, for the corrected accuracy. Recalibration work like this:
If R2 data did not align at all then this was a faulty/failed run. Are you over estimating the problem? You have data from Illumina's test dataset downloaded from BaseSpace. Have you tested that data via same method? I can give this a try if you don't have that data any longer.
It was definitely a failed run, but the problem is that Illumina assigned Q37 to most of the bases in the failed read 2, so their quality-score algorithm is completely broken.
And it's not like read 2 didn't align at all... something like 97% of them aligned properly. The illumination failed only in the last 30% of read 2.
What interests me the most here is not that our first NovaSeq run was a massive failure, since I'm pretty sure that will be fixed, but rather...
1) Read 1 had excellent quality, and if there are no illumination failures in the future, I expect both reads will have excellent quality
2) Illumina's algorithm for assigning quality scores to bases is inferior to a blind monkey randomly throwing darts
I mean, illumination failure or not, #2 is simply inexcusable. If there is no signal, Illumina should assign bases a low quality score, but it seems like they simply hard-code Q37 to most bases to pass their so-called "Q30" metrics. Note that Novaseq only has 4 quality scores, as you can see in the image I posted. But ironically, only Q0 bases have a 100% rate of being correct.