Dear all,
I got some Illumina data in from an external provider. It seems it's from a NovaSeq (instrument ID starts with @A00, and according to that post...)
So, it's paired-end RNA sequencing data (2x151bp), prepared with the TruSeq Stranded mRNA LT Sample Prep Kit following the protocol "TruSeq Stranded mRNA Sample Preparation Guide, Part #15031047 Rev. E". I'm surprised to see the quality scores look like that: FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFFFFFFF:FFFF:FFFFFFFFFFFF
or like that: FFFF:,FFFFFFFFFF:FFFFFF:FFFFF,:FF::F,:FFFFFFFFFFFF,FFF:F:FF:FF,FFFFFFFFF,FFFFF:F,F:FF,,FFFFFF,FF:F:F,FF:FFFF:F:F FFFFF,F,FFF:FFFF:FFF:FFFFFFFFFFF:FF::FF
or like that: FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF::F:FFFF:FF:FFF,F:FFF:::F,:F:F:FFFFF,FFFFFF:FF,FFFFFFF:FF,FFFFFF,::F,:F:::F,FFF,FFF:FFF,FFF::F,:FFFFFFFF,,:F,FFF
Of course 'F' is a very good score (Q37), but ':' is only Q25, and ',' is even worse (Q11). Have you guys observed such very short basecalling quality dips in otherwise good-looking RNA-based data?
When you have a billion reads one does not get caught up with minor things for a few reads. If that dip is present is every read (at the at cycle and perhaps for a set of tiles) in the data then it is possible that there was something going on with the flowcell (e.g. a bubble). Ask your sequencing provider if they can clarify.