Question

Deleted:mPileup base quality values (was Illumina Novaseq 6000 sequencer base quality values)

0

Entering edit mode

14 months ago

Randy H ▴ 110

How does one interpret the quality score in the ~~FASTQ (or BAM)~~ mPileup results coming out from the ~~Illumina Novaseq 6000 Sequencer and~~ DRAGEN pipeline. Any ideas or pointers?

UPDATE: By finally looking at the BAM and FASTQ files, we have modified the question title and content here to address pileup. The original table below is based on mPileup output (with or without the -B) and are adjusted quality values. So the question becomes not for the sequencer but how to interpret the quality scores out of pileup that range from 2 to 74? Comments below may be based on the original post. But thought it better to clarify the original question here.

     Occur  ASCII    ASC-to-Num     PHRED Q value?
         82   *     (42-33)  or  9   Q10?   Q0?
         65   5                 20 
        152   7                 22
      37377   :     (58-33)  or 25   Q20?  Q10?
        216   >                 29
        337   E                 36
     452991   F     (70-33)  or 37   Q30?  Q20?
       2218   Q                 48
        259   S                 50
       3395   _                 62
      19558   k     (107-33) or 74   Q40?  Q30?

A typical result from Novaseq 6000 on a WGS is 113-4 values ranging from 5 to 37. After pileup, the above table is given. I have given roughly the # of occurrences in a 100,000 sample (see below for command). These ASCII values from pileup represent a numerical range of 9 to 74 it appears. Far more than the typical 40 of both standard PHRED+33 or even Illumina's old Solexa coding system. Appears Illumina are binning. They have posted white papers about how that dramatically reduces the storage requirements while not affecting the final accuracy. But these ASCII values are unevenly spread. ASCII F is by far the majority of the reads. And hence why I was guessing it represents Q30 or similar. It is roughly midrange though so maybe Q20?

I know there are lots of posts on PHRED quality scores from sequencers. And how (historically) Illumina was different that what others considered the standard in the industry. But after searching deeply here and elsewhere; I am no closer to an answer on interpreting the above values.

I could try and code-spelunk into various pileup tools to see what they are doing. But my concern is I will find the callers are getting it wrong and just assuming Phred+33 with higher numbers just Q40+..

The first two columns of the above chart were generated by the command below. Just simpler than processing the BAM read segment quality values to see what was going on. And to see it post other quality metrics. ~~The code in the read segments in the BAM, FASTQs and mPileup appear to be the same. But it is possible mpileup has filtered out below Q10.~~ The "head" and "tail" commands are simply to skip past the first 400,000 read segments and then take the next 200,000. This is much quicker and easier than doing a random sampling. If it matters, I also give the BAM header tags indicating the DRAGEN version.:

{ samtools view -H your.bam ; samtools view your.bam | head -400000 | tail +200000 ; }  |  samtools mpileup - | cut -f6 | cut -c1 | sort | uniq -c

@PG ID: DRAGEN HW build VN: 00000000
@PG ID: DRAGEN SW build VN: 05.121.645.4.0.3

(I should add that the MGI DNBSEQ sequencers (complete genomics) with the Megabolt pipeline also put out a range of "*" to "F" (37) but not binned like Illumina. You see just about every ASCII value in between. Post pileup, they are then spread to 74 as a high also.

UPDATE: the commands to look at the BAM and FASTQ file quality scores; respectively, are:

samtools view *bam | head -8000 | tail +4000 | cut -f11 | sed 's/\(.\)/\1\n/g' | sort | uniq -c | tail -n +2
     1 #
 13711 ,
 24534 :
555853 F

zcat *R1*fastq.gz | awk 'NR % 4 == 0' | head -8000 | tail +4000 | cut -c2- | sed 's/\(.\)/\1\n/g' | sort | uniq -c | tail -n +2
   340 #
 14769 ,
 24446 :
548044 F

I realize I am not sampling the same values in the 3 different runs. But these are from the same WGS result (FASTQ and BAM from an Illumina Novaseq 6000 run).

PHRED Illumina pileup MGI WGS • 1.2k views

ADD COMMENT • link 14 months ago by Randy H ▴ 110