Question

What could be the best explanation for a dip in phred quality score at the near beginning of reads?

0

Entering edit mode

6.9 years ago

lakhujanivijay 5.8k

I am looking at reverse read (R2) from a dataset from 2X150 paired end Illumina platform, transcriptome data. As observed from below plot (mean phred score distribution per base of read), a sudden "dip" could be seen at base number 5th, 6th and 7th. enter image description here I am wondering:

What could be the best explanation for such dip? A problem with library preparation or a technical problem with sequencer? Another observation is that a major chunk of data sets is affected by this issue which is coming from the same sequencing batch.
To get rid of this dip, I did a trimmomatic "HEADCROP" upto 7-8 bases which considerably improved the distribution for obvious reasons, however, this affected the "Sequence Duplication Levels" metric in the way that the "Percent of sequences remaining after deduplication" dropped from 71.7% to 32.8% as show here -

Before trimming enter image description here

and After trimming enter image description here What could be the explanation? I also, went through this biostar post with a little help.

phred fastqc illumina quality trimmomatic • 2.1k views

ADD COMMENT • link 6.9 years ago by lakhujanivijay 5.8k

1

Entering edit mode

Is this across all the lanes and tiles? If so it's a machine error (focusing issues or such). If not, it's probably a bubble (or series of them).

ADD REPLY • link 6.9 years ago by Devon Ryan 104k

1

Entering edit mode

I suspect the sequence duplication level after cropping is closer to the truth, possibly it was masked before cropping by low quality / sequencing errors.

ADD REPLY • link 6.9 years ago by h.mon 35k