Having received a task to assemble phage genome, I and my colleague ran into several problems.
- First, the sequence duplication levels are abnormally high, reaching up to ~90% and ~40% for forward and reverse reads, respectively.
- Second, per tile sequence quality displays a mixture of alarming patterns.
This is per-tile sequence quality for raw R1 reads.
And per base sequence quality for raw R1 reads as well.
This is per-tile sequence quality for raw R2 reads.
And per base sequence quality scores for raw R2 reads
Initially, several possible explanations were born:
1) The cell might've been overloaded;
2) The duplication level might've been too high, hence the 4 distinct low quality bands;
3) Something is wrong with the sequencing platform, hence the long red bands in the per-tile sequence quality report.
After several unsuccessful rounds of fiddling with trimmomatic, I ended up specifying very strict quality control options:
HEADCROP:10 SLIDINGWINDOW:3:32 MINLEN:230. Using these options I ended up with ~22% (~250k out of ~1.1mln) and ~4% (~41k out of ~1.1mln) of initial sequences for forward and reverse reads, respectively. I also specified the Nextera Transposase adapter sequences, because the samples were badly contaminated. Nevertheless, the problem with per tile sequence quality persists (the adapters have been removed, though).
Per tile sequence quality for R1 reads after running
Together with the sequence quality scores.
Per tile sequence quality for R2 reads after running
trimmomatic. Red tiles didn't disappear and seemingly random bad quality patterns emerged to the left of the cell.
As well as the sequence quality scores.
Sequencing was performed on a Illumina HiSeq T1500 machine in rapid-run mode producing ~1.2 mln of 250bp paired-end reads.
What do you think might be the cause of such duplication levels, per tile sequence quality patterns and overall data quality?
Was something wrong with the sequencing procedure or the machine?
May this be due to the rapid-run? (We haven't seen anyone using it before).
Should we raise an alarm and contact the sequencing facility or are we being overly cautious?
We've just received additional information regarding the run from the sequencing facility:
Our samples make up only ~1.3% of the run.
0.5% of phiX was spiked in.
Cluster density is 1200.
200GB of data were produced during the run (although the upper limit should be 150GB as per machine's specification).
200GB of data were produced by the sequencer in this run.