Please take a look at these sequence quality histograms from fastqc.
This is WGS data sequenced on illumina HISEQ4000. We intend to call snps and indels and possibly structural variants. In the future we may even use the data set for imputation.
I have 4 options and I'm not really experienced enough to make the call but I'd like some informed opinions
- Perform another size selection step to narrow the spread in the library pool so the HiSeq4000 can accommodate without read2 quality dropping as it did in the first run. We have QC’ed the library following a second round of sizing and it does look much better in terms of suitability for the HiSeq4000. However, 10X do not recommend this due to the fear of losing diversity in the library.
- Run the library again on the HiSeq4000 with adjusted loading to improve overall yield. The likelihood here is that the read2 issue will continue.
- Run the library on the NextSeq500. This is an unknown but it is believed this could accommodate the size of the library better than the HiSeq4000. The data yield would be lower.
- Just use the data as is - the sequencing quality is still quite good - maybe consider trimming, but how much should be trimmed?
Appreciate any impute from experienced eyes.
EDIT: I'm expecting around 20X coverage (150bp read length, paired end, 250M reads per sample (125M per fq), 3GB genome)