Hi guys,
I am analyzing a NIPT data set for one of our collaborators. The reads are 50 bps long. In some of the samples, two strange things are observed:
Many of the reads are just incomplete TruSeq indexed adapter sequences. The ratio of index adapter sequence reads varies by lanes and samples, from 0.2% ~ 8%. I wonder if was because there were DNA clusters of only index adapters on the flowcells and why, or if it was because Illumina base calling & data filtering software did not trim the sequences correctly.
Usually GC bias curve is a unimodel, which means both high GC and high AT bins have lower coverage, but in the samples that have high ratio of index adapter sequence reads, it is almost an ascending line (see the picture below, or click at this google drive link) - the coverage increases by the GC content. This is very different from what I read from the papers, and why it seems to be correlated with over-represented indexed adapter sequences?
Thanks,
Ning
Thanks for helping, Karl.
I am sorry I forgot to mention that the GC statistics are based on the aligned reads. So actually the GC of the index sequence reads should not have been counted.
I don't know much details about their library prep. NIPT is, however, definitely not targeting sequencing, it is basically low coverage whole genome sequencing. There should be no PCR amplification but only genomic DNA fragmentation (using ultrasound or enzyme) before the indexed library prep.