NIPT data analysis - TruSeq Indexed Adapter sequence reads and GC bias
7.6 years ago
Ning ▴ 20

Hi guys,

I am analyzing a NIPT data set for one of our collaborators. The reads are 50 bps long. In some of the samples, two strange things are observed:

  1. Many of the reads are just incomplete TruSeq indexed adapter sequences. The ratio of index adapter sequence reads varies by lanes and samples, from 0.2% ~ 8%. I wonder if was because there were DNA clusters of only index adapters on the flowcells and why, or if it was because Illumina base calling & data filtering software did not trim the sequences correctly.

  2. Usually GC bias curve is a unimodel, which means both high GC and high AT bins have lower coverage, but in the samples that have high ratio of index adapter sequence reads, it is almost an ascending line (see the picture below, or click at this google drive link) - the coverage increases by the GC content. This is very different from what I read from the papers, and why it seems to be correlated with over-represented indexed adapter sequences?

Left plot- original GC stats, right - corrected GC stats


7.6 years ago

If the Truseq adapter sequences are 55-60% GC, they would drive the GC distribution, so your Problem2 is a symptom of Problem1.

Regarding overrepresentation of adapter sequences, talk to the wetlab to see if it is possible that indexes could self-bind and amplify. Sometimes you get primer-dimers and the adapters become a large portion of sequenced reads. All you can do is filter them away and hope it doesn't impact the results.

To help clarify, we need to know the protocol, this includes what exactly is being sequenced. When you say NIPT, do you mean WGS? NIPT could be a lot of different things, with different sequence parameters. When I do W.E.S. I get a bimodal distribution of GC% because not many baited regions are between 47 and 53% GC, we have peaks at 45% and 55%.

Thanks for helping, Karl.

I am sorry I forgot to mention that the GC statistics are based on the aligned reads. So actually the GC of the index sequence reads should not have been counted.

I don't know much details about their library prep. NIPT is, however, definitely not targeting sequencing, it is basically low coverage whole genome sequencing. There should be no PCR amplification but only genomic DNA fragmentation (using ultrasound or enzyme) before the indexed library prep.


