I am analyzing a NIPT data set for one of our collaborators. The reads are 50 bps long. In some of the samples, two strange things are observed:
Many of the reads are just incomplete TruSeq indexed adapter sequences. The ratio of index adapter sequence reads varies by lanes and samples, from 0.2% ~ 8%. I wonder if was because there were DNA clusters of only index adapters on the flowcells and why, or if it was because Illumina base calling & data filtering software did not trim the sequences correctly.
Usually GC bias curve is a unimodel, which means both high GC and high AT bins have lower coverage, but in the samples that have high ratio of index adapter sequence reads, it is almost an ascending line (see the picture below, or click at this google drive link) - the coverage increases by the GC content. This is very different from what I read from the papers, and why it seems to be correlated with over-represented indexed adapter sequences?